The other day a colleague of mine was trying to get distinct values out of a list of doubles. He wanted two doubles to be considered equal if the difference between them was less than 0.1.
C# does not offer a Distinct
overload that takes a Func<T, T, bool>
to handle the comparison. It does, however, offer one which takes an IEqualityComparer
. He decided to make use of this overload and implemented something that looked roughly like the following.
However instead of returning the expected output.
His implementation returned the following.
The answer to why this happened is hidden in a two-line comment in the Examples section of the Enumerable.Distinct
MSDN page.
If Equals() returns true for a pair of objects then GetHashCode() must return the same value for these objects.
A quick check of the Reference Source entry for the method confirms that in order for two elements to be considered equal, Equals
must return true and GetHashCode
must return the same value for both elements.
I found this behaviour to be pretty surprising. Although it is a well-known convention that equal objects should have the same hash code, when one is supplying a custom equality comparer, it is probably safe to say that the objects would not typically be considered to be equal to one another.
Perhaps the most disturbing facet of this implementation, however, is that it is notoriously easy to do the following in order to get the desired result.
While this implementation will work, it relies on the fact that Distinct
will always use Equals
and GetHashCode
to make the comparison. As we cannot guarantee that this will remain the case in future versions of the .NET Framework, I feel that this type of implementation is a bit too hacky for my liking, and should be avoided.
So, how can we get the result that we’re after without compromising good design principles? Two simple solutions come to mind. The first thing that we could do is simply brute-force the problem.
The obvious issue with this method is that it will execute in O(n²) time.
An alternative solution is to sort the list first.
This will execute in the same O as our sort, which we can generally take to mean O(n*log(n)).
The one important difference between these two solutions, aside from the execution time, is that DistinctBruteForce
will return distinct elements in the order that they appear in the source collection, whereas DistinctSorted
will add them in order from smallest to largest. We can see this difference if we modify our method to take a different collection, and to use all three comparison methods (hacky IEqualityComparer
, DistinctBruteForce
, and DistinctSorted
).
The console output obtained by running this is as follows.
As you can see, the Sorted
implementation includes 1
and 1.1
, whereas the other two only include the middle value, 1.01
. While in many cases I would expect the Sorted
result to be the more desirable of the two, knowing that these implementations give different results could no doubt prove useful when having to deal with a problem such as this in the future.