David R. MacIver's Blog
java.lang.String#hashCode
Apologies in advance. This is a really boring post.
For reasons you either know about by now or don’t care about, I was curious as to how well String’s hashCode was distributed (I suspected the answer was “not very”). I ran a few quick experiments to verify this.
For your amusement, here is a list of all hash collisions between alphanumeric strings of size 2: http://www.drmaciver.com/collisions.txt and here is a list of all which don’t collide with any others http://www.drmaciver.com/noncolliding.txt
Some statistics: There are 3844 alphanumeric strings of size 2. Of these 3570 collide with at least one other string. That is, 274 of these strings (or about 7% of them) *don’t* collide with something else.
Oh well. It’s a good thing no one would be stupid enough to rely on hashCode to distinguish the contents of two objects.
Edit: Additional facts I originally posted in the comments.
For what it’s worth, even fewer strings have unique hash codes for 3 characters. 3948 don’t collide, or about 1.6% of them.
This of course doesn’t mean that probability of a hash collision is really high. In reality it’s acceptably low. It’s just a demonstration that it’s not hard to find colliding pairs.
The following consist of all the String hash collisions contained in
the sowpods scrabble dictionary:
isohel, epistolaries
righto, buzzards
hierarch, crinolines
inwork, hypercatalexes
wainages, presentencing
variants, gelato
misused, horsemints
trichothecenes, locular
pomatoes, eructation
Comments
Justin lee on 2008-07-16 13:53:04:
haha. I *wonder* where all *that* came from! P^)=
--cheeser
Knah on 2008-07-16 14:04:13:
That’ll be 3844, not 3884.
Mind you, 3570+274=3844.…
david on 2008-07-16 14:06:54:
Oops. Thanks for catching the typo. Fixed now.
chessguy on 2008-07-16 14:22:56:
Sounds like you found what you were looking for. Which makes me a little suspicious... Now I’m not usually one to defend Java, by any means, but this experiment seems a little...strange. If you’ve gone this far, why not use the same infrastructure to do a quick calculation and see about 3-character strings? or 4? or.…whatever. What about toString()’ing some non-string objects? Testing only 2-character strings can’t be a very good measure.
david on 2008-07-16 14:27:32:
Well part of the reason I didn’t do it for much larger numbers is that it takes a longish time to run for them as the run time for calculating this is exponential in the number of characters. :-)
I did try it out for three character strings, but the results were uninterestingly similar so I didn’t bother posting them. If you really care I’ll run the results for larger numbers when I get home (I didn’t put the code anywhere central and can’t be bothered to recreate it here) and post them.
For what it’s worth, this isn’t intended as evidence that java Strings are terrible and awful. I just wanted to demonstrate that hash collisions are not remotely a rare thing.
Adam Goode on 2008-07-16 14:59:44:
See also:
http://developers.sun.com/learning/javaoneonline/2007/pdf/TS-2707.pdf
Puzzle #6, page 37
dave on 2008-07-16 16:25:47:
I once worked at a bank where the former Cobol programmers who made up the Java development staff thought you had to override hashCode() for every class you defined -- and almost universally defined it as “return 1;”
Cetin Sert on 2008-07-16 17:27:07:
I tested it with the same string size using a-zA-Z0-9. 32-bit Vista using .NET and mono.
running on .NET 3.5:
3844 two-char strings
14776336 comparisons
0 collisions with one or more items
0 total collisions
running on mono 1.9.1:
3844 two-char strings
14776336 comparisons
**3570** collisions with one or more items
5250 total collisions
http://sert.homelinux.org/projects/mono-collisions.txt
** the number is the same ^_^
Cetin Sert on 2008-07-16 17:29:25:
even the colliding strings seem to be the same ones ^^
david on 2008-07-16 21:24:04:
For what it’s worth, even fewer strings have unique hash codes for 3 characters. 3948 don’t collide, or about 1.6% of them.
Also, this of course doesn’t mean that probability of a hash collision is really high. In reality it’s acceptable low. It’s just a demonstration that it’s not hard to find colliding pairs.
david on 2008-07-16 21:30:45:
Utterly pointless fact of the day. The following consist of all the String hash collisions contained in sowpods:
isohel, epistolaries
righto, buzzards
hierarch, crinolines
inwork, hypercatalexes
wainages, presentencing
variants, gelato
misused, horsemints
trichothecenes, locular
pomatoes, eructation
Cetin Sert on 2008-07-17 01:30:28:
david: “3948 don’t collide, or about 1.6% of them”
3 characters mono 1.9.1:
again the same situation with mono 1.9.1 so I conclude java and mono
use the same hashing algorithm:
http://sert.homelinux.org/projects/mono-collisions3.txt
Oh by the way may I ask which implementation and version of java we are talking about here?
david on 2008-07-17 09:23:27:
It doesn’t actually matter which version I’m using. java.lang.String has a hash code implementation defined as part of the standard library specification, and it’s been the same since... not sure. 1.2 or 1.3 I think.
jual cd rpp kurikulum 2013 on 2014-07-25 07:35:35:
Audio books can be purchased online at a fraction of cost. Sometimes you can find them for free. Audible and ebook libraries and bookstores are both dedicated to their customers by providing the largest selection of audible books that are downloadable online or purchased in local stores. Some of the top providers are eBay, Amazon and several audible clubs. With millions of audio books available, various formats are used for the convenience of everyone ranging from children to adults. Some of those formats included are MP3 players, MP3 compatible mobile phones, iPods, iPhones, PDA’s, CD’s as well as kindles and android devices. Here is a list of places that offer audible books: