You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@shindig.apache.org by "Weygandt, Jon" <jw...@ebay.com> on 2009/12/21 22:56:36 UTC

HashUtil.rawChecksum - possible risk of collisions

Why do we have both HashUtil.checksum and HashUtil.rawChecksum?
 
Both of these are being used as primary keys to a hash map, or cache.
Therefore we would want a low risk of collisions. Using a message digest
function reduces the risk, but the rawChecksum does "new
String(md.digest(data))" which converts binary to a string based on the
default character set encoding. Since conversion of invalid characters
is "unspecified" the risk of collisions go up greatly when the
conversion algorithm uses characters like "?" for the invalid character.
 
A quick check with a single byte value shows that character sets
"windows-1252" and "ISO-8859-1" seem to work with no collisions, single
byte character sets like "US-ASCII" and multibyte character sets like
"EUC-JP" and "UTF-8" have numerous invalid characters, hence increased
collisions.
 
Suggestions:
1) Use checksum
2) Do a base64 conversion
3) Wrap the byte[] in an object with a proper hashCode and equals.
 
I'll be glad to create a patch for the chosen fix.
 
Jon