You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2020/12/01 15:26:59 UTC

[GitHub] [kafka] viktorsomogyi commented on pull request #9519: KAFKA-10650: Use Murmur3 instead of MD5 in SkimpyOffsetMap

viktorsomogyi commented on pull request #9519:
URL: https://github.com/apache/kafka/pull/9519#issuecomment-736624841

I ran another test with 2,000,000,000 keys being inserted into a Bloom Filter with 1% false positive rate (and it ran for 7 hours and 21 minutes). The test set was created by generating random printable ASCII characters and adding an incremental counter in a random position within a string. The counter was formatted to display 10 digits, so 16 would look like 0000000016 in the string, therefore preserving the length of the strings (which was 110 characters with the counter included).

MD5 hash collisions: 3,325,038 (0.1662%)
Murmur3 128bit hash collisions: 3,330,883 (0.1665%)
Murmur3 64bit hash collisions: 3,327,266 (0.1663%)
Murmur3 32bit hash collisions: 401,833,502 (20.0916%)
XXHash 64bit hash collisions: 3,327,266 (0.1663%)

It shows that interestingly there were slightly more collisions in Murmur3 than in MD5 (MD5 was 0.0002% better, which is 5845 collisions) and interestingly the 64bit hashes performed the best (and I won't mention the 32bit hash :D). Given that there were 2 billion unique keys, that 5845 statistically well within any range of error (it's 0,1% of 3,330,883). I don't expect that with more test runs we'll have significantly bigger differences. As I observed the values were growing linearly with values close together (sometimes MD5 falling behind, sometimes getting ahead) so I think this 0.0002% doesn't mean too much. I didn't save this progress data out as I thought about this 4 hours into the test but I'll do it for the next one.

I'd also like to try out xxHash 128bit but I haven't found any Java JNIs for it yet but I'll keep searching. I expect though that it would perform similarly to the other 128bit hashes but overall I think this could be a good foundation for using traditional hashing algorithms instead of the slow MD5.
Another experiment could be to try out SHA1. As I saw it should be somewhat faster and more collision resistant than MD5, however I never did any kind of experiments, so we'll see if its performance gets meaningfully better (meaning it exceeds at least the 1% error rate).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org