You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2020/12/01 06:49:40 UTC

[GitHub] [kafka] viktorsomogyi edited a comment on pull request #9519: KAFKA-10650: Use Murmur3 instead of MD5 in SkimpyOffsetMap

viktorsomogyi edited a comment on pull request #9519:
URL: https://github.com/apache/kafka/pull/9519#issuecomment-736027069

@lbradstreet it is really hard to give an exact answer to this as collision rate is hard to calculate mathematically as it is very dependant on the size and values of the testset. For non-cryptographic hashes it is possible to generate DDoS attacks where everything gets placed into the same bucket and thus slows down lookups. On the theoretical side though Murmur3 passes the most often cited Chi Square test, it has a very good avalanche effect and thus generates a hashes that are very close to the uniform distribution.
Because of the lack of available mathematical articles on this topic (murmur vs MD5) I started brute-force tests where I generated a few billion unique keys and inserted them into a Bloom Filter (which had a 1% false positive probability). That showed that Murmur3 is actually on the same level as MD5, it generates roughly the same amount of collisions. I have two types of datasets: the first can use any UTF8 characters and the second works only from the printable ASCII characters. In fact both MD5, murmur3 128bit, murmur3 64bit and xxhash64 bit generated around the same amount of collisions which was 0.016% out of 200 million unique keys. I added Murmur3 32bit for a baseline but it was significantly worse, around 2% of collisions. Maybe to show the difference we need a much larger keyset, I'll try to do what I can.
I'll publish my code in the following days I just have to work on something else too so it's a bit slow, sorry :).

On the other hand if we want to make sure that there will be no collisions, I don't think it's possible with either of these solutions, there is always a chance. To completely cut this off we either have to store the user key-hash maps similarly to the offset indexes and reject new, colliding keys or use perfect hashes (but that couldn't work well as it requires the knowledge of the full keyset or have to rebuild the cache in each insert or at least often).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org