You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@datasketches.apache.org by GitBox <gi...@apache.org> on 2021/05/08 23:55:39 UTC

[GitHub] [datasketches-java] leerho edited a comment on pull request #353: Add ByteBuffer hashing methods to MurmurHash3, BaseHllSketch.

leerho edited a comment on pull request #353:
URL: https://github.com/apache/datasketches-java/pull/353#issuecomment-834857892

@giianm
Now that I've had a look at this, I have some questions and possible suggestions to make it even faster.

1. It would be helpful to me if you could give me an idea of the range of sizes of these ByteBuffer objects that you wish to create a unique hash of. Even better would be a description of the distribution of sizes (e.g, 95%ile is 100 bytes, median is 10 bytes, etc.)
2. As I recall from some characterization studies I did several years ago, a `ByteBuffer.getLong()` is much slower than, a `LongBuffer.getLong()` because under the covers, for the `ByteBuffer.getLong()` call, the BB does a brute-force byte conversion of the `long`, rather than calling `Unsafe.getLong()`. And if your BB is of any reasonable size, the getLong call will predominate. I understand why you chose this, but just keep in mind what Java is doing under the covers, and I think we can do better.
3. Also, I'm a bit surprised that this switch statement would be faster than just a tight loop. But perhaps you have recently characterized both and I could be wrong.

**WRT the MurmurHash3 modifications.**
You should know that the MurmurHash3 is a great hash function, and we chose it a long time ago and are stuck with it for binary compatibility reasons. Nonetheless, there is a newer hash function, the `XxHash` that has comparable avalanche and bit-independence to the MurmurHash3 and is about twice as fast. We already have the XxHash integrated into the `datasketches.hash` package as well as in the `Memory` component. And, in fact, it was integrated into the Memory component for just this kind of case.

So consider this approach:
- Use Memory to wrap the ByteBuffer. Memory uses Unsafe to access the bytes underneath the BB.
- Hash the BB bytes using XxHash. This is already a built-in function and is very fast.
- Take the resulting 64-bit hash and feed it as a `long` into HLL. Yes, you will be hashing twice, but if the size of your BB is more than 10 or so longs, the speed you gain from the xxHash will make up for the double hashing.

How much improvement you get will depend on your distribution of BB sizes, which comes back to my question in #1.

Keep in mind that Panama is now integrated into JDK14+, (JDK16 is the most recent) and I am in process of migrating the whole Java library, first to JDK9-13, and then highly leveraging Panama starting with JDK17, which is the next LTS after JDK11.

With Panama the scheme that you would use would be identical except you would use the new Panama MemorySegment instead of our Memory, but the steps would be the same. So the switch over to Panama should be really easy.

Lee.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org