You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/06/14 21:51:14 UTC

[GitHub] [arrow] westonpace commented on issue #36059: [C++] Performance of building up HashTable (MemoTable) in is_in kernel

westonpace commented on issue #36059:
URL: https://github.com/apache/arrow/issues/36059#issuecomment-1592037527

   If someone really wanted to be adventurous a [swiss table](https://faultlore.com/blah/hashbrown-tldr/) is generally better suited for columnar batch operations (more vectorization friendly).
   
   We have one in `src/arrow/acero/swiss_join_internal.h`.  However, we probably can't use it directly since it is built around the row encoding and is doing a lot more work than would strictly be needed for an is_in operation.  Even so, I can perform a full hash_join of two tables faster than `is_in` (it's about 2-3x slower than pandas on my system) so that is encouraging.  It could be used for inspiration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org