You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/06/14 08:12:25 UTC

[GitHub] [arrow] jorisvandenbossche opened a new issue, #36059: [C++] Performance of building up HashTable (MemoTable) in is_in kernel

jorisvandenbossche opened a new issue, #36059:
URL: https://github.com/apache/arrow/issues/36059

   From some benchmarking in pandas, @phofl noticed that the "is_in" implementation of Arrow is considerably slower compared to pandas' (khash-based) implementation, _if_ the value_set is big (and thus when the execution time is mostly dominated by creating the hashtable, and not by the actual lookups). 
   
   
   Small illustration comparing `pc.is_in` with `pandas.core.algorithms.isin` (the equivalent pandas function) using a larger value_set with a small array (so that we are mostly timing the hashtable creation):
   ```
   arr = pa.array(np.random.choice(np.arange(1000), 100))
   for n in [10_000, 100_000, 1_000_000, 10_000_000]:
       print(n)
       value_set = pa.array(np.arange(n))
       %timeit pc.is_in(arr, value_set)
       %timeit pd.core.algorithms.isin(np.asarray(arr), np.asarray(value_set))
   ```
   
   gives
   
   ```
   10000
   158 µs ± 581 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   66.8 µs ± 473 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   100000
   8.51 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   627 µs ± 41.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   1000000
   103 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   28.5 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   10000000
   1.26 s ± 33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   171 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   ```
   
   So pandas is often 4-10x faster. I am not familiar with our HashTable implementation or whether it's a fully fair comparison, but it seems this is at least an indication there is room for performance improvement (and I suppose this might be for HashTable-based functions in general, not just "is_in").


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #36059: [C++] Performance of building up HashTable (MemoTable) in is_in kernel

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #36059:
URL: https://github.com/apache/arrow/issues/36059#issuecomment-1592037527

   If someone really wanted to be adventurous a [swiss table](https://faultlore.com/blah/hashbrown-tldr/) is generally better suited for columnar batch operations (more vectorization friendly).
   
   We have one in `src/arrow/acero/swiss_join_internal.h`.  However, we probably can't use it directly since it is built around the row encoding and is doing a lot more work than would strictly be needed for an is_in operation.  Even so, I can perform a full hash_join of two tables faster than `is_in` (it's about 2-3x slower than pandas on my system) so that is encouraging.  It could be used for inspiration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on issue #36059: [C++] Performance of building up HashTable (MemoTable) in is_in kernel

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on issue #36059:
URL: https://github.com/apache/arrow/issues/36059#issuecomment-1603024003

   @phofl You can open another issue if you think performance can be further improved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] phofl commented on issue #36059: [C++] Performance of building up HashTable (MemoTable) in is_in kernel

Posted by "phofl (via GitHub)" <gi...@apache.org>.

phofl commented on issue #36059:
URL: https://github.com/apache/arrow/issues/36059#issuecomment-1603011492

   Can we reopen this? It looks like it’s only addressing parts of the difference 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] js8544 commented on issue #36059: [C++] Performance of building up HashTable (MemoTable) in is_in kernel

Posted by "js8544 (via GitHub)" <gi...@apache.org>.

js8544 commented on issue #36059:
URL: https://github.com/apache/arrow/issues/36059#issuecomment-1591516918

   I found out that we can pre-allocate buffer space for the hash table before inserting elements into it. This completely avoids rehashing during the insertion process and yields some performance improvements. On my machine I get:
   
   Before
   ```
   100
   4.49 µs ± 94.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
   16.2 µs ± 515 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
   1000
   8.75 µs ± 51.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
   22.4 µs ± 414 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   10000
   71.4 µs ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   53.7 µs ± 937 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   100000
   3.78 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   375 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   1000000
   41.2 ms ± 929 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   17.7 ms ± 705 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   10000000
   679 ms ± 24.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   127 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
   ```
   
   After
   ```
   100
   3.73 µs ± 57.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
   17 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
   1000
   7.75 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
   23.2 µs ± 602 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   10000
   44.5 µs ± 751 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   54.4 µs ± 926 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   100000
   519 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   378 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   1000000
   38.3 ms ± 520 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   17.4 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   10000000
   557 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   124 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   ```
   
   The improvement is outstanding for input size <= 100000 (up to serveral times faster), however it won't be too much for larger inputs as the rehashing cost was already amortized. There is still a large performance gap during the `Lookup` phase that remains unsolved.
   
   Nevertheless I'll submit a PR for this improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou closed issue #36059: [C++] Performance of building up HashTable (MemoTable) in is_in kernel

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou closed issue #36059: [C++] Performance of building up HashTable (MemoTable) in is_in kernel
URL: https://github.com/apache/arrow/issues/36059


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org