You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/28 00:40:03 UTC

[GitHub] [arrow] wesm commented on pull request #9280: ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead

wesm commented on pull request #9280:
URL: https://github.com/apache/arrow/pull/9280#issuecomment-887924786


   Some updated performance (gcc 9.3 locally on x86):
   
   ```
   -------------------------------------------------------------------------------------
   Benchmark                           Time             CPU   Iterations UserCounters...
   -------------------------------------------------------------------------------------
   BM_ExecBatchIterator/256     11314787 ns     11313272 ns           62 items_per_second=88.3918/s
   BM_ExecBatchIterator/512      5670423 ns      5669598 ns          123 items_per_second=176.379/s
   BM_ExecBatchIterator/1024     2903937 ns      2903272 ns          242 items_per_second=344.439/s
   BM_ExecBatchIterator/2048     1461982 ns      1461711 ns          481 items_per_second=684.13/s
   BM_ExecBatchIterator/4096      739382 ns       739235 ns          951 items_per_second=1.35275k/s
   BM_ExecBatchIterator/8192      370264 ns       370207 ns         1892 items_per_second=2.70119k/s
   BM_ExecBatchIterator/16384     186622 ns       186573 ns         3755 items_per_second=5.35983k/s
   BM_ExecBatchIterator/32768      93581 ns        93567 ns         7437 items_per_second=10.6876k/s
   ```
   
   The way to read this is that breaking `ExecBatch` with 32 primitive array fields into smaller ExecBatches (and then accessing a a data pointer in each batch) has an overhead of approximately:
   
   * 2800 nanoseconds per batch
   * 88.6 nanoseconds per batch per field
   
   So if you wanted to break a batch with 1M elements into batches of size 1024 for finer-grained parallel processing, you would pay  2900 microseconds to do so. On this same machine, I have:
   
   ```
   In [2]: arr = np.random.randn(1 << 20)                                                                                                                                                         
   
   In [3]: timeit arr * 2                                                                                                                                                                         
   395 µs ± 8.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
   ```
   
   This seems problematic if we wish to enable array expression evaluation on smaller batch sizes to keep more data in CPU caches. I'll bring this up on the mailing list to see what people think. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org