You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/28 00:40:03 UTC
[GitHub] [arrow] wesm commented on pull request #9280: ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead
wesm commented on pull request #9280:
URL: https://github.com/apache/arrow/pull/9280#issuecomment-887924786
Some updated performance (gcc 9.3 locally on x86):
```
-------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------
BM_ExecBatchIterator/256 11314787 ns 11313272 ns 62 items_per_second=88.3918/s
BM_ExecBatchIterator/512 5670423 ns 5669598 ns 123 items_per_second=176.379/s
BM_ExecBatchIterator/1024 2903937 ns 2903272 ns 242 items_per_second=344.439/s
BM_ExecBatchIterator/2048 1461982 ns 1461711 ns 481 items_per_second=684.13/s
BM_ExecBatchIterator/4096 739382 ns 739235 ns 951 items_per_second=1.35275k/s
BM_ExecBatchIterator/8192 370264 ns 370207 ns 1892 items_per_second=2.70119k/s
BM_ExecBatchIterator/16384 186622 ns 186573 ns 3755 items_per_second=5.35983k/s
BM_ExecBatchIterator/32768 93581 ns 93567 ns 7437 items_per_second=10.6876k/s
```
The way to read this is that breaking `ExecBatch` with 32 primitive array fields into smaller ExecBatches (and then accessing a a data pointer in each batch) has an overhead of approximately:
* 2800 nanoseconds per batch
* 88.6 nanoseconds per batch per field
So if you wanted to break a batch with 1M elements into batches of size 1024 for finer-grained parallel processing, you would pay 2900 microseconds to do so. On this same machine, I have:
```
In [2]: arr = np.random.randn(1 << 20)
In [3]: timeit arr * 2
395 µs ± 8.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
This seems problematic if we wish to enable array expression evaluation on smaller batch sizes to keep more data in CPU caches. I'll bring this up on the mailing list to see what people think.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org