You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/11/28 23:29:00 UTC
[jira] [Commented] (ARROW-1844) [C++] Basic benchmark suite for
hash kernels
[ https://issues.apache.org/jira/browse/ARROW-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269706#comment-16269706 ]
ASF GitHub Bot commented on ARROW-1844:
---------------------------------------
wesm opened a new pull request #1370: ARROW-1844: [C++] Add initial Unique benchmarks for int64, variable-length strings
URL: https://github.com/apache/arrow/pull/1370
I also fixed a bug this surfaced in the hash table resize (unit test coverage was not adequate)
Now we have
```
$ ./release/compute-benchmark
Run on (8 X 4174.84 MHz CPU s)
2017-11-28 18:18:26
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------------
BM_BuildDictionary/min_time:1.000 1451 us 1451 us 959 2.68974GB/s
BM_BuildStringDictionary/min_time:1.000 4005 us 4005 us 350 75.3785MB/s
BM_UniqueInt64NoNulls/16M/50/min_time:1.000/real_time 35940 us 35942 us 39 91.3192MB/s
BM_UniqueInt64NoNulls/16M/1024/min_time:1.000/real_time 120002 us 120006 us 12 88.8877MB/s
BM_UniqueInt64NoNulls/16M/10k/min_time:1.000/real_time 175855 us 175862 us 8 90.9838MB/s
BM_UniqueInt64NoNulls/16M/1024k/min_time:1.000/real_time 452242 us 452257 us 3 94.3449MB/s
BM_UniqueInt64WithNulls/16M/50/min_time:1.000/real_time 58632 us 58634 us 29 75.2797MB/s
BM_UniqueInt64WithNulls/16M/1024/min_time:1.000/real_time 134079 us 134084 us 10 95.4661MB/s
BM_UniqueInt64WithNulls/16M/10k/min_time:1.000/real_time 183846 us 183851 us 8 87.0295MB/s
BM_UniqueInt64WithNulls/16M/1024k/min_time:1.000/real_time 528790 us 528808 us 3 80.6873MB/s
BM_UniqueString10bytes/16M/50/min_time:1.000/real_time 152207 us 152212 us 9 116.8MB/s
BM_UniqueString10bytes/16M/1024/min_time:1.000/real_time 260047 us 260056 us 5 123.055MB/s
BM_UniqueString10bytes/16M/10k/min_time:1.000/real_time 426539 us 426552 us 3 125.038MB/s
BM_UniqueString10bytes/16M/1024k/min_time:1.000/real_time 1716739 us 1716791 us 1 93.2MB/s
BM_UniqueString100bytes/16M/50/min_time:1.000/real_time 556145 us 556165 us 3 958.982MB/s
BM_UniqueString100bytes/16M/1024/min_time:1.000/real_time 693922 us 693943 us 2 1.12585GB/s
BM_UniqueString100bytes/16M/10k/min_time:1.000/real_time 1000449 us 1000484 us 1 1.5618GB/s
BM_UniqueString100bytes/16M/1024k/min_time:1.000/real_time 3591215 us 3591314 us 1 445.532MB/s
```
This suggests quite a lot of room for improvement -- it's counter-intuitive to me that hashing strings seems optically faster than hashing integers, so we should figure out what's going on there.
We can also refactor the hash table implementations without worrying too much about whether we're making things slower
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
> [C++] Basic benchmark suite for hash kernels
> --------------------------------------------
>
> Key: ARROW-1844
> URL: https://issues.apache.org/jira/browse/ARROW-1844
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
> Assignee: Wes McKinney
> Labels: pull-request-available
> Fix For: 0.8.0
>
>
> * Integers, small cardinality and large cardinality
> * Short strings, small/large cardinality
> * Long strings, small/large cardinality
> These benchmarks will enable us to refactor without fear, and to experiment with faster hash functions
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)