You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/11/28 23:29:00 UTC

[jira] [Commented] (ARROW-1844) [C++] Basic benchmark suite for hash kernels

    [ https://issues.apache.org/jira/browse/ARROW-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269706#comment-16269706 ] 

ASF GitHub Bot commented on ARROW-1844:
---------------------------------------

wesm opened a new pull request #1370: ARROW-1844: [C++] Add initial Unique benchmarks for int64, variable-length strings
URL: https://github.com/apache/arrow/pull/1370
 
 
   I also fixed a bug this surfaced in the hash table resize (unit test coverage was not adequate)
   
   Now we have
   
   ```
   $ ./release/compute-benchmark 
   Run on (8 X 4174.84 MHz CPU s)
   2017-11-28 18:18:26
   Benchmark                                                           Time           CPU Iterations
   -------------------------------------------------------------------------------------------------
   BM_BuildDictionary/min_time:1.000                                1451 us       1451 us        959   2.68974GB/s
   BM_BuildStringDictionary/min_time:1.000                          4005 us       4005 us        350   75.3785MB/s
   BM_UniqueInt64NoNulls/16M/50/min_time:1.000/real_time           35940 us      35942 us         39   91.3192MB/s
   BM_UniqueInt64NoNulls/16M/1024/min_time:1.000/real_time        120002 us     120006 us         12   88.8877MB/s
   BM_UniqueInt64NoNulls/16M/10k/min_time:1.000/real_time         175855 us     175862 us          8   90.9838MB/s
   BM_UniqueInt64NoNulls/16M/1024k/min_time:1.000/real_time       452242 us     452257 us          3   94.3449MB/s
   BM_UniqueInt64WithNulls/16M/50/min_time:1.000/real_time         58632 us      58634 us         29   75.2797MB/s
   BM_UniqueInt64WithNulls/16M/1024/min_time:1.000/real_time      134079 us     134084 us         10   95.4661MB/s
   BM_UniqueInt64WithNulls/16M/10k/min_time:1.000/real_time       183846 us     183851 us          8   87.0295MB/s
   BM_UniqueInt64WithNulls/16M/1024k/min_time:1.000/real_time     528790 us     528808 us          3   80.6873MB/s
   BM_UniqueString10bytes/16M/50/min_time:1.000/real_time         152207 us     152212 us          9     116.8MB/s
   BM_UniqueString10bytes/16M/1024/min_time:1.000/real_time       260047 us     260056 us          5   123.055MB/s
   BM_UniqueString10bytes/16M/10k/min_time:1.000/real_time        426539 us     426552 us          3   125.038MB/s
   BM_UniqueString10bytes/16M/1024k/min_time:1.000/real_time     1716739 us    1716791 us          1      93.2MB/s
   BM_UniqueString100bytes/16M/50/min_time:1.000/real_time        556145 us     556165 us          3   958.982MB/s
   BM_UniqueString100bytes/16M/1024/min_time:1.000/real_time      693922 us     693943 us          2   1.12585GB/s
   BM_UniqueString100bytes/16M/10k/min_time:1.000/real_time      1000449 us    1000484 us          1    1.5618GB/s
   BM_UniqueString100bytes/16M/1024k/min_time:1.000/real_time    3591215 us    3591314 us          1   445.532MB/s
   ```
   
   This suggests quite a lot of room for improvement -- it's counter-intuitive to me that hashing strings seems optically faster than hashing integers, so we should figure out what's going on there.
   
   We can also refactor the hash table implementations without worrying too much about whether we're making things slower

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> [C++] Basic benchmark suite for hash kernels
> --------------------------------------------
>
>                 Key: ARROW-1844
>                 URL: https://issues.apache.org/jira/browse/ARROW-1844
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> * Integers, small cardinality and large cardinality
> * Short strings, small/large cardinality
> * Long strings, small/large cardinality
> These benchmarks will enable us to refactor without fear, and to experiment with faster hash functions



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)