You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/12/07 03:35:00 UTC

[jira] [Updated] (ARROW-14479) [C++][Compute] Hash Join microbenchmarks

     [ https://issues.apache.org/jira/browse/ARROW-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated ARROW-14479:
-----------------------------------
    Labels: pull-request-available  (was: )

> [C++][Compute] Hash Join microbenchmarks
> ----------------------------------------
>
>                 Key: ARROW-14479
>                 URL: https://issues.apache.org/jira/browse/ARROW-14479
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 7.0.0
>            Reporter: Michal Nowakiewicz
>            Assignee: Sasha Krassovsky
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 7.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Implement a series of microbenchmarks giving a good picture of the performance of hash join implemented in Arrow across different set of dimensions.
> Compare the performance against some other product(s).
> Add scripts for generating useful visual reports giving a good picture of the costs of hash join.
> Examples of dimensions to explore in microbenchmarks:
>  * number of duplicate keys on build side
>  * relative size of build side to probe side
>  * selectivity of the join
>  * number of key columns
>  * number of payload columns
>  * filtering performance for semi- and anti- joins
>  * dense integer key vs sparse integer key vs string key
>  * build size
>  * scaling of build, filtering, probe
>  * inner vs left outer, inner vs right outer
>  * left semi vs right semi, left anti vs right anti, left outer vs right outer
>  * non-uniform key distribution
>  * monotonic key values in input, partitioned key values in input (with and without per batch min-max metadata)
>  * chain of multiple hash joins
>  * overhead of Bloom filter for non-selective Bloom filter



--
This message was sent by Atlassian Jira
(v8.20.1#820001)