You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Mustafa Iman (Jira)" <ji...@apache.org> on 2020/10/02 23:38:00 UTC
[jira] [Commented] (HIVE-24205) Optimise CuckooSetBytes

    [ https://issues.apache.org/jira/browse/HIVE-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206559#comment-17206559 ] 

Mustafa Iman commented on HIVE-24205:
-------------------------------------

I added a simple max/min length check in CuckooSetBytes#lookup. Attached file shows some benchmark results.

 

*TPCH_Q12* is a select with IN clause and a join afterwards. Selectivity of the filter is 30%.

*Synthetic* query ** is Simple select with IN clause. IN is over two of the longest comment fields (both 72 characters wide). So selectivity is very high at about 2%:

select o_orderkey, o_comment from orders where o_comment in ('jole quickly furiously bold escapades: regular accounts play regular req', 's foxes. regular warhorses detect fluffily. carefull 
y regular tithes amo', 'grate ironic, pending sauternes. deposits do are slyly. carefully ironic')

*Synthetic Wide* query is the same as synthetic except IN clause is over one shortest length and one longest length comment. Selectivity is still high at 4% but our optimization cannot eliminate any tuples.

select o_orderkey, o_comment from orders where o_comment in ('jole quickly furiously bold escapades: regular accounts play regular req', 'ts nag furiously. even');

 

The patch outperforms original code by 50% on synthetic query. For tpch q12, there is no meaningful difference between two runs. My conclusion is that the optimization is very low overhead and it gives significant perf improvement in certain cases.

I implemented a vectorized version of the early return from cuckooset. It is attached as vectorized.patch. However, in all cases simpler patch outperforms vectorized one.

> Optimise CuckooSetBytes
> -----------------------
>
>                 Key: HIVE-24205
>                 URL: https://issues.apache.org/jira/browse/HIVE-24205
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Mustafa Iman
>            Priority: Major
>         Attachments: Screenshot 2020-09-28 at 4.29.24 PM.png, bench.png, vectorized.patch
>
>
> {{FilterStringColumnInList, StringColumnInList}}  etc use CuckooSetBytes for lookup.
> !Screenshot 2020-09-28 at 4.29.24 PM.png|width=714,height=508!
> One option to optimize would be to add boundary conditions on "length" with the min/max length stored in the hashes (ref: [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/CuckooSetBytes.java#L85]) . This would significantly reduce the number of hash computation that needs to happen. E.g [TPCH-Q12|https://github.com/hortonworks/hive-testbench/blob/hdp3/sample-queries-tpch/tpch_query12.sql#L20]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)