You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "weijie.tong (JIRA)" <ji...@apache.org> on 2018/12/20 04:05:00 UTC

[jira] [Commented] (DRILL-6825) Applying different hash function according to data types and data size

    [ https://issues.apache.org/jira/browse/DRILL-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725571#comment-16725571 ] 

weijie.tong commented on DRILL-6825:
------------------------------------

[~ben-zvi]  Some thought about this issue.
The suggestion to make ValueVector have hash method is not well enough to solve the performance problem.
If the key column size is 1, the suggestion is good. Since different hash functions have different performance over different datatypes.
If the key column size is more than 1, then the iterate hash invocation over different ValueVector by different hash function may does not have better performance and the result maybe not right. Since some hash functions like XXHash have good performance over big input size. Maybe we can copy the value from different columns to construct a bigger input row , then invoke the XXHash to hash the constructed row to get a hashed value. That's to say to one row ,we have one time hash not more times hash as before. Here, maybe ValueVector could have a maxByteSize method to indicate its max bytes width of all the rows. So we could allocate one fix size memory ahead to hold all the data from different key columns to save some memory allocation cost while copying the data from different key columns.

In summary, to one key column, we pay attention to the data type of the ValueVector to choose a suitable hash function, to more than one key columns, we pay attention to the key data size to choose a suitable hash function.







> Applying different hash function according to data types and data size
> ----------------------------------------------------------------------
>
>                 Key: DRILL-6825
>                 URL: https://issues.apache.org/jira/browse/DRILL-6825
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Codegen
>            Reporter: weijie.tong
>            Assignee: weijie.tong
>            Priority: Major
>             Fix For: 1.16.0
>
>
> Different hash functions have different performance according to different data types and data size. We should choose a right one to apply not just Murmurhash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)