You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by "Arina Ielchiieva (JIRA)" <ji...@apache.org> on 2017/12/22 11:20:00 UTC

[jira] [Created] (DRILL-6052) Add hash(seed) method to each value vector

Arina Ielchiieva created DRILL-6052:
---------------------------------------

Summary: Add hash(seed) method to each value vector
Key: DRILL-6052
URL: https://issues.apache.org/jira/browse/DRILL-6052
Project: Apache Drill
Issue Type: Improvement
Affects Versions: 1.12.0
Reporter: Arina Ielchiieva

As part of DRILL-6028 we has to enhance ChainedHashTable code generation to allow method splitting. Though as [~Paul.Rogers] has proposed there is an alternative way to reduce amount of generated code and improve performance:
{quote}
Thanks much for the example files and explanation for the need to hash.

The improvements look good. I wonder, however, if the code gen approach is overkill. There is exactly one allowable hash method per type. (Has to be the same for all queries to get reliable results.)

Here, we are generating code to do the work of:

Bind to all vectors.
Get a value out of the vector into a holder.
Pass the value to the proper hash function.
Combine the results.
The result is a huge amount of code to generate. The gist of this bug is that, when the number of columns becomes large, we generate so much code that we have to take extra steps to manage it. And, of course, compiling, caching and loading the code takes time.

As something to think about for the future, this entire mechanism can be replaced with a much simpler one:

Add a hash(seed) method to each value vector.
Here, iterate over the list of vectors.
Call the hash() method on each vector.
Combine the results.
Tradeoffs?

The proposed method has no setup cost. It is, instead an "interpreted" approach. The old method has a large setup cost.
The proposed method must make a "virtual" call into each vector to get the value and hash it using the pre-coded, type-specific function. The old method makes a direct call to get the value in a holder, then a direct call to the hash method.
The proposed method is insensitive to the number of columns (other than that it increases the size of the column loop.) The old method needs special care to handle the extra code.
The proposed method would be easy to test to see which is more efficient: (large code generation + direct method calls) vs. (no code generation and virtual method calls). My money is on the new method as it eliminates the holders, sloshing variables around and so on. The JIT can optimize the "pre-coded" methods once and for all early in the Drillbit run rather than having to re-optimize the (huge) generated methods per query.

The improvement is not necessary for this PR, but is something to think about. @Ben-Zvi may need something similar for the hash join to avoid generating query-specific key hashes. In fact, since hashing is used in many places (exchanges, hash agg, etc.), we might get quite a nice savings in time and code complexity by slowing moving to the proposed model.
{quote}

This Jira aims to apply proposed solution.

--
This message was sent by Atlassian JIRA
(v6.4.14#64029)