You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by arina-ielchiieva <gi...@git.apache.org> on 2017/12/13 16:24:31 UTC

[GitHub] drill pull request #1071: DRILL-6028: Allow splitting generated code in Chai...

GitHub user arina-ielchiieva opened a pull request:

    https://github.com/apache/drill/pull/1071

    DRILL-6028: Allow splitting generated code in ChainedHashTable into b…

    …locks to avoid "code too large" error
    
    1. Added new parameter seedValue to getHashBuild and getHashProbe methods in HashTableTemplate.
    2. Generate logical expression for each key so its  can be split into blocks if number of expressions in method exceeds upper limit.
    3. ParameterExpression was added to generate reference to method parameter during code generation.
    
    Details in [DRILL-6028](https://issues.apache.org/jira/browse/DRILL-6028)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/arina-ielchiieva/drill DRILL-6028

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1071.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1071
    
----
commit 90dbd2db96bed8fa3cd3dab7575aadab240a621c
Author: Arina Ielchiieva <ar...@gmail.com>
Date:   2017-12-12T16:45:57Z

    DRILL-6028: Allow splitting generated code in ChainedHashTable into blocks to avoid "code too large" error
    
    1. Added new parameter seedValue to getHashBuild and getHashProbe methods in HashTableTemplate.
    2. Generate logical expression for each key so its  can be split into blocks if number of expressions in method exceeds upper limit.
    3. ParameterExpression was added to generate reference to method parameter during code generation.

----


---

[GitHub] drill pull request #1071: DRILL-6028: Allow splitting generated code in Chai...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/1071


---

[GitHub] drill pull request #1071: DRILL-6028: Allow splitting generated code in Chai...

Posted by arina-ielchiieva <gi...@git.apache.org>.

Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1071#discussion_r158255284
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/TestUnionDistinct.java ---
    @@ -754,4 +756,37 @@ public void testDrill4147_1() throws Exception {
         }
       }
     
    +  @Test
    +  public void testUnionWithManyColumns() throws Exception {
    --- End diff --
    
    During union we need to add two result sets and then remove the duplicates.
    Depending on the query cost query plans will be different, when number of columns are not large (approx. less then 40), sorting and streaming aggregate is chosen:
    Query `select columns[0]...columns[5] from dfs.test_1.csv union select columns[0]...columns[5] from dfs.test_2.csv`:
    ```
    00-00 Screen
    00-01 Project
    00-02 StreamAgg
    00-03 Sort
    00-04 UnionAll(all=[true])
    00-05 Scan table 2
    00-06 Scan table 1
    ```
    When number of columns 40+ plan with hash aggregate is chosen since it is cheaper.
    Query `select columns[0]...columns[1200] from dfs.test_1.csv union select columns[0]...columns[1200] from dfs.test_2.csv`:
    ```
    00-00 Screen
    00-01 Project
    00-02 HashAgg
    00-03 UnionAll(all=[true])
    00-04 Scan table 2
    00-05 Scan table 1
    ```
    Also added in Jira more examples of generated code before and after the change. 


---

[GitHub] drill issue #1071: DRILL-6028: Allow splitting generated code in ChainedHash...

Posted by arina-ielchiieva <gi...@git.apache.org>.

Github user arina-ielchiieva commented on the issue:

    https://github.com/apache/drill/pull/1071
  
    @paul-rogers thanks for the code review. Your approach seems to be worth trying. I have created Jira for enhancement [1].
    
    [1] https://issues.apache.org/jira/browse/DRILL-6052


---

[GitHub] drill issue #1071: DRILL-6028: Allow splitting generated code in ChainedHash...

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1071

Thanks much for the example files and explanation for the need to hash.

The improvements look good. I wonder, however, if the code gen approach is overkill. There is exactly one allowable hash method per type. (Has to be the same for all queries to get reliable results.)

Here, we are generating code to do the work of:

* Bind to all vectors.
* Get a value out of the vector into a holder.
* Pass the value to the proper hash function.
* Combine the results.

The result is a huge amount of code to generate. The gist of this bug is that, when the number of columns becomes large, we generate so much code that we have to take extra steps to manage it. And, of course, compiling, caching and loading the code takes time.

As something to think about for the future, this entire mechanism can be replaced with a much simpler one:

* Add a `hash(seed)` method to each value vector.
* Here, iterate over the list of vectors.
* Call the `hash()` method on each vector.
* Combine the results.

Tradeoffs?

* The proposed method has no setup cost. It is, instead an "interpreted" approach. The old method has a large setup cost.
* The proposed method must make a "virtual" call into each vector to get the value and hash it using the pre-coded, type-specific function. The old method makes a direct call to get the value in a holder, then a direct call to the hash method.
* The proposed method is insensitive to the number of columns (other than that it increases the size of the column loop.) The old method needs special care to handle the extra code.

The proposed method would be easy to test to see which is more efficient: (large code generation + direct method calls) vs. (no code generation and virtual method calls). My money is on the new method as it eliminates the holders, sloshing variables around and so on. The JIT can optimize the "pre-coded" methods once and for all early in the Drillbit run rather than having to re-optimize the (huge) generated methods per query.

The improvement is not necessary for this PR, but is something to think about. @Ben-Zvi may need something similar for the hash join to avoid generating query-specific key hashes. In fact, since hashing is used in many places (exchanges, hash agg, etc.), we might get quite a nice savings in time and code complexity by slowing moving to the proposed model.

---

[GitHub] drill pull request #1071: DRILL-6028: Allow splitting generated code in Chai...

Posted by amansinha100 <gi...@git.apache.org>.

Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1071#discussion_r158322278
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/TestUnionDistinct.java ---
    @@ -754,4 +756,37 @@ public void testDrill4147_1() throws Exception {
         }
       }
     
    +  @Test
    +  public void testUnionWithManyColumns() throws Exception {
    --- End diff --
    
    Perhaps the confusion is with union-all vs. union-distinct.  UNION (without the following ALL or DISTINCT keyword) is interpreted as DISTINCT and as @arina-ielchiieva 's example shows, it currently is processed in 2 steps: union-all followed by a distinct (group-by).  
    
    As a separate enhancement, unless a JIRA already exists, we should file one to natively support union-distinct within the same operator.  


---

[GitHub] drill pull request #1071: DRILL-6028: Allow splitting generated code in Chai...

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1071#discussion_r158164670
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/TestUnionDistinct.java ---
    @@ -754,4 +756,37 @@ public void testDrill4147_1() throws Exception {
         }
       }
     
    +  @Test
    +  public void testUnionWithManyColumns() throws Exception {
    --- End diff --
    
    Why would a UNION operator need code for a generated hash table? The union matches columns, then iterates over multiple result sets. Where would a run-time hash table fit?
    
    Do we need a different unit test that exercises the hash table code?


---

[GitHub] drill issue #1071: DRILL-6028: Allow splitting generated code in ChainedHash...

Posted by priteshm <gi...@git.apache.org>.

Github user priteshm commented on the issue:

    https://github.com/apache/drill/pull/1071
  
    @paul-rogers can you please review this?


---