You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by minji-kim <gi...@git.apache.org> on 2016/02/18 18:39:55 UTC

[GitHub] drill pull request: DRILL-4411: hash join should limit batch based...

GitHub user minji-kim opened a pull request:

    https://github.com/apache/drill/pull/381

    DRILL-4411: hash join should limit batch based on size and number of records

    Right now, hash joins can run out of memory if records are large since the batch is limited only by size (of 4000).  This patch implements a simple heuristic.  If the allocator for the outputs become larger than 10 MB before outputing 4000 records (say 2000), then set the batch size limit to 2000 for the future batches.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/minji-kim/drill DRILL-4411

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/381.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #381
    
----
commit 2e3b1c75273e1b87679d79bdc4f3877b72603e3c
Author: Minji Kim <mi...@dremio.com>
Date:   2016-02-18T17:05:51Z

    DRILL-4411: hash join should limit batch based on size as well as number of records

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4411: hash join should limit batch based...

Posted by minji-kim <gi...@git.apache.org>.
Github user minji-kim commented on a diff in the pull request:

    https://github.com/apache/drill/pull/381#discussion_r55926325
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java ---
    @@ -532,6 +542,9 @@ public void close() {
         if (hashTable != null) {
           hashTable.clear();
         }
    +
    +    container.clear();
    +    outputAllocator.close();
    --- End diff --
    
    Yes, it could be if we close before calling next()/buildSchema().  I will add a check for null.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4411: hash join should limit batch based...

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/381#discussion_r54017535
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinProbeTemplate.java ---
    @@ -47,7 +47,11 @@
     
       private HashJoinBatch outgoingJoinBatch = null;
     
    -  private static final int TARGET_RECORDS_PER_BATCH = 4000;
    +  private int targetRecordsPerBatch = 4000;
    +
    +  private boolean adjustTargetRecordsPerBatch = true;
    --- End diff --
    
    It looks like this flag is designed to allow the adjustment to only happen once, is that actually what we want? If the row size is growing it would seem like a good idea to allow for several batch size adjustments. It also removes another boolean state to manage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4411: hash join should limit batch based...

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/381#discussion_r54017219
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinProbeTemplate.java ---
    @@ -47,7 +47,11 @@
     
       private HashJoinBatch outgoingJoinBatch = null;
     
    -  private static final int TARGET_RECORDS_PER_BATCH = 4000;
    +  private int targetRecordsPerBatch = 4000;
    --- End diff --
    
    I would add a comment here about when this value will be mutated as we are moving it away from immutability, and most of the other operators currently have this as an immutable value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4411: hash join should limit batch based...

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/381#discussion_r54018324
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinProbeTemplate.java ---
    @@ -98,7 +102,9 @@ public void setupHashJoinProbe(FragmentContext context, VectorContainer buildBat
       }
     
       public void executeProjectRightPhase() {
    -    while (outputRecords < TARGET_RECORDS_PER_BATCH && recordsProcessed < recordsToProcess) {
    +    while (outputRecords < targetRecordsPerBatch
    +            && recordsProcessed < recordsToProcess
    +            && (!adjustTargetRecordsPerBatch || outgoingJoinBatch.getMemoryUsed() < TARGET_BATCH_SIZE_IN_BYTES)) {
    --- End diff --
    
    It seems like the thing we are testing for here isn't actually directly related to the condition we are trying to avoid. The overall memory consumed when outputting records will be a function of both size of values as well as number of columns. I think this is a reasonable approach for now but we should open a follow-up JIRA to look at where things will break as we encounter cases where there are many wide columns in a dataset.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4411: hash join should limit batch based...

Posted by adeneche <gi...@git.apache.org>.
Github user adeneche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/381#discussion_r55917078
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java ---
    @@ -532,6 +542,9 @@ public void close() {
         if (hashTable != null) {
           hashTable.clear();
         }
    +
    +    container.clear();
    +    outputAllocator.close();
    --- End diff --
    
    outputAllocator can be null, right ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4411: hash join should limit batch based...

Posted by minji-kim <gi...@git.apache.org>.
Github user minji-kim commented on the pull request:

    https://github.com/apache/drill/pull/381#issuecomment-195647307
  
    I made it such that it doesn't adjust the batch size once, but keep the minimum batch size (in terms of number of records) to be at least 1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---