You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by paul-rogers <gi...@git.apache.org> on 2017/03/03 00:09:33 UTC

[GitHub] drill pull request #767: DRILL-5226: Managed external sort fixes

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/767

    DRILL-5226: Managed external sort fixes

    * Memory leak in managed sort if OOM during sv2 allocation
    * "Record batch sizer" does not include overhead for variable-sized
    vectors
    * Paranoid double-checking of merge batch sizes to prevent OOM when the
    sizes differ from expectations
    * Revised logging

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-5226

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/767.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #767
    
----
commit 57d99bdfdc5c7150d64d107b549dfb808f1c92a4
Author: Paul Rogers <pr...@maprtech.com>
Date:   2017-03-03T00:09:01Z

    DRILL-5226: Managed external sort fixes
    
    * Memory leak in managed sort if OOM during sv2 allocation
    * "Record batch sizer" does not include overhead for variable-sized
    vectors
    * Paranoid double-checking of merge batch sizes to prevent OOM when the
    sizes differ from expectations
    * Revised logging

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #767: DRILL-5226: Managed external sort fixes

Posted by Ben-Zvi <gi...@git.apache.org>.
Github user Ben-Zvi commented on a diff in the pull request:

    https://github.com/apache/drill/pull/767#discussion_r104252352
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/ExternalSortBatch.java ---
    @@ -1333,8 +1339,43 @@ private void spillFromMemory() {
         mergeAndSpill(bufferedBatches, spillCount);
       }
     
    +  private void mergeRuns(int targetCount) {
    +
    +    // Determine the number of runs to merge. The count should be the
    +    // target count. However, to prevent possible memory overrun, we
    +    // double-check with actual spill batch size and only spill as much
    +    // as fits in the merge memory pool.
    +
    +    int mergeCount = 0;
    +    long mergeSize = 0;
    +    for (SpilledRun batch : spilledRuns) {
    --- End diff --
    
    Rename "batch" to "run" ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #767: DRILL-5226: Managed external sort fixes

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/767#discussion_r104275259
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/ExternalSortBatch.java ---
    @@ -1333,8 +1339,43 @@ private void spillFromMemory() {
         mergeAndSpill(bufferedBatches, spillCount);
       }
     
    +  private void mergeRuns(int targetCount) {
    +
    +    // Determine the number of runs to merge. The count should be the
    +    // target count. However, to prevent possible memory overrun, we
    +    // double-check with actual spill batch size and only spill as much
    +    // as fits in the merge memory pool.
    +
    +    int mergeCount = 0;
    +    long mergeSize = 0;
    +    for (SpilledRun batch : spilledRuns) {
    +      long batchSize = batch.getBatchSize();
    +      if (mergeSize + batchSize > mergeMemoryPool) {
    +        break;
    +      }
    +      mergeSize += batchSize;
    +      mergeCount++;
    +      if (mergeCount == targetCount) {
    +        break;
    +      }
    +    }
    +
    +    // Must always spill at least 2, even if this creates an over-size
    +    // spill file. But, if this is a final consolidation, we may have only
    +    // a single batch.
    +
    +    mergeCount = Math.max(mergeCount, 2);
    +    mergeCount = Math.min(mergeCount, spilledRuns.size());
    +
    +    // Do the actual spill.
    +
    +    mergeAndSpill(spilledRuns, mergeCount);
    --- End diff --
    
    The logic is a bit more complex. We use the size of the largest row as our row size estimate. But, when we spill, we also consider the actual batch sizes. Earlier versions of the code were more sensitive to the row size estimate, more recent revisions have evolved to favor actual sizes in many cases.
    
    For "first generation" batches, we use the actual "payload" amount of data in each batch, sum that amount, and stop when we have enough data to fill a spill file. Thus, first generation spills are immune to the problem you describe.
    
    For merge/spills, we guess a target number of batches based on the target spill batch size. If the above logic works, then we know (roughly) the size of the spilled batches -- because we created them. As a sanity check, new code above sums the actual batch sizes in the files to merge to ensure that they fit in memory, choosing a smaller-than-requested number of batches if the merge batches turn out to be a bit larger than expected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #767: DRILL-5226: Managed external sort fixes

Posted by Ben-Zvi <gi...@git.apache.org>.
Github user Ben-Zvi commented on a diff in the pull request:

    https://github.com/apache/drill/pull/767#discussion_r104252671
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -253,7 +253,7 @@ public int getPayloadByteCount() {
           // If 1 or more values, then the last value is set to
           // the offset of the next value, which is the same as
           // the length of existing values.
    -      return a.get(count-1);
    +      return a.get(count-1) + offsetVector.getPayloadByteCount();
    --- End diff --
    
    Don't understand this new addition; doesn't the get(count-1) already reports all the memory used by this varchar, including the overhead ?  Maybe need to explain in the comment.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #767: DRILL-5226: Managed external sort fixes

Posted by Ben-Zvi <gi...@git.apache.org>.
Github user Ben-Zvi commented on a diff in the pull request:

    https://github.com/apache/drill/pull/767#discussion_r104252291
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/ExternalSortBatch.java ---
    @@ -1333,8 +1339,43 @@ private void spillFromMemory() {
         mergeAndSpill(bufferedBatches, spillCount);
       }
     
    +  private void mergeRuns(int targetCount) {
    +
    +    // Determine the number of runs to merge. The count should be the
    +    // target count. However, to prevent possible memory overrun, we
    +    // double-check with actual spill batch size and only spill as much
    +    // as fits in the merge memory pool.
    +
    +    int mergeCount = 0;
    +    long mergeSize = 0;
    +    for (SpilledRun batch : spilledRuns) {
    +      long batchSize = batch.getBatchSize();
    +      if (mergeSize + batchSize > mergeMemoryPool) {
    +        break;
    +      }
    +      mergeSize += batchSize;
    +      mergeCount++;
    +      if (mergeCount == targetCount) {
    +        break;
    +      }
    +    }
    +
    +    // Must always spill at least 2, even if this creates an over-size
    +    // spill file. But, if this is a final consolidation, we may have only
    +    // a single batch.
    +
    +    mergeCount = Math.max(mergeCount, 2);
    +    mergeCount = Math.min(mergeCount, spilledRuns.size());
    +
    +    // Do the actual spill.
    +
    +    mergeAndSpill(spilledRuns, mergeCount);
    --- End diff --
    
    Just a comment: So we always merge the _FIRST_ mergeCount runs. So if (one of) the first run has some crazy "oversized" batch, we'd repeatedly merge a small number of runs, as that bad batch may be preserved on.
    Alternatively - select the runs with the smaller "max batch"es, hence getting more runs to merge.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #767: DRILL-5226: Managed external sort fixes

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/767#discussion_r104275436
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -253,7 +253,7 @@ public int getPayloadByteCount() {
           // If 1 or more values, then the last value is set to
           // the offset of the next value, which is the same as
           // the length of existing values.
    -      return a.get(count-1);
    +      return a.get(count-1) + offsetVector.getPayloadByteCount();
    --- End diff --
    
    Added a comment. Basically, for planning purposes, the "payload" bytes for a Varchar vector include the bytes taken to hold the text data PLUS the bytes needed by the offset vector that points to those bytes. When averaged, it means a Varchar column width is 4 + avg. byte width.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #767: DRILL-5226: Managed external sort fixes

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on the issue:

    https://github.com/apache/drill/pull/767
  
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #767: DRILL-5226: Managed external sort fixes

Posted by Ben-Zvi <gi...@git.apache.org>.
Github user Ben-Zvi commented on a diff in the pull request:

    https://github.com/apache/drill/pull/767#discussion_r104229270
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/CopierHolder.java ---
    @@ -303,5 +307,9 @@ public int getRecordCount() {
         public int getBatchCount() {
           return batchCount;
         }
    +
    +    public long getBatchSize() {
    --- End diff --
    
    This should be better named as getMaxBatchSize()



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #767: DRILL-5226: Managed external sort fixes

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/767#discussion_r104275056
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/ExternalSortBatch.java ---
    @@ -1333,8 +1339,43 @@ private void spillFromMemory() {
         mergeAndSpill(bufferedBatches, spillCount);
       }
     
    +  private void mergeRuns(int targetCount) {
    +
    +    // Determine the number of runs to merge. The count should be the
    +    // target count. However, to prevent possible memory overrun, we
    +    // double-check with actual spill batch size and only spill as much
    +    // as fits in the merge memory pool.
    +
    +    int mergeCount = 0;
    +    long mergeSize = 0;
    +    for (SpilledRun batch : spilledRuns) {
    --- End diff --
    
    Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #767: DRILL-5226: Managed external sort fixes

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/767#discussion_r104275014
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/CopierHolder.java ---
    @@ -303,5 +307,9 @@ public int getRecordCount() {
         public int getBatchCount() {
           return batchCount;
         }
    +
    +    public long getBatchSize() {
    --- End diff --
    
    Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #767: DRILL-5226: Managed external sort fixes

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/767


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---