You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by paul-rogers <gi...@git.apache.org> on 2017/09/01 20:03:33 UTC

[GitHub] drill pull request #932: DRILL-5758: Fix for repeated columns; enable manage...

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/932

    DRILL-5758: Fix for repeated columns; enable managed sort by default

    Fix for DRILL-5443. Basically, the “record batch sizer” did not handle repeated columns correctly.
    
    Enabled managed sort by default.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-5758

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/932.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #932
    
----
commit 2c9b2249fbad1bee2cb290a9c04fb82df5ecf2e8
Author: Paul Rogers <pr...@maprtech.com>
Date:   2017-09-01T00:24:43Z

    Fix for DRILL-5443. See DRILL-5758 for details. Basically, the “record
    batch sizer” did not handle repeated columns correctly.
    
    Enabled managed sort by default

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #932: DRILL-5758: Fix for repeated columns; enable managed sort ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/932
  
    @Ben-Zvi, added another fix. Please take another look.


---

[GitHub] drill issue #932: DRILL-5758: Fix for repeated columns; enable managed sort ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/932
  
    Added a fix for repeated columns that have a low cardinality. If, say, one row in ten has an array entry, then average cardinality (values per row) is 0.1. This was represented by an int, rounded to 0 and caused a zero-length vector to be allocated. Drill then tried to double the length, which resulted in 0, which was doubled again, and so on forever.
    
    The fix has three parts:
    
    * The "record batch sizer" uses floats to allow fractional cardinality.
    * The vector initializer now works with fractional cardinality.
    * If all else fails, if a fixed-width vector is asked to double from zero, it sizes the vector to 256 bytes.


---

[GitHub] drill pull request #932: DRILL-5758: Fix for repeated columns; enable manage...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/932#discussion_r142017748
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java ---
    @@ -74,53 +74,52 @@
         public final int estSize;
     
         /**
    -     * Number of times the value here (possibly repeated) appears in
    -     * the record batch.
    +     * Number of occurrences of the value in the batch. This is trivial
    +     * for top-level scalars: it is the record count. For a top-level
    +     * repeated vector, this is the number of arrays, also the record
    +     * count. For a value nested inside a repeated map, it is the
    +     * total number of values across all maps, and may be less than,
    +     * greater than (but unlikely) same as the row count.
          */
     
         public final int valueCount;
     
         /**
    -     * The number of elements in the value vector. Consider two cases.
    -     * A required or nullable vector has one element per row, so the
    -     * <tt>entryCount</tt> is the same as the <tt>valueCount</tt> (which,
    -     * in turn, is the same as the row count.) But, if this vector is an
    -     * array, then the <tt>valueCount</tt> is the number of columns, while
    -     * <tt>entryCount</tt> is the total number of elements in all the arrays
    -     * that make up the columns, so <tt>entryCount</tt> will be different than
    -     * the <tt>valueCount</tt> (normally larger, but possibly smaller if most
    -     * arrays are empty.
    -     * <p>
    -     * Finally, the column may be part of another list. In this case, the above
    -     * logic still applies, but the <tt>valueCount</tt> is the number of entries
    -     * in the outer array, not the row count.
    +     * Total number of elements for a repeated type, or 1 if this is
    +     * a non-repeated type. That is, a batch of 100 rows may have an
    +     * array with 10 elements per row. In this case, the element count
    +     * is 1000.
          */
     
    -    public int entryCount;
    +    public final int elementCount;
    --- End diff --
    
    Good point. However, a single batch of greater than 2 GB is far more than the sort can handle, so we'd not even get this far if the batch was this large.
    
    Still, the point is valid, so a new commit changes batch size variables from int to long.


---

[GitHub] drill issue #932: DRILL-5758: Fix for repeated columns; enable managed sort ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/932
  
    Squashed commits.


---

[GitHub] drill issue #932: DRILL-5758: Fix for repeated columns; enable managed sort ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/932
  
    Rebased onto master.


---

[GitHub] drill issue #932: DRILL-5758: Fix for repeated columns; enable managed sort ...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on the issue:

    https://github.com/apache/drill/pull/932
  
    Changes LGTM.  +1


---

[GitHub] drill pull request #932: DRILL-5758: Fix for repeated columns; enable manage...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/932


---

[GitHub] drill issue #932: DRILL-5758: Fix for repeated columns; enable managed sort ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/932
  
    Rebased onto latest master.


---

[GitHub] drill pull request #932: DRILL-5758: Fix for repeated columns; enable manage...

Posted by Ben-Zvi <gi...@git.apache.org>.
Github user Ben-Zvi commented on a diff in the pull request:

    https://github.com/apache/drill/pull/932#discussion_r137963963
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortMemoryManager.java ---
    @@ -680,9 +680,11 @@ public MergeTask consolidateBatches(long allocMemory, int inMemCount, int spille
         // spilled run.
     
         // Maximum spill batches that fit into available memory.
    +    // Use the maximum buffer size since spill batches seem to
    +    // be read with almost 50% internal fragmentation.
     
         int memMergeLimit = (int) ((mergeMemoryLimit - allocMemory) /
    -                                spillBatchSize.expectedBufferSize);
    +                                spillBatchSize.maxBufferSize);
         memMergeLimit = Math.max(0, memMergeLimit);
    --- End diff --
    
    This line of code is not needed ( Math.max(....) ), as the code above (around line 675) ensures that mergeMemoryLimit >= allocMemory .



---

[GitHub] drill pull request #932: DRILL-5758: Fix for repeated columns; enable manage...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/932#discussion_r141191139
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java ---
    @@ -74,53 +74,52 @@
         public final int estSize;
     
         /**
    -     * Number of times the value here (possibly repeated) appears in
    -     * the record batch.
    +     * Number of occurrences of the value in the batch. This is trivial
    +     * for top-level scalars: it is the record count. For a top-level
    +     * repeated vector, this is the number of arrays, also the record
    +     * count. For a value nested inside a repeated map, it is the
    +     * total number of values across all maps, and may be less than,
    +     * greater than (but unlikely) same as the row count.
          */
     
         public final int valueCount;
     
         /**
    -     * The number of elements in the value vector. Consider two cases.
    -     * A required or nullable vector has one element per row, so the
    -     * <tt>entryCount</tt> is the same as the <tt>valueCount</tt> (which,
    -     * in turn, is the same as the row count.) But, if this vector is an
    -     * array, then the <tt>valueCount</tt> is the number of columns, while
    -     * <tt>entryCount</tt> is the total number of elements in all the arrays
    -     * that make up the columns, so <tt>entryCount</tt> will be different than
    -     * the <tt>valueCount</tt> (normally larger, but possibly smaller if most
    -     * arrays are empty.
    -     * <p>
    -     * Finally, the column may be part of another list. In this case, the above
    -     * logic still applies, but the <tt>valueCount</tt> is the number of entries
    -     * in the outer array, not the row count.
    +     * Total number of elements for a repeated type, or 1 if this is
    +     * a non-repeated type. That is, a batch of 100 rows may have an
    +     * array with 10 elements per row. In this case, the element count
    +     * is 1000.
          */
     
    -    public int entryCount;
    +    public final int elementCount;
    --- End diff --
    
    Not related to elementCount per-se but I see that netBatchSize and accountedMemorySize are integers.  These could overflow depending on number of columns.  Should they be longs ? 


---