You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by daveoshinsky <gi...@git.apache.org> on 2016/02/10 00:06:56 UTC

[GitHub] drill pull request: DRILL-4184: support variable length decimal fi...

GitHub user daveoshinsky opened a pull request:

    https://github.com/apache/drill/pull/372

    DRILL-4184: support variable length decimal fields in parquet

    Support decimal fields in parquet that are stored as variable length BINARY.  Parquet files that store decimal values this way are often significantly smaller than ones storing decimal values as FIXED_LEN_BYTE_ARRAY's (full precision).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/daveoshinsky/drill master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/372.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #372
    
----
commit 9a47ca52125139d88adf39b5d894a02f870f37d9
Author: U-COMMVAULT-NJ\doshinsky <do...@daveoshinsky-pc.gp.cv.commvault.com>
Date:   2016-02-09T22:37:47Z

    DRILL-4184: support variable length decimal fields in parquet

commit dec00a808c99554f008e23fd21b944b858aa9ae0
Author: daveoshinsky <do...@daveoshinsky-pc.gp.cv.commvault.com>
Date:   2016-02-09T22:56:28Z

    DRILL-4184: changes to support variable length decimal fields in parquet

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4184: support variable length decimal fi...

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/372#discussion_r55125265
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLengthValuesColumn.java ---
    @@ -81,6 +112,14 @@ public boolean skipReadyToReadPositionUpdate() {
         return false;
       }
     
    +  // decimalLengths list is part of a near-term fix for DRILL-4184.
    +  // Decimal[23]8SparseVector classes are fixed width vectors, without ability to "remember" offsets of
    +  // (variable width) field sizes.  so, we "remember" the array sizes in decimalLengths (also used to
    +  // "remember" whether a value was null, for nullable decimal columns).
    +  // TODO: storage of decimal values should support variable length values in a much cleaner way than this,
    +  // perhaps with a new variable width Decimal vector class.
    +  protected ArrayList<Integer> decimalLengths = new ArrayList();
    --- End diff --
    
    please put class members at the top of the class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4184: support variable length decimal fi...

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on a diff in the pull request:

    https://github.com/apache/drill/pull/372#discussion_r55125267
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/NullableVarLengthValuesColumn.java ---
    @@ -69,11 +73,16 @@ protected boolean readAndStoreValueSizeInformation() throws IOException {
         if ( currDefLevel == -1 ) {
           currDefLevel = pageReader.definitionLevels.readInteger();
         }
    -    if ( columnDescriptor.getMaxDefinitionLevel() > currDefLevel) {
    +
    +    if (columnDescriptor.getMaxDefinitionLevel() > currDefLevel) {
           nullsRead++;
    -      // set length of zero, each index in the vector defaults to null so no need to set the nullability
    -      variableWidthVector.getMutator().setValueLengthSafe(
    -          valuesReadInCurrentPass + pageReader.valuesReadyToRead, 0);
    +      // set length of zero, each index in the vector defaults to null so no
    +      // need to set the nullability
    +      if (variableWidthVector == null) {
    --- End diff --
    
    I know this class hierarchy is a bit messy as is ( I am the original author of most if it, I should go back to clean it up). But we shouldn't be using the presence or absence of this field as a flag to know which class we are in.
    
    I'm 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4184: support variable length decimal fi...

Posted by daveoshinsky <gi...@git.apache.org>.
Github user daveoshinsky commented on a diff in the pull request:

    https://github.com/apache/drill/pull/372#discussion_r55148322
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/NullableVarLengthValuesColumn.java ---
    @@ -69,11 +73,16 @@ protected boolean readAndStoreValueSizeInformation() throws IOException {
         if ( currDefLevel == -1 ) {
           currDefLevel = pageReader.definitionLevels.readInteger();
         }
    -    if ( columnDescriptor.getMaxDefinitionLevel() > currDefLevel) {
    +
    +    if (columnDescriptor.getMaxDefinitionLevel() > currDefLevel) {
           nullsRead++;
    -      // set length of zero, each index in the vector defaults to null so no need to set the nullability
    -      variableWidthVector.getMutator().setValueLengthSafe(
    -          valuesReadInCurrentPass + pageReader.valuesReadyToRead, 0);
    +      // set length of zero, each index in the vector defaults to null so no
    +      // need to set the nullability
    +      if (variableWidthVector == null) {
    --- End diff --
    
    It took me a while to get some idea how this code is supposed to work.  It was not clear to me that only the last item in the list was accessed.  If you are certain that is the case, we don't need a list, but instead just a single value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4184: support variable length decimal fi...

Posted by daveoshinsky <gi...@git.apache.org>.
Github user daveoshinsky commented on a diff in the pull request:

    https://github.com/apache/drill/pull/372#discussion_r55148335
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLengthValuesColumn.java ---
    @@ -81,6 +112,14 @@ public boolean skipReadyToReadPositionUpdate() {
         return false;
       }
     
    +  // decimalLengths list is part of a near-term fix for DRILL-4184.
    +  // Decimal[23]8SparseVector classes are fixed width vectors, without ability to "remember" offsets of
    +  // (variable width) field sizes.  so, we "remember" the array sizes in decimalLengths (also used to
    +  // "remember" whether a value was null, for nullable decimal columns).
    +  // TODO: storage of decimal values should support variable length values in a much cleaner way than this,
    +  // perhaps with a new variable width Decimal vector class.
    +  protected ArrayList<Integer> decimalLengths = new ArrayList();
    --- End diff --
    
    Once I'm sure whether a list, or a single value, is needed, I will move the value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4184: support variable length decimal fi...

Posted by jaltekruse <gi...@git.apache.org>.
Github user jaltekruse commented on the pull request:

    https://github.com/apache/drill/pull/372#issuecomment-192701211
  
    @daveoshinsky I have a few questions about your design here, I have a proposed alternative but I might be wrong about what you are trying to accomplish.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4184: support variable length decimal fi...

Posted by daveoshinsky <gi...@git.apache.org>.
Github user daveoshinsky commented on a diff in the pull request:

    https://github.com/apache/drill/pull/372#discussion_r55277730
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLengthValuesColumn.java ---
    @@ -81,6 +112,14 @@ public boolean skipReadyToReadPositionUpdate() {
         return false;
       }
     
    +  // decimalLengths list is part of a near-term fix for DRILL-4184.
    +  // Decimal[23]8SparseVector classes are fixed width vectors, without ability to "remember" offsets of
    +  // (variable width) field sizes.  so, we "remember" the array sizes in decimalLengths (also used to
    +  // "remember" whether a value was null, for nullable decimal columns).
    +  // TODO: storage of decimal values should support variable length values in a much cleaner way than this,
    +  // perhaps with a new variable width Decimal vector class.
    +  protected ArrayList<Integer> decimalLengths = new ArrayList();
    --- End diff --
    
    I have moved decimalLengths class member to top of the class.  I found that the list is NOT always accessed at the end, hence left it as a list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4184: support variable length decimal fi...

Posted by daveoshinsky <gi...@git.apache.org>.
Github user daveoshinsky commented on a diff in the pull request:

    https://github.com/apache/drill/pull/372#discussion_r55417098
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/NullableVarLengthValuesColumn.java ---
    @@ -69,11 +73,16 @@ protected boolean readAndStoreValueSizeInformation() throws IOException {
         if ( currDefLevel == -1 ) {
           currDefLevel = pageReader.definitionLevels.readInteger();
         }
    -    if ( columnDescriptor.getMaxDefinitionLevel() > currDefLevel) {
    +
    +    if (columnDescriptor.getMaxDefinitionLevel() > currDefLevel) {
           nullsRead++;
    -      // set length of zero, each index in the vector defaults to null so no need to set the nullability
    -      variableWidthVector.getMutator().setValueLengthSafe(
    -          valuesReadInCurrentPass + pageReader.valuesReadyToRead, 0);
    +      // set length of zero, each index in the vector defaults to null so no
    +      // need to set the nullability
    +      if (variableWidthVector == null) {
    --- End diff --
    
    Regarding the two variables variableWidthVector and fixedWidthVector that I added, here is my reasoning.  Either variableWidthVector is set if we have a VariableWidthVector, or fixedWidthVector is set if we have a FixedWidthVector (i.e., decimal).  Hence, variableWidthVector is non-null if and only if we are to invoke the pre-existing logic, that assumed a variable width vector.  When variableWidthVector is null (fixedWidthVector is non-null, but not currently used), we invoke the new logic to save the length information in decimalLengths.  If this is no good, please tell me why, and suggest an alternative.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4184: support variable length decimal fi...

Posted by daveoshinsky <gi...@git.apache.org>.
Github user daveoshinsky commented on the pull request:

    https://github.com/apache/drill/pull/372#issuecomment-192999951
  
    Regarding the overall intent of the fix, as the "TODO" comment on decimalLengths implies, it's intended only as a short-term fix.  More long-term, I would suggest that decimal values should be stored in a VariableWidthVector (which was assumed by VarLenghValuesColumn, hence the class cast exception).  This would use memory more efficiently when most values are far smaller than full precision, as is often the case (think java.math.BigDecimal, which operates this way).  Moreover, there would be no need to have a whole bunch of separate (generated) classes for different decimal precisions.  Just one class, variable width, handling any precision.  I also suggest that some other "special cases" could be combined.  Fixed width is a special case of variable width, where there's no need to store a separate length for each value.  Non-nullable is a special case of nullable, where there's no need to store a nullable boolean (or equivalent) for each value.  One last bit of feedback - it wou
 ld be much easier to maintain the code if it did not involve generation of code (freemarker).  An old-fashioned class hierarchy, with no generated code, would probably work just fine for the vectoring mechanisms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---