You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by dprofeta <gi...@git.apache.org> on 2017/10/06 08:45:23 UTC

[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

GitHub user dprofeta opened a pull request:

    https://github.com/apache/drill/pull/976

    DRILL-5797: Choose parquet reader from read columns

    ParquetRecordReader is not able to read complex columns. However it is
    able to read simple columns in a file containing complex
    columns. Instead of looking at the file to choose the reader, we
    now choose which reader to use based on what columns is asked and if
    they are simple or not.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dprofeta/drill DRILL-5797

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/976.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #976
    
----
commit 9669dd2c0c61e56c76bc9939c4f1c01aab908baf
Author: Damien Profeta <da...@amadeus.com>
Date:   2017-10-06T08:40:22Z

    DRILL-5797: Choose parquet reader from read columns
    
    ParquetRecordReader is not able to read complex columns. However it is
    able to read simple columns in a file containing complex
    columns. Instead of looking at the file to choose the reader, we
    now choose which reader to use based on what columns is asked and if
    they are simple or not.

----


---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

Posted by priteshm <gi...@git.apache.org>.

Github user priteshm commented on the issue:

    https://github.com/apache/drill/pull/976
  
    @sachouche can you please take a final look? If it looks good, maybe one of the committers can include this for the 1.12 release. @arina-ielchiieva ?


---

[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

Posted by dprofeta <gi...@git.apache.org>.

Github user dprofeta commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r143403559
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java ---
    @@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    +  private static boolean isComplex(ParquetMetadata footer, List<SchemaPath> columns) {
    +    if (Utilities.isStarQuery(columns)) {
    +      MessageType schema = footer.getFileMetaData().getSchema();
     
    -    for (Type type : schema.getFields()) {
    -      if (!type.isPrimitive()) {
    -        return true;
    +      for (Type type : schema.getFields()) {
    +        if (!type.isPrimitive()) {
    +          return true;
    +        }
           }
    -    }
    -    for (ColumnDescriptor col : schema.getColumns()) {
    -      if (col.getMaxRepetitionLevel() > 0) {
    -        return true;
    +      for (ColumnDescriptor col : schema.getColumns()) {
    +        if (col.getMaxRepetitionLevel() > 0) {
    +          return true;
    +        }
    +      }
    +      return false;
    +    } else {
    +      for (SchemaPath column : columns) {
    +        if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) {
    +          return true;
    +        }
           }
    +      return false;
    +    }
    +  }
    +
    +  private static boolean isColumnComplex(GroupType grouptype, SchemaPath column) {
    +    PathSegment.NameSegment root = column.getRootSegment();
    +    if (!grouptype.containsField(root.getPath().toLowerCase())) {
    +      return false;
    +    }
    +    Type type = grouptype.getType(root.getPath().toLowerCase());
    +    if (type.isRepetition(Type.Repetition.REPEATED) || !type.isPrimitive()) {
    --- End diff --
    
    Yes, sure. I wanted to check it in a loop first, but ParquetRecordReader doesn't handle any nested type, so the loop is not needed now. But I didn't refactor enough.


---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/976
  
    @dprofeta, tried to commit this PR, but ran into multiple functional test failures:
    
    ```
    Execution Failures:
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex12.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex8.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex56.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex274.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex7.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex57.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex102.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex5.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex10.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex9.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex203.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex101.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex275.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex6.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex205.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex11.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex58.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex153.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex202.q
    /root/drillAutomation/mapr/framework/resources/Functional/complex/parquet/complex151.q
    ```
    
    The common failure stack trace seems to be:
    
    ```
        org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleException():272
        org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.setup():256
        org.apache.drill.exec.physical.impl.ScanBatch.getNextReaderIfHas():241
        org.apache.drill.exec.physical.impl.ScanBatch.next():167
    ...
    ``` 


---

[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r143263332
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java ---
    @@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    +  private static boolean isComplex(ParquetMetadata footer, List<SchemaPath> columns) {
    +    if (Utilities.isStarQuery(columns)) {
    --- End diff --
    
    Perhaps a comment with some explanation? If wildcard query, we query all columns, so check if any of them are complex. If project list, then check only the projected columns.


---

[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

Posted by sachouche <gi...@git.apache.org>.

Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r143837353
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java ---
    @@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    +  private static boolean isComplex(ParquetMetadata footer, List<SchemaPath> columns) {
    +    if (Utilities.isStarQuery(columns)) {
    +      MessageType schema = footer.getFileMetaData().getSchema();
     
    -    for (Type type : schema.getFields()) {
    -      if (!type.isPrimitive()) {
    -        return true;
    +      for (Type type : schema.getFields()) {
    +        if (!type.isPrimitive()) {
    +          return true;
    +        }
           }
    -    }
    -    for (ColumnDescriptor col : schema.getColumns()) {
    -      if (col.getMaxRepetitionLevel() > 0) {
    -        return true;
    +      for (ColumnDescriptor col : schema.getColumns()) {
    +        if (col.getMaxRepetitionLevel() > 0) {
    +          return true;
    +        }
    +      }
    +      return false;
    +    } else {
    +      for (SchemaPath column : columns) {
    +        if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) {
    --- End diff --
    
    Can you please use the already defined schema variable instead of invoking "footer.getFileMetaData().getSchema()" multiple times.


---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/976
  
    Drill follows SQL rules and is case insensitive. If case sensitivity has snuck in somewhere (perhaps due to the use of `equals()` rather than `equalsIgnorCase()` or the use of a case-sensitive map), then we should fix that.
    
    Note also that column aliases should not be visible to the Parquet reader.


---

[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r143264522
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java ---
    @@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    +  private static boolean isComplex(ParquetMetadata footer, List<SchemaPath> columns) {
    +    if (Utilities.isStarQuery(columns)) {
    +      MessageType schema = footer.getFileMetaData().getSchema();
     
    -    for (Type type : schema.getFields()) {
    -      if (!type.isPrimitive()) {
    -        return true;
    +      for (Type type : schema.getFields()) {
    +        if (!type.isPrimitive()) {
    +          return true;
    +        }
           }
    -    }
    -    for (ColumnDescriptor col : schema.getColumns()) {
    -      if (col.getMaxRepetitionLevel() > 0) {
    -        return true;
    +      for (ColumnDescriptor col : schema.getColumns()) {
    +        if (col.getMaxRepetitionLevel() > 0) {
    +          return true;
    +        }
    +      }
    +      return false;
    +    } else {
    +      for (SchemaPath column : columns) {
    +        if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) {
    +          return true;
    +        }
           }
    +      return false;
    +    }
    +  }
    +
    +  private static boolean isColumnComplex(GroupType grouptype, SchemaPath column) {
    +    PathSegment.NameSegment root = column.getRootSegment();
    +    if (!grouptype.containsField(root.getPath().toLowerCase())) {
    +      return false;
    +    }
    +    Type type = grouptype.getType(root.getPath().toLowerCase());
    --- End diff --
    
    What are the semantics of `getType()`? Does it return null if there is no such type? If so, then we can retrieve the type and check if it is null. Otherwise, if it throws an exception, perhaps we can catch that. Either way, we avoid two searches for the same column and two conversions of the path to lower case.
    
    Note also that a recent change deprecated `getPath()`. The preferred form is `getName()` so it is clear that we are getting a single name part. If the Parquet column is nested (a.b.c, say), then we have to explicitly handle the multiple name parts as Drill does support names with dots and one cannot simply concatenate or split a path to get a string. That is, `["a.b", "c"]` is two fields, `["a", "b", "c"]` is three, but both are represented as a full path as "a.b.c", creating an ambiguity.


---

[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

Posted by dprofeta <gi...@git.apache.org>.

Github user dprofeta commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r143403232
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java ---
    @@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    +  private static boolean isComplex(ParquetMetadata footer, List<SchemaPath> columns) {
    +    if (Utilities.isStarQuery(columns)) {
    +      MessageType schema = footer.getFileMetaData().getSchema();
     
    -    for (Type type : schema.getFields()) {
    -      if (!type.isPrimitive()) {
    -        return true;
    +      for (Type type : schema.getFields()) {
    +        if (!type.isPrimitive()) {
    +          return true;
    +        }
           }
    -    }
    -    for (ColumnDescriptor col : schema.getColumns()) {
    -      if (col.getMaxRepetitionLevel() > 0) {
    -        return true;
    +      for (ColumnDescriptor col : schema.getColumns()) {
    +        if (col.getMaxRepetitionLevel() > 0) {
    +          return true;
    +        }
    +      }
    +      return false;
    +    } else {
    +      for (SchemaPath column : columns) {
    +        if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) {
    +          return true;
    +        }
           }
    +      return false;
    +    }
    +  }
    +
    +  private static boolean isColumnComplex(GroupType grouptype, SchemaPath column) {
    +    PathSegment.NameSegment root = column.getRootSegment();
    +    if (!grouptype.containsField(root.getPath().toLowerCase())) {
    +      return false;
    +    }
    +    Type type = grouptype.getType(root.getPath().toLowerCase());
    --- End diff --
    
    ok, for `getType()`. It throws an exception so I will catch it.
    I don't see any `getName()` in the `SchemaPath` / `PathSegment` class. Can you tell me which `getName()` you mean?


---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

Posted by sachouche <gi...@git.apache.org>.

Github user sachouche commented on the issue:

    https://github.com/apache/drill/pull/976
  
    Sure!
    
    
    Regards,
    
    Salim
    
    ________________________________
    From: dprofeta <no...@github.com>
    Sent: Friday, October 6, 2017 8:52:51 AM
    To: apache/drill
    Cc: Salim Achouche; Mention
    Subject: Re: [apache/drill] DRILL-5797: Choose parquet reader from read columns (#976)
    
    
    @sachouche<https://github.com/sachouche> Can you review it?
    
    —
    You are receiving this because you were mentioned.
    Reply to this email directly, view it on GitHub<https://github.com/apache/drill/pull/976#issuecomment-334795292>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AckoK82kUi_xZmBVaiKuujb-rxzaL2sIks5spkzTgaJpZM4PwLDB>.



---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

Posted by priteshm <gi...@git.apache.org>.

Github user priteshm commented on the issue:

    https://github.com/apache/drill/pull/976
  
    @dprofeta will you be able to address the issues before the release?


---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

Posted by dprofeta <gi...@git.apache.org>.

Github user dprofeta commented on the issue:

    https://github.com/apache/drill/pull/976
  
    here is the updated PR.
    Yes, I also wanted to add group without repetition. It is only a matter of naming so it should not be hard but when I tested, the fast reader was not able to handle it.


---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

Posted by sachouche <gi...@git.apache.org>.

Github user sachouche commented on the issue:

    https://github.com/apache/drill/pull/976
  
    +1
    looks good!


---

[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r144074677
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java ---
    @@ -156,20 +161,46 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    -
    -    for (Type type : schema.getFields()) {
    -      if (!type.isPrimitive()) {
    -        return true;
    +  private static boolean isComplex(ParquetMetadata footer, List<SchemaPath> columns) {
    +    /*
    +    ParquetRecordReader is not able to read any nested columns and is not able to handle repeated columns.
    +    It only handles flat column and optional column.
    +    If it is a wildcard query, we check every columns in the metadata.
    +    If not, we only check the projected columns.
    +    */
    --- End diff --
    
    Very small request: this is a great Javadoc comment, so please use this form:
    
    ```
    /**
     * Your comment here.
     */
    ```
    
    It may also be worth pointing out that the algorithm here works regardless of the form of the column:
    
    * `a`: Must consider the type of column `a` in Parquet.
    * `a.b`: The top level column `a` must be a structure in Parquet. (If not, then presumably an error is thrown later on.) So, no need to check `b`.
    * `a[10]`: The column `a` must be an array (repeated), so no need to check the column `SchemaPath` itself. Again, presumably, Drill will throw an error internally if it turns out that `a` is not an array.


---

[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

Posted by dprofeta <gi...@git.apache.org>.

Github user dprofeta commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r143403657
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java ---
    @@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    +  private static boolean isComplex(ParquetMetadata footer, List<SchemaPath> columns) {
    +    if (Utilities.isStarQuery(columns)) {
    --- End diff --
    
    Added in a new commit.


---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

Posted by dprofeta <gi...@git.apache.org>.

Github user dprofeta commented on the issue:

    https://github.com/apache/drill/pull/976
  
    @sachouche Can you review it?


---

[GitHub] drill pull request #976: DRILL-5797: Choose parquet reader from read columns

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/976#discussion_r143263740
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java ---
    @@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context, ParquetRowGroupScan rowGroupS
         return new ScanBatch(rowGroupScan, context, oContext, readers, implicitColumns);
       }
     
    -  private static boolean isComplex(ParquetMetadata footer) {
    -    MessageType schema = footer.getFileMetaData().getSchema();
    +  private static boolean isComplex(ParquetMetadata footer, List<SchemaPath> columns) {
    +    if (Utilities.isStarQuery(columns)) {
    +      MessageType schema = footer.getFileMetaData().getSchema();
     
    -    for (Type type : schema.getFields()) {
    -      if (!type.isPrimitive()) {
    -        return true;
    +      for (Type type : schema.getFields()) {
    +        if (!type.isPrimitive()) {
    +          return true;
    +        }
           }
    -    }
    -    for (ColumnDescriptor col : schema.getColumns()) {
    -      if (col.getMaxRepetitionLevel() > 0) {
    -        return true;
    +      for (ColumnDescriptor col : schema.getColumns()) {
    +        if (col.getMaxRepetitionLevel() > 0) {
    +          return true;
    +        }
    +      }
    +      return false;
    +    } else {
    +      for (SchemaPath column : columns) {
    +        if (isColumnComplex(footer.getFileMetaData().getSchema(), column)) {
    +          return true;
    +        }
           }
    +      return false;
    +    }
    +  }
    +
    +  private static boolean isColumnComplex(GroupType grouptype, SchemaPath column) {
    +    PathSegment.NameSegment root = column.getRootSegment();
    +    if (!grouptype.containsField(root.getPath().toLowerCase())) {
    +      return false;
    +    }
    +    Type type = grouptype.getType(root.getPath().toLowerCase());
    +    if (type.isRepetition(Type.Repetition.REPEATED) || !type.isPrimitive()) {
    --- End diff --
    
    `return type.isRepetition(Type.Repetition.REPEATED) || !type.isPrimitive();`


---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

Posted by dprofeta <gi...@git.apache.org>.

Github user dprofeta commented on the issue:

    https://github.com/apache/drill/pull/976
  
    I updated the javadoc with Paul remarks.


---

[GitHub] drill issue #976: DRILL-5797: Choose parquet reader from read columns

Posted by sachouche <gi...@git.apache.org>.

Github user sachouche commented on the issue:

    https://github.com/apache/drill/pull/976
  
    Looking at the stack trace:
    - The code definitely is initializing a column of type REPEATABLE
    - The Fast Reader didn't expect this scenario so it used a default container (NullableVarBinary) for VL binary DT
    
    Why this is happening?
    - The code in ReadState::buildReader() is processing all selected columns
    - This information is obtained from the ParquetSchema
    - Looking at the code, this seems a case-sensitivity issue
    - The ParquetSchema is case-insensitive whereas the Parquet GroupType is not
    - Damien added a catch handler (column not found) to handle use-cases where we are projecting non-existing columns
    - This basically is leading to an unforeseen use-case
    - Assume column XYZ is complex
    - User uses an alias (xyz)
    - The new code will allow this column to pass and treat is as simple
    - The ParquetSchema is being case insensitive will process this column
    - and thus the exception in the test suite
    
    Suggested Fix
    - Create a map (key to-lower-case) and register all current row-group columns
    - Use this map to locate a selected column type



---