You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/07/09 16:35:00 UTC
[jira] [Commented] (DRILL-5797) Use more often the new parquet reader

    [ https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537205#comment-16537205 ] 

ASF GitHub Bot commented on DRILL-5797:
---------------------------------------

okalinin opened a new pull request #1368: DRILL-5797: use Parquet new reader more often
URL: https://github.com/apache/drill/pull/1368
 
 
   # DRILL-5797: use Parquet new reader more often
   
   ## Background
   This PR is follow up on previous work done by @dprofeta and documented in the JIRA. Previously  new reader was only used if file schema did not contain any single complex column. With this change, new reader will be used on a complex schema in case queried column list does not contain any complex one which should make new reader usage more frequent.
   
   ## Change description
   In order to make usage of new reader possible on complex schema, following modifications had to be made:
   - `ParquetReaderUtility` class - modified and added several functions to enable it working with nested schema. E.g. one limitation was explicitly referencing top level schema element path with `column.getPath()[0]` in several locations. Top level schema element path was also used in building path to SchemaElement map which caused map corruption for cases when schema contained columns `a` and `b`.`a` (for both schema elements key `a` was used overwriting the map entry).
   - `ParquetSchema` - `fieldSelected` function replaced with `columnSelected` in order to enable it functioning with full path. Previously, it would fail on cases when schema contains columns `a` and `b`.`a` as both schema paths would be marked as selected.
   - `ParquetColumnMetadata` - replaced top level path reference with full path; also, replaced parameter passed to `ParquetToDrillTypeConverter.toMajorType()` from `se.getType_length()` to `column.getTypeLength()`. Reason behind is `se.getType_length()` returning 0 on FIXED_LEN_BYTE_ARRAY column and subsequent failure in minor type conversion that was failing complex parquet tests. `column.getTypeLength()` provides correct result. In fact, I am not sure if this is Parquet bug - possibly TBD item.
   - `AbstractParquetScanBatchCreator` - added a function which utilizes `ParquetReaderUtility` functions to identify if query columns list contains complex column.
   
   Added tests rely on existing `complex.parquet` file used in other tests.
   
   ## Level of testing
   build tests and complex*q query tests from Drill test framework. Tests added for newly introduced methods except for `ParquetReaderUtility.buildFullColumnPath()`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Use more often the new parquet reader
> -------------------------------------
>
>                 Key: DRILL-5797
>                 URL: https://issues.apache.org/jira/browse/DRILL-5797
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>            Reporter: Damien Profeta
>            Assignee: Oleksandr Kalinin
>            Priority: Major
>             Fix For: 1.15.0
>
>
> The choice of using the regular parquet reader of the optimized one is based of what type of columns is in the file. But the columns that are read by the query doesn't matter. We can increase a little bit the cases where the optimized reader is used by checking is the projected column are simple or not.
> This is an optimization waiting for the fast parquet reader to handle complex structure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)