You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2018/07/09 16:34:57 UTC
[GitHub] okalinin opened a new pull request #1368: DRILL-5797: use Parquet new reader more often

okalinin opened a new pull request #1368: DRILL-5797: use Parquet new reader more often
URL: https://github.com/apache/drill/pull/1368
 
 
   # DRILL-5797: use Parquet new reader more often
   
   ## Background
   This PR is follow up on previous work done by @dprofeta and documented in the JIRA. Previously  new reader was only used if file schema did not contain any single complex column. With this change, new reader will be used on a complex schema in case queried column list does not contain any complex one which should make new reader usage more frequent.
   
   ## Change description
   In order to make usage of new reader possible on complex schema, following modifications had to be made:
   - `ParquetReaderUtility` class - modified and added several functions to enable it working with nested schema. E.g. one limitation was explicitly referencing top level schema element path with `column.getPath()[0]` in several locations. Top level schema element path was also used in building path to SchemaElement map which caused map corruption for cases when schema contained columns `a` and `b`.`a` (for both schema elements key `a` was used overwriting the map entry).
   - `ParquetSchema` - `fieldSelected` function replaced with `columnSelected` in order to enable it functioning with full path. Previously, it would fail on cases when schema contains columns `a` and `b`.`a` as both schema paths would be marked as selected.
   - `ParquetColumnMetadata` - replaced top level path reference with full path; also, replaced parameter passed to `ParquetToDrillTypeConverter.toMajorType()` from `se.getType_length()` to `column.getTypeLength()`. Reason behind is `se.getType_length()` returning 0 on FIXED_LEN_BYTE_ARRAY column and subsequent failure in minor type conversion that was failing complex parquet tests. `column.getTypeLength()` provides correct result. In fact, I am not sure if this is Parquet bug - possibly TBD item.
   - `AbstractParquetScanBatchCreator` - added a function which utilizes `ParquetReaderUtility` functions to identify if query columns list contains complex column.
   
   Added tests rely on existing `complex.parquet` file used in other tests.
   
   ## Level of testing
   build tests and complex*q query tests from Drill test framework. Tests added for newly introduced methods except for `ParquetReaderUtility.buildFullColumnPath()`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services