You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Nate Putnam (JIRA)" <ji...@apache.org> on 2017/02/23 18:03:44 UTC
[jira] [Commented] (DRILL-5292) Better Parquet handling of sparse
columns
[ https://issues.apache.org/jira/browse/DRILL-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880951#comment-15880951 ]
Nate Putnam commented on DRILL-5292:
------------------------------------
Digging into this further an approach would be to Modify the ParquetScanBatchCreator and ParquetRecordReader classes with the general modifications.
* ParquetScanBatchCreator - Read a full list of all the footers for the SelectionRoot so that a Map of Parquet Files to Footers can be passed to the record reader. Looking at that class there is a TODO that this would be a desired change for performance reasons anyway.
* ParquetRecordReader - Refactor the nullFilledVectors to be the more general NullableVector instead of the specific NullableIntVector.
* ParquetRecordReader - Use the Map of footers passed in from the ParquetScanBatchCreator to do a reconciliation on the schema.
** If the requested vector is not present in the file being read but present in a different file and is optional than add it as a NullableVector to the current file.
> Better Parquet handling of sparse columns
> -----------------------------------------
>
> Key: DRILL-5292
> URL: https://issues.apache.org/jira/browse/DRILL-5292
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Affects Versions: 1.10.0
> Reporter: Nate Putnam
>
> It appears the current implantation of ParquetRecordReader will fill in missing columns between files as a NullableIntVector. It would be better if the code could determine if that column was defined in a different file (and didn't conflict) and use the defined data type.
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)