You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "Nate Putnam (JIRA)" <ji...@apache.org> on 2017/02/23 18:03:44 UTC

[jira] [Commented] (DRILL-5292) Better Parquet handling of sparse columns

    [ https://issues.apache.org/jira/browse/DRILL-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880951#comment-15880951 ] 

Nate Putnam commented on DRILL-5292:
------------------------------------

Digging into this further an approach would be to Modify the ParquetScanBatchCreator and ParquetRecordReader classes with the general modifications.

* ParquetScanBatchCreator - Read a full list of all the footers for the SelectionRoot so that a Map of Parquet Files to Footers can be passed to the record reader. Looking at that class there is a TODO that this would be a desired change for performance reasons anyway. 

* ParquetRecordReader - Refactor the nullFilledVectors to be the more general NullableVector instead of the specific NullableIntVector. 

* ParquetRecordReader - Use the Map of footers passed in from the ParquetScanBatchCreator to do a reconciliation on the schema. 
** If the requested vector is not present in the file being read but present in a different file and is optional than add it as a NullableVector to the current file. 


> Better Parquet handling of sparse columns
> -----------------------------------------
>
>                 Key: DRILL-5292
>                 URL: https://issues.apache.org/jira/browse/DRILL-5292
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.10.0
>            Reporter: Nate Putnam
>
> It appears the current implantation of ParquetRecordReader will fill in missing columns between files as a NullableIntVector. It would be better if the code could determine if that column was defined in a different file (and didn't conflict) and use the defined data type. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)