You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2016/08/24 01:12:21 UTC

[jira] [Created] (DRILL-4861) Revisit the 'entries' stored as part of ParquetGroupScan

Aman Sinha created DRILL-4861:
---------------------------------

             Summary: Revisit the 'entries' stored as part of ParquetGroupScan
                 Key: DRILL-4861
                 URL: https://issues.apache.org/jira/browse/DRILL-4861
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.7.0
            Reporter: Aman Sinha


The ParquetGroupScan stores a list of ReadEntryWithPath in the form of 'entries' field (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L104) as well as a hash set of file names  in the 'fileSet' field (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L263).   

The underlying data stored by both is essentially the same set of filenames.  We should try to consolidate these into a single entity.  This is not just useful for code simplification but has a real performance cost: when a ParquetGroupScan is serialized and sent as part of a Json plan fragment, the overhead is quite high if the number of files is large (tens of thousands or higher). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)