You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2016/08/24 01:12:21 UTC
[jira] [Created] (DRILL-4861) Revisit the 'entries' stored as part
of ParquetGroupScan
Aman Sinha created DRILL-4861:
---------------------------------
Summary: Revisit the 'entries' stored as part of ParquetGroupScan
Key: DRILL-4861
URL: https://issues.apache.org/jira/browse/DRILL-4861
Project: Apache Drill
Issue Type: Bug
Components: Storage - Parquet
Affects Versions: 1.7.0
Reporter: Aman Sinha
The ParquetGroupScan stores a list of ReadEntryWithPath in the form of 'entries' field (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L104) as well as a hash set of file names in the 'fileSet' field (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L263).
The underlying data stored by both is essentially the same set of filenames. We should try to consolidate these into a single entity. This is not just useful for code simplification but has a real performance cost: when a ParquetGroupScan is serialized and sent as part of a Json plan fragment, the overhead is quite high if the number of files is large (tens of thousands or higher).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)