You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "John Omernik (JIRA)" <ji...@apache.org> on 2016/07/01 18:04:10 UTC

[jira] [Created] (DRILL-4758) Option for Lazy/Late Materialization of columns during query with Parquet

John Omernik created DRILL-4758:
-----------------------------------

             Summary: Option for Lazy/Late Materialization of columns during query with Parquet
                 Key: DRILL-4758
                 URL: https://issues.apache.org/jira/browse/DRILL-4758
             Project: Apache Drill
          Issue Type: Improvement
          Components: Storage - Parquet
    Affects Versions: 1.6.0
            Reporter: John Omernik


On tables stored as Parquet with lots of columns, it appears that all columns requested in the select statement are materialized for every row, regardless of the where clause filter. 

For example, a table with 100 columns, 

select field1 from table where id = 123 and client BETWEEN 10 and 100 

Will return in 30 seconds a large amount of data (2 TB) and return no rows. 

However, 

select * from table where id = 123 and client BETWEEN 10 and 100 

will take 15 minutes to run on the same amount of data, while still returning no rows.  

If an option (perhaps it should be the default) to only materialize rows that match the filter were present, it would provide a huge boon to performance. 

Now, if this were an issue because tables with a small number of columns would now have an extra step, one option would be to use table options (select with options) to make it so queries to certain tables would have this option, and queries to other tables would not.  This is up for discussion, but I think the first step is to discuss how something this could be achieved.  This is an item also being looked at by the Impala project on Parquet files. (IMPALA-2017) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)