You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2017/03/15 20:10:42 UTC
[jira] [Commented] (DRILL-5357) Partition pruning information not available in query plan for COUNT aggregate query

    [ https://issues.apache.org/jira/browse/DRILL-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926880#comment-15926880 ] 

Aman Sinha commented on DRILL-5357:
-----------------------------------

This should not be marked 'critical' if the issue is about showing additional information in an Explain plan.  For COUNT(*) and COUNT(Non-Nullable-Column) Drill converts it to a DirectGroupScan and uses the PojoRecordReader (that's what you see in the Explain plan instead of the typical ParquetGroupScan).  This is reading rowcount directly from Parquet metadata and aggregating instead of reading the actual data.   If there is a filter condition on a partition column, we apply the partition pruning and then read the metadata from Parquet files within that partition.  We could add some additional info in the DirectGroupScan to indicate how many files' metadata was read. 

> Partition pruning information not available in query plan for COUNT aggregate query
> -----------------------------------------------------------------------------------
>
>                 Key: DRILL-5357
>                 URL: https://issues.apache.org/jira/browse/DRILL-5357
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.10.0
>         Environment: 3 node CentOS cluster
>            Reporter: Khurram Faraaz
>            Priority: Critical
>
> We are not seeing partition pruning information in the query plan for the below, COUNT(*) and COUNT(<col-name>) query 
> Drill 1.10.0-SNAPSHOT
> git commit id: b657d44f
> parquet table has 6 columns
> total number of rows = 1638640
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> CREATE TABLE tbl_prtn_prune_01 PARTITION BY (col_state) 
> AS 
> SELECT CAST(columns[0] AS DATE) col_date, 
> CAST(columns[1] AS CHAR(3)) col_state, 
> CAST(columns[2] AS INTEGER) col_prime, 
> CAST(columns[3] AS VARCHAR(256)) col_varstr, 
> CAST(columns[4] AS INTEGER) col_id, 
> CAST(columns[5] AS VARCHAR(50)) col_name 
> from `partition_prune_data.csv`;
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1638640                    |
> +-----------+----------------------------+
> 1 row selected (17.675 seconds)
> 0: jdbc:drill:schema=dfs.tmp> select COUNT(*) from tbl_prtn_prune_01 where col_state = 'CA';
> +---------+
> | EXPR$0  |
> +---------+
> | 35653   |
> +---------+
> 1 row selected (0.471 seconds)
> 0: jdbc:drill:schema=dfs.tmp> explain plan for select COUNT(*) from tbl_prtn_prune_01 where col_state = 'CA';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(EXPR$0=[$0])
> 00-02        Project(EXPR$0=[$0])
> 00-03          Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@1d4bb67d[columns = null, isStarQuery = false, isSkipQuery = false]])
> {noformat}
> And then I did a REFRESH TABLE METADATA on the parquet table
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> refresh table metadata tbl_prtn_prune_01;
> +-------+-------------------------------------------------------------+
> |  ok   |                           summary                           |
> +-------+-------------------------------------------------------------+
> | true  | Successfully updated metadata for table tbl_prtn_prune_01.  |
> +-------+-------------------------------------------------------------+
> 1 row selected (0.321 seconds)
> 0: jdbc:drill:schema=dfs.tmp> explain plan for select COUNT(col_state) from tbl_prtn_prune_01 where col_state = 'CA';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(EXPR$0=[$0])
> 00-02        Project(EXPR$0=[$0])
> 00-03          Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@2e0f4be9[columns = null, isStarQuery = false, isSkipQuery = false]])
> 0: jdbc:drill:schema=dfs.tmp> explain plan for select COUNT(*) from tbl_prtn_prune_01 where col_state = 'CA';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(EXPR$0=[$0])
> 00-02        Project(EXPR$0=[$0])
> 00-03          Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@3fc1f8e7[columns = null, isStarQuery = false, isSkipQuery = false]])
> 0: jdbc:drill:schema=dfs.tmp> explain plan for select COUNT(col_date) from tbl_prtn_prune_01 where col_state = 'CA';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(EXPR$0=[$0])
> 00-02        Project(EXPR$0=[$0])
> 00-03          Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@7afc851e[columns = null, isStarQuery = false, isSkipQuery = false]])
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)