You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2019/04/06 00:27:00 UTC
[jira] [Commented] (DRILL-7064) Leverage the summary's totalRowCount and totalNullCount for COUNT() queries (also prevent eager expansion of files)

    [ https://issues.apache.org/jira/browse/DRILL-7064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811377#comment-16811377 ] 

ASF GitHub Bot commented on DRILL-7064:
---------------------------------------

amansinha100 commented on pull request #1736: DRILL-7064: Leverage the summary metadata for plain COUNT aggregates.
URL: https://github.com/apache/drill/pull/1736
 
 
   Please see [DRILL-7064](https://issues.apache.org/jira/browse/DRILL-7064) for a description of this enhancement.  
   
   This PR adds a logical planning rule `ConvertCountToDirectScanRule` and creates a DirectScan plan for plain COUNT(*) and COUNT(column) aggregates with no group-by.   It does this by reading the Summary metadata cache file and fetching the `totalRowCount` and `totalNullCount` per column. 
   
   Note to reviewer:  Please review the DRILL-7064 commit here and ignore the DRILL-7063 which is the underlying metadata changes on which this PR is based.    
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Leverage the summary's totalRowCount and totalNullCount for COUNT() queries (also prevent eager expansion of files)
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-7064
>                 URL: https://issues.apache.org/jira/browse/DRILL-7064
>             Project: Apache Drill
>          Issue Type: Sub-task
>          Components: Metadata
>            Reporter: Venkata Jyothsna Donapati
>            Assignee: Aman Sinha
>            Priority: Major
>             Fix For: 1.16.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> This sub-task is meant to leverage the Parquet metadata cache's summary stats: totalRowCount (across all files and row groups) and the per-column totalNullCount (across all files and row groups) to answer plain COUNT aggregation queries without Group-By.  These are currently converted to a DirectScan by the ConvertCountToDirectScanRule which utilizes the row group metadata; however this rule is applied on Drill Logical rels and converts the logical plan to a physical plan with DirectScanPrel but this is too late since the DrillScanRel that is already created during logical planning has already read the entire metadata cache file along with its full list of row group entries. The metadata cache file can grow quite large and this does not scale. 
> The solution is to use the Metadata Summary file that is created in DRILL-7063 and create a new rule that will apply early on such that it operates on the Calcite logical rels instead of the Drill logical rels and prevents eager expansion of the list of files/row groups.   
> We will not remove the existing rule. The existing rule will continue to operate as before because it is possible that after some transformations, we still want to apply the optimizations for COUNT queries. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)