You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2015/03/23 04:10:11 UTC
[jira] [Commented] (DRILL-2287) Filesystem partitioning is slow

    [ https://issues.apache.org/jira/browse/DRILL-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375331#comment-14375331 ] 

Aman Sinha commented on DRILL-2287:
-----------------------------------

Can you use the 0.8 release and provide some performance numbers for the tests you are running, including some information about the data set size etc.?   

The purpose of partition pruning is to reduce scan I/O.  If your partition filter contains all the directories, then you are not reducing I/O and yet paying for the cost of evaluating the filter.  Also, note that  you cannot assume the partition filter is the only filter present;  if there are other conditions present in some combination of ANDs and ORs,  it is not always possible to remove the Filter node from the plan. 
 

> Filesystem partitioning is slow
> -------------------------------
>
>                 Key: DRILL-2287
>                 URL: https://issues.apache.org/jira/browse/DRILL-2287
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>    Affects Versions: 0.7.0, 0.8.0
>            Reporter: Adam Gilmore
>            Assignee: Jinfeng Ni
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> We have created a number of Parquet files in different directories (e.g. 1, 2, 3, 4) to partition our data on the filesystem.
> Assuming we only have 4 directories (1, 2, 3 and 4), when executing a query like:
> {code:sql}
> select sum(price) from dfs.tmp.mydata where dir0 in (1, 2, 3, 4)
> {code}
> The query is significantly slower than:
> {code:sql}
> select sum(price) from dfs.tmp.mydata
> {code}
> Looking at the physical plans, it looks like even if dir0 is only in the WHERE clause, it'll emit that from the scan, which then needs an extra step (a projection) to only project through the count (removing dir0).  This appears to be the cause of the slowdown.
> To make it even more confusing, if you only select the LAST directory (i.e. in the case, 4), then it has a different physical plan again and seems to use a union-exchange.
> Ultimately, the query planner should realise that dir0 is not projected and then once the pushdown filesystem filtering is done, remove dir0 from being emitted from the scan and not require a project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)