You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Jinfeng Ni (JIRA)" <ji...@apache.org> on 2016/01/07 01:05:39 UTC

[jira] [Created] (DRILL-4250) File system directory-based partition pruning does not work when a directory contains both subdirectories and files.

Jinfeng Ni created DRILL-4250:
---------------------------------

             Summary: File system directory-based partition pruning does not work when a directory contains both subdirectories and files.  
                 Key: DRILL-4250
                 URL: https://issues.apache.org/jira/browse/DRILL-4250
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
            Reporter: Jinfeng Ni


When a directory contains both subdirectories and files, then the directory-based partition pruning would not work. 

For example, I have the following directory structure with nation.parquet (copied from tpch sample dataset).

.//2001/Q1/nation.parquet
.//2001/Q2/nation.parquet

The following query has the directory-based partition pruning work correctly. 
 
{code}
explain plan for select * from dfs.tmp.fileAndDir where dir0 = 2001 and dir1 = 'Q1';
00-00    Screen
00-01      Project(*=[$0])
00-02        Project(*=[$0])
00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/fileAndDir/2001/Q1/nation.parquet]], selectionRoot=file:/tmp/fileAndDir, numFiles=1, usedMetadataFile=false, columns=[`*`]]])
{code}

However, if I add a nation.parquet file to 2001 directory, like the following:

.//2001/nation.parquet
.//2001/Q1/nation.parquet
.//2001/Q2/nation.parquet

Then, the same query will not have the partition pruning applied.
{code}
explain plan for select * from dfs.tmp.fileAndDir where dir0 = 2001 and dir1 = 'Q1';
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(*=[$0])
00-02        Project(T0¦¦*=[$0])
00-03          SelectionVectorRemover
00-04            Filter(condition=[AND(=($1, 2001), =($2, 'Q1'))])
00-05              Project(T0¦¦*=[$0], dir0=[$1], dir1=[$2])
00-06                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=file:/tmp/fileAndDir/2001/nation.parquet], ReadEntryWithPath [path=file:/tmp/fileAndDir/2001/Q1/nation.parquet], ReadEntryWithPath [path=file:/tmp/fileAndDir/2001/Q2/nation.parquet]], selectionRoot=file:/tmp/fileAndDir, numFiles=3, usedMetadataFile=false, columns=[`*`]]])
{code}

I should note that for the second case where partition pruning did not work, the query did return the correct result. Therefore, this issue is only impact the query performance, not the query result. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)