You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Robert Hou (JIRA)" <ji...@apache.org> on 2019/01/09 01:25:00 UTC
[jira] [Created] (DRILL-6957) Parquet rowgroup filtering can have incorrect file count

Robert Hou created DRILL-6957:
---------------------------------

             Summary: Parquet rowgroup filtering can have incorrect file count
                 Key: DRILL-6957
                 URL: https://issues.apache.org/jira/browse/DRILL-6957
             Project: Apache Drill
          Issue Type: Bug
            Reporter: Robert Hou
            Assignee: Jean-Blas IMBERT


If a query accesses all the files, the Scan operator indicates that one file is accessed.  The number of rowgroups is correct.

Here is an example query:
{noformat}
select count(*) from dfs.`/custdata/tudata/fact/vintage/snapshot_period_id=20151231/comp_id=120` where cur_tot_bal_amt < 100
{noformat}

Here is the plan:
{noformat}
Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = {9.8376721446E9 rows, 4.35668337906E10 cpu, 2.810763469E9 io, 4096.0 network, 0.0 memory}, id = 4477
00-01      Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = {9.8376721445E9 rows, 4.35668337905E10 cpu, 2.810763469E9 io, 4096.0 network, 0.0 memory}, id = 4476
00-02        StreamAgg(group=[{}], EXPR$0=[$SUM0($0)]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = {9.8376721435E9 rows, 4.35668337895E10 cpu, 2.810763469E9 io, 4096.0 network, 0.0 memory}, id = 4475
00-03          UnionExchange : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = {9.8376721425E9 rows, 4.35668337775E10 cpu, 2.810763469E9 io, 4096.0 network, 0.0 memory}, id = 4474
01-01            StreamAgg(group=[{}], EXPR$0=[COUNT()]) : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = {9.8376721415E9 rows, 4.35668337695E10 cpu, 2.810763469E9 io, 0.0 network, 0.0 memory}, id = 4473
01-02              Project($f0=[0]) : rowType = RecordType(INTEGER $f0): rowcount = 1.4053817345E9, cumulative cost = {8.432290407E9 rows, 2.67022529555E10 cpu, 2.810763469E9 io, 0.0 network, 0.0 memory}, id = 4472
01-03                SelectionVectorRemover : rowType = RecordType(ANY cur_tot_bal_amt): rowcount = 1.4053817345E9, cumulative cost = {7.0269086725E9 rows, 2.10807260175E10 cpu, 2.810763469E9 io, 0.0 network, 0.0 memory}, id = 4471
01-04                  Filter(condition=[&lt;($0, 100)]) : rowType = RecordType(ANY cur_tot_bal_amt): rowcount = 1.4053817345E9, cumulative cost = {5.621526938E9 rows, 1.9675344283E10 cpu, 2.810763469E9 io, 0.0 network, 0.0 memory}, id = 4470
01-05                    Scan(table=[[dfs, /custdata/tudata/fact/vintage/snapshot_period_id=20151231/comp_id=120]], groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:///custdata/tudata/fact/vintage/snapshot_period_id=20151231/comp_id=120]], selectionRoot=maprfs:/custdata/tudata/fact/vintage/snapshot_period_id=20151231/comp_id=120, numFiles=1, numRowGroups=1007, usedMetadataFile=false, columns=[`cur_tot_bal_amt`]]]) : rowType = RecordType(ANY cur_tot_bal_amt): rowcount = 2.810763469E9, cumulative cost = {2.810763469E9 rows, 2.810763469E9 cpu, 2.810763469E9 io, 0.0 network, 0.0 memory}, id = 4469
{noformat}

numFiles is set to 1 when it should be set to 21.

All the files are in one directory.  If I add a level of directories (i.e. a directory with multiple directories, each with files), then I get the correct file count.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)