You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org> on 2017/11/06 07:31:44 UTC

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Vuk Ercegovac has uploaded this change for review. ( http://gerrit.cloudera.org:8080/8480


Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................

IMPALA-4985: use parquet stats of nested types for dynamic pruning

Currently, parquet row-groups can be pruned at run-time using
min/max stats when predicates (in, binary) are specified for
column scalar types. This patch extends pruning to nested types
for the same class of predicates. A nested value is defined to
be on a path of one or more nested types that is rooted at a
table column. For example, table T's column x is an array of structs,
one of whose fields is y. Given a predicate T.x.y > 3, the path
rooted at T is T.x.y. So long as all types on the path are required
(must be non-empty), the min-max pruning predicate can be used.

Testing:
- extended nested-types-parquet-stats e2e test cases.

Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
---
M be/src/exec/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
4 files changed, 249 insertions(+), 13 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/8480/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 1
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 7: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 7
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Nov 2017 02:05:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/8480/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8480/2//COMMIT_MSG@13
PS2, Line 13: value
> Should this read "type"?
clarified


http://gerrit.cloudera.org:8080/#/c/8480/2//COMMIT_MSG@16
PS2, Line 16: the scalar value must be on a path
            : to the root of the nested value where every node on the path
            : is required
> I'm not sure I'm following the reasoning behind this. Please see my comment
reworded and included examples (in the tests) that I had trouble separating from cases that could indeed be pruned.


http://gerrit.cloudera.org:8080/#/c/8480/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@567
PS2, Line 567:       tryComputeMinMaxPredicate(analyzer, pred);
> nit: this looks like it could go on a single line now.
Done


http://gerrit.cloudera.org:8080/#/c/8480/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@575
PS2, Line 575:         tryComputeMinMaxPredicate(analyzer, pred);
> nit: single line?
Done


http://gerrit.cloudera.org:8080/#/c/8480/2/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/8480/2/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@9
PS2, Line 9: row_regex: .*NumStatsFilteredRowGroups: 2 .*
> While you're here, do you mind converting them to the aggregation(..) synta
done. didn't make that connection until you pointed it out.


http://gerrit.cloudera.org:8080/#/c/8480/2/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@58
PS2, Line 58: where bottom.item < -2;
> This looks like a c&p error from the query above. Can you double check that
latest update works.


http://gerrit.cloudera.org:8080/#/c/8480/2/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@98
PS2, Line 98: where a.item.e < -10
> This may be seem like an ignorant question, but doesn't this predicate make
good point-- seems that the current approach is conservative. this could be pruned, but we currently do not. I've added a testcase below that illustrates an example of a collection filter, with no !empty guard, for which pruning would be incorrect (tried it and it produces different results than we currently do). I've also added a comment in the HDFSScanNode to point out that the current approach is conservative.


http://gerrit.cloudera.org:8080/#/c/8480/2/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@107
PS2, Line 107: left outer join c.nested_struct.c.d cn, cn.item a where a.item.e < -10;
> Same here, if a row group has no values in nested_struct.c.d.item.item.e th
see comment above.



-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 2
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 15 Nov 2017 02:46:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8480/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@179
PS3, Line 179: it is checked to be not empty.
> It could help to explain how that check is generated, e.g. by mentioning th
Done


http://gerrit.cloudera.org:8080/#/c/8480/3/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/8480/3/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@98
PS3, Line 98: !empty
> Do we introduce the "!empty" guard somewhere and explain what it means?
added a reference to where its discussed.


http://gerrit.cloudera.org:8080/#/c/8480/3/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@145
PS3, Line 145: # Nested access in left-outer-join that would be incorrect if scan was pruned.
> Impressive find! Can you reword this to emphasize that this does not allow 
credit goes to Alex-- its a fun one. reworded.



-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 3
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Thu, 16 Nov 2017 22:19:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Alex Behm, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8480

to look at the new patch set (#10).

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................

IMPALA-4985: use parquet stats of nested types for dynamic pruning

Currently, parquet row-groups can be pruned at run-time using
min/max stats when predicates (in, binary) are specified for
column scalar types. This patch extends pruning to nested types
for the same class of predicates. A nested value is an instance
of a nested type (struct, array, map). A nested value consists of
other nested and scalar values (as declared by its type).
Predicates that can be used for row-group pruning must be applied to
nested scalar values. In addition, the parent of the nested scalar
must also be required, that is, not empty. The latter requirement
is conservative: some filters that could be used for pruning are
not used for correctness reasons.

Testing:
- extended nested-types-parquet-stats e2e test cases.

Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
---
M be/src/exec/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/analysis/CollectionStructType.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
M tests/query_test/test_nested_types.py
10 files changed, 649 insertions(+), 38 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/8480/10
-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 10
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Alex Behm, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8480

to look at the new patch set (#5).

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................

IMPALA-4985: use parquet stats of nested types for dynamic pruning

Currently, parquet row-groups can be pruned at run-time using
min/max stats when predicates (in, binary) are specified for
column scalar types. This patch extends pruning to nested types
for the same class of predicates. A nested value is an instance
of a nested type (struct, array, map). A nested value consists of
other nested and scalar values (as declared by its type).
Predicates that can be used for row-group pruning must be applied to
nested scalar values. In addition, the parent of the nested scalar
must also be required, that is, not empty. The latter requirement
is conservative: some filters that could be used for pruning are
not used for correctness reasons.

Testing:
- extended nested-types-parquet-stats e2e test cases.

Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
---
M be/src/exec/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/analysis/CollectionStructType.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
M tests/query_test/test_nested_types.py
7 files changed, 508 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/8480/5
-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 5
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 1:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/8480/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8480/1//COMMIT_MSG@12
PS1, Line 12: A nested value is defined to
            : be on a path of one or more nested types that is rooted at a
            : table column.
> I don't understand what that sentence means. Can you try to clarify the dis
Done


http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@435
PS1, Line 435:   // Checks if slot refers to an array "pos" pseudo-column.
> Can you add a comment explaining why checking for getColumn() == null is no
added a clarifying comment. when I stepped through this code, getColumn was casting the net too. Yes it caught references to "pos" but it also included all nested types as well, which is not the intent with this change. If expectations for getColumn are different, then I can look into why its not behaving as expected.


http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@441
PS1, Line 441: isMapStruct
> I think it would be clearer to add a isArrayStruct() method to CollectionSt
Done


http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@564
PS1, Line 564:  
> nit: the surrounding code seems to omit this space.
Done


http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@567
PS1, Line 567:       for (Expr pred: entry.getValue()) {
             :         if (pred instanceof BinaryPredicate) {
             :           tryComputeBinaryMinMaxPredicate(analyzer, (BinaryPredicate) pred);
             :         } else if (pred instanceof InPredicate) {
             :           tryComputeInListMinMaxPredicate(analyzer, (InPredicate) pred);
             :         }
             :       }
> This looks like a duplication of the above loop. Adding additional predicat
good point, done.


http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1109
PS1, Line 1109: slot.getColumn() == null
> Is this another check for a pos slot?
hmm, looks like a proxy for estimating when a scan range will definitely be included. since nested types are currently not filtered, these columns will be scanned so it makes sense to increase the scanner estimate. 

for this patch, since we conservatively included possibly filtered scalar columns, we should treat nested types similarly, so it makes sense to me to leave the current logic as-is.


http://gerrit.cloudera.org:8080/#/c/8480/1/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/8480/1/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@141
PS1, Line 141: Basics test
> I'm not sure I understand what Basics means. Can you elaborate? I think we 
yeah, that was confusing so removed it and clarified that these queries are distinguished from the previous ones since they are reading from the tpch data set.


http://gerrit.cloudera.org:8080/#/c/8480/1/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/8480/1/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test@460
PS1, Line 460: ====
> Does this remove the trailing newline?
reverted... I had put tests here first so must have changed this file.



-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 1
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 14 Nov 2017 18:07:21 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................

IMPALA-4985: use parquet stats of nested types for dynamic pruning

Currently, parquet row-groups can be pruned at run-time using
min/max stats when predicates (in, binary) are specified for
column scalar types. This patch extends pruning to nested types
for the same class of predicates. A nested value is an instance
of a nested type (struct, array, map). A nested value consists of
other nested and scalar values (as declared by its type).
Predicates that can be used for row-group pruning must be applied to
nested scalar values. In addition, the parent of the nested scalar
must also be required, that is, not empty. The latter requirement
is conservative: some filters that could be used for pruning are
not used for correctness reasons.

Testing:
- extended nested-types-parquet-stats e2e test cases.

Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Reviewed-on: http://gerrit.cloudera.org:8080/8480
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/analysis/CollectionStructType.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
M tests/query_test/test_nested_types.py
10 files changed, 649 insertions(+), 38 deletions(-)

Approvals:
  Alex Behm: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 11
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 8: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/1513/


-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 8
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Nov 2017 05:39:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/8480/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@179
PS3, Line 179: it is checked to be not empty.
It could help to explain how that check is generated, e.g. by mentioning that the analyzer adds a IsNotEmptyPredicate for some queries.


http://gerrit.cloudera.org:8080/#/c/8480/3/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/8480/3/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@98
PS3, Line 98: !empty
Do we introduce the "!empty" guard somewhere and explain what it means?


http://gerrit.cloudera.org:8080/#/c/8480/3/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@145
PS3, Line 145: # Nested access in left-outer-join that would be incorrect if scan was pruned.
Impressive find! Can you reword this to emphasize that this does not allow pruning despite having a predicate on one of the columns that does not match any rows?


http://gerrit.cloudera.org:8080/#/c/8480/2/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/8480/2/tests/query_test/test_nested_types.py@85
PS2, Line 85:   @SkipIfIsilon.hive
Thank you :)



-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 3
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Thu, 16 Nov 2017 21:28:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 5:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/8480/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@439
PS5, Line 439:   private boolean isArrayPosReference(SlotRef slotRef) {
> Move to SlotRef?
ah, didn't understand this as "move this method to the SlotRef class". Done.


http://gerrit.cloudera.org:8080/#/c/8480/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@562
PS5, Line 562:       // Adds only predicates for collections that are guarded by an IsNotEmptyPredicate.
> guarded -> filtered
Done


http://gerrit.cloudera.org:8080/#/c/8480/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@563
PS5, Line 563:       // Its assumed that analysis adds these guards such that they are correct, but
> It is assumed
Done


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@95
PS5, Line 95: where a.item.e < -10;
> Can you add a filter at all levels to make sure that all works together?
Done


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@96
PS5, Line 96: ---- PLAN
> Do we need to go to explain level 2 in all these tests?
that's the lowest level (at the moment) that prints out "parquet statistics predicates".


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@99
PS5, Line 99: PLAN-ROOT SINK
> Do we have tests for min-max filters on a top-level struct? I mean somethin
I have tests for collections nested in top-level structs. Added a test for scalar in a top-level struct. These are in the runtime filtering tests.


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@266
PS5, Line 266: # Test collections in a way that would incorrect to apply a min-max
> garbled sentence
Done


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@40
PS5, Line 40: where  int_map.value < -1;
> Can you modify the tests to mix in more predicate variety? For example, use
Done


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@145
PS5, Line 145: # False pruning example. There is one table that's scanned (complextypestbl).
> Add 1-2 more tests along these lines with non-selective min-max filters tha
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 5
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Nov 2017 23:36:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 4:

(12 comments)

Looks pretty good, mostly comments on phrasing.

http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@179
PS4, Line 179:   // A collection type is "required" if it is checked to be not empty. Collection
Shrink? The following seems sufficient.

// Tuple descriptors of collection slots that have an IsNotEmptyPredicate. See SelectStmt#registerIsNotEmptyPredicates().

Do we really need a new term "required". The meaning of that term is not clear clear to me from this comment. I think my suggested comment above is pretty clear. A collection slot is required if an only if it has an IsNotEmptyPredicate.


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@184
PS4, Line 184:   // Analysis adds IsNotEmptyPredicates where needed to enforce required collections.
I think this could be misunderstood. IsNotEmptyPredicates are not required, they are purely a perf optimization, they don't really "enforce" anything in the query semantics sense.

Suggest shrinking text as indicated above.


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@436
PS4, Line 436:   // Checks if slot refers to an array "pos" pseudo-column.
nit: we typically use /** */ style class and function comments


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@440
PS4, Line 440:   private boolean isArrayPosReference(SlotRef slot) {
Move to SlotRef. I generally prefer 'slotRef' instead of 'slot' because the latter could be misunderstood as a SlotDescriptor.


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@442
PS4, Line 442:     if (parent != null) {
reverse? if (parent == null) return false;


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@533
PS4, Line 533:    * Adds nested collections that are required from 'conjuncts' to requiredCollections_.
Populates 'requiredCollections_' based on IsNotEmptyPredicates in the given conjuncts.

(don't think we need to say more)


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@537
PS4, Line 537:    * Example: table T has field A which is of type array<array<int>>.
I don't think examples really belong here because this is not where we are adding the predicates. Instead they should be in SelecmStmt.registerIsNotEmptyPredicates().


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@548
PS4, Line 548:         Preconditions.checkArgument(ref.getDesc().getType().isComplexType());
checkState


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@549
PS4, Line 549:         Preconditions.checkArgument(ref.getDesc().getItemTupleDesc() != null);
checkState


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@574
PS4, Line 574:       // Note: it is conservative (but correct) to depend on a collection to be required.
I think we should state what we mean more directly. We rely on analysis to generate IsNotEmptyPredicates where correct, but analysis is conservative and misses some opportunities.


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@586
PS4, Line 586:    * collection-typed slot. As conjuncts are seen, mark nested collections that are
mark -> collect


http://gerrit.cloudera.org:8080/#/c/8480/4/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/8480/4/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@33
PS4, Line 33: aggregation(SUM, NumRowGroups): 2
we should also have planner tests to see how the explain looks



-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 4
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Mon, 20 Nov 2017 22:27:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/8480/6/fe/src/main/java/org/apache/impala/analysis/SlotRef.java
File fe/src/main/java/org/apache/impala/analysis/SlotRef.java:

http://gerrit.cloudera.org:8080/#/c/8480/6/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@154
PS6, Line 154:    * Checks if slotRef refers to an array "pos" pseudo-column.
> Checks if this SlotRef
Done


http://gerrit.cloudera.org:8080/#/c/8480/6/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@160
PS6, Line 160:   public boolean isArrayPosReference() {
> isArrayPosRef()
Done


http://gerrit.cloudera.org:8080/#/c/8480/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@541
PS6, Line 541:       // It is assumed that analysis adds these guards such that they are correct, but
> guards -> filters
done. checked that this not used in the rest of this file.


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@96
PS5, Line 96: ---- PLAN
> Got it. Might want to change that, but definitely not in this patch.
right, a naive change could cause too many changes/noise.

for the point about conflicts, perhaps a plan visitor to extract and test would be useful for checking for simple, local properties like these. that would move all of these tests from this text file to a plain java test.



-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 6
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Nov 2017 00:05:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 10:

found the issue with the planner test: mismatch between local and test environment. change includes the fix.

carrying +2


-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 10
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Nov 2017 18:06:05 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 10: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 10
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Nov 2017 18:25:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 4:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@179
PS4, Line 179:   // A collection type is "required" if it is checked to be not empty. Collection
> Shrink? The following seems sufficient.
went with just referring to not-empty instead of not-empty and required.


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@184
PS4, Line 184:   // Analysis adds IsNotEmptyPredicates where needed to enforce required collections.
> I think this could be misunderstood. IsNotEmptyPredicates are not required,
Done


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@436
PS4, Line 436:   // Checks if slot refers to an array "pos" pseudo-column.
> nit: we typically use /** */ style class and function comments
whoops, Done


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@440
PS4, Line 440:   private boolean isArrayPosReference(SlotRef slot) {
> Move to SlotRef. I generally prefer 'slotRef' instead of 'slot' because the
Done. sync'd with similar uses below as well.


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@442
PS4, Line 442:     if (parent != null) {
> reverse? if (parent == null) return false;
Done


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@533
PS4, Line 533:    * Adds nested collections that are required from 'conjuncts' to requiredCollections_.
> Populates 'requiredCollections_' based on IsNotEmptyPredicates in the given
Done


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@537
PS4, Line 537:    * Example: table T has field A which is of type array<array<int>>.
> I don't think examples really belong here because this is not where we are 
Done


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@548
PS4, Line 548:         Preconditions.checkArgument(ref.getDesc().getType().isComplexType());
> checkState
Done


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@549
PS4, Line 549:         Preconditions.checkArgument(ref.getDesc().getItemTupleDesc() != null);
> checkState
Done


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@574
PS4, Line 574:       // Note: it is conservative (but correct) to depend on a collection to be required.
> I think we should state what we mean more directly. We rely on analysis to 
Done


http://gerrit.cloudera.org:8080/#/c/8480/4/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@586
PS4, Line 586:    * collection-typed slot. As conjuncts are seen, mark nested collections that are
> mark -> collect
Done


http://gerrit.cloudera.org:8080/#/c/8480/4/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/8480/4/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@33
PS4, Line 33: aggregation(SUM, NumRowGroups): 2
> we should also have planner tests to see how the explain looks
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 4
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Nov 2017 00:33:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1513/


-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 8
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Nov 2017 02:06:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 5:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/8480/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@439
PS5, Line 439:   private boolean isArrayPosReference(SlotRef slotRef) {
Move to SlotRef?


http://gerrit.cloudera.org:8080/#/c/8480/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@562
PS5, Line 562:       // Adds only predicates for collections that are guarded by an IsNotEmptyPredicate.
guarded -> filtered

I think that makes it clearer that it's a pure perf optimization


http://gerrit.cloudera.org:8080/#/c/8480/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@563
PS5, Line 563:       // Its assumed that analysis adds these guards such that they are correct, but
It is assumed


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@95
PS5, Line 95: where a.item.e < -10;
Can you add a filter at all levels to make sure that all works together?

If it's too hard with complextypestbl you can use tpch_nested_parquet.customer


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@96
PS5, Line 96: ---- PLAN
Do we need to go to explain level 2 in all these tests?


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@99
PS5, Line 99: PLAN-ROOT SINK
Do we have tests for min-max filters on a top-level struct? I mean something like:

create table t (s struct<f1:int,f2:int>);

select 1 from t where s.f1 < 10


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@266
PS5, Line 266: # Test collections in a way that would incorrect to apply a min-max
garbled sentence


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@40
PS5, Line 40: where  int_map.value < -1;
Can you modify the tests to mix in more predicate variety? For example, use a binary predicate with "=", use an IN predicate somewhere, flip the lhs/rhs of some predicates, etc.


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@145
PS5, Line 145: # False pruning example. There is one table that's scanned (complextypestbl).
Add 1-2 more tests along these lines with non-selective min-max filters that correctly prune nothing.



-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 5
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:05:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 4: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 4
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Thu, 16 Nov 2017 23:41:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8480

to look at the new patch set (#4).

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................

IMPALA-4985: use parquet stats of nested types for dynamic pruning

Currently, parquet row-groups can be pruned at run-time using
min/max stats when predicates (in, binary) are specified for
column scalar types. This patch extends pruning to nested types
for the same class of predicates. A nested value is an instance
of a nested type (struct, array, map). A nested value consists of
other nested and scalar values (as declared by its type).
Predicates that can be used for row-group pruning must be applied to
nested scalar values. In addition, the parent of the nested scalar
must also be required, that is, not empty. The latter requirement
is conservative: some filters that could be used for pruning are
not used for correctness reasons.

Testing:
- extended nested-types-parquet-stats e2e test cases.

Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
---
M be/src/exec/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/analysis/CollectionStructType.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
M tests/query_test/test_nested_types.py
5 files changed, 284 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/8480/4
-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 4
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Alex Behm, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8480

to look at the new patch set (#6).

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................

IMPALA-4985: use parquet stats of nested types for dynamic pruning

Currently, parquet row-groups can be pruned at run-time using
min/max stats when predicates (in, binary) are specified for
column scalar types. This patch extends pruning to nested types
for the same class of predicates. A nested value is an instance
of a nested type (struct, array, map). A nested value consists of
other nested and scalar values (as declared by its type).
Predicates that can be used for row-group pruning must be applied to
nested scalar values. In addition, the parent of the nested scalar
must also be required, that is, not empty. The latter requirement
is conservative: some filters that could be used for pruning are
not used for correctness reasons.

Testing:
- extended nested-types-parquet-stats e2e test cases.

Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
---
M be/src/exec/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/analysis/CollectionStructType.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
M tests/query_test/test_nested_types.py
8 files changed, 644 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/8480/6
-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 6
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 9:

fixes two tests that failed gvo. unable to repro locally third failure (and do not know where to get e2e logs from gvo run).


-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 9
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Nov 2017 06:50:43 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 10: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 10
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Nov 2017 22:00:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 8: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 8
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Nov 2017 02:06:00 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 10:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1515/


-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 10
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Nov 2017 18:28:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 1:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/8480/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8480/1//COMMIT_MSG@12
PS1, Line 12: A nested value is defined to
            : be on a path of one or more nested types that is rooted at a
            : table column.
I don't understand what that sentence means. Can you try to clarify the distinction between nested value and nested type?


http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@435
PS1, Line 435:   // Checks if slot refers to an array "pos" pseudo-column.
Can you add a comment explaining why checking for getColumn() == null is not sufficient?


http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@441
PS1, Line 441: isMapStruct
I think it would be clearer to add a isArrayStruct() method to CollectionStructType to emphasize that that's what we're checking.


http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@564
PS1, Line 564:  
nit: the surrounding code seems to omit this space.


http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@567
PS1, Line 567:       for (Expr pred: entry.getValue()) {
             :         if (pred instanceof BinaryPredicate) {
             :           tryComputeBinaryMinMaxPredicate(analyzer, (BinaryPredicate) pred);
             :         } else if (pred instanceof InPredicate) {
             :           tryComputeInListMinMaxPredicate(analyzer, (InPredicate) pred);
             :         }
             :       }
This looks like a duplication of the above loop. Adding additional predicates in the future may require changing both loops. Have you considered factoring it into it's own method?


http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1109
PS1, Line 1109: slot.getColumn() == null
Is this another check for a pos slot?


http://gerrit.cloudera.org:8080/#/c/8480/1/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/8480/1/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@141
PS1, Line 141: Basics test
I'm not sure I understand what Basics means. Can you elaborate? I think we often order tests by ascending complexity so that the simpler ones fail before the complex ones.


http://gerrit.cloudera.org:8080/#/c/8480/1/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/8480/1/testdata/workloads/functional-query/queries/QueryTest/parquet-stats.test@460
PS1, Line 460: ====
Does this remove the trailing newline?



-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 1
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Comment-Date: Tue, 14 Nov 2017 00:10:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8480

to look at the new patch set (#3).

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................

IMPALA-4985: use parquet stats of nested types for dynamic pruning

Currently, parquet row-groups can be pruned at run-time using
min/max stats when predicates (in, binary) are specified for
column scalar types. This patch extends pruning to nested types
for the same class of predicates. A nested value is an instance
of a nested type (struct, array, map). A nested value consists of
other nested and scalar values (as declared by its type).
Predicates that can be used for row-group pruning must be applied to
nested scalar values. In addition, the parent of the nested scalar
must also be required, that is, not empty. The latter requirement
is conservative: some filters that could be used for pruning are
not used for correctness reasons.

Testing:
- extended nested-types-parquet-stats e2e test cases.

Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
---
M be/src/exec/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/analysis/CollectionStructType.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
M tests/query_test/test_nested_types.py
5 files changed, 280 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/8480/3
-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 3
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8480

to look at the new patch set (#2).

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................

IMPALA-4985: use parquet stats of nested types for dynamic pruning

Currently, parquet row-groups can be pruned at run-time using
min/max stats when predicates (in, binary) are specified for
column scalar types. This patch extends pruning to nested types
for the same class of predicates. A nested value is an instance
of a nested type. The nested value itself is a tree of nested
values (e.g., map, array, struct) and scalar values. Predicates
that can be used for row-group pruning must be applied to scalar
nested values. In addition, the scalar value must be on a path
to the root of the nested value where every node on the path
is required. For example, let Table T's column x be an array
of structs, one of whose fields is y. Given a predicate T.x.y > 3,
the path rooted at T is T.x.y. So long as all types on the path
are required (must be non-empty), the min-max pruning predicate
can be used.

Testing:
- extended nested-types-parquet-stats e2e test cases.

Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
---
M be/src/exec/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/analysis/CollectionStructType.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
4 files changed, 259 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/8480/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 2
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Alex Behm, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8480

to look at the new patch set (#9).

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................

IMPALA-4985: use parquet stats of nested types for dynamic pruning

Currently, parquet row-groups can be pruned at run-time using
min/max stats when predicates (in, binary) are specified for
column scalar types. This patch extends pruning to nested types
for the same class of predicates. A nested value is an instance
of a nested type (struct, array, map). A nested value consists of
other nested and scalar values (as declared by its type).
Predicates that can be used for row-group pruning must be applied to
nested scalar values. In addition, the parent of the nested scalar
must also be required, that is, not empty. The latter requirement
is conservative: some filters that could be used for pruning are
not used for correctness reasons.

Testing:
- extended nested-types-parquet-stats e2e test cases.

Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
---
M be/src/exec/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/analysis/CollectionStructType.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
M tests/query_test/test_nested_types.py
10 files changed, 649 insertions(+), 38 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/8480/9
-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 9
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Vuk Ercegovac (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Alex Behm, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8480

to look at the new patch set (#7).

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................

IMPALA-4985: use parquet stats of nested types for dynamic pruning

Currently, parquet row-groups can be pruned at run-time using
min/max stats when predicates (in, binary) are specified for
column scalar types. This patch extends pruning to nested types
for the same class of predicates. A nested value is an instance
of a nested type (struct, array, map). A nested value consists of
other nested and scalar values (as declared by its type).
Predicates that can be used for row-group pruning must be applied to
nested scalar values. In addition, the parent of the nested scalar
must also be required, that is, not empty. The latter requirement
is conservative: some filters that could be used for pruning are
not used for correctness reasons.

Testing:
- extended nested-types-parquet-stats e2e test cases.

Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
---
M be/src/exec/hdfs-parquet-scanner.h
M fe/src/main/java/org/apache/impala/analysis/CollectionStructType.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
M tests/query_test/test_nested_types.py
8 files changed, 644 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/8480/7
-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 7
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 6:

(5 comments)

lgtm after final nits

http://gerrit.cloudera.org:8080/#/c/8480/6/fe/src/main/java/org/apache/impala/analysis/SlotRef.java
File fe/src/main/java/org/apache/impala/analysis/SlotRef.java:

http://gerrit.cloudera.org:8080/#/c/8480/6/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@154
PS6, Line 154:    * Checks if slotRef refers to an array "pos" pseudo-column.
Checks if this SlotRef


http://gerrit.cloudera.org:8080/#/c/8480/6/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@160
PS6, Line 160:   public boolean isArrayPosReference() {
isArrayPosRef()

(we typically abbreviate Reference with Ref)


http://gerrit.cloudera.org:8080/#/c/8480/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@439
PS5, Line 439:     Preconditions.checkState(slotRef.getDesc().isScanSlot());
> ah, didn't understand this as "move this method to the SlotRef class". Done
Sorry! I realize my comment was not clear. Thanks for addressing.


http://gerrit.cloudera.org:8080/#/c/8480/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@541
PS6, Line 541:       // It is assumed that analysis adds these guards such that they are correct, but
guards -> filters


http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

http://gerrit.cloudera.org:8080/#/c/8480/5/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@96
PS5, Line 96: ---- PLAN
> that's the lowest level (at the moment) that prints out "parquet statistics
Got it. Might want to change that, but definitely not in this patch.

The motivation for changing is that dealing with conflicts in planner tests is becoming increasingly painful.



-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 6
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Nov 2017 23:57:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4985: use parquet stats of nested types for dynamic pruning

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/8480 )

Change subject: IMPALA-4985: use parquet stats of nested types for dynamic pruning
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/8480/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8480/2//COMMIT_MSG@13
PS2, Line 13: value
Should this read "type"?


http://gerrit.cloudera.org:8080/#/c/8480/2//COMMIT_MSG@16
PS2, Line 16: the scalar value must be on a path
            : to the root of the nested value where every node on the path
            : is required
I'm not sure I'm following the reasoning behind this. Please see my comments in the tests.


http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1109
PS1, Line 1109: tScanRanges = 0;
> hmm, looks like a proxy for estimating when a scan range will definitely be
Sounds good, thanks for the clarification.


http://gerrit.cloudera.org:8080/#/c/8480/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8480/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@567
PS2, Line 567:       tryComputeMinMaxPredicate(analyzer, pred);
nit: this looks like it could go on a single line now.


http://gerrit.cloudera.org:8080/#/c/8480/2/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@575
PS2, Line 575:         tryComputeMinMaxPredicate(analyzer, pred);
nit: single line?


http://gerrit.cloudera.org:8080/#/c/8480/2/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test
File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test:

http://gerrit.cloudera.org:8080/#/c/8480/2/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@9
PS2, Line 9: row_regex: .*NumStatsFilteredRowGroups: 2 .*
While you're here, do you mind converting them to the aggregation(..) syntax? Then you could lift the restriction of 'num_nodes = 1' in test_nested_types.py.


http://gerrit.cloudera.org:8080/#/c/8480/2/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@58
PS2, Line 58: where bottom.item < -2;
This looks like a c&p error from the query above. Can you double check that the tests run as you expect them to?


http://gerrit.cloudera.org:8080/#/c/8480/2/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@98
PS2, Line 98: where a.item.e < -10
This may be seem like an ignorant question, but doesn't this predicate make the bottom field required? In general, does having a predicate on a field mean that it must not be null?


http://gerrit.cloudera.org:8080/#/c/8480/2/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-stats.test@107
PS2, Line 107: left outer join c.nested_struct.c.d cn, cn.item a where a.item.e < -10;
Same here, if a row group has no values in nested_struct.c.d.item.item.e that are < -10, then their rows will not show up in the result, no?



-- 
To view, visit http://gerrit.cloudera.org:8080/8480
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Gerrit-Change-Number: 8480
Gerrit-PatchSet: 2
Gerrit-Owner: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>
Gerrit-Comment-Date: Tue, 14 Nov 2017 18:54:08 +0000
Gerrit-HasComments: Yes