You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Gabor Kaszab (Code Review)" <ge...@cloudera.org> on 2022/06/07 15:31:23 UTC

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Gabor Kaszab has uploaded this change for review. ( http://gerrit.cloudera.org:8080/18596


Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................

IMPALA-9496: Allow struct type in the select list for Parquet tables

This patch is to extend the support of Struct columns in the select
list to Parquet files as well.

Testing:
  - There were a lot of tests already to exercise this functionality
    but they were only run on ORC table. I changed these to cover
    Parquet tables too.

Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
---
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/parquet/parquet-complex-column-reader.cc
A be/src/exec/parquet/parquet-complex-column-reader.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
A be/src/exec/parquet/parquet-struct-column-reader.cc
A be/src/exec/parquet/parquet-struct-column-reader.h
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
M testdata/workloads/functional-query/queries/QueryTest/struct-in-select-list.test
M tests/query_test/test_nested_types.py
18 files changed, 556 insertions(+), 184 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/96/18596/1
-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10722/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jun 2022 15:49:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8259/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Jun 2022 13:14:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 5: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18596/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18596/3//COMMIT_MSG@23
PS3, Line 23: ple ReadValue() on these readers instead of the
            :     batched version. The reason is that calling the batched reader in
            :     the member column readers would in fact read in batches, but it
            :     won't handle the case when the parent struct i
> Well I'm not completely in favor of the proposed approach. It would be comp
Ok, we can think about optimizations later.



-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 16 Jun 2022 11:16:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 6: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Jun 2022 17:55:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10732/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 08 Jun 2022 13:06:59 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10731/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 08 Jun 2022 13:02:43 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18596

to look at the new patch set (#3).

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................

IMPALA-9496: Allow struct type in the select list for Parquet tables

This patch is to extend the support of Struct columns in the select
list to Parquet files as well.

There are some limitation with this patch:
  - Dictionary filtering could work when we have conjuncts on a member
    of a struct, however, if this struct is given in the select list
    then the dictionary filtering is disabled. The reason is that in
    this case there would be a mismatch between the slot/tuple IDs in
    the conjunct between the ones in the select list due to expr
    substitution logic when a struct is in the select list. Solving
    this puzzle would be a nice future performance enhancement.
  - When structs are read in a batched manner it delegates the actual
    reading of the data to the column readers of its children, however,
    would use the simple ReadValue() on these readers instead of the
    batched version. The reason is that calling the batched reader in
    the member column readers would in fact read in batches, but it
    won't handle the case when the parent struct is NULL and would set
    only itself to NULL but not the parent struct. This might also be a
    future performance enhancement.

Testing:
  - There were a lot of tests already to exercise this functionality
    but they were only run on ORC table. I changed these to cover
    Parquet tables too.

Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
---
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/parquet/parquet-complex-column-reader.cc
A be/src/exec/parquet/parquet-complex-column-reader.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
A be/src/exec/parquet/parquet-struct-column-reader.cc
A be/src/exec/parquet/parquet-struct-column-reader.h
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
M testdata/workloads/functional-query/queries/QueryTest/struct-in-select-list.test
M tests/query_test/test_nested_types.py
20 files changed, 586 insertions(+), 188 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/96/18596/3
-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18596/2/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/18596/2/tests/query_test/test_nested_types.py@153
PS2, Line 153: #
flake8: E265 block comment should start with '# '



-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 08 Jun 2022 12:45:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/18596/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18596/3//COMMIT_MSG@23
PS3, Line 23: ple ReadValue() on these readers instead of the
            :     batched version. The reason is that calling the batched reader in
            :     the member column readers would in fact read in batches, but it
            :     won't handle the case when the parent struct i
> A relatively simple optimization that comes to mind is to use only the firs
Well I'm not completely in favor of the proposed approach. It would be complex to understand for a later reader why only one scalar member of a struct is added as a children to the struct and why the others aren't. The nested struct use case would make this even more complicated.

I opened a Jira with another suggestion for this. Added the Jira ID here to the commit msg.


http://gerrit.cloudera.org:8080/#/c/18596/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18596/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2235
PS3, Line 2235: HasStructColumnReader
> This means that if there are struct scanners, then we never use late materi
Added this to the commit msg.


http://gerrit.cloudera.org:8080/#/c/18596/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/18596/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1014
PS3, Line 1014:     // of a struct, where the struct is also given in the select list then skip dictionary
> typo
Done


http://gerrit.cloudera.org:8080/#/c/18596/1/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/18596/1/tests/query_test/test_nested_types.py@153
PS1, Line 153: s
> flake8: E265 block comment should start with '# '
Note for myself: drop this commented line



-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 16 Jun 2022 09:59:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10782/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 16 Jun 2022 10:18:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18596

to look at the new patch set (#5).

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................

IMPALA-9496: Allow struct type in the select list for Parquet tables

This patch is to extend the support of Struct columns in the select
list to Parquet files as well.

There are some limitation with this patch:
  - Dictionary filtering could work when we have conjuncts on a member
    of a struct, however, if this struct is given in the select list
    then the dictionary filtering is disabled. The reason is that in
    this case there would be a mismatch between the slot/tuple IDs in
    the conjunct between the ones in the select list due to expr
    substitution logic when a struct is in the select list. Solving
    this puzzle would be a nice future performance enhancement. See
    IMPALA-11361.
  - When structs are read in a batched manner it delegates the actual
    reading of the data to the column readers of its children, however,
    would use the simple ReadValue() on these readers instead of the
    batched version. The reason is that calling the batched reader in
    the member column readers would in fact read in batches, but it
    won't handle the case when the parent struct is NULL and would set
    only itself to NULL but not the parent struct. This might also be a
    future performance enhancement. See IMPALA-11363.
  - If there is a struct in the select list then late materialization
    is turned off. The reason is that LM expects the column readers to
    be used through the batched reading interface, however, as said in
    the above bulletpoint currently struct column readers use the
    non-batched reading interface of its children. As a result after
    reading the column readers are not in a state as SkipRows() of LM
    expects and then results in a query failure because it's not able
    to skip the rows for non-filter readers.
    Once IMPALA-11363 is implemented and the struct will also use the
    ReadValueBatch() interface of its children then late
    materialization could be turned on even if structs are in the
    select list. See IMPALA-11364.

Testing:
  - There were a lot of tests already to exercise this functionality
    but they were only run on ORC table. I changed these to cover
    Parquet tables too.

Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
---
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/parquet/parquet-complex-column-reader.cc
A be/src/exec/parquet/parquet-complex-column-reader.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
A be/src/exec/parquet/parquet-struct-column-reader.cc
A be/src/exec/parquet/parquet-struct-column-reader.h
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
M testdata/workloads/functional-query/queries/QueryTest/struct-in-select-list.test
M tests/query_test/test_nested_types.py
21 files changed, 605 insertions(+), 207 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/96/18596/5
-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................

IMPALA-9496: Allow struct type in the select list for Parquet tables

This patch is to extend the support of Struct columns in the select
list to Parquet files as well.

There are some limitation with this patch:
  - Dictionary filtering could work when we have conjuncts on a member
    of a struct, however, if this struct is given in the select list
    then the dictionary filtering is disabled. The reason is that in
    this case there would be a mismatch between the slot/tuple IDs in
    the conjunct between the ones in the select list due to expr
    substitution logic when a struct is in the select list. Solving
    this puzzle would be a nice future performance enhancement. See
    IMPALA-11361.
  - When structs are read in a batched manner it delegates the actual
    reading of the data to the column readers of its children, however,
    would use the simple ReadValue() on these readers instead of the
    batched version. The reason is that calling the batched reader in
    the member column readers would in fact read in batches, but it
    won't handle the case when the parent struct is NULL and would set
    only itself to NULL but not the parent struct. This might also be a
    future performance enhancement. See IMPALA-11363.
  - If there is a struct in the select list then late materialization
    is turned off. The reason is that LM expects the column readers to
    be used through the batched reading interface, however, as said in
    the above bulletpoint currently struct column readers use the
    non-batched reading interface of its children. As a result after
    reading the column readers are not in a state as SkipRows() of LM
    expects and then results in a query failure because it's not able
    to skip the rows for non-filter readers.
    Once IMPALA-11363 is implemented and the struct will also use the
    ReadValueBatch() interface of its children then late
    materialization could be turned on even if structs are in the
    select list. See IMPALA-11364.

Testing:
  - There were a lot of tests already to exercise this functionality
    but they were only run on ORC table. I changed these to cover
    Parquet tables too.

Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Reviewed-on: http://gerrit.cloudera.org:8080/18596
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/parquet/parquet-complex-column-reader.cc
A be/src/exec/parquet/parquet-complex-column-reader.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
A be/src/exec/parquet/parquet-struct-column-reader.cc
A be/src/exec/parquet/parquet-struct-column-reader.h
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
M testdata/workloads/functional-query/queries/QueryTest/struct-in-select-list.test
M tests/query_test/test_nested_types.py
21 files changed, 605 insertions(+), 207 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18596

to look at the new patch set (#2).

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................

IMPALA-9496: Allow struct type in the select list for Parquet tables

This patch is to extend the support of Struct columns in the select
list to Parquet files as well.

There are some limitation with this patch:
  - Dictionary filtering could work when we have conjuncts on a member
    of a struct, however, if this struct is given in the select list
    then the dictionary filtering is disabled. The reason is that in
    this case there would be a mismatch between the slot/tuple IDs in
    the conjunct between the ones in the select list due to expr
    substitution logic when a struct is in the select list. Solving
    this puzzle would be a nice future performance enhancement.
  - When structs are read in a batched manner it delegates the actual
    reading of the data to the column readers of its children, however,
    would use the simple ReadValue() on these readers instead of the
    batched version. The reason is that calling the batched reader in
    the member column readers would in fact read in batches, but it
    won't handle the case when the parent struct is NULL and would set
    only itself to NULL but not the parent struct. This might also be a
    future performance enhancement.

Testing:
  - There were a lot of tests already to exercise this functionality
    but they were only run on ORC table. I changed these to cover
    Parquet tables too.

Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
---
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/parquet/parquet-complex-column-reader.cc
A be/src/exec/parquet/parquet-complex-column-reader.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
A be/src/exec/parquet/parquet-struct-column-reader.cc
A be/src/exec/parquet/parquet-struct-column-reader.h
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
M testdata/workloads/functional-query/queries/QueryTest/struct-in-select-list.test
M tests/query_test/test_nested_types.py
20 files changed, 587 insertions(+), 188 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/96/18596/2
-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10745/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 4
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 09 Jun 2022 14:02:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 5: Code-Review+2

Upgrading to +2 as no one had other comments.


-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Jun 2022 11:43:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 6: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Jun 2022 13:14:43 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18596

to look at the new patch set (#4).

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................

IMPALA-9496: Allow struct type in the select list for Parquet tables

This patch is to extend the support of Struct columns in the select
list to Parquet files as well.

There are some limitation with this patch:
  - Dictionary filtering could work when we have conjuncts on a member
    of a struct, however, if this struct is given in the select list
    then the dictionary filtering is disabled. The reason is that in
    this case there would be a mismatch between the slot/tuple IDs in
    the conjunct between the ones in the select list due to expr
    substitution logic when a struct is in the select list. Solving
    this puzzle would be a nice future performance enhancement.
  - When structs are read in a batched manner it delegates the actual
    reading of the data to the column readers of its children, however,
    would use the simple ReadValue() on these readers instead of the
    batched version. The reason is that calling the batched reader in
    the member column readers would in fact read in batches, but it
    won't handle the case when the parent struct is NULL and would set
    only itself to NULL but not the parent struct. This might also be a
    future performance enhancement.

Testing:
  - There were a lot of tests already to exercise this functionality
    but they were only run on ORC table. I changed these to cover
    Parquet tables too.

Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
---
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/parquet/parquet-complex-column-reader.cc
A be/src/exec/parquet/parquet-complex-column-reader.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
A be/src/exec/parquet/parquet-struct-column-reader.cc
A be/src/exec/parquet/parquet-struct-column-reader.h
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-ambiguous-list-legacy.test
M testdata/workloads/functional-query/queries/QueryTest/struct-in-select-list.test
M tests/query_test/test_nested_types.py
21 files changed, 605 insertions(+), 207 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/96/18596/4
-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 4
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18596/1/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/18596/1/tests/query_test/test_nested_types.py@153
PS1, Line 153: #
flake8: E265 block comment should start with '# '



-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jun 2022 15:32:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9496: Allow struct type in the select list for Parquet tables

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18596 )

Change subject: IMPALA-9496: Allow struct type in the select list for Parquet tables
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/18596/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18596/3//COMMIT_MSG@23
PS3, Line 23: The reason is that calling the batched reader in
            :     the member column readers would in fact read in batches, but it
            :     won't handle the case when the parent struct is NULL and would set
            :     only itself to NULL but not the parent struct.
A relatively simple optimization that comes to mind is to use only the first child of a struct in the struct reader and treat the others as normal scalar column readers. As I see in StructColumnReader::NextLevels / StructColumnReader::ReadValue we are already using only the first child's def level to set the nullness of the struct.

The selection of the children to use in the struct reader could be done like this:
- if the struct has only scalar children, then the first child would be kept
- if the struct has also struct children, then all struct children but no scalar children would be kept


http://gerrit.cloudera.org:8080/#/c/18596/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18596/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2235
PS3, Line 2235: HasStructColumnReader
This means that if there are struct scanners, then we never use late materialization, right? This could be mentioned among the limitations in the commit message.


http://gerrit.cloudera.org:8080/#/c/18596/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/18596/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1014
PS3, Line 1014:     // o a struct, where the struct is also given in the select list then skip dictionary
typo



-- 
To view, visit http://gerrit.cloudera.org:8080/18596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Gerrit-Change-Number: 18596
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 10 Jun 2022 14:28:05 +0000
Gerrit-HasComments: Yes