You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Daniel Becker (Code Review)" <ge...@cloudera.org> on 2022/04/12 16:17:58 UTC

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Daniel Becker has uploaded this change for review. ( http://gerrit.cloudera.org:8080/18403


Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................

IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

When querying a non-toplevel nested struct from an ORC file, the NULL
values are displayed at an incorrect level. E.g.:

select id, outer_struct.inner_struct3 from
functional_orc_def.complextypes_nested_structs where id >= 4;
+----+----------------------------+
| id | outer_struct.inner_struct3 |
+----+----------------------------+
| 4  | {"s":{"i":null,"s":null}}  |
| 5  | {"s":null}                 |
+----+----------------------------+

However, in the first row it is expected that 's' should be null and not
its members; in the second row the result should be 'NULL', i.e.
'outer_struct.inner_struct3' is null.
For reference see what is returned when querying 'outer_struct' instead
of 'outer_struct.inner_struct3':

+----+-------------------------------------------------------------------------------------------------------------------------------+
| 4  | {"str":"","inner_struct1":{"str":"somestr2","de":12345.12},"inner_struct2":{"i":1,"str":"string"},"inner_struct3":{"s":null}} |
| 5  | {"str":null,"inner_struct1":null,"inner_struct2":null,"inner_struct3":null}                                                   |
+----+-------------------------------------------------------------------------------------------------------------------------------+

The problem comes from the incorrect handling of the different depths of
the following trees:
 - the ORC type hierarchy (schema)
 - the tuple descriptor / slot descriptor hierarchy
as the ORC type hierarchy contains a node for every level in the schema
but the tuple/slot descriptor hierarchy omits the levels of structs that
are not in the select list (but an ancestor of theirs is), as these
structs are not materialised.

In the case of the example query, the two hierarchies are the following:
ORC:
 root --> outer_struct -> inner_struct3 -> s --> i
      |                                      \-> s
      \-> id
Tuple/slot descriptors:
 main_tuple --> inner_struct3 -> s --> i
            |                      \-> s
            \-> id

We create 'OrcColumnReader's for each node in the ORC type tree. Each
OrcColumnReader is assigned an ORC type node and a slot descriptor. The
incorrect behaviour comes from the incorrect pairing of ORC type nodes
with slot descriptors.

The old behaviour is described below:
Starting from the root, going along a path in both trees (for example
the path leading to outer_struct.inner_struct3.s.i), for each step we
consume a level in both trees until no more nodes remain in the
tuple/slot desc tree, and then we pair the last element from that tree
with the remaining ORC type node(s).

In the example, we get the following pairs:
(root, main_tuple) -> (outer_struct, inner_struct3) ->
(inner_struct3, s) -> (s, i) -> (i, i)

When we run out of structs in the tuple/slot desc tree, we still create
OrcStructReaders (because the ORC type is still a struct, but the slot
descriptor now refers to an INT), but we mark them incorrectly as
non-materialised.

Also, the OrcStructReaders for non-materialised structs do not need to
check for null-ness as they are not present in the select list, only
their descendants, and the ORC batch object stores null information also
for the descendants of null values.

Let's look at the row with id 4 in the example:
Because of the bug, the  the non-materialising OrcStructReader appears
at the level of the (s, i) pair, so the 's' struct is not checked for
null-ness, although it is actually null. One level lower, for 'i' (and
the inner 's' string field), the ORC batch object tells us that the
values are null (because their parent is). Therefore the nulls appear
one level lower than they should.

The correct behaviour is that ORC type nodes are paired with slot
descriptors if either
 - the ORC type node matches the slot descriptor (they refer to the same
   node in the schema) or
 - the slot descriptor is a descendant of the schema node that the ORC
   type node refers to.

This patch fixes the incorrect pairing of ORC types and slot
descriptors, so we have the following pairs:
(root, main_tuple) -> (outer_struct, main_tuple) ->
(inner_struct3, inner_struct3) -> (s, s) -> (i, i)

In this case the OrcStructReader for the pair (outer_struct, main_tuple)
becomes non-materialising and the one for (s, s) will be materialising,
so the 's' struct will also be null-checked, recognising null-ness at
the correct level.

This commit also fixes some comments in be/src/exec/orc-column-readers.h
and be/src/exec/hdfs-orc-scanner.h mentioning the field
HdfsOrcScanner::col_id_path_map_, which has been removed by
"IMPALA-10485: part(1): make ORC column reader creation independent of
schema resolution".

Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
---
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
3 files changed, 12 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/18403/1
-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 1
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10438/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 2
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 13 Apr 2022 11:39:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................

IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

When querying a non-toplevel nested struct from an ORC file, the NULL
values are displayed at an incorrect level. E.g.:

select id, outer_struct.inner_struct3 from
functional_orc_def.complextypes_nested_structs where id >= 4;
+----+----------------------------+
| id | outer_struct.inner_struct3 |
+----+----------------------------+
| 4  | {"s":{"i":null,"s":null}}  |
| 5  | {"s":null}                 |
+----+----------------------------+

However, in the first row it is expected that 's' should be null and not
its members; in the second row the result should be 'NULL', i.e.
'outer_struct.inner_struct3' is null.
For reference see what is returned when querying 'outer_struct' instead
of 'outer_struct.inner_struct3':

+----+-------------------------------------------------------------------------------------------------------------------------------+
| 4  | {"str":"","inner_struct1":{"str":"somestr2","de":12345.12},"inner_struct2":{"i":1,"str":"string"},"inner_struct3":{"s":null}} |
| 5  | {"str":null,"inner_struct1":null,"inner_struct2":null,"inner_struct3":null}                                                   |
+----+-------------------------------------------------------------------------------------------------------------------------------+

The problem comes from the incorrect handling of the different depths of
the following trees:
 - the ORC type hierarchy (schema)
 - the tuple descriptor / slot descriptor hierarchy
as the ORC type hierarchy contains a node for every level in the schema
but the tuple/slot descriptor hierarchy omits the levels of structs that
are not in the select list (but an ancestor of theirs is), as these
structs are not materialised.

In the case of the example query, the two hierarchies are the following:
ORC:
 root --> outer_struct -> inner_struct3 -> s --> i
      |                                      \-> s
      \-> id
Tuple/slot descriptors:
 main_tuple --> inner_struct3 -> s --> i
            |                      \-> s
            \-> id

We create 'OrcColumnReader's for each node in the ORC type tree. Each
OrcColumnReader is assigned an ORC type node and a slot descriptor. The
incorrect behaviour comes from the incorrect pairing of ORC type nodes
with slot descriptors.

The old behaviour is described below:
Starting from the root, going along a path in both trees (for example
the path leading to outer_struct.inner_struct3.s.i), for each step we
consume a level in both trees until no more nodes remain in the
tuple/slot desc tree, and then we pair the last element from that tree
with the remaining ORC type node(s).

In the example, we get the following pairs:
(root, main_tuple) -> (outer_struct, inner_struct3) ->
(inner_struct3, s) -> (s, i) -> (i, i)

When we run out of structs in the tuple/slot desc tree, we still create
OrcStructReaders (because the ORC type is still a struct, but the slot
descriptor now refers to an INT), but we mark them incorrectly as
non-materialised.

Also, the OrcStructReaders for non-materialised structs do not need to
check for null-ness as they are not present in the select list, only
their descendants, and the ORC batch object stores null information also
for the descendants of null values.

Let's look at the row with id 4 in the example:
Because of the bug, the non-materialising OrcStructReader appears at the
level of the (s, i) pair, so the 's' struct is not checked for
null-ness, although it is actually null. One level lower, for 'i' (and
the inner 's' string field), the ORC batch object tells us that the
values are null (because their parent is). Therefore the nulls appear
one level lower than they should.

The correct behaviour is that ORC type nodes are paired with slot
descriptors if either
 - the ORC type node matches the slot descriptor (they refer to the same
   node in the schema) or
 - the slot descriptor is a descendant of the schema node that the ORC
   type node refers to.

This patch fixes the incorrect pairing of ORC types and slot
descriptors, so we have the following pairs:
(root, main_tuple) -> (outer_struct, main_tuple) ->
(inner_struct3, inner_struct3) -> (s, s) -> (i, i)

In this case the OrcStructReader for the pair (outer_struct, main_tuple)
becomes non-materialising and the one for (s, s) will be materialising,
so the 's' struct will also be null-checked, recognising null-ness at
the correct level.

This commit also fixes some comments in be/src/exec/orc-column-readers.h
and be/src/exec/hdfs-orc-scanner.h mentioning the field
HdfsOrcScanner::col_id_path_map_, which has been removed by
"IMPALA-10485: part(1): make ORC column reader creation independent of
schema resolution".

Testing:
  - added tests to
    testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
    that query various levels of the struct 'outer_struct' to check that
    NULLs are at the correct level.

Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
---
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
4 files changed, 55 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/18403/6
-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 6
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10475/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 7
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 09:51:00 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................

IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

When querying a non-toplevel nested struct from an ORC file, the NULL
values are displayed at an incorrect level. E.g.:

select id, outer_struct.inner_struct3 from
functional_orc_def.complextypes_nested_structs where id >= 4;
+----+----------------------------+
| id | outer_struct.inner_struct3 |
+----+----------------------------+
| 4  | {"s":{"i":null,"s":null}}  |
| 5  | {"s":null}                 |
+----+----------------------------+

However, in the first row it is expected that 's' should be null and not
its members; in the second row the result should be 'NULL', i.e.
'outer_struct.inner_struct3' is null.
For reference see what is returned when querying 'outer_struct' instead
of 'outer_struct.inner_struct3':

+----+-------------------------------------------------------------------------------------------------------------------------------+
| 4  | {"str":"","inner_struct1":{"str":"somestr2","de":12345.12},"inner_struct2":{"i":1,"str":"string"},"inner_struct3":{"s":null}} |
| 5  | {"str":null,"inner_struct1":null,"inner_struct2":null,"inner_struct3":null}                                                   |
+----+-------------------------------------------------------------------------------------------------------------------------------+

The problem comes from the incorrect handling of the different depths of
the following trees:
 - the ORC type hierarchy (schema)
 - the tuple descriptor / slot descriptor hierarchy
as the ORC type hierarchy contains a node for every level in the schema
but the tuple/slot descriptor hierarchy omits the levels of structs that
are not in the select list (but an ancestor of theirs is), as these
structs are not materialised.

In the case of the example query, the two hierarchies are the following:
ORC:
 root --> outer_struct -> inner_struct3 -> s --> i
      |                                      \-> s
      \-> id
Tuple/slot descriptors:
 main_tuple --> inner_struct3 -> s --> i
            |                      \-> s
            \-> id

We create 'OrcColumnReader's for each node in the ORC type tree. Each
OrcColumnReader is assigned an ORC type node and a slot descriptor. The
incorrect behaviour comes from the incorrect pairing of ORC type nodes
with slot descriptors.

The old behaviour is described below:
Starting from the root, going along a path in both trees (for example
the path leading to outer_struct.inner_struct3.s.i), for each step we
consume a level in both trees until no more nodes remain in the
tuple/slot desc tree, and then we pair the last element from that tree
with the remaining ORC type node(s).

In the example, we get the following pairs:
(root, main_tuple) -> (outer_struct, inner_struct3) ->
(inner_struct3, s) -> (s, i) -> (i, i)

When we run out of structs in the tuple/slot desc tree, we still create
OrcStructReaders (because the ORC type is still a struct, but the slot
descriptor now refers to an INT), but we mark them incorrectly as
non-materialised.

Also, the OrcStructReaders for non-materialised structs do not need to
check for null-ness as they are not present in the select list, only
their descendants, and the ORC batch object stores null information also
for the descendants of null values.

Let's look at the row with id 4 in the example:
Because of the bug, the non-materialising OrcStructReader appears at the
level of the (s, i) pair, so the 's' struct is not checked for
null-ness, although it is actually null. One level lower, for 'i' (and
the inner 's' string field), the ORC batch object tells us that the
values are null (because their parent is). Therefore the nulls appear
one level lower than they should.

The correct behaviour is that ORC type nodes are paired with slot
descriptors if either
 - the ORC type node matches the slot descriptor (they refer to the same
   node in the schema) or
 - the slot descriptor is a descendant of the schema node that the ORC
   type node refers to.

This patch fixes the incorrect pairing of ORC types and slot
descriptors, so we have the following pairs:
(root, main_tuple) -> (outer_struct, main_tuple) ->
(inner_struct3, inner_struct3) -> (s, s) -> (i, i)

In this case the OrcStructReader for the pair (outer_struct, main_tuple)
becomes non-materialising and the one for (s, s) will be materialising,
so the 's' struct will also be null-checked, recognising null-ness at
the correct level.

This commit also fixes some comments in be/src/exec/orc-column-readers.h
and be/src/exec/hdfs-orc-scanner.h mentioning the field
HdfsOrcScanner::col_id_path_map_, which has been removed by
"IMPALA-10485: part(1): make ORC column reader creation independent of
schema resolution".

Testing:
  - added tests to
    testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
    that query various levels of the struct 'outer_struct' to check that
    NULLs are at the correct level.

Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Reviewed-on: http://gerrit.cloudera.org:8080/18403
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
4 files changed, 42 insertions(+), 9 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 9
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 7: Code-Review+2

Carry Gabor's +2.


-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 7
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 09:35:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18403/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18403/2//COMMIT_MSG@12
PS2, Line 12: select id, outer_struct.inner_struct3 from
            : functional_orc_def.complextypes_nested_structs where id >= 4
Can you add some tests like this?
testdata/workloads/functional-query/queries/QueryTest/struct-in-select-list.test


http://gerrit.cloudera.org:8080/#/c/18403/2/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/18403/2/be/src/exec/orc-column-readers.cc@52
PS2, Line 52:       scanner->slot_to_col_id_[slot_desc] == node->getColumnId();
nit: +2 indentation



-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 2
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 13 Apr 2022 15:03:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18403/6/testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
File testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test:

http://gerrit.cloudera.org:8080/#/c/18403/6/testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test@148
PS6, Line 148: select id, outer_struct.inner_struct3 from
> nit: I think the very first test in this file runs the same query as this o
Thanks, I removed it here.



-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 7
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 09:31:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10473/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 6
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 20 Apr 2022 14:07:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8059/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 8
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 09:36:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 5: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 5
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Apr 2022 13:13:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................

IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

When querying a non-toplevel nested struct from an ORC file, the NULL
values are displayed at an incorrect level. E.g.:

select id, outer_struct.inner_struct3 from
functional_orc_def.complextypes_nested_structs where id >= 4;
+----+----------------------------+
| id | outer_struct.inner_struct3 |
+----+----------------------------+
| 4  | {"s":{"i":null,"s":null}}  |
| 5  | {"s":null}                 |
+----+----------------------------+

However, in the first row it is expected that 's' should be null and not
its members; in the second row the result should be 'NULL', i.e.
'outer_struct.inner_struct3' is null.
For reference see what is returned when querying 'outer_struct' instead
of 'outer_struct.inner_struct3':

+----+-------------------------------------------------------------------------------------------------------------------------------+
| 4  | {"str":"","inner_struct1":{"str":"somestr2","de":12345.12},"inner_struct2":{"i":1,"str":"string"},"inner_struct3":{"s":null}} |
| 5  | {"str":null,"inner_struct1":null,"inner_struct2":null,"inner_struct3":null}                                                   |
+----+-------------------------------------------------------------------------------------------------------------------------------+

The problem comes from the incorrect handling of the different depths of
the following trees:
 - the ORC type hierarchy (schema)
 - the tuple descriptor / slot descriptor hierarchy
as the ORC type hierarchy contains a node for every level in the schema
but the tuple/slot descriptor hierarchy omits the levels of structs that
are not in the select list (but an ancestor of theirs is), as these
structs are not materialised.

In the case of the example query, the two hierarchies are the following:
ORC:
 root --> outer_struct -> inner_struct3 -> s --> i
      |                                      \-> s
      \-> id
Tuple/slot descriptors:
 main_tuple --> inner_struct3 -> s --> i
            |                      \-> s
            \-> id

We create 'OrcColumnReader's for each node in the ORC type tree. Each
OrcColumnReader is assigned an ORC type node and a slot descriptor. The
incorrect behaviour comes from the incorrect pairing of ORC type nodes
with slot descriptors.

The old behaviour is described below:
Starting from the root, going along a path in both trees (for example
the path leading to outer_struct.inner_struct3.s.i), for each step we
consume a level in both trees until no more nodes remain in the
tuple/slot desc tree, and then we pair the last element from that tree
with the remaining ORC type node(s).

In the example, we get the following pairs:
(root, main_tuple) -> (outer_struct, inner_struct3) ->
(inner_struct3, s) -> (s, i) -> (i, i)

When we run out of structs in the tuple/slot desc tree, we still create
OrcStructReaders (because the ORC type is still a struct, but the slot
descriptor now refers to an INT), but we mark them incorrectly as
non-materialised.

Also, the OrcStructReaders for non-materialised structs do not need to
check for null-ness as they are not present in the select list, only
their descendants, and the ORC batch object stores null information also
for the descendants of null values.

Let's look at the row with id 4 in the example:
Because of the bug, the non-materialising OrcStructReader appears at the
level of the (s, i) pair, so the 's' struct is not checked for
null-ness, although it is actually null. One level lower, for 'i' (and
the inner 's' string field), the ORC batch object tells us that the
values are null (because their parent is). Therefore the nulls appear
one level lower than they should.

The correct behaviour is that ORC type nodes are paired with slot
descriptors if either
 - the ORC type node matches the slot descriptor (they refer to the same
   node in the schema) or
 - the slot descriptor is a descendant of the schema node that the ORC
   type node refers to.

This patch fixes the incorrect pairing of ORC types and slot
descriptors, so we have the following pairs:
(root, main_tuple) -> (outer_struct, main_tuple) ->
(inner_struct3, inner_struct3) -> (s, s) -> (i, i)

In this case the OrcStructReader for the pair (outer_struct, main_tuple)
becomes non-materialising and the one for (s, s) will be materialising,
so the 's' struct will also be null-checked, recognising null-ness at
the correct level.

This commit also fixes some comments in be/src/exec/orc-column-readers.h
and be/src/exec/hdfs-orc-scanner.h mentioning the field
HdfsOrcScanner::col_id_path_map_, which has been removed by
"IMPALA-10485: part(1): make ORC column reader creation independent of
schema resolution".

Testing:
  - added tests to
    testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
    that query various levels of the struct 'outer_struct' to check that
    NULLs are at the correct level.

Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
---
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
4 files changed, 42 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/18403/7
-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 7
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 5: Code-Review+2

(3 comments)

Carry Csaba's +1. Thanks for fixing this!

http://gerrit.cloudera.org:8080/#/c/18403/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18403/5//COMMIT_MSG@78
PS5, Line 78: the  the
nit: duplicated "the"


http://gerrit.cloudera.org:8080/#/c/18403/5/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/18403/5/be/src/exec/hdfs-orc-scanner.h@52
PS5, Line 52:   
nit: duplicated spaces


http://gerrit.cloudera.org:8080/#/c/18403/5/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/18403/5/be/src/exec/orc-column-readers.cc@52
PS5, Line 52: scanner->slot_to_col_id_[slot_desc]
nit: Can we add a DCHECK making sure the key exists?



-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 5
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Wed, 20 Apr 2022 13:02:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 6: Code-Review+2

(1 comment)

Sorry for the late review. I just left one nit, other than that, I'm fine with the patch. Thanks for taking care!

http://gerrit.cloudera.org:8080/#/c/18403/6/testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
File testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test:

http://gerrit.cloudera.org:8080/#/c/18403/6/testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test@148
PS6, Line 148: select id, outer_struct from
nit: I think the very first test in this file runs the same query as this one.



-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 6
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 08:56:46 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 8: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 8
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 09:36:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 8: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 8
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 21 Apr 2022 13:59:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 1:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/10429/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 1
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Apr 2022 16:26:26 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................

IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

When querying a non-toplevel nested struct from an ORC file, the NULL
values are displayed at an incorrect level. E.g.:

select id, outer_struct.inner_struct3 from
functional_orc_def.complextypes_nested_structs where id >= 4;
+----+----------------------------+
| id | outer_struct.inner_struct3 |
+----+----------------------------+
| 4  | {"s":{"i":null,"s":null}}  |
| 5  | {"s":null}                 |
+----+----------------------------+

However, in the first row it is expected that 's' should be null and not
its members; in the second row the result should be 'NULL', i.e.
'outer_struct.inner_struct3' is null.
For reference see what is returned when querying 'outer_struct' instead
of 'outer_struct.inner_struct3':

+----+-------------------------------------------------------------------------------------------------------------------------------+
| 4  | {"str":"","inner_struct1":{"str":"somestr2","de":12345.12},"inner_struct2":{"i":1,"str":"string"},"inner_struct3":{"s":null}} |
| 5  | {"str":null,"inner_struct1":null,"inner_struct2":null,"inner_struct3":null}                                                   |
+----+-------------------------------------------------------------------------------------------------------------------------------+

The problem comes from the incorrect handling of the different depths of
the following trees:
 - the ORC type hierarchy (schema)
 - the tuple descriptor / slot descriptor hierarchy
as the ORC type hierarchy contains a node for every level in the schema
but the tuple/slot descriptor hierarchy omits the levels of structs that
are not in the select list (but an ancestor of theirs is), as these
structs are not materialised.

In the case of the example query, the two hierarchies are the following:
ORC:
 root --> outer_struct -> inner_struct3 -> s --> i
      |                                      \-> s
      \-> id
Tuple/slot descriptors:
 main_tuple --> inner_struct3 -> s --> i
            |                      \-> s
            \-> id

We create 'OrcColumnReader's for each node in the ORC type tree. Each
OrcColumnReader is assigned an ORC type node and a slot descriptor. The
incorrect behaviour comes from the incorrect pairing of ORC type nodes
with slot descriptors.

The old behaviour is described below:
Starting from the root, going along a path in both trees (for example
the path leading to outer_struct.inner_struct3.s.i), for each step we
consume a level in both trees until no more nodes remain in the
tuple/slot desc tree, and then we pair the last element from that tree
with the remaining ORC type node(s).

In the example, we get the following pairs:
(root, main_tuple) -> (outer_struct, inner_struct3) ->
(inner_struct3, s) -> (s, i) -> (i, i)

When we run out of structs in the tuple/slot desc tree, we still create
OrcStructReaders (because the ORC type is still a struct, but the slot
descriptor now refers to an INT), but we mark them incorrectly as
non-materialised.

Also, the OrcStructReaders for non-materialised structs do not need to
check for null-ness as they are not present in the select list, only
their descendants, and the ORC batch object stores null information also
for the descendants of null values.

Let's look at the row with id 4 in the example:
Because of the bug, the  the non-materialising OrcStructReader appears
at the level of the (s, i) pair, so the 's' struct is not checked for
null-ness, although it is actually null. One level lower, for 'i' (and
the inner 's' string field), the ORC batch object tells us that the
values are null (because their parent is). Therefore the nulls appear
one level lower than they should.

The correct behaviour is that ORC type nodes are paired with slot
descriptors if either
 - the ORC type node matches the slot descriptor (they refer to the same
   node in the schema) or
 - the slot descriptor is a descendant of the schema node that the ORC
   type node refers to.

This patch fixes the incorrect pairing of ORC types and slot
descriptors, so we have the following pairs:
(root, main_tuple) -> (outer_struct, main_tuple) ->
(inner_struct3, inner_struct3) -> (s, s) -> (i, i)

In this case the OrcStructReader for the pair (outer_struct, main_tuple)
becomes non-materialising and the one for (s, s) will be materialising,
so the 's' struct will also be null-checked, recognising null-ness at
the correct level.

This commit also fixes some comments in be/src/exec/orc-column-readers.h
and be/src/exec/hdfs-orc-scanner.h mentioning the field
HdfsOrcScanner::col_id_path_map_, which has been removed by
"IMPALA-10485: part(1): make ORC column reader creation independent of
schema resolution".

Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
---
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
3 files changed, 12 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/18403/2
-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 2
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/10464/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 5
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Apr 2022 10:13:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/18403 )

Change subject: IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)
......................................................................

IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

When querying a non-toplevel nested struct from an ORC file, the NULL
values are displayed at an incorrect level. E.g.:

select id, outer_struct.inner_struct3 from
functional_orc_def.complextypes_nested_structs where id >= 4;
+----+----------------------------+
| id | outer_struct.inner_struct3 |
+----+----------------------------+
| 4  | {"s":{"i":null,"s":null}}  |
| 5  | {"s":null}                 |
+----+----------------------------+

However, in the first row it is expected that 's' should be null and not
its members; in the second row the result should be 'NULL', i.e.
'outer_struct.inner_struct3' is null.
For reference see what is returned when querying 'outer_struct' instead
of 'outer_struct.inner_struct3':

+----+-------------------------------------------------------------------------------------------------------------------------------+
| 4  | {"str":"","inner_struct1":{"str":"somestr2","de":12345.12},"inner_struct2":{"i":1,"str":"string"},"inner_struct3":{"s":null}} |
| 5  | {"str":null,"inner_struct1":null,"inner_struct2":null,"inner_struct3":null}                                                   |
+----+-------------------------------------------------------------------------------------------------------------------------------+

The problem comes from the incorrect handling of the different depths of
the following trees:
 - the ORC type hierarchy (schema)
 - the tuple descriptor / slot descriptor hierarchy
as the ORC type hierarchy contains a node for every level in the schema
but the tuple/slot descriptor hierarchy omits the levels of structs that
are not in the select list (but an ancestor of theirs is), as these
structs are not materialised.

In the case of the example query, the two hierarchies are the following:
ORC:
 root --> outer_struct -> inner_struct3 -> s --> i
      |                                      \-> s
      \-> id
Tuple/slot descriptors:
 main_tuple --> inner_struct3 -> s --> i
            |                      \-> s
            \-> id

We create 'OrcColumnReader's for each node in the ORC type tree. Each
OrcColumnReader is assigned an ORC type node and a slot descriptor. The
incorrect behaviour comes from the incorrect pairing of ORC type nodes
with slot descriptors.

The old behaviour is described below:
Starting from the root, going along a path in both trees (for example
the path leading to outer_struct.inner_struct3.s.i), for each step we
consume a level in both trees until no more nodes remain in the
tuple/slot desc tree, and then we pair the last element from that tree
with the remaining ORC type node(s).

In the example, we get the following pairs:
(root, main_tuple) -> (outer_struct, inner_struct3) ->
(inner_struct3, s) -> (s, i) -> (i, i)

When we run out of structs in the tuple/slot desc tree, we still create
OrcStructReaders (because the ORC type is still a struct, but the slot
descriptor now refers to an INT), but we mark them incorrectly as
non-materialised.

Also, the OrcStructReaders for non-materialised structs do not need to
check for null-ness as they are not present in the select list, only
their descendants, and the ORC batch object stores null information also
for the descendants of null values.

Let's look at the row with id 4 in the example:
Because of the bug, the  the non-materialising OrcStructReader appears
at the level of the (s, i) pair, so the 's' struct is not checked for
null-ness, although it is actually null. One level lower, for 'i' (and
the inner 's' string field), the ORC batch object tells us that the
values are null (because their parent is). Therefore the nulls appear
one level lower than they should.

The correct behaviour is that ORC type nodes are paired with slot
descriptors if either
 - the ORC type node matches the slot descriptor (they refer to the same
   node in the schema) or
 - the slot descriptor is a descendant of the schema node that the ORC
   type node refers to.

This patch fixes the incorrect pairing of ORC types and slot
descriptors, so we have the following pairs:
(root, main_tuple) -> (outer_struct, main_tuple) ->
(inner_struct3, inner_struct3) -> (s, s) -> (i, i)

In this case the OrcStructReader for the pair (outer_struct, main_tuple)
becomes non-materialising and the one for (s, s) will be materialising,
so the 's' struct will also be null-checked, recognising null-ness at
the correct level.

This commit also fixes some comments in be/src/exec/orc-column-readers.h
and be/src/exec/hdfs-orc-scanner.h mentioning the field
HdfsOrcScanner::col_id_path_map_, which has been removed by
"IMPALA-10485: part(1): make ORC column reader creation independent of
schema resolution".

Testing:
  - added tests to
    testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
    that query various levels of the struct 'outer_struct' to check that
    NULLs are at the correct level.

Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
---
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
4 files changed, 54 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/18403/5
-- 
To view, visit http://gerrit.cloudera.org:8080/18403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Gerrit-Change-Number: 18403
Gerrit-PatchSet: 5
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>