You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/07/02 03:28:00 UTC

[jira] [Commented] (IMPALA-11363) Use ReadValueBatch() when the members of Parquet StructColumnReader

    [ https://issues.apache.org/jira/browse/IMPALA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561674#comment-17561674 ] 

ASF subversion and git services commented on IMPALA-11363:
----------------------------------------------------------

Commit 5d021ce5a72060d243ae4c56ad803c2fc686a5ce in impala's branch refs/heads/dependabot/pip/infra/python/deps/urllib3-1.26.5 from Gabor Kaszab
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=5d021ce5a ]

IMPALA-9496: Allow struct type in the select list for Parquet tables

This patch is to extend the support of Struct columns in the select
list to Parquet files as well.

There are some limitation with this patch:
  - Dictionary filtering could work when we have conjuncts on a member
    of a struct, however, if this struct is given in the select list
    then the dictionary filtering is disabled. The reason is that in
    this case there would be a mismatch between the slot/tuple IDs in
    the conjunct between the ones in the select list due to expr
    substitution logic when a struct is in the select list. Solving
    this puzzle would be a nice future performance enhancement. See
    IMPALA-11361.
  - When structs are read in a batched manner it delegates the actual
    reading of the data to the column readers of its children, however,
    would use the simple ReadValue() on these readers instead of the
    batched version. The reason is that calling the batched reader in
    the member column readers would in fact read in batches, but it
    won't handle the case when the parent struct is NULL and would set
    only itself to NULL but not the parent struct. This might also be a
    future performance enhancement. See IMPALA-11363.
  - If there is a struct in the select list then late materialization
    is turned off. The reason is that LM expects the column readers to
    be used through the batched reading interface, however, as said in
    the above bulletpoint currently struct column readers use the
    non-batched reading interface of its children. As a result after
    reading the column readers are not in a state as SkipRows() of LM
    expects and then results in a query failure because it's not able
    to skip the rows for non-filter readers.
    Once IMPALA-11363 is implemented and the struct will also use the
    ReadValueBatch() interface of its children then late
    materialization could be turned on even if structs are in the
    select list. See IMPALA-11364.

Testing:
  - There were a lot of tests already to exercise this functionality
    but they were only run on ORC table. I changed these to cover
    Parquet tables too.

Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Reviewed-on: http://gerrit.cloudera.org:8080/18596
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Use ReadValueBatch() when the members of Parquet StructColumnReader
> -------------------------------------------------------------------
>
>                 Key: IMPALA-11363
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11363
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 4.1.0
>            Reporter: Gabor Kaszab
>            Priority: Major
>              Labels: complextype
>
> IMPALA-9496 introduced the support for querying structs in the select list also from Parquet tables. This required adding a new column reader: StructColumnReader that has the usual interface as all the other Parquet column readers. However, the ReadValueBatch of the StructColumnReader calls the ReadValue() of its children instead of ReadValueBatch() so even though the batched read is called on the StructColumnReader it will in fact do a non-batched read.
> The reason for this is that if the batched read would have been called on the children readers then currently there is no way to set the parent struct to null when the children reader find that the def_level_ indicates that the struct member is null. It's even more complicated when there is a nested struct column.
> This has an impact on performance as querying a struct is slower than querying its children together. As a solution I see 2 approaches:
> 1) Enhance the ScalarColumnReaders that when they do a ReadValueBatch() and see based on def_level_ that there is a NULL value, then it also sets the parent structs to NULL not just itself. For this the scalar reader should keep track of the max def levels of the parent structs and their details in the internal representation (e.g. tuple offset, etc.)
> 2) Only the first child of the struct is used as a struct child while the others could be regular column readers not inside the struct. As a result the first child wouldn't be read in a batched manner but then the struct could be set based on the def_level coming from this child. All the other members could be then read in a batched manner.
> This needs some extra care when there are nested structs. In this case all the struct would be added as children to the current struct, and what I described above would only apply for the struct(s) at the bottom of the tree.
> I personally would go for 1) as it is more straighforward and easier to understand.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org