You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Daniel Becker (Jira)" <ji...@apache.org> on 2022/06/16 09:48:00 UTC
[jira] [Commented] (IMPALA-11363) Use ReadValueBatch() when the members of Parquet StructColumnReader

    [ https://issues.apache.org/jira/browse/IMPALA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554998#comment-17554998 ] 

Daniel Becker commented on IMPALA-11363:
----------------------------------------

I agree that solution 1) seems to be cleaner and easier to understand.

> Use ReadValueBatch() when the members of Parquet StructColumnReader
> -------------------------------------------------------------------
>
>                 Key: IMPALA-11363
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11363
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 4.1.0
>            Reporter: Gabor Kaszab
>            Priority: Major
>              Labels: complextype
>
> IMPALA-9496 introduced the support for querying structs in the select list also from Parquet tables. This required adding a new column reader: StructColumnReader that has the usual interface as all the other Parquet column readers. However, the ReadValueBatch of the StructColumnReader calls the ReadValue() of its children instead of ReadValueBatch() so even though the batched read is called on the StructColumnReader it will in fact do a non-batched read.
> The reason for this is that if the batched read would have been called on the children readers then currently there is no way to set the parent struct to null when the children reader find that the def_level_ indicates that the struct member is null. It's even more complicated when there is a nested struct column.
> This has an impact on performance as querying a struct is slower than querying its children together. As a solution I see 2 approaches:
> 1) Enhance the ScalarColumnReaders that when they do a ReadValueBatch() and see based on def_level_ that there is a NULL value, then it also sets the parent structs to NULL not just itself. For this the scalar reader should keep track of the max def levels of the parent structs and their details in the internal representation (e.g. tuple offset, etc.)
> 2) Only the first child of the struct is used as a struct child while the others could be regular column readers not inside the struct. As a result the first child wouldn't be read in a batched manner but then the struct could be set based on the def_level coming from this child. All the other members could be then read in a batched manner.
> This needs some extra care when there are nested structs. In this case all the struct would be added as children to the current struct, and what I described above would only apply for the struct(s) at the bottom of the tree.
> I personally would go for 1) as it is more straighforward and easier to understand.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org