You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/04/20 13:35:01 UTC
[jira] [Commented] (IMPALA-9469) ORC scanner vectorization for collection types

    [ https://issues.apache.org/jira/browse/IMPALA-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087731#comment-17087731 ] 

ASF subversion and git services commented on IMPALA-9469:
---------------------------------------------------------

Commit 248c6d2495d4628c10e6bdbb00f9ed170bba19b6 in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=248c6d2 ]

IMPALA-9469: ORC scanner vectorization for collection types

This commit only keeps the batched read path of ORC columns, i.e.
from now on we always read ORC values into a scratch batch. Thanks to
this we also get codegen out of the box.

From now on materialization of the table-level tuples are always driven
by the root struct reader. This will enable us to implement row
validation (against a valid write id list) much easier. It's needed
for IMPALA-9512.

I eliminated the OrcComplexColumnReader::TransferTuple() interface
and the related codes. HdfsOrcScanner became simpler. Now it just calls
TopLevelReadValueBatch() on the root struct reader which tracks the
row index of the table-level tuples and calls ReadValueBatch on its
children accordingly. The children don't need to track the state
as they are always being told which row they need to read.

Testing:
 * ran exhaustive tests

Performance:
 * non-nested benchmark results stayed the same as expected
 * Overall 1-2% gain on TPCH Nested, scale=1
 ** In some cases scanning was ~20% more efficient

Change-Id: I477961b427406035a04529c5175dbee8f8a93ad5
Reviewed-on: http://gerrit.cloudera.org:8080/15730
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> ORC scanner vectorization for collection types
> ----------------------------------------------
>
>                 Key: IMPALA-9469
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9469
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Gabor Kaszab
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: complextype
>
> https://issues.apache.org/jira/browse/IMPALA-9228 introduced vectorization for primitive types and struct. This Jira covers the same for collections (array, map) and structs containing collections.
> *Prerequisite:*
> 1) As a prerequisite please check how IMPALA-9228 introduces scratch batches to hold a batch rows, and also check how it's populated by primitives or struct fields.
> 2) Read the following document to understand the difference between materialising and non-materialising collection readers: https://docs.google.com/presentation/d/1uj8m7y69o47MhpqCc0SJ03GDTtPDrg4m04eAFVmq34A
> 3) Check how parquet handles collections when populating its scratch batch.
> Implementation details:
> 1) Taking care of materialising collections readers should be done similarly as for primitive types. In this case each collection reader will write one slot into the outgoing RowBatch per each collection it reads. In other words one collection will be represented as one CollectionValue in RowBatch.
> 2) The other case is when the top-level collection reader doesn't materialise directly into RowBatch, instead, it delegates the materialisation to its children. In this case it's not guaranteed that number of required slots in the RowBatch will equal to the number of collections in the collection reader.
> E.g.: Let's assume a table with one column: list of integers. In this case if the top-level ListColumnReader is not materialising then its child, the IntColumnReader will. But the number of required slots will be the number of int values within the collections instead of the number of collection as it would be if the ListColumnReader was materialising directly.
> As a Result if the scratch batch is being populated we might get to a situation where a whole collection doesn't fit into the scratch batch. Check how Parquet handles this case.
> 3) Once populating the scratch batch is done for collections it has to be verified that codegen is also run in these cases. It should work out of the box but let's make sure.
> 4) Currently ORC scanner chooses between row-by-row processing of the rows read by ORC reader and scratch batch reading. Once this Jira is implemented the row-by-row approach is not needed anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org