You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/04/20 13:35:01 UTC

[jira] [Commented] (IMPALA-9512) Milestone 2: Validate each row against the valid write id list

    [ https://issues.apache.org/jira/browse/IMPALA-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087732#comment-17087732 ] 

ASF subversion and git services commented on IMPALA-9512:
---------------------------------------------------------

Commit 248c6d2495d4628c10e6bdbb00f9ed170bba19b6 in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=248c6d2 ]

IMPALA-9469: ORC scanner vectorization for collection types

This commit only keeps the batched read path of ORC columns, i.e.
from now on we always read ORC values into a scratch batch. Thanks to
this we also get codegen out of the box.

From now on materialization of the table-level tuples are always driven
by the root struct reader. This will enable us to implement row
validation (against a valid write id list) much easier. It's needed
for IMPALA-9512.

I eliminated the OrcComplexColumnReader::TransferTuple() interface
and the related codes. HdfsOrcScanner became simpler. Now it just calls
TopLevelReadValueBatch() on the root struct reader which tracks the
row index of the table-level tuples and calls ReadValueBatch on its
children accordingly. The children don't need to track the state
as they are always being told which row they need to read.

Testing:
 * ran exhaustive tests

Performance:
 * non-nested benchmark results stayed the same as expected
 * Overall 1-2% gain on TPCH Nested, scale=1
 ** In some cases scanning was ~20% more efficient

Change-Id: I477961b427406035a04529c5175dbee8f8a93ad5
Reviewed-on: http://gerrit.cloudera.org:8080/15730
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Milestone 2: Validate each row against the valid write id list
> --------------------------------------------------------------
>
>                 Key: IMPALA-9512
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9512
>             Project: IMPALA
>          Issue Type: Sub-task
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-acid
>
> Minor compactions can compact several delta directories into a single delta directory. The current directory filtering algorithm needs to be modified to handle minor compacted directories and prefer those over plain delta directories.
> On top of that, in minor compacted directories we need to filter out rows we cannot see. E.g. we can have the following delta directory:
> {noformat}
> full_acid/delta_0000001_0000010_0000/0000 # minWriteId: 1
>                                           # maxWriteId: 10
> {noformat}
> So this delta dir contains rows with write ids between 1 and 10. But maybe we are only allowed to see write ids less than 5. Therefore we need to check the ACID write id column (named originalTransaction) for each row to decide whether this row is valid or not.
> There are several ways to optimize this. E.g. based on the min/max write ids of the delta directory, and the validWriteIdList, we can decide whether we need to validate the rows at all. Or, when we reach the high watermark (that tells us the max valid write id) we can stop the scanner since rows are ordered based on record ID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org