You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2022/12/01 00:25:03 UTC

[Impala-ASF-CR] IMPALA-11751: Template tuple of Avro header should be transferred to ScanRangeSharedState

Hello Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19289

to look at the new patch set (#2).

Change subject: IMPALA-11751: Template tuple of Avro header should be transferred to ScanRangeSharedState
......................................................................

IMPALA-11751: Template tuple of Avro header should be transferred to ScanRangeSharedState

Sequence container based file formats (SequenceFile, RCFile, Avro) have
a file header in each file that describes the metadata of the file, e.g.
codec, default values, etc. The header should be decoded before reading
the file content. The initial scanners will read the header and then
issue follow-up scan ranges for the file content. The decoded header
will be referenced by follow-up scanners.

Since IMPALA-9655, when MT_DOP > 1, the issued scan ranges could be
scheduled to other scan node instances. So the header resource should
live until all scan node instances close. Header objects are owned by
the object pool of the RuntimeState, which meets the requirement.

AvroFileHeader is special than other headers in that it references a
template tuple which contains the partition values and default values
for missing fields. The template tuple is initially owned by the header
scanner, then transferred to the scan node before the scanner closes.
However, when the scan node instance closes, the template tuple is
freed. Scanners of other scan node instances might still depend on it.
This could cause wrong results or crash the impalad.

When partition columns are used in the query, or when the underlying
avro files have missing fields and the table schema has default values
for them, the AvroFileHeader will have a non-null template tuple, which
could hit this bug when MT_DOP>1.

This patch fixes the bug by transferring the template tuple to
ScanRangeSharedState directly. The scan_node_pool of HdfsScanNodeBase is
also removed since it's only used to hold the template tuple (and
related buffers) of the avro header. Also no need to override
TransferToScanNodePool in HdfsScanNode since the original purpose is to
protect the pool by a lock, and now the method in ScanRangeSharedState
already has a lock.

Tests
 - Add missing test coverage for compute stats on avro tables. Note that
   MT_DOP=4 is set by default for compute stats.
 - Add the MT_DOP dimension for TestScannersAllTableFormats. Also add
   some queries that can reveal the bug in scanners.test. The ASAN build
   can easily crash by heap-use-after-free error without this fix.
 - Ran exhaustive tests.

Change-Id: Iafa43fce7c2ffdc867004d11e5873327c3d8cb42
---
M be/src/exec/avro/hdfs-avro-scanner.cc
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/streaming-aggregation-node.h
M testdata/workloads/functional-query/queries/QueryTest/compute-stats-avro.test
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
M tests/query_test/test_scanners.py
11 files changed, 228 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/89/19289/2
-- 
To view, visit http://gerrit.cloudera.org:8080/19289
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iafa43fce7c2ffdc867004d11e5873327c3d8cb42
Gerrit-Change-Number: 19289
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>