You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Tim Armstrong (Code Review)" <ge...@cloudera.org> on 2020/06/17 22:46:42 UTC

[Impala-ASF-CR] IMPALA-2515: support parquet decimal with extra padding

Tim Armstrong has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/16090 )

Change subject: IMPALA-2515: support parquet decimal with extra padding
......................................................................

IMPALA-2515: support parquet decimal with extra padding

This adds support for reading Parquet files where the
DECIMAL is encoded as a FIXED_LEN_BYTE_ARRAY field
with extra padding. This requires loosening file
validation and fixing up the decoding so that it
no longer assumes that the in-memory value is at
least as large as the encoded representation.

The decimal decoding logic was reworked so that we
could add the extra condition handling without regressing
performance of the decoding logic in the common case. In
the end I was able to significantly speed up the decoding
logic. The bottleneck, revealed by perf record while running
the below benchmark, was CPU stalls on the bitshift used for
sign extension instruction waiting on loading the result of
ByteSwap(). I worked around this by doing the sign-extension
before the ByteSwap(),

Perf:
Ran a microbenchmark to check that scanning perf didn't regress
as a result of the change. The query scans a DECIMAL column
that is mostly plain-encoded, so to maximally stress the
FIXED_LEN_BYTE_ARRAY decoding performance.

  set mt_dop=1;
  set num_nodes=1;
  select min(l_extendedprice)
  from tpch_parquet.lineitem

The SCAN time in the summary averaged out to 94ms before
the change and is reduced to 74ms after the change. The
actual speedup of the DECIMAL decoding is greater - it
went from ~20% of time in to ~6% of time as measured by
perf.

Testing:
Added a couple of parquet files that were generated with a
hacked version of Impala to have extra padding. Sanity-checked
that hacked tables returned the same results on Hive. The
tests failed before this code change.

Ran exhaustive tests with the hacked version of Impala
(so that all decimal tables got extra padding).

Change-Id: I2700652eab8ba7f23ffa75800a1712d310d4e1ec
---
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/util/decimal-util.h
M testdata/data/README
A testdata/data/decimal_padded_fixed_len_byte_array.parquet
A testdata/data/decimal_padded_fixed_len_byte_array2.parquet
M testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-formats.test
M tests/query_test/test_scanners.py
7 files changed, 105 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/90/16090/5
-- 
To view, visit http://gerrit.cloudera.org:8080/16090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2700652eab8ba7f23ffa75800a1712d310d4e1ec
Gerrit-Change-Number: 16090
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>