You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Tim Armstrong (Code Review)" <ge...@cloudera.org> on 2017/09/27 15:22:32 UTC
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/8146 )
Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
......................................................................
IMPALA-5307: Part 2: copy out strings in uncompressed Avro
The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.
This approach will also work for the Sequence and Text scanners.
Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.
Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.
Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
set num_scanner_threads=1;
I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.
Selecting one string from a table with multiple columns:
select min(l_comment) from biglineitem_avro
1.814 -> 2.096
Selecting one string from a table with one column:
select min(l_comment) from biglineitem_comment; profile;
1.708 -> 3.7
Selecting one string from a table with one column with predicate:
select min(l_comment) from biglineitem_comment where length(l_comment) > 10000;
1.691 -> 1.449
Selecting all columns:
select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
min(l_returnflag), min(l_linestatus), min(l_shipdate),
min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
2.335 -> 3.711
Selecting an int column (no strings):
select min(l_linenumber) from biglineitem_avro
1.806 -> 1.819
Testing:
Ran exhaustive tests.
Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/common/status.cc
M be/src/common/status.h
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/tuple-ir.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
19 files changed, 329 insertions(+), 56 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/7
--
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 7
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>