You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Tim Armstrong (Code Review)" <ge...@cloudera.org> on 2017/09/27 15:22:32 UTC

[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro

Tim Armstrong has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/8146 )

Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro
......................................................................

IMPALA-5307: Part 2: copy out strings in uncompressed Avro

The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.

Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 10000;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
        min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
        min(l_returnflag), min(l_linestatus), min(l_shipdate),
        min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
        min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Testing:
Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M be/src/common/status.cc
M be/src/common/status.h
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/runtime/CMakeLists.txt
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
A be/src/runtime/tuple-ir.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
19 files changed, 329 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/7
-- 
To view, visit http://gerrit.cloudera.org:8080/8146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Gerrit-Change-Number: 8146
Gerrit-PatchSet: 7
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>