You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/04/05 08:58:00 UTC
[jira] [Commented] (IMPALA-11204) OrcStringColumnReader should be a template class

    [ https://issues.apache.org/jira/browse/IMPALA-11204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517322#comment-17517322 ] 

ASF subversion and git services commented on IMPALA-11204:
----------------------------------------------------------

Commit 85ddd27b640bf42a61d8af238938e16618db537e in impala's branch refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=85ddd27b6 ]

IMPALA-11204: Template implementation for OrcStringColumnReader::ReadValue

There are some checks in OrcStringColumnReader::ReadValue() that we can
determine outside the scope of this method. They should be optimized
since this is a critical method that will be executed for each row (and
for each string column). With these checks, the method is too complex to
be inlined in OrcBatchedReader::ReadValueBatch() by the compiler.

This patch templates OrcStringColumnReader::ReadValue() with two
parameters, one for the target slot type (i.e. STRING/CHAR/VARCHAR),
and the other one for whether the column is dictionary encoded. Also
adds an ALWAYS_INLINE marker to force inlining it.

OrcStringColumnReader::ReadValueBatch() will call a template version of
ReadValue() based on the slot type and the orc batch encoded state.

Ran a single node perf test on TPCH(30) on my dev box using 3 impalad
instances. There are some improvements and no significant regressions:
+----------+--------+-------------+------------+
| Query    | Avg(s) | Base Avg(s) | Delta(Avg) |
+----------+--------+-------------+------------+
| TPCH-Q19 | 5.62   | 6.07        | I -7.41%   |
| TPCH-Q6  | 2.56   | 2.78        | I -7.77%   |
| TPCH-Q4  | 3.85   | 4.25        | I -9.42%   |
| TPCH-Q12 | 4.25   | 4.99        | I -14.78%  |
+----------+--------+-------------+------------+
Base commit: ff21728
File Format: orc/snap/block
Iterations: 30

Change-Id: I5e5f88c28059fb3d3ac1172e6d383d06ee3bedd5
Reviewed-on: http://gerrit.cloudera.org:8080/18366
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> OrcStringColumnReader should be a template class
> ------------------------------------------------
>
>                 Key: IMPALA-11204
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11204
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>         Attachments: perf-report-tpch30-orc-snap.txt
>
>
> There are some duplicated checks in OrcStringColumnReader::ReadValue() which is executed for each row. They are static and should only be performed once per ORC stripe.
> {code:cpp}
> Status OrcStringColumnReader::ReadValue(int row_idx, Tuple* tuple, MemPool* pool) {
>   if (IsNull(DCHECK_NOTNULL(batch_), row_idx)) {
>     SetNullSlot(tuple);
>     return Status::OK();
>   }
>   char* src_ptr;
>   int src_len;
>   if (batch_->isEncoded) {
>     orc::EncodedStringVectorBatch* currentBatch =
>         static_cast<orc::EncodedStringVectorBatch*>(batch_);
>     orc::DataBuffer<int64_t>& offsets = currentBatch->dictionary->dictionaryOffset;
>     int64_t index = currentBatch->index[row_idx];
>     if (UNLIKELY(index < 0  || static_cast<uint64_t>(index) + 1 >= offsets.size())) {
>       return Status(Substitute("Corrupt ORC file: $0. Index ($1) out of range [0, $2) in "
>           "StringDictionary.", scanner_->filename(), index, offsets.size()));;
>     }
>     src_ptr = blob_ + offsets[index];
>     src_len = offsets[index + 1] - offsets[index];
>   } else {
>     // The pointed data is now in blob_, a buffer handled by Impala.
>     src_ptr = blob_ + (batch_->data[row_idx] - batch_->blob.data());
>     src_len = batch_->length[row_idx];
>   }
>   int dst_len = slot_desc_->type().len;
>   if (slot_desc_->type().type == TYPE_CHAR) {
>     int unpadded_len = min(dst_len, static_cast<int>(src_len));
>     char* dst_char = reinterpret_cast<char*>(GetSlot(tuple));
>     memcpy(dst_char, src_ptr, unpadded_len);
>     StringValue::PadWithSpaces(dst_char, dst_len, unpadded_len);
>     return Status::OK();
>   }
>   StringValue* dst = reinterpret_cast<StringValue*>(GetSlot(tuple));
>   if (slot_desc_->type().type == TYPE_VARCHAR && src_len > dst_len) {
>     dst->len = dst_len;
>   } else {
>     dst->len = src_len;
>   }
>   dst->ptr = src_ptr;
>   return Status::OK();
> }
> {code}
> This method is executed for each row. The complexity of it causes the compiler unable to inline it. I can see this method takes some portion of the time in TPC-H queries (e.g. 21% in Q12).
> Actually, the slot type is determined when we are creating the reader. Whether the orc vector batch is encoded is determined when we start reading a new stripe. If the string column is in dictionary encodings in the stripe and we set EnableLazyDecoding to true in RowReaderOptions, the orc vector batch will be encoded.
> [https://github.com/apache/orc/blob/21259815ae665f6a4beac29edb4fab52c2403e13/c%2B%2B/src/Reader.cc#L1205]
> [https://github.com/apache/orc/blob/21259815ae665f6a4beac29edb4fab52c2403e13/c%2B%2B/src/ColumnReader.cc#L1843-L1847]
> We can switch to a template implementation so the compile can remove these checks and inline the method.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org