You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Zoltán Borók-Nagy (Jira)" <ji...@apache.org> on 2019/12/10 15:39:00 UTC

[jira] [Created] (IMPALA-9226) Improve string allocations of the ORC scanner

Zoltán Borók-Nagy created IMPALA-9226:
-----------------------------------------

             Summary: Improve string allocations of the ORC scanner
                 Key: IMPALA-9226
                 URL: https://issues.apache.org/jira/browse/IMPALA-9226
             Project: IMPALA
          Issue Type: Improvement
            Reporter: Zoltán Borók-Nagy


Currently the ORC scanner allocates new memory for each string values (except for fixed size strings):

https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172

Since ORC-501 StringVectorBatch has a member named 'blob' that contains the strings in the batch: [https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126]

'blob' has type DataBuffer which is movable, so Impala might be able to get ownership of it. Or, at least we could copy the whole blob array instead of copying the strings one-by-one.

ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC 1.5.5.

ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch:

[https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153]

It uses dictionary encoding for storing the values. Impala could copy/move the dictionary as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)