You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Gabor Kaszab (Jira)" <ji...@apache.org> on 2020/01/24 14:56:00 UTC

[jira] [Issue Comment Deleted] (IMPALA-9228) ORC scanner could be vectorized

     [ https://issues.apache.org/jira/browse/IMPALA-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabor Kaszab updated IMPALA-9228:
---------------------------------
    Comment: was deleted

(was: Created a very basic PoC implementation that covers OrcIntColumnReader only: [https://gerrit.cloudera.org/#/c/15104/]

+Perf test details:+
 - Ran a basic test on TPCH (scale factor 25) querying avg() of columns of lineitem. Ran this on a single column and on all 4 of the int/bigint columns in one query.
 - MT_DOP=1, NUM_NODES=1
 - Compared the MaterializeTupleTime metric of the runs.
 - Did the measurement both with and without this change.
 - Unfortunately I did it with debug builds but the results seems quite straightforward. Will do it in release build later.

+Results:+
    !screenshot-1.png!

Apparently, this enhancement is actually a decrease in performance. As the root cause I identified the fact that the orc library reads values into it's own representation first (OrcIntColumnReader::batch_). We have to copy these items first into a scratch batch and after this we can do the filtering on the scratch. Doing the copying column by column in batches doesn't add as much improvements in performance to balance the performance loss of copying itself.

 )

> ORC scanner could be vectorized
> -------------------------------
>
>                 Key: IMPALA-9228
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9228
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Gabor Kaszab
>            Priority: Major
>              Labels: orc
>
> The ORC scanners uses an external library to read ORC files. The library reads the file contents into its own memory representation. It is a vectorized representation similar to the Arrow format.
> Impala needs to convert the ORC row batch to an Impala row batch. Currently the conversion happens row-wise via virtual function calls:
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/hdfs-orc-scanner.cc#L671]
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L352]
> Instead of this approach it could work similarly to the Parquet scanner that fills the columns one-by-one into a scratch batch, then evaluate the conjuncts on the scratch batch. For more details see HdfsParquetScanner::AssembleRows():
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1077-L1088]
> This way we'll need a lot less virtual function calls, also the memory reads/writes will be much more localized and predictable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org