You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/03/03 17:17:00 UTC

[jira] [Commented] (IMPALA-9228) ORC scanner could be vectorized

    [ https://issues.apache.org/jira/browse/IMPALA-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050404#comment-17050404 ] 

ASF subversion and git services commented on IMPALA-9228:
---------------------------------------------------------

Commit f7c78ba27eaa4d228fd7fb9f349cf18c92cddc46 in impala's branch refs/heads/master from Gabor Kaszab
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f7c78ba ]

IMPALA-9228: ORC scanner reads rows into scratch batch

Because of performance considerations this change enhances ORC
scanner to populate a scratch batch on a column-by-column manner
using data from the column readers. Once this is done the parquet
code was reused to apply runtime filter and conjuncts and to
populate the outgoing row batch.

This approach reduces the number of virtual function calls and takes
advantage of the columnar orientation of the data to enhance scan
performance. Additionally, introducing the scratch batch concept also
opens the door for codegen runtime filtering and applying conjuncts.

Note, this change doesn't cover collection types just primitive types
and struct. Collection types will follow the previous row-by-row
approach.

Testing:
  - Re-run the full test suite to verify that no regression is
    introduced.
  - Checked the performance impact by running TPCH workload on a
    scale 25 database using single_node_perf_run.py. The total query
    runtime is decreased by 0-20% depending on how scan heavy the
    particular query was. The more scan heavy the query is the more
    performance gain I observe.

Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Reviewed-on: http://gerrit.cloudera.org:8080/15104
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> ORC scanner could be vectorized
> -------------------------------
>
>                 Key: IMPALA-9228
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9228
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Gabor Kaszab
>            Priority: Major
>              Labels: orc
>         Attachments: 1-4_col_measurement_int_only.png
>
>
> The ORC scanners uses an external library to read ORC files. The library reads the file contents into its own memory representation. It is a vectorized representation similar to the Arrow format.
> Impala needs to convert the ORC row batch to an Impala row batch. Currently the conversion happens row-wise via virtual function calls:
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/hdfs-orc-scanner.cc#L671]
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L352]
> Instead of this approach it could work similarly to the Parquet scanner that fills the columns one-by-one into a scratch batch, then evaluate the conjuncts on the scratch batch. For more details see HdfsParquetScanner::AssembleRows():
> [https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1077-L1088]
> This way we'll need a lot less virtual function calls, also the memory reads/writes will be much more localized and predictable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org