You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/10/29 10:38:00 UTC

[jira] [Commented] (IMPALA-9873) Skip decoding of non-materialised columns in Parquet

    [ https://issues.apache.org/jira/browse/IMPALA-9873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17435895#comment-17435895 ] 

ASF subversion and git services commented on IMPALA-9873:
---------------------------------------------------------

Commit cd64271a0c4df6906d036a5a831001fdc8000285 in impala's branch refs/heads/master from Amogh Margoor
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=cd64271 ]

IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less than 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'
 3. Added end-to-end test for late materialization.
Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Reviewed-on: http://gerrit.cloudera.org:8080/17860
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Qifan Chen <qc...@cloudera.com>


> Skip decoding of non-materialised columns in Parquet
> ----------------------------------------------------
>
>                 Key: IMPALA-9873
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9873
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Amogh Margoor
>            Priority: Major
>
> This is a first milestone for lazy materialization in parquet, focusing on avoiding decompression and decoding of columns.
> * Identify columns referenced by predicates and runtime row filters and determine what order the columns need to be materialised in. Probably we want to evaluate static predicates before runtime filters to match current behaviour.
> * Rework this loop so that it alternates between materialising columns and evaluating predicates: https://github.com/apache/impala/blob/052129c/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1110
> * We probably need to keep track of filtered rows using a new data structure, e.g. bitmap
> * We need to then check that bitmap at each step to see if we skip materialising part or all of the following columns. E.g. if the first N rows were pruned, we can skip forward the remaining readers N rows.
> * This part may be a little tricky - there is the risk of adding overhead compared to the current code.
> * It is probably OK to just materialise the partition columns to start off with - avoiding materialising those is not going to buy that much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org