You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Gabor Kaszab (Jira)" <ji...@apache.org> on 2022/01/13 14:48:00 UTC

[jira] [Commented] (IMPALA-3841) Avoid materializing nested collections if top-level predicates already disqualify the row.

    [ https://issues.apache.org/jira/browse/IMPALA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475411#comment-17475411 ] 

Gabor Kaszab commented on IMPALA-3841:
--------------------------------------

FYI, this has been implemented for Parquet under this ticket: https://issues.apache.org/jira/browse/IMPALA-9873

The change introduces late materialisation for columns not part of the decision to filter out a row. (not just complex types)

> Avoid materializing nested collections if top-level predicates already disqualify the row.
> ------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-3841
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3841
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.5.0, Impala 2.6.0
>            Reporter: Alexander Behm
>            Priority: Minor
>              Labels: complextype, nested_types, parquet, performance
>
> Today, we fully materialize a row before evaluating the top-level conjuncts when scanning Parquet. This includes materializing nested collections. We should avoid materializing nested collections if top-level conjuncts already discard the row. Our recent move to column-wise materialization makes this improvement feasible (IMPALA-2736).
> To illustrate the problem, consider this query:
> {code}
> select * from customer c, c.orders o where c.id = 10
> {code}
> Even though we have a very selective predicate on the top-level customer, our scanner will still fully materialize all orders of all customers. The non-matches will be filtered, but we still pay the cost of materializing the orders.
> The proposed improvement is to avoid materializing the orders of non-qualifying customers.
> The improvement will several things:
> * Analyze and separate the top-level conjuncts into those that can be evaluated before materializing the nested collections and those that require nested collections to be materialized. In particular, we need to be careful with our auto-generated !empty() predicates on nested collections.
> * Add a new SkipValues() or similar interface to the Parquet column readers to advances the scanner without actually materializing values. If possible, we should skip entire blocks.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org