You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@impala.apache.org by Antoni Ivanov <ai...@vmware.com> on 2018/03/20 07:45:08 UTC

Does Impala supports or plan to support Late Materialization

I don't mean partition pruning but as described in
https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-redshift-introduces-late-materialization-for-faster-query-processing/

It basically pre-fetches first the filter columns and then after applying the filter it fetches only the data from the rest of columns only if filter applies.

Thanks

Re: Does Impala supports or plan to support Late Materialization

Posted by Mostafa Mokhtar <mm...@cloudera.com>.

@Antoni,

Check the blog below, it has examples on how to optimize the schema for
selective queries.

https://blog.cloudera.com/blog/2017/12/faster-performance-for-selective-queries/


On Tue, Mar 20, 2018 at 3:30 PM, Tim Armstrong <ta...@cloudera.com>
wrote:

> The page indices should solve a large part of this problem, but I can
> definitely come up with examples where the page indices aren't sufficient
> to avoid most materialisation if we have a predicate on an unsorted column.
>
> E.g. if you have a predicate on a state column with 50 distinct values
> (I'm being US-centric).
>
>   select * from sales where state = 'MI'
>
> Suppose there is some amount of locality to the data and on average you
> get 2 states per data page. You're probably only going to be able to filter
> out ~50% of pages using min-max filters since 'MI' will lie in-between many
> pairs of states. Whereas if you scanned the 'state' column and materialized
> the other columns lazily, you could filter out a large majority of the data
> before materialising the other columns.
>
> On Tue, Mar 20, 2018 at 9:20 AM, Alexander Behm <al...@cloudera.com>
> wrote:
>
>> I think we do eventually want to support it. For highly selective queries
>> the existing dictionary and min/max filtering can already be very
>> effective. In addition, we plan to add indexes for finer-grained page
>> pruning. See https://issues.apache.org/jira/browse/IMPALA-5842
>>
>> After all those improvements, it's not clear what the additional benefit
>> of later materialization is going to be in practice.
>>
>> Do you have a case in mind that specifically requires late
>> materialization to work well?
>>
>> On Tue, Mar 20, 2018 at 12:47 AM, Antoni Ivanov <ai...@vmware.com>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> You can ignore my question, Found the relevant JIRA -
>>> https://issues.apache.org/jira/browse/IMPALA-2017 So I guess the answer
>>> is not yet.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Antoni
>>>
>>>
>>>
>>> *From:* Antoni Ivanov
>>> *Sent:* Tuesday, March 20, 2018 9:45 AM
>>> *To:* 'user@impala.apache.org' <us...@impala.apache.org>
>>> *Subject:* Does Impala supports or plan to support Late Materialization
>>>
>>>
>>>
>>> I don’t mean partition pruning but as described in
>>>
>>> https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-re
>>> dshift-introduces-late-materialization-for-faster-query-processing/
>>>
>>>
>>>
>>> It basically pre-fetches first the filter columns and then after
>>> applying the filter it fetches only the data from the rest of columns only
>>> if filter applies.
>>>
>>>
>>>
>>> Thanks
>>>
>>
>>
>

Re: Does Impala supports or plan to support Late Materialization

Posted by Tim Armstrong <ta...@cloudera.com>.

The page indices should solve a large part of this problem, but I can
definitely come up with examples where the page indices aren't sufficient
to avoid most materialisation if we have a predicate on an unsorted column.

E.g. if you have a predicate on a state column with 50 distinct values (I'm
being US-centric).

  select * from sales where state = 'MI'

Suppose there is some amount of locality to the data and on average you get
2 states per data page. You're probably only going to be able to filter out
~50% of pages using min-max filters since 'MI' will lie in-between many
pairs of states. Whereas if you scanned the 'state' column and materialized
the other columns lazily, you could filter out a large majority of the data
before materialising the other columns.

On Tue, Mar 20, 2018 at 9:20 AM, Alexander Behm <al...@cloudera.com>
wrote:

> I think we do eventually want to support it. For highly selective queries
> the existing dictionary and min/max filtering can already be very
> effective. In addition, we plan to add indexes for finer-grained page
> pruning. See https://issues.apache.org/jira/browse/IMPALA-5842
>
> After all those improvements, it's not clear what the additional benefit
> of later materialization is going to be in practice.
>
> Do you have a case in mind that specifically requires late materialization
> to work well?
>
> On Tue, Mar 20, 2018 at 12:47 AM, Antoni Ivanov <ai...@vmware.com>
> wrote:
>
>> Hi,
>>
>>
>>
>> You can ignore my question, Found the relevant JIRA -
>> https://issues.apache.org/jira/browse/IMPALA-2017 So I guess the answer
>> is not yet.
>>
>>
>>
>> Regards,
>>
>> Antoni
>>
>>
>>
>> *From:* Antoni Ivanov
>> *Sent:* Tuesday, March 20, 2018 9:45 AM
>> *To:* 'user@impala.apache.org' <us...@impala.apache.org>
>> *Subject:* Does Impala supports or plan to support Late Materialization
>>
>>
>>
>> I don’t mean partition pruning but as described in
>>
>> https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-re
>> dshift-introduces-late-materialization-for-faster-query-processing/
>>
>>
>>
>> It basically pre-fetches first the filter columns and then after applying
>> the filter it fetches only the data from the rest of columns only if filter
>> applies.
>>
>>
>>
>> Thanks
>>
>
>

Re: Does Impala supports or plan to support Late Materialization

Posted by Alexander Behm <al...@cloudera.com>.

I think we do eventually want to support it. For highly selective queries
the existing dictionary and min/max filtering can already be very
effective. In addition, we plan to add indexes for finer-grained page
pruning. See https://issues.apache.org/jira/browse/IMPALA-5842

After all those improvements, it's not clear what the additional benefit of
later materialization is going to be in practice.

Do you have a case in mind that specifically requires late materialization
to work well?

On Tue, Mar 20, 2018 at 12:47 AM, Antoni Ivanov <ai...@vmware.com> wrote:

> Hi,
>
>
>
> You can ignore my question, Found the relevant JIRA -
> https://issues.apache.org/jira/browse/IMPALA-2017 So I guess the answer
> is not yet.
>
>
>
> Regards,
>
> Antoni
>
>
>
> *From:* Antoni Ivanov
> *Sent:* Tuesday, March 20, 2018 9:45 AM
> *To:* 'user@impala.apache.org' <us...@impala.apache.org>
> *Subject:* Does Impala supports or plan to support Late Materialization
>
>
>
> I don’t mean partition pruning but as described in
>
> https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-
> redshift-introduces-late-materialization-for-faster-query-processing/
>
>
>
> It basically pre-fetches first the filter columns and then after applying
> the filter it fetches only the data from the rest of columns only if filter
> applies.
>
>
>
> Thanks
>

RE: Does Impala supports or plan to support Late Materialization

Posted by Antoni Ivanov <ai...@vmware.com>.

Hi,

You can ignore my question, Found the relevant JIRA - https://issues.apache.org/jira/browse/IMPALA-2017 So I guess the answer is not yet.

Regards,
Antoni

From: Antoni Ivanov
Sent: Tuesday, March 20, 2018 9:45 AM
To: 'user@impala.apache.org' <us...@impala.apache.org>
Subject: Does Impala supports or plan to support Late Materialization

I don't mean partition pruning but as described in
https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-redshift-introduces-late-materialization-for-faster-query-processing/

It basically pre-fetches first the filter columns and then after applying the filter it fetches only the data from the rest of columns only if filter applies.

Thanks