You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Amos Bird <am...@gmail.com> on 2016/09/05 12:01:02 UTC

questions about runtime filters

Deer impala comunity,
  I'm reading the docs of the newly runtime filtering technique,
  http://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html#runtime_filtering
  While being quite enlightening, this document still confuses me in some
  aspects.

Question 1,
  >> For other file formats (text, Avro, RCFile, and SequenceFile),
  >> runtime filtering speeds up queries against partitioned tables only.

  Is it a technical problem to implement runtime filtering for other
  formats? Will they be supported in the future?

Question 2,
  >> explain select s from yy2 where year in (select year from yy where year between 2000 and 2005);
  >> +----------------------------------------------------------+
  >> | Explain String                                           |
  >> +----------------------------------------------------------+
  >> | Estimated Per-Host Requirements: Memory=16.00MB VCores=2 |
  >> |                                                          |
  >> | 04:EXCHANGE [UNPARTITIONED]                              |
  >> | |                                                        |
  >> | 02:HASH JOIN [LEFT SEMI JOIN, BROADCAST]                 |
  >> | |  hash predicates: year = year                          |
  >> | |  runtime filters: RF000 <- year                        |
  >> | |                                                        |
  >> | |--03:EXCHANGE [BROADCAST]                               |
  >> | |  |                                                     |
  >> | |  01:SCAN HDFS [dpp.yy]                                 |
  >> | |     partitions=2/4 files=2 size=468B                   |
  >> | |                                                        |
  >> | 00:SCAN HDFS [dpp.yy2]                                   |
  >> |    partitions=2/3 files=2 size=468B                      |
  >> |    runtime filters: RF000 -> year                        |
  >> +----------------------------------------------------------+

  How does the planner be able to tell that there gonna be 2/3
  partitions surviving after runtime fitering?

Question 3,
  >> When the spill-to-disk mechanism is activated on a particular host
  >> during a query, that host does not produce any filters while
  >> processing that query.

  Isn't the spill-to-disk ON by default? However, through my observation,
  runtime filtering is still enabled. What does this quote sentence
  mean?

Question 4,
  In which scenarios will Impala encounter Memory limit exceeded when
  spill-to-disk is enabled.

  Any help is highly appreciated!

Regards,
Amos

Re: questions about runtime filters

Posted by Henry Robinson <he...@apache.org>.
On 5 September 2016 at 18:36, Amos Bird <am...@gmail.com> wrote:

>
> > Henry Robinson writes:
>
> >> Question 2,
> >>>> explain select s from yy2 where year in (select year from yy where
> year between 2000 and 2005);
> >>>> +----------------------------------------------------------+
> >>>> | Explain String                                           |
> >>>> +----------------------------------------------------------+
> >>>> | Estimated Per-Host Requirements: Memory=16.00MB VCores=2 |
> >>>> |                                                          |
> >>>> | 04:EXCHANGE [UNPARTITIONED]                              |
> >>>> | |                                                        |
> >>>> | 02:HASH JOIN [LEFT SEMI JOIN, BROADCAST]                 |
> >>>> | |  hash predicates: year = year                          |
> >>>> | |  runtime filters: RF000 <- year                        |
> >>>> | |                                                        |
> >>>> | |--03:EXCHANGE [BROADCAST]                               |
> >>>> | |  |                                                     |
> >>>> | |  01:SCAN HDFS [dpp.yy]                                 |
> >>>> | |     partitions=2/4 files=2 size=468B                   |
> >>>> | |                                                        |
> >>>> | 00:SCAN HDFS [dpp.yy2]                                   |
> >>>> |    partitions=2/3 files=2 size=468B                      |
> >>>> |    runtime filters: RF000 -> year                        |
> >>>> +----------------------------------------------------------+
> >>
> >> How does the planner be able to tell that there gonna be 2/3
> >> partitions surviving after runtime fitering?
> >
> > The planner doesn't know about runtime filters' selectivity. But the
> between predicate does eliminate the 1999 partition which can be statically
> determined by the planner. That's what changes the number of expected
> scanned partitions in the plan.
>
> So this example isn't related to runtime filtering. It's just a
> case of predicate transitivity and static partition pruning right?
>

Sorry, lost track of this in my inbox. You're right - although runtime
filters are created, they won't be effective because the partition with
year=1999 has already been eliminated by static filtering. I've asked
Cloudera's docs team to fix this. Thanks for pointing it out!

Re: questions about runtime filters

Posted by Amos Bird <am...@gmail.com>.
> Henry Robinson writes:

>> Question 2,
>>>> explain select s from yy2 where year in (select year from yy where year between 2000 and 2005);
>>>> +----------------------------------------------------------+
>>>> | Explain String                                           |
>>>> +----------------------------------------------------------+
>>>> | Estimated Per-Host Requirements: Memory=16.00MB VCores=2 |
>>>> |                                                          |
>>>> | 04:EXCHANGE [UNPARTITIONED]                              |
>>>> | |                                                        |
>>>> | 02:HASH JOIN [LEFT SEMI JOIN, BROADCAST]                 |
>>>> | |  hash predicates: year = year                          |
>>>> | |  runtime filters: RF000 <- year                        |
>>>> | |                                                        |
>>>> | |--03:EXCHANGE [BROADCAST]                               |
>>>> | |  |                                                     |
>>>> | |  01:SCAN HDFS [dpp.yy]                                 |
>>>> | |     partitions=2/4 files=2 size=468B                   |
>>>> | |                                                        |
>>>> | 00:SCAN HDFS [dpp.yy2]                                   |
>>>> |    partitions=2/3 files=2 size=468B                      |
>>>> |    runtime filters: RF000 -> year                        |
>>>> +----------------------------------------------------------+
>>
>> How does the planner be able to tell that there gonna be 2/3
>> partitions surviving after runtime fitering?
>
> The planner doesn't know about runtime filters' selectivity. But the between predicate does eliminate the 1999 partition which can be statically determined by the planner. That's what changes the number of expected scanned partitions in the plan.

So this example isn't related to runtime filtering. It's just a
case of predicate transitivity and static partition pruning right?

Re: questions about runtime filters

Posted by Henry Robinson <he...@cloudera.com>.

> On Sep 5, 2016, at 5:01 AM, Amos Bird <am...@gmail.com> wrote:
> 
> 
> Deer impala comunity,
> I'm reading the docs of the newly runtime filtering technique,
> http://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html#runtime_filtering
> While being quite enlightening, this document still confuses me in some
> aspects.
> 
> Question 1,
>>> For other file formats (text, Avro, RCFile, and SequenceFile),
>>> runtime filtering speeds up queries against partitioned tables only.
> 
> Is it a technical problem to implement runtime filtering for other
> formats? Will they be supported in the future?

If someone decides to implement support then yes :) The most effective filters are those that apply against partition columns, which allow you to skip the scan in some cases. Those work regardless of file format. 

Per-row filtering is an expensive operation, so would need implementing with a lot of care. 


> 
> Question 2,
>>> explain select s from yy2 where year in (select year from yy where year between 2000 and 2005);
>>> +----------------------------------------------------------+
>>> | Explain String                                           |
>>> +----------------------------------------------------------+
>>> | Estimated Per-Host Requirements: Memory=16.00MB VCores=2 |
>>> |                                                          |
>>> | 04:EXCHANGE [UNPARTITIONED]                              |
>>> | |                                                        |
>>> | 02:HASH JOIN [LEFT SEMI JOIN, BROADCAST]                 |
>>> | |  hash predicates: year = year                          |
>>> | |  runtime filters: RF000 <- year                        |
>>> | |                                                        |
>>> | |--03:EXCHANGE [BROADCAST]                               |
>>> | |  |                                                     |
>>> | |  01:SCAN HDFS [dpp.yy]                                 |
>>> | |     partitions=2/4 files=2 size=468B                   |
>>> | |                                                        |
>>> | 00:SCAN HDFS [dpp.yy2]                                   |
>>> |    partitions=2/3 files=2 size=468B                      |
>>> |    runtime filters: RF000 -> year                        |
>>> +----------------------------------------------------------+
> 
> How does the planner be able to tell that there gonna be 2/3
> partitions surviving after runtime fitering?

The planner doesn't know about runtime filters' selectivity. But the between predicate does eliminate the 1999 partition which can be statically determined by the planner. That's what changes the number of expected scanned partitions in the plan. 



> 
> Question 3,
>>> When the spill-to-disk mechanism is activated on a particular host
>>> during a query, that host does not produce any filters while
>>> processing that query.
> 
> Isn't the spill-to-disk ON by default? However, through my observation,
> runtime filtering is still enabled. What does this quote sentence
> mean?

This is out of date - thanks! I'll make sure someone changes that. Filters work with spilling. 


> 
> Question 4,
> In which scenarios will Impala encounter Memory limit exceeded when
> spill-to-disk is enabled.

In general Impala still needs some fixed amount of memory to run any query - there's a minimum amount of memory a join might need for example to keep just one partition in memory while spilling.   When Impala can't allocate even that memory without going over its limit, the query will fail regardless of spill-to-disk. Others on this list will have more specific examples. 

Henry


> 
> Any help is highly appreciated!
> 
> Regards,
> Amos