You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/10/06 23:23:20 UTC
[jira] [Commented] (KUDU-1683) Kudu client support for pushing runtime min/max filters

    [ https://issues.apache.org/jira/browse/KUDU-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553535#comment-15553535 ] 

Todd Lipcon commented on KUDU-1683:
-----------------------------------

I think we can implement this in two phases:

----
*Given a KuduScanToken, we'd like to add additional predicates to it before "hydrating" it into a Scanner.*

Using this, the Impala scanner code can apply newly discovered predicates to scan ranges before it opens them up and begins scanning them.

This will benefit two types of queries:

1) In the case that the runtime filters are generated very quickly, the ScanNode which will apply them already has code to wait until filters are ready (see https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html#runtime_filtering_timing ). 

This should apply for queries like the following:
{code}
select c.*  from orders o join customer c on c.c_custkey = o.o_custkey where o_orderkey = 1125;
{code}
which are currenty very slow due to scanning the entire 'customer' table. For comparison, the above query takes 100 seconds whereas we can estimate the time with runtime filters by doing the "join" manually:
{code}
[vd0340:21000] > select o_custkey from orders where o_orderkey = 1125;
Query: select o_custkey from orders where o_orderkey = 1125
+-----------+
| o_custkey |
+-----------+
| 72370418  |
+-----------+
Fetched 1 row(s) in 0.15s
[vd0340:21000] > select c.*  from customer  c where c_custkey = 72370418;
Query: select c.*  from customer  c where c_custkey = 72370418
+-----------+--------------------+-----------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------+
| c_custkey | c_name             | c_address                         | c_nationkey | c_phone         | c_acctbal | c_mktsegment | c_comment                                             |
+-----------+--------------------+-----------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------+
| 72370418  | Customer#072370418 | dw7lQ2wpAXeBFb1FAovuSO7N3GM,wv gc | 19          | 29-711-497-7607 | 6479.07   | FURNITURE    | boost quickly. even attainments snooze. even deposits |
+-----------+--------------------+-----------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------+
Fetched 1 row(s) in 0.18s
{code}

Given that the 'orders' lookup was <1sec, the 'customers' scan would have waited for the filter to be ready, and resulted in the very fast single-partition range scan.

2) In the case that the scan is very large, it will have many scan ranges (tokens). So, even if we can't "add" a predicate in the middle of a single scan range, it's plausible that the filter will arrive when there are still many ranges left to scan, and we'll be able to cull them or make the more efficient.

----
*Phase 2: allow adding a predicate mid-scan*

This would be more complicated to implement, but we could add an API to the tablet server that adds a predicate to an existing scanner. The most simplistic implementation would be to avoid "re-optimizing" the scan but just start evaluating the predicate server-side. This can give a good linear improvement by avoiding extra IO, etc. A more complex implementation would include evaluating whether the additional predicate allows re-optimization in which case a full tablet scan could be converted to a range scan, etc.

I think we should implement these two phases separately, since the second one seems more more complicated, and the first would still yield good benefits.


> Kudu client support for pushing runtime min/max filters
> -------------------------------------------------------
>
>                 Key: KUDU-1683
>                 URL: https://issues.apache.org/jira/browse/KUDU-1683
>             Project: Kudu
>          Issue Type: New Feature
>          Components: client
>    Affects Versions: 1.0.0
>            Reporter: Matthew Jacobs
>            Priority: Critical
>              Labels: impala
>
> Impala would like to generate runtime min/max filters to be pushed to Kudu, at least for scan tokens that haven't been opened yet.
> https://issues.cloudera.org/browse/IMPALA-4252



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)