You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Aman Sinha (Code Review)" <ge...@cloudera.org> on 2020/11/30 01:51:33 UTC

[Impala-ASF-CR] IMPALA-10360: Allow simple limit to be treated as sampling hint

Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16792

to look at the new patch set (#2).

Change subject: IMPALA-10360: Allow simple limit to be treated as sampling hint
......................................................................

IMPALA-10360: Allow simple limit to be treated as sampling hint

As a follow-up to IMPALA-10314, it is sometimes useful to consider
a simple limit as a way to sample from a table if a relevant hint
has been provided. Doing a sample instead of pure limit serves
dual purposes: (a) it still helps with reducing the planning time
since the scan ranges need be computed only for the sample files,
(b) it allows sufficient number of files/rows to be read from
the table such that after applying filter conditions or joins with
another table, the query may still produce the N rows needed for
limit.

This fuctionality is especially useful if the query is against a
view (note that TABLESAMPLE clause cannot be applied to a view).

In this patch, a new table level hint, 'convert_limit_to_sample'
is added. If this hint is attached to a table either in the main
query block or within a view/subquery and simple limit optimization
conditions are satisfied (according to IMPALA-10314), the limit
is converted to a table sample. For example:

 set optimize_simple_limit = true;
 CREATE VIEW v1 as SELECT * FROM T [convert_limit_to_sample]
    WHERE [always_true] <predicate>;
 SELECT * FROM v1 LIMIT 10;

In this case, the limit 10 is converted to a sample of T and the
sampling percent is the greater of 1% or ratio (in percent) of
limit to the estimated row count of the table.

Testing:
 - Added a alltypes_date_partition_2 table where the date and
   timestamp values match (this helps with setting the
   'always_true' hint).
 - Added views with 'convert_limit_to_sample' and 'always_true'
   hints and added new tests against the views. Modified a few
   existing tests to reference the new table variant.
 - Added an end-to-end test.

Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b
---
M fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/TableRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
9 files changed, 224 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/92/16792/2
-- 
To view, visit http://gerrit.cloudera.org:8080/16792
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b
Gerrit-Change-Number: 16792
Gerrit-PatchSet: 2
Gerrit-Owner: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>