You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jr...@apache.org on 2018/01/11 22:30:54 UTC

[3/3] impala git commit: IMPALA-4252: [DOCS] Document min/max filters for Kudu tables

IMPALA-4252: [DOCS] Document min/max filters for Kudu tables

Change-Id: I15d8c952ab5b90e89fdd57640dfb4da882f7ecb2
Reviewed-on: http://gerrit.cloudera.org:8080/8986
Reviewed-by: Thomas Tauber-Marshall <tm...@cloudera.com>
Tested-by: Impala Public Jenkins


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/b27537a1
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/b27537a1
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/b27537a1

Branch: refs/heads/master
Commit: b27537a15b679c0c84b7d2e9f49c9a51f9ae93ba
Parents: ab81c48
Author: John Russell <jr...@cloudera.com>
Authored: Tue Jan 9 13:52:02 2018 -0800
Committer: Impala Public Jenkins <im...@gerrit.cloudera.org>
Committed: Thu Jan 11 21:49:23 2018 +0000

----------------------------------------------------------------------
 docs/shared/impala_common.xml                   |  6 +++++
 .../impala_disable_row_runtime_filtering.xml    | 14 ++++++++++++
 docs/topics/impala_kudu.xml                     | 17 ++++++++++++++
 docs/topics/impala_max_num_runtime_filters.xml  |  4 ++++
 .../topics/impala_runtime_bloom_filter_size.xml |  4 ++++
 docs/topics/impala_runtime_filter_max_size.xml  |  4 ++++
 docs/topics/impala_runtime_filter_min_size.xml  |  4 ++++
 docs/topics/impala_runtime_filtering.xml        | 24 +++++++++++++++-----
 8 files changed, 71 insertions(+), 6 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/b27537a1/docs/shared/impala_common.xml
----------------------------------------------------------------------
diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml
index dc8cdb5..c1496c6 100644
--- a/docs/shared/impala_common.xml
+++ b/docs/shared/impala_common.xml
@@ -1194,6 +1194,12 @@ drop database temp;
         <codeph>hadoop fs -cp</codeph>, or <codeph>INSERT</codeph> in Impala or Hive.
       </p>
 
+      <p rev="2.11.0 IMPALA-4252" id="filter_option_bloom_only">
+        This query option affects only Bloom filters, not the min/max filters
+        that are applied to Kudu tables. Therefore, it does not affect the
+        performance of queries against Kudu tables.
+      </p>
+
       <p rev="2.9.0 IMPALA-5333" id="adls_dml_performance">
         <draft-comment>
           Currently nothing to say on this subject. Leaving this placeholder

http://git-wip-us.apache.org/repos/asf/impala/blob/b27537a1/docs/topics/impala_disable_row_runtime_filtering.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_disable_row_runtime_filtering.xml b/docs/topics/impala_disable_row_runtime_filtering.xml
index 3280084..8bdbcda 100644
--- a/docs/topics/impala_disable_row_runtime_filtering.xml
+++ b/docs/topics/impala_disable_row_runtime_filtering.xml
@@ -72,6 +72,20 @@ under the License.
       unsetting it immediately afterward.
     </p>
 
+    <p conref="../shared/impala_common.xml#common/file_format_blurb"/>
+
+    <p rev="2.11.0 IMPALA-4252">
+      This query option only applies to queries against HDFS-based tables
+      using the Parquet file format.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
+    <p rev="2.11.0 IMPALA-4252">
+      When applied to a query involving a Kudu table, this option turns off
+      all runtime filtering for the Kudu table.
+    </p>
+
     <p conref="../shared/impala_common.xml#common/related_info"/>
     <p>
       <xref href="impala_runtime_filtering.xml"/>,

http://git-wip-us.apache.org/repos/asf/impala/blob/b27537a1/docs/topics/impala_kudu.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_kudu.xml b/docs/topics/impala_kudu.xml
index 08d3559..4260c56 100644
--- a/docs/topics/impala_kudu.xml
+++ b/docs/topics/impala_kudu.xml
@@ -1373,6 +1373,23 @@ kudu.table_name  | impala::some_database.table_name_demo
         parallelize the query very efficiently.
       </p>
 
+      <p rev="2.11.0 IMPALA-4252">
+        In <keyword keyref="impala211_full"/> and higher, Impala can push down additional
+        information to optimize join queries involving Kudu tables. If the join clause
+        contains predicates of the form
+        <codeph><varname>column</varname> = <varname>expression</varname></codeph>,
+        after Impala constructs a hash table of possible matching values for the
+        join columns from the bigger table (either an HDFS table or a Kudu table), Impala
+        can <q>push down</q> the minimum and maximum matching column values to Kudu,
+        so that Kudu can more efficiently locate matching rows in the second (smaller) table.
+        These min/max filters are affected by the <codeph>RUNTIME_FILTER_MODE</codeph>,
+        <codeph>RUNTIME_FILTER_WAIT_TIME_MS</codeph>, and <codeph>DISABLE_ROW_RUNTIME_FILTERING</codeph>
+        query options; the min/max filters are not affected by the
+        <codeph>RUNTIME_BLOOM_FILTER_SIZE</codeph>, <codeph>RUNTIME_FILTER_MIN_SIZE</codeph>,
+        <codeph>RUNTIME_FILTER_MAX_SIZE</codeph>, and <codeph>MAX_NUM_RUNTIME_FILTERS</codeph>
+        query options.
+      </p>
+
       <p>
         See <xref keyref="explain"/> for examples of evaluating the effectiveness of
         the predicate pushdown for a specific query against a Kudu table.

http://git-wip-us.apache.org/repos/asf/impala/blob/b27537a1/docs/topics/impala_max_num_runtime_filters.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_max_num_runtime_filters.xml b/docs/topics/impala_max_num_runtime_filters.xml
index 7ac06ee..0fa52e0 100644
--- a/docs/topics/impala_max_num_runtime_filters.xml
+++ b/docs/topics/impala_max_num_runtime_filters.xml
@@ -67,6 +67,10 @@ under the License.
 
     <p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/>
 
+    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
+    <p rev="2.11.0 IMPALA-4252" conref="../shared/impala_common.xml#common/filter_option_bloom_only"/>
+
     <p conref="../shared/impala_common.xml#common/related_info"/>
     <p>
       <xref href="impala_runtime_filtering.xml"/>,

http://git-wip-us.apache.org/repos/asf/impala/blob/b27537a1/docs/topics/impala_runtime_bloom_filter_size.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_runtime_bloom_filter_size.xml b/docs/topics/impala_runtime_bloom_filter_size.xml
index 62d0c03..96f0ca5 100644
--- a/docs/topics/impala_runtime_bloom_filter_size.xml
+++ b/docs/topics/impala_runtime_bloom_filter_size.xml
@@ -98,6 +98,10 @@ under the License.
       unsetting it immediately afterward.
     </p>
 
+    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
+    <p rev="2.11.0 IMPALA-4252" conref="../shared/impala_common.xml#common/filter_option_bloom_only"/>
+
     <p conref="../shared/impala_common.xml#common/related_info"/>
     <p>
       <xref href="impala_runtime_filtering.xml"/>,

http://git-wip-us.apache.org/repos/asf/impala/blob/b27537a1/docs/topics/impala_runtime_filter_max_size.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_runtime_filter_max_size.xml b/docs/topics/impala_runtime_filter_max_size.xml
index 7e86914..2370d44 100644
--- a/docs/topics/impala_runtime_filter_max_size.xml
+++ b/docs/topics/impala_runtime_filter_max_size.xml
@@ -57,6 +57,10 @@ under the License.
 
     <p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/>
 
+    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
+    <p rev="2.11.0 IMPALA-4252" conref="../shared/impala_common.xml#common/filter_option_bloom_only"/>
+
     <p conref="../shared/impala_common.xml#common/related_info"/>
     <p>
       <xref href="impala_runtime_filtering.xml"/>,

http://git-wip-us.apache.org/repos/asf/impala/blob/b27537a1/docs/topics/impala_runtime_filter_min_size.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_runtime_filter_min_size.xml b/docs/topics/impala_runtime_filter_min_size.xml
index c341b79..36a4426 100644
--- a/docs/topics/impala_runtime_filter_min_size.xml
+++ b/docs/topics/impala_runtime_filter_min_size.xml
@@ -57,6 +57,10 @@ under the License.
 
     <p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/>
 
+    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
+    <p rev="2.11.0 IMPALA-4252" conref="../shared/impala_common.xml#common/filter_option_bloom_only"/>
+
     <p conref="../shared/impala_common.xml#common/related_info"/>
     <p>
       <xref href="impala_runtime_filtering.xml"/>,

http://git-wip-us.apache.org/repos/asf/impala/blob/b27537a1/docs/topics/impala_runtime_filtering.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_runtime_filtering.xml b/docs/topics/impala_runtime_filtering.xml
index 3323afb..ae0f1b4 100644
--- a/docs/topics/impala_runtime_filtering.xml
+++ b/docs/topics/impala_runtime_filtering.xml
@@ -169,16 +169,23 @@ under the License.
         of values for join key columns. When this list is values is transmitted in time to a scan node,
         Impala can filter out non-matching values immediately after reading them, rather than transmitting
         the raw data to another host to compare against the in-memory hash table on that host.
-        This data structure is implemented as a <term>Bloom filter</term>, which uses a probability-based
-        algorithm to determine all possible matching values. (The probability-based aspects means that the
-        filter might include some non-matching values, but if so, that does not cause any inaccuracy
+      </p>
+      <p>
+        For HDFS-based tables, this data structure is implemented as a <term>Bloom filter</term>, which uses
+        a probability-based algorithm to determine all possible matching values. (The probability-based aspects
+        means that the filter might include some non-matching values, but if so, that does not cause any inaccuracy
         in the final results.)
       </p>
+      <p rev="2.11.0 IMPALA-4252">
+        Another kind of filter is the <q>min-max</q> filter. It currently only applies to Kudu tables. The
+        filter is a data structure representing a minimum and maximum value. These filters are passed to
+        Kudu to reduce the number of rows returned to Impala when scanning the probe side of the join.
+      </p>
       <p>
         There are different kinds of filters to match the different kinds of joins (partitioned and broadcast).
-        A broadcast filter is a complete list of relevant values that can be immediately evaluated by a scan node.
-        A partitioned filter is a partial list of relevant values (based on the data processed by one host in the
-        cluster); all the partitioned filters must be combined into one (by the coordinator node) before the
+        A broadcast filter reflects the complete list of relevant values and can be immediately evaluated by a scan node.
+        A partitioned filter reflects only the values processed by one host in the
+        cluster; all the partitioned filters must be combined into one (by the coordinator node) before the
         scan nodes can use the results to accurately filter the data as it is read from storage.
       </p>
       <p>
@@ -331,6 +338,9 @@ under the License.
         <codeph>runtime filters: <varname>filter_id</varname> &lt;- <varname>table</varname>.<varname>column</varname></codeph>,
         while a plan fragment that consumes a filter includes an annotation such as
         <codeph>runtime filters: <varname>filter_id</varname> -&gt; <varname>table</varname>.<varname>column</varname></codeph>.
+        <ph rev="2.11.0 IMPALA-4252">Setting the query option <codeph>EXPLAIN_LEVEL=2</codeph> adds additional
+        annotations showing the type of the filter, either <codeph><varname>filter_id</varname>[bloom]</codeph>
+        (for HDFS-based tables) or <codeph><varname>filter_id</varname>[min_max]</codeph> (for Kudu tables).</ph>
       </p>
 
       <p>
@@ -507,6 +517,8 @@ select c1 from huge_t1 join [shuffle] huge_t2
         The runtime filtering feature is most effective for the Parquet file formats.
         For other file formats, filtering only applies for partitioned tables.
         See <xref href="impala_runtime_filtering.xml#runtime_filtering_file_formats"/>.
+        For the ways in which runtime filtering works for Kudu tables, see
+        <xref href="impala_kudu.xml#kudu_performance"/>.
       </p>
 
       <!-- To do: check if this restriction is lifted in 5.8 / 2.6. -->