You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@impala.apache.org by ar...@apache.org on 2018/10/31 18:34:36 UTC

[1/4] impala git commit: IMPALA-7687: [DOCS] Support for multiple DISTINCT in a query

Repository: impala
Updated Branches:
  refs/heads/master 85166afa8 -> 01f60d938


IMPALA-7687: [DOCS] Support for multiple DISTINCT in a query

- Removed notes about the single DISTINCT restriction.
- Rewrote the description for the APPX_COUNT_DISTINCT query option.

Change-Id: I3a6e664b016e9408a3ff809f1811253a91764481
Reviewed-on: http://gerrit.cloudera.org:8080/11823
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Thomas Marshall <th...@cmu.edu>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/dcc4024b
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/dcc4024b
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/dcc4024b

Branch: refs/heads/master
Commit: dcc4024b1d13631ec57e0dcd3dddb461c918cb1b
Parents: 85166af
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Mon Oct 29 17:33:30 2018 -0700
Committer: Alex Rodoni <ar...@cloudera.com>
Committed: Tue Oct 30 23:55:25 2018 +0000

----------------------------------------------------------------------
 docs/shared/impala_common.xml              | 27 ----------
 docs/topics/impala_appx_count_distinct.xml | 65 ++++++-------------------
 docs/topics/impala_count.xml               |  2 -
 docs/topics/impala_distinct.xml            | 38 +++++++--------
 docs/topics/impala_langref_unsupported.xml |  6 ---
 docs/topics/impala_select.xml              |  6 ---
 6 files changed, 33 insertions(+), 111 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/dcc4024b/docs/shared/impala_common.xml
----------------------------------------------------------------------
diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml
index cb4be6c..a45f802 100644
--- a/docs/shared/impala_common.xml
+++ b/docs/shared/impala_common.xml
@@ -2117,33 +2117,6 @@ show functions in _impala_builtins like '*<varname>substring</varname>*';
         <codeph>--insert_inherit_permissions</codeph> startup option for the <cmdname>impalad</cmdname> daemon.
       </p>
 
-      <note id="multiple_count_distinct">
-        <p>
-          By default, Impala only allows a single <codeph>COUNT(DISTINCT <varname>columns</varname>)</codeph>
-          expression in each query.
-        </p>
-        <p>
-          If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by
-          specifying <codeph>NDV(<varname>column</varname>)</codeph>; a query can contain multiple instances of
-          <codeph>NDV(<varname>column</varname>)</codeph>. To make Impala automatically rewrite
-          <codeph>COUNT(DISTINCT)</codeph> expressions to <codeph>NDV()</codeph>, enable the
-          <codeph>APPX_COUNT_DISTINCT</codeph> query option.
-        </p>
-        <p>
-          To produce the same result as multiple <codeph>COUNT(DISTINCT)</codeph> expressions, you can use the
-          following technique for queries involving a single table:
-        </p>
-<codeblock xml:space="preserve">select v1.c1 result1, v2.c1 result2 from
-  (select count(distinct col1) as c1 from t1) v1
-    cross join
-  (select count(distinct col2) as c1 from t1) v2;
-</codeblock>
-        <p>
-          Because <codeph>CROSS JOIN</codeph> is an expensive operation, prefer to use the <codeph>NDV()</codeph>
-          technique wherever practical.
-        </p>
-      </note>
-
       <p>
         <ph id="union_all_vs_union">Prefer <codeph>UNION ALL</codeph> over <codeph>UNION</codeph> when you know the
         data sets are disjoint or duplicate values are not a problem; <codeph>UNION ALL</codeph> is more efficient

http://git-wip-us.apache.org/repos/asf/impala/blob/dcc4024b/docs/topics/impala_appx_count_distinct.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_appx_count_distinct.xml b/docs/topics/impala_appx_count_distinct.xml
index 8655968..28544e0 100644
--- a/docs/topics/impala_appx_count_distinct.xml
+++ b/docs/topics/impala_appx_count_distinct.xml
@@ -21,7 +21,13 @@ under the License.
 <concept rev="2.0.0" id="appx_count_distinct">
 
   <title>APPX_COUNT_DISTINCT Query Option (<keyword keyref="impala20"/> or higher only)</title>
-  <titlealts audience="PDF"><navtitle>APPX_COUNT_DISTINCT</navtitle></titlealts>
+
+  <titlealts audience="PDF">
+
+    <navtitle>APPX_COUNT_DISTINCT</navtitle>
+
+  </titlealts>
+
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -35,65 +41,26 @@ under the License.
   <conbody>
 
     <p rev="2.0.0">
-      <indexterm audience="hidden">APPX_COUNT_DISTINCT query option</indexterm>
-      Allows multiple <codeph>COUNT(DISTINCT)</codeph> operations within a single query, by internally rewriting
-      each <codeph>COUNT(DISTINCT)</codeph> to use the <codeph>NDV()</codeph> function. The resulting count is
-      approximate rather than precise.
+      When the <codeph>APPX_COUNT_DISTINCT</codeph> query option is set to
+      <codeph>TRUE</codeph>, Impala implicitly converts <codeph>COUNT(DISTINCT)</codeph>
+      operations to the <codeph>NDV()</codeph> function calls. The resulting count is
+      approximate rather than precise. Enable the query option when a tolerable amount of error
+      is acceptable in order to obtain faster query results than with a <codeph>COUNT
+      (DISTINCT)</codeph> queries.
     </p>
 
     <p conref="../shared/impala_common.xml#common/type_boolean"/>
 
     <p conref="../shared/impala_common.xml#common/default_false_0"/>
 
-    <p conref="../shared/impala_common.xml#common/example_blurb"/>
-
-    <p>
-      The following examples show how the <codeph>APPX_COUNT_DISTINCT</codeph> lets you work around the restriction
-      where a query can only evaluate <codeph>COUNT(DISTINCT <varname>col_name</varname>)</codeph> for a single
-      column. By default, you can count the distinct values of one column or another, but not both in a single
-      query:
-    </p>
-
-<codeblock>[localhost:21000] &gt; select count(distinct x) from int_t;
-+-------------------+
-| count(distinct x) |
-+-------------------+
-| 10                |
-+-------------------+
-[localhost:21000] &gt; select count(distinct property) from int_t;
-+--------------------------+
-| count(distinct property) |
-+--------------------------+
-| 7                        |
-+--------------------------+
-[localhost:21000] &gt; select count(distinct x), count(distinct property) from int_t;
-ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters
-as count(DISTINCT x); deviating function: count(DISTINCT property)
-</codeblock>
-
-    <p>
-      When you enable the <codeph>APPX_COUNT_DISTINCT</codeph> query option, now the query with multiple
-      <codeph>COUNT(DISTINCT)</codeph> works. The reason this behavior requires a query option is that each
-      <codeph>COUNT(DISTINCT)</codeph> is rewritten internally to use the <codeph>NDV()</codeph> function instead,
-      which provides an approximate result rather than a precise count.
-    </p>
-
-<codeblock>[localhost:21000] &gt; set APPX_COUNT_DISTINCT=true;
-[localhost:21000] &gt; select count(distinct x), count(distinct property) from int_t;
-+-------------------+--------------------------+
-| count(distinct x) | count(distinct property) |
-+-------------------+--------------------------+
-| 10                | 7                        |
-+-------------------+--------------------------+
-</codeblock>
-
     <p conref="../shared/impala_common.xml#common/related_info"/>
 
     <p>
       <xref href="impala_count.xml#count"/>,
-      <xref href="impala_distinct.xml#distinct"/>,
-      <xref href="impala_ndv.xml#ndv"/>
+      <xref
+        href="impala_distinct.xml#distinct"/>, <xref href="impala_ndv.xml#ndv"/>
     </p>
 
   </conbody>
+
 </concept>

http://git-wip-us.apache.org/repos/asf/impala/blob/dcc4024b/docs/topics/impala_count.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_count.xml b/docs/topics/impala_count.xml
index 59180c7..d489c6d 100644
--- a/docs/topics/impala_count.xml
+++ b/docs/topics/impala_count.xml
@@ -242,8 +242,6 @@ ERROR: AnalysisException: RANGE is only supported with both the lower and upper
 </codeblock>
     </p>
 
-    <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
-
     <p conref="../shared/impala_common.xml#common/related_info"/>
 
     <p>

http://git-wip-us.apache.org/repos/asf/impala/blob/dcc4024b/docs/topics/impala_distinct.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_distinct.xml b/docs/topics/impala_distinct.xml
index 710ea0c..1a5a947 100644
--- a/docs/topics/impala_distinct.xml
+++ b/docs/topics/impala_distinct.xml
@@ -21,6 +21,7 @@ under the License.
 <concept id="distinct">
 
   <title>DISTINCT Operator</title>
+
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -35,45 +36,40 @@ under the License.
   <conbody>
 
     <p>
-      <indexterm audience="hidden">DISTINCT operator</indexterm>
-      The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the result set to
-      remove duplicates:
+      The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the
+      result set to remove duplicates.
     </p>
 
 <codeblock>-- Returns the unique values from one column.
 -- NULL is included in the set of values if any rows have a NULL in this column.
-select distinct c_birth_country from customer;
+SELECT DISTINCT c_birth_country FROM customer;
+
 -- Returns the unique combinations of values from multiple columns.
-select distinct c_salutation, c_last_name from customer;</codeblock>
+SELECT DISTINCT c_salutation, c_last_name FROM customer;</codeblock>
 
     <p>
-      You can use <codeph>DISTINCT</codeph> in combination with an aggregation function, typically
-      <codeph>COUNT()</codeph>, to find how many different values a column contains:
+      You can use <codeph>DISTINCT</codeph> in combination with an aggregation function,
+      typically <codeph>COUNT()</codeph>, to find how many different values a column contains.
     </p>
 
 <codeblock>-- Counts the unique values from one column.
 -- NULL is not included as a distinct value in the count.
-select count(distinct c_birth_country) from customer;
--- Counts the unique combinations of values from multiple columns.
-select count(distinct c_salutation, c_last_name) from customer;</codeblock>
+SELECT COUNT(DISTINCT c_birth_country) FROM customer;
 
-    <p>
-      One construct that Impala SQL does <i>not</i> support is using <codeph>DISTINCT</codeph> in more than one
-      aggregation function in the same query. For example, you could not have a single query with both
-      <codeph>COUNT(DISTINCT c_first_name)</codeph> and <codeph>COUNT(DISTINCT c_last_name)</codeph> in the
-      <codeph>SELECT</codeph> list.
-    </p>
+-- Counts the unique combinations of values from multiple columns.
+SELECT COUNT(DISTINCT c_salutation, c_last_name) FROM customer;</codeblock>
 
     <p conref="../shared/impala_common.xml#common/zero_length_strings"/>
 
-    <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
-
     <note>
       <p>
-        In contrast with some database systems that always return <codeph>DISTINCT</codeph> values in sorted order,
-        Impala does not do any ordering of <codeph>DISTINCT</codeph> values. Always include an <codeph>ORDER
-        BY</codeph> clause if you need the values in alphabetical or numeric sorted order.
+        In contrast with some database systems that always return <codeph>DISTINCT</codeph>
+        values in sorted order, Impala does not do any ordering of <codeph>DISTINCT</codeph>
+        values. Always include an <codeph>ORDER BY</codeph> clause if you need the values in
+        alphabetical or numeric sorted order.
       </p>
     </note>
+
   </conbody>
+
 </concept>

http://git-wip-us.apache.org/repos/asf/impala/blob/dcc4024b/docs/topics/impala_langref_unsupported.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_langref_unsupported.xml b/docs/topics/impala_langref_unsupported.xml
index 8f46cec..a7b7d65 100644
--- a/docs/topics/impala_langref_unsupported.xml
+++ b/docs/topics/impala_langref_unsupported.xml
@@ -105,12 +105,6 @@ under the License.
           rather than the <codeph>EXPLODE()</codeph> keyword.
           See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types.
         </li>
-
-        <li>
-          Multiple <codeph>DISTINCT</codeph> clauses per query, although Impala includes some workarounds for this
-          limitation.
-          <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
-        </li>
       </ul>
 
       <p>

http://git-wip-us.apache.org/repos/asf/impala/blob/dcc4024b/docs/topics/impala_select.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_select.xml b/docs/topics/impala_select.xml
index 0253712..7b516a7 100644
--- a/docs/topics/impala_select.xml
+++ b/docs/topics/impala_select.xml
@@ -108,12 +108,6 @@ table_reference := { <varname>table_name</varname> | (<varname>subquery</varname
       </li>
 
       <li>
-        By default, one <codeph>DISTINCT</codeph> clause per query. See <xref href="impala_distinct.xml#distinct"/>
-        for details. See <xref href="impala_appx_count_distinct.xml#appx_count_distinct"/> for a query option to
-        allow multiple <codeph>COUNT(DISTINCT)</codeph> impressions in the same query.
-      </li>
-
-      <li>
         Subqueries in a <codeph>FROM</codeph> clause. In <keyword keyref="impala20_full"/> and higher,
         subqueries can also go in the <codeph>WHERE</codeph> clause, for example with the
         <codeph>IN()</codeph>, <codeph>EXISTS</codeph>, and <codeph>NOT EXISTS</codeph> operators.

[3/4] impala git commit: IMPALA-7765: [DOCS] Document IMPALA_MAX_MEM_ESTIMATE_FOR_ADMISSION option

Posted by ar...@apache.org.

IMPALA-7765: [DOCS] Document IMPALA_MAX_MEM_ESTIMATE_FOR_ADMISSION option

Change-Id: Ibef89c98530c6974dc791666cc51c1ded52e7910
Reviewed-on: http://gerrit.cloudera.org:8080/11804
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Tim Armstrong <ta...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/f7794cf2
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/f7794cf2
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/f7794cf2

Branch: refs/heads/master
Commit: f7794cf2280ec9742c47d8b425751ad92a25c675
Parents: d4c0ce3
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Fri Oct 26 15:56:06 2018 -0700
Committer: Alex Rodoni <ar...@cloudera.com>
Committed: Wed Oct 31 01:05:04 2018 +0000

----------------------------------------------------------------------
 docs/impala.ditamap                             |  1 +
 .../impala_max_mem_estimate_for_admission.xml   | 89 ++++++++++++++++++++
 2 files changed, 90 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/f7794cf2/docs/impala.ditamap
----------------------------------------------------------------------
diff --git a/docs/impala.ditamap b/docs/impala.ditamap
index 373b92d..d1d09cc 100644
--- a/docs/impala.ditamap
+++ b/docs/impala.ditamap
@@ -200,6 +200,7 @@ under the License.
           <topicref href="topics/impala_live_progress.xml"/>
           <topicref href="topics/impala_live_summary.xml"/>
           <topicref href="topics/impala_max_errors.xml"/>
+          <topicref rev="3.1 IMPALA-6847" href="topics/impala_max_mem_estimate_for_admission.xml"/>
           <topicref rev="2.10.0 IMPALA-3200" href="topics/impala_max_row_size.xml"/>
           <topicref rev="2.5.0" href="topics/impala_max_num_runtime_filters.xml"/>
           <topicref href="topics/impala_max_scan_range_length.xml"/>

http://git-wip-us.apache.org/repos/asf/impala/blob/f7794cf2/docs/topics/impala_max_mem_estimate_for_admission.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_max_mem_estimate_for_admission.xml b/docs/topics/impala_max_mem_estimate_for_admission.xml
new file mode 100644
index 0000000..ee5136d
--- /dev/null
+++ b/docs/topics/impala_max_mem_estimate_for_admission.xml
@@ -0,0 +1,89 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="max_mem_estimate_for_admission">
+
+  <title>MAX_MEM_ESTIMATE_FOR_ADMISSION Query Option</title>
+
+  <titlealts audience="PDF">
+
+    <navtitle>MAX_MEM_ESTIMATE_FOR_ADMISSION</navtitle>
+
+  </titlealts>
+
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Impala Query Options"/>
+      <data name="Category" value="Querying"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      Use the <codeph>MAX_MEM_ESTIMATE_FOR_ADMISSION</codeph> query option to set an upper limit
+      on the memory estimates of a query as a workaround for over-estimates precluding a query
+      from being admitted.
+    </p>
+
+    <p>
+      The query option takes effect when all of the below conditions are met:
+    </p>
+
+    <ul>
+      <li>
+        Memory-based admission control is enabled for the pool.
+      </li>
+
+      <li>
+        The <codeph>MEM_LIMIT</codeph> query option is not set at the query, session, resource
+        pool, or global level.
+      </li>
+    </ul>
+
+    <p>
+      When the above conditions are met, MIN(<codeph>MAX_MEM_ESTIMATE_FOR_ADMISSION</codeph>,
+      mem_estimate) is used for admission control.
+    </p>
+
+    <p>
+      Setting the <codeph>MEM_LIMIT</codeph> query option is usually a better option. Use the
+      <codeph>MAX_MEM_ESTIMATE_FOR_ADMISSION</codeph> query option when it is not feasible to
+      set <codeph>MEM_LIMIT</codeph> for each individual query.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/type_integer"/>
+
+    <p conref="../shared/impala_common.xml#common/default_blurb"/>
+
+    <p>
+      <b>Added in:</b> <keyword keyref="impala31"/>
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+    <p conref="../shared/impala_common.xml#common/related_info"/>
+
+  </conbody>
+
+</concept>

[4/4] impala git commit: IMPALA-7614: [DOCS] Document the New Invalidate Options

Posted by ar...@apache.org.

IMPALA-7614: [DOCS] Document the New Invalidate Options

--invalidate_tables_timeout_s
--invalidate_tables_on_memory_pressure

Change-Id: I40c552eeaee81ee6528d9f725bd416b51d8ab837
Reviewed-on: http://gerrit.cloudera.org:8080/11809
Reviewed-by: Tianyi Wang <tw...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/01f60d93
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/01f60d93
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/01f60d93

Branch: refs/heads/master
Commit: 01f60d9389a52453c346d11e36a6bce6ed0d2fcd
Parents: f7794cf
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Thu Oct 25 19:02:45 2018 -0700
Committer: Alex Rodoni <ar...@cloudera.com>
Committed: Wed Oct 31 01:06:23 2018 +0000

----------------------------------------------------------------------
 docs/topics/impala_config_options.xml | 69 +++++++++++++++---------------
 1 file changed, 35 insertions(+), 34 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/01f60d93/docs/topics/impala_config_options.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_config_options.xml b/docs/topics/impala_config_options.xml
index 2f8a83c..7dc1add 100644
--- a/docs/topics/impala_config_options.xml
+++ b/docs/topics/impala_config_options.xml
@@ -262,30 +262,6 @@ Starting Impala Catalog Server:                            [  OK  ]</codeblock>
 
     </conbody>
 
-    <concept audience="hidden" id="config_options_statestored_details">
-
-      <title>Configuration Options for statestored Daemon</title>
-
-      <conbody>
-
-        <p></p>
-
-      </conbody>
-
-    </concept>
-
-    <concept audience="hidden" id="config_options_catalogd_details">
-
-      <title>Configuration Options for catalogd Daemon</title>
-
-      <conbody>
-
-        <p></p>
-
-      </conbody>
-
-    </concept>
-
   </concept>
 
   <concept id="config_options_checking">
@@ -348,11 +324,10 @@ Starting Impala Catalog Server:                            [  OK  ]</codeblock>
 
     <conbody>
 
-      <p>
-        The <cmdname>statestored</cmdname> daemon implements the Impala statestore service,
-        which monitors the availability of Impala services across the cluster, and handles
-        situations such as nodes becoming unavailable or becoming available again.
-      </p>
+      <p> The <cmdname>statestored</cmdname> daemon implements the Impala
+        StateStore service, which monitors the availability of Impala services
+        across the cluster, and handles situations such as nodes becoming
+        unavailable or becoming available again. </p>
 
     </conbody>
 
@@ -364,16 +339,42 @@ Starting Impala Catalog Server:                            [  OK  ]</codeblock>
 
     <conbody>
 
-      <p>
-        The <cmdname>catalogd</cmdname> daemon implements the Impala catalog service, which
-        broadcasts metadata changes to all the Impala nodes when Impala creates a table, inserts
-        data, or performs other kinds of DDL and DML operations.
-      </p>
+      <p> The <cmdname>catalogd</cmdname> daemon implements the Impala Catalog
+        service, which broadcasts metadata changes to all the Impala nodes when
+        Impala creates a table, inserts data, or performs other kinds of DDL and
+        DML operations. </p>
 
       <p conref="../shared/impala_common.xml#common/load_catalog_in_background"/>
 
     </conbody>
 
   </concept>
+  <concept id="auto_invalidate_metadata">
+    <title>Startup Options for Automatic Invalidation of Metadata</title>
+    <conbody>
+      <p>To keep the size of metadata small, <codeph>catalogd</codeph>
+        periodically scans all the tables and invalidates those not recently
+        used. There are two types of configurations in
+        <codeph>catalogd</codeph>.</p>
+      <ul>
+        <li>Time-based invalidation with the
+            <codeph>--invalidate_tables_timeout_s</codeph> flag:
+            <codeph>Catalogd</codeph> invalidates tables that are not recently
+          used in the specified time period. This flag needs to be applied to
+          both <codeph>impalad</codeph> and <codeph>catalogd</codeph>.</li>
+        <li>Java garbage collection-based invalidation with the
+            <codeph>--invalidate_tables_on_memory_pressure</codeph> flag: When
+          the memory pressure is high after a Java garbage collection in
+            <codeph>catalogd</codeph>, Impala invalidates a certain fraction of
+          the least recently used tables. This flag needs to be applied to both
+            <codeph>impalad</codeph> and <codeph>catalogd</codeph>.</li>
+      </ul>
+      <p>Automatic invalidation of metadata provides more stability with lower
+        chances of running out of memory, but the feature could potentially
+        cause performance risks.</p>
+      <note>This is a preview feature in Impala 3.1 and not generally
+        available.</note>
+    </conbody>
+  </concept>
 
 </concept>

[2/4] impala git commit: IMPALA-7743: [DOCS] A new option to load incremental statistics from catalog

Posted by ar...@apache.org.

IMPALA-7743: [DOCS] A new option to load incremental statistics from catalog

--pull_incremental_statistics described in the Incremental Stats section.

Change-Id: I8fd9b88138350406065df2f39a48043178759949
Reviewed-on: http://gerrit.cloudera.org:8080/11790
Reviewed-by: Greg Rahn <gr...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/d4c0ce32
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/d4c0ce32
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/d4c0ce32

Branch: refs/heads/master
Commit: d4c0ce32a67a3f8d7fd4b8e92e42f6d4567d8db2
Parents: dcc4024
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Thu Oct 25 11:32:27 2018 -0700
Committer: Alex Rodoni <ar...@cloudera.com>
Committed: Wed Oct 31 00:25:45 2018 +0000

----------------------------------------------------------------------
 docs/shared/impala_common.xml     |  16 ++---
 docs/topics/impala_perf_stats.xml | 106 ++++++++++++++++++++-------------
 2 files changed, 75 insertions(+), 47 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/d4c0ce32/docs/shared/impala_common.xml
----------------------------------------------------------------------
diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml
index a45f802..8b79596 100644
--- a/docs/shared/impala_common.xml
+++ b/docs/shared/impala_common.xml
@@ -1422,13 +1422,15 @@ drop database temp;
         for the first time on a given table.
       </p>
 
-      <p id="incremental_stats_caveats">
-        For a table with a huge number of partitions and many columns, the approximately 400 bytes
-        of metadata per column per partition can add up to significant memory overhead, as it must
-        be cached on the <cmdname>catalogd</cmdname> host and on every <cmdname>impalad</cmdname> host
-        that is eligible to be a coordinator. If this metadata for all tables combined exceeds 2 GB,
-        you might experience service downtime.
-      </p>
+      <p id="incremental_stats_caveats"> In Impala 3.0 and lower, approximately
+        400 bytes of metadata per column per partition are needed for caching.
+        Tables with a big number of partitions and many columns can add up to a
+        significant memory overhead as the metadata must be cached on the
+          <cmdname>catalogd</cmdname> host and on every
+          <cmdname>impalad</cmdname> host that is eligible to be a coordinator.
+        If this metadata for all tables exceeds 2 GB, you might experience
+        service downtime. In Impala 3.1 and higher, the issue was alleviated
+        with an improved handling of incremental stats.</p>
 
       <p id="incremental_partition_spec">
         The <codeph>PARTITION</codeph> clause is only allowed in combination with the <codeph>INCREMENTAL</codeph>

http://git-wip-us.apache.org/repos/asf/impala/blob/d4c0ce32/docs/topics/impala_perf_stats.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_perf_stats.xml b/docs/topics/impala_perf_stats.xml
index 15a00f7..861aba3 100644
--- a/docs/topics/impala_perf_stats.xml
+++ b/docs/topics/impala_perf_stats.xml
@@ -581,8 +581,10 @@ show column stats year_month_day;
         </p>
 
         <note type="important">
-          <p conref="../shared/impala_common.xml#common/incremental_stats_caveats"/>
-          <p conref="../shared/impala_common.xml#common/incremental_stats_after_full"/>
+          <p
+            conref="../shared/impala_common.xml#common/incremental_stats_caveats"/>
+          <p
+            conref="../shared/impala_common.xml#common/incremental_stats_after_full"/>
         </note>
 
         <p>
@@ -629,12 +631,13 @@ show column stats year_month_day;
           <li>
             <p>
               <codeph>COMPUTE INCREMENTAL STATS</codeph> uses some memory in the
-              <cmdname>catalogd</cmdname> process, proportional to the number of partitions and
-              number of columns in the applicable table. The memory overhead is approximately 400
-              bytes for each column in each partition. This memory is reserved in the
-              <cmdname>catalogd</cmdname> daemon, the <cmdname>statestored</cmdname> daemon, and
-              in each instance of the <cmdname>impalad</cmdname> daemon.
-            </p>
+                <cmdname>catalogd</cmdname> process, proportional to the number
+              of partitions and number of columns in the applicable table. The
+              memory overhead is approximately 400 bytes for each column in each
+              partition. This memory is reserved in the
+                <cmdname>catalogd</cmdname> daemon, the
+                <cmdname>statestored</cmdname> daemon, and in each instance of
+              the impalad daemon. </p>
           </li>
 
           <li>
@@ -705,42 +708,66 @@ show column stats year_month_day;
       <concept id="inc_stats_size_limit_bytes">
         <title>Maximum Serialized Stats Size</title>
         <conbody>
-          <p>
-            When executing <codeph>COMPUTE INCREMENTAL STATS</codeph> on
-            very large tables, use the configuration setting
-              <codeph>inc_stats_size_limit_bytes</codeph> to prevent Impala from
-            running out of memory while updating table metadata. If this limit
-            is reached, Impala will stop loading the table and return an error.
-            The error serves as an indication that <codeph>COMPUTE INCREMENTAL
-              STATS</codeph> should not be used on the particular table.
-            Consider spitting the table and using regular <codeph>COMPUTE
-              STATS</codeph> ]if possible.
-          </p>
-
-          <p>
-            The <codeph>inc_stats_size_limit_bytes</codeph> limit is set as a
-            safety check, to prevent Impala from hitting the maximum limit for
+          <p>In Impala 3.0 and lower, when executing <codeph>COMPUTE INCREMENTAL
+              STATS</codeph> on very large tables, use the configuration setting
+              <codeph>--inc_stats_size_limit_bytes</codeph> to prevent Impala
+            from running out of memory while updating table metadata. If this
+            limit is reached, Impala will stop loading the table and return an
+            error. The error serves as an indication that <codeph>COMPUTE
+              INCREMENTAL STATS</codeph> should not be used on the particular
+            table. Consider spitting the table and using regular <codeph>COMPUTE
+              STATS</codeph> ]if possible. </p>
+
+          <p> The <codeph>--inc_stats_size_limit_bytes</codeph> limit is set as
+            a safety check, to prevent Impala from hitting the maximum limit for
             the table metadata. Note that this limit is only one part of the
-            entire table's metadata all of which together must be below 2 GB.
-          </p>
+            entire table's metadata all of which together must be below 2 GB. </p>
 
-          <p>
-            The default value for <codeph>inc_stats_size_limit_bytes</codeph>
-            is 209715200, 200 MB.
-          </p>
+          <p> The default value for
+              <codeph>--inc_stats_size_limit_bytes</codeph> is 209715200, 200
+            MB. </p>
 
-          <p> To change the <codeph>inc_stats_size_limit_bytes</codeph> value,
-            restart <codeph>impalad</codeph> and <codeph>catalogd</codeph> with
-            the new value specified in bytes, for example, 1048576000 for 1 GB.
-            See <xref href="impala_config_options.xml#config_options"/> for the
-            steps to change the option and restart Impala daemons. </p>
+          <p> To change the <codeph>--inc_stats_size_limit_bytes</codeph> value,
+            restart impalad and catalogd with the new value specified in bytes,
+            for example, 1048576000 for 1 GB. See <xref
+              href="impala_config_options.xml#config_options"/> for the steps to
+            change the option and restart Impala daemons. </p>
 
-          <note type="attention">
-            The <codeph>inc_stats_size_limit_bytes</codeph> setting should be
+          <note type="attention"> The
+              <codeph>--inc_stats_size_limit_bytes</codeph> setting should be
             increased with care. A big value for the setting, such as 1 GB or
             more, can result in a spike in heap usage as well as a crash of
-            Impala.
-          </note>
+            Impala. </note>
+          <p>In Impala 3.1 and higher, Impala improved how metadata is updated
+            when executing <codeph>COMPUTE INCREMENTAL STATS</codeph>,
+            significantly reducing the need for
+              <codeph>--inc_stats_size_limit_bytes</codeph>. </p>
+        </conbody>
+      </concept>
+      <concept id="pull_incremental_statistics">
+        <title>Loading Incremental Statistics from Catalogd</title>
+        <conbody>
+          <p>
+            Starting in Impala 3.1, a new configuration setting,
+              <codeph>--pull_incremental_statistics</codeph>, was added and set
+            to <codeph>true</codeph> by default. When you start Impala catalogd
+            and impalad coordinators with this setting enabled:
+          </p>
+          <ul>
+            <li> Newly created incremental stats will be smaller in size thus
+              reducing memory pressure on the catalogd daemon. Your users can
+              keep more tables and partitions in the same catalog and have lower
+              chances of crashing catalogd due to out-of-memory issues. </li>
+            <li>
+              Incremental stats will not be replicated to impalad and will be
+              accessed on demand from catalogd, resulting in a reduced memory
+              footprint of impalad.
+            </li>
+          </ul>
+          <p>
+            We do not recommend you change the default setting of
+              <codeph>--pull_incremental_statistics</codeph>.
+          </p>
         </conbody>
       </concept>
 
@@ -980,8 +1007,7 @@ alter table <varname>table_name</varname> partition (<varname>keycol1</varname>=
           frequently enough to keep up with data changes for a huge table.
         </p>
 
-        <p conref="../shared/impala_common.xml#common/set_column_stats_example"
-        />
+        <p conref="../shared/impala_common.xml#common/set_column_stats_example"/>
 
       </conbody>