You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by ar...@apache.org on 2018/10/31 18:34:36 UTC
[1/4] impala git commit: IMPALA-7687: [DOCS] Support for multiple
DISTINCT in a query
Repository: impala
Updated Branches:
refs/heads/master 85166afa8 -> 01f60d938
IMPALA-7687: [DOCS] Support for multiple DISTINCT in a query
- Removed notes about the single DISTINCT restriction.
- Rewrote the description for the APPX_COUNT_DISTINCT query option.
Change-Id: I3a6e664b016e9408a3ff809f1811253a91764481
Reviewed-on: http://gerrit.cloudera.org:8080/11823
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Thomas Marshall <th...@cmu.edu>
Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/dcc4024b
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/dcc4024b
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/dcc4024b
Branch: refs/heads/master
Commit: dcc4024b1d13631ec57e0dcd3dddb461c918cb1b
Parents: 85166af
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Mon Oct 29 17:33:30 2018 -0700
Committer: Alex Rodoni <ar...@cloudera.com>
Committed: Tue Oct 30 23:55:25 2018 +0000
----------------------------------------------------------------------
docs/shared/impala_common.xml | 27 ----------
docs/topics/impala_appx_count_distinct.xml | 65 ++++++-------------------
docs/topics/impala_count.xml | 2 -
docs/topics/impala_distinct.xml | 38 +++++++--------
docs/topics/impala_langref_unsupported.xml | 6 ---
docs/topics/impala_select.xml | 6 ---
6 files changed, 33 insertions(+), 111 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/impala/blob/dcc4024b/docs/shared/impala_common.xml
----------------------------------------------------------------------
diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml
index cb4be6c..a45f802 100644
--- a/docs/shared/impala_common.xml
+++ b/docs/shared/impala_common.xml
@@ -2117,33 +2117,6 @@ show functions in _impala_builtins like '*<varname>substring</varname>*';
<codeph>--insert_inherit_permissions</codeph> startup option for the <cmdname>impalad</cmdname> daemon.
</p>
- <note id="multiple_count_distinct">
- <p>
- By default, Impala only allows a single <codeph>COUNT(DISTINCT <varname>columns</varname>)</codeph>
- expression in each query.
- </p>
- <p>
- If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by
- specifying <codeph>NDV(<varname>column</varname>)</codeph>; a query can contain multiple instances of
- <codeph>NDV(<varname>column</varname>)</codeph>. To make Impala automatically rewrite
- <codeph>COUNT(DISTINCT)</codeph> expressions to <codeph>NDV()</codeph>, enable the
- <codeph>APPX_COUNT_DISTINCT</codeph> query option.
- </p>
- <p>
- To produce the same result as multiple <codeph>COUNT(DISTINCT)</codeph> expressions, you can use the
- following technique for queries involving a single table:
- </p>
-<codeblock xml:space="preserve">select v1.c1 result1, v2.c1 result2 from
- (select count(distinct col1) as c1 from t1) v1
- cross join
- (select count(distinct col2) as c1 from t1) v2;
-</codeblock>
- <p>
- Because <codeph>CROSS JOIN</codeph> is an expensive operation, prefer to use the <codeph>NDV()</codeph>
- technique wherever practical.
- </p>
- </note>
-
<p>
<ph id="union_all_vs_union">Prefer <codeph>UNION ALL</codeph> over <codeph>UNION</codeph> when you know the
data sets are disjoint or duplicate values are not a problem; <codeph>UNION ALL</codeph> is more efficient
http://git-wip-us.apache.org/repos/asf/impala/blob/dcc4024b/docs/topics/impala_appx_count_distinct.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_appx_count_distinct.xml b/docs/topics/impala_appx_count_distinct.xml
index 8655968..28544e0 100644
--- a/docs/topics/impala_appx_count_distinct.xml
+++ b/docs/topics/impala_appx_count_distinct.xml
@@ -21,7 +21,13 @@ under the License.
<concept rev="2.0.0" id="appx_count_distinct">
<title>APPX_COUNT_DISTINCT Query Option (<keyword keyref="impala20"/> or higher only)</title>
- <titlealts audience="PDF"><navtitle>APPX_COUNT_DISTINCT</navtitle></titlealts>
+
+ <titlealts audience="PDF">
+
+ <navtitle>APPX_COUNT_DISTINCT</navtitle>
+
+ </titlealts>
+
<prolog>
<metadata>
<data name="Category" value="Impala"/>
@@ -35,65 +41,26 @@ under the License.
<conbody>
<p rev="2.0.0">
- <indexterm audience="hidden">APPX_COUNT_DISTINCT query option</indexterm>
- Allows multiple <codeph>COUNT(DISTINCT)</codeph> operations within a single query, by internally rewriting
- each <codeph>COUNT(DISTINCT)</codeph> to use the <codeph>NDV()</codeph> function. The resulting count is
- approximate rather than precise.
+ When the <codeph>APPX_COUNT_DISTINCT</codeph> query option is set to
+ <codeph>TRUE</codeph>, Impala implicitly converts <codeph>COUNT(DISTINCT)</codeph>
+ operations to the <codeph>NDV()</codeph> function calls. The resulting count is
+ approximate rather than precise. Enable the query option when a tolerable amount of error
+ is acceptable in order to obtain faster query results than with a <codeph>COUNT
+ (DISTINCT)</codeph> queries.
</p>
<p conref="../shared/impala_common.xml#common/type_boolean"/>
<p conref="../shared/impala_common.xml#common/default_false_0"/>
- <p conref="../shared/impala_common.xml#common/example_blurb"/>
-
- <p>
- The following examples show how the <codeph>APPX_COUNT_DISTINCT</codeph> lets you work around the restriction
- where a query can only evaluate <codeph>COUNT(DISTINCT <varname>col_name</varname>)</codeph> for a single
- column. By default, you can count the distinct values of one column or another, but not both in a single
- query:
- </p>
-
-<codeblock>[localhost:21000] > select count(distinct x) from int_t;
-+-------------------+
-| count(distinct x) |
-+-------------------+
-| 10 |
-+-------------------+
-[localhost:21000] > select count(distinct property) from int_t;
-+--------------------------+
-| count(distinct property) |
-+--------------------------+
-| 7 |
-+--------------------------+
-[localhost:21000] > select count(distinct x), count(distinct property) from int_t;
-ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters
-as count(DISTINCT x); deviating function: count(DISTINCT property)
-</codeblock>
-
- <p>
- When you enable the <codeph>APPX_COUNT_DISTINCT</codeph> query option, now the query with multiple
- <codeph>COUNT(DISTINCT)</codeph> works. The reason this behavior requires a query option is that each
- <codeph>COUNT(DISTINCT)</codeph> is rewritten internally to use the <codeph>NDV()</codeph> function instead,
- which provides an approximate result rather than a precise count.
- </p>
-
-<codeblock>[localhost:21000] > set APPX_COUNT_DISTINCT=true;
-[localhost:21000] > select count(distinct x), count(distinct property) from int_t;
-+-------------------+--------------------------+
-| count(distinct x) | count(distinct property) |
-+-------------------+--------------------------+
-| 10 | 7 |
-+-------------------+--------------------------+
-</codeblock>
-
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_count.xml#count"/>,
- <xref href="impala_distinct.xml#distinct"/>,
- <xref href="impala_ndv.xml#ndv"/>
+ <xref
+ href="impala_distinct.xml#distinct"/>, <xref href="impala_ndv.xml#ndv"/>
</p>
</conbody>
+
</concept>
http://git-wip-us.apache.org/repos/asf/impala/blob/dcc4024b/docs/topics/impala_count.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_count.xml b/docs/topics/impala_count.xml
index 59180c7..d489c6d 100644
--- a/docs/topics/impala_count.xml
+++ b/docs/topics/impala_count.xml
@@ -242,8 +242,6 @@ ERROR: AnalysisException: RANGE is only supported with both the lower and upper
</codeblock>
</p>
- <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
-
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
http://git-wip-us.apache.org/repos/asf/impala/blob/dcc4024b/docs/topics/impala_distinct.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_distinct.xml b/docs/topics/impala_distinct.xml
index 710ea0c..1a5a947 100644
--- a/docs/topics/impala_distinct.xml
+++ b/docs/topics/impala_distinct.xml
@@ -21,6 +21,7 @@ under the License.
<concept id="distinct">
<title>DISTINCT Operator</title>
+
<prolog>
<metadata>
<data name="Category" value="Impala"/>
@@ -35,45 +36,40 @@ under the License.
<conbody>
<p>
- <indexterm audience="hidden">DISTINCT operator</indexterm>
- The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the result set to
- remove duplicates:
+ The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the
+ result set to remove duplicates.
</p>
<codeblock>-- Returns the unique values from one column.
-- NULL is included in the set of values if any rows have a NULL in this column.
-select distinct c_birth_country from customer;
+SELECT DISTINCT c_birth_country FROM customer;
+
-- Returns the unique combinations of values from multiple columns.
-select distinct c_salutation, c_last_name from customer;</codeblock>
+SELECT DISTINCT c_salutation, c_last_name FROM customer;</codeblock>
<p>
- You can use <codeph>DISTINCT</codeph> in combination with an aggregation function, typically
- <codeph>COUNT()</codeph>, to find how many different values a column contains:
+ You can use <codeph>DISTINCT</codeph> in combination with an aggregation function,
+ typically <codeph>COUNT()</codeph>, to find how many different values a column contains.
</p>
<codeblock>-- Counts the unique values from one column.
-- NULL is not included as a distinct value in the count.
-select count(distinct c_birth_country) from customer;
--- Counts the unique combinations of values from multiple columns.
-select count(distinct c_salutation, c_last_name) from customer;</codeblock>
+SELECT COUNT(DISTINCT c_birth_country) FROM customer;
- <p>
- One construct that Impala SQL does <i>not</i> support is using <codeph>DISTINCT</codeph> in more than one
- aggregation function in the same query. For example, you could not have a single query with both
- <codeph>COUNT(DISTINCT c_first_name)</codeph> and <codeph>COUNT(DISTINCT c_last_name)</codeph> in the
- <codeph>SELECT</codeph> list.
- </p>
+-- Counts the unique combinations of values from multiple columns.
+SELECT COUNT(DISTINCT c_salutation, c_last_name) FROM customer;</codeblock>
<p conref="../shared/impala_common.xml#common/zero_length_strings"/>
- <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
-
<note>
<p>
- In contrast with some database systems that always return <codeph>DISTINCT</codeph> values in sorted order,
- Impala does not do any ordering of <codeph>DISTINCT</codeph> values. Always include an <codeph>ORDER
- BY</codeph> clause if you need the values in alphabetical or numeric sorted order.
+ In contrast with some database systems that always return <codeph>DISTINCT</codeph>
+ values in sorted order, Impala does not do any ordering of <codeph>DISTINCT</codeph>
+ values. Always include an <codeph>ORDER BY</codeph> clause if you need the values in
+ alphabetical or numeric sorted order.
</p>
</note>
+
</conbody>
+
</concept>
http://git-wip-us.apache.org/repos/asf/impala/blob/dcc4024b/docs/topics/impala_langref_unsupported.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_langref_unsupported.xml b/docs/topics/impala_langref_unsupported.xml
index 8f46cec..a7b7d65 100644
--- a/docs/topics/impala_langref_unsupported.xml
+++ b/docs/topics/impala_langref_unsupported.xml
@@ -105,12 +105,6 @@ under the License.
rather than the <codeph>EXPLODE()</codeph> keyword.
See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types.
</li>
-
- <li>
- Multiple <codeph>DISTINCT</codeph> clauses per query, although Impala includes some workarounds for this
- limitation.
- <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
- </li>
</ul>
<p>
http://git-wip-us.apache.org/repos/asf/impala/blob/dcc4024b/docs/topics/impala_select.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_select.xml b/docs/topics/impala_select.xml
index 0253712..7b516a7 100644
--- a/docs/topics/impala_select.xml
+++ b/docs/topics/impala_select.xml
@@ -108,12 +108,6 @@ table_reference := { <varname>table_name</varname> | (<varname>subquery</varname
</li>
<li>
- By default, one <codeph>DISTINCT</codeph> clause per query. See <xref href="impala_distinct.xml#distinct"/>
- for details. See <xref href="impala_appx_count_distinct.xml#appx_count_distinct"/> for a query option to
- allow multiple <codeph>COUNT(DISTINCT)</codeph> impressions in the same query.
- </li>
-
- <li>
Subqueries in a <codeph>FROM</codeph> clause. In <keyword keyref="impala20_full"/> and higher,
subqueries can also go in the <codeph>WHERE</codeph> clause, for example with the
<codeph>IN()</codeph>, <codeph>EXISTS</codeph>, and <codeph>NOT EXISTS</codeph> operators.
[3/4] impala git commit: IMPALA-7765: [DOCS] Document
IMPALA_MAX_MEM_ESTIMATE_FOR_ADMISSION option
Posted by ar...@apache.org.
IMPALA-7765: [DOCS] Document IMPALA_MAX_MEM_ESTIMATE_FOR_ADMISSION option
Change-Id: Ibef89c98530c6974dc791666cc51c1ded52e7910
Reviewed-on: http://gerrit.cloudera.org:8080/11804
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/f7794cf2
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/f7794cf2
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/f7794cf2
Branch: refs/heads/master
Commit: f7794cf2280ec9742c47d8b425751ad92a25c675
Parents: d4c0ce3
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Fri Oct 26 15:56:06 2018 -0700
Committer: Alex Rodoni <ar...@cloudera.com>
Committed: Wed Oct 31 01:05:04 2018 +0000
----------------------------------------------------------------------
docs/impala.ditamap | 1 +
.../impala_max_mem_estimate_for_admission.xml | 89 ++++++++++++++++++++
2 files changed, 90 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/impala/blob/f7794cf2/docs/impala.ditamap
----------------------------------------------------------------------
diff --git a/docs/impala.ditamap b/docs/impala.ditamap
index 373b92d..d1d09cc 100644
--- a/docs/impala.ditamap
+++ b/docs/impala.ditamap
@@ -200,6 +200,7 @@ under the License.
<topicref href="topics/impala_live_progress.xml"/>
<topicref href="topics/impala_live_summary.xml"/>
<topicref href="topics/impala_max_errors.xml"/>
+ <topicref rev="3.1 IMPALA-6847" href="topics/impala_max_mem_estimate_for_admission.xml"/>
<topicref rev="2.10.0 IMPALA-3200" href="topics/impala_max_row_size.xml"/>
<topicref rev="2.5.0" href="topics/impala_max_num_runtime_filters.xml"/>
<topicref href="topics/impala_max_scan_range_length.xml"/>
http://git-wip-us.apache.org/repos/asf/impala/blob/f7794cf2/docs/topics/impala_max_mem_estimate_for_admission.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_max_mem_estimate_for_admission.xml b/docs/topics/impala_max_mem_estimate_for_admission.xml
new file mode 100644
index 0000000..ee5136d
--- /dev/null
+++ b/docs/topics/impala_max_mem_estimate_for_admission.xml
@@ -0,0 +1,89 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="max_mem_estimate_for_admission">
+
+ <title>MAX_MEM_ESTIMATE_FOR_ADMISSION Query Option</title>
+
+ <titlealts audience="PDF">
+
+ <navtitle>MAX_MEM_ESTIMATE_FOR_ADMISSION</navtitle>
+
+ </titlealts>
+
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="Impala Query Options"/>
+ <data name="Category" value="Querying"/>
+ <data name="Category" value="Developers"/>
+ <data name="Category" value="Data Analysts"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ Use the <codeph>MAX_MEM_ESTIMATE_FOR_ADMISSION</codeph> query option to set an upper limit
+ on the memory estimates of a query as a workaround for over-estimates precluding a query
+ from being admitted.
+ </p>
+
+ <p>
+ The query option takes effect when all of the below conditions are met:
+ </p>
+
+ <ul>
+ <li>
+ Memory-based admission control is enabled for the pool.
+ </li>
+
+ <li>
+ The <codeph>MEM_LIMIT</codeph> query option is not set at the query, session, resource
+ pool, or global level.
+ </li>
+ </ul>
+
+ <p>
+ When the above conditions are met, MIN(<codeph>MAX_MEM_ESTIMATE_FOR_ADMISSION</codeph>,
+ mem_estimate) is used for admission control.
+ </p>
+
+ <p>
+ Setting the <codeph>MEM_LIMIT</codeph> query option is usually a better option. Use the
+ <codeph>MAX_MEM_ESTIMATE_FOR_ADMISSION</codeph> query option when it is not feasible to
+ set <codeph>MEM_LIMIT</codeph> for each individual query.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/type_integer"/>
+
+ <p conref="../shared/impala_common.xml#common/default_blurb"/>
+
+ <p>
+ <b>Added in:</b> <keyword keyref="impala31"/>
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+ <p conref="../shared/impala_common.xml#common/related_info"/>
+
+ </conbody>
+
+</concept>
[4/4] impala git commit: IMPALA-7614: [DOCS] Document the New
Invalidate Options
Posted by ar...@apache.org.
IMPALA-7614: [DOCS] Document the New Invalidate Options
--invalidate_tables_timeout_s
--invalidate_tables_on_memory_pressure
Change-Id: I40c552eeaee81ee6528d9f725bd416b51d8ab837
Reviewed-on: http://gerrit.cloudera.org:8080/11809
Reviewed-by: Tianyi Wang <tw...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/01f60d93
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/01f60d93
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/01f60d93
Branch: refs/heads/master
Commit: 01f60d9389a52453c346d11e36a6bce6ed0d2fcd
Parents: f7794cf
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Thu Oct 25 19:02:45 2018 -0700
Committer: Alex Rodoni <ar...@cloudera.com>
Committed: Wed Oct 31 01:06:23 2018 +0000
----------------------------------------------------------------------
docs/topics/impala_config_options.xml | 69 +++++++++++++++---------------
1 file changed, 35 insertions(+), 34 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/impala/blob/01f60d93/docs/topics/impala_config_options.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_config_options.xml b/docs/topics/impala_config_options.xml
index 2f8a83c..7dc1add 100644
--- a/docs/topics/impala_config_options.xml
+++ b/docs/topics/impala_config_options.xml
@@ -262,30 +262,6 @@ Starting Impala Catalog Server: [ OK ]</codeblock>
</conbody>
- <concept audience="hidden" id="config_options_statestored_details">
-
- <title>Configuration Options for statestored Daemon</title>
-
- <conbody>
-
- <p></p>
-
- </conbody>
-
- </concept>
-
- <concept audience="hidden" id="config_options_catalogd_details">
-
- <title>Configuration Options for catalogd Daemon</title>
-
- <conbody>
-
- <p></p>
-
- </conbody>
-
- </concept>
-
</concept>
<concept id="config_options_checking">
@@ -348,11 +324,10 @@ Starting Impala Catalog Server: [ OK ]</codeblock>
<conbody>
- <p>
- The <cmdname>statestored</cmdname> daemon implements the Impala statestore service,
- which monitors the availability of Impala services across the cluster, and handles
- situations such as nodes becoming unavailable or becoming available again.
- </p>
+ <p> The <cmdname>statestored</cmdname> daemon implements the Impala
+ StateStore service, which monitors the availability of Impala services
+ across the cluster, and handles situations such as nodes becoming
+ unavailable or becoming available again. </p>
</conbody>
@@ -364,16 +339,42 @@ Starting Impala Catalog Server: [ OK ]</codeblock>
<conbody>
- <p>
- The <cmdname>catalogd</cmdname> daemon implements the Impala catalog service, which
- broadcasts metadata changes to all the Impala nodes when Impala creates a table, inserts
- data, or performs other kinds of DDL and DML operations.
- </p>
+ <p> The <cmdname>catalogd</cmdname> daemon implements the Impala Catalog
+ service, which broadcasts metadata changes to all the Impala nodes when
+ Impala creates a table, inserts data, or performs other kinds of DDL and
+ DML operations. </p>
<p conref="../shared/impala_common.xml#common/load_catalog_in_background"/>
</conbody>
</concept>
+ <concept id="auto_invalidate_metadata">
+ <title>Startup Options for Automatic Invalidation of Metadata</title>
+ <conbody>
+ <p>To keep the size of metadata small, <codeph>catalogd</codeph>
+ periodically scans all the tables and invalidates those not recently
+ used. There are two types of configurations in
+ <codeph>catalogd</codeph>.</p>
+ <ul>
+ <li>Time-based invalidation with the
+ <codeph>--invalidate_tables_timeout_s</codeph> flag:
+ <codeph>Catalogd</codeph> invalidates tables that are not recently
+ used in the specified time period. This flag needs to be applied to
+ both <codeph>impalad</codeph> and <codeph>catalogd</codeph>.</li>
+ <li>Java garbage collection-based invalidation with the
+ <codeph>--invalidate_tables_on_memory_pressure</codeph> flag: When
+ the memory pressure is high after a Java garbage collection in
+ <codeph>catalogd</codeph>, Impala invalidates a certain fraction of
+ the least recently used tables. This flag needs to be applied to both
+ <codeph>impalad</codeph> and <codeph>catalogd</codeph>.</li>
+ </ul>
+ <p>Automatic invalidation of metadata provides more stability with lower
+ chances of running out of memory, but the feature could potentially
+ cause performance risks.</p>
+ <note>This is a preview feature in Impala 3.1 and not generally
+ available.</note>
+ </conbody>
+ </concept>
</concept>
[2/4] impala git commit: IMPALA-7743: [DOCS] A new option to load
incremental statistics from catalog
Posted by ar...@apache.org.
IMPALA-7743: [DOCS] A new option to load incremental statistics from catalog
--pull_incremental_statistics described in the Incremental Stats section.
Change-Id: I8fd9b88138350406065df2f39a48043178759949
Reviewed-on: http://gerrit.cloudera.org:8080/11790
Reviewed-by: Greg Rahn <gr...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/d4c0ce32
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/d4c0ce32
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/d4c0ce32
Branch: refs/heads/master
Commit: d4c0ce32a67a3f8d7fd4b8e92e42f6d4567d8db2
Parents: dcc4024
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Thu Oct 25 11:32:27 2018 -0700
Committer: Alex Rodoni <ar...@cloudera.com>
Committed: Wed Oct 31 00:25:45 2018 +0000
----------------------------------------------------------------------
docs/shared/impala_common.xml | 16 ++---
docs/topics/impala_perf_stats.xml | 106 ++++++++++++++++++++-------------
2 files changed, 75 insertions(+), 47 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/impala/blob/d4c0ce32/docs/shared/impala_common.xml
----------------------------------------------------------------------
diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml
index a45f802..8b79596 100644
--- a/docs/shared/impala_common.xml
+++ b/docs/shared/impala_common.xml
@@ -1422,13 +1422,15 @@ drop database temp;
for the first time on a given table.
</p>
- <p id="incremental_stats_caveats">
- For a table with a huge number of partitions and many columns, the approximately 400 bytes
- of metadata per column per partition can add up to significant memory overhead, as it must
- be cached on the <cmdname>catalogd</cmdname> host and on every <cmdname>impalad</cmdname> host
- that is eligible to be a coordinator. If this metadata for all tables combined exceeds 2 GB,
- you might experience service downtime.
- </p>
+ <p id="incremental_stats_caveats"> In Impala 3.0 and lower, approximately
+ 400 bytes of metadata per column per partition are needed for caching.
+ Tables with a big number of partitions and many columns can add up to a
+ significant memory overhead as the metadata must be cached on the
+ <cmdname>catalogd</cmdname> host and on every
+ <cmdname>impalad</cmdname> host that is eligible to be a coordinator.
+ If this metadata for all tables exceeds 2 GB, you might experience
+ service downtime. In Impala 3.1 and higher, the issue was alleviated
+ with an improved handling of incremental stats.</p>
<p id="incremental_partition_spec">
The <codeph>PARTITION</codeph> clause is only allowed in combination with the <codeph>INCREMENTAL</codeph>
http://git-wip-us.apache.org/repos/asf/impala/blob/d4c0ce32/docs/topics/impala_perf_stats.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_perf_stats.xml b/docs/topics/impala_perf_stats.xml
index 15a00f7..861aba3 100644
--- a/docs/topics/impala_perf_stats.xml
+++ b/docs/topics/impala_perf_stats.xml
@@ -581,8 +581,10 @@ show column stats year_month_day;
</p>
<note type="important">
- <p conref="../shared/impala_common.xml#common/incremental_stats_caveats"/>
- <p conref="../shared/impala_common.xml#common/incremental_stats_after_full"/>
+ <p
+ conref="../shared/impala_common.xml#common/incremental_stats_caveats"/>
+ <p
+ conref="../shared/impala_common.xml#common/incremental_stats_after_full"/>
</note>
<p>
@@ -629,12 +631,13 @@ show column stats year_month_day;
<li>
<p>
<codeph>COMPUTE INCREMENTAL STATS</codeph> uses some memory in the
- <cmdname>catalogd</cmdname> process, proportional to the number of partitions and
- number of columns in the applicable table. The memory overhead is approximately 400
- bytes for each column in each partition. This memory is reserved in the
- <cmdname>catalogd</cmdname> daemon, the <cmdname>statestored</cmdname> daemon, and
- in each instance of the <cmdname>impalad</cmdname> daemon.
- </p>
+ <cmdname>catalogd</cmdname> process, proportional to the number
+ of partitions and number of columns in the applicable table. The
+ memory overhead is approximately 400 bytes for each column in each
+ partition. This memory is reserved in the
+ <cmdname>catalogd</cmdname> daemon, the
+ <cmdname>statestored</cmdname> daemon, and in each instance of
+ the impalad daemon. </p>
</li>
<li>
@@ -705,42 +708,66 @@ show column stats year_month_day;
<concept id="inc_stats_size_limit_bytes">
<title>Maximum Serialized Stats Size</title>
<conbody>
- <p>
- When executing <codeph>COMPUTE INCREMENTAL STATS</codeph> on
- very large tables, use the configuration setting
- <codeph>inc_stats_size_limit_bytes</codeph> to prevent Impala from
- running out of memory while updating table metadata. If this limit
- is reached, Impala will stop loading the table and return an error.
- The error serves as an indication that <codeph>COMPUTE INCREMENTAL
- STATS</codeph> should not be used on the particular table.
- Consider spitting the table and using regular <codeph>COMPUTE
- STATS</codeph> ]if possible.
- </p>
-
- <p>
- The <codeph>inc_stats_size_limit_bytes</codeph> limit is set as a
- safety check, to prevent Impala from hitting the maximum limit for
+ <p>In Impala 3.0 and lower, when executing <codeph>COMPUTE INCREMENTAL
+ STATS</codeph> on very large tables, use the configuration setting
+ <codeph>--inc_stats_size_limit_bytes</codeph> to prevent Impala
+ from running out of memory while updating table metadata. If this
+ limit is reached, Impala will stop loading the table and return an
+ error. The error serves as an indication that <codeph>COMPUTE
+ INCREMENTAL STATS</codeph> should not be used on the particular
+ table. Consider spitting the table and using regular <codeph>COMPUTE
+ STATS</codeph> ]if possible. </p>
+
+ <p> The <codeph>--inc_stats_size_limit_bytes</codeph> limit is set as
+ a safety check, to prevent Impala from hitting the maximum limit for
the table metadata. Note that this limit is only one part of the
- entire table's metadata all of which together must be below 2 GB.
- </p>
+ entire table's metadata all of which together must be below 2 GB. </p>
- <p>
- The default value for <codeph>inc_stats_size_limit_bytes</codeph>
- is 209715200, 200 MB.
- </p>
+ <p> The default value for
+ <codeph>--inc_stats_size_limit_bytes</codeph> is 209715200, 200
+ MB. </p>
- <p> To change the <codeph>inc_stats_size_limit_bytes</codeph> value,
- restart <codeph>impalad</codeph> and <codeph>catalogd</codeph> with
- the new value specified in bytes, for example, 1048576000 for 1 GB.
- See <xref href="impala_config_options.xml#config_options"/> for the
- steps to change the option and restart Impala daemons. </p>
+ <p> To change the <codeph>--inc_stats_size_limit_bytes</codeph> value,
+ restart impalad and catalogd with the new value specified in bytes,
+ for example, 1048576000 for 1 GB. See <xref
+ href="impala_config_options.xml#config_options"/> for the steps to
+ change the option and restart Impala daemons. </p>
- <note type="attention">
- The <codeph>inc_stats_size_limit_bytes</codeph> setting should be
+ <note type="attention"> The
+ <codeph>--inc_stats_size_limit_bytes</codeph> setting should be
increased with care. A big value for the setting, such as 1 GB or
more, can result in a spike in heap usage as well as a crash of
- Impala.
- </note>
+ Impala. </note>
+ <p>In Impala 3.1 and higher, Impala improved how metadata is updated
+ when executing <codeph>COMPUTE INCREMENTAL STATS</codeph>,
+ significantly reducing the need for
+ <codeph>--inc_stats_size_limit_bytes</codeph>. </p>
+ </conbody>
+ </concept>
+ <concept id="pull_incremental_statistics">
+ <title>Loading Incremental Statistics from Catalogd</title>
+ <conbody>
+ <p>
+ Starting in Impala 3.1, a new configuration setting,
+ <codeph>--pull_incremental_statistics</codeph>, was added and set
+ to <codeph>true</codeph> by default. When you start Impala catalogd
+ and impalad coordinators with this setting enabled:
+ </p>
+ <ul>
+ <li> Newly created incremental stats will be smaller in size thus
+ reducing memory pressure on the catalogd daemon. Your users can
+ keep more tables and partitions in the same catalog and have lower
+ chances of crashing catalogd due to out-of-memory issues. </li>
+ <li>
+ Incremental stats will not be replicated to impalad and will be
+ accessed on demand from catalogd, resulting in a reduced memory
+ footprint of impalad.
+ </li>
+ </ul>
+ <p>
+ We do not recommend you change the default setting of
+ <codeph>--pull_incremental_statistics</codeph>.
+ </p>
</conbody>
</concept>
@@ -980,8 +1007,7 @@ alter table <varname>table_name</varname> partition (<varname>keycol1</varname>=
frequently enough to keep up with data changes for a huge table.
</p>
- <p conref="../shared/impala_common.xml#common/set_column_stats_example"
- />
+ <p conref="../shared/impala_common.xml#common/set_column_stats_example"/>
</conbody>