You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jo...@apache.org on 2018/04/17 20:25:51 UTC
[2/4] impala git commit: IMPALA-6464: [DOCS] COMPUTE STATS supports a
list of columns
IMPALA-6464: [DOCS] COMPUTE STATS supports a list of columns
Change-Id: I609c38eac29e36eca008bfb66f5e78f5491e719a
Reviewed-on: http://gerrit.cloudera.org:8080/10070
Reviewed-by: Vuk Ercegovac <ve...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/0e98b9ab
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/0e98b9ab
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/0e98b9ab
Branch: refs/heads/master
Commit: 0e98b9abd05ccfb3f01657434f913ad7d061f087
Parents: a6767de
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Fri Apr 13 18:14:57 2018 -0700
Committer: Impala Public Jenkins <im...@cloudera.com>
Committed: Mon Apr 16 20:28:34 2018 +0000
----------------------------------------------------------------------
docs/topics/impala_compute_stats.xml | 116 ++++++++++++++++++++----------
1 file changed, 77 insertions(+), 39 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/impala/blob/0e98b9ab/docs/topics/impala_compute_stats.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_compute_stats.xml b/docs/topics/impala_compute_stats.xml
index 98694f8..b62972c 100644
--- a/docs/topics/impala_compute_stats.xml
+++ b/docs/topics/impala_compute_stats.xml
@@ -49,7 +49,11 @@ under the License.
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
-<codeblock rev="2.1.0">COMPUTE STATS [<varname>db_name</varname>.]<varname>table_name</varname>
+<codeblock rev="impala-3562">COMPUTE STATS
+ [<varname>db_name</varname>.]<varname>table_name</varname> [ ( <varname>column_list</varname> ) ]
+
+<varname>column_list</varname> ::= <varname>column_name</varname> [ , <varname>column_name</varname>, ... ]
+
COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>partition_spec</varname>)]
<varname>partition_spec</varname> ::= <varname>simple_partition_spec</varname> | <ph rev="IMPALA-1654"><varname>complex_partition_spec</varname></ph>
@@ -64,12 +68,40 @@ COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varn
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
<p>
- Originally, Impala relied on users to run the Hive <codeph>ANALYZE TABLE</codeph> statement, but that method
- of gathering statistics proved unreliable and difficult to use. The Impala <codeph>COMPUTE STATS</codeph>
- statement is built from the ground up to improve the reliability and user-friendliness of this operation.
- <codeph>COMPUTE STATS</codeph> does not require any setup steps or special configuration. You only run a
- single Impala <codeph>COMPUTE STATS</codeph> statement to gather both table and column statistics, rather
- than separate Hive <codeph>ANALYZE TABLE</codeph> statements for each kind of statistics.
+ Originally, Impala relied on users to run the Hive <codeph>ANALYZE
+ TABLE</codeph> statement, but that method of gathering statistics proved
+ unreliable and difficult to use. The Impala <codeph>COMPUTE STATS</codeph>
+ statement was built to improve the reliability and user-friendliness of
+ this operation. <codeph>COMPUTE STATS</codeph> does not require any setup
+ steps or special configuration. You only run a single Impala
+ <codeph>COMPUTE STATS</codeph> statement to gather both table and column
+ statistics, rather than separate Hive <codeph>ANALYZE TABLE</codeph>
+ statements for each kind of statistics.
+ </p>
+
+ <p rev="impala-3562">
+ For non-incremental <codeph>COMPUTE STATS</codeph>
+ statement, the columns for which statistics are computed can be specified
+ with an optional comma-separate list of columns.
+ </p>
+
+ <p rev="impala-3562">
+ If no column list is given, the <codeph>COMPUTE STATS</codeph> statement
+ computes column-level statistics for all columns of the table. This adds
+ potentially unneeded work for columns whose stats are not needed by
+ queries. It can be especially costly for very wide tables and unneeded
+ large string fields.
+ </p>
+ <p rev="impala-3562">
+ <codeph>COMPUTE STATS</codeph> returns an error when a specified column
+ cannot be analyzed, such as when the column does not exist, the column is
+ of an unsupported type for COMPUTE STATS, e.g. colums of complex types,
+ or the column is a partitioning column.
+
+ </p>
+ <p rev="impala-3562">
+ If an empty column list is given, no column is analyzed by <codeph>COMPUTE
+ STATS</codeph>.
</p>
<p rev="2.1.0">
@@ -92,39 +124,45 @@ COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varn
<codeph>COMPUTE STATS</codeph> statement. Such tables display <codeph>false</codeph> under the
<codeph>Incremental stats</codeph> column of the <codeph>SHOW TABLE STATS</codeph> output.
</p>
-
<note>
- Because many of the most performance-critical and resource-intensive operations rely on table and column
- statistics to construct accurate and efficient plans, <codeph>COMPUTE STATS</codeph> is an important step at
- the end of your ETL process. Run <codeph>COMPUTE STATS</codeph> on all tables as your first step during
- performance tuning for slow queries, or troubleshooting for out-of-memory conditions:
- <ul>
- <li>
- Accurate statistics help Impala construct an efficient query plan for join queries, improving performance
- and reducing memory usage.
- </li>
-
- <li>
- Accurate statistics help Impala distribute the work effectively for insert operations into Parquet
- tables, improving performance and reducing memory usage.
- </li>
-
- <li rev="1.3.0">
- Accurate statistics help Impala estimate the memory required for each query, which is important when you
- use resource management features, such as admission control and the YARN resource management framework.
- The statistics help Impala to achieve high concurrency, full utilization of available memory, and avoid
- contention with workloads from other Hadoop components.
- </li>
- <li rev="IMPALA-4572">
- In <keyword keyref="impala28_full"/> and higher, when you run the
- <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph>
- statement against a Parquet table, Impala automatically applies the query
- option setting <codeph>MT_DOP=4</codeph> to increase the amount of intra-node
- parallelism during this CPU-intensive operation. See <xref keyref="mt_dop"/>
- for details about what this query option does and how to use it with
- CPU-intensive <codeph>SELECT</codeph> statements.
- </li>
- </ul>
+ <p>
+ Because many of the most performance-critical and resource-intensive
+ operations rely on table and column statistics to construct accurate and
+ efficient plans, <codeph>COMPUTE STATS</codeph> is an important step at
+ the end of your ETL process. Run <codeph>COMPUTE STATS</codeph> on all
+ tables as your first step during performance tuning for slow queries, or
+ troubleshooting for out-of-memory conditions:
+ <ul>
+ <li>
+ Accurate statistics help Impala construct an efficient query plan
+ for join queries, improving performance and reducing memory usage.
+ </li>
+ <li>
+ Accurate statistics help Impala distribute the work effectively
+ for insert operations into Parquet tables, improving performance and
+ reducing memory usage.
+ </li>
+ <li rev="1.3.0">
+ Accurate statistics help Impala estimate the memory
+ required for each query, which is important when you use resource
+ management features, such as admission control and the YARN resource
+ management framework. The statistics help Impala to achieve high
+ concurrency, full utilization of available memory, and avoid
+ contention with workloads from other Hadoop components.
+ </li>
+ <li rev="IMPALA-4572">
+ In <keyword keyref="impala28_full"/> and
+ higher, when you run the <codeph>COMPUTE STATS</codeph> or
+ <codeph>COMPUTE INCREMENTAL STATS</codeph> statement against a
+ Parquet table, Impala automatically applies the query option setting
+ <codeph>MT_DOP=4</codeph> to increase the amount of intra-node
+ parallelism during this CPU-intensive operation. See <xref
+ keyref="mt_dop"/> for details about what this query option does
+ and how to use it with CPU-intensive <codeph>SELECT</codeph>
+ statements.
+ </li>
+ </ul>
+ </p>
</note>
<p rev="IMPALA-1654">