You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by sa...@apache.org on 2018/04/23 17:39:02 UTC
[14/20] impala git commit: IMPALA-6459: [DOCS] Part 2: Stats
extrapolation and sampling.
IMPALA-6459: [DOCS] Part 2: Stats extrapolation and sampling.
Adds new materials under COMPUTE STATS describing
the experimental stats extrapolation and sampling
features.
More cleanup and examples are needed. This patch provides
a reasonable starting point which we can extend.
Change-Id: Idae7a377b5873701e91f60afa62dde2bd8aacd1b
Reviewed-on: http://gerrit.cloudera.org:8080/10112
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/d42f8d7c
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/d42f8d7c
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/d42f8d7c
Branch: refs/heads/2.x
Commit: d42f8d7c61bc819047b36564f4be1c0a544a4f0d
Parents: 62885d8
Author: Alex Behm <al...@cloudera.com>
Authored: Tue Apr 17 17:12:17 2018 -0700
Committer: Impala Public Jenkins <im...@gerrit.cloudera.org>
Committed: Fri Apr 20 20:17:58 2018 +0000
----------------------------------------------------------------------
docs/topics/impala_perf_stats.xml | 135 +++++++++++++++++++++++++++++++++
1 file changed, 135 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/impala/blob/d42f8d7c/docs/topics/impala_perf_stats.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_perf_stats.xml b/docs/topics/impala_perf_stats.xml
index f503a68..dab2eb8 100644
--- a/docs/topics/impala_perf_stats.xml
+++ b/docs/topics/impala_perf_stats.xml
@@ -392,6 +392,12 @@ show column stats year_month_day;
This feature is available since Impala 2.8.
</p>
</li>
+ <li>
+ <p>
+ Consider the experimental extrapolation and sampling features (see below)
+ to further increase the efficiency of computing stats.
+ </p>
+ </li>
</ul>
</p>
@@ -415,6 +421,135 @@ show column stats year_month_day;
</conbody>
+ <concept id="experimental_stats_features">
+ <title>Experimental: Extrapolation and Sampling</title>
+ <conbody>
+ <p>
+ Impala 2.12 and higher includes two experimental features to alleviate
+ common issues for computing and maintaining statistics on very large tables.
+ The following shortcomings are improved upon:
+ <ul>
+ <li>
+ <p>
+ Newly added partitions do not have row count statistics. Table scans
+ that only access those new partitions are treated as not having stats.
+ Similarly, table scans that access both new and old partitions estimate
+ the scan cardinality based on those old partitions that have stats, and
+ the new partitions without stats are treated as having 0 rows.
+ </p>
+ </li>
+ <li>
+ <p>
+ The row counts of existing partitions become stale when data is added
+ or dropped.
+ </p>
+ </li>
+ <li>
+ <p>
+ Computing stats for tables with a 100,000 or more partitions might fail
+ or be very slow due to the high cost of updating the partition metadata
+ in the Hive Metastore.
+ </p>
+ </li>
+ <li>
+ <p>
+ With transient compute resources it is important to minimize the time
+ from starting a new cluster to successfully running queries.
+ Since the cluster might be relatively short-lived, users might prefer to
+ quickly collect stats that are "good enough" as opposed to spending
+ a lot of time and resouces on computing full-fidelity stats.
+ </p>
+ </li>
+ </ul>
+ For very large tables, it is often wasteful or impractical to run a full
+ COMPUTE STATS to address the scenarios above on a frequent basis.
+ </p>
+ <p>
+ The sampling feature makes COMPUTE STATS more efficient by processing a
+ fraction of the table data, and the extrapolation feature aims to reduce
+ the frequency at which COMPUTE STATS needs to be re-run by estimating
+ the row count of new and modified partitions.
+ </p>
+ <p>
+ The sampling and extrapolation features are disabled by default.
+ They can be enabled globally or for specific tables, as follows.
+ Set the impalad start-up configuration "--enable_stats_extrapolation" to
+ enable the features globally. To enable them only for a specific table, set
+ the "impala.enable.stats.extrapolation" table property to "true" for the
+ desired table. The tbale-level property overrides the global setting, so
+ it is also possible to enable sampling and extrapolation globally, but
+ disable it for specific tables by setting the table property to "false".
+ Example:
+ ALTER TABLE mytable test_table SET TBLPROPERTIES("impala.enable.stats.extrapolation"="true")
+ </p>
+ <note>
+ Why are these features experimental? Due to their probabilistic nature
+ it is possible that these features perform pathologically poorly on tables
+ with extreme data/file/size distributions. Since it is not feasible for us
+ to test all possible scenarios we only cautiously advertise these new
+ capabilities. That said, the features have been thoroughly tested and
+ are considered functionally stable. If you decide to give these features
+ a try, please tell us about your experience at user@impala.apache.org!
+ We rely on user feedback to guide future inprovements in statistics
+ collection.
+ </note>
+ </conbody>
+
+ <concept id="experimental_stats_extrapolation">
+ <title>Stats Extrapolation</title>
+ <conbody>
+ <p>
+ The main idea of stats extrapolation is to estimate the row count of new
+ and modified partitions based on the result of the last COMPUTE STATS.
+ Enabling stats extrapolation changes the behavior of COMPUTE STATS,
+ as well as the cardinality estimation of table scans. COMPUTE STATS no
+ longer computes and stores per-partition row counts, and instead, only
+ computes a table-level row count together with the total number of file
+ bytes in the table at that time. No partition metadata is modified. The
+ input cardinality of a table scan is estimated by converting the data
+ volume of relevant partitions to a row count, based on the table-level
+ row count and file bytes statistics. It is assumed that within the same
+ table, different sets of files with the same data volume correspond
+ to the similar number of rows on average. With extrapolation enabled,
+ the scan cardinality estimation ignores per-partition row counts. It
+ only relies on the table-level statistics and the scanned data volume.
+ </p>
+ <p>
+ The SHOW TABLE STATS and EXPLAIN commands distinguish between row counts
+ stored in the Hive Metastore, and the row counts extrapolated based on the
+ above process. Consult the SHOW TABLE STATS and EXPLAIN documentation
+ for more details.
+ </p>
+ </conbody>
+ </concept>
+
+ <concept id="experimental_stats_sampling">
+ <title>Sampling</title>
+ <conbody>
+ <p>
+ A TABLESAMPLE clause may be added to COMPUTE STATS to limit the
+ percentage of data to be processed. The final statistics are obtained
+ by extrapolating the statistics from the data sample over the entire table.
+ The extrapolated statistics are stored in the Hive Metastore, just as if no
+ sampling was used. The following example runs COMPUTE STATS over a 10 percent
+ data sample: COMPUTE STATS test_table TABLESAMPLE SYSTEM(10)
+ </p>
+ <p>
+ We have found that a 10 percent sampling rate typically offers a good
+ tradeoff between statistics accuracy and execution cost. A sampling rate
+ well below 10 percent has shown poor results and is not recommended.
+ </p>
+ <note type="important">
+ Sampling-based techniques sacrifice result accuracy for execution
+ efficiency, so your mileage may vary for different tables and columns
+ depending on their data distribution. The extrapolation procedure Impala
+ uses for estimating the number of distinct values per column is inherently
+ non-detetministic, so your results may even vary between runs of
+ COMPUTE STATS TABLESAMPLE, even if no data has changed.
+ </note>
+ </conbody>
+ </concept>
+ </concept>
</concept>
<concept id="concept_bmk_pfl_mdb">