You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by mi...@apache.org on 2018/05/09 21:10:22 UTC
[13/51] [partial] impala git commit: [DOCS] Impala doc site update
for 3.0
http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_perf_stats.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_perf_stats.html b/docs/build3x/html/topics/impala_perf_stats.html
new file mode 100644
index 0000000..c4bdf0c
--- /dev/null
+++ b/docs/build3x/html/topics/impala_perf_stats.html
@@ -0,0 +1,1192 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_performance.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="perf_stats"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Table and Column Statistics</title></head><body id="perf_stats"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">Table and Column Statistics</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ Impala can do better optimization for complex or multi-table queries when it has access to
+ statistics about the volume of data and how the values are distributed. Impala uses this
+ information to help parallelize and distribute the work for a query. For example,
+ optimizing join queries requires a way of determining if one table is <span class="q">"bigger"</span> than
+ another, which is a function of the number of rows and the average row size for each
+ table. The following sections describe the categories of statistics Impala can work with,
+ and how to produce them and keep them up to date.
+ </p>
+
+ <p class="p toc inpage all"></p>
+
+ </div>
+
+ <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_performance.html">Tuning Impala for Performance</a></div></div></nav><article class="topic concept nested1" aria-labelledby="perf_table_stats__table_stats" id="perf_stats__perf_table_stats">
+
+ <h2 class="title topictitle2" id="perf_table_stats__table_stats">Overview of Table Statistics</h2>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ The Impala query planner can make use of statistics about entire tables and partitions.
+ This information includes physical characteristics such as the number of rows, number of
+ data files, the total size of the data files, and the file format. For partitioned
+ tables, the numbers are calculated per partition, and as totals for the whole table.
+ This metadata is stored in the metastore database, and can be updated by either Impala
+ or Hive. If a number is not available, the value -1 is used as a placeholder. Some
+ numbers, such as number and total sizes of data files, are always kept up to date
+ because they can be calculated cheaply, as part of gathering HDFS block metadata.
+ </p>
+
+ <p class="p">
+ The following example shows table stats for an unpartitioned Parquet table. The values
+ for the number and sizes of files are always available. Initially, the number of rows is
+ not known, because it requires a potentially expensive scan through the entire table,
+ and so that value is displayed as -1. The <code class="ph codeph">COMPUTE STATS</code> statement fills
+ in any unknown table stats values.
+ </p>
+
+<pre class="pre codeblock"><code>
+show table stats parquet_snappy;
++-------+--------+---------+--------------+-------------------+---------+-------------------+...
+| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |...
++-------+--------+---------+--------------+-------------------+---------+-------------------+...
+| -1 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false |...
++-------+--------+---------+--------------+-------------------+---------+-------------------+...
+
+compute stats parquet_snappy;
++-----------------------------------------+
+| summary |
++-----------------------------------------+
+| Updated 1 partition(s) and 6 column(s). |
++-----------------------------------------+
+
+
+show table stats parquet_snappy;
++------------+--------+---------+--------------+-------------------+---------+-------------------+...
+| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |...
++------------+--------+---------+--------------+-------------------+---------+-------------------+...
+| 1000000000 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false |...
++------------+--------+---------+--------------+-------------------+---------+-------------------+...
+</code></pre>
+
+ <p class="p">
+ Impala performs some optimizations using this metadata on its own, and other
+ optimizations by using a combination of table and column statistics.
+ </p>
+
+ <p class="p">
+ To check that table statistics are available for a table, and see the details of those
+ statistics, use the statement <code class="ph codeph">SHOW TABLE STATS
+ <var class="keyword varname">table_name</var></code>. See <a class="xref" href="impala_show.html#show">SHOW Statement</a> for
+ details.
+ </p>
+
+ <p class="p">
+ If you use the Hive-based methods of gathering statistics, see
+ <a class="xref" href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" target="_blank">the
+ Hive wiki</a> for information about the required configuration on the Hive side.
+ Where practical, use the Impala <code class="ph codeph">COMPUTE STATS</code> statement to avoid
+ potential configuration and scalability issues with the statistics-gathering process.
+ </p>
+
+ <p class="p">
+ If you run the Hive statement <code class="ph codeph">ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS</code>,
+ Impala can only use the resulting column statistics if the table is unpartitioned.
+ Impala cannot use Hive-generated column statistics for a partitioned table.
+ </p>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="perf_column_stats__column_stats" id="perf_stats__perf_column_stats">
+
+ <h2 class="title topictitle2" id="perf_column_stats__column_stats">Overview of Column Statistics</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ The Impala query planner can make use of statistics about individual columns when that
+ metadata is available in the metastore database. This technique is most valuable for
+ columns compared across tables in <a class="xref" href="impala_perf_joins.html#perf_joins">join
+ queries</a>, to help estimate how many rows the query will retrieve from each table.
+ <span class="ph"> These statistics are also important for correlated subqueries using the
+ <code class="ph codeph">EXISTS()</code> or <code class="ph codeph">IN()</code> operators, which are processed
+ internally the same way as join queries.</span>
+ </p>
+
+ <p class="p">
+ The following example shows column stats for an unpartitioned Parquet table. The values
+ for the maximum and average sizes of some types are always available, because those
+ figures are constant for numeric and other fixed-size types. Initially, the number of
+ distinct values is not known, because it requires a potentially expensive scan through
+ the entire table, and so that value is displayed as -1. The same applies to maximum and
+ average sizes of variable-sized types, such as <code class="ph codeph">STRING</code>. The
+ <code class="ph codeph">COMPUTE STATS</code> statement fills in most unknown column stats values. (It
+ does not record the number of <code class="ph codeph">NULL</code> values, because currently Impala
+ does not use that figure for query optimization.)
+ </p>
+
+<pre class="pre codeblock"><code>
+show column stats parquet_snappy;
++-------------+----------+------------------+--------+----------+----------+
+| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
++-------------+----------+------------------+--------+----------+----------+
+| id | BIGINT | -1 | -1 | 8 | 8 |
+| val | INT | -1 | -1 | 4 | 4 |
+| zerofill | STRING | -1 | -1 | -1 | -1 |
+| name | STRING | -1 | -1 | -1 | -1 |
+| assertion | BOOLEAN | -1 | -1 | 1 | 1 |
+| location_id | SMALLINT | -1 | -1 | 2 | 2 |
++-------------+----------+------------------+--------+----------+----------+
+
+compute stats parquet_snappy;
++-----------------------------------------+
+| summary |
++-----------------------------------------+
+| Updated 1 partition(s) and 6 column(s). |
++-----------------------------------------+
+
+show column stats parquet_snappy;
++-------------+----------+------------------+--------+----------+-------------------+
+| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
++-------------+----------+------------------+--------+----------+-------------------+
+| id | BIGINT | 183861280 | -1 | 8 | 8 |
+| val | INT | 139017 | -1 | 4 | 4 |
+| zerofill | STRING | 101761 | -1 | 6 | 6 |
+| name | STRING | 145636240 | -1 | 22 | 13.00020027160645 |
+| assertion | BOOLEAN | 2 | -1 | 1 | 1 |
+| location_id | SMALLINT | 339 | -1 | 2 | 2 |
++-------------+----------+------------------+--------+----------+-------------------+
+</code></pre>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ <p class="p">
+ For column statistics to be effective in Impala, you also need to have table
+ statistics for the applicable tables, as described in
+ <a class="xref" href="impala_perf_stats.html#perf_table_stats">Overview of Table Statistics</a>. When you use the Impala
+ <code class="ph codeph">COMPUTE STATS</code> statement, both table and column statistics are
+ automatically gathered at the same time, for all columns in the table.
+ </p>
+ </div>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span> Prior to Impala 1.4.0,
+ <code class="ph codeph">COMPUTE STATS</code> counted the number of
+ <code class="ph codeph">NULL</code> values in each column and recorded that figure
+ in the metastore database. Because Impala does not currently use the
+ <code class="ph codeph">NULL</code> count during query planning, Impala 1.4.0 and
+ higher speeds up the <code class="ph codeph">COMPUTE STATS</code> statement by
+ skipping this <code class="ph codeph">NULL</code> counting. </div>
+
+ <p class="p">
+ To check whether column statistics are available for a particular set of columns, use
+ the <code class="ph codeph">SHOW COLUMN STATS <var class="keyword varname">table_name</var></code> statement, or check
+ the extended <code class="ph codeph">EXPLAIN</code> output for a query against that table that refers
+ to those columns. See <a class="xref" href="impala_show.html#show">SHOW Statement</a> and
+ <a class="xref" href="impala_explain.html#explain">EXPLAIN Statement</a> for details.
+ </p>
+
+ <p class="p">
+ If you run the Hive statement <code class="ph codeph">ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS</code>,
+ Impala can only use the resulting column statistics if the table is unpartitioned.
+ Impala cannot use Hive-generated column statistics for a partitioned table.
+ </p>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="perf_stats_partitions__stats_partitions" id="perf_stats__perf_stats_partitions">
+
+ <h2 class="title topictitle2" id="perf_stats_partitions__stats_partitions">How Table and Column Statistics Work for Partitioned Tables</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ When you use Impala for <span class="q">"big data"</span>, you are highly likely to use partitioning for
+ your biggest tables, the ones representing data that can be logically divided based on
+ dates, geographic regions, or similar criteria. The table and column statistics are
+ especially useful for optimizing queries on such tables. For example, a query involving
+ one year might involve substantially more or less data than a query involving a
+ different year, or a range of several years. Each query might be optimized differently
+ as a result.
+ </p>
+
+ <p class="p">
+ The following examples show how table and column stats work with a partitioned table.
+ The table for this example is partitioned by year, month, and day. For simplicity, the
+ sample data consists of 5 partitions, all from the same year and month. Table stats are
+ collected independently for each partition. (In fact, the <code class="ph codeph">SHOW
+ PARTITIONS</code> statement displays exactly the same information as <code class="ph codeph">SHOW
+ TABLE STATS</code> for a partitioned table.) Column stats apply to the entire table,
+ not to individual partitions. Because the partition key column values are represented as
+ HDFS directories, their characteristics are typically known in advance, even when the
+ values for non-key columns are shown as -1.
+ </p>
+
+<pre class="pre codeblock"><code>
+show partitions year_month_day;
++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
+| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |...
++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
+| 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
+| 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |...
+| 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
+| 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
+| 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
+| Total | | | -1 | 5 | 12.58MB | 0B | | |...
++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
+
+show table stats year_month_day;
++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
+| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |...
++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
+| 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
+| 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |...
+| 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
+| 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
+| 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
+| Total | | | -1 | 5 | 12.58MB | 0B | | |...
++-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
+
+show column stats year_month_day;
++-----------+---------+------------------+--------+----------+----------+
+| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
++-----------+---------+------------------+--------+----------+----------+
+| id | INT | -1 | -1 | 4 | 4 |
+| val | INT | -1 | -1 | 4 | 4 |
+| zfill | STRING | -1 | -1 | -1 | -1 |
+| name | STRING | -1 | -1 | -1 | -1 |
+| assertion | BOOLEAN | -1 | -1 | 1 | 1 |
+| year | INT | 1 | 0 | 4 | 4 |
+| month | INT | 1 | 0 | 4 | 4 |
+| day | INT | 5 | 0 | 4 | 4 |
++-----------+---------+------------------+--------+----------+----------+
+
+compute stats year_month_day;
++-----------------------------------------+
+| summary |
++-----------------------------------------+
+| Updated 5 partition(s) and 5 column(s). |
++-----------------------------------------+
+
+show table stats year_month_day;
++-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
+| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |...
++-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
+| 2013 | 12 | 1 | 93606 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
+| 2013 | 12 | 2 | 94158 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |...
+| 2013 | 12 | 3 | 94122 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
+| 2013 | 12 | 4 | 93559 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
+| 2013 | 12 | 5 | 93845 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
+| Total | | | 469290 | 5 | 12.58MB | 0B | | |...
++-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
+
+show column stats year_month_day;
++-----------+---------+------------------+--------+----------+-------------------+
+| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
++-----------+---------+------------------+--------+----------+-------------------+
+| id | INT | 511129 | -1 | 4 | 4 |
+| val | INT | 364853 | -1 | 4 | 4 |
+| zfill | STRING | 311430 | -1 | 6 | 6 |
+| name | STRING | 471975 | -1 | 22 | 13.00160026550293 |
+| assertion | BOOLEAN | 2 | -1 | 1 | 1 |
+| year | INT | 1 | 0 | 4 | 4 |
+| month | INT | 1 | 0 | 4 | 4 |
+| day | INT | 5 | 0 | 4 | 4 |
++-----------+---------+------------------+--------+----------+-------------------+
+</code></pre>
+
+ <p class="p">
+ If you run the Hive statement <code class="ph codeph">ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS</code>,
+ Impala can only use the resulting column statistics if the table is unpartitioned.
+ Impala cannot use Hive-generated column statistics for a partitioned table.
+ </p>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="perf_stats__perf_generating_stats">
+
+ <h2 class="title topictitle2" id="ariaid-title5">Generating Table and Column Statistics</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ Use the <code class="ph codeph">COMPUTE STATS</code> family of commands to collect table and
+ column statistics. The <code class="ph codeph">COMPUTE STATS</code> variants offer
+ different tradeoffs between computation cost, staleness, and maintenance
+ workflows which are explained below.
+ </p>
+
+ <div class="note important note_important"><span class="note__title importanttitle">Important:</span>
+ <p class="p">
+ For a particular table, use either <code class="ph codeph">COMPUTE STATS</code> or
+ <code class="ph codeph">COMPUTE INCREMENTAL STATS</code>, but never combine the two or
+ alternate between them. If you switch from <code class="ph codeph">COMPUTE STATS</code> to
+ <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> during the lifetime of a table, or
+ vice versa, drop all statistics by running <code class="ph codeph">DROP STATS</code> before
+ making the switch.
+ </p>
+ </div>
+
+
+
+ </div>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title6" id="perf_generating_stats__concept_y2f_nfl_mdb">
+
+ <h3 class="title topictitle3" id="ariaid-title6">COMPUTE STATS</h3>
+
+ <div class="body conbody">
+
+ <p class="p">
+ The <code class="ph codeph">COMPUTE STATS</code> command collects and sets the table-level
+ and partition-level row counts as well as all column statistics for a given
+ table. The collection process is CPU-intensive and can take a long time to
+ complete for very large tables.
+ </p>
+ <div class="p">
+ To speed up <code class="ph codeph">COMPUTE STATS</code> consider the following options
+ which can be combined.
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ Limit the number of columns for which statistics are collected to increase
+ the efficiency of COMPUTE STATS. Queries benefit from statistics for those
+ columns involved in filters, join conditions, group by or partition by
+ clauses. Other columns are good candidates to exclude from COMPUTE STATS.
+ This feature is available since Impala 2.12.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ Set the MT_DOP query option to use more threads within each participating
+ impalad to compute the statistics faster - but not more efficiently. Note
+ that computing stats on a large table with a high MT_DOP value can
+ negatively affect other queries running at the same time if the
+ COMPUTE STATS claims most CPU cycles.
+ This feature is available since Impala 2.8.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ Consider the experimental extrapolation and sampling features (see below)
+ to further increase the efficiency of computing stats.
+ </p>
+ </li>
+ </ul>
+ </div>
+
+ <p class="p">
+ <code class="ph codeph">COMPUTE STATS</code> is intended to be run periodically,
+ e.g. weekly, or on-demand when the contents of a table have changed
+ significantly. Due to the high resource utilization and long repsonse
+ time of t<code class="ph codeph">COMPUTE STATS</code>, it is most practical to run it
+ in a scheduled maintnance window where the Impala cluster is idle
+ enough to accommodate the expensive operation. The degree of change that
+ qualifies as <span class="q">"significant"</span> depends on the query workload, but typically,
+ if 30% of the rows have changed then it is recommended to recompute
+ statistics.
+ </p>
+
+ <p class="p">
+ If you reload a complete new set of data for a table, but the number of rows and
+ number of distinct values for each column is relatively unchanged from before, you
+ do not need to recompute stats for the table.
+ </p>
+
+ </div>
+
+ <article class="topic concept nested3" aria-labelledby="ariaid-title7" id="concept_y2f_nfl_mdb__experimental_stats_features">
+ <h4 class="title topictitle4" id="ariaid-title7">Experimental: Extrapolation and Sampling</h4>
+ <div class="body conbody">
+ <div class="p">
+ Impala 2.12 and higher includes two experimental features to alleviate
+ common issues for computing and maintaining statistics on very large tables.
+ The following shortcomings are improved upon:
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ Newly added partitions do not have row count statistics. Table scans
+ that only access those new partitions are treated as not having stats.
+ Similarly, table scans that access both new and old partitions estimate
+ the scan cardinality based on those old partitions that have stats, and
+ the new partitions without stats are treated as having 0 rows.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ The row counts of existing partitions become stale when data is added
+ or dropped.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ Computing stats for tables with a 100,000 or more partitions might fail
+ or be very slow due to the high cost of updating the partition metadata
+ in the Hive Metastore.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ With transient compute resources it is important to minimize the time
+ from starting a new cluster to successfully running queries.
+ Since the cluster might be relatively short-lived, users might prefer to
+ quickly collect stats that are "good enough" as opposed to spending
+ a lot of time and resouces on computing full-fidelity stats.
+ </p>
+ </li>
+ </ul>
+ For very large tables, it is often wasteful or impractical to run a full
+ COMPUTE STATS to address the scenarios above on a frequent basis.
+ </div>
+ <p class="p">
+ The sampling feature makes COMPUTE STATS more efficient by processing a
+ fraction of the table data, and the extrapolation feature aims to reduce
+ the frequency at which COMPUTE STATS needs to be re-run by estimating
+ the row count of new and modified partitions.
+ </p>
+ <p class="p">
+ The sampling and extrapolation features are disabled by default.
+ They can be enabled globally or for specific tables, as follows.
+ Set the impalad start-up configuration "--enable_stats_extrapolation" to
+ enable the features globally. To enable them only for a specific table, set
+ the "impala.enable.stats.extrapolation" table property to "true" for the
+ desired table. The tbale-level property overrides the global setting, so
+ it is also possible to enable sampling and extrapolation globally, but
+ disable it for specific tables by setting the table property to "false".
+ Example:
+ ALTER TABLE mytable test_table SET TBLPROPERTIES("impala.enable.stats.extrapolation"="true")
+ </p>
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ Why are these features experimental? Due to their probabilistic nature
+ it is possible that these features perform pathologically poorly on tables
+ with extreme data/file/size distributions. Since it is not feasible for us
+ to test all possible scenarios we only cautiously advertise these new
+ capabilities. That said, the features have been thoroughly tested and
+ are considered functionally stable. If you decide to give these features
+ a try, please tell us about your experience at user@impala.apache.org!
+ We rely on user feedback to guide future inprovements in statistics
+ collection.
+ </div>
+ </div>
+
+ <article class="topic concept nested4" aria-labelledby="ariaid-title8" id="experimental_stats_features__experimental_stats_extrapolation">
+ <h5 class="title topictitle5" id="ariaid-title8">Stats Extrapolation</h5>
+ <div class="body conbody">
+ <p class="p">
+ The main idea of stats extrapolation is to estimate the row count of new
+ and modified partitions based on the result of the last COMPUTE STATS.
+ Enabling stats extrapolation changes the behavior of COMPUTE STATS,
+ as well as the cardinality estimation of table scans. COMPUTE STATS no
+ longer computes and stores per-partition row counts, and instead, only
+ computes a table-level row count together with the total number of file
+ bytes in the table at that time. No partition metadata is modified. The
+ input cardinality of a table scan is estimated by converting the data
+ volume of relevant partitions to a row count, based on the table-level
+ row count and file bytes statistics. It is assumed that within the same
+ table, different sets of files with the same data volume correspond
+ to the similar number of rows on average. With extrapolation enabled,
+ the scan cardinality estimation ignores per-partition row counts. It
+ only relies on the table-level statistics and the scanned data volume.
+ </p>
+ <p class="p">
+ The SHOW TABLE STATS and EXPLAIN commands distinguish between row counts
+ stored in the Hive Metastore, and the row counts extrapolated based on the
+ above process. Consult the SHOW TABLE STATS and EXPLAIN documentation
+ for more details.
+ </p>
+ </div>
+ </article>
+
+ <article class="topic concept nested4" aria-labelledby="ariaid-title9" id="experimental_stats_features__experimental_stats_sampling">
+ <h5 class="title topictitle5" id="ariaid-title9">Sampling</h5>
+ <div class="body conbody">
+ <p class="p">
+ A TABLESAMPLE clause may be added to COMPUTE STATS to limit the
+ percentage of data to be processed. The final statistics are obtained
+ by extrapolating the statistics from the data sample over the entire table.
+ The extrapolated statistics are stored in the Hive Metastore, just as if no
+ sampling was used. The following example runs COMPUTE STATS over a 10 percent
+ data sample: COMPUTE STATS test_table TABLESAMPLE SYSTEM(10)
+ </p>
+ <p class="p">
+ We have found that a 10 percent sampling rate typically offers a good
+ tradeoff between statistics accuracy and execution cost. A sampling rate
+ well below 10 percent has shown poor results and is not recommended.
+ </p>
+ <div class="note important note_important"><span class="note__title importanttitle">Important:</span>
+ Sampling-based techniques sacrifice result accuracy for execution
+ efficiency, so your mileage may vary for different tables and columns
+ depending on their data distribution. The extrapolation procedure Impala
+ uses for estimating the number of distinct values per column is inherently
+ non-detetministic, so your results may even vary between runs of
+ COMPUTE STATS TABLESAMPLE, even if no data has changed.
+ </div>
+ </div>
+ </article>
+ </article>
+ </article>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title10" id="perf_generating_stats__concept_bmk_pfl_mdb">
+
+ <h3 class="title topictitle3" id="ariaid-title10">COMPUTE INCREMENTAL STATS</h3>
+
+ <div class="body conbody">
+
+ <p class="p">
+ In Impala 2.1.0 and higher, you can use the
+ <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> and
+ <code class="ph codeph">DROP INCREMENTAL STATS</code> commands.
+ The <code class="ph codeph">INCREMENTAL</code> clauses work with incremental statistics,
+ a specialized feature for partitioned tables.
+ </p>
+
+ <p class="p">
+ When you compute incremental statistics for a partitioned table, by default Impala only
+ processes those partitions that do not yet have incremental statistics. By processing
+ only newly added partitions, you can keep statistics up to date without incurring the
+ overhead of reprocessing the entire table each time.
+ </p>
+
+ <p class="p">
+ You can also compute or drop statistics for a specified subset of partitions by
+ including a <code class="ph codeph">PARTITION</code> clause in the
+ <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> or <code class="ph codeph">DROP INCREMENTAL STATS</code>
+ statement.
+ </p>
+
+ <div class="note important note_important"><span class="note__title importanttitle">Important:</span>
+ <p class="p">
+ For a table with a huge number of partitions and many columns, the approximately 400 bytes
+ of metadata per column per partition can add up to significant memory overhead, as it must
+ be cached on the <span class="keyword cmdname">catalogd</span> host and on every <span class="keyword cmdname">impalad</span> host
+ that is eligible to be a coordinator. If this metadata for all tables combined exceeds 2 GB,
+ you might experience service downtime.
+ </p>
+ <p class="p">
+ When you run <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> on a table for the first time,
+ the statistics are computed again from scratch regardless of whether the table already
+ has statistics. Therefore, expect a one-time resource-intensive operation
+ for scanning the entire table when running <code class="ph codeph">COMPUTE INCREMENTAL STATS</code>
+ for the first time on a given table.
+ </p>
+ </div>
+
+ <p class="p">
+ The metadata for incremental statistics is handled differently from the original style
+ of statistics:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ Issuing a <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> without a partition
+ clause causes Impala to compute incremental stats for all partitions that
+ do not already have incremental stats. This might be the entire table when
+ running the command for the first time, but subsequent runs should only
+ update new partitions. You can force updating a partition that already has
+ incremental stats by issuing a <code class="ph codeph">DROP INCREMENTAL STATS</code>
+ before running <code class="ph codeph">COMPUTE INCREMENTAL STATS</code>.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ The <code class="ph codeph">SHOW TABLE STATS</code> and <code class="ph codeph">SHOW PARTITIONS</code>
+ statements now include an additional column showing whether incremental statistics
+ are available for each column. A partition could already be covered by the original
+ type of statistics based on a prior <code class="ph codeph">COMPUTE STATS</code> statement, as
+ indicated by a value other than <code class="ph codeph">-1</code> under the <code class="ph codeph">#Rows</code>
+ column. Impala query planning uses either kind of statistics when available.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> takes more time than <code class="ph codeph">COMPUTE
+ STATS</code> for the same volume of data. Therefore it is most suitable for tables
+ with large data volume where new partitions are added frequently, making it
+ impractical to run a full <code class="ph codeph">COMPUTE STATS</code> operation for each new
+ partition. For unpartitioned tables, or partitioned tables that are loaded once and
+ not updated with new partitions, use the original <code class="ph codeph">COMPUTE STATS</code>
+ syntax.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> uses some memory in the
+ <span class="keyword cmdname">catalogd</span> process, proportional to the number of partitions and
+ number of columns in the applicable table. The memory overhead is approximately 400
+ bytes for each column in each partition. This memory is reserved in the
+ <span class="keyword cmdname">catalogd</span> daemon, the <span class="keyword cmdname">statestored</span> daemon, and
+ in each instance of the <span class="keyword cmdname">impalad</span> daemon.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ In cases where new files are added to an existing partition, issue a
+ <code class="ph codeph">REFRESH</code> statement for the table, followed by a <code class="ph codeph">DROP
+ INCREMENTAL STATS</code> and <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> sequence
+ for the changed partition.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ The <code class="ph codeph">DROP INCREMENTAL STATS</code> statement operates only on a single
+ partition at a time. To remove statistics (whether incremental or not) from all
+ partitions of a table, issue a <code class="ph codeph">DROP STATS</code> statement with no
+ <code class="ph codeph">INCREMENTAL</code> or <code class="ph codeph">PARTITION</code> clauses.
+ </p>
+ </li>
+ </ul>
+
+ <p class="p">
+ The following considerations apply to incremental statistics when the structure of an
+ existing table is changed (known as <dfn class="term">schema evolution</dfn>):
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ If you use an <code class="ph codeph">ALTER TABLE</code> statement to drop a column, the existing
+ statistics remain valid and <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> does not
+ rescan any partitions.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ If you use an <code class="ph codeph">ALTER TABLE</code> statement to add a column, Impala rescans
+ all partitions and fills in the appropriate column-level values the next time you
+ run <code class="ph codeph">COMPUTE INCREMENTAL STATS</code>.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ If you use an <code class="ph codeph">ALTER TABLE</code> statement to change the data type of a
+ column, Impala rescans all partitions and fills in the appropriate column-level
+ values the next time you run <code class="ph codeph">COMPUTE INCREMENTAL STATS</code>.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ If you use an <code class="ph codeph">ALTER TABLE</code> statement to change the file format of a
+ table, the existing statistics remain valid and a subsequent <code class="ph codeph">COMPUTE
+ INCREMENTAL STATS</code> does not rescan any partitions.
+ </p>
+ </li>
+ </ul>
+
+ <p class="p">
+ See <a class="xref" href="impala_compute_stats.html#compute_stats">COMPUTE STATS Statement</a> and
+ <a class="xref" href="impala_drop_stats.html#drop_stats">DROP STATS Statement</a> for syntax details.
+ </p>
+
+ </div>
+
+ </article>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title11" id="perf_stats__perf_stats_checking">
+
+ <h2 class="title topictitle2" id="ariaid-title11">Detecting Missing Statistics</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ You can check whether a specific table has statistics using the <code class="ph codeph">SHOW TABLE
+ STATS</code> statement (for any table) or the <code class="ph codeph">SHOW PARTITIONS</code>
+ statement (for a partitioned table). Both statements display the same information. If a
+ table or a partition does not have any statistics, the <code class="ph codeph">#Rows</code> field
+ contains <code class="ph codeph">-1</code>. Once you compute statistics for the table or partition,
+ the <code class="ph codeph">#Rows</code> field changes to an accurate value.
+ </p>
+
+ <p class="p">
+ The following example shows a table that initially does not have any statistics. The
+ <code class="ph codeph">SHOW TABLE STATS</code> statement displays different values for
+ <code class="ph codeph">#Rows</code> before and after the <code class="ph codeph">COMPUTE STATS</code> operation.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > create table no_stats (x int);
+[localhost:21000] > show table stats no_stats;
++-------+--------+------+--------------+--------+-------------------+
+| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
++-------+--------+------+--------------+--------+-------------------+
+| -1 | 0 | 0B | NOT CACHED | TEXT | false |
++-------+--------+------+--------------+--------+-------------------+
+[localhost:21000] > compute stats no_stats;
++-----------------------------------------+
+| summary |
++-----------------------------------------+
+| Updated 1 partition(s) and 1 column(s). |
++-----------------------------------------+
+[localhost:21000] > show table stats no_stats;
++-------+--------+------+--------------+--------+-------------------+
+| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
++-------+--------+------+--------------+--------+-------------------+
+| 0 | 0 | 0B | NOT CACHED | TEXT | false |
++-------+--------+------+--------------+--------+-------------------+
+</code></pre>
+
+ <p class="p">
+ The following example shows a similar progression with a partitioned table. Initially,
+ <code class="ph codeph">#Rows</code> is <code class="ph codeph">-1</code>. After a <code class="ph codeph">COMPUTE STATS</code>
+ operation, <code class="ph codeph">#Rows</code> changes to an accurate value. Any newly added
+ partition starts with no statistics, meaning that you must collect statistics after
+ adding a new partition.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > create table no_stats_partitioned (x int) partitioned by (year smallint);
+[localhost:21000] > show table stats no_stats_partitioned;
++-------+-------+--------+------+--------------+--------+-------------------+
+| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
++-------+-------+--------+------+--------------+--------+-------------------+
+| Total | -1 | 0 | 0B | 0B | | |
++-------+-------+--------+------+--------------+--------+-------------------+
+[localhost:21000] > show partitions no_stats_partitioned;
++-------+-------+--------+------+--------------+--------+-------------------+
+| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
++-------+-------+--------+------+--------------+--------+-------------------+
+| Total | -1 | 0 | 0B | 0B | | |
++-------+-------+--------+------+--------------+--------+-------------------+
+[localhost:21000] > alter table no_stats_partitioned add partition (year=2013);
+[localhost:21000] > compute stats no_stats_partitioned;
++-----------------------------------------+
+| summary |
++-----------------------------------------+
+| Updated 1 partition(s) and 1 column(s). |
++-----------------------------------------+
+[localhost:21000] > alter table no_stats_partitioned add partition (year=2014);
+[localhost:21000] > show partitions no_stats_partitioned;
++-------+-------+--------+------+--------------+--------+-------------------+
+| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
++-------+-------+--------+------+--------------+--------+-------------------+
+| 2013 | 0 | 0 | 0B | NOT CACHED | TEXT | false |
+| 2014 | -1 | 0 | 0B | NOT CACHED | TEXT | false |
+| Total | 0 | 0 | 0B | 0B | | |
++-------+-------+--------+------+--------------+--------+-------------------+
+</code></pre>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ Because the default <code class="ph codeph">COMPUTE STATS</code> statement creates and updates
+ statistics for all partitions in a table, if you expect to frequently add new
+ partitions, use the <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> syntax instead, which
+ lets you compute stats for a single specified partition, or only for those partitions
+ that do not already have incremental stats.
+ </div>
+
+ <p class="p">
+ If checking each individual table is impractical, due to a large number of tables or
+ views that hide the underlying base tables, you can also check for missing statistics
+ for a particular query. Use the <code class="ph codeph">EXPLAIN</code> statement to preview query
+ efficiency before actually running the query. Use the query profile output available
+ through the <code class="ph codeph">PROFILE</code> command in <span class="keyword cmdname">impala-shell</span> or the
+ web UI to verify query execution and timing after running the query. Both the
+ <code class="ph codeph">EXPLAIN</code> plan and the <code class="ph codeph">PROFILE</code> output display a warning
+ if any tables or partitions involved in the query do not have statistics.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > create table no_stats (x int);
+[localhost:21000] > explain select count(*) from no_stats;
++------------------------------------------------------------------------------------+
+| Explain String |
++------------------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 |
+| WARNING: The following tables are missing relevant table and/or column statistics. |
+| incremental_stats.no_stats |
+| |
+| 03:AGGREGATE [FINALIZE] |
+| | output: count:merge(*) |
+| | |
+| 02:EXCHANGE [UNPARTITIONED] |
+| | |
+| 01:AGGREGATE |
+| | output: count(*) |
+| | |
+| 00:SCAN HDFS [incremental_stats.no_stats] |
+| partitions=1/1 files=0 size=0B |
++------------------------------------------------------------------------------------+
+</code></pre>
+
+ <p class="p">
+ Because Impala uses the <dfn class="term">partition pruning</dfn> technique when possible to only
+ evaluate certain partitions, if you have a partitioned table with statistics for some
+ partitions and not others, whether or not the <code class="ph codeph">EXPLAIN</code> statement shows
+ the warning depends on the actual partitions used by the query. For example, you might
+ see warnings or not for different queries against the same table:
+ </p>
+
+<pre class="pre codeblock"><code>-- No warning because all the partitions for the year 2012 have stats.
+EXPLAIN SELECT ... FROM t1 WHERE year = 2012;
+
+-- Missing stats warning because one or more partitions in this range
+-- do not have stats.
+EXPLAIN SELECT ... FROM t1 WHERE year BETWEEN 2006 AND 2009;
+</code></pre>
+
+ <p class="p">
+ To confirm if any partitions at all in the table are missing statistics, you might
+ explain a query that scans the entire table, such as <code class="ph codeph">SELECT COUNT(*) FROM
+ <var class="keyword varname">table_name</var></code>.
+ </p>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title12" id="perf_stats__concept_s3c_4gl_mdb">
+
+ <h2 class="title topictitle2" id="ariaid-title12">Manually Setting Table and Column Statistics with ALTER TABLE</h2>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title13" id="concept_s3c_4gl_mdb__concept_wpt_pgl_mdb">
+
+ <h3 class="title topictitle3" id="ariaid-title13">Setting Table Statistics</h3>
+
+ <div class="body conbody">
+
+ <p class="p">
+ The most crucial piece of data in all the statistics is the number of rows in the
+ table (for an unpartitioned or partitioned table) and for each partition (for a
+ partitioned table). The <code class="ph codeph">COMPUTE STATS</code> statement always gathers
+ statistics about all columns, as well as overall table statistics. If it is not
+ practical to do a full <code class="ph codeph">COMPUTE STATS</code> or <code class="ph codeph">COMPUTE INCREMENTAL
+ STATS</code> operation after adding a partition or inserting data, or if you can see
+ that Impala would produce a more efficient plan if the number of rows was different,
+ you can manually set the number of rows through an <code class="ph codeph">ALTER TABLE</code>
+ statement:
+ </p>
+
+<pre class="pre codeblock"><code>
+-- Set total number of rows. Applies to both unpartitioned and partitioned tables.
+alter table <var class="keyword varname">table_name</var> set tblproperties('numRows'='<var class="keyword varname">new_value</var>', 'STATS_GENERATED_VIA_STATS_TASK'='true');
+
+-- Set total number of rows for a specific partition. Applies to partitioned tables only.
+-- You must specify all the partition key columns in the PARTITION clause.
+alter table <var class="keyword varname">table_name</var> partition (<var class="keyword varname">keycol1</var>=<var class="keyword varname">val1</var>,<var class="keyword varname">keycol2</var>=<var class="keyword varname">val2</var>...) set tblproperties('numRows'='<var class="keyword varname">new_value</var>', 'STATS_GENERATED_VIA_STATS_TASK'='true');
+</code></pre>
+
+ <p class="p">
+ This statement avoids re-scanning any data files. (The requirement to include the
+ <code class="ph codeph">STATS_GENERATED_VIA_STATS_TASK</code> property is relatively new, as a
+ result of the issue
+ <a class="xref" href="https://issues.apache.org/jira/browse/HIVE-8648" target="_blank">HIVE-8648</a>
+ for the Hive metastore.)
+ </p>
+
+<pre class="pre codeblock"><code>create table analysis_data stored as parquet as select * from raw_data;
+Inserted 1000000000 rows in 181.98s
+compute stats analysis_data;
+insert into analysis_data select * from smaller_table_we_forgot_before;
+Inserted 1000000 rows in 15.32s
+-- Now there are 1001000000 rows. We can update this single data point in the stats.
+alter table analysis_data set tblproperties('numRows'='1001000000', 'STATS_GENERATED_VIA_STATS_TASK'='true');</code></pre>
+
+ <p class="p">
+ For a partitioned table, update both the per-partition number of rows and the number
+ of rows for the whole table:
+ </p>
+
+<pre class="pre codeblock"><code>-- If the table originally contained 1 million rows, and we add another partition with 30 thousand rows,
+-- change the numRows property for the partition and the overall table.
+alter table partitioned_data partition(year=2009, month=4) set tblproperties ('numRows'='30000', 'STATS_GENERATED_VIA_STATS_TASK'='true');
+alter table partitioned_data set tblproperties ('numRows'='1030000', 'STATS_GENERATED_VIA_STATS_TASK'='true');</code></pre>
+
+ <p class="p">
+ In practice, the <code class="ph codeph">COMPUTE STATS</code> statement, or <code class="ph codeph">COMPUTE
+ INCREMENTAL STATS</code> for a partitioned table, should be fast and convenient
+ enough that this technique is only useful for the very largest partitioned tables.
+
+
+ Because the column statistics might be left in a stale state, do not use this
+ technique as a replacement for <code class="ph codeph">COMPUTE STATS</code>. Only use this technique
+ if all other means of collecting statistics are impractical, or as a low-overhead
+ operation that you run in between periodic <code class="ph codeph">COMPUTE STATS</code> or
+ <code class="ph codeph">COMPUTE INCREMENTAL STATS</code> operations.
+ </p>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title14" id="concept_s3c_4gl_mdb__concept_asb_vgl_mdb">
+
+ <h3 class="title topictitle3" id="ariaid-title14">Setting Column Statistics</h3>
+
+ <div class="body conbody">
+
+ <p class="p">
+ In <span class="keyword">Impala 2.6</span> and higher, you can also use the <code class="ph codeph">SET
+ COLUMN STATS</code> clause of <code class="ph codeph">ALTER TABLE</code> to manually set or change
+ column statistics. Only use this technique in cases where it is impractical to run
+ <code class="ph codeph">COMPUTE STATS</code> or <code class="ph codeph">COMPUTE INCREMENTAL STATS</code>
+ frequently enough to keep up with data changes for a huge table.
+ </p>
+
+ <div class="p">
+ You specify a case-insensitive symbolic name for the kind of statistics:
+ <code class="ph codeph">numDVs</code>, <code class="ph codeph">numNulls</code>, <code class="ph codeph">avgSize</code>, <code class="ph codeph">maxSize</code>.
+ The key names and values are both quoted. This operation applies to an entire table,
+ not a specific partition. For example:
+<pre class="pre codeblock"><code>
+create table t1 (x int, s string);
+insert into t1 values (1, 'one'), (2, 'two'), (2, 'deux');
+show column stats t1;
++--------+--------+------------------+--------+----------+----------+
+| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
++--------+--------+------------------+--------+----------+----------+
+| x | INT | -1 | -1 | 4 | 4 |
+| s | STRING | -1 | -1 | -1 | -1 |
++--------+--------+------------------+--------+----------+----------+
+alter table t1 set column stats x ('numDVs'='2','numNulls'='0');
+alter table t1 set column stats s ('numdvs'='3','maxsize'='4');
+show column stats t1;
++--------+--------+------------------+--------+----------+----------+
+| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
++--------+--------+------------------+--------+----------+----------+
+| x | INT | 2 | 0 | 4 | 4 |
+| s | STRING | 3 | -1 | 4 | -1 |
++--------+--------+------------------+--------+----------+----------+
+</code></pre>
+ </div>
+
+ </div>
+
+ </article>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title15" id="perf_stats__perf_stats_examples">
+
+ <h2 class="title topictitle2" id="ariaid-title15">Examples of Using Table and Column Statistics with Impala</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ The following examples walk through a sequence of <code class="ph codeph">SHOW TABLE STATS</code>,
+ <code class="ph codeph">SHOW COLUMN STATS</code>, <code class="ph codeph">ALTER TABLE</code>, and
+ <code class="ph codeph">SELECT</code> and <code class="ph codeph">INSERT</code> statements to illustrate various
+ aspects of how Impala uses statistics to help optimize queries.
+ </p>
+
+ <p class="p">
+ This example shows table and column statistics for the <code class="ph codeph">STORE</code> column
+ used in the <a class="xref" href="http://www.tpc.org/tpcds/" target="_blank">TPC-DS
+ benchmarks for decision support</a> systems. It is a tiny table holding data for 12
+ stores. Initially, before any statistics are gathered by a <code class="ph codeph">COMPUTE
+ STATS</code> statement, most of the numeric fields show placeholder values of -1,
+ indicating that the figures are unknown. The figures that are filled in are values that
+ are easily countable or deducible at the physical level, such as the number of files,
+ total data size of the files, and the maximum and average sizes for data types that have
+ a constant size such as <code class="ph codeph">INT</code>, <code class="ph codeph">FLOAT</code>, and
+ <code class="ph codeph">TIMESTAMP</code>.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > show table stats store;
++-------+--------+--------+--------+
+| #Rows | #Files | Size | Format |
++-------+--------+--------+--------+
+| -1 | 1 | 3.08KB | TEXT |
++-------+--------+--------+--------+
+Returned 1 row(s) in 0.03s
+[localhost:21000] > show column stats store;
++--------------------+-----------+------------------+--------+----------+----------+
+| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
++--------------------+-----------+------------------+--------+----------+----------+
+| s_store_sk | INT | -1 | -1 | 4 | 4 |
+| s_store_id | STRING | -1 | -1 | -1 | -1 |
+| s_rec_start_date | TIMESTAMP | -1 | -1 | 16 | 16 |
+| s_rec_end_date | TIMESTAMP | -1 | -1 | 16 | 16 |
+| s_closed_date_sk | INT | -1 | -1 | 4 | 4 |
+| s_store_name | STRING | -1 | -1 | -1 | -1 |
+| s_number_employees | INT | -1 | -1 | 4 | 4 |
+| s_floor_space | INT | -1 | -1 | 4 | 4 |
+| s_hours | STRING | -1 | -1 | -1 | -1 |
+| s_manager | STRING | -1 | -1 | -1 | -1 |
+| s_market_id | INT | -1 | -1 | 4 | 4 |
+| s_geography_class | STRING | -1 | -1 | -1 | -1 |
+| s_market_desc | STRING | -1 | -1 | -1 | -1 |
+| s_market_manager | STRING | -1 | -1 | -1 | -1 |
+| s_division_id | INT | -1 | -1 | 4 | 4 |
+| s_division_name | STRING | -1 | -1 | -1 | -1 |
+| s_company_id | INT | -1 | -1 | 4 | 4 |
+| s_company_name | STRING | -1 | -1 | -1 | -1 |
+| s_street_number | STRING | -1 | -1 | -1 | -1 |
+| s_street_name | STRING | -1 | -1 | -1 | -1 |
+| s_street_type | STRING | -1 | -1 | -1 | -1 |
+| s_suite_number | STRING | -1 | -1 | -1 | -1 |
+| s_city | STRING | -1 | -1 | -1 | -1 |
+| s_county | STRING | -1 | -1 | -1 | -1 |
+| s_state | STRING | -1 | -1 | -1 | -1 |
+| s_zip | STRING | -1 | -1 | -1 | -1 |
+| s_country | STRING | -1 | -1 | -1 | -1 |
+| s_gmt_offset | FLOAT | -1 | -1 | 4 | 4 |
+| s_tax_percentage | FLOAT | -1 | -1 | 4 | 4 |
++--------------------+-----------+------------------+--------+----------+----------+
+Returned 29 row(s) in 0.04s</code></pre>
+
+ <p class="p">
+ With the Hive <code class="ph codeph">ANALYZE TABLE</code> statement for column statistics, you had to
+ specify each column for which to gather statistics. The Impala <code class="ph codeph">COMPUTE
+ STATS</code> statement automatically gathers statistics for all columns, because it
+ reads through the entire table relatively quickly and can efficiently compute the values
+ for all the columns. This example shows how after running the <code class="ph codeph">COMPUTE
+ STATS</code> statement, statistics are filled in for both the table and all its
+ columns:
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > compute stats store;
++------------------------------------------+
+| summary |
++------------------------------------------+
+| Updated 1 partition(s) and 29 column(s). |
++------------------------------------------+
+Returned 1 row(s) in 1.88s
+[localhost:21000] > show table stats store;
++-------+--------+--------+--------+
+| #Rows | #Files | Size | Format |
++-------+--------+--------+--------+
+| 12 | 1 | 3.08KB | TEXT |
++-------+--------+--------+--------+
+Returned 1 row(s) in 0.02s
+[localhost:21000] > show column stats store;
++--------------------+-----------+------------------+--------+----------+-------------------+
+| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
++--------------------+-----------+------------------+--------+----------+-------------------+
+| s_store_sk | INT | 12 | -1 | 4 | 4 |
+| s_store_id | STRING | 6 | -1 | 16 | 16 |
+| s_rec_start_date | TIMESTAMP | 4 | -1 | 16 | 16 |
+| s_rec_end_date | TIMESTAMP | 3 | -1 | 16 | 16 |
+| s_closed_date_sk | INT | 3 | -1 | 4 | 4 |
+| s_store_name | STRING | 8 | -1 | 5 | 4.25 |
+| s_number_employees | INT | 9 | -1 | 4 | 4 |
+| s_floor_space | INT | 10 | -1 | 4 | 4 |
+| s_hours | STRING | 2 | -1 | 8 | 7.083300113677979 |
+| s_manager | STRING | 7 | -1 | 15 | 12 |
+| s_market_id | INT | 7 | -1 | 4 | 4 |
+| s_geography_class | STRING | 1 | -1 | 7 | 7 |
+| s_market_desc | STRING | 10 | -1 | 94 | 55.5 |
+| s_market_manager | STRING | 7 | -1 | 16 | 14 |
+| s_division_id | INT | 1 | -1 | 4 | 4 |
+| s_division_name | STRING | 1 | -1 | 7 | 7 |
+| s_company_id | INT | 1 | -1 | 4 | 4 |
+| s_company_name | STRING | 1 | -1 | 7 | 7 |
+| s_street_number | STRING | 9 | -1 | 3 | 2.833300113677979 |
+| s_street_name | STRING | 12 | -1 | 11 | 6.583300113677979 |
+| s_street_type | STRING | 8 | -1 | 9 | 4.833300113677979 |
+| s_suite_number | STRING | 11 | -1 | 9 | 8.25 |
+| s_city | STRING | 2 | -1 | 8 | 6.5 |
+| s_county | STRING | 1 | -1 | 17 | 17 |
+| s_state | STRING | 1 | -1 | 2 | 2 |
+| s_zip | STRING | 2 | -1 | 5 | 5 |
+| s_country | STRING | 1 | -1 | 13 | 13 |
+| s_gmt_offset | FLOAT | 1 | -1 | 4 | 4 |
+| s_tax_percentage | FLOAT | 5 | -1 | 4 | 4 |
++--------------------+-----------+------------------+--------+----------+-------------------+
+Returned 29 row(s) in 0.04s</code></pre>
+
+ <p class="p">
+ The following example shows how statistics are represented for a partitioned table. In
+ this case, we have set up a table to hold the world's most trivial census data, a single
+ <code class="ph codeph">STRING</code> field, partitioned by a <code class="ph codeph">YEAR</code> column. The table
+ statistics include a separate entry for each partition, plus final totals for the
+ numeric fields. The column statistics include some easily deducible facts for the
+ partitioning column, such as the number of distinct values (the number of partition
+ subdirectories).
+
+ </p>
+
+<pre class="pre codeblock"><code>localhost:21000] > describe census;
++------+----------+---------+
+| name | type | comment |
++------+----------+---------+
+| name | string | |
+| year | smallint | |
++------+----------+---------+
+Returned 2 row(s) in 0.02s
+[localhost:21000] > show table stats census;
++-------+-------+--------+------+---------+
+| year | #Rows | #Files | Size | Format |
++-------+-------+--------+------+---------+
+| 2000 | -1 | 0 | 0B | TEXT |
+| 2004 | -1 | 0 | 0B | TEXT |
+| 2008 | -1 | 0 | 0B | TEXT |
+| 2010 | -1 | 0 | 0B | TEXT |
+| 2011 | 0 | 1 | 22B | TEXT |
+| 2012 | -1 | 1 | 22B | TEXT |
+| 2013 | -1 | 1 | 231B | PARQUET |
+| Total | 0 | 3 | 275B | |
++-------+-------+--------+------+---------+
+Returned 8 row(s) in 0.02s
+[localhost:21000] > show column stats census;
++--------+----------+------------------+--------+----------+----------+
+| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
++--------+----------+------------------+--------+----------+----------+
+| name | STRING | -1 | -1 | -1 | -1 |
+| year | SMALLINT | 7 | -1 | 2 | 2 |
++--------+----------+------------------+--------+----------+----------+
+Returned 2 row(s) in 0.02s</code></pre>
+
+ <p class="p">
+ The following example shows how the statistics are filled in by a <code class="ph codeph">COMPUTE
+ STATS</code> statement in Impala.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > compute stats census;
++-----------------------------------------+
+| summary |
++-----------------------------------------+
+| Updated 3 partition(s) and 1 column(s). |
++-----------------------------------------+
+Returned 1 row(s) in 2.16s
+[localhost:21000] > show table stats census;
++-------+-------+--------+------+---------+
+| year | #Rows | #Files | Size | Format |
++-------+-------+--------+------+---------+
+| 2000 | -1 | 0 | 0B | TEXT |
+| 2004 | -1 | 0 | 0B | TEXT |
+| 2008 | -1 | 0 | 0B | TEXT |
+| 2010 | -1 | 0 | 0B | TEXT |
+| 2011 | 4 | 1 | 22B | TEXT |
+| 2012 | 4 | 1 | 22B | TEXT |
+| 2013 | 1 | 1 | 231B | PARQUET |
+| Total | 9 | 3 | 275B | |
++-------+-------+--------+------+---------+
+Returned 8 row(s) in 0.02s
+[localhost:21000] > show column stats census;
++--------+----------+------------------+--------+----------+----------+
+| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
++--------+----------+------------------+--------+----------+----------+
+| name | STRING | 4 | -1 | 5 | 4.5 |
+| year | SMALLINT | 7 | -1 | 2 | 2 |
++--------+----------+------------------+--------+----------+----------+
+Returned 2 row(s) in 0.02s</code></pre>
+
+ <p class="p">
+ For examples showing how some queries work differently when statistics are available,
+ see <a class="xref" href="impala_perf_joins.html#perf_joins_examples">Examples of Join Order Optimization</a>. You can see how Impala
+ executes a query differently in each case by observing the <code class="ph codeph">EXPLAIN</code>
+ output before and after collecting statistics. Measure the before and after query times,
+ and examine the throughput numbers in before and after <code class="ph codeph">SUMMARY</code> or
+ <code class="ph codeph">PROFILE</code> output, to verify how much the improved plan speeds up
+ performance.
+ </p>
+
+ </div>
+
+ </article>
+
+</article></main></body></html>
http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_perf_testing.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_perf_testing.html b/docs/build3x/html/topics/impala_perf_testing.html
new file mode 100644
index 0000000..fee319a
--- /dev/null
+++ b/docs/build3x/html/topics/impala_perf_testing.html
@@ -0,0 +1,152 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_performance.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="performance_testing"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Testing Impala Performance</title></head><body id="performance_testing"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">Testing Impala Performance</h1>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ Test to ensure that Impala is configured for optimal performance. If you have installed Impala with cluster
+ management software, complete the processes described in this topic to help ensure a proper
+ configuration. These procedures can be used to verify that Impala is set up correctly.
+ </p>
+
+ <section class="section" id="performance_testing__checking_config_performance"><h2 class="title sectiontitle">Checking Impala Configuration Values</h2>
+
+
+
+ <p class="p">
+ You can inspect Impala configuration values by connecting to your Impala server using a browser.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">To check Impala configuration values:</strong>
+ </p>
+
+ <ol class="ol">
+ <li class="li">
+ Use a browser to connect to one of the hosts running <code class="ph codeph">impalad</code> in your environment.
+ Connect using an address of the form
+ <code class="ph codeph">http://<var class="keyword varname">hostname</var>:<var class="keyword varname">port</var>/varz</code>.
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ In the preceding example, replace <code class="ph codeph">hostname</code> and <code class="ph codeph">port</code> with the name and
+ port of your Impala server. The default port is 25000.
+ </div>
+ </li>
+
+ <li class="li">
+ Review the configured values.
+ <p class="p">
+ For example, to check that your system is configured to use block locality tracking information, you
+ would check that the value for <code class="ph codeph">dfs.datanode.hdfs-blocks-metadata.enabled</code> is
+ <code class="ph codeph">true</code>.
+ </p>
+ </li>
+ </ol>
+
+ <p class="p" id="performance_testing__p_31">
+ <strong class="ph b">To check data locality:</strong>
+ </p>
+
+ <ol class="ol">
+ <li class="li">
+ Execute a query on a dataset that is available across multiple nodes. For example, for a table named
+ <code class="ph codeph">MyTable</code> that has a reasonable chance of being spread across multiple DataNodes:
+<pre class="pre codeblock"><code>[impalad-host:21000] > SELECT COUNT (*) FROM MyTable</code></pre>
+ </li>
+
+ <li class="li">
+ After the query completes, review the contents of the Impala logs. You should find a recent message
+ similar to the following:
+<pre class="pre codeblock"><code>Total remote scan volume = 0</code></pre>
+ </li>
+ </ol>
+
+ <p class="p">
+ The presence of remote scans may indicate <code class="ph codeph">impalad</code> is not running on the correct nodes.
+ This can be because some DataNodes do not have <code class="ph codeph">impalad</code> running or it can be because the
+ <code class="ph codeph">impalad</code> instance that is starting the query is unable to contact one or more of the
+ <code class="ph codeph">impalad</code> instances.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">To understand the causes of this issue:</strong>
+ </p>
+
+ <ol class="ol">
+ <li class="li">
+ Connect to the debugging web server. By default, this server runs on port 25000. This page lists all
+ <code class="ph codeph">impalad</code> instances running in your cluster. If there are fewer instances than you expect,
+ this often indicates some DataNodes are not running <code class="ph codeph">impalad</code>. Ensure
+ <code class="ph codeph">impalad</code> is started on all DataNodes.
+ </li>
+
+ <li class="li">
+
+ If you are using multi-homed hosts, ensure that the Impala daemon's hostname resolves to the interface on
+ which <code class="ph codeph">impalad</code> is running. The hostname Impala is using is displayed when
+ <code class="ph codeph">impalad</code> starts. To explicitly set the hostname, use the <code class="ph codeph">--hostname</code> flag.
+ </li>
+
+ <li class="li">
+ Check that <code class="ph codeph">statestored</code> is running as expected. Review the contents of the state store
+ log to ensure all instances of <code class="ph codeph">impalad</code> are listed as having connected to the state
+ store.
+ </li>
+ </ol>
+ </section>
+
+ <section class="section" id="performance_testing__checking_config_logs"><h2 class="title sectiontitle">Reviewing Impala Logs</h2>
+
+
+
+ <p class="p">
+ You can review the contents of the Impala logs for signs that short-circuit reads or block location
+ tracking are not functioning. Before checking logs, execute a simple query against a small HDFS dataset.
+ Completing a query task generates log messages using current settings. Information on starting Impala and
+ executing queries can be found in <a class="xref" href="impala_processes.html#processes">Starting Impala</a> and
+ <a class="xref" href="impala_impala_shell.html#impala_shell">Using the Impala Shell (impala-shell Command)</a>. Information on logging can be found in
+ <a class="xref" href="impala_logging.html#logging">Using Impala Logging</a>. Log messages and their interpretations are as follows:
+ </p>
+
+ <table class="table"><caption></caption><colgroup><col style="width:75%"><col style="width:25%"></colgroup><thead class="thead">
+ <tr class="row">
+ <th class="entry nocellnorowborder" id="performance_testing__checking_config_logs__entry__1">
+ Log Message
+ </th>
+ <th class="entry nocellnorowborder" id="performance_testing__checking_config_logs__entry__2">
+ Interpretation
+ </th>
+ </tr>
+ </thead><tbody class="tbody">
+ <tr class="row">
+ <td class="entry nocellnorowborder" headers="performance_testing__checking_config_logs__entry__1 ">
+ <div class="p">
+<pre class="pre">Unknown disk id. This will negatively affect performance. Check your hdfs settings to enable block location metadata
+</pre>
+ </div>
+ </td>
+ <td class="entry nocellnorowborder" headers="performance_testing__checking_config_logs__entry__2 ">
+ <p class="p">
+ Tracking block locality is not enabled.
+ </p>
+ </td>
+ </tr>
+ <tr class="row">
+ <td class="entry nocellnorowborder" headers="performance_testing__checking_config_logs__entry__1 ">
+ <div class="p">
+<pre class="pre">Unable to load native-hadoop library for your platform... using builtin-java classes where applicable</pre>
+ </div>
+ </td>
+ <td class="entry nocellnorowborder" headers="performance_testing__checking_config_logs__entry__2 ">
+ <p class="p">
+ Native checksumming is not enabled.
+ </p>
+ </td>
+ </tr>
+ </tbody></table>
+ </section>
+ </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_performance.html">Tuning Impala for Performance</a></div></div></nav></article></main></body></html>
http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_performance.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_performance.html b/docs/build3x/html/topics/impala_performance.html
new file mode 100644
index 0000000..bc87821
--- /dev/null
+++ b/docs/build3x/html/topics/impala_performance.html
@@ -0,0 +1,116 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_perf_cookbook.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_perf_joins.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_perf_stats.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_perf_benchmarking.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_perf_resources.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_runtime_filtering.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_perf_hdfs_caching.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_perf_testing.html"><meta name="DC.Relation" scheme="URI" content="../topics/im
pala_explain_plan.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_perf_skew.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="performance"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Tuning Impala for Performance</title></head><body id="performance"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">Tuning Impala for Performance</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ The following sections explain the factors affecting the performance of Impala features, and procedures for
+ tuning, monitoring, and benchmarking Impala queries and other SQL operations.
+ </p>
+
+ <p class="p">
+ This section also describes techniques for maximizing Impala scalability. Scalability is tied to performance:
+ it means that performance remains high as the system workload increases. For example, reducing the disk I/O
+ performed by a query can speed up an individual query, and at the same time improve scalability by making it
+ practical to run more queries simultaneously. Sometimes, an optimization technique improves scalability more
+ than performance. For example, reducing memory usage for a query might not change the query performance much,
+ but might improve scalability by allowing more Impala queries or other kinds of jobs to run at the same time
+ without running out of memory.
+ </p>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ <p class="p">
+ Before starting any performance tuning or benchmarking, make sure your system is configured with all the
+ recommended minimum hardware requirements from <a class="xref" href="impala_prereqs.html#prereqs_hardware">Hardware Requirements</a> and
+ software settings from <a class="xref" href="impala_config_performance.html#config_performance">Post-Installation Configuration for Impala</a>.
+ </p>
+ </div>
+
+ <ul class="ul">
+ <li class="li">
+ <a class="xref" href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a>. This technique physically divides the data based on
+ the different values in frequently queried columns, allowing queries to skip reading a large percentage of
+ the data in a table.
+ </li>
+
+ <li class="li">
+ <a class="xref" href="impala_perf_joins.html#perf_joins">Performance Considerations for Join Queries</a>. Joins are the main class of queries that you can tune at
+ the SQL level, as opposed to changing physical factors such as the file format or the hardware
+ configuration. The related topics <a class="xref" href="impala_perf_stats.html#perf_column_stats">Overview of Column Statistics</a> and
+ <a class="xref" href="impala_perf_stats.html#perf_table_stats">Overview of Table Statistics</a> are also important primarily for join performance.
+ </li>
+
+ <li class="li">
+ <a class="xref" href="impala_perf_stats.html#perf_table_stats">Overview of Table Statistics</a> and
+ <a class="xref" href="impala_perf_stats.html#perf_column_stats">Overview of Column Statistics</a>. Gathering table and column statistics, using the
+ <code class="ph codeph">COMPUTE STATS</code> statement, helps Impala automatically optimize the performance for join
+ queries, without requiring changes to SQL query statements. (This process is greatly simplified in Impala
+ 1.2.2 and higher, because the <code class="ph codeph">COMPUTE STATS</code> statement gathers both kinds of statistics in
+ one operation, and does not require any setup and configuration as was previously necessary for the
+ <code class="ph codeph">ANALYZE TABLE</code> statement in Hive.)
+ </li>
+
+ <li class="li">
+ <a class="xref" href="impala_perf_testing.html#performance_testing">Testing Impala Performance</a>. Do some post-setup testing to ensure Impala is
+ using optimal settings for performance, before conducting any benchmark tests.
+ </li>
+
+ <li class="li">
+ <a class="xref" href="impala_perf_benchmarking.html#perf_benchmarks">Benchmarking Impala Queries</a>. The configuration and sample data that you use
+ for initial experiments with Impala is often not appropriate for doing performance tests.
+ </li>
+
+ <li class="li">
+ <a class="xref" href="impala_perf_resources.html#mem_limits">Controlling Impala Resource Usage</a>. The more memory Impala can utilize, the better query
+ performance you can expect. In a cluster running other kinds of workloads as well, you must make tradeoffs
+ to make sure all Hadoop components have enough memory to perform well, so you might cap the memory that
+ Impala can use.
+ </li>
+
+
+
+ <li class="li">
+ <a class="xref" href="impala_s3.html#s3">Using Impala with the Amazon S3 Filesystem</a>. Queries against data stored in the Amazon Simple Storage Service (S3)
+ have different performance characteristics than when the data is stored in HDFS.
+ </li>
+ </ul>
+
+ <p class="p toc"></p>
+
+ <p class="p">
+ A good source of tips related to scalability and performance tuning is the
+ <a class="xref" href="http://www.slideshare.net/cloudera/the-impala-cookbook-42530186" target="_blank">Impala Cookbook</a>
+ presentation. These slides are updated periodically as new features come out and new benchmarks are performed.
+ </p>
+
+ </div>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+<nav role="navigation" class="related-links"><ul class="ullinks"><li class="link ulchildlink"><strong><a href="../topics/impala_perf_cookbook.html">Impala Performance Guidelines and Best Practices</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_perf_joins.html">Performance Considerations for Join Queries</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_perf_stats.html">Table and Column Statistics</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_perf_benchmarking.html">Benchmarking Impala Queries</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_perf_resources.html">Controlling Impala Resource Usage</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_runtime_filtering.html">Runtime Filtering for Impala Queries (Impala 2.5 or higher only)</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/i
mpala_perf_hdfs_caching.html">Using HDFS Caching with Impala (Impala 2.1 or higher only)</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_perf_testing.html">Testing Impala Performance</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_explain_plan.html">Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_perf_skew.html">Detecting and Correcting HDFS Block Skew Conditions</a></strong><br></li></ul></nav></article></main></body></html>