You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jb...@apache.org on 2017/04/12 18:25:20 UTC
[16/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_partitioning.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_partitioning.html b/docs/build/html/topics/impala_partitioning.html
new file mode 100644
index 0000000..b361083
--- /dev/null
+++ b/docs/build/html/topics/impala_partitioning.html
@@ -0,0 +1,653 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version"
  content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="partitioning"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Partitioning for Impala Tables</title></head><body id="partitioning"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Partitioning for Impala Tables</h1>
+
+  
+
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      By default, all the data files for a table are located in a single directory. Partitioning is a technique for physically dividing the
+      data during loading, based on values from one or more columns, to speed up queries that test those columns. For example, with a
+      <code class="ph codeph">school_records</code> table partitioned on a <code class="ph codeph">year</code> column, there is a separate data directory for each
+      different year value, and all the data for that year is stored in a data file in that directory. A query that includes a
+      <code class="ph codeph">WHERE</code> condition such as <code class="ph codeph">YEAR=1966</code>, <code class="ph codeph">YEAR IN (1989,1999)</code>, or <code class="ph codeph">YEAR BETWEEN
+      1984 AND 1989</code> can examine only the data files from the appropriate directory or directories, greatly reducing the amount of
+      data to read and test.
+    </p>
+
+    <p class="p toc inpage"></p>
+
+    <p class="p">
+      See <a class="xref" href="impala_tutorial.html#tut_external_partition_data">Attaching an External Partitioned Table to an HDFS Directory Structure</a> for an example that illustrates the syntax for creating partitioned
+      tables, the underlying directory structure in HDFS, and how to attach a partitioned Impala external table to data files stored
+      elsewhere in HDFS.
+    </p>
+
+    <p class="p">
+      Parquet is a popular format for partitioned Impala tables because it is well suited to handle huge data volumes. See
+      <a class="xref" href="impala_parquet.html#parquet_performance">Query Performance for Impala Parquet Tables</a> for performance considerations for partitioned Parquet tables.
+    </p>
+
+    <p class="p">
+      See <a class="xref" href="impala_literals.html#null">NULL</a> for details about how <code class="ph codeph">NULL</code> values are represented in partitioned tables.
+    </p>
+
+    <p class="p">
+      See <a class="xref" href="impala_s3.html#s3">Using Impala with the Amazon S3 Filesystem</a> for details about setting up tables where some or all partitions reside on the Amazon Simple
+      Storage Service (S3).
+    </p>
+
+  </div>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title2" id="partitioning__partitioning_choosing">
+
+    <h2 class="title topictitle2" id="ariaid-title2">When to Use Partitioned Tables</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Partitioning is typically appropriate for:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          Tables that are very large, where reading the entire data set takes an impractical amount of time.
+        </li>
+
+        <li class="li">
+          Tables that are always or almost always queried with conditions on the partitioning columns. In our example of a table partitioned
+          by year, <code class="ph codeph">SELECT COUNT(*) FROM school_records WHERE year = 1985</code> is efficient, only examining a small fraction of
+          the data; but <code class="ph codeph">SELECT COUNT(*) FROM school_records</code> has to process a separate data file for each year, resulting in
+          more overall work than in an unpartitioned table. You would probably not partition this way if you frequently queried the table
+          based on last name, student ID, and so on without testing the year.
+        </li>
+
+        <li class="li">
+          Columns that have reasonable cardinality (number of different values). If a column only has a small number of values, for example
+          <code class="ph codeph">Male</code> or <code class="ph codeph">Female</code>, you do not gain much efficiency by eliminating only about 50% of the data to
+          read for each query. If a column has only a few rows matching each value, the number of directories to process can become a
+          limiting factor, and the data file in each directory could be too small to take advantage of the Hadoop mechanism for transmitting
+          data in multi-megabyte blocks. For example, you might partition census data by year, store sales data by year and month, and web
+          traffic data by year, month, and day. (Some users with high volumes of incoming data might even partition down to the individual
+          hour and minute.)
+        </li>
+
+        <li class="li">
+          Data that already passes through an extract, transform, and load (ETL) pipeline. The values of the partitioning columns are
+          stripped from the original data files and represented by directory names, so loading data into a partitioned table involves some
+          sort of transformation or preprocessing.
+        </li>
+      </ul>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="partitioning__partition_sql">
+
+    <h2 class="title topictitle2" id="ariaid-title3">SQL Statements for Partitioned Tables</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        In terms of Impala SQL syntax, partitioning affects these statements:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          <code class="ph codeph"><a class="xref" href="impala_create_table.html#create_table">CREATE TABLE</a></code>: you specify a <code class="ph codeph">PARTITIONED
+          BY</code> clause when creating the table to identify names and data types of the partitioning columns. These columns are not
+          included in the main list of columns for the table.
+        </li>
+
+        <li class="li">
+          In <span class="keyword">Impala 2.5</span> and higher, you can also use the <code class="ph codeph">PARTITIONED BY</code> clause in a <code class="ph codeph">CREATE TABLE AS
+          SELECT</code> statement. This syntax lets you use a single statement to create a partitioned table, copy data into it, and
+          create new partitions based on the values in the inserted data.
+        </li>
+
+        <li class="li">
+          <code class="ph codeph"><a class="xref" href="impala_alter_table.html#alter_table">ALTER TABLE</a></code>: you can add or drop partitions, to work with
+          different portions of a huge data set. You can designate the HDFS directory that holds the data files for a specific partition.
+          With data partitioned by date values, you might <span class="q">"age out"</span> data that is no longer relevant.
+          <div class="note note note_note"><span class="note__title notetitle">Note:</span> 
+        If you are creating a partition for the first time and specifying its location, for maximum efficiency, use
+        a single <code class="ph codeph">ALTER TABLE</code> statement including both the <code class="ph codeph">ADD PARTITION</code> and
+        <code class="ph codeph">LOCATION</code> clauses, rather than separate statements with <code class="ph codeph">ADD PARTITION</code> and
+        <code class="ph codeph">SET LOCATION</code> clauses.
+      </div>
+        </li>
+
+        <li class="li">
+          <code class="ph codeph"><a class="xref" href="impala_insert.html#insert">INSERT</a></code>: When you insert data into a partitioned table, you identify
+          the partitioning columns. One or more values from each inserted row are not stored in data files, but instead determine the
+          directory where that row value is stored. You can also specify which partition to load a set of data into, with <code class="ph codeph">INSERT
+          OVERWRITE</code> statements; you can replace the contents of a specific partition but you cannot append data to a specific
+          partition.
+          <p class="p">
+        By default, if an <code class="ph codeph">INSERT</code> statement creates any new subdirectories underneath a partitioned
+        table, those subdirectories are assigned default HDFS permissions for the <code class="ph codeph">impala</code> user. To
+        make each subdirectory have the same permissions as its parent directory in HDFS, specify the
+        <code class="ph codeph">--insert_inherit_permissions</code> startup option for the <span class="keyword cmdname">impalad</span> daemon.
+      </p>
+        </li>
+
+        <li class="li">
+          Although the syntax of the <code class="ph codeph"><a class="xref" href="impala_select.html#select">SELECT</a></code> statement is the same whether or
+          not the table is partitioned, the way queries interact with partitioned tables can have a dramatic impact on performance and
+          scalability. The mechanism that lets queries skip certain partitions during a query is known as partition pruning; see
+          <a class="xref" href="impala_partitioning.html#partition_pruning">Partition Pruning for Queries</a> for details.
+        </li>
+
+        <li class="li">
+          In Impala 1.4 and later, there is a <code class="ph codeph">SHOW PARTITIONS</code> statement that displays information about each partition in a
+          table. See <a class="xref" href="impala_show.html#show">SHOW Statement</a> for details.
+        </li>
+      </ul>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="partitioning__partition_static_dynamic">
+
+    <h2 class="title topictitle2" id="ariaid-title4">Static and Dynamic Partitioning Clauses</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Specifying all the partition columns in a SQL statement is called <dfn class="term">static partitioning</dfn>, because the statement affects a
+        single predictable partition. For example, you use static partitioning with an <code class="ph codeph">ALTER TABLE</code> statement that affects
+        only one partition, or with an <code class="ph codeph">INSERT</code> statement that inserts all values into the same partition:
+      </p>
+
+<pre class="pre codeblock"><code>insert into t1 <strong class="ph b">partition(x=10, y='a')</strong> select c1 from some_other_table;
+</code></pre>
+
+      <p class="p">
+        When you specify some partition key columns in an <code class="ph codeph">INSERT</code> statement, but leave out the values, Impala determines
+        which partition to insert. This technique is called <dfn class="term">dynamic partitioning</dfn>:
+      </p>
+
+<pre class="pre codeblock"><code>insert into t1 <strong class="ph b">partition(x, y='b')</strong> select c1, c2 from some_other_table;
+-- Create new partition if necessary based on variable year, month, and day; insert a single value.
+insert into weather <strong class="ph b">partition (year, month, day)</strong> select 'cloudy',2014,4,21;
+-- Create new partition if necessary for specified year and month but variable day; insert a single value.
+insert into weather <strong class="ph b">partition (year=2014, month=04, day)</strong> select 'sunny',22;
+</code></pre>
+
+      <p class="p">
+        The more key columns you specify in the <code class="ph codeph">PARTITION</code> clause, the fewer columns you need in the <code class="ph codeph">SELECT</code>
+        list. The trailing columns in the <code class="ph codeph">SELECT</code> list are substituted in order for the partition key columns with no
+        specified value.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="partitioning__partition_refresh">
+
+    <h2 class="title topictitle2" id="ariaid-title5">Refreshing a Single Partition</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The <code class="ph codeph">REFRESH</code> statement is typically used with partitioned tables when new data files are loaded into a partition by
+        some non-Impala mechanism, such as a Hive or Spark job. The <code class="ph codeph">REFRESH</code> statement makes Impala aware of the new data
+        files so that they can be used in Impala queries. Because partitioned tables typically contain a high volume of data, the
+        <code class="ph codeph">REFRESH</code> operation for a full partitioned table can take significant time.
+      </p>
+
+      <p class="p">
+        In <span class="keyword">Impala 2.7</span> and higher, you can include a <code class="ph codeph">PARTITION (<var class="keyword varname">partition_spec</var>)</code> clause in the
+        <code class="ph codeph">REFRESH</code> statement so that only a single partition is refreshed. For example, <code class="ph codeph">REFRESH big_table PARTITION
+        (year=2017, month=9, day=30)</code>. The partition spec must include all the partition key columns. See
+        <a class="xref" href="impala_refresh.html#refresh">REFRESH Statement</a> for more details and examples of <code class="ph codeph">REFRESH</code> syntax and usage.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="partitioning__partition_permissions">
+
+    <h2 class="title topictitle2" id="ariaid-title6">Permissions for Partition Subdirectories</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        By default, if an <code class="ph codeph">INSERT</code> statement creates any new subdirectories underneath a partitioned
+        table, those subdirectories are assigned default HDFS permissions for the <code class="ph codeph">impala</code> user. To
+        make each subdirectory have the same permissions as its parent directory in HDFS, specify the
+        <code class="ph codeph">--insert_inherit_permissions</code> startup option for the <span class="keyword cmdname">impalad</span> daemon.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title7" id="partitioning__partition_pruning">
+
+    <h2 class="title topictitle2" id="ariaid-title7">Partition Pruning for Queries</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Partition pruning refers to the mechanism where a query can skip reading the data files corresponding to one or more partitions. If
+        you can arrange for queries to prune large numbers of unnecessary partitions from the query execution plan, the queries use fewer
+        resources and are thus proportionally faster and more scalable.
+      </p>
+
+      <p class="p">
+        For example, if a table is partitioned by columns <code class="ph codeph">YEAR</code>, <code class="ph codeph">MONTH</code>, and <code class="ph codeph">DAY</code>, then
+        <code class="ph codeph">WHERE</code> clauses such as <code class="ph codeph">WHERE year = 2013</code>, <code class="ph codeph">WHERE year &lt; 2010</code>, or <code class="ph codeph">WHERE
+        year BETWEEN 1995 AND 1998</code> allow Impala to skip the data files in all partitions outside the specified range. Likewise,
+        <code class="ph codeph">WHERE year = 2013 AND month BETWEEN 1 AND 3</code> could prune even more partitions, reading the data files for only a
+        portion of one year.
+      </p>
+
+      <p class="p toc inpage"></p>
+
+    </div>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title8" id="partition_pruning__partition_pruning_checking">
+
+      <h3 class="title topictitle3" id="ariaid-title8">Checking if Partition Pruning Happens for a Query</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          To check the effectiveness of partition pruning for a query, check the <code class="ph codeph">EXPLAIN</code> output for the query before
+          running it. For example, this example shows a table with 3 partitions, where the query only reads 1 of them. The notation
+          <code class="ph codeph">#partitions=1/3</code> in the <code class="ph codeph">EXPLAIN</code> plan confirms that Impala can do the appropriate partition
+          pruning.
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; insert into census partition (year=2010) values ('Smith'),('Jones');
+[localhost:21000] &gt; insert into census partition (year=2011) values ('Smith'),('Jones'),('Doe');
+[localhost:21000] &gt; insert into census partition (year=2012) values ('Smith'),('Doe');
+[localhost:21000] &gt; select name from census where year=2010;
++-------+
+| name  |
++-------+
+| Smith |
+| Jones |
++-------+
+[localhost:21000] &gt; explain select name from census <strong class="ph b">where year=2010</strong>;
++------------------------------------------------------------------+
+| Explain String                                                   |
++------------------------------------------------------------------+
+| PLAN FRAGMENT 0                                                  |
+|   PARTITION: UNPARTITIONED                                       |
+|                                                                  |
+|   1:EXCHANGE                                                     |
+|                                                                  |
+| PLAN FRAGMENT 1                                                  |
+|   PARTITION: RANDOM                                              |
+|                                                                  |
+|   STREAM DATA SINK                                               |
+|     EXCHANGE ID: 1                                               |
+|     UNPARTITIONED                                                |
+|                                                                  |
+|   0:SCAN HDFS                                                    |
+|      table=predicate_propagation.census <strong class="ph b">#partitions=1/3</strong> size=12B |
++------------------------------------------------------------------+</code></pre>
+
+        <p class="p">
+          For a report of the volume of data that was actually read and processed at each stage of the query, check the output of the
+          <code class="ph codeph">SUMMARY</code> command immediately after running the query. For a more detailed analysis, look at the output of the
+          <code class="ph codeph">PROFILE</code> command; it includes this same summary report near the start of the profile output.
+        </p>
+
+      </div>
+
+    </article>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title9" id="partition_pruning__partition_pruning_sql">
+
+      <h3 class="title topictitle3" id="ariaid-title9">What SQL Constructs Work with Partition Pruning</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          
+          Impala can even do partition pruning in cases where the partition key column is not directly compared to a constant, by applying
+          the transitive property to other parts of the <code class="ph codeph">WHERE</code> clause. This technique is known as predicate propagation, and
+          is available in Impala 1.2.2 and later. In this example, the census table includes another column indicating when the data was
+          collected, which happens in 10-year intervals. Even though the query does not compare the partition key column
+          (<code class="ph codeph">YEAR</code>) to a constant value, Impala can deduce that only the partition <code class="ph codeph">YEAR=2010</code> is required, and
+          again only reads 1 out of 3 partitions.
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; drop table census;
+[localhost:21000] &gt; create table census (name string, census_year int) partitioned by (year int);
+[localhost:21000] &gt; insert into census partition (year=2010) values ('Smith',2010),('Jones',2010);
+[localhost:21000] &gt; insert into census partition (year=2011) values ('Smith',2020),('Jones',2020),('Doe',2020);
+[localhost:21000] &gt; insert into census partition (year=2012) values ('Smith',2020),('Doe',2020);
+[localhost:21000] &gt; select name from census where year = census_year and census_year=2010;
++-------+
+| name  |
++-------+
+| Smith |
+| Jones |
++-------+
+[localhost:21000] &gt; explain select name from census <strong class="ph b">where year = census_year and census_year=2010</strong>;
++------------------------------------------------------------------+
+| Explain String                                                   |
++------------------------------------------------------------------+
+| PLAN FRAGMENT 0                                                  |
+|   PARTITION: UNPARTITIONED                                       |
+|                                                                  |
+|   1:EXCHANGE                                                     |
+|                                                                  |
+| PLAN FRAGMENT 1                                                  |
+|   PARTITION: RANDOM                                              |
+|                                                                  |
+|   STREAM DATA SINK                                               |
+|     EXCHANGE ID: 1                                               |
+|     UNPARTITIONED                                                |
+|                                                                  |
+|   0:SCAN HDFS                                                    |
+|      table=predicate_propagation.census <strong class="ph b">#partitions=1/3</strong> size=22B |
+|      predicates: census_year = 2010, year = census_year          |
++------------------------------------------------------------------+
+</code></pre>
+
+        <p class="p">
+        If a view applies to a partitioned table, any partition pruning considers the clauses on both
+        the original query and any additional <code class="ph codeph">WHERE</code> predicates in the query that refers to the view.
+        Prior to Impala 1.4, only the <code class="ph codeph">WHERE</code> clauses on the original query from the
+        <code class="ph codeph">CREATE VIEW</code> statement were used for partition pruning.
+      </p>
+
+        <p class="p">
+        In queries involving both analytic functions and partitioned tables, partition pruning only occurs for columns named in the <code class="ph codeph">PARTITION BY</code>
+        clause of the analytic function call. For example, if an analytic function query has a clause such as <code class="ph codeph">WHERE year=2016</code>,
+        the way to make the query prune all other <code class="ph codeph">YEAR</code> partitions is to include <code class="ph codeph">PARTITION BY year</code>in the analytic function call;
+        for example, <code class="ph codeph">OVER (PARTITION BY year,<var class="keyword varname">other_columns</var> <var class="keyword varname">other_analytic_clauses</var>)</code>.
+
+      </p>
+
+      </div>
+
+    </article>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title10" id="partition_pruning__dynamic_partition_pruning">
+
+      <h3 class="title topictitle3" id="ariaid-title10">Dynamic Partition Pruning</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          The original mechanism uses to prune partitions is <dfn class="term">static partition pruning</dfn>, in which the conditions in the
+          <code class="ph codeph">WHERE</code> clause are analyzed to determine in advance which partitions can be safely skipped. In <span class="keyword">Impala 2.5</span>
+          and higher, Impala can perform <dfn class="term">dynamic partition pruning</dfn>, where information about the partitions is collected during
+          the query, and Impala prunes unnecessary partitions in ways that were impractical to predict in advance.
+        </p>
+
+        <p class="p">
+          For example, if partition key columns are compared to literal values in a <code class="ph codeph">WHERE</code> clause, Impala can perform static
+          partition pruning during the planning phase to only read the relevant partitions:
+        </p>
+
+<pre class="pre codeblock"><code>
+-- The query only needs to read 3 partitions whose key values are known ahead of time.
+-- That's static partition pruning.
+SELECT COUNT(*) FROM sales_table WHERE year IN (2005, 2010, 2015);
+</code></pre>
+
+        <p class="p">
+          Dynamic partition pruning involves using information only available at run time, such as the result of a subquery:
+        </p>
+
+<pre class="pre codeblock"><code>
+create table yy (s string) partitioned by (year int) stored as parquet;
+insert into yy partition (year) values ('1999', 1999), ('2000', 2000),
+  ('2001', 2001), ('2010',2010);
+compute stats yy;
+
+create table yy2 (s string) partitioned by (year int) stored as parquet;
+insert into yy2 partition (year) values ('1999', 1999), ('2000', 2000),
+  ('2001', 2001);
+compute stats yy2;
+
+-- The query reads an unknown number of partitions, whose key values are only
+-- known at run time. The 'runtime filters' lines show how the information about
+-- the partitions is calculated in query fragment 02, and then used in query
+-- fragment 00 to decide which partitions to skip.
+explain select s from yy2 where year in (select year from yy where year between 2000 and 2005);
++----------------------------------------------------------+
+| Explain String                                           |
++----------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=16.00MB VCores=2 |
+|                                                          |
+| 04:EXCHANGE [UNPARTITIONED]                              |
+| |                                                        |
+| 02:HASH JOIN [LEFT SEMI JOIN, BROADCAST]                 |
+| |  hash predicates: year = year                          |
+| |  <strong class="ph b">runtime filters: RF000 &lt;- year</strong>                        |
+| |                                                        |
+| |--03:EXCHANGE [BROADCAST]                               |
+| |  |                                                     |
+| |  01:SCAN HDFS [dpp.yy]                                 |
+| |     partitions=2/4 files=2 size=468B                   |
+| |                                                        |
+| 00:SCAN HDFS [dpp.yy2]                                   |
+|    partitions=2/3 files=2 size=468B                      |
+|    <strong class="ph b">runtime filters: RF000 -&gt; year</strong>                        |
++----------------------------------------------------------+
+</code></pre>
+
+
+
+        <p class="p">
+          In this case, Impala evaluates the subquery, sends the subquery results to all Impala nodes participating in the query, and then
+          each <span class="keyword cmdname">impalad</span> daemon uses the dynamic partition pruning optimization to read only the partitions with the
+          relevant key values.
+        </p>
+
+        <p class="p">
+          Dynamic partition pruning is especially effective for queries involving joins of several large partitioned tables. Evaluating the
+          <code class="ph codeph">ON</code> clauses of the join predicates might normally require reading data from all partitions of certain tables. If
+          the <code class="ph codeph">WHERE</code> clauses of the query refer to the partition key columns, Impala can now often skip reading many of the
+          partitions while evaluating the <code class="ph codeph">ON</code> clauses. The dynamic partition pruning optimization reduces the amount of I/O
+          and the amount of intermediate data stored and transmitted across the network during the query.
+        </p>
+
+        <p class="p">
+        When the spill-to-disk feature is activated for a join node within a query, Impala does not
+        produce any runtime filters for that join operation on that host. Other join nodes within
+        the query are not affected.
+      </p>
+
+        <p class="p">
+          Dynamic partition pruning is part of the runtime filtering feature, which applies to other kinds of queries in addition to queries
+          against partitioned tables. See <a class="xref" href="impala_runtime_filtering.html#runtime_filtering">Runtime Filtering for Impala Queries (Impala 2.5 or higher only)</a> for full details about this feature.
+        </p>
+
+      </div>
+
+    </article>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title11" id="partitioning__partition_key_columns">
+
+    <h2 class="title topictitle2" id="ariaid-title11">Partition Key Columns</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The columns you choose as the partition keys should be ones that are frequently used to filter query results in important,
+        large-scale queries. Popular examples are some combination of year, month, and day when the data has associated time values, and
+        geographic region when the data is associated with some place.
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          <p class="p">
+            For time-based data, split out the separate parts into their own columns, because Impala cannot partition based on a
+            <code class="ph codeph">TIMESTAMP</code> column.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            The data type of the partition columns does not have a significant effect on the storage required, because the values from those
+            columns are not stored in the data files, rather they are represented as strings inside HDFS directory names.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            In <span class="keyword">Impala 2.5</span> and higher, you can enable the <code class="ph codeph">OPTIMIZE_PARTITION_KEY_SCANS</code> query option to speed up
+            queries that only refer to partition key columns, such as <code class="ph codeph">SELECT MAX(year)</code>. This setting is not enabled by
+            default because the query behavior is slightly different if the table contains partition directories without actual data inside.
+            See <a class="xref" href="impala_optimize_partition_key_scans.html#optimize_partition_key_scans">OPTIMIZE_PARTITION_KEY_SCANS Query Option (Impala 2.5 or higher only)</a> for details.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+        Partitioned tables can contain complex type columns.
+        All the partition key columns must be scalar types.
+      </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            Remember that when Impala queries data stored in HDFS, it is most efficient to use multi-megabyte files to take advantage of the
+            HDFS block size. For Parquet tables, the block size (and ideal size of the data files) is <span class="ph">256 MB in
+            Impala 2.0 and later</span>. Therefore, avoid specifying too many partition key columns, which could result in individual
+            partitions containing only small amounts of data. For example, if you receive 1 GB of data per day, you might partition by year,
+            month, and day; while if you receive 5 GB of data per minute, you might partition by year, month, day, hour, and minute. If you
+            have data with a geographic component, you might partition based on postal code if you have many megabytes of data for each
+            postal code, but if not, you might partition by some larger region such as city, state, or country. state
+          </p>
+        </li>
+      </ul>
+
+      <p class="p">
+        If you frequently run aggregate functions such as <code class="ph codeph">MIN()</code>, <code class="ph codeph">MAX()</code>, and
+        <code class="ph codeph">COUNT(DISTINCT)</code> on partition key columns, consider enabling the <code class="ph codeph">OPTIMIZE_PARTITION_KEY_SCANS</code>
+        query option, which optimizes such queries. This feature is available in <span class="keyword">Impala 2.5</span> and higher.
+        See <a class="xref" href="../shared/../topics/impala_optimize_partition_key_scans.html">OPTIMIZE_PARTITION_KEY_SCANS Query Option (Impala 2.5 or higher only)</a>
+        for the kinds of queries that this option applies to, and slight differences in how partitions are
+        evaluated when this query option is enabled.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title12" id="partitioning__mixed_format_partitions">
+
+    <h2 class="title topictitle2" id="ariaid-title12">Setting Different File Formats for Partitions</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Partitioned tables have the flexibility to use different file formats for different partitions. (For background information about
+        the different file formats Impala supports, see <a class="xref" href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File Formats</a>.) For example, if you originally
+        received data in text format, then received new data in RCFile format, and eventually began receiving data in Parquet format, all
+        that data could reside in the same table for queries. You just need to ensure that the table is structured so that the data files
+        that use different file formats reside in separate partitions.
+      </p>
+
+      <p class="p">
+        For example, here is how you might switch from text to Parquet data as you receive data for different years:
+      </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table census (name string) partitioned by (year smallint);
+[localhost:21000] &gt; alter table census add partition (year=2012); -- Text format;
+
+[localhost:21000] &gt; alter table census add partition (year=2013); -- Text format switches to Parquet before data loaded;
+[localhost:21000] &gt; alter table census partition (year=2013) set fileformat parquet;
+
+[localhost:21000] &gt; insert into census partition (year=2012) values ('Smith'),('Jones'),('Lee'),('Singh');
+[localhost:21000] &gt; insert into census partition (year=2013) values ('Flores'),('Bogomolov'),('Cooper'),('Appiah');</code></pre>
+
+      <p class="p">
+        At this point, the HDFS directory for <code class="ph codeph">year=2012</code> contains a text-format data file, while the HDFS directory for
+        <code class="ph codeph">year=2013</code> contains a Parquet data file. As always, when loading non-trivial data, you would use <code class="ph codeph">INSERT ...
+        SELECT</code> or <code class="ph codeph">LOAD DATA</code> to import data in large batches, rather than <code class="ph codeph">INSERT ... VALUES</code> which
+        produces small files that are inefficient for real-world queries.
+      </p>
+
+      <p class="p">
+        For other file types that Impala cannot create natively, you can switch into Hive and issue the <code class="ph codeph">ALTER TABLE ... SET
+        FILEFORMAT</code> statements and <code class="ph codeph">INSERT</code> or <code class="ph codeph">LOAD DATA</code> statements there. After switching back to
+        Impala, issue a <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> statement so that Impala recognizes any partitions or new
+        data added through Hive.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title13" id="partitioning__partition_management">
+
+    <h2 class="title topictitle2" id="ariaid-title13">Managing Partitions</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        You can add, drop, set the expected file format, or set the HDFS location of the data files for individual partitions within an
+        Impala table. See <a class="xref" href="impala_alter_table.html#alter_table">ALTER TABLE Statement</a> for syntax details, and
+        <a class="xref" href="impala_partitioning.html#mixed_format_partitions">Setting Different File Formats for Partitions</a> for tips on managing tables containing partitions with different file
+        formats.
+      </p>
+
+      <div class="note note note_note"><span class="note__title notetitle">Note:</span> 
+        If you are creating a partition for the first time and specifying its location, for maximum efficiency, use
+        a single <code class="ph codeph">ALTER TABLE</code> statement including both the <code class="ph codeph">ADD PARTITION</code> and
+        <code class="ph codeph">LOCATION</code> clauses, rather than separate statements with <code class="ph codeph">ADD PARTITION</code> and
+        <code class="ph codeph">SET LOCATION</code> clauses.
+      </div>
+
+      <p class="p">
+        What happens to the data files when a partition is dropped depends on whether the partitioned table is designated as internal or
+        external. For an internal (managed) table, the data files are deleted. For example, if data in the partitioned table is a copy of
+        raw data files stored elsewhere, you might save disk space by dropping older partitions that are no longer required for reporting,
+        knowing that the original data is still available if needed later. For an external table, the data files are left alone. For
+        example, dropping a partition without deleting the associated files lets Impala consider a smaller set of partitions, improving
+        query efficiency and reducing overhead for DDL operations on the table; if the data is needed again later, you can add the partition
+        again. See <a class="xref" href="impala_tables.html#tables">Overview of Impala Tables</a> for details and examples.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title14" id="partitioning__partition_kudu">
+
+    <h2 class="title topictitle2" id="ariaid-title14">Using Partitioning with Kudu Tables</h2>
+
+    
+
+    <div class="body conbody">
+
+      <p class="p">
+        Kudu tables use a more fine-grained partitioning scheme than tables containing HDFS data files. You specify a <code class="ph codeph">PARTITION
+        BY</code> clause with the <code class="ph codeph">CREATE TABLE</code> statement to identify how to divide the values from the partition key
+        columns.
+      </p>
+
+      <p class="p">
+        See <a class="xref" href="impala_kudu.html#kudu_partitioning">Partitioning for Kudu Tables</a> for
+        details and examples of the partitioning techniques
+        for Kudu tables.
+      </p>
+
+    </div>
+
+  </article>
+
+</article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_perf_benchmarking.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_perf_benchmarking.html b/docs/build/html/topics/impala_perf_benchmarking.html
new file mode 100644
index 0000000..ce9d995
--- /dev/null
+++ b/docs/build/html/topics/impala_perf_benchmarking.html
@@ -0,0 +1,27 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_performance.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="perf_benchmarks"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Benchmarking Impala Queries</title></head><body id="perf_benchmarks"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Benchmarking Impala Queries</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      Because Impala, like other Hadoop components, is designed to handle large data volumes in a distributed
+      environment, conduct any performance tests using realistic data and cluster configurations. Use a multi-node
+      cluster rather than a single node; run queries against tables containing terabytes of data rather than tens
+      of gigabytes. The parallel processing techniques used by Impala are most appropriate for workloads that are
+      beyond the capacity of a single server.
+    </p>
+
+    <p class="p">
+      When you run queries returning large numbers of rows, the CPU time to pretty-print the output can be
+      substantial, giving an inaccurate measurement of the actual query time. Consider using the
+      <code class="ph codeph">-B</code> option on the <code class="ph codeph">impala-shell</code> command to turn off the pretty-printing, and
+      optionally the <code class="ph codeph">-o</code> option to store query results in a file rather than printing to the
+      screen. See <a class="xref" href="impala_shell_options.html#shell_options">impala-shell Configuration Options</a> for details.
+    </p>
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_performance.html">Tuning Impala for Performance</a></div></div></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_perf_cookbook.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_perf_cookbook.html b/docs/build/html/topics/impala_perf_cookbook.html
new file mode 100644
index 0000000..fc68b45
--- /dev/null
+++ b/docs/build/html/topics/impala_perf_cookbook.html
@@ -0,0 +1,256 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_performance.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="perf_cookbook"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Impala Performance Guidelines and Best Practices</title></head><body id="perf_cookbook"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Impala Performance Guidelines and Best Practices</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      Here are performance guidelines and best practices that you can use during planning, experimentation, and
+      performance tuning for an Impala-enabled <span class="keyword"></span> cluster. All of this information is also available in more
+      detail elsewhere in the Impala documentation; it is gathered together here to serve as a cookbook and
+      emphasize which performance techniques typically provide the highest return on investment
+    </p>
+
+    <p class="p toc inpage"></p>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_file_format"><h2 class="title sectiontitle">Choose the appropriate file format for the data.</h2>
+
+      
+
+      <p class="p">
+        Typically, for large volumes of data (multiple gigabytes per table or partition), the Parquet file format
+        performs best because of its combination of columnar storage layout, large I/O request size, and
+        compression and encoding. See <a class="xref" href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File Formats</a> for comparisons of all
+        file formats supported by Impala, and <a class="xref" href="impala_parquet.html#parquet">Using the Parquet File Format with Impala Tables</a> for details about the
+        Parquet file format.
+      </p>
+
+      <div class="note note note_note"><span class="note__title notetitle">Note:</span> 
+        For smaller volumes of data, a few gigabytes or less for each table or partition, you might not see
+        significant performance differences between file formats. At small data volumes, reduced I/O from an
+        efficient compressed file format can be counterbalanced by reduced opportunity for parallel execution. When
+        planning for a production deployment or conducting benchmarks, always use realistic data volumes to get a
+        true picture of performance and scalability.
+      </div>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_small_files"><h2 class="title sectiontitle">Avoid data ingestion processes that produce many small files.</h2>
+
+      
+
+      <p class="p">
+        When producing data files outside of Impala, prefer either text format or Avro, where you can build up the
+        files row by row. Once the data is in Impala, you can convert it to the more efficient Parquet format and
+        split into multiple data files using a single <code class="ph codeph">INSERT ... SELECT</code> statement. Or, if you have
+        the infrastructure to produce multi-megabyte Parquet files as part of your data preparation process, do
+        that and skip the conversion step inside Impala.
+      </p>
+
+      <p class="p">
+        Always use <code class="ph codeph">INSERT ... SELECT</code> to copy significant volumes of data from table to table
+        within Impala. Avoid <code class="ph codeph">INSERT ... VALUES</code> for any substantial volume of data or
+        performance-critical tables, because each such statement produces a separate tiny data file. See
+        <a class="xref" href="impala_insert.html#insert">INSERT Statement</a> for examples of the <code class="ph codeph">INSERT ... SELECT</code> syntax.
+      </p>
+
+      <p class="p">
+        For example, if you have thousands of partitions in a Parquet table, each with less than
+        <span class="ph">256 MB</span> of data, consider partitioning in a less granular way, such as by
+        year / month rather than year / month / day. If an inefficient data ingestion process produces thousands of
+        data files in the same table or partition, consider compacting the data by performing an <code class="ph codeph">INSERT ...
+        SELECT</code> to copy all the data to a different table; the data will be reorganized into a smaller
+        number of larger files by this process.
+      </p>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_partitioning"><h2 class="title sectiontitle">Choose partitioning granularity based on actual data volume.</h2>
+
+      
+
+      <p class="p">
+        Partitioning is a technique that physically divides the data based on values of one or more columns, such
+        as by year, month, day, region, city, section of a web site, and so on. When you issue queries that request
+        a specific value or range of values for the partition key columns, Impala can avoid reading the irrelevant
+        data, potentially yielding a huge savings in disk I/O.
+      </p>
+
+      <p class="p">
+        When deciding which column(s) to use for partitioning, choose the right level of granularity. For example,
+        should you partition by year, month, and day, or only by year and month? Choose a partitioning strategy
+        that puts at least <span class="ph">256 MB</span> of data in each partition, to take advantage of
+        HDFS bulk I/O and Impala distributed queries.
+      </p>
+
+      <p class="p">
+        Over-partitioning can also cause query planning to take longer than necessary, as Impala prunes the
+        unnecessary partitions. Ideally, keep the number of partitions in the table under 30 thousand.
+      </p>
+
+      <p class="p">
+        When preparing data files to go in a partition directory, create several large files rather than many small
+        ones. If you receive data in the form of many small files and have no control over the input format,
+        consider using the <code class="ph codeph">INSERT ... SELECT</code> syntax to copy data from one table or partition to
+        another, which compacts the files into a relatively small number (based on the number of nodes in the
+        cluster).
+      </p>
+
+      <p class="p">
+        If you need to reduce the overall number of partitions and increase the amount of data in each partition,
+        first look for partition key columns that are rarely referenced or are referenced in non-critical queries
+        (not subject to an SLA). For example, your web site log data might be partitioned by year, month, day, and
+        hour, but if most queries roll up the results by day, perhaps you only need to partition by year, month,
+        and day.
+      </p>
+
+      <p class="p">
+        If you need to reduce the granularity even more, consider creating <span class="q">"buckets"</span>, computed values
+        corresponding to different sets of partition key values. For example, you can use the
+        <code class="ph codeph">TRUNC()</code> function with a <code class="ph codeph">TIMESTAMP</code> column to group date and time values
+        based on intervals such as week or quarter. See
+        <a class="xref" href="impala_datetime_functions.html#datetime_functions">Impala Date and Time Functions</a> for details.
+      </p>
+
+      <p class="p">
+        See <a class="xref" href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a> for full details and performance considerations for
+        partitioning.
+      </p>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_partition_keys"><h2 class="title sectiontitle">Use smallest appropriate integer types for partition key columns.</h2>
+
+      
+
+      <p class="p">
+        Although it is tempting to use strings for partition key columns, since those values are turned into HDFS
+        directory names anyway, you can minimize memory usage by using numeric values for common partition key
+        fields such as <code class="ph codeph">YEAR</code>, <code class="ph codeph">MONTH</code>, and <code class="ph codeph">DAY</code>. Use the smallest
+        integer type that holds the appropriate range of values, typically <code class="ph codeph">TINYINT</code> for
+        <code class="ph codeph">MONTH</code> and <code class="ph codeph">DAY</code>, and <code class="ph codeph">SMALLINT</code> for <code class="ph codeph">YEAR</code>.
+        Use the <code class="ph codeph">EXTRACT()</code> function to pull out individual date and time fields from a
+        <code class="ph codeph">TIMESTAMP</code> value, and <code class="ph codeph">CAST()</code> the return value to the appropriate integer
+        type.
+      </p>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_parquet_block_size"><h2 class="title sectiontitle">Choose an appropriate Parquet block size.</h2>
+
+      
+
+      <p class="p">
+        By default, the Impala <code class="ph codeph">INSERT ... SELECT</code> statement creates Parquet files with a 256 MB
+        block size. (This default was changed in Impala 2.0. Formerly, the limit was 1 GB, but Impala made
+        conservative estimates about compression, resulting in files that were smaller than 1 GB.)
+      </p>
+
+      <p class="p">
+        Each Parquet file written by Impala is a single block, allowing the whole file to be processed as a unit by a single host.
+        As you copy Parquet files into HDFS or between HDFS filesystems, use <code class="ph codeph">hdfs dfs -pb</code> to preserve the original
+        block size.
+      </p>
+
+      <p class="p">
+        If there is only one or a few data block in your Parquet table, or in a partition that is the only one
+        accessed by a query, then you might experience a slowdown for a different reason: not enough data to take
+        advantage of Impala's parallel distributed queries. Each data block is processed by a single core on one of
+        the DataNodes. In a 100-node cluster of 16-core machines, you could potentially process thousands of data
+        files simultaneously. You want to find a sweet spot between <span class="q">"many tiny files"</span> and <span class="q">"single giant
+        file"</span> that balances bulk I/O and parallel processing. You can set the <code class="ph codeph">PARQUET_FILE_SIZE</code>
+        query option before doing an <code class="ph codeph">INSERT ... SELECT</code> statement to reduce the size of each
+        generated Parquet file. <span class="ph">(Specify the file size as an absolute number of bytes, or in Impala
+        2.0 and later, in units ending with <code class="ph codeph">m</code> for megabytes or <code class="ph codeph">g</code> for
+        gigabytes.)</span> Run benchmarks with different file sizes to find the right balance point for your
+        particular data volume.
+      </p>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_stats"><h2 class="title sectiontitle">Gather statistics for all tables used in performance-critical or high-volume join queries.</h2>
+
+      
+
+      <p class="p">
+        Gather the statistics with the <code class="ph codeph">COMPUTE STATS</code> statement. See
+        <a class="xref" href="impala_perf_joins.html#perf_joins">Performance Considerations for Join Queries</a> for details.
+      </p>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_network"><h2 class="title sectiontitle">Minimize the overhead of transmitting results back to the client.</h2>
+
+      
+
+      <p class="p">
+        Use techniques such as:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          Aggregation. If you need to know how many rows match a condition, the total values of matching values
+          from some column, the lowest or highest matching value, and so on, call aggregate functions such as
+          <code class="ph codeph">COUNT()</code>, <code class="ph codeph">SUM()</code>, and <code class="ph codeph">MAX()</code> in the query rather than
+          sending the result set to an application and doing those computations there. Remember that the size of an
+          unaggregated result set could be huge, requiring substantial time to transmit across the network.
+        </li>
+
+        <li class="li">
+          Filtering. Use all applicable tests in the <code class="ph codeph">WHERE</code> clause of a query to eliminate rows
+          that are not relevant, rather than producing a big result set and filtering it using application logic.
+        </li>
+
+        <li class="li">
+          <code class="ph codeph">LIMIT</code> clause. If you only need to see a few sample values from a result set, or the top
+          or bottom values from a query using <code class="ph codeph">ORDER BY</code>, include the <code class="ph codeph">LIMIT</code> clause
+          to reduce the size of the result set rather than asking for the full result set and then throwing most of
+          the rows away.
+        </li>
+
+        <li class="li">
+          Avoid overhead from pretty-printing the result set and displaying it on the screen. When you retrieve the
+          results through <span class="keyword cmdname">impala-shell</span>, use <span class="keyword cmdname">impala-shell</span> options such as
+          <code class="ph codeph">-B</code> and <code class="ph codeph">--output_delimiter</code> to produce results without special
+          formatting, and redirect output to a file rather than printing to the screen. Consider using
+          <code class="ph codeph">INSERT ... SELECT</code> to write the results directly to new files in HDFS. See
+          <a class="xref" href="impala_shell_options.html#shell_options">impala-shell Configuration Options</a> for details about the
+          <span class="keyword cmdname">impala-shell</span> command-line options.
+        </li>
+      </ul>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_explain"><h2 class="title sectiontitle">Verify that your queries are planned in an efficient logical manner.</h2>
+
+      
+
+      <p class="p">
+        Examine the <code class="ph codeph">EXPLAIN</code> plan for a query before actually running it. See
+        <a class="xref" href="impala_explain.html#explain">EXPLAIN Statement</a> and <a class="xref" href="impala_explain_plan.html#perf_explain">Using the EXPLAIN Plan for Performance Tuning</a> for
+        details.
+      </p>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_profile"><h2 class="title sectiontitle">Verify performance characteristics of queries.</h2>
+
+      
+
+      <p class="p">
+        Verify that the low-level aspects of I/O, memory usage, network bandwidth, CPU utilization, and so on are
+        within expected ranges by examining the query profile for a query after running it. See
+        <a class="xref" href="impala_explain_plan.html#perf_profile">Using the Query Profile for Performance Tuning</a> for details.
+      </p>
+    </section>
+
+    <section class="section" id="perf_cookbook__perf_cookbook_os"><h2 class="title sectiontitle">Use appropriate operating system settings.</h2>
+
+      
+
+      <p class="p">
+        See <span class="xref">the documentation for your Apache Hadoop distribution</span> for recommendations about operating system
+        settings that you can change to influence Impala performance. In particular, you might find
+        that changing the <code class="ph codeph">vm.swappiness</code> Linux kernel setting to a non-zero value improves
+        overall performance.
+      </p>
+    </section>
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_performance.html">Tuning Impala for Performance</a></div></div></nav></article></main></body></html>
\ No newline at end of file