You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jb...@apache.org on 2017/04/12 18:25:38 UTC
[34/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS]
Publish rendered Impala documentation to ASF site
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_explain.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_explain.html b/docs/build/html/topics/impala_explain.html
new file mode 100644
index 0000000..473a94d
--- /dev/null
+++ b/docs/build/html/topics/impala_explain.html
@@ -0,0 +1,291 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_langref_sql.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="explain"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>EXPLAIN Statement</title></head><body id="explain"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">EXPLAIN Statement</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ Returns the execution plan for a statement, showing the low-level mechanisms that Impala will use to read the
+ data, divide the work among nodes in the cluster, and transmit intermediate and final results across the
+ network. Use <code class="ph codeph">explain</code> followed by a complete <code class="ph codeph">SELECT</code> query. For example:
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Syntax:</strong>
+ </p>
+
+<pre class="pre codeblock"><code>EXPLAIN { <var class="keyword varname">select_query</var> | <var class="keyword varname">ctas_stmt</var> | <var class="keyword varname">insert_stmt</var> }
+</code></pre>
+
+ <p class="p">
+ The <var class="keyword varname">select_query</var> is a <code class="ph codeph">SELECT</code> statement, optionally prefixed by a
+ <code class="ph codeph">WITH</code> clause. See <a class="xref" href="impala_select.html#select">SELECT Statement</a> for details.
+ </p>
+
+ <p class="p">
+ The <var class="keyword varname">insert_stmt</var> is an <code class="ph codeph">INSERT</code> statement that inserts into or overwrites an
+ existing table. It can use either the <code class="ph codeph">INSERT ... SELECT</code> or <code class="ph codeph">INSERT ...
+ VALUES</code> syntax. See <a class="xref" href="impala_insert.html#insert">INSERT Statement</a> for details.
+ </p>
+
+ <p class="p">
+ The <var class="keyword varname">ctas_stmt</var> is a <code class="ph codeph">CREATE TABLE</code> statement using the <code class="ph codeph">AS
+ SELECT</code> clause, typically abbreviated as a <span class="q">"CTAS"</span> operation. See
+ <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a> for details.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Usage notes:</strong>
+ </p>
+
+ <p class="p">
+ You can interpret the output to judge whether the query is performing efficiently, and adjust the query
+ and/or the schema if not. For example, you might change the tests in the <code class="ph codeph">WHERE</code> clause, add
+ hints to make join operations more efficient, introduce subqueries, change the order of tables in a join, add
+ or change partitioning for a table, collect column statistics and/or table statistics in Hive, or any other
+ performance tuning steps.
+ </p>
+
+ <p class="p">
+ The <code class="ph codeph">EXPLAIN</code> output reminds you if table or column statistics are missing from any table
+ involved in the query. These statistics are important for optimizing queries involving large tables or
+ multi-table joins. See <a class="xref" href="impala_compute_stats.html#compute_stats">COMPUTE STATS Statement</a> for how to gather statistics,
+ and <a class="xref" href="impala_perf_stats.html#perf_stats">Table and Column Statistics</a> for how to use this information for query tuning.
+ </p>
+
+ <div class="p">
+ Read the <code class="ph codeph">EXPLAIN</code> plan from bottom to top:
+ <ul class="ul">
+ <li class="li">
+ The last part of the plan shows the low-level details such as the expected amount of data that will be
+ read, where you can judge the effectiveness of your partitioning strategy and estimate how long it will
+ take to scan a table based on total data size and the size of the cluster.
+ </li>
+
+ <li class="li">
+ As you work your way up, next you see the operations that will be parallelized and performed on each
+ Impala node.
+ </li>
+
+ <li class="li">
+ At the higher levels, you see how data flows when intermediate result sets are combined and transmitted
+ from one node to another.
+ </li>
+
+ <li class="li">
+ See <a class="xref" href="../shared/../topics/impala_explain_level.html#explain_level">EXPLAIN_LEVEL Query Option</a> for details about the
+ <code class="ph codeph">EXPLAIN_LEVEL</code> query option, which lets you customize how much detail to show in the
+ <code class="ph codeph">EXPLAIN</code> plan depending on whether you are doing high-level or low-level tuning,
+ dealing with logical or physical aspects of the query.
+ </li>
+ </ul>
+ </div>
+
+ <p class="p">
+ If you come from a traditional database background and are not familiar with data warehousing, keep in mind
+ that Impala is optimized for full table scans across very large tables. The structure and distribution of
+ this data is typically not suitable for the kind of indexing and single-row lookups that are common in OLTP
+ environments. Seeing a query scan entirely through a large table is common, not necessarily an indication of
+ an inefficient query. Of course, if you can reduce the volume of scanned data by orders of magnitude, for
+ example by using a query that affects only certain partitions within a partitioned table, then you might be
+ able to optimize a query so that it executes in seconds rather than minutes.
+ </p>
+
+ <p class="p">
+ For more information and examples to help you interpret <code class="ph codeph">EXPLAIN</code> output, see
+ <a class="xref" href="impala_explain_plan.html#perf_explain">Using the EXPLAIN Plan for Performance Tuning</a>.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Extended EXPLAIN output:</strong>
+ </p>
+
+ <p class="p">
+ For performance tuning of complex queries, and capacity planning (such as using the admission control and
+ resource management features), you can enable more detailed and informative output for the
+ <code class="ph codeph">EXPLAIN</code> statement. In the <span class="keyword cmdname">impala-shell</span> interpreter, issue the command
+ <code class="ph codeph">SET EXPLAIN_LEVEL=<var class="keyword varname">level</var></code>, where <var class="keyword varname">level</var> is an integer
+ from 0 to 3 or corresponding mnemonic values <code class="ph codeph">minimal</code>, <code class="ph codeph">standard</code>,
+ <code class="ph codeph">extended</code>, or <code class="ph codeph">verbose</code>.
+ </p>
+
+ <p class="p">
+ When extended <code class="ph codeph">EXPLAIN</code> output is enabled, <code class="ph codeph">EXPLAIN</code> statements print
+ information about estimated memory requirements, minimum number of virtual cores, and so on.
+
+ </p>
+
+ <p class="p">
+ See <a class="xref" href="impala_explain_level.html#explain_level">EXPLAIN_LEVEL Query Option</a> for details and examples.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Examples:</strong>
+ </p>
+
+ <p class="p">
+ This example shows how the standard <code class="ph codeph">EXPLAIN</code> output moves from the lowest (physical) level to
+ the higher (logical) levels. The query begins by scanning a certain amount of data; each node performs an
+ aggregation operation (evaluating <code class="ph codeph">COUNT(*)</code>) on some subset of data that is local to that
+ node; the intermediate results are transmitted back to the coordinator node (labelled here as the
+ <code class="ph codeph">EXCHANGE</code> node); lastly, the intermediate results are summed to display the final result.
+ </p>
+
+<pre class="pre codeblock" id="explain__explain_plan_simple"><code>[impalad-host:21000] > explain select count(*) from customer_address;
++----------------------------------------------------------+
+| Explain String |
++----------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=42.00MB VCores=1 |
+| |
+| 03:AGGREGATE [MERGE FINALIZE] |
+| | output: sum(count(*)) |
+| | |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
+| | |
+| 01:AGGREGATE |
+| | output: count(*) |
+| | |
+| 00:SCAN HDFS [default.customer_address] |
+| partitions=1/1 size=5.25MB |
++----------------------------------------------------------+
+</code></pre>
+
+ <p class="p">
+ These examples show how the extended <code class="ph codeph">EXPLAIN</code> output becomes more accurate and informative as
+ statistics are gathered by the <code class="ph codeph">COMPUTE STATS</code> statement. Initially, much of the information
+ about data size and distribution is marked <span class="q">"unavailable"</span>. Impala can determine the raw data size, but
+ not the number of rows or number of distinct values for each column without additional analysis. The
+ <code class="ph codeph">COMPUTE STATS</code> statement performs this analysis, so a subsequent <code class="ph codeph">EXPLAIN</code>
+ statement has additional information to use in deciding how to optimize the distributed query.
+ </p>
+
+
+
+<pre class="pre codeblock"><code>[localhost:21000] > set explain_level=extended;
+EXPLAIN_LEVEL set to extended
+[localhost:21000] > explain select x from t1;
+[localhost:21000] > explain select x from t1;
++----------------------------------------------------------+
+| Explain String |
++----------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=32.00MB VCores=1 |
+| |
+| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
+| | hosts=1 per-host-mem=unavailable |
+<strong class="ph b">| | tuple-ids=0 row-size=4B cardinality=unavailable |</strong>
+| | |
+| 00:SCAN HDFS [default.t2, PARTITION=RANDOM] |
+| partitions=1/1 size=36B |
+<strong class="ph b">| table stats: unavailable |</strong>
+<strong class="ph b">| column stats: unavailable |</strong>
+| hosts=1 per-host-mem=32.00MB |
+<strong class="ph b">| tuple-ids=0 row-size=4B cardinality=unavailable |</strong>
++----------------------------------------------------------+
+</code></pre>
+
+<pre class="pre codeblock"><code>[localhost:21000] > compute stats t1;
++-----------------------------------------+
+| summary |
++-----------------------------------------+
+| Updated 1 partition(s) and 1 column(s). |
++-----------------------------------------+
+[localhost:21000] > explain select x from t1;
++----------------------------------------------------------+
+| Explain String |
++----------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=64.00MB VCores=1 |
+| |
+| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
+| | hosts=1 per-host-mem=unavailable |
+| | tuple-ids=0 row-size=4B cardinality=0 |
+| | |
+| 00:SCAN HDFS [default.t1, PARTITION=RANDOM] |
+| partitions=1/1 size=36B |
+<strong class="ph b">| table stats: 0 rows total |</strong>
+<strong class="ph b">| column stats: all |</strong>
+| hosts=1 per-host-mem=64.00MB |
+<strong class="ph b">| tuple-ids=0 row-size=4B cardinality=0 |</strong>
++----------------------------------------------------------+
+</code></pre>
+
+ <p class="p">
+ <strong class="ph b">Security considerations:</strong>
+ </p>
+ <p class="p">
+ If these statements in your environment contain sensitive literal values such as credit card numbers or tax
+ identifiers, Impala can redact this sensitive information when displaying the statements in log files and
+ other administrative contexts. See <span class="xref">the documentation for your Apache Hadoop distribution</span> for details.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Cancellation:</strong> Cannot be cancelled.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">HDFS permissions:</strong>
+ </p>
+ <p class="p">
+
+ The user ID that the <span class="keyword cmdname">impalad</span> daemon runs under,
+ typically the <code class="ph codeph">impala</code> user, must have read
+ and execute permissions for all applicable directories in all source tables
+ for the query that is being explained.
+ (A <code class="ph codeph">SELECT</code> operation could read files from multiple different HDFS directories
+ if the source table is partitioned.)
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Kudu considerations:</strong>
+ </p>
+ <p class="p">
+ The <code class="ph codeph">EXPLAIN</code> statement displays equivalent plan
+ information for queries against Kudu tables as for queries
+ against HDFS-based tables.
+ </p>
+
+ <div class="p">
+ To see which predicates Impala can <span class="q">"push down"</span> to Kudu for
+ efficient evaluation, without transmitting unnecessary rows back
+ to Impala, look for the <code class="ph codeph">kudu predicates</code> item in
+ the scan phase of the query. The label <code class="ph codeph">kudu predicates</code>
+ indicates a condition that can be evaluated efficiently on the Kudu
+ side. The label <code class="ph codeph">predicates</code> in a <code class="ph codeph">SCAN KUDU</code>
+ node indicates a condition that is evaluated by Impala.
+ For example, in a table with primary key column <code class="ph codeph">X</code>
+ and non-primary key column <code class="ph codeph">Y</code>, you can see that
+ some operators in the <code class="ph codeph">WHERE</code> clause are evaluated
+ immediately by Kudu and others are evaluated later by Impala:
+<pre class="pre codeblock"><code>
+EXPLAIN SELECT x,y from kudu_table WHERE
+ x = 1 AND x NOT IN (2,3) AND y = 1
+ AND x IS NOT NULL AND x > 0;
++----------------
+| Explain String
++----------------
+...
+| 00:SCAN KUDU [jrussell.hash_only]
+| predicates: x IS NOT NULL, x NOT IN (2, 3)
+| kudu predicates: x = 1, x > 0, y = 1
+</code></pre>
+ Only binary predicates and <code class="ph codeph">IN</code> predicates containing
+ literal values that exactly match the types in the Kudu table, and do not
+ require any casting, can be pushed to Kudu.
+ </div>
+
+ <p class="p">
+ <strong class="ph b">Related information:</strong>
+ </p>
+ <p class="p">
+ <a class="xref" href="impala_select.html#select">SELECT Statement</a>,
+ <a class="xref" href="impala_insert.html#insert">INSERT Statement</a>,
+ <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a>,
+ <a class="xref" href="impala_explain_plan.html#explain_plan">Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles</a>
+ </p>
+
+ </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_langref_sql.html">Impala SQL Statements</a></div></div></nav></article></main></body></html>
\ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_explain_level.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_explain_level.html b/docs/build/html/topics/impala_explain_level.html
new file mode 100644
index 0000000..c9f527b
--- /dev/null
+++ b/docs/build/html/topics/impala_explain_level.html
@@ -0,0 +1,342 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="explain_level"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>EXPLAIN_LEVEL Query Option</title></head><body id="explain_level"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">EXPLAIN_LEVEL Query Option</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ Controls the amount of detail provided in the output of the <code class="ph codeph">EXPLAIN</code> statement. The basic
+ output can help you identify high-level performance issues such as scanning a higher volume of data or more
+ partitions than you expect. The higher levels of detail show how intermediate results flow between nodes and
+ how different SQL operations such as <code class="ph codeph">ORDER BY</code>, <code class="ph codeph">GROUP BY</code>, joins, and
+ <code class="ph codeph">WHERE</code> clauses are implemented within a distributed query.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Type:</strong> <code class="ph codeph">STRING</code> or <code class="ph codeph">INT</code>
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Default:</strong> <code class="ph codeph">1</code>
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Arguments:</strong>
+ </p>
+
+ <p class="p">
+ The allowed range of numeric values for this option is 0 to 3:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ <code class="ph codeph">0</code> or <code class="ph codeph">MINIMAL</code>: A barebones list, one line per operation. Primarily useful
+ for checking the join order in very long queries where the regular <code class="ph codeph">EXPLAIN</code> output is too
+ long to read easily.
+ </li>
+
+ <li class="li">
+ <code class="ph codeph">1</code> or <code class="ph codeph">STANDARD</code>: The default level of detail, showing the logical way that
+ work is split up for the distributed query.
+ </li>
+
+ <li class="li">
+ <code class="ph codeph">2</code> or <code class="ph codeph">EXTENDED</code>: Includes additional detail about how the query planner
+ uses statistics in its decision-making process, to understand how a query could be tuned by gathering
+ statistics, using query hints, adding or removing predicates, and so on.
+ </li>
+
+ <li class="li">
+ <code class="ph codeph">3</code> or <code class="ph codeph">VERBOSE</code>: The maximum level of detail, showing how work is split up
+ within each node into <span class="q">"query fragments"</span> that are connected in a pipeline. This extra detail is
+ primarily useful for low-level performance testing and tuning within Impala itself, rather than for
+ rewriting the SQL code at the user level.
+ </li>
+ </ul>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ Prior to Impala 1.3, the allowed argument range for <code class="ph codeph">EXPLAIN_LEVEL</code> was 0 to 1: level 0 had
+ the mnemonic <code class="ph codeph">NORMAL</code>, and level 1 was <code class="ph codeph">VERBOSE</code>. In Impala 1.3 and higher,
+ <code class="ph codeph">NORMAL</code> is not a valid mnemonic value, and <code class="ph codeph">VERBOSE</code> still applies to the
+ highest level of detail but now corresponds to level 3. You might need to adjust the values if you have any
+ older <code class="ph codeph">impala-shell</code> script files that set the <code class="ph codeph">EXPLAIN_LEVEL</code> query option.
+ </div>
+
+ <p class="p">
+ Changing the value of this option controls the amount of detail in the output of the <code class="ph codeph">EXPLAIN</code>
+ statement. The extended information from level 2 or 3 is especially useful during performance tuning, when
+ you need to confirm whether the work for the query is distributed the way you expect, particularly for the
+ most resource-intensive operations such as join queries against large tables, queries against tables with
+ large numbers of partitions, and insert operations for Parquet tables. The extended information also helps to
+ check estimated resource usage when you use the admission control or resource management features explained
+ in <a class="xref" href="impala_resource_management.html#resource_management">Resource Management for Impala</a>. See
+ <a class="xref" href="impala_explain.html#explain">EXPLAIN Statement</a> for the syntax of the <code class="ph codeph">EXPLAIN</code> statement, and
+ <a class="xref" href="impala_explain_plan.html#perf_explain">Using the EXPLAIN Plan for Performance Tuning</a> for details about how to use the extended information.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Usage notes:</strong>
+ </p>
+
+ <p class="p">
+ As always, read the <code class="ph codeph">EXPLAIN</code> output from bottom to top. The lowest lines represent the
+ initial work of the query (scanning data files), the lines in the middle represent calculations done on each
+ node and how intermediate results are transmitted from one node to another, and the topmost lines represent
+ the final results being sent back to the coordinator node.
+ </p>
+
+ <p class="p">
+ The numbers in the left column are generated internally during the initial planning phase and do not
+ represent the actual order of operations, so it is not significant if they appear out of order in the
+ <code class="ph codeph">EXPLAIN</code> output.
+ </p>
+
+ <p class="p">
+ At all <code class="ph codeph">EXPLAIN</code> levels, the plan contains a warning if any tables in the query are missing
+ statistics. Use the <code class="ph codeph">COMPUTE STATS</code> statement to gather statistics for each table and suppress
+ this warning. See <a class="xref" href="impala_perf_stats.html#perf_stats">Table and Column Statistics</a> for details about how the statistics help
+ query performance.
+ </p>
+
+ <p class="p">
+ The <code class="ph codeph">PROFILE</code> command in <span class="keyword cmdname">impala-shell</span> always starts with an explain plan
+ showing full detail, the same as with <code class="ph codeph">EXPLAIN_LEVEL=3</code>. <span class="ph">After the explain
+ plan comes the executive summary, the same output as produced by the <code class="ph codeph">SUMMARY</code> command in
+ <span class="keyword cmdname">impala-shell</span>.</span>
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Examples:</strong>
+ </p>
+
+ <p class="p">
+ These examples use a trivial, empty table to illustrate how the essential aspects of query planning are shown
+ in <code class="ph codeph">EXPLAIN</code> output:
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > create table t1 (x int, s string);
+[localhost:21000] > set explain_level=1;
+[localhost:21000] > explain select count(*) from t1;
++------------------------------------------------------------------------+
+| Explain String |
++------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 |
+| WARNING: The following tables are missing relevant table and/or column |
+| statistics. |
+| explain_plan.t1 |
+| |
+| 03:AGGREGATE [MERGE FINALIZE] |
+| | output: sum(count(*)) |
+| | |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
+| | |
+| 01:AGGREGATE |
+| | output: count(*) |
+| | |
+| 00:SCAN HDFS [explain_plan.t1] |
+| partitions=1/1 size=0B |
++------------------------------------------------------------------------+
+[localhost:21000] > explain select * from t1;
++------------------------------------------------------------------------+
+| Explain String |
++------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
+| WARNING: The following tables are missing relevant table and/or column |
+| statistics. |
+| explain_plan.t1 |
+| |
+| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
+| | |
+| 00:SCAN HDFS [explain_plan.t1] |
+| partitions=1/1 size=0B |
++------------------------------------------------------------------------+
+[localhost:21000] > set explain_level=2;
+[localhost:21000] > explain select * from t1;
++------------------------------------------------------------------------+
+| Explain String |
++------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
+| WARNING: The following tables are missing relevant table and/or column |
+| statistics. |
+| explain_plan.t1 |
+| |
+| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
+| | hosts=0 per-host-mem=unavailable |
+| | tuple-ids=0 row-size=19B cardinality=unavailable |
+| | |
+| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM] |
+| partitions=1/1 size=0B |
+| table stats: unavailable |
+| column stats: unavailable |
+| hosts=0 per-host-mem=0B |
+| tuple-ids=0 row-size=19B cardinality=unavailable |
++------------------------------------------------------------------------+
+[localhost:21000] > set explain_level=3;
+[localhost:21000] > explain select * from t1;
++------------------------------------------------------------------------+
+| Explain String |
++------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
+<strong class="ph b">| WARNING: The following tables are missing relevant table and/or column |</strong>
+<strong class="ph b">| statistics. |</strong>
+<strong class="ph b">| explain_plan.t1 |</strong>
+| |
+| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED] |
+| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
+| hosts=0 per-host-mem=unavailable |
+| tuple-ids=0 row-size=19B cardinality=unavailable |
+| |
+| F00:PLAN FRAGMENT [PARTITION=RANDOM] |
+| DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED] |
+| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM] |
+| partitions=1/1 size=0B |
+<strong class="ph b">| table stats: unavailable |</strong>
+<strong class="ph b">| column stats: unavailable |</strong>
+| hosts=0 per-host-mem=0B |
+| tuple-ids=0 row-size=19B cardinality=unavailable |
++------------------------------------------------------------------------+
+</code></pre>
+
+ <p class="p">
+ As the warning message demonstrates, most of the information needed for Impala to do efficient query
+ planning, and for you to understand the performance characteristics of the query, requires running the
+ <code class="ph codeph">COMPUTE STATS</code> statement for the table:
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > compute stats t1;
++-----------------------------------------+
+| summary |
++-----------------------------------------+
+| Updated 1 partition(s) and 2 column(s). |
++-----------------------------------------+
+[localhost:21000] > explain select * from t1;
++------------------------------------------------------------------------+
+| Explain String |
++------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=-9223372036854775808B VCores=0 |
+| |
+| F01:PLAN FRAGMENT [PARTITION=UNPARTITIONED] |
+| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
+| hosts=0 per-host-mem=unavailable |
+| tuple-ids=0 row-size=20B cardinality=0 |
+| |
+| F00:PLAN FRAGMENT [PARTITION=RANDOM] |
+| DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, PARTITION=UNPARTITIONED] |
+| 00:SCAN HDFS [explain_plan.t1, PARTITION=RANDOM] |
+| partitions=1/1 size=0B |
+<strong class="ph b">| table stats: 0 rows total |</strong>
+<strong class="ph b">| column stats: all |</strong>
+| hosts=0 per-host-mem=0B |
+| tuple-ids=0 row-size=20B cardinality=0 |
++------------------------------------------------------------------------+
+</code></pre>
+
+ <p class="p">
+ Joins and other complicated, multi-part queries are the ones where you most commonly need to examine the
+ <code class="ph codeph">EXPLAIN</code> output and customize the amount of detail in the output. This example shows the
+ default <code class="ph codeph">EXPLAIN</code> output for a three-way join query, then the equivalent output with a
+ <code class="ph codeph">[SHUFFLE]</code> hint to change the join mechanism between the first two tables from a broadcast
+ join to a shuffle join.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > set explain_level=1;
+[localhost:21000] > explain select one.*, two.*, three.* from t1 one, t1 two, t1 three where one.x = two.x and two.x = three.x;
++---------------------------------------------------------+
+| Explain String |
++---------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
+| |
+| 07:EXCHANGE [PARTITION=UNPARTITIONED] |
+| | |
+<strong class="ph b">| 04:HASH JOIN [INNER JOIN, BROADCAST] |</strong>
+| | hash predicates: two.x = three.x |
+| | |
+<strong class="ph b">| |--06:EXCHANGE [BROADCAST] |</strong>
+| | | |
+| | 02:SCAN HDFS [explain_plan.t1 three] |
+| | partitions=1/1 size=0B |
+| | |
+<strong class="ph b">| 03:HASH JOIN [INNER JOIN, BROADCAST] |</strong>
+| | hash predicates: one.x = two.x |
+| | |
+<strong class="ph b">| |--05:EXCHANGE [BROADCAST] |</strong>
+| | | |
+| | 01:SCAN HDFS [explain_plan.t1 two] |
+| | partitions=1/1 size=0B |
+| | |
+| 00:SCAN HDFS [explain_plan.t1 one] |
+| partitions=1/1 size=0B |
++---------------------------------------------------------+
+[localhost:21000] > explain select one.*, two.*, three.*
+ > from t1 one join [shuffle] t1 two join t1 three
+ > where one.x = two.x and two.x = three.x;
++---------------------------------------------------------+
+| Explain String |
++---------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
+| |
+| 08:EXCHANGE [PARTITION=UNPARTITIONED] |
+| | |
+<strong class="ph b">| 04:HASH JOIN [INNER JOIN, BROADCAST] |</strong>
+| | hash predicates: two.x = three.x |
+| | |
+<strong class="ph b">| |--07:EXCHANGE [BROADCAST] |</strong>
+| | | |
+| | 02:SCAN HDFS [explain_plan.t1 three] |
+| | partitions=1/1 size=0B |
+| | |
+<strong class="ph b">| 03:HASH JOIN [INNER JOIN, PARTITIONED] |</strong>
+| | hash predicates: one.x = two.x |
+| | |
+<strong class="ph b">| |--06:EXCHANGE [PARTITION=HASH(two.x)] |</strong>
+| | | |
+| | 01:SCAN HDFS [explain_plan.t1 two] |
+| | partitions=1/1 size=0B |
+| | |
+<strong class="ph b">| 05:EXCHANGE [PARTITION=HASH(one.x)] |</strong>
+| | |
+| 00:SCAN HDFS [explain_plan.t1 one] |
+| partitions=1/1 size=0B |
++---------------------------------------------------------+
+</code></pre>
+
+ <p class="p">
+ For a join involving many different tables, the default <code class="ph codeph">EXPLAIN</code> output might stretch over
+ several pages, and the only details you care about might be the join order and the mechanism (broadcast or
+ shuffle) for joining each pair of tables. In that case, you might set <code class="ph codeph">EXPLAIN_LEVEL</code> to its
+ lowest value of 0, to focus on just the join order and join mechanism for each stage. The following example
+ shows how the rows from the first and second joined tables are hashed and divided among the nodes of the
+ cluster for further filtering; then the entire contents of the third table are broadcast to all nodes for the
+ final stage of join processing.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > set explain_level=0;
+[localhost:21000] > explain select one.*, two.*, three.*
+ > from t1 one join [shuffle] t1 two join t1 three
+ > where one.x = two.x and two.x = three.x;
++---------------------------------------------------------+
+| Explain String |
++---------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=4.00GB VCores=3 |
+| |
+| 08:EXCHANGE [PARTITION=UNPARTITIONED] |
+<strong class="ph b">| 04:HASH JOIN [INNER JOIN, BROADCAST] |</strong>
+<strong class="ph b">| |--07:EXCHANGE [BROADCAST] |</strong>
+| | 02:SCAN HDFS [explain_plan.t1 three] |
+<strong class="ph b">| 03:HASH JOIN [INNER JOIN, PARTITIONED] |</strong>
+<strong class="ph b">| |--06:EXCHANGE [PARTITION=HASH(two.x)] |</strong>
+| | 01:SCAN HDFS [explain_plan.t1 two] |
+<strong class="ph b">| 05:EXCHANGE [PARTITION=HASH(one.x)] |</strong>
+| 00:SCAN HDFS [explain_plan.t1 one] |
++---------------------------------------------------------+
+</code></pre>
+
+
+
+ </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_explain_plan.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_explain_plan.html b/docs/build/html/topics/impala_explain_plan.html
new file mode 100644
index 0000000..bcd0855
--- /dev/null
+++ b/docs/build/html/topics/impala_explain_plan.html
@@ -0,0 +1,592 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_performance.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="explain_plan"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles</title>
</head><body id="explain_plan"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ To understand the high-level performance considerations for Impala queries, read the output of the
+ <code class="ph codeph">EXPLAIN</code> statement for the query. You can get the <code class="ph codeph">EXPLAIN</code> plan without
+ actually running the query itself.
+ </p>
+
+ <p class="p">
+ For an overview of the physical performance characteristics for a query, issue the <code class="ph codeph">SUMMARY</code>
+ statement in <span class="keyword cmdname">impala-shell</span> immediately after executing a query. This condensed information
+ shows which phases of execution took the most time, and how the estimates for memory usage and number of rows
+ at each phase compare to the actual values.
+ </p>
+
+ <p class="p">
+ To understand the detailed performance characteristics for a query, issue the <code class="ph codeph">PROFILE</code>
+ statement in <span class="keyword cmdname">impala-shell</span> immediately after executing a query. This low-level information
+ includes physical details about memory, CPU, I/O, and network usage, and thus is only available after the
+ query is actually run.
+ </p>
+
+ <p class="p toc inpage"></p>
+
+ <p class="p">
+ Also, see <a class="xref" href="impala_hbase.html#hbase_performance">Performance Considerations for the Impala-HBase Integration</a>
+ and <a class="xref" href="impala_s3.html#s3_performance">Understanding and Tuning Impala Query Performance for S3 Data</a>
+ for examples of interpreting
+ <code class="ph codeph">EXPLAIN</code> plans for queries against HBase tables
+ <span class="ph">and data stored in the Amazon Simple Storage System (S3)</span>.
+ </p>
+ </div>
+
+ <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_performance.html">Tuning Impala for Performance</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="explain_plan__perf_explain">
+
+ <h2 class="title topictitle2" id="ariaid-title2">Using the EXPLAIN Plan for Performance Tuning</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ The <code class="ph codeph"><a class="xref" href="impala_explain.html#explain">EXPLAIN</a></code> statement gives you an outline
+ of the logical steps that a query will perform, such as how the work will be distributed among the nodes
+ and how intermediate results will be combined to produce the final result set. You can see these details
+ before actually running the query. You can use this information to check that the query will not operate in
+ some very unexpected or inefficient way.
+ </p>
+
+
+
+<pre class="pre codeblock"><code>[impalad-host:21000] > explain select count(*) from customer_address;
++----------------------------------------------------------+
+| Explain String |
++----------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=42.00MB VCores=1 |
+| |
+| 03:AGGREGATE [MERGE FINALIZE] |
+| | output: sum(count(*)) |
+| | |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
+| | |
+| 01:AGGREGATE |
+| | output: count(*) |
+| | |
+| 00:SCAN HDFS [default.customer_address] |
+| partitions=1/1 size=5.25MB |
++----------------------------------------------------------+
+</code></pre>
+
+ <div class="p">
+ Read the <code class="ph codeph">EXPLAIN</code> plan from bottom to top:
+ <ul class="ul">
+ <li class="li">
+ The last part of the plan shows the low-level details such as the expected amount of data that will be
+ read, where you can judge the effectiveness of your partitioning strategy and estimate how long it will
+ take to scan a table based on total data size and the size of the cluster.
+ </li>
+
+ <li class="li">
+ As you work your way up, next you see the operations that will be parallelized and performed on each
+ Impala node.
+ </li>
+
+ <li class="li">
+ At the higher levels, you see how data flows when intermediate result sets are combined and transmitted
+ from one node to another.
+ </li>
+
+ <li class="li">
+ See <a class="xref" href="../shared/../topics/impala_explain_level.html#explain_level">EXPLAIN_LEVEL Query Option</a> for details about the
+ <code class="ph codeph">EXPLAIN_LEVEL</code> query option, which lets you customize how much detail to show in the
+ <code class="ph codeph">EXPLAIN</code> plan depending on whether you are doing high-level or low-level tuning,
+ dealing with logical or physical aspects of the query.
+ </li>
+ </ul>
+ </div>
+
+ <p class="p">
+ The <code class="ph codeph">EXPLAIN</code> plan is also printed at the beginning of the query profile report described in
+ <a class="xref" href="#perf_profile">Using the Query Profile for Performance Tuning</a>, for convenience in examining both the logical and physical aspects of the
+ query side-by-side.
+ </p>
+
+ <p class="p">
+ The amount of detail displayed in the <code class="ph codeph">EXPLAIN</code> output is controlled by the
+ <a class="xref" href="impala_explain_level.html#explain_level">EXPLAIN_LEVEL</a> query option. You typically
+ increase this setting from <code class="ph codeph">normal</code> to <code class="ph codeph">verbose</code> (or from <code class="ph codeph">0</code>
+ to <code class="ph codeph">1</code>) when doublechecking the presence of table and column statistics during performance
+ tuning, or when estimating query resource usage in conjunction with the resource management features.
+ </p>
+
+
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="explain_plan__perf_summary">
+
+ <h2 class="title topictitle2" id="ariaid-title3">Using the SUMMARY Report for Performance Tuning</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ The <code class="ph codeph"><a class="xref" href="impala_shell_commands.html#shell_commands">SUMMARY</a></code> command within
+ the <span class="keyword cmdname">impala-shell</span> interpreter gives you an easy-to-digest overview of the timings for the
+ different phases of execution for a query. Like the <code class="ph codeph">EXPLAIN</code> plan, it is easy to see
+ potential performance bottlenecks. Like the <code class="ph codeph">PROFILE</code> output, it is available after the
+ query is run and so displays actual timing numbers.
+ </p>
+
+ <p class="p">
+ The <code class="ph codeph">SUMMARY</code> report is also printed at the beginning of the query profile report described
+ in <a class="xref" href="#perf_profile">Using the Query Profile for Performance Tuning</a>, for convenience in examining high-level and low-level aspects of the query
+ side-by-side.
+ </p>
+
+ <p class="p">
+ For example, here is a query involving an aggregate function, on a single-node VM. The different stages of
+ the query and their timings are shown (rolled up for all nodes), along with estimated and actual values
+ used in planning the query. In this case, the <code class="ph codeph">AVG()</code> function is computed for a subset of
+ data on each node (stage 01) and then the aggregated results from all nodes are combined at the end (stage
+ 03). You can see which stages took the most time, and whether any estimates were substantially different
+ than the actual data distribution. (When examining the time values, be sure to consider the suffixes such
+ as <code class="ph codeph">us</code> for microseconds and <code class="ph codeph">ms</code> for milliseconds, rather than just looking
+ for the largest numbers.)
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > select avg(ss_sales_price) from store_sales where ss_coupon_amt = 0;
++---------------------+
+| avg(ss_sales_price) |
++---------------------+
+| 37.80770926328327 |
++---------------------+
+[localhost:21000] > summary;
++--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+
+| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
++--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+
+| 03:AGGREGATE | 1 | 1.03ms | 1.03ms | 1 | 1 | 48.00 KB | -1 B | MERGE FINALIZE |
+| 02:EXCHANGE | 1 | 0ns | 0ns | 1 | 1 | 0 B | -1 B | UNPARTITIONED |
+| 01:AGGREGATE | 1 | 30.79ms | 30.79ms | 1 | 1 | 80.00 KB | 10.00 MB | |
+| 00:SCAN HDFS | 1 | 5.45s | 5.45s | 2.21M | -1 | 64.05 MB | 432.00 MB | tpc.store_sales |
++--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+
+</code></pre>
+
+ <p class="p">
+ Notice how the longest initial phase of the query is measured in seconds (s), while later phases working on
+ smaller intermediate results are measured in milliseconds (ms) or even nanoseconds (ns).
+ </p>
+
+ <p class="p">
+ Here is an example from a more complicated query, as it would appear in the <code class="ph codeph">PROFILE</code>
+ output:
+ </p>
+
+<pre class="pre codeblock"><code>Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
+------------------------------------------------------------------------------------------------------------------------
+09:MERGING-EXCHANGE 1 79.738us 79.738us 5 5 0 -1.00 B UNPARTITIONED
+05:TOP-N 3 84.693us 88.810us 5 5 12.00 KB 120.00 B
+04:AGGREGATE 3 5.263ms 6.432ms 5 5 44.00 KB 10.00 MB MERGE FINALIZE
+08:AGGREGATE 3 16.659ms 27.444ms 52.52K 600.12K 3.20 MB 15.11 MB MERGE
+07:EXCHANGE 3 2.644ms 5.1ms 52.52K 600.12K 0 0 HASH(o_orderpriority)
+03:AGGREGATE 3 342.913ms 966.291ms 52.52K 600.12K 10.80 MB 15.11 MB
+02:HASH JOIN 3 2s165ms 2s171ms 144.87K 600.12K 13.63 MB 941.01 KB INNER JOIN, BROADCAST
+|--06:EXCHANGE 3 8.296ms 8.692ms 57.22K 15.00K 0 0 BROADCAST
+| 01:SCAN HDFS 2 1s412ms 1s978ms 57.22K 15.00K 24.21 MB 176.00 MB tpch.orders o
+00:SCAN HDFS 3 8s032ms 8s558ms 3.79M 600.12K 32.29 MB 264.00 MB tpch.lineitem l
+</code></pre>
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="explain_plan__perf_profile">
+
+ <h2 class="title topictitle2" id="ariaid-title4">Using the Query Profile for Performance Tuning</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ The <code class="ph codeph">PROFILE</code> statement, available in the <span class="keyword cmdname">impala-shell</span> interpreter,
+ produces a detailed low-level report showing how the most recent query was executed. Unlike the
+ <code class="ph codeph">EXPLAIN</code> plan described in <a class="xref" href="#perf_explain">Using the EXPLAIN Plan for Performance Tuning</a>, this information is only available
+ after the query has finished. It shows physical details such as the number of bytes read, maximum memory
+ usage, and so on for each node. You can use this information to determine if the query is I/O-bound or
+ CPU-bound, whether some network condition is imposing a bottleneck, whether a slowdown is affecting some
+ nodes but not others, and to check that recommended configuration settings such as short-circuit local
+ reads are in effect.
+ </p>
+
+ <p class="p">
+ By default, time values in the profile output reflect the wall-clock time taken by an operation.
+ For values denoting system time or user time, the measurement unit is reflected in the metric
+ name, such as <code class="ph codeph">ScannerThreadsSysTime</code> or <code class="ph codeph">ScannerThreadsUserTime</code>.
+ For example, a multi-threaded I/O operation might show a small figure for wall-clock time,
+ while the corresponding system time is larger, representing the sum of the CPU time taken by each thread.
+ Or a wall-clock time figure might be larger because it counts time spent waiting, while
+ the corresponding system and user time figures only measure the time while the operation
+ is actively using CPU cycles.
+ </p>
+
+ <p class="p">
+ The <a class="xref" href="impala_explain_plan.html#perf_explain"><code class="ph codeph">EXPLAIN</code> plan</a> is also printed
+ at the beginning of the query profile report, for convenience in examining both the logical and physical
+ aspects of the query side-by-side. The
+ <a class="xref" href="impala_explain_level.html#explain_level">EXPLAIN_LEVEL</a> query option also controls the
+ verbosity of the <code class="ph codeph">EXPLAIN</code> output printed by the <code class="ph codeph">PROFILE</code> command.
+ </p>
+
+
+
+ <p class="p">
+ Here is an example of a query profile, from a relatively straightforward query on a single-node
+ pseudo-distributed cluster to keep the output relatively brief.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > profile;
+Query Runtime Profile:
+Query (id=6540a03d4bee0691:4963d6269b210ebd):
+ Summary:
+ Session ID: ea4a197f1c7bf858:c74e66f72e3a33ba
+ Session Type: BEESWAX
+ Start Time: 2013-12-02 17:10:30.263067000
+ End Time: 2013-12-02 17:10:50.932044000
+ Query Type: QUERY
+ Query State: FINISHED
+ Query Status: OK
+ Impala Version: impalad version 1.2.1 RELEASE (build edb5af1bcad63d410bc5d47cc203df3a880e9324)
+ User: doc_demo
+ Network Address: 127.0.0.1:49161
+ Default Db: stats_testing
+ Sql Statement: select t1.s, t2.s from t1 join t2 on (t1.id = t2.parent)
+ Plan:
+----------------
+Estimated Per-Host Requirements: Memory=2.09GB VCores=2
+
+PLAN FRAGMENT 0
+ PARTITION: UNPARTITIONED
+
+ 4:EXCHANGE
+ cardinality: unavailable
+ per-host memory: unavailable
+ tuple ids: 0 1
+
+PLAN FRAGMENT 1
+ PARTITION: RANDOM
+
+ STREAM DATA SINK
+ EXCHANGE ID: 4
+ UNPARTITIONED
+
+ 2:HASH JOIN
+ | join op: INNER JOIN (BROADCAST)
+ | hash predicates:
+ | t1.id = t2.parent
+ | cardinality: unavailable
+ | per-host memory: 2.00GB
+ | tuple ids: 0 1
+ |
+ |----3:EXCHANGE
+ | cardinality: unavailable
+ | per-host memory: 0B
+ | tuple ids: 1
+ |
+ 0:SCAN HDFS
+ table=stats_testing.t1 #partitions=1/1 size=33B
+ table stats: unavailable
+ column stats: unavailable
+ cardinality: unavailable
+ per-host memory: 32.00MB
+ tuple ids: 0
+
+PLAN FRAGMENT 2
+ PARTITION: RANDOM
+
+ STREAM DATA SINK
+ EXCHANGE ID: 3
+ UNPARTITIONED
+
+ 1:SCAN HDFS
+ table=stats_testing.t2 #partitions=1/1 size=960.00KB
+ table stats: unavailable
+ column stats: unavailable
+ cardinality: unavailable
+ per-host memory: 96.00MB
+ tuple ids: 1
+----------------
+ Query Timeline: 20s670ms
+ - Start execution: 2.559ms (2.559ms)
+ - Planning finished: 23.587ms (21.27ms)
+ - Rows available: 666.199ms (642.612ms)
+ - First row fetched: 668.919ms (2.719ms)
+ - Unregister query: 20s668ms (20s000ms)
+ ImpalaServer:
+ - ClientFetchWaitTimer: 19s637ms
+ - RowMaterializationTimer: 167.121ms
+ Execution Profile 6540a03d4bee0691:4963d6269b210ebd:(Active: 837.815ms, % non-child: 0.00%)
+ Per Node Peak Memory Usage: impala-1.example.com:22000(7.42 MB)
+ - FinalizationTimer: 0ns
+ Coordinator Fragment:(Active: 195.198ms, % non-child: 0.00%)
+ MemoryUsage(500.0ms): 16.00 KB, 7.42 MB, 7.33 MB, 7.10 MB, 6.94 MB, 6.71 MB, 6.56 MB, 6.40 MB, 6.17 MB, 6.02 MB, 5.79 MB, 5.63 MB, 5.48 MB, 5.25 MB, 5.09 MB, 4.86 MB, 4.71 MB, 4.47 MB, 4.32 MB, 4.09 MB, 3.93 MB, 3.78 MB, 3.55 MB, 3.39 MB, 3.16 MB, 3.01 MB, 2.78 MB, 2.62 MB, 2.39 MB, 2.24 MB, 2.08 MB, 1.85 MB, 1.70 MB, 1.54 MB, 1.31 MB, 1.16 MB, 948.00 KB, 790.00 KB, 553.00 KB, 395.00 KB, 237.00 KB
+ ThreadUsage(500.0ms): 1
+ - AverageThreadTokens: 1.00
+ - PeakMemoryUsage: 7.42 MB
+ - PrepareTime: 36.144us
+ - RowsProduced: 98.30K (98304)
+ - TotalCpuTime: 20s449ms
+ - TotalNetworkWaitTime: 191.630ms
+ - TotalStorageWaitTime: 0ns
+ CodeGen:(Active: 150.679ms, % non-child: 77.19%)
+ - CodegenTime: 0ns
+ - CompileTime: 139.503ms
+ - LoadTime: 10.7ms
+ - ModuleFileSize: 95.27 KB
+ EXCHANGE_NODE (id=4):(Active: 194.858ms, % non-child: 99.83%)
+ - BytesReceived: 2.33 MB
+ - ConvertRowBatchTime: 2.732ms
+ - DataArrivalWaitTime: 191.118ms
+ - DeserializeRowBatchTimer: 14.943ms
+ - FirstBatchArrivalWaitTime: 191.117ms
+ - PeakMemoryUsage: 7.41 MB
+ - RowsReturned: 98.30K (98304)
+ - RowsReturnedRate: 504.49 K/sec
+ - SendersBlockedTimer: 0ns
+ - SendersBlockedTotalTimer(*): 0ns
+ Averaged Fragment 1:(Active: 442.360ms, % non-child: 0.00%)
+ split sizes: min: 33.00 B, max: 33.00 B, avg: 33.00 B, stddev: 0.00
+ completion times: min:443.720ms max:443.720ms mean: 443.720ms stddev:0ns
+ execution rates: min:74.00 B/sec max:74.00 B/sec mean:74.00 B/sec stddev:0.00 /sec
+ num instances: 1
+ - AverageThreadTokens: 1.00
+ - PeakMemoryUsage: 6.06 MB
+ - PrepareTime: 7.291ms
+ - RowsProduced: 98.30K (98304)
+ - TotalCpuTime: 784.259ms
+ - TotalNetworkWaitTime: 388.818ms
+ - TotalStorageWaitTime: 3.934ms
+ CodeGen:(Active: 312.862ms, % non-child: 70.73%)
+ - CodegenTime: 2.669ms
+ - CompileTime: 302.467ms
+ - LoadTime: 9.231ms
+ - ModuleFileSize: 95.27 KB
+ DataStreamSender (dst_id=4):(Active: 80.63ms, % non-child: 18.10%)
+ - BytesSent: 2.33 MB
+ - NetworkThroughput(*): 35.89 MB/sec
+ - OverallThroughput: 29.06 MB/sec
+ - PeakMemoryUsage: 5.33 KB
+ - SerializeBatchTime: 26.487ms
+ - ThriftTransmitTime(*): 64.814ms
+ - UncompressedRowBatchSize: 6.66 MB
+ HASH_JOIN_NODE (id=2):(Active: 362.25ms, % non-child: 3.92%)
+ - BuildBuckets: 1.02K (1024)
+ - BuildRows: 98.30K (98304)
+ - BuildTime: 12.622ms
+ - LoadFactor: 0.00
+ - PeakMemoryUsage: 6.02 MB
+ - ProbeRows: 3
+ - ProbeTime: 3.579ms
+ - RowsReturned: 98.30K (98304)
+ - RowsReturnedRate: 271.54 K/sec
+ EXCHANGE_NODE (id=3):(Active: 344.680ms, % non-child: 77.92%)
+ - BytesReceived: 1.15 MB
+ - ConvertRowBatchTime: 2.792ms
+ - DataArrivalWaitTime: 339.936ms
+ - DeserializeRowBatchTimer: 9.910ms
+ - FirstBatchArrivalWaitTime: 199.474ms
+ - PeakMemoryUsage: 156.00 KB
+ - RowsReturned: 98.30K (98304)
+ - RowsReturnedRate: 285.20 K/sec
+ - SendersBlockedTimer: 0ns
+ - SendersBlockedTotalTimer(*): 0ns
+ HDFS_SCAN_NODE (id=0):(Active: 13.616us, % non-child: 0.00%)
+ - AverageHdfsReadThreadConcurrency: 0.00
+ - AverageScannerThreadConcurrency: 0.00
+ - BytesRead: 33.00 B
+ - BytesReadLocal: 33.00 B
+ - BytesReadShortCircuit: 33.00 B
+ - NumDisksAccessed: 1
+ - NumScannerThreadsStarted: 1
+ - PeakMemoryUsage: 46.00 KB
+ - PerReadThreadRawHdfsThroughput: 287.52 KB/sec
+ - RowsRead: 3
+ - RowsReturned: 3
+ - RowsReturnedRate: 220.33 K/sec
+ - ScanRangesComplete: 1
+ - ScannerThreadsInvoluntaryContextSwitches: 26
+ - ScannerThreadsTotalWallClockTime: 55.199ms
+ - DelimiterParseTime: 2.463us
+ - MaterializeTupleTime(*): 1.226us
+ - ScannerThreadsSysTime: 0ns
+ - ScannerThreadsUserTime: 42.993ms
+ - ScannerThreadsVoluntaryContextSwitches: 1
+ - TotalRawHdfsReadTime(*): 112.86us
+ - TotalReadThroughput: 0.00 /sec
+ Averaged Fragment 2:(Active: 190.120ms, % non-child: 0.00%)
+ split sizes: min: 960.00 KB, max: 960.00 KB, avg: 960.00 KB, stddev: 0.00
+ completion times: min:191.736ms max:191.736ms mean: 191.736ms stddev:0ns
+ execution rates: min:4.89 MB/sec max:4.89 MB/sec mean:4.89 MB/sec stddev:0.00 /sec
+ num instances: 1
+ - AverageThreadTokens: 0.00
+ - PeakMemoryUsage: 906.33 KB
+ - PrepareTime: 3.67ms
+ - RowsProduced: 98.30K (98304)
+ - TotalCpuTime: 403.351ms
+ - TotalNetworkWaitTime: 34.999ms
+ - TotalStorageWaitTime: 108.675ms
+ CodeGen:(Active: 162.57ms, % non-child: 85.24%)
+ - CodegenTime: 3.133ms
+ - CompileTime: 148.316ms
+ - LoadTime: 12.317ms
+ - ModuleFileSize: 95.27 KB
+ DataStreamSender (dst_id=3):(Active: 70.620ms, % non-child: 37.14%)
+ - BytesSent: 1.15 MB
+ - NetworkThroughput(*): 23.30 MB/sec
+ - OverallThroughput: 16.23 MB/sec
+ - PeakMemoryUsage: 5.33 KB
+ - SerializeBatchTime: 22.69ms
+ - ThriftTransmitTime(*): 49.178ms
+ - UncompressedRowBatchSize: 3.28 MB
+ HDFS_SCAN_NODE (id=1):(Active: 118.839ms, % non-child: 62.51%)
+ - AverageHdfsReadThreadConcurrency: 0.00
+ - AverageScannerThreadConcurrency: 0.00
+ - BytesRead: 960.00 KB
+ - BytesReadLocal: 960.00 KB
+ - BytesReadShortCircuit: 960.00 KB
+ - NumDisksAccessed: 1
+ - NumScannerThreadsStarted: 1
+ - PeakMemoryUsage: 869.00 KB
+ - PerReadThreadRawHdfsThroughput: 130.21 MB/sec
+ - RowsRead: 98.30K (98304)
+ - RowsReturned: 98.30K (98304)
+ - RowsReturnedRate: 827.20 K/sec
+ - ScanRangesComplete: 15
+ - ScannerThreadsInvoluntaryContextSwitches: 34
+ - ScannerThreadsTotalWallClockTime: 189.774ms
+ - DelimiterParseTime: 15.703ms
+ - MaterializeTupleTime(*): 3.419ms
+ - ScannerThreadsSysTime: 1.999ms
+ - ScannerThreadsUserTime: 44.993ms
+ - ScannerThreadsVoluntaryContextSwitches: 118
+ - TotalRawHdfsReadTime(*): 7.199ms
+ - TotalReadThroughput: 0.00 /sec
+ Fragment 1:
+ Instance 6540a03d4bee0691:4963d6269b210ebf (host=impala-1.example.com:22000):(Active: 442.360ms, % non-child: 0.00%)
+ Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/33.00 B
+ MemoryUsage(500.0ms): 69.33 KB
+ ThreadUsage(500.0ms): 1
+ - AverageThreadTokens: 1.00
+ - PeakMemoryUsage: 6.06 MB
+ - PrepareTime: 7.291ms
+ - RowsProduced: 98.30K (98304)
+ - TotalCpuTime: 784.259ms
+ - TotalNetworkWaitTime: 388.818ms
+ - TotalStorageWaitTime: 3.934ms
+ CodeGen:(Active: 312.862ms, % non-child: 70.73%)
+ - CodegenTime: 2.669ms
+ - CompileTime: 302.467ms
+ - LoadTime: 9.231ms
+ - ModuleFileSize: 95.27 KB
+ DataStreamSender (dst_id=4):(Active: 80.63ms, % non-child: 18.10%)
+ - BytesSent: 2.33 MB
+ - NetworkThroughput(*): 35.89 MB/sec
+ - OverallThroughput: 29.06 MB/sec
+ - PeakMemoryUsage: 5.33 KB
+ - SerializeBatchTime: 26.487ms
+ - ThriftTransmitTime(*): 64.814ms
+ - UncompressedRowBatchSize: 6.66 MB
+ HASH_JOIN_NODE (id=2):(Active: 362.25ms, % non-child: 3.92%)
+ ExecOption: Build Side Codegen Enabled, Probe Side Codegen Enabled, Hash Table Built Asynchronously
+ - BuildBuckets: 1.02K (1024)
+ - BuildRows: 98.30K (98304)
+ - BuildTime: 12.622ms
+ - LoadFactor: 0.00
+ - PeakMemoryUsage: 6.02 MB
+ - ProbeRows: 3
+ - ProbeTime: 3.579ms
+ - RowsReturned: 98.30K (98304)
+ - RowsReturnedRate: 271.54 K/sec
+ EXCHANGE_NODE (id=3):(Active: 344.680ms, % non-child: 77.92%)
+ - BytesReceived: 1.15 MB
+ - ConvertRowBatchTime: 2.792ms
+ - DataArrivalWaitTime: 339.936ms
+ - DeserializeRowBatchTimer: 9.910ms
+ - FirstBatchArrivalWaitTime: 199.474ms
+ - PeakMemoryUsage: 156.00 KB
+ - RowsReturned: 98.30K (98304)
+ - RowsReturnedRate: 285.20 K/sec
+ - SendersBlockedTimer: 0ns
+ - SendersBlockedTotalTimer(*): 0ns
+ HDFS_SCAN_NODE (id=0):(Active: 13.616us, % non-child: 0.00%)
+ Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:1/33.00 B
+ Hdfs Read Thread Concurrency Bucket: 0:0% 1:0%
+ File Formats: TEXT/NONE:1
+ ExecOption: Codegen enabled: 1 out of 1
+ - AverageHdfsReadThreadConcurrency: 0.00
+ - AverageScannerThreadConcurrency: 0.00
+ - BytesRead: 33.00 B
+ - BytesReadLocal: 33.00 B
+ - BytesReadShortCircuit: 33.00 B
+ - NumDisksAccessed: 1
+ - NumScannerThreadsStarted: 1
+ - PeakMemoryUsage: 46.00 KB
+ - PerReadThreadRawHdfsThroughput: 287.52 KB/sec
+ - RowsRead: 3
+ - RowsReturned: 3
+ - RowsReturnedRate: 220.33 K/sec
+ - ScanRangesComplete: 1
+ - ScannerThreadsInvoluntaryContextSwitches: 26
+ - ScannerThreadsTotalWallClockTime: 55.199ms
+ - DelimiterParseTime: 2.463us
+ - MaterializeTupleTime(*): 1.226us
+ - ScannerThreadsSysTime: 0ns
+ - ScannerThreadsUserTime: 42.993ms
+ - ScannerThreadsVoluntaryContextSwitches: 1
+ - TotalRawHdfsReadTime(*): 112.86us
+ - TotalReadThroughput: 0.00 /sec
+ Fragment 2:
+ Instance 6540a03d4bee0691:4963d6269b210ec0 (host=impala-1.example.com:22000):(Active: 190.120ms, % non-child: 0.00%)
+ Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:15/960.00 KB
+ - AverageThreadTokens: 0.00
+ - PeakMemoryUsage: 906.33 KB
+ - PrepareTime: 3.67ms
+ - RowsProduced: 98.30K (98304)
+ - TotalCpuTime: 403.351ms
+ - TotalNetworkWaitTime: 34.999ms
+ - TotalStorageWaitTime: 108.675ms
+ CodeGen:(Active: 162.57ms, % non-child: 85.24%)
+ - CodegenTime: 3.133ms
+ - CompileTime: 148.316ms
+ - LoadTime: 12.317ms
+ - ModuleFileSize: 95.27 KB
+ DataStreamSender (dst_id=3):(Active: 70.620ms, % non-child: 37.14%)
+ - BytesSent: 1.15 MB
+ - NetworkThroughput(*): 23.30 MB/sec
+ - OverallThroughput: 16.23 MB/sec
+ - PeakMemoryUsage: 5.33 KB
+ - SerializeBatchTime: 22.69ms
+ - ThriftTransmitTime(*): 49.178ms
+ - UncompressedRowBatchSize: 3.28 MB
+ HDFS_SCAN_NODE (id=1):(Active: 118.839ms, % non-child: 62.51%)
+ Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:15/960.00 KB
+ Hdfs Read Thread Concurrency Bucket: 0:0% 1:0%
+ File Formats: TEXT/NONE:15
+ ExecOption: Codegen enabled: 15 out of 15
+ - AverageHdfsReadThreadConcurrency: 0.00
+ - AverageScannerThreadConcurrency: 0.00
+ - BytesRead: 960.00 KB
+ - BytesReadLocal: 960.00 KB
+ - BytesReadShortCircuit: 960.00 KB
+ - NumDisksAccessed: 1
+ - NumScannerThreadsStarted: 1
+ - PeakMemoryUsage: 869.00 KB
+ - PerReadThreadRawHdfsThroughput: 130.21 MB/sec
+ - RowsRead: 98.30K (98304)
+ - RowsReturned: 98.30K (98304)
+ - RowsReturnedRate: 827.20 K/sec
+ - ScanRangesComplete: 15
+ - ScannerThreadsInvoluntaryContextSwitches: 34
+ - ScannerThreadsTotalWallClockTime: 189.774ms
+ - DelimiterParseTime: 15.703ms
+ - MaterializeTupleTime(*): 3.419ms
+ - ScannerThreadsSysTime: 1.999ms
+ - ScannerThreadsUserTime: 44.993ms
+ - ScannerThreadsVoluntaryContextSwitches: 118
+ - TotalRawHdfsReadTime(*): 7.199ms
+ - TotalReadThroughput: 0.00 /sec</code></pre>
+ </div>
+ </article>
+</article></main></body></html>
\ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_faq.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_faq.html b/docs/build/html/topics/impala_faq.html
new file mode 100644
index 0000000..b85bb8a
--- /dev/null
+++ b/docs/build/html/topics/impala_faq.html
@@ -0,0 +1,21 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="faq"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Impala Frequently Asked Questions</title></head><body id="faq"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">Impala Frequently Asked Questions</h1>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ This section lists frequently asked questions for Apache Impala (incubating),
+ the interactive SQL engine for Hadoop.
+ </p>
+
+ <p class="p">
+ This section is under construction.
+ </p>
+
+ </div>
+
+</article></main></body></html>
\ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_file_formats.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_file_formats.html b/docs/build/html/topics/impala_file_formats.html
new file mode 100644
index 0000000..d9ccbca
--- /dev/null
+++ b/docs/build/html/topics/impala_file_formats.html
@@ -0,0 +1,236 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_txtfile.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_parquet.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_avro.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_rcfile.html"><meta name="DC.Relation" scheme="URI" content="../topics/impala_seqfile.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="file_formats"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>How Impala Works with Hado
op File Formats</title></head><body id="file_formats"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">How Impala Works with Hadoop File Formats</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+
+ Impala supports several familiar file formats used in Apache Hadoop. Impala can load and query data files
+ produced by other Hadoop components such as Pig or MapReduce, and data files produced by Impala can be used
+ by other components also. The following sections discuss the procedures, limitations, and performance
+ considerations for using each file format with Impala.
+ </p>
+
+ <p class="p">
+ The file format used for an Impala table has significant performance consequences. Some file formats include
+ compression support that affects the size of data on the disk and, consequently, the amount of I/O and CPU
+ resources required to deserialize data. The amounts of I/O and CPU resources required can be a limiting
+ factor in query performance since querying often begins with moving and decompressing data. To reduce the
+ potential impact of this part of the process, data is often compressed. By compressing data, a smaller total
+ number of bytes are transferred from disk to memory. This reduces the amount of time taken to transfer the
+ data, but a tradeoff occurs when the CPU decompresses the content.
+ </p>
+
+ <p class="p">
+ Impala can query files encoded with most of the popular file formats and compression codecs used in Hadoop.
+ Impala can create and insert data into tables that use some file formats but not others; for file formats
+ that Impala cannot write to, create the table in Hive, issue the <code class="ph codeph">INVALIDATE METADATA <var class="keyword varname">table_name</var></code>
+ statement in <code class="ph codeph">impala-shell</code>, and query the table through Impala. File formats can be
+ structured, in which case they may include metadata and built-in compression. Supported formats include:
+ </p>
+
+ <table class="table"><caption><span class="table--title-label">Table 1. </span><span class="title">File Format Support in Impala</span></caption><colgroup><col style="width:10%"><col style="width:10%"><col style="width:20%"><col style="width:30%"><col style="width:30%"></colgroup><thead class="thead">
+ <tr class="row">
+ <th class="entry nocellnorowborder" id="file_formats__entry__1">
+ File Type
+ </th>
+ <th class="entry nocellnorowborder" id="file_formats__entry__2">
+ Format
+ </th>
+ <th class="entry nocellnorowborder" id="file_formats__entry__3">
+ Compression Codecs
+ </th>
+ <th class="entry nocellnorowborder" id="file_formats__entry__4">
+ Impala Can CREATE?
+ </th>
+ <th class="entry nocellnorowborder" id="file_formats__entry__5">
+ Impala Can INSERT?
+ </th>
+ </tr>
+ </thead><tbody class="tbody">
+ <tr class="row" id="file_formats__parquet_support">
+ <td class="entry nocellnorowborder" headers="file_formats__entry__1 ">
+ <a class="xref" href="impala_parquet.html#parquet">Parquet</a>
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__2 ">
+ Structured
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__3 ">
+ Snappy, gzip; currently Snappy by default
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__4 ">
+ Yes.
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__5 ">
+ Yes: <code class="ph codeph">CREATE TABLE</code>, <code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, and query.
+ </td>
+ </tr>
+ <tr class="row" id="file_formats__txtfile_support">
+ <td class="entry nocellnorowborder" headers="file_formats__entry__1 ">
+ <a class="xref" href="impala_txtfile.html#txtfile">Text</a>
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__2 ">
+ Unstructured
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__3 ">
+ LZO, gzip, bzip2, Snappy
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__4 ">
+ Yes. For <code class="ph codeph">CREATE TABLE</code> with no <code class="ph codeph">STORED AS</code> clause, the default file
+ format is uncompressed text, with values separated by ASCII <code class="ph codeph">0x01</code> characters
+ (typically represented as Ctrl-A).
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__5 ">
+ Yes: <code class="ph codeph">CREATE TABLE</code>, <code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, and query.
+ If LZO compression is used, you must create the table and load data in Hive. If other kinds of
+ compression are used, you must load data through <code class="ph codeph">LOAD DATA</code>, Hive, or manually in
+ HDFS.
+
+
+ </td>
+ </tr>
+ <tr class="row" id="file_formats__avro_support">
+ <td class="entry nocellnorowborder" headers="file_formats__entry__1 ">
+ <a class="xref" href="impala_avro.html#avro">Avro</a>
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__2 ">
+ Structured
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__3 ">
+ Snappy, gzip, deflate, bzip2
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__4 ">
+ Yes, in Impala 1.4.0 and higher. Before that, create the table using Hive.
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__5 ">
+ No. Import data by using <code class="ph codeph">LOAD DATA</code> on data files already in the right format, or use
+ <code class="ph codeph">INSERT</code> in Hive followed by <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> in Impala.
+ </td>
+
+ </tr>
+ <tr class="row" id="file_formats__rcfile_support">
+ <td class="entry nocellnorowborder" headers="file_formats__entry__1 ">
+ <a class="xref" href="impala_rcfile.html#rcfile">RCFile</a>
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__2 ">
+ Structured
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__3 ">
+ Snappy, gzip, deflate, bzip2
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__4 ">
+ Yes.
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__5 ">
+ No. Import data by using <code class="ph codeph">LOAD DATA</code> on data files already in the right format, or use
+ <code class="ph codeph">INSERT</code> in Hive followed by <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> in Impala.
+ </td>
+
+ </tr>
+ <tr class="row" id="file_formats__sequencefile_support">
+ <td class="entry nocellnorowborder" headers="file_formats__entry__1 ">
+ <a class="xref" href="impala_seqfile.html#seqfile">SequenceFile</a>
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__2 ">
+ Structured
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__3 ">
+ Snappy, gzip, deflate, bzip2
+ </td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__4 ">Yes.</td>
+ <td class="entry nocellnorowborder" headers="file_formats__entry__5 ">
+ No. Import data by using <code class="ph codeph">LOAD DATA</code> on data files already in the right format, or use
+ <code class="ph codeph">INSERT</code> in Hive followed by <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> in Impala.
+ </td>
+
+ </tr>
+ </tbody></table>
+
+ <p class="p">
+ Impala can only query the file formats listed in the preceding table.
+ In particular, Impala does not support the ORC file format.
+ </p>
+
+ <p class="p">
+ Impala supports the following compression codecs:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ Snappy. Recommended for its effective balance between compression ratio and decompression speed. Snappy
+ compression is very fast, but gzip provides greater space savings. Supported for text files in Impala 2.0
+ and higher.
+
+ </li>
+
+ <li class="li">
+ Gzip. Recommended when achieving the highest level of compression (and therefore greatest disk-space
+ savings) is desired. Supported for text files in Impala 2.0 and higher.
+ </li>
+
+ <li class="li">
+ Deflate. Not supported for text files.
+ </li>
+
+ <li class="li">
+ Bzip2. Supported for text files in Impala 2.0 and higher.
+
+ </li>
+
+ <li class="li">
+ <p class="p"> LZO, for text files only. Impala can query
+ LZO-compressed text tables, but currently cannot create them or insert
+ data into them; perform these operations in Hive. </p>
+ </li>
+ </ul>
+ </div>
+
+ <nav role="navigation" class="related-links"><ul class="ullinks"><li class="link ulchildlink"><strong><a href="../topics/impala_txtfile.html">Using Text Data Files with Impala Tables</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_parquet.html">Using the Parquet File Format with Impala Tables</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_avro.html">Using the Avro File Format with Impala Tables</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_rcfile.html">Using the RCFile File Format with Impala Tables</a></strong><br></li><li class="link ulchildlink"><strong><a href="../topics/impala_seqfile.html">Using the SequenceFile File Format with Impala Tables</a></strong><br></li></ul></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="file_formats__file_format_choosing">
+
+ <h2 class="title topictitle2" id="ariaid-title2">Choosing the File Format for a Table</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ Different file formats and compression codecs work better for different data sets. While Impala typically
+ provides performance gains regardless of file format, choosing the proper format for your data can yield
+ further performance improvements. Use the following considerations to decide which combination of file
+ format and compression to use for a particular table:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ If you are working with existing files that are already in a supported file format, use the same format
+ for the Impala table where practical. If the original format does not yield acceptable query performance
+ or resource usage, consider creating a new Impala table with different file format or compression
+ characteristics, and doing a one-time conversion by copying the data to the new table using the
+ <code class="ph codeph">INSERT</code> statement. Depending on the file format, you might run the
+ <code class="ph codeph">INSERT</code> statement in <code class="ph codeph">impala-shell</code> or in Hive.
+ </li>
+
+ <li class="li">
+ Text files are convenient to produce through many different tools, and are human-readable for ease of
+ verification and debugging. Those characteristics are why text is the default format for an Impala
+ <code class="ph codeph">CREATE TABLE</code> statement. When performance and resource usage are the primary
+ considerations, use one of the other file formats and consider using compression. A typical workflow
+ might involve bringing data into an Impala table by copying CSV or TSV files into the appropriate data
+ directory, and then using the <code class="ph codeph">INSERT ... SELECT</code> syntax to copy the data into a table
+ using a different, more compact file format.
+ </li>
+
+ <li class="li">
+ If your architecture involves storing data to be queried in memory, do not compress the data. There is no
+ I/O savings since the data does not need to be moved from disk, but there is a CPU cost to decompress the
+ data.
+ </li>
+ </ul>
+ </div>
+ </article>
+</article></main></body></html>
\ No newline at end of file