You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by mi...@apache.org on 2018/05/09 21:10:39 UTC
[30/51] [partial] impala git commit: [DOCS] Impala doc site update
for 3.0
http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_hbase.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_hbase.html b/docs/build3x/html/topics/impala_hbase.html
new file mode 100644
index 0000000..ef339ea
--- /dev/null
+++ b/docs/build3x/html/topics/impala_hbase.html
@@ -0,0 +1,772 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.
0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="impala_hbase"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using Impala to Query HBase Tables</title></head><body id="impala_hbase"><main role="main"><article role="article" aria-labelledby="impala_hbase__hbase">
+
+ <h1 class="title topictitle1" id="impala_hbase__hbase">Using Impala to Query HBase Tables</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ You can use Impala to query HBase tables. This capability allows convenient access to a storage system that
+ is tuned for different kinds of workloads than the default with Impala. The default Impala tables use data
+ files stored on HDFS, which are ideal for bulk loads and queries using full-table scans. In contrast, HBase
+ can do efficient queries for data organized for OLTP-style workloads, with lookups of individual rows or
+ ranges of values.
+ </p>
+
+ <p class="p">
+ From the perspective of an Impala user, coming from an RDBMS background, HBase is a kind of key-value store
+ where the value consists of multiple fields. The key is mapped to one column in the Impala table, and the
+ various fields of the value are mapped to the other columns in the Impala table.
+ </p>
+
+ <p class="p">
+ For background information on HBase, see <a class="xref" href="https://hbase.apache.org/book.html" target="_blank">the Apache HBase documentation</a>.
+ </p>
+
+ <p class="p toc inpage"></p>
+ </div>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title2" id="impala_hbase__hbase_using">
+
+ <h2 class="title topictitle2" id="ariaid-title2">Overview of Using HBase with Impala</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ When you use Impala with HBase:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ You create the tables on the Impala side using the Hive shell, because the Impala <code class="ph codeph">CREATE
+ TABLE</code> statement currently does not support custom SerDes and some other syntax needed for these
+ tables:
+ <ul class="ul">
+ <li class="li">
+ You designate it as an HBase table using the <code class="ph codeph">STORED BY
+ 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'</code> clause on the Hive <code class="ph codeph">CREATE
+ TABLE</code> statement.
+ </li>
+
+ <li class="li">
+ You map these specially created tables to corresponding tables that exist in HBase, with the clause
+ <code class="ph codeph">TBLPROPERTIES("hbase.table.name" = "<var class="keyword varname">table_name_in_hbase</var>")</code> on the
+ Hive <code class="ph codeph">CREATE TABLE</code> statement.
+ </li>
+
+ <li class="li">
+ See <a class="xref" href="#hbase_queries">Examples of Querying HBase Tables from Impala</a> for a full example.
+ </li>
+ </ul>
+ </li>
+
+ <li class="li">
+ You define the column corresponding to the HBase row key as a string with the <code class="ph codeph">#string</code>
+ keyword, or map it to a <code class="ph codeph">STRING</code> column.
+ </li>
+
+ <li class="li">
+ Because Impala and Hive share the same metastore database, once you create the table in Hive, you can
+ query or insert into it through Impala. (After creating a new table through Hive, issue the
+ <code class="ph codeph">INVALIDATE METADATA</code> statement in <span class="keyword cmdname">impala-shell</span> to make Impala aware of
+ the new table.)
+ </li>
+
+ <li class="li">
+ You issue queries against the Impala tables. For efficient queries, use <code class="ph codeph">WHERE</code> clauses to
+ find a single key value or a range of key values wherever practical, by testing the Impala column
+ corresponding to the HBase row key. Avoid queries that do full-table scans, which are efficient for
+ regular Impala tables but inefficient in HBase.
+ </li>
+ </ul>
+
+ <p class="p">
+ To work with an HBase table from Impala, ensure that the <code class="ph codeph">impala</code> user has read/write
+ privileges for the HBase table, using the <code class="ph codeph">GRANT</code> command in the HBase shell. For details
+ about HBase security, see <a class="xref" href="https://hbase.apache.org/book.html#security" target="_blank">the Security chapter in the Apache HBase documentation</a>.
+ </p>
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="impala_hbase__hbase_config">
+
+ <h2 class="title topictitle2" id="ariaid-title3">Configuring HBase for Use with Impala</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ HBase works out of the box with Impala. There is no mandatory configuration needed to use these two
+ components together.
+ </p>
+
+ <p class="p">
+ To avoid delays if HBase is unavailable during Impala startup or after an <code class="ph codeph">INVALIDATE
+ METADATA</code> statement, set timeout values similar to the following in
+ <span class="ph filepath">/etc/impala/conf/hbase-site.xml</span>:
+ </p>
+
+<pre class="pre codeblock"><code><property>
+ <name>hbase.client.retries.number</name>
+ <value>3</value>
+</property>
+<property>
+ <name>hbase.rpc.timeout</name>
+ <value>3000</value>
+</property>
+</code></pre>
+
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="impala_hbase__hbase_types">
+
+ <h2 class="title topictitle2" id="ariaid-title4">Supported Data Types for HBase Columns</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ To understand how Impala column data types are mapped to fields in HBase, you should have some background
+ knowledge about HBase first. You set up the mapping by running the <code class="ph codeph">CREATE TABLE</code> statement
+ in the Hive shell. See
+ <a class="xref" href="https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration" target="_blank">the
+ Hive wiki</a> for a starting point, and <a class="xref" href="#hbase_queries">Examples of Querying HBase Tables from Impala</a> for examples.
+ </p>
+
+ <p class="p">
+ HBase works as a kind of <span class="q">"bit bucket"</span>, in the sense that HBase does not enforce any typing for the
+ key or value fields. All the type enforcement is done on the Impala side.
+ </p>
+
+ <p class="p">
+ For best performance of Impala queries against HBase tables, most queries will perform comparisons in the
+ <code class="ph codeph">WHERE</code> against the column that corresponds to the HBase row key. When creating the table
+ through the Hive shell, use the <code class="ph codeph">STRING</code> data type for the column that corresponds to the
+ HBase row key. Impala can translate conditional tests (through operators such as <code class="ph codeph">=</code>,
+ <code class="ph codeph"><</code>, <code class="ph codeph">BETWEEN</code>, and <code class="ph codeph">IN</code>) against this column into fast
+ lookups in HBase, but this optimization (<span class="q">"predicate pushdown"</span>) only works when that column is
+ defined as <code class="ph codeph">STRING</code>.
+ </p>
+
+ <p class="p">
+ Starting in Impala 1.1, Impala also supports reading and writing to columns that are defined in the Hive
+ <code class="ph codeph">CREATE TABLE</code> statement using binary data types, represented in the Hive table definition
+ using the <code class="ph codeph">#binary</code> keyword, often abbreviated as <code class="ph codeph">#b</code>. Defining numeric
+ columns as binary can reduce the overall data volume in the HBase tables. You should still define the
+ column that corresponds to the HBase row key as a <code class="ph codeph">STRING</code>, to allow fast lookups using
+ those columns.
+ </p>
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="impala_hbase__hbase_performance">
+
+ <h2 class="title topictitle2" id="ariaid-title5">Performance Considerations for the Impala-HBase Integration</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ To understand the performance characteristics of SQL queries against data stored in HBase, you should have
+ some background knowledge about how HBase interacts with SQL-oriented systems first. See
+ <a class="xref" href="https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration" target="_blank">the
+ Hive wiki</a> for a starting point; because Impala shares the same metastore database as Hive, the
+ information about mapping columns from Hive tables to HBase tables is generally applicable to Impala too.
+ </p>
+
+ <p class="p">
+ Impala uses the HBase client API via Java Native Interface (JNI) to query data stored in HBase. This
+ querying does not read HFiles directly. The extra communication overhead makes it important to choose what
+ data to store in HBase or in HDFS, and construct efficient queries that can retrieve the HBase data
+ efficiently:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ Use HBase table for queries that return a single row or a range of rows, not queries that scan the entire
+ table. (If a query has no <code class="ph codeph">WHERE</code> clause, that is a strong indicator that it is an
+ inefficient query for an HBase table.)
+ </li>
+
+ <li class="li">
+ If you have join queries that do aggregation operations on large fact tables and join the results against
+ small dimension tables, consider using Impala for the fact tables and HBase for the dimension tables.
+ (Because Impala does a full scan on the HBase table in this case, rather than doing single-row HBase
+ lookups based on the join column, only use this technique where the HBase table is small enough that
+ doing a full table scan does not cause a performance bottleneck for the query.)
+ </li>
+ </ul>
+
+ <p class="p">
+ Query predicates are applied to row keys as start and stop keys, thereby limiting the scope of a particular
+ lookup. If row keys are not mapped to string columns, then ordering is typically incorrect and comparison
+ operations do not work. For example, if row keys are not mapped to string columns, evaluating for greater
+ than (>) or less than (<) cannot be completed.
+ </p>
+
+ <p class="p">
+ Predicates on non-key columns can be sent to HBase to scan as <code class="ph codeph">SingleColumnValueFilters</code>,
+ providing some performance gains. In such a case, HBase returns fewer rows than if those same predicates
+ were applied using Impala. While there is some improvement, it is not as great when start and stop rows are
+ used. This is because the number of rows that HBase must examine is not limited as it is when start and
+ stop rows are used. As long as the row key predicate only applies to a single row, HBase will locate and
+ return that row. Conversely, if a non-key predicate is used, even if it only applies to a single row, HBase
+ must still scan the entire table to find the correct result.
+ </p>
+
+ <div class="example"><h3 class="title sectiontitle">Interpreting EXPLAIN Output for HBase Queries</h3>
+
+
+
+ <p class="p">
+ For example, here are some queries against the following Impala table, which is mapped to an HBase table.
+ The examples show excerpts from the output of the <code class="ph codeph">EXPLAIN</code> statement, demonstrating what
+ things to look for to indicate an efficient or inefficient query against an HBase table.
+ </p>
+
+ <p class="p">
+ The first column (<code class="ph codeph">cust_id</code>) was specified as the key column in the <code class="ph codeph">CREATE
+ EXTERNAL TABLE</code> statement; for performance, it is important to declare this column as
+ <code class="ph codeph">STRING</code>. Other columns, such as <code class="ph codeph">BIRTH_YEAR</code> and
+ <code class="ph codeph">NEVER_LOGGED_ON</code>, are also declared as <code class="ph codeph">STRING</code>, rather than their
+ <span class="q">"natural"</span> types of <code class="ph codeph">INT</code> or <code class="ph codeph">BOOLEAN</code>, because Impala can optimize
+ those types more effectively in HBase tables. For comparison, we leave one column,
+ <code class="ph codeph">YEAR_REGISTERED</code>, as <code class="ph codeph">INT</code> to show that filtering on this column is
+ inefficient.
+ </p>
+
+<pre class="pre codeblock"><code>describe hbase_table;
+Query: describe hbase_table
++-----------------------+--------+---------+
+| name | type | comment |
++-----------------------+--------+---------+
+| cust_id | <strong class="ph b">string</strong> | |
+| birth_year | <strong class="ph b">string</strong> | |
+| never_logged_on | <strong class="ph b">string</strong> | |
+| private_email_address | string | |
+| year_registered | <strong class="ph b">int</strong> | |
++-----------------------+--------+---------+
+</code></pre>
+
+ <p class="p">
+ The best case for performance involves a single row lookup using an equality comparison on the column
+ defined as the row key:
+ </p>
+
+<pre class="pre codeblock"><code>explain select count(*) from hbase_table where cust_id = 'some_user@example.com';
++------------------------------------------------------------------------------------+
+| Explain String |
++------------------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=1.01GB VCores=1 |
+| WARNING: The following tables are missing relevant table and/or column statistics. |
+| hbase.hbase_table |
+| |
+| 03:AGGREGATE [MERGE FINALIZE] |
+| | output: sum(count(*)) |
+| | |
+| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
+| | |
+| 01:AGGREGATE |
+| | output: count(*) |
+| | |
+<strong class="ph b">| 00:SCAN HBASE [hbase.hbase_table] |</strong>
+<strong class="ph b">| start key: some_user@example.com |</strong>
+<strong class="ph b">| stop key: some_user@example.com\0 |</strong>
++------------------------------------------------------------------------------------+
+</code></pre>
+
+ <p class="p">
+ Another type of efficient query involves a range lookup on the row key column, using SQL operators such
+ as greater than (or equal), less than (or equal), or <code class="ph codeph">BETWEEN</code>. This example also includes
+ an equality test on a non-key column; because that column is a <code class="ph codeph">STRING</code>, Impala can let
+ HBase perform that test, indicated by the <code class="ph codeph">hbase filters:</code> line in the
+ <code class="ph codeph">EXPLAIN</code> output. Doing the filtering within HBase is more efficient than transmitting all
+ the data to Impala and doing the filtering on the Impala side.
+ </p>
+
+<pre class="pre codeblock"><code>explain select count(*) from hbase_table where cust_id between 'a' and 'b'
+ and never_logged_on = 'true';
++------------------------------------------------------------------------------------+
+| Explain String |
++------------------------------------------------------------------------------------+
+...
+
+| 01:AGGREGATE |
+| | output: count(*) |
+| | |
+<strong class="ph b">| 00:SCAN HBASE [hbase.hbase_table] |</strong>
+<strong class="ph b">| start key: a |</strong>
+<strong class="ph b">| stop key: b\0 |</strong>
+<strong class="ph b">| hbase filters: cols:never_logged_on EQUAL 'true' |</strong>
++------------------------------------------------------------------------------------+
+</code></pre>
+
+ <p class="p">
+ The query is less efficient if Impala has to evaluate any of the predicates, because Impala must scan the
+ entire HBase table. Impala can only push down predicates to HBase for columns declared as
+ <code class="ph codeph">STRING</code>. This example tests a column declared as <code class="ph codeph">INT</code>, and the
+ <code class="ph codeph">predicates:</code> line in the <code class="ph codeph">EXPLAIN</code> output indicates that the test is
+ performed after the data is transmitted to Impala.
+ </p>
+
+<pre class="pre codeblock"><code>explain select count(*) from hbase_table where year_registered = 2010;
++------------------------------------------------------------------------------------+
+| Explain String |
++------------------------------------------------------------------------------------+
+...
+
+| 01:AGGREGATE |
+| | output: count(*) |
+| | |
+<strong class="ph b">| 00:SCAN HBASE [hbase.hbase_table] |</strong>
+<strong class="ph b">| predicates: year_registered = 2010 |</strong>
++------------------------------------------------------------------------------------+
+</code></pre>
+
+ <p class="p">
+ The same inefficiency applies if the key column is compared to any non-constant value. Here, even though
+ the key column is a <code class="ph codeph">STRING</code>, and is tested using an equality operator, Impala must scan
+ the entire HBase table because the key column is compared to another column value rather than a constant.
+ </p>
+
+<pre class="pre codeblock"><code>explain select count(*) from hbase_table where cust_id = private_email_address;
++------------------------------------------------------------------------------------+
+| Explain String |
++------------------------------------------------------------------------------------+
+...
+
+| 01:AGGREGATE |
+| | output: count(*) |
+| | |
+<strong class="ph b">| 00:SCAN HBASE [hbase.hbase_table] |</strong>
+<strong class="ph b">| predicates: cust_id = private_email_address |</strong>
++------------------------------------------------------------------------------------+
+</code></pre>
+
+ <p class="p">
+ Currently, tests on the row key using <code class="ph codeph">OR</code> or <code class="ph codeph">IN</code> clauses are not
+ optimized into direct lookups either. Such limitations might be lifted in the future, so always check the
+ <code class="ph codeph">EXPLAIN</code> output to be sure whether a particular SQL construct results in an efficient
+ query or not for HBase tables.
+ </p>
+
+<pre class="pre codeblock"><code>explain select count(*) from hbase_table where
+ cust_id = 'some_user@example.com' or cust_id = 'other_user@example.com';
++----------------------------------------------------------------------------------------+
+| Explain String |
++----------------------------------------------------------------------------------------+
+...
+
+| 01:AGGREGATE |
+| | output: count(*) |
+| | |
+<strong class="ph b">| 00:SCAN HBASE [hbase.hbase_table] |</strong>
+<strong class="ph b">| predicates: cust_id = 'some_user@example.com' OR cust_id = 'other_user@example.com' |</strong>
++----------------------------------------------------------------------------------------+
+
+explain select count(*) from hbase_table where
+ cust_id in ('some_user@example.com', 'other_user@example.com');
++------------------------------------------------------------------------------------+
+| Explain String |
++------------------------------------------------------------------------------------+
+...
+
+| 01:AGGREGATE |
+| | output: count(*) |
+| | |
+<strong class="ph b">| 00:SCAN HBASE [hbase.hbase_table] |</strong>
+<strong class="ph b">| predicates: cust_id IN ('some_user@example.com', 'other_user@example.com') |</strong>
++------------------------------------------------------------------------------------+
+</code></pre>
+
+ <p class="p">
+ Either rewrite into separate queries for each value and combine the results in the application, or
+ combine the single-row queries using UNION ALL:
+ </p>
+
+<pre class="pre codeblock"><code>select count(*) from hbase_table where cust_id = 'some_user@example.com';
+select count(*) from hbase_table where cust_id = 'other_user@example.com';
+
+explain
+ select count(*) from hbase_table where cust_id = 'some_user@example.com'
+ union all
+ select count(*) from hbase_table where cust_id = 'other_user@example.com';
++------------------------------------------------------------------------------------+
+| Explain String |
++------------------------------------------------------------------------------------+
+...
+
+| | 04:AGGREGATE |
+| | | output: count(*) |
+| | | |
+<strong class="ph b">| | 03:SCAN HBASE [hbase.hbase_table] |</strong>
+<strong class="ph b">| | start key: other_user@example.com |</strong>
+<strong class="ph b">| | stop key: other_user@example.com\0 |</strong>
+| | |
+| 10:MERGE |
+...
+
+| 02:AGGREGATE |
+| | output: count(*) |
+| | |
+<strong class="ph b">| 01:SCAN HBASE [hbase.hbase_table] |</strong>
+<strong class="ph b">| start key: some_user@example.com |</strong>
+<strong class="ph b">| stop key: some_user@example.com\0 |</strong>
++------------------------------------------------------------------------------------+
+</code></pre>
+
+ </div>
+
+ <div class="example"><h3 class="title sectiontitle">Configuration Options for Java HBase Applications</h3>
+
+
+
+ <p class="p"> If you have an HBase Java application that calls the
+ <code class="ph codeph">setCacheBlocks</code> or <code class="ph codeph">setCaching</code>
+ methods of the class <a class="xref" href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" target="_blank">org.apache.hadoop.hbase.client.Scan</a>, you can set these same
+ caching behaviors through Impala query options, to control the memory
+ pressure on the HBase RegionServer. For example, when doing queries in
+ HBase that result in full-table scans (which by default are
+ inefficient for HBase), you can reduce memory usage and speed up the
+ queries by turning off the <code class="ph codeph">HBASE_CACHE_BLOCKS</code> setting
+ and specifying a large number for the <code class="ph codeph">HBASE_CACHING</code>
+ setting.
+ </p>
+
+ <p class="p">
+ To set these options, issue commands like the following in <span class="keyword cmdname">impala-shell</span>:
+ </p>
+
+<pre class="pre codeblock"><code>-- Same as calling setCacheBlocks(true) or setCacheBlocks(false).
+set hbase_cache_blocks=true;
+set hbase_cache_blocks=false;
+
+-- Same as calling setCaching(rows).
+set hbase_caching=1000;
+</code></pre>
+
+ <p class="p">
+ Or update the <span class="keyword cmdname">impalad</span> defaults file <span class="ph filepath">/etc/default/impala</span> and
+ include settings for <code class="ph codeph">HBASE_CACHE_BLOCKS</code> and/or <code class="ph codeph">HBASE_CACHING</code> in the
+ <code class="ph codeph">-default_query_options</code> setting for <code class="ph codeph">IMPALA_SERVER_ARGS</code>. See
+ <a class="xref" href="impala_config_options.html#config_options">Modifying Impala Startup Options</a> for details.
+ </p>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ In Impala 2.0 and later, these options are settable through the JDBC or ODBC interfaces using the
+ <code class="ph codeph">SET</code> statement.
+ </div>
+
+ </div>
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="impala_hbase__hbase_scenarios">
+
+ <h2 class="title topictitle2" id="ariaid-title6">Use Cases for Querying HBase through Impala</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ The following are popular use cases for using Impala to query HBase tables:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ Keeping large fact tables in Impala, and smaller dimension tables in HBase. The fact tables use Parquet
+ or other binary file format optimized for scan operations. Join queries scan through the large Impala
+ fact tables, and cross-reference the dimension tables using efficient single-row lookups in HBase.
+ </li>
+
+ <li class="li">
+ Using HBase to store rapidly incrementing counters, such as how many times a web page has been viewed, or
+ on a social network, how many connections a user has or how many votes a post received. HBase is
+ efficient for capturing such changeable data: the append-only storage mechanism is efficient for writing
+ each change to disk, and a query always returns the latest value. An application could query specific
+ totals like these from HBase, and combine the results with a broader set of data queried from Impala.
+ </li>
+
+ <li class="li">
+ <p class="p">
+ Storing very wide tables in HBase. Wide tables have many columns, possibly thousands, typically
+ recording many attributes for an important subject such as a user of an online service. These tables
+ are also often sparse, that is, most of the columns values are <code class="ph codeph">NULL</code>, 0,
+ <code class="ph codeph">false</code>, empty string, or other blank or placeholder value. (For example, any particular
+ web site user might have never used some site feature, filled in a certain field in their profile,
+ visited a particular part of the site, and so on.) A typical query against this kind of table is to
+ look up a single row to retrieve all the information about a specific subject, rather than summing,
+ averaging, or filtering millions of rows as in typical Impala-managed tables.
+ </p>
+ <p class="p">
+ Or the HBase table could be joined with a larger Impala-managed table. For example, analyze the large
+ Impala table representing web traffic for a site and pick out 50 users who view the most pages. Join
+ that result with the wide user table in HBase to look up attributes of those users. The HBase side of
+ the join would result in 50 efficient single-row lookups in HBase, rather than scanning the entire user
+ table.
+ </p>
+ </li>
+ </ul>
+ </div>
+ </article>
+
+
+
+
+
+
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title7" id="impala_hbase__hbase_loading">
+
+ <h2 class="title topictitle2" id="ariaid-title7">Loading Data into an HBase Table</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ The Impala <code class="ph codeph">INSERT</code> statement works for HBase tables. The <code class="ph codeph">INSERT ... VALUES</code>
+ syntax is ideally suited to HBase tables, because inserting a single row is an efficient operation for an
+ HBase table. (For regular Impala tables, with data files in HDFS, the tiny data files produced by
+ <code class="ph codeph">INSERT ... VALUES</code> are extremely inefficient, so you would not use that technique with
+ tables containing any significant data volume.)
+ </p>
+
+
+
+ <p class="p">
+ When you use the <code class="ph codeph">INSERT ... SELECT</code> syntax, the result in the HBase table could be fewer
+ rows than you expect. HBase only stores the most recent version of each unique row key, so if an
+ <code class="ph codeph">INSERT ... SELECT</code> statement copies over multiple rows containing the same value for the
+ key column, subsequent queries will only return one row with each key column value:
+ </p>
+
+ <p class="p">
+ Although Impala does not have an <code class="ph codeph">UPDATE</code> statement, you can achieve the same effect by
+ doing successive <code class="ph codeph">INSERT</code> statements using the same value for the key column each time:
+ </p>
+
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title8" id="impala_hbase__hbase_limitations">
+
+ <h2 class="title topictitle2" id="ariaid-title8">Limitations and Restrictions of the Impala and HBase Integration</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ The Impala integration with HBase has the following limitations and restrictions, some inherited from the
+ integration between HBase and Hive, and some unique to Impala:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ If you issue a <code class="ph codeph">DROP TABLE</code> for an internal (Impala-managed) table that is mapped to an
+ HBase table, the underlying table is not removed in HBase. The Hive <code class="ph codeph">DROP TABLE</code>
+ statement also removes the HBase table in this case.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ The <code class="ph codeph">INSERT OVERWRITE</code> statement is not available for HBase tables. You can insert new
+ data, or modify an existing row by inserting a new row with the same key value, but not replace the
+ entire contents of the table. You can do an <code class="ph codeph">INSERT OVERWRITE</code> in Hive if you need this
+ capability.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ If you issue a <code class="ph codeph">CREATE TABLE LIKE</code> statement for a table mapped to an HBase table, the
+ new table is also an HBase table, but inherits the same underlying HBase table name as the original.
+ The new table is effectively an alias for the old one, not a new table with identical column structure.
+ Avoid using <code class="ph codeph">CREATE TABLE LIKE</code> for HBase tables, to avoid any confusion.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ Copying data into an HBase table using the Impala <code class="ph codeph">INSERT ... SELECT</code> syntax might
+ produce fewer new rows than are in the query result set. If the result set contains multiple rows with
+ the same value for the key column, each row supercedes any previous rows with the same key value.
+ Because the order of the inserted rows is unpredictable, you cannot rely on this technique to preserve
+ the <span class="q">"latest"</span> version of a particular key value.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ Because the complex data types (<code class="ph codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, and <code class="ph codeph">MAP</code>)
+ available in <span class="keyword">Impala 2.3</span> and higher are currently only supported in Parquet tables, you cannot
+ use these types in HBase tables that are queried through Impala.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ The <code class="ph codeph">LOAD DATA</code> statement cannot be used with HBase tables.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ The <code class="ph codeph">TABLESAMPLE</code> clause of the <code class="ph codeph">SELECT</code>
+ statement does not apply to a table reference derived from a view, a subquery,
+ or anything other than a real base table. This clause only works for tables
+ backed by HDFS or HDFS-like data files, therefore it does not apply to Kudu or
+ HBase tables.
+ </p>
+ </li>
+ </ul>
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title9" id="impala_hbase__hbase_queries">
+
+ <h2 class="title topictitle2" id="ariaid-title9">Examples of Querying HBase Tables from Impala</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ The following examples create an HBase table with four column families,
+ create a corresponding table through Hive,
+ then insert and query the table through Impala.
+ </p>
+ <p class="p">
+ In HBase shell, the table
+ name is quoted in <code class="ph codeph">CREATE</code> and <code class="ph codeph">DROP</code> statements. Tables created in HBase
+ begin in <span class="q">"enabled"</span> state; before dropping them through the HBase shell, you must issue a
+ <code class="ph codeph">disable '<var class="keyword varname">table_name</var>'</code> statement.
+ </p>
+
+<pre class="pre codeblock"><code>$ hbase shell
+15/02/10 16:07:45
+HBase Shell; enter 'help<RETURN>' for list of supported commands.
+Type "exit<RETURN>" to leave the HBase Shell
+...
+
+hbase(main):001:0> create 'hbasealltypessmall', 'boolsCF', 'intsCF', 'floatsCF', 'stringsCF'
+0 row(s) in 4.6520 seconds
+
+=> Hbase::Table - hbasealltypessmall
+hbase(main):006:0> quit
+</code></pre>
+
+ <p class="p">
+ Issue the following <code class="ph codeph">CREATE TABLE</code> statement in the Hive shell. (The Impala <code class="ph codeph">CREATE
+ TABLE</code> statement currently does not support the <code class="ph codeph">STORED BY</code> clause, so you switch into Hive to
+ create the table, then back to Impala and the <span class="keyword cmdname">impala-shell</span> interpreter to issue the
+ queries.)
+ </p>
+
+ <p class="p">
+ This example creates an external table mapped to the HBase table, usable by both Impala and Hive. It is
+ defined as an external table so that when dropped by Impala or Hive, the original HBase table is not touched at all.
+ </p>
+
+ <p class="p">
+ The <code class="ph codeph">WITH SERDEPROPERTIES</code> clause
+ specifies that the first column (<code class="ph codeph">ID</code>) represents the row key, and maps the remaining
+ columns of the SQL table to HBase column families. The mapping relies on the ordinal order of the
+ columns in the table, not the column names in the <code class="ph codeph">CREATE TABLE</code> statement.
+ The first column is defined to be the lookup key; the
+ <code class="ph codeph">STRING</code> data type produces the fastest key-based lookups for HBase tables.
+ </p>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ For Impala with HBase tables, the most important aspect to ensure good performance is to use a
+ <code class="ph codeph">STRING</code> column as the row key, as shown in this example.
+ </div>
+
+<pre class="pre codeblock"><code>$ hive
+...
+hive> use hbase;
+OK
+Time taken: 4.095 seconds
+hive> CREATE EXTERNAL TABLE hbasestringids (
+ > id string,
+ > bool_col boolean,
+ > tinyint_col tinyint,
+ > smallint_col smallint,
+ > int_col int,
+ > bigint_col bigint,
+ > float_col float,
+ > double_col double,
+ > date_string_col string,
+ > string_col string,
+ > timestamp_col timestamp)
+ > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ > WITH SERDEPROPERTIES (
+ > "hbase.columns.mapping" =
+ > ":key,boolsCF:bool_col,intsCF:tinyint_col,intsCF:smallint_col,intsCF:int_col,intsCF:\
+ > bigint_col,floatsCF:float_col,floatsCF:double_col,stringsCF:date_string_col,\
+ > stringsCF:string_col,stringsCF:timestamp_col"
+ > )
+ > TBLPROPERTIES("hbase.table.name" = "hbasealltypessmall");
+OK
+Time taken: 2.879 seconds
+hive> quit;
+</code></pre>
+
+ <p class="p">
+ Once you have established the mapping to an HBase table, you can issue DML statements and queries
+ from Impala. The following example shows a series of <code class="ph codeph">INSERT</code>
+ statements followed by a query.
+ The ideal kind of query from a performance standpoint
+ retrieves a row from the table based on a row key
+ mapped to a string column.
+ An initial <code class="ph codeph">INVALIDATE METADATA <var class="keyword varname">table_name</var></code>
+ statement makes the table created through Hive visible to Impala.
+ </p>
+
+<pre class="pre codeblock"><code>$ impala-shell -i localhost -d hbase
+Starting Impala Shell without Kerberos authentication
+Connected to localhost:21000
+...
+Query: use `hbase`
+[localhost:21000] > invalidate metadata hbasestringids;
+Fetched 0 row(s) in 0.09s
+[localhost:21000] > desc hbasestringids;
++-----------------+-----------+---------+
+| name | type | comment |
++-----------------+-----------+---------+
+| id | string | |
+| bool_col | boolean | |
+| double_col | double | |
+| float_col | float | |
+| bigint_col | bigint | |
+| int_col | int | |
+| smallint_col | smallint | |
+| tinyint_col | tinyint | |
+| date_string_col | string | |
+| string_col | string | |
+| timestamp_col | timestamp | |
++-----------------+-----------+---------+
+Fetched 11 row(s) in 0.02s
+[localhost:21000] > insert into hbasestringids values ('0001',true,3.141,9.94,1234567,32768,4000,76,'2014-12-31','Hello world',now());
+Inserted 1 row(s) in 0.26s
+[localhost:21000] > insert into hbasestringids values ('0002',false,2.004,6.196,1500,8000,129,127,'2014-01-01','Foo bar',now());
+Inserted 1 row(s) in 0.12s
+[localhost:21000] > select * from hbasestringids where id = '0001';
++------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+
+| id | bool_col | double_col | float_col | bigint_col | int_col | smallint_col | tinyint_col | date_string_col | string_col | timestamp_col |
++------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+
+| 0001 | true | 3.141 | 9.939999580383301 | 1234567 | 32768 | 4000 | 76 | 2014-12-31 | Hello world | 2015-02-10 16:36:59.764838000 |
++------+----------+------------+-------------------+------------+---------+--------------+-------------+-----------------+-------------+-------------------------------+
+Fetched 1 row(s) in 0.54s
+</code></pre>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ After you create a table in Hive, such as the HBase mapping table in this example, issue an
+ <code class="ph codeph">INVALIDATE METADATA <var class="keyword varname">table_name</var></code> statement the next time you connect to
+ Impala, make Impala aware of the new table. (Prior to Impala 1.2.4, you could not specify the table name if
+ Impala was not aware of the table yet; in Impala 1.2.4 and higher, specifying the table name avoids
+ reloading the metadata for other tables that are not changed.)
+ </div>
+ </div>
+ </article>
+</article></main></body></html>
http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_hbase_cache_blocks.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_hbase_cache_blocks.html b/docs/build3x/html/topics/impala_hbase_cache_blocks.html
new file mode 100644
index 0000000..27ebee3
--- /dev/null
+++ b/docs/build3x/html/topics/impala_hbase_cache_blocks.html
@@ -0,0 +1,36 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="hbase_cache_blocks"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>HBASE_CACHE_BLOCKS Query Option</title></head><body id="hbase_cache_blocks"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">HBASE_CACHE_BLOCKS Query Option</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ Setting this option is equivalent to calling the
+ <code class="ph codeph">setCacheBlocks</code> method of the class <a class="xref" href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" target="_blank">org.apache.hadoop.hbase.client.Scan</a>, in an HBase Java
+ application. Helps to control the memory pressure on the HBase
+ RegionServer, in conjunction with the <code class="ph codeph">HBASE_CACHING</code> query
+ option. </p>
+
+ <p class="p">
+ <strong class="ph b">Type:</strong> Boolean; recognized values are 1 and 0, or <code class="ph codeph">true</code> and <code class="ph codeph">false</code>;
+ any other value interpreted as <code class="ph codeph">false</code>
+ </p>
+ <p class="p">
+ <strong class="ph b">Default:</strong> <code class="ph codeph">false</code> (shown as 0 in output of <code class="ph codeph">SET</code> statement)
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Related information:</strong>
+ </p>
+ <p class="p">
+ <a class="xref" href="impala_hbase.html#impala_hbase">Using Impala to Query HBase Tables</a>,
+ <a class="xref" href="impala_hbase_caching.html#hbase_caching">HBASE_CACHING Query Option</a>
+ </p>
+
+ </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html>
http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_hbase_caching.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_hbase_caching.html b/docs/build3x/html/topics/impala_hbase_caching.html
new file mode 100644
index 0000000..e2082d4
--- /dev/null
+++ b/docs/build3x/html/topics/impala_hbase_caching.html
@@ -0,0 +1,36 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="hbase_caching"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>HBASE_CACHING Query Option</title></head><body id="hbase_caching"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">HBASE_CACHING Query Option</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ Setting this option is equivalent to calling the
+ <code class="ph codeph">setCaching</code> method of the class <a class="xref" href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html" target="_blank">org.apache.hadoop.hbase.client.Scan</a>, in an HBase Java
+ application. Helps to control the memory pressure on the HBase
+ RegionServer, in conjunction with the <code class="ph codeph">HBASE_CACHE_BLOCKS</code>
+ query option. </p>
+
+ <p class="p">
+ <strong class="ph b">Type:</strong> <code class="ph codeph">BOOLEAN</code>
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Default:</strong> 0
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Related information:</strong>
+ </p>
+ <p class="p">
+ <a class="xref" href="impala_hbase.html#impala_hbase">Using Impala to Query HBase Tables</a>,
+ <a class="xref" href="impala_hbase_cache_blocks.html#hbase_cache_blocks">HBASE_CACHE_BLOCKS Query Option</a>
+ </p>
+
+ </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html>
http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_hints.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_hints.html b/docs/build3x/html/topics/impala_hints.html
new file mode 100644
index 0000000..7777fa2
--- /dev/null
+++ b/docs/build3x/html/topics/impala_hints.html
@@ -0,0 +1,488 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_langref_sql.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="hints"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Optimizer Hints</title></head><body id="hints"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">Optimizer Hints</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ The Impala SQL supports
+ query hints, for fine-tuning the inner workings of queries. Specify hints
+ as a temporary workaround for expensive queries, where missing statistics
+ or other factors cause inefficient performance. </p>
+
+ <p class="p"> Hints are most often used for the resource-intensive Impala queries,
+ such as: </p>
+
+ <ul class="ul">
+ <li class="li">
+ Join queries involving large tables, where intermediate result sets are transmitted across the network to
+ evaluate the join conditions.
+ </li>
+
+ <li class="li">
+ Inserting into partitioned Parquet tables, where many memory buffers could be allocated on each host to
+ hold intermediate results for each partition.
+ </li>
+ </ul>
+
+ <p class="p">
+ <strong class="ph b">Syntax:</strong>
+ </p>
+
+ <p class="p"> In <span class="keyword">Impala 2.0</span> and higher, you can
+ specify the hints inside comments that use either the <code class="ph codeph">/*
+ */</code> or <code class="ph codeph">--</code> notation. Specify a
+ <code class="ph codeph">+</code> symbol immediately before the hint name. Recently
+ added hints are only available using the <code class="ph codeph">/* */</code> and
+ <code class="ph codeph">--</code> notation. For clarity, the <code class="ph codeph">/* */</code>
+ and <code class="ph codeph">--</code> styles are used in the syntax and examples
+ throughout this section. With the <code class="ph codeph">/* */</code> or
+ <code class="ph codeph">--</code> notation for hints, specify a <code class="ph codeph">+</code>
+ symbol immediately before the first hint name. Multiple hints can be
+ specified separated by commas, for example <code class="ph codeph">/* +clustered,shuffle
+ */</code>
+ </p>
+
+<pre class="pre codeblock"><code>SELECT STRAIGHT_JOIN <var class="keyword varname">select_list</var> FROM
+<var class="keyword varname">join_left_hand_table</var>
+ JOIN /* +BROADCAST|SHUFFLE */
+<var class="keyword varname">join_right_hand_table</var>
+<var class="keyword varname">remainder_of_query</var>;
+
+SELECT <var class="keyword varname">select_list</var> FROM
+<var class="keyword varname">join_left_hand_table</var>
+ JOIN -- +BROADCAST|SHUFFLE
+<var class="keyword varname">join_right_hand_table</var>
+<var class="keyword varname">remainder_of_query</var>;
+
+INSERT <var class="keyword varname">insert_clauses</var>
+ /* +SHUFFLE|NOSHUFFLE */
+ SELECT <var class="keyword varname">remainder_of_query</var>;
+
+INSERT <var class="keyword varname">insert_clauses</var>
+ -- +SHUFFLE|NOSHUFFLE
+ SELECT <var class="keyword varname">remainder_of_query</var>;
+
+<span class="ph">
+INSERT /* +SHUFFLE|NOSHUFFLE */
+ <var class="keyword varname">insert_clauses</var>
+ SELECT <var class="keyword varname">remainder_of_query</var>;</span>
+
+<span class="ph">
+INSERT -- +SHUFFLE|NOSHUFFLE
+ <var class="keyword varname">insert_clauses</var>
+ SELECT <var class="keyword varname">remainder_of_query</var>;</span>
+
+<span class="ph">
+UPSERT /* +SHUFFLE|NOSHUFFLE */
+ <var class="keyword varname">upsert_clauses</var>
+ SELECT <var class="keyword varname">remainder_of_query</var>;</span>
+
+<span class="ph">
+UPSERT -- +SHUFFLE|NOSHUFFLE
+ <var class="keyword varname">upsert_clauses</var>
+ SELECT <var class="keyword varname">remainder_of_query</var>;</span>
+
+<span class="ph">SELECT <var class="keyword varname">select_list</var> FROM
+<var class="keyword varname">table_ref</var>
+ /* +{SCHEDULE_CACHE_LOCAL | SCHEDULE_DISK_LOCAL | SCHEDULE_REMOTE}
+ [,RANDOM_REPLICA] */
+<var class="keyword varname">remainder_of_query</var>;</span>
+
+<span class="ph">INSERT <var class="keyword varname">insert_clauses</var>
+ -- +CLUSTERED
+ SELECT <var class="keyword varname">remainder_of_query</var>;
+
+INSERT <var class="keyword varname">insert_clauses</var>
+ /* +CLUSTERED */
+ SELECT <var class="keyword varname">remainder_of_query</var>;</span>
+
+<span class="ph">INSERT -- +CLUSTERED
+ <var class="keyword varname">insert_clauses</var>
+ SELECT <var class="keyword varname">remainder_of_query</var>;
+
+INSERT /* +CLUSTERED */
+ <var class="keyword varname">insert_clauses</var>
+ SELECT <var class="keyword varname">remainder_of_query</var>;
+
+UPSERT -- +CLUSTERED
+ <var class="keyword varname">upsert_clauses</var>
+ SELECT <var class="keyword varname">remainder_of_query</var>;
+
+UPSERT /* +CLUSTERED */
+ <var class="keyword varname">upsert_clauses</var>
+ SELECT <var class="keyword varname">remainder_of_query</var>;</span>
+
+CREATE /* +SHUFFLE|NOSHUFFLE */
+ <var class="keyword varname">table_clauses</var>
+ AS SELECT <var class="keyword varname">remainder_of_query</var>;
+
+CREATE -- +SHUFFLE|NOSHUFFLE
+ <var class="keyword varname">table_clauses</var>
+ AS SELECT <var class="keyword varname">remainder_of_query</var>;
+
+CREATE /* +CLUSTER|NOCLUSTER */
+ <var class="keyword varname">table_clauses</var>
+ AS SELECT <var class="keyword varname">remainder_of_query</var>;
+
+CREATE -- +CLUSTER|NOCLUSTER
+ <var class="keyword varname">table_clauses</var>
+ AS SELECT <var class="keyword varname">remainder_of_query</var>;
+</code></pre>
+ <p class="p">The square bracket style hints are supported for backward compatibility,
+ but the syntax is deprecated and will be removed in a future release. For
+ that reason, any newly added hints are not available with the square
+ bracket syntax.</p>
+ <pre class="pre codeblock"><code>SELECT STRAIGHT_JOIN <var class="keyword varname">select_list</var> FROM
+<var class="keyword varname">join_left_hand_table</var>
+ JOIN [{ /* +BROADCAST */ | /* +SHUFFLE */ }]
+<var class="keyword varname">join_right_hand_table</var>
+<var class="keyword varname">remainder_of_query</var>;
+
+INSERT <var class="keyword varname">insert_clauses</var>
+ [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
+ [<span class="ph">/* +CLUSTERED */</span>]
+ SELECT <var class="keyword varname">remainder_of_query</var>;
+
+<span class="ph">
+UPSERT [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
+ [<span class="ph">/* +CLUSTERED */</span>]
+ <var class="keyword varname">upsert_clauses</var>
+ SELECT <var class="keyword varname">remainder_of_query</var>;</span>
+</code></pre>
+
+ <p class="p">
+ <strong class="ph b">Usage notes:</strong>
+ </p>
+
+ <p class="p">
+ With both forms of hint syntax, include the <code class="ph codeph">STRAIGHT_JOIN</code>
+ keyword immediately after the <code class="ph codeph">SELECT</code> and any
+ <code class="ph codeph">DISTINCT</code> or <code class="ph codeph">ALL</code> keywords to prevent Impala from
+ reordering the tables in a way that makes the join-related hints ineffective.
+ </p>
+
+ <p class="p">
+ The <code class="ph codeph">STRAIGHT_JOIN</code> hint affects the join order of table references in the query
+ block containing the hint. It does not affect the join order of nested queries, such as views,
+ inline views, or <code class="ph codeph">WHERE</code>-clause subqueries. To use this hint for performance
+ tuning of complex queries, apply the hint to all query blocks that need a fixed join order.
+ </p>
+
+ <p class="p">
+ To reduce the need to use hints, run the <code class="ph codeph">COMPUTE STATS</code> statement against all tables involved
+ in joins, or used as the source tables for <code class="ph codeph">INSERT ... SELECT</code> operations where the
+ destination is a partitioned Parquet table. Do this operation after loading data or making substantial
+ changes to the data within each table. Having up-to-date statistics helps Impala choose more efficient query
+ plans without the need for hinting. See <a class="xref" href="impala_perf_stats.html#perf_stats">Table and Column Statistics</a> for details and
+ examples.
+ </p>
+
+ <p class="p">
+ To see which join strategy is used for a particular query, examine the <code class="ph codeph">EXPLAIN</code> output for
+ that query. See <a class="xref" href="impala_explain_plan.html#perf_explain">Using the EXPLAIN Plan for Performance Tuning</a> for details and examples.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Hints for join queries:</strong>
+ </p>
+
+ <p class="p">
+ The <code class="ph codeph">/* +BROADCAST */</code> and <code class="ph codeph">/* +SHUFFLE */</code> hints control the execution strategy for join
+ queries. Specify one of the following constructs immediately after the <code class="ph codeph">JOIN</code> keyword in a
+ query:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ <code class="ph codeph">/* +SHUFFLE */</code> - Makes that join operation use the <span class="q">"partitioned"</span> technique, which divides
+ up corresponding rows from both tables using a hashing algorithm, sending subsets of the rows to other
+ nodes for processing. (The keyword <code class="ph codeph">SHUFFLE</code> is used to indicate a <span class="q">"partitioned join"</span>,
+ because that type of join is not related to <span class="q">"partitioned tables"</span>.) Since the alternative
+ <span class="q">"broadcast"</span> join mechanism is the default when table and index statistics are unavailable, you might
+ use this hint for queries where broadcast joins are unsuitable; typically, partitioned joins are more
+ efficient for joins between large tables of similar size.
+ </li>
+
+ <li class="li">
+ <code class="ph codeph">/* +BROADCAST */</code> - Makes that join operation use the <span class="q">"broadcast"</span> technique that sends the
+ entire contents of the right-hand table to all nodes involved in processing the join. This is the default
+ mode of operation when table and index statistics are unavailable, so you would typically only need it if
+ stale metadata caused Impala to mistakenly choose a partitioned join operation. Typically, broadcast joins
+ are more efficient in cases where one table is much smaller than the other. (Put the smaller table on the
+ right side of the <code class="ph codeph">JOIN</code> operator.)
+ </li>
+ </ul>
+
+ <p class="p">
+ <strong class="ph b">Hints for INSERT ... SELECT and CREATE TABLE AS SELECT (CTAS):</strong>
+ </p>
+ <p class="p" id="hints__insert_hints">
+ When inserting into partitioned tables, such as using the Parquet file
+ format, you can include a hint in the <code class="ph codeph">INSERT</code> or <code class="ph codeph">CREATE TABLE AS SELECT(CTAS)</code>
+ statements to fine-tune the overall performance of the operation and its
+ resource usage.</p>
+ <p class="p">
+ You would only use hints if an <code class="ph codeph">INSERT</code> or
+ <code class="ph codeph">CTAS</code> into a partitioned table was failing due to
+ capacity limits, or if such an operation was succeeding but with
+ less-than-optimal performance.
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ <code class="ph codeph">/* +SHUFFLE */</code> and <code class="ph codeph">/* +NOSHUFFLE */</code> Hints
+ <ul class="ul">
+ <li class="li">
+ <code class="ph codeph">/* +SHUFFLE */</code> adds an exchange node, before
+ writing the data, which re-partitions the result of the
+ <code class="ph codeph">SELECT</code> based on the partitioning columns of the
+ target table. With this hint, only one node writes to a partition at
+ a time, minimizing the global number of simultaneous writes and the
+ number of memory buffers holding data for individual partitions.
+ This also reduces fragmentation, resulting in fewer files. Thus it
+ reduces overall resource usage of the <code class="ph codeph">INSERT</code> or
+ <code class="ph codeph">CTAS</code> operation and allows some operations to
+ succeed that otherwise would fail. It does involve some data
+ transfer between the nodes so that the data files for a particular
+ partition are all written on the same node.
+
+ <p class="p">
+ Use <code class="ph codeph">/* +SHUFFLE */</code> in cases where an <code class="ph codeph">INSERT</code>
+ or <code class="ph codeph">CTAS</code> statement fails or runs inefficiently due
+ to all nodes attempting to write data for all partitions.
+ </p>
+
+ <p class="p"> If the table is unpartitioned or every partitioning expression
+ is constant, then <code class="ph codeph">/* +SHUFFLE */</code> will cause every
+ write to happen on the coordinator node.
+ </p>
+ </li>
+
+ <li class="li">
+ <code class="ph codeph">/* +NOSHUFFLE */</code> does not add exchange node before
+ inserting to partitioned tables and disables re-partitioning. So the
+ selected execution plan might be faster overall, but might also
+ produce a larger number of small data files or exceed capacity
+ limits, causing the <code class="ph codeph">INSERT</code> or <code class="ph codeph">CTAS</code>
+ operation to fail.
+
+ <p class="p"> Impala automatically uses the <code class="ph codeph">/*
+ +SHUFFLE */</code> method if any partition key column in the
+ source table, mentioned in the <code class="ph codeph">SELECT</code> clause,
+ does not have column statistics. In this case, use the <code class="ph codeph">/*
+ +NOSHUFFLE */</code> hint if you want to override this default
+ behavior.
+ </p>
+ </li>
+
+ <li class="li">
+ If column statistics are available for all partition key columns
+ in the source table mentioned in the <code class="ph codeph">INSERT ...
+ SELECT</code> or <code class="ph codeph">CTAS</code> query, Impala chooses
+ whether to use the <code class="ph codeph">/* +SHUFFLE */</code> or <code class="ph codeph">/*
+ +NOSHUFFLE */</code> technique based on the estimated number of
+ distinct values in those columns and the number of nodes involved in
+ the operation. In this case, you might need the <code class="ph codeph">/* +SHUFFLE
+ */</code> or the <code class="ph codeph">/* +NOSHUFFLE */</code> hint to
+ override the execution plan selected by Impala.
+ </li>
+ </ul>
+ </li>
+
+ <li class="li">
+ <code class="ph codeph">/* +CLUSTERED */</code> and <code class="ph codeph">/* +NOCLUSTERED
+ */</code> Hints
+ <ul class="ul">
+ <li class="li">
+ <code class="ph codeph">/* +CLUSTERED */</code> sorts data by the partition
+ columns before inserting to ensure that only one partition is
+ written at a time per node. Use this hint to reduce the number of
+ files kept open and the number of buffers kept in memory
+ simultaneously. This technique is primarily useful for inserts into
+ Parquet tables, where the large block size requires substantial
+ memory to buffer data for multiple output files at once. This hint
+ is available in <span class="keyword">Impala 2.8</span> or higher.
+
+ <p class="p">
+ Starting in <span class="keyword">Impala 3.0</span>, <code class="ph codeph">/*
+ +CLUSTERED */</code> is the default behavior for HDFS tables.
+ </p>
+ </li>
+
+ <li class="li">
+ <code class="ph codeph">/* +NOCLUSTERED */</code> does not sort by primary key
+ before insert. This hint is available in <span class="keyword">Impala 2.8</span> or higher.
+
+ <p class="p">
+ Use this hint when inserting to Kudu tables.
+ </p>
+
+ <p class="p">
+ In the versions lower than <span class="keyword">Impala 3.0</span>,
+ <code class="ph codeph">/* +NOCLUSTERED */</code> is the default in HDFS
+ tables.
+ </p>
+ </li>
+ </ul>
+ </li>
+ </ul>
+
+ <p class="p">
+ Starting from <span class="keyword">Impala 2.9</span>, <code class="ph codeph">INSERT</code>
+ or <code class="ph codeph">UPSERT</code> operations into Kudu tables automatically have
+ an exchange and sort node added to the plan that partitions and sorts the
+ rows according to the partitioning/primary key scheme of the target table
+ (unless the number of rows to be inserted is small enough to trigger
+ single node execution). Use the<code class="ph codeph"> /* +NOCLUSTERED */</code> and
+ <code class="ph codeph">/* +NOSHUFFLE */</code> hints together to disable partitioning
+ and sorting before the rows are sent to Kudu.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Hints for scheduling of HDFS blocks:</strong>
+ </p>
+
+ <p class="p">
+ The hints <code class="ph codeph">/* +SCHEDULE_CACHE_LOCAL */</code>,
+ <code class="ph codeph">/* +SCHEDULE_DISK_LOCAL */</code>, and
+ <code class="ph codeph">/* +SCHEDULE_REMOTE */</code> have the same effect
+ as specifying the <code class="ph codeph">REPLICA_PREFERENCE</code> query
+ option with the respective option settings of <code class="ph codeph">CACHE_LOCAL</code>,
+ <code class="ph codeph">DISK_LOCAL</code>, or <code class="ph codeph">REMOTE</code>.
+ The hint <code class="ph codeph">/* +RANDOM_REPLICA */</code> is the same as
+ enabling the <code class="ph codeph">SCHEDULE_RANDOM_REPLICA</code> query option.
+ </p>
+
+ <p class="p">
+ You can use these hints in combination by separating them with commas,
+ for example, <code class="ph codeph">/* +SCHEDULE_CACHE_LOCAL,RANDOM_REPLICA */</code>.
+ See <a class="xref" href="impala_replica_preference.html">REPLICA_PREFERENCE Query Option (Impala 2.7 or higher only)</a> and
+ <a class="xref" href="impala_schedule_random_replica.html">SCHEDULE_RANDOM_REPLICA Query Option (Impala 2.5 or higher only)</a> for information about how
+ these settings influence the way Impala processes HDFS data blocks.
+ </p>
+
+ <p class="p">
+ Specifying the replica preference as a query hint always overrides the
+ query option setting. Specifying either the <code class="ph codeph">SCHEDULE_RANDOM_REPLICA</code>
+ query option or the corresponding <code class="ph codeph">RANDOM_REPLICA</code> query hint
+ enables the random tie-breaking behavior when processing data blocks
+ during the query.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Suggestions versus directives:</strong>
+ </p>
+
+ <p class="p">
+ In early Impala releases, hints were always obeyed and so acted more like directives. Once Impala gained join
+ order optimizations, sometimes join queries were automatically reordered in a way that made a hint
+ irrelevant. Therefore, the hints act more like suggestions in Impala 1.2.2 and higher.
+ </p>
+
+ <p class="p">
+ To force Impala to follow the hinted execution mechanism for a join query, include the
+ <code class="ph codeph">STRAIGHT_JOIN</code> keyword in the <code class="ph codeph">SELECT</code> statement. See
+ <a class="xref" href="impala_perf_joins.html#straight_join">Overriding Join Reordering with STRAIGHT_JOIN</a> for details. When you use this technique, Impala does not
+ reorder the joined tables at all, so you must be careful to arrange the join order to put the largest table
+ (or subquery result set) first, then the smallest, second smallest, third smallest, and so on. This ordering lets Impala do the
+ most I/O-intensive parts of the query using local reads on the DataNodes, and then reduce the size of the
+ intermediate result set as much as possible as each subsequent table or subquery result set is joined.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Restrictions:</strong>
+ </p>
+
+ <p class="p">
+ Queries that include subqueries in the <code class="ph codeph">WHERE</code> clause can be rewritten internally as join
+ queries. Currently, you cannot apply hints to the joins produced by these types of queries.
+ </p>
+
+ <p class="p">
+ Because hints can prevent queries from taking advantage of new metadata or improvements in query planning,
+ use them only when required to work around performance issues, and be prepared to remove them when they are
+ no longer required, such as after a new Impala release or bug fix.
+ </p>
+
+ <p class="p">
+ In particular, the <code class="ph codeph">/* +BROADCAST */</code> and <code class="ph codeph">/* +SHUFFLE */</code> hints are expected to be
+ needed much less frequently in Impala 1.2.2 and higher, because the join order optimization feature in
+ combination with the <code class="ph codeph">COMPUTE STATS</code> statement now automatically choose join order and join
+ mechanism without the need to rewrite the query and add hints. See
+ <a class="xref" href="impala_perf_joins.html#perf_joins">Performance Considerations for Join Queries</a> for details.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Compatibility:</strong>
+ </p>
+
+ <p class="p">
+ The hints embedded within <code class="ph codeph">--</code> comments are compatible with Hive queries. The hints embedded
+ within <code class="ph codeph">/* */</code> comments or <code class="ph codeph">[ ]</code> square brackets are not recognized by or not
+ compatible with Hive. For example, Hive raises an error for Impala hints within <code class="ph codeph">/* */</code>
+ comments because it does not recognize the Impala hint names.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Considerations for views:</strong>
+ </p>
+
+ <p class="p">
+ If you use a hint in the query that defines a view, the hint is preserved when you query the view. Impala
+ internally rewrites all hints in views to use the <code class="ph codeph">--</code> comment notation, so that Hive can
+ query such views without errors due to unrecognized hint names.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Examples:</strong>
+ </p>
+
+ <p class="p">
+ For example, this query joins a large customer table with a small lookup table of less than 100 rows. The
+ right-hand table can be broadcast efficiently to all nodes involved in the join. Thus, you would use the
+ <code class="ph codeph">/* +broadcast */</code> hint to force a broadcast join strategy:
+ </p>
+
+<pre class="pre codeblock"><code>select straight_join customer.address, state_lookup.state_name
+ from customer join <strong class="ph b">/* +broadcast */</strong> state_lookup
+ on customer.state_id = state_lookup.state_id;</code></pre>
+
+ <p class="p">
+ This query joins two large tables of unpredictable size. You might benchmark the query with both kinds of
+ hints and find that it is more efficient to transmit portions of each table to other nodes for processing.
+ Thus, you would use the <code class="ph codeph">/* +shuffle */</code> hint to force a partitioned join strategy:
+ </p>
+
+<pre class="pre codeblock"><code>select straight_join weather.wind_velocity, geospatial.altitude
+ from weather join <strong class="ph b">/* +shuffle */</strong> geospatial
+ on weather.lat = geospatial.lat and weather.long = geospatial.long;</code></pre>
+
+ <p class="p">
+ For joins involving three or more tables, the hint applies to the tables on either side of that specific
+ <code class="ph codeph">JOIN</code> keyword. The <code class="ph codeph">STRAIGHT_JOIN</code> keyword ensures that joins are processed
+ in a predictable order from left to right. For example, this query joins
+ <code class="ph codeph">t1</code> and <code class="ph codeph">t2</code> using a partitioned join, then joins that result set to
+ <code class="ph codeph">t3</code> using a broadcast join:
+ </p>
+
+<pre class="pre codeblock"><code>select straight_join t1.name, t2.id, t3.price
+ from t1 join <strong class="ph b">/* +shuffle */</strong> t2 join <strong class="ph b">/* +broadcast */</strong> t3
+ on t1.id = t2.id and t2.id = t3.id;</code></pre>
+
+
+
+ <p class="p">
+ <strong class="ph b">Related information:</strong>
+ </p>
+
+ <p class="p">
+ For more background information about join queries, see <a class="xref" href="impala_joins.html#joins">Joins in Impala SELECT Statements</a>. For
+ performance considerations, see <a class="xref" href="impala_perf_joins.html#perf_joins">Performance Considerations for Join Queries</a>.
+ </p>
+ </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_langref_sql.html">Impala SQL Statements</a></div></div></nav></article></main></body></html>
http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_identifiers.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_identifiers.html b/docs/build3x/html/topics/impala_identifiers.html
new file mode 100644
index 0000000..267b91d
--- /dev/null
+++ b/docs/build3x/html/topics/impala_identifiers.html
@@ -0,0 +1,110 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_schema_objects.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="identifiers"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Overview of Impala Identifiers</title></head><body id="identifiers"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">Overview of Impala Identifiers</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ Identifiers are the names of databases, tables, or columns that you specify in a SQL statement. The rules for
+ identifiers govern what names you can give to things you create, the notation for referring to names
+ containing unusual characters, and other aspects such as case sensitivity.
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ The minimum length of an identifier is 1 character.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ The maximum length of an identifier is currently 128 characters, enforced by the metastore database.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ An identifier must start with an alphabetic character. The remainder can contain any combination of
+ alphanumeric characters and underscores. Quoting the identifier with backticks has no effect on the allowed
+ characters in the name.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ An identifier can contain only ASCII characters.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ To use an identifier name that matches one of the Impala reserved keywords (listed in
+ <a class="xref" href="impala_reserved_words.html#reserved_words">Impala Reserved Words</a>), surround the identifier with <code class="ph codeph">``</code>
+ characters (backticks). Quote the reserved word even if it is part of a fully qualified name.
+ The following example shows how a reserved word can be used as a column name if it is quoted
+ with backticks in the <code class="ph codeph">CREATE TABLE</code> statement, and how the column name
+ must also be quoted with backticks in a query:
+ </p>
+<pre class="pre codeblock"><code>[localhost:21000] > create table reserved (`data` string);
+
+[localhost:21000] > select data from reserved;
+ERROR: AnalysisException: Syntax error in line 1:
+select data from reserved
+ ^
+Encountered: DATA
+Expected: ALL, CASE, CAST, DISTINCT, EXISTS, FALSE, IF, INTERVAL, NOT, NULL, STRAIGHT_JOIN, TRUE, IDENTIFIER
+CAUSED BY: Exception: Syntax error
+
+[localhost:21000] > select reserved.data from reserved;
+ERROR: AnalysisException: Syntax error in line 1:
+select reserved.data from reserved
+ ^
+Encountered: DATA
+Expected: IDENTIFIER
+CAUSED BY: Exception: Syntax error
+
+[localhost:21000] > select reserved.`data` from reserved;
+
+[localhost:21000] >
+</code></pre>
+
+ <div class="note important note_important"><span class="note__title importanttitle">Important:</span>
+ Because the list of reserved words grows over time as new SQL syntax is added,
+ consider adopting coding conventions (especially for any automated scripts
+ or in packaged applications) to always quote all identifiers with backticks.
+ Quoting all identifiers protects your SQL from compatibility issues if
+ new reserved words are added in later releases.
+ </div>
+
+ </li>
+
+ <li class="li">
+ <p class="p">
+ Impala identifiers are always case-insensitive. That is, tables named <code class="ph codeph">t1</code> and
+ <code class="ph codeph">T1</code> always refer to the same table, regardless of quote characters. Internally, Impala
+ always folds all specified table and column names to lowercase. This is why the column headers in query
+ output are always displayed in lowercase.
+ </p>
+ </li>
+ </ul>
+
+ <p class="p">
+ See <a class="xref" href="impala_aliases.html#aliases">Overview of Impala Aliases</a> for how to define shorter or easier-to-remember aliases if the
+ original names are long or cryptic identifiers.
+ <span class="ph"> Aliases follow the same rules as identifiers when it comes to case
+ insensitivity. Aliases can be longer than identifiers (up to the maximum length of a Java string) and can
+ include additional characters such as spaces and dashes when they are quoted using backtick characters.
+ </span>
+ </p>
+
+ <p class="p">
+ Another way to define different names for the same tables or columns is to create views. See
+ <a class="xref" href="../shared/../topics/impala_views.html#views">Overview of Impala Views</a> for details.
+ </p>
+ </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_schema_objects.html">Impala Schema Objects and Object Names</a></div></div></nav></article></main></body></html>