You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jb...@apache.org on 2017/04/12 18:25:32 UTC
[28/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS]
Publish rendered Impala documentation to ASF site
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_isilon.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_isilon.html b/docs/build/html/topics/impala_isilon.html
new file mode 100644
index 0000000..304e8ec
--- /dev/null
+++ b/docs/build/html/topics/impala_isilon.html
@@ -0,0 +1,89 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="impala_isilon"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using Impala with Isilon Storage</title></head><body id="impala_isilon"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">Using Impala with Isilon Storage</h1>
+
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ You can use Impala to query data files that reside on EMC Isilon storage devices, rather than in HDFS.
+ This capability allows convenient query access to a storage system where you might already be
+ managing large volumes of data. The combination of the Impala query engine and Isilon storage is
+ certified on <span class="keyword">Impala 2.2.4</span> or higher.
+ </p>
+
+ <div class="p">
+ Because the EMC Isilon storage devices use a global value for the block size
+ rather than a configurable value for each file, the <code class="ph codeph">PARQUET_FILE_SIZE</code>
+ query option has no effect when Impala inserts data into a table or partition
+ residing on Isilon storage. Use the <code class="ph codeph">isi</code> command to set the
+ default block size globally on the Isilon device. For example, to set the
+ Isilon default block size to 256 MB, the recommended size for Parquet
+ data files for Impala, issue the following command:
+<pre class="pre codeblock"><code>isi hdfs settings modify --default-block-size=256MB</code></pre>
+ </div>
+
+ <p class="p">
+ The typical use case for Impala and Isilon together is to use Isilon for the
+ default filesystem, replacing HDFS entirely. In this configuration,
+ when you create a database, table, or partition, the data always resides on
+ Isilon storage and you do not need to specify any special <code class="ph codeph">LOCATION</code>
+ attribute. If you do specify a <code class="ph codeph">LOCATION</code> attribute, its value refers
+ to a path within the Isilon filesystem.
+ For example:
+ </p>
+<pre class="pre codeblock"><code>-- If the default filesystem is Isilon, all Impala data resides there
+-- and all Impala databases and tables are located there.
+CREATE TABLE t1 (x INT, s STRING);
+
+-- You can specify LOCATION for database, table, or partition,
+-- using values from the Isilon filesystem.
+CREATE DATABASE d1 LOCATION '/some/path/on/isilon/server/d1.db';
+CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN);
+</code></pre>
+
+ <p class="p">
+ Impala can write to, delete, and rename data files and database, table,
+ and partition directories on Isilon storage. Therefore, Impala statements such
+ as
+ <code class="ph codeph">CREATE TABLE</code>, <code class="ph codeph">DROP TABLE</code>,
+ <code class="ph codeph">CREATE DATABASE</code>, <code class="ph codeph">DROP DATABASE</code>,
+ <code class="ph codeph">ALTER TABLE</code>,
+ and
+ <code class="ph codeph">INSERT</code> work the same with Isilon storage as with HDFS.
+ </p>
+
+ <p class="p">
+ When the Impala spill-to-disk feature is activated by a query that approaches
+ the memory limit, Impala writes all the temporary data to a local (not Isilon)
+ storage device. Because the I/O bandwidth for the temporary data depends on
+ the number of local disks, and clusters using Isilon storage might not have
+ as many local disks attached, pay special attention on Isilon-enabled clusters
+ to any queries that use the spill-to-disk feature. Where practical, tune the
+ queries or allocate extra memory for Impala to avoid spilling.
+ Although you can specify an Isilon storage device as the destination for
+ the temporary data for the spill-to-disk feature, that configuration is
+ not recommended due to the need to transfer the data both ways using remote I/O.
+ </p>
+
+ <p class="p">
+ When tuning Impala queries on HDFS, you typically try to avoid any remote reads.
+ When the data resides on Isilon storage, all the I/O consists of remote reads.
+ Do not be alarmed when you see non-zero numbers for remote read measurements
+ in query profile output. The benefit of the Impala and Isilon integration is
+ primarily convenience of not having to move or copy large volumes of data to HDFS,
+ rather than raw query performance. You can increase the performance of Impala
+ I/O for Isilon systems by increasing the value for the
+ <code class="ph codeph">--num_remote_hdfs_io_threads</code> startup option for the
+ <span class="keyword cmdname">impalad</span> daemon.
+ </p>
+
+
+ </div>
+
+</article></main></body></html>
\ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_jdbc.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_jdbc.html b/docs/build/html/topics/impala_jdbc.html
new file mode 100644
index 0000000..cdb031b
--- /dev/null
+++ b/docs/build/html/topics/impala_jdbc.html
@@ -0,0 +1,326 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_config.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="impala_jdbc"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Configuring Impala to Work with JDBC</title></head><body id="impala_jdbc"><main role="main"><article role="article" aria-labelledby="impala_jdbc__jdbc">
+
+ <h1 class="title topictitle1" id="impala_jdbc__jdbc">Configuring Impala to Work with JDBC</h1>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ Impala supports the standard JDBC interface, allowing access from commercial Business Intelligence tools and
+ custom software written in Java or other programming languages. The JDBC driver allows you to access Impala
+ from a Java program that you write, or a Business Intelligence or similar tool that uses JDBC to communicate
+ with various database products.
+ </p>
+
+ <p class="p">
+ Setting up a JDBC connection to Impala involves the following steps:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ Verifying the communication port where the Impala daemons in your cluster are listening for incoming JDBC
+ requests.
+ </li>
+
+ <li class="li">
+ Installing the JDBC driver on every system that runs the JDBC-enabled application.
+ </li>
+
+ <li class="li">
+ Specifying a connection string for the JDBC application to access one of the servers running the
+ <span class="keyword cmdname">impalad</span> daemon, with the appropriate security settings.
+ </li>
+ </ul>
+
+ <p class="p toc inpage"></p>
+ </div>
+
+ <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_config.html">Managing Impala</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="impala_jdbc__jdbc_port">
+
+ <h2 class="title topictitle2" id="ariaid-title2">Configuring the JDBC Port</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ The default port used by JDBC 2.0 and later (as well as ODBC 2.x) is 21050. Impala server accepts JDBC
+ connections through this same port 21050 by default. Make sure this port is available for communication
+ with other hosts on your network, for example, that it is not blocked by firewall software. If your JDBC
+ client software connects to a different port, specify that alternative port number with the
+ <code class="ph codeph">--hs2_port</code> option when starting <code class="ph codeph">impalad</code>. See
+ <a class="xref" href="impala_processes.html#processes">Starting Impala</a> for details about Impala startup options. See
+ <a class="xref" href="impala_ports.html#ports">Ports Used by Impala</a> for information about all ports used for communication between Impala
+ and clients or between Impala components.
+ </p>
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="impala_jdbc__jdbc_driver_choice">
+
+ <h2 class="title topictitle2" id="ariaid-title3">Choosing the JDBC Driver</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ In Impala 2.0 and later, you can use the Hive 0.13 JDBC driver. If you are
+ already using JDBC applications with an earlier Impala release, you should update
+ your JDBC driver, because the Hive 0.12 driver that was formerly the only choice
+ is not compatible with Impala 2.0 and later.
+ </p>
+
+ <p class="p">
+ The Hive JDBC driver provides a substantial speed increase for JDBC
+ applications with Impala 2.0 and higher, for queries that return large result sets.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Complex type considerations:</strong>
+ </p>
+
+ <p class="p">
+ The Impala complex types (<code class="ph codeph">STRUCT</code>, <code class="ph codeph">ARRAY</code>, or <code class="ph codeph">MAP</code>)
+ are available in <span class="keyword">Impala 2.3</span> and higher.
+ To use these types with JDBC requires version 2.5.28 or higher of the JDBC Connector for Impala.
+ To use these types with ODBC requires version 2.5.30 or higher of the ODBC Connector for Impala.
+ Consider upgrading all JDBC and ODBC drivers at the same time you upgrade from <span class="keyword">Impala 2.3</span> or higher.
+ </p>
+ <p class="p">
+ Although the result sets from queries involving complex types consist of all scalar values,
+ the queries involve join notation and column references that might not be understood by
+ a particular JDBC or ODBC connector. Consider defining a view that represents the
+ flattened version of a table containing complex type columns, and pointing the JDBC
+ or ODBC application at the view.
+ See <a class="xref" href="../shared/../topics/impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details.
+ </p>
+
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="impala_jdbc__jdbc_setup">
+
+ <h2 class="title topictitle2" id="ariaid-title4">Enabling Impala JDBC Support on Client Systems</h2>
+
+
+ <div class="body conbody">
+
+ <section class="section" id="jdbc_setup__install_hive_driver"><h3 class="title sectiontitle">Using the Hive JDBC Driver</h3>
+
+ <p class="p">
+ You install the Hive JDBC driver (<code class="ph codeph">hive-jdbc</code> package) through the Linux package manager, on
+ hosts within the cluster. The driver consists of several Java JAR files. The same driver can be used by Impala and Hive.
+ </p>
+
+ <p class="p">
+ To get the JAR files, install the Hive JDBC driver on each host in the cluster that will run
+ JDBC applications.
+ </p>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ The latest JDBC driver, corresponding to Hive 0.13, provides substantial performance improvements for
+ Impala queries that return large result sets. Impala 2.0 and later are compatible with the Hive 0.13
+ driver. If you already have an older JDBC driver installed, and are running Impala 2.0 or higher, consider
+ upgrading to the latest Hive JDBC driver for best performance with JDBC applications.
+ </div>
+
+ <p class="p">
+ If you are using JDBC-enabled applications on hosts outside the cluster, you cannot use the the same install
+ procedure on the hosts. Install the JDBC driver on at least one cluster host using the preceding
+ procedure. Then download the JAR files to each client machine that will use JDBC with Impala:
+ </p>
+
+ <pre class="pre codeblock"><code>commons-logging-X.X.X.jar
+ hadoop-common.jar
+ hive-common-X.XX.X.jar
+ hive-jdbc-X.XX.X.jar
+ hive-metastore-X.XX.X.jar
+ hive-service-X.XX.X.jar
+ httpclient-X.X.X.jar
+ httpcore-X.X.X.jar
+ libfb303-X.X.X.jar
+ libthrift-X.X.X.jar
+ log4j-X.X.XX.jar
+ slf4j-api-X.X.X.jar
+ slf4j-logXjXX-X.X.X.jar
+ </code></pre>
+
+ <p class="p">
+ <strong class="ph b">To enable JDBC support for Impala on the system where you run the JDBC application:</strong>
+ </p>
+
+ <ol class="ol">
+ <li class="li">
+ Download the JAR files listed above to each client machine.
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ For Maven users, see
+ <a class="xref" href="https://github.com/onefoursix/Cloudera-Impala-JDBC-Example" target="_blank">this sample github page</a> for an example of the
+ dependencies you could add to a <code class="ph codeph">pom</code> file instead of downloading the individual JARs.
+ </div>
+ </li>
+
+ <li class="li">
+ Store the JAR files in a location of your choosing, ideally a directory already referenced in your
+ <code class="ph codeph">CLASSPATH</code> setting. For example:
+ <ul class="ul">
+ <li class="li">
+ On Linux, you might use a location such as <code class="ph codeph">/opt/jars/</code>.
+ </li>
+
+ <li class="li">
+ On Windows, you might use a subdirectory underneath <span class="ph filepath">C:\Program Files</span>.
+ </li>
+ </ul>
+ </li>
+
+ <li class="li">
+ To successfully load the Impala JDBC driver, client programs must be able to locate the associated JAR
+ files. This often means setting the <code class="ph codeph">CLASSPATH</code> for the client process to include the
+ JARs. Consult the documentation for your JDBC client for more details on how to install new JDBC drivers,
+ but some examples of how to set <code class="ph codeph">CLASSPATH</code> variables include:
+ <ul class="ul">
+ <li class="li">
+ On Linux, if you extracted the JARs to <code class="ph codeph">/opt/jars/</code>, you might issue the following
+ command to prepend the JAR files path to an existing classpath:
+ <pre class="pre codeblock"><code>export CLASSPATH=/opt/jars/*.jar:$CLASSPATH</code></pre>
+ </li>
+
+ <li class="li">
+ On Windows, use the <strong class="ph b">System Properties</strong> control panel item to modify the <strong class="ph b">Environment
+ Variables</strong> for your system. Modify the environment variables to include the path to which you
+ extracted the files.
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ If the existing <code class="ph codeph">CLASSPATH</code> on your client machine refers to some older version of
+ the Hive JARs, ensure that the new JARs are the first ones listed. Either put the new JAR files
+ earlier in the listings, or delete the other references to Hive JAR files.
+ </div>
+ </li>
+ </ul>
+ </li>
+ </ol>
+ </section>
+
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="impala_jdbc__jdbc_connect">
+
+ <h2 class="title topictitle2" id="ariaid-title5">Establishing JDBC Connections</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ The JDBC driver class depends on which driver you select.
+ </p>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ If your JDBC or ODBC application connects to Impala through a load balancer such as
+ <code class="ph codeph">haproxy</code>, be cautious about reusing the connections. If the load balancer has set up
+ connection timeout values, either check the connection frequently so that it never sits idle longer than
+ the load balancer timeout value, or check the connection validity before using it and create a new one if
+ the connection has been closed.
+ </div>
+
+ <section class="section" id="jdbc_connect__class_hive_driver"><h3 class="title sectiontitle">Using the Hive JDBC Driver</h3>
+
+
+ <p class="p">
+ For example, with the Hive JDBC driver, the class name is <code class="ph codeph">org.apache.hive.jdbc.HiveDriver</code>.
+ Once you have configured Impala to work with JDBC, you can establish connections between the two.
+ To do so for a cluster that does not use
+ Kerberos authentication, use a connection string of the form
+ <code class="ph codeph">jdbc:hive2://<var class="keyword varname">host</var>:<var class="keyword varname">port</var>/;auth=noSasl</code>.
+
+ For example, you might use:
+ </p>
+
+<pre class="pre codeblock"><code>jdbc:hive2://myhost.example.com:21050/;auth=noSasl</code></pre>
+
+ <p class="p">
+ To connect to an instance of Impala that requires Kerberos authentication, use a connection string of the
+ form
+ <code class="ph codeph">jdbc:hive2://<var class="keyword varname">host</var>:<var class="keyword varname">port</var>/;principal=<var class="keyword varname">principal_name</var></code>.
+ The principal must be the same user principal you used when starting Impala. For example, you might use:
+ </p>
+
+<pre class="pre codeblock"><code>jdbc:hive2://myhost.example.com:21050/;principal=impala/myhost.example.com@H2.EXAMPLE.COM</code></pre>
+
+ <p class="p">
+ To connect to an instance of Impala that requires LDAP authentication, use a connection string of the form
+ <code class="ph codeph">jdbc:hive2://<var class="keyword varname">host</var>:<var class="keyword varname">port</var>/<var class="keyword varname">db_name</var>;user=<var class="keyword varname">ldap_userid</var>;password=<var class="keyword varname">ldap_password</var></code>.
+ For example, you might use:
+ </p>
+
+<pre class="pre codeblock"><code>jdbc:hive2://myhost.example.com:21050/test_db;user=fred;password=xyz123</code></pre>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ <p class="p">
+ Prior to <span class="keyword">Impala 2.5</span>, the Hive JDBC driver did not support connections that use both Kerberos authentication
+ and SSL encryption. If your cluster is running an older release that has this restriction,
+ use an alternative JDBC driver that supports
+ both of these security features.
+ </p>
+ </div>
+
+ </section>
+
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="impala_jdbc__jdbc_odbc_notes">
+ <h2 class="title topictitle2" id="ariaid-title6">Notes about JDBC and ODBC Interaction with Impala SQL Features</h2>
+ <div class="body conbody">
+ <p class="p">
+ Most Impala SQL features work equivalently through the <span class="keyword cmdname">impala-shell</span> interpreter
+ of the JDBC or ODBC APIs. The following are some exceptions to keep in mind when switching between
+ the interactive shell and applications using the APIs:
+ </p>
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ <strong class="ph b">Complex type considerations:</strong>
+ </p>
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ Queries involving the complex types (<code class="ph codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, and <code class="ph codeph">MAP</code>)
+ require notation that might not be available in all levels of JDBC and ODBC drivers.
+ If you have trouble querying such a table due to the driver level or
+ inability to edit the queries used by the application, you can create a view that exposes
+ a <span class="q">"flattened"</span> version of the complex columns and point the application at the view.
+ See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ The complex types available in <span class="keyword">Impala 2.3</span> and higher are supported by the
+ JDBC <code class="ph codeph">getColumns()</code> API.
+ Both <code class="ph codeph">MAP</code> and <code class="ph codeph">ARRAY</code> are reported as the JDBC SQL Type <code class="ph codeph">ARRAY</code>,
+ because this is the closest matching Java SQL type. This behavior is consistent with Hive.
+ <code class="ph codeph">STRUCT</code> types are reported as the JDBC SQL Type <code class="ph codeph">STRUCT</code>.
+ </p>
+ <div class="p">
+ To be consistent with Hive's behavior, the TYPE_NAME field is populated
+ with the primitive type name for scalar types, and with the full <code class="ph codeph">toSql()</code>
+ for complex types. The resulting type names are somewhat inconsistent,
+ because nested types are printed differently than top-level types. For example,
+ the following list shows how <code class="ph codeph">toSQL()</code> for Impala types are
+ translated to <code class="ph codeph">TYPE_NAME</code> values:
+<pre class="pre codeblock"><code>DECIMAL(10,10) becomes DECIMAL
+CHAR(10) becomes CHAR
+VARCHAR(10) becomes VARCHAR
+ARRAY<DECIMAL(10,10)> becomes ARRAY<DECIMAL(10,10)>
+ARRAY<CHAR(10)> becomes ARRAY<CHAR(10)>
+ARRAY<VARCHAR(10)> becomes ARRAY<VARCHAR(10)>
+
+</code></pre>
+ </div>
+ </li>
+ </ul>
+ </li>
+ </ul>
+ </div>
+ </article>
+
+</article></main></body></html>
\ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_joins.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_joins.html b/docs/build/html/topics/impala_joins.html
new file mode 100644
index 0000000..436f9f5
--- /dev/null
+++ b/docs/build/html/topics/impala_joins.html
@@ -0,0 +1,531 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_select.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="joins"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Joins in Impala SELECT Statements</title></head><body id="joins"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">Joins in Impala SELECT Statements</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ A join query is a <code class="ph codeph">SELECT</code> statement that combines data from two or more tables,
+ and returns a result set containing items from some or all of those tables. It is a way to
+ cross-reference and correlate related data that is organized into multiple tables, typically
+ using identifiers that are repeated in each of the joined tables.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Syntax:</strong>
+ </p>
+
+ <p class="p">
+ Impala supports a wide variety of <code class="ph codeph">JOIN</code> clauses. Left, right, semi, full, and outer joins
+ are supported in all Impala versions. The <code class="ph codeph">CROSS JOIN</code> operator is available in Impala 1.2.2
+ and higher. During performance tuning, you can override the reordering of join clauses that Impala does
+ internally by including the keyword <code class="ph codeph">STRAIGHT_JOIN</code> immediately after the
+ <code class="ph codeph">SELECT</code> keyword
+ </p>
+
+<pre class="pre codeblock"><code>SELECT <var class="keyword varname">select_list</var> FROM
+ <var class="keyword varname">table_or_subquery1</var> [INNER] JOIN <var class="keyword varname">table_or_subquery2</var> |
+ <var class="keyword varname">table_or_subquery1</var> {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} JOIN <var class="keyword varname">table_or_subquery2</var> |
+ <var class="keyword varname">table_or_subquery1</var> {LEFT | RIGHT} SEMI JOIN <var class="keyword varname">table_or_subquery2</var> |
+ <span class="ph"><var class="keyword varname">table_or_subquery1</var> {LEFT | RIGHT} ANTI JOIN <var class="keyword varname">table_or_subquery2</var> |</span>
+ [ ON <var class="keyword varname">col1</var> = <var class="keyword varname">col2</var> [AND <var class="keyword varname">col3</var> = <var class="keyword varname">col4</var> ...] |
+ USING (<var class="keyword varname">col1</var> [, <var class="keyword varname">col2</var> ...]) ]
+ [<var class="keyword varname">other_join_clause</var> ...]
+[ WHERE <var class="keyword varname">where_clauses</var> ]
+
+SELECT <var class="keyword varname">select_list</var> FROM
+ <var class="keyword varname">table_or_subquery1</var>, <var class="keyword varname">table_or_subquery2</var> [, <var class="keyword varname">table_or_subquery3</var> ...]
+ [<var class="keyword varname">other_join_clause</var> ...]
+WHERE
+ <var class="keyword varname">col1</var> = <var class="keyword varname">col2</var> [AND <var class="keyword varname">col3</var> = <var class="keyword varname">col4</var> ...]
+
+SELECT <var class="keyword varname">select_list</var> FROM
+ <var class="keyword varname">table_or_subquery1</var> CROSS JOIN <var class="keyword varname">table_or_subquery2</var>
+ [<var class="keyword varname">other_join_clause</var> ...]
+[ WHERE <var class="keyword varname">where_clauses</var> ]</code></pre>
+
+ <p class="p">
+ <strong class="ph b">SQL-92 and SQL-89 Joins:</strong>
+ </p>
+
+ <p class="p">
+ Queries with the explicit <code class="ph codeph">JOIN</code> keywords are known as SQL-92 style joins, referring to the
+ level of the SQL standard where they were introduced. The corresponding <code class="ph codeph">ON</code> or
+ <code class="ph codeph">USING</code> clauses clearly show which columns are used as the join keys in each case:
+ </p>
+
+<pre class="pre codeblock"><code>SELECT t1.c1, t2.c2 FROM <strong class="ph b">t1 JOIN t2</strong>
+ <strong class="ph b">ON t1.id = t2.id and t1.type_flag = t2.type_flag</strong>
+ WHERE t1.c1 > 100;
+
+SELECT t1.c1, t2.c2 FROM <strong class="ph b">t1 JOIN t2</strong>
+ <strong class="ph b">USING (id, type_flag)</strong>
+ WHERE t1.c1 > 100;</code></pre>
+
+ <p class="p">
+ The <code class="ph codeph">ON</code> clause is a general way to compare columns across the two tables, even if the column
+ names are different. The <code class="ph codeph">USING</code> clause is a shorthand notation for specifying the join
+ columns, when the column names are the same in both tables. You can code equivalent <code class="ph codeph">WHERE</code>
+ clauses that compare the columns, instead of <code class="ph codeph">ON</code> or <code class="ph codeph">USING</code> clauses, but that
+ practice is not recommended because mixing the join comparisons with other filtering clauses is typically
+ less readable and harder to maintain.
+ </p>
+
+ <p class="p">
+ Queries with a comma-separated list of tables and subqueries are known as SQL-89 style joins. In these
+ queries, the equality comparisons between columns of the joined tables go in the <code class="ph codeph">WHERE</code>
+ clause alongside other kinds of comparisons. This syntax is easy to learn, but it is also easy to
+ accidentally remove a <code class="ph codeph">WHERE</code> clause needed for the join to work correctly.
+ </p>
+
+<pre class="pre codeblock"><code>SELECT t1.c1, t2.c2 FROM <strong class="ph b">t1, t2</strong>
+ WHERE
+ <strong class="ph b">t1.id = t2.id AND t1.type_flag = t2.type_flag</strong>
+ AND t1.c1 > 100;</code></pre>
+
+ <p class="p">
+ <strong class="ph b">Self-joins:</strong>
+ </p>
+
+ <p class="p">
+ Impala can do self-joins, for example to join on two different columns in the same table to represent
+ parent-child relationships or other tree-structured data. There is no explicit syntax for this; just use the
+ same table name for both the left-hand and right-hand table, and assign different table aliases to use when
+ referring to the fully qualified column names:
+ </p>
+
+<pre class="pre codeblock"><code>-- Combine fields from both parent and child rows.
+SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs WHERE lhs.id = rhs.parent;</code></pre>
+
+ <p class="p">
+ <strong class="ph b">Cartesian joins:</strong>
+ </p>
+
+ <div class="p">
+ To avoid producing huge result sets by mistake, Impala does not allow Cartesian joins of the form:
+<pre class="pre codeblock"><code>SELECT ... FROM t1 JOIN t2;
+SELECT ... FROM t1, t2;</code></pre>
+ If you intend to join the tables based on common values, add <code class="ph codeph">ON</code> or <code class="ph codeph">WHERE</code>
+ clauses to compare columns across the tables. If you truly intend to do a Cartesian join, use the
+ <code class="ph codeph">CROSS JOIN</code> keyword as the join operator. The <code class="ph codeph">CROSS JOIN</code> form does not use
+ any <code class="ph codeph">ON</code> clause, because it produces a result set with all combinations of rows from the
+ left-hand and right-hand tables. The result set can still be filtered by subsequent <code class="ph codeph">WHERE</code>
+ clauses. For example:
+ </div>
+
+<pre class="pre codeblock"><code>SELECT ... FROM t1 CROSS JOIN t2;
+SELECT ... FROM t1 CROSS JOIN t2 WHERE <var class="keyword varname">tests_on_non_join_columns</var>;</code></pre>
+
+ <p class="p">
+ <strong class="ph b">Inner and outer joins:</strong>
+ </p>
+
+ <p class="p">
+ An inner join is the most common and familiar type: rows in the result set contain the requested columns from
+ the appropriate tables, for all combinations of rows where the join columns of the tables have identical
+ values. If a column with the same name occurs in both tables, use a fully qualified name or a column alias to
+ refer to the column in the select list or other clauses. Impala performs inner joins by default for both
+ SQL-89 and SQL-92 join syntax:
+ </p>
+
+<pre class="pre codeblock"><code>-- The following 3 forms are all equivalent.
+SELECT t1.id, c1, c2 FROM t1, t2 WHERE t1.id = t2.id;
+SELECT t1.id, c1, c2 FROM t1 JOIN t2 ON t1.id = t2.id;
+SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;</code></pre>
+
+ <p class="p">
+ An outer join retrieves all rows from the left-hand table, or the right-hand table, or both; wherever there
+ is no matching data in the table on the other side of the join, the corresponding columns in the result set
+ are set to <code class="ph codeph">NULL</code>. To perform an outer join, include the <code class="ph codeph">OUTER</code> keyword in the
+ join operator, along with either <code class="ph codeph">LEFT</code>, <code class="ph codeph">RIGHT</code>, or <code class="ph codeph">FULL</code>:
+ </p>
+
+<pre class="pre codeblock"><code>SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id;
+SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id;
+SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;</code></pre>
+
+ <p class="p">
+ For outer joins, Impala requires SQL-92 syntax; that is, the <code class="ph codeph">JOIN</code> keyword instead of
+ comma-separated table names. Impala does not support vendor extensions such as <code class="ph codeph">(+)</code> or
+ <code class="ph codeph">*=</code> notation for doing outer joins with SQL-89 query syntax.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Equijoins and Non-Equijoins:</strong>
+ </p>
+
+ <p class="p">
+ By default, Impala requires an equality comparison between the left-hand and right-hand tables, either
+ through <code class="ph codeph">ON</code>, <code class="ph codeph">USING</code>, or <code class="ph codeph">WHERE</code> clauses. These types of
+ queries are classified broadly as equijoins. Inner, outer, full, and semi joins can all be equijoins based on
+ the presence of equality tests between columns in the left-hand and right-hand tables.
+ </p>
+
+ <p class="p">
+ In Impala 1.2.2 and higher, non-equijoin queries are also possible, with comparisons such as
+ <code class="ph codeph">!=</code> or <code class="ph codeph"><</code> between the join columns. These kinds of queries require care to
+ avoid producing huge result sets that could exceed resource limits. Once you have planned a non-equijoin
+ query that produces a result set of acceptable size, you can code the query using the <code class="ph codeph">CROSS
+ JOIN</code> operator, and add the extra comparisons in the <code class="ph codeph">WHERE</code> clause:
+ </p>
+
+<pre class="pre codeblock"><code>SELECT * FROM t1 CROSS JOIN t2 WHERE t1.total > t2.maximum_price;</code></pre>
+
+ <p class="p">
+ In <span class="keyword">Impala 2.3</span> and higher, additional non-equijoin queries are possible due to the addition
+ of nested loop joins. These queries typically involve <code class="ph codeph">SEMI JOIN</code>,
+ <code class="ph codeph">ANTI JOIN</code>, or <code class="ph codeph">FULL OUTER JOIN</code> clauses.
+ Impala sometimes also uses nested loop joins internally when evaluating <code class="ph codeph">OUTER JOIN</code>
+ queries involving complex type columns.
+ Query phases involving nested loop joins do not use the spill-to-disk mechanism if they
+ exceed the memory limit. Impala decides internally when to use each join mechanism; you cannot
+ specify any query hint to choose between the nested loop join or the original hash join algorithm.
+ </p>
+
+<pre class="pre codeblock"><code>SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.int_col < t2.int_col;</code></pre>
+
+ <p class="p">
+ <strong class="ph b">Semi-joins:</strong>
+ </p>
+
+ <p class="p">
+ Semi-joins are a relatively rarely used variation. With the left semi-join, only data from the left-hand
+ table is returned, for rows where there is matching data in the right-hand table, based on comparisons
+ between join columns in <code class="ph codeph">ON</code> or <code class="ph codeph">WHERE</code> clauses. Only one instance of each row
+ from the left-hand table is returned, regardless of how many matching rows exist in the right-hand table.
+ <span class="ph">A right semi-join (available in Impala 2.0 and higher) reverses the comparison and returns
+ data from the right-hand table.</span>
+ </p>
+
+<pre class="pre codeblock"><code>SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI JOIN t2 ON t1.id = t2.id;</code></pre>
+
+ <p class="p">
+ <strong class="ph b">Natural joins (not supported):</strong>
+ </p>
+
+ <p class="p">
+ Impala does not support the <code class="ph codeph">NATURAL JOIN</code> operator, again to avoid inconsistent or huge
+ result sets. Natural joins do away with the <code class="ph codeph">ON</code> and <code class="ph codeph">USING</code> clauses, and
+ instead automatically join on all columns with the same names in the left-hand and right-hand tables. This
+ kind of query is not recommended for rapidly evolving data structures such as are typically used in Hadoop.
+ Thus, Impala does not support the <code class="ph codeph">NATURAL JOIN</code> syntax, which can produce different query
+ results as columns are added to or removed from tables.
+ </p>
+
+ <p class="p">
+ If you do have any queries that use <code class="ph codeph">NATURAL JOIN</code>, make sure to rewrite them with explicit
+ <code class="ph codeph">USING</code> clauses, because Impala could interpret the <code class="ph codeph">NATURAL</code> keyword as a
+ table alias:
+ </p>
+
+<pre class="pre codeblock"><code>-- 'NATURAL' is interpreted as an alias for 't1' and Impala attempts an inner join,
+-- resulting in an error because inner joins require explicit comparisons between columns.
+SELECT t1.c1, t2.c2 FROM t1 NATURAL JOIN t2;
+ERROR: NotImplementedException: Join with 't2' requires at least one conjunctive equality predicate.
+ To perform a Cartesian product between two tables, use a CROSS JOIN.
+
+-- If you expect the tables to have identically named columns with matching values,
+-- list the corresponding column names in a USING clause.
+SELECT t1.c1, t2.c2 FROM t1 JOIN t2 USING (id, type_flag, name, address);</code></pre>
+
+ <p class="p">
+ <strong class="ph b">Anti-joins (<span class="keyword">Impala 2.0</span> and higher only):</strong>
+ </p>
+
+ <p class="p">
+ Impala supports the <code class="ph codeph">LEFT ANTI JOIN</code> and <code class="ph codeph">RIGHT ANTI JOIN</code> clauses in
+ <span class="keyword">Impala 2.0</span> and higher. The <code class="ph codeph">LEFT</code> or <code class="ph codeph">RIGHT</code>
+ keyword is required for this kind of join. For <code class="ph codeph">LEFT ANTI JOIN</code>, this clause returns those
+ values from the left-hand table that have no matching value in the right-hand table. <code class="ph codeph">RIGHT ANTI
+ JOIN</code> reverses the comparison and returns values from the right-hand table. You can express this
+ negative relationship either through the <code class="ph codeph">ANTI JOIN</code> clause or through a <code class="ph codeph">NOT
+ EXISTS</code> operator with a subquery.
+ </p>
+
+
+
+ <p class="p">
+ <strong class="ph b">Complex type considerations:</strong>
+ </p>
+
+
+
+ <p class="p">
+ When referring to a column with a complex type (<code class="ph codeph">STRUCT</code>, <code class="ph codeph">ARRAY</code>, or <code class="ph codeph">MAP</code>)
+ in a query, you use join notation to <span class="q">"unpack"</span> the scalar fields of the struct, the elements of the array, or
+ the key-value pairs of the map. (The join notation is not required for aggregation operations, such as
+ <code class="ph codeph">COUNT()</code> or <code class="ph codeph">SUM()</code> for array elements.) Because Impala recognizes which complex type elements are associated with which row
+ of the result set, you use the same syntax as for a cross or cartesian join, without an explicit join condition.
+ See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details about Impala support for complex types.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Usage notes:</strong>
+ </p>
+
+ <p class="p">
+ You typically use join queries in situations like these:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ When related data arrives from different sources, with each data set physically residing in a separate
+ table. For example, you might have address data from business records that you cross-check against phone
+ listings or census data.
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ Impala can join tables of different file formats, including Impala-managed tables and HBase tables. For
+ example, you might keep small dimension tables in HBase, for convenience of single-row lookups and
+ updates, and for the larger fact tables use Parquet or other binary file format optimized for scan
+ operations. Then, you can issue a join query to cross-reference the fact tables with the dimension
+ tables.
+ </div>
+ </li>
+
+ <li class="li">
+ When data is normalized, a technique for reducing data duplication by dividing it across multiple tables.
+ This kind of organization is often found in data that comes from traditional relational database systems.
+ For example, instead of repeating some long string such as a customer name in multiple tables, each table
+ might contain a numeric customer ID. Queries that need to display the customer name could <span class="q">"join"</span> the
+ table that specifies which customer ID corresponds to which name.
+ </li>
+
+ <li class="li">
+ When certain columns are rarely needed for queries, so they are moved into separate tables to reduce
+ overhead for common queries. For example, a <code class="ph codeph">biography</code> field might be rarely needed in
+ queries on employee data. Putting that field in a separate table reduces the amount of I/O for common
+ queries on employee addresses or phone numbers. Queries that do need the <code class="ph codeph">biography</code> column
+ can retrieve it by performing a join with that separate table.
+ </li>
+
+ <li class="li">
+ In <span class="keyword">Impala 2.3</span> or higher, when referring to complex type columns in queries.
+ See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details.
+ </li>
+ </ul>
+
+ <p class="p">
+ When comparing columns with the same names in <code class="ph codeph">ON</code> or <code class="ph codeph">WHERE</code> clauses, use the
+ fully qualified names such as <code class="ph codeph"><var class="keyword varname">db_name</var>.<var class="keyword varname">table_name</var></code>, or
+ assign table aliases, column aliases, or both to make the code more compact and understandable:
+ </p>
+
+<pre class="pre codeblock"><code>select t1.c1 as first_id, t2.c2 as second_id from
+ t1 join t2 on first_id = second_id;
+
+select fact.custno, dimension.custno from
+ customer_data as fact join customer_address as dimension
+ using (custno)</code></pre>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ <p class="p">
+ Performance for join queries is a crucial aspect for Impala, because complex join queries are
+ resource-intensive operations. An efficient join query produces much less network traffic and CPU overhead
+ than an inefficient one. For best results:
+ </p>
+ <ul class="ul">
+ <li class="li">
+ Make sure that both <a class="xref" href="impala_perf_stats.html#perf_stats">table and column statistics</a> are
+ available for all the tables involved in a join query, and especially for the columns referenced in any
+ join conditions. Impala uses the statistics to automatically deduce an efficient join order.
+ Use <a class="xref" href="impala_show.html#show"><code class="ph codeph">SHOW TABLE STATS <var class="keyword varname">table_name</var></code> and
+ <code class="ph codeph">SHOW COLUMN STATS <var class="keyword varname">table_name</var></code></a> to check if statistics are
+ already present. Issue the <code class="ph codeph">COMPUTE STATS <var class="keyword varname">table_name</var></code> for a nonpartitioned table,
+ or (in Impala 2.1.0 and higher) <code class="ph codeph">COMPUTE INCREMENTAL STATS <var class="keyword varname">table_name</var></code>
+ for a partitioned table, to collect the initial statistics at both the table and column levels, and to keep the
+ statistics up to date after any substantial <code class="ph codeph">INSERT</code> or <code class="ph codeph">LOAD DATA</code> operations.
+ </li>
+
+ <li class="li">
+ If table or column statistics are not available, join the largest table first. You can check the
+ existence of statistics with the <code class="ph codeph">SHOW TABLE STATS <var class="keyword varname">table_name</var></code> and
+ <code class="ph codeph">SHOW COLUMN STATS <var class="keyword varname">table_name</var></code> statements.
+ </li>
+
+ <li class="li">
+ If table or column statistics are not available, join subsequent tables according to which table has the
+ most selective filter, based on overall size and <code class="ph codeph">WHERE</code> clauses. Joining the table with
+ the most selective filter results in the fewest number of rows being returned.
+ </li>
+ </ul>
+ <p class="p">
+ For more information and examples of performance for join queries, see
+ <a class="xref" href="impala_perf_joins.html#perf_joins">Performance Considerations for Join Queries</a>.
+ </p>
+ </div>
+
+ <p class="p">
+ To control the result set from a join query, include the names of corresponding column names in both tables
+ in an <code class="ph codeph">ON</code> or <code class="ph codeph">USING</code> clause, or by coding equality comparisons for those
+ columns in the <code class="ph codeph">WHERE</code> clause.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > select c_last_name, ca_city from customer join customer_address where c_customer_sk = ca_address_sk;
++-------------+-----------------+
+| c_last_name | ca_city |
++-------------+-----------------+
+| Lewis | Fairfield |
+| Moses | Fairview |
+| Hamilton | Pleasant Valley |
+| White | Oak Ridge |
+| Moran | Glendale |
+...
+| Richards | Lakewood |
+| Day | Lebanon |
+| Painter | Oak Hill |
+| Bentley | Greenfield |
+| Jones | Stringtown |
++-------------+------------------+
+Returned 50000 row(s) in 9.82s</code></pre>
+
+ <p class="p">
+ One potential downside of joins is the possibility of excess resource usage in poorly constructed queries.
+ Impala imposes restrictions on join queries to guard against such issues. To minimize the chance of runaway
+ queries on large data sets, Impala requires every join query to contain at least one equality predicate
+ between the columns of the various tables. For example, if <code class="ph codeph">T1</code> contains 1000 rows and
+ <code class="ph codeph">T2</code> contains 1,000,000 rows, a query <code class="ph codeph">SELECT <var class="keyword varname">columns</var> FROM t1 JOIN
+ t2</code> could return up to 1 billion rows (1000 * 1,000,000); Impala requires that the query include a
+ clause such as <code class="ph codeph">ON t1.c1 = t2.c2</code> or <code class="ph codeph">WHERE t1.c1 = t2.c2</code>.
+ </p>
+
+ <p class="p">
+ Because even with equality clauses, the result set can still be large, as we saw in the previous example, you
+ might use a <code class="ph codeph">LIMIT</code> clause to return a subset of the results:
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > select c_last_name, ca_city from customer, customer_address where c_customer_sk = ca_address_sk limit 10;
++-------------+-----------------+
+| c_last_name | ca_city |
++-------------+-----------------+
+| Lewis | Fairfield |
+| Moses | Fairview |
+| Hamilton | Pleasant Valley |
+| White | Oak Ridge |
+| Moran | Glendale |
+| Sharp | Lakeview |
+| Wiles | Farmington |
+| Shipman | Union |
+| Gilbert | New Hope |
+| Brunson | Martinsville |
++-------------+-----------------+
+Returned 10 row(s) in 0.63s</code></pre>
+
+ <p class="p">
+ Or you might use additional comparison operators or aggregation functions to condense a large result set into
+ a smaller set of values:
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > -- Find the names of customers who live in one particular town.
+[localhost:21000] > select distinct c_last_name from customer, customer_address where
+ c_customer_sk = ca_address_sk
+ and ca_city = "Green Acres";
++---------------+
+| c_last_name |
++---------------+
+| Hensley |
+| Pearson |
+| Mayer |
+| Montgomery |
+| Ricks |
+...
+| Barrett |
+| Price |
+| Hill |
+| Hansen |
+| Meeks |
++---------------+
+Returned 332 row(s) in 0.97s
+
+[localhost:21000] > -- See how many different customers in this town have names starting with "A".
+[localhost:21000] > select count(distinct c_last_name) from customer, customer_address where
+ c_customer_sk = ca_address_sk
+ and ca_city = "Green Acres"
+ and substr(c_last_name,1,1) = "A";
++-----------------------------+
+| count(distinct c_last_name) |
++-----------------------------+
+| 12 |
++-----------------------------+
+Returned 1 row(s) in 1.00s</code></pre>
+
+ <p class="p">
+ Because a join query can involve reading large amounts of data from disk, sending large amounts of data
+ across the network, and loading large amounts of data into memory to do the comparisons and filtering, you
+ might do benchmarking, performance analysis, and query tuning to find the most efficient join queries for
+ your data set, hardware capacity, network configuration, and cluster workload.
+ </p>
+
+ <p class="p">
+ The two categories of joins in Impala are known as <strong class="ph b">partitioned joins</strong> and <strong class="ph b">broadcast joins</strong>. If
+ inaccurate table or column statistics, or some quirk of the data distribution, causes Impala to choose the
+ wrong mechanism for a particular join, consider using query hints as a temporary workaround. For details, see
+ <a class="xref" href="impala_hints.html#hints">Query Hints in Impala SELECT Statements</a>.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Handling NULLs in Join Columns:</strong>
+ </p>
+
+ <p class="p">
+ By default, join key columns do not match if either one contains a <code class="ph codeph">NULL</code> value.
+ To treat such columns as equal if both contain <code class="ph codeph">NULL</code>, you can use an expression
+ such as <code class="ph codeph">A = B OR (A IS NULL AND B IS NULL)</code>.
+ In <span class="keyword">Impala 2.5</span> and higher, the <code class="ph codeph"><=></code> operator (shorthand for
+ <code class="ph codeph">IS NOT DISTINCT FROM</code>) performs the same comparison in a concise and efficient form.
+ The <code class="ph codeph"><=></code> operator is more efficient in for comparing join keys in a <code class="ph codeph">NULL</code>-safe
+ manner, because the operator can use a hash join while the <code class="ph codeph">OR</code> expression cannot.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Examples:</strong>
+ </p>
+
+ <div class="p">
+ The following examples refer to these simple tables containing small sets of integers:
+<pre class="pre codeblock"><code>[localhost:21000] > create table t1 (x int);
+[localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6);
+
+[localhost:21000] > create table t2 (y int);
+[localhost:21000] > insert into t2 values (2), (4), (6);
+
+[localhost:21000] > create table t3 (z int);
+[localhost:21000] > insert into t3 values (1), (3), (5);
+</code></pre>
+ </div>
+
+
+
+ <p class="p">
+ The following example demonstrates an anti-join, returning the values from <code class="ph codeph">T1</code> that do not
+ exist in <code class="ph codeph">T2</code> (in this case, the odd numbers 1, 3, and 5):
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y);
++---+
+| x |
++---+
+| 1 |
+| 3 |
+| 5 |
++---+
+</code></pre>
+
+ <p class="p">
+ <strong class="ph b">Related information:</strong>
+ </p>
+
+ <p class="p">
+ See these tutorials for examples of different kinds of joins:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ <a class="xref" href="impala_tutorial.html#tut_cross_join">Cross Joins and Cartesian Products with the CROSS JOIN Operator</a>
+ </li>
+ </ul>
+ </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_select.html">SELECT Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_kerberos.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_kerberos.html b/docs/build/html/topics/impala_kerberos.html
new file mode 100644
index 0000000..858ff87
--- /dev/null
+++ b/docs/build/html/topics/impala_kerberos.html
@@ -0,0 +1,342 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_authentication.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="kerberos"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Enabling Kerberos Authentication for Impala</title></head><body id="kerberos"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">Enabling Kerberos Authentication for Impala</h1>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ Impala supports an enterprise-grade authentication system called Kerberos. Kerberos provides strong security benefits including
+ capabilities that render intercepted authentication packets unusable by an attacker. It virtually eliminates the threat of
+ impersonation by never sending a user's credentials in cleartext over the network. For more information on Kerberos, visit
+ the <a class="xref" href="https://web.mit.edu/kerberos/" target="_blank">MIT Kerberos website</a>.
+ </p>
+
+ <p class="p">
+ The rest of this topic assumes you have a working <a class="xref" href="https://web.mit.edu/kerberos/krb5-latest/doc/admin/install_kdc.html" target="_blank">Kerberos Key Distribution Center (KDC)</a>
+ set up. To enable Kerberos, you first create a Kerberos principal for each host running
+ <span class="keyword cmdname">impalad</span> or <span class="keyword cmdname">statestored</span>.
+ </p>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ Regardless of the authentication mechanism used, Impala always creates HDFS directories and data files
+ owned by the same user (typically <code class="ph codeph">impala</code>). To implement user-level access to different
+ databases, tables, columns, partitions, and so on, use the Sentry authorization feature, as explained in
+ <a class="xref" href="../shared/../topics/impala_authorization.html#authorization">Enabling Sentry Authorization for Impala</a>.
+ </div>
+
+ <p class="p">
+ An alternative form of authentication you can use is LDAP, described in <a class="xref" href="impala_ldap.html#ldap">Enabling LDAP Authentication for Impala</a>.
+ </p>
+
+ <p class="p toc inpage"></p>
+
+ </div>
+
+ <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_authentication.html">Impala Authentication</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="kerberos__kerberos_prereqs">
+
+ <h2 class="title topictitle2" id="ariaid-title2">Requirements for Using Impala with Kerberos</h2>
+
+
+ <div class="body conbody">
+
+ <div class="p">
+ On version 5 of Red Hat Enterprise Linux and comparable distributions, some additional setup is needed for
+ the <span class="keyword cmdname">impala-shell</span> interpreter to connect to a Kerberos-enabled Impala cluster:
+<pre class="pre codeblock"><code>sudo yum install python-devel openssl-devel python-pip
+sudo pip-python install ssl</code></pre>
+ </div>
+
+ <div class="note important note_important"><span class="note__title importanttitle">Important:</span>
+ <p class="p">
+ If you plan to use Impala in your cluster, you must configure your KDC to allow tickets to be renewed,
+ and you must configure <span class="ph filepath">krb5.conf</span> to request renewable tickets. Typically, you can do
+ this by adding the <code class="ph codeph">max_renewable_life</code> setting to your realm in
+ <span class="ph filepath">kdc.conf</span>, and by adding the <span class="ph filepath">renew_lifetime</span> parameter to the
+ <span class="ph filepath">libdefaults</span> section of <span class="ph filepath">krb5.conf</span>. For more information about
+ renewable tickets, see the
+ <a class="xref" href="http://web.mit.edu/Kerberos/krb5-1.8/" target="_blank"> Kerberos
+ documentation</a>.
+ </p>
+ <p class="p">
+ Currently, you cannot use the resource management feature on a cluster that has Kerberos
+ authentication enabled.
+ </p>
+ </div>
+
+ <p class="p">
+ Start all <span class="keyword cmdname">impalad</span> and <span class="keyword cmdname">statestored</span> daemons with the
+ <code class="ph codeph">--principal</code> and <code class="ph codeph">--keytab-file</code> flags set to the principal and full path
+ name of the <code class="ph codeph">keytab</code> file containing the credentials for the principal.
+ </p>
+
+ <p class="p">
+ To enable Kerberos in the Impala shell, start the <span class="keyword cmdname">impala-shell</span> command using the
+ <code class="ph codeph">-k</code> flag.
+ </p>
+
+ <p class="p">
+ To enable Impala to work with Kerberos security on your Hadoop cluster, make sure you perform the
+ installation and configuration steps in
+ <a class="xref" href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SecureMode.html#Authentication" target="_blank">Authentication in Hadoop</a>.
+ Note that when Kerberos security is enabled in Impala, a web browser that
+ supports Kerberos HTTP SPNEGO is required to access the Impala web console (for example, Firefox, Internet
+ Explorer, or Chrome).
+ </p>
+
+ <p class="p">
+ If the NameNode, Secondary NameNode, DataNode, JobTracker, TaskTrackers, ResourceManager, NodeManagers,
+ HttpFS, Oozie, Impala, or Impala statestore services are configured to use Kerberos HTTP SPNEGO
+ authentication, and two or more of these services are running on the same host, then all of the running
+ services must use the same HTTP principal and keytab file used for their HTTP endpoints.
+ </p>
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="kerberos__kerberos_config">
+
+ <h2 class="title topictitle2" id="ariaid-title3">Configuring Impala to Support Kerberos Security</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ Enabling Kerberos authentication for Impala involves steps that can be summarized as follows:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ Creating service principals for Impala and the HTTP service. Principal names take the form:
+ <code class="ph codeph"><var class="keyword varname">serviceName</var>/<var class="keyword varname">fully.qualified.domain.name</var>@<var class="keyword varname">KERBEROS.REALM</var></code>.
+ <p class="p">
+ In Impala 2.0 and later, <code class="ph codeph">user()</code> returns the full Kerberos principal string, such as
+ <code class="ph codeph">user@example.com</code>, in a Kerberized environment.
+ </p>
+ </li>
+
+ <li class="li">
+ Creating, merging, and distributing key tab files for these principals.
+ </li>
+
+ <li class="li">
+ Editing <code class="ph codeph">/etc/default/impala</code>
+ to accommodate Kerberos authentication.
+ </li>
+ </ul>
+ </div>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title4" id="kerberos_config__kerberos_setup">
+
+ <h3 class="title topictitle3" id="ariaid-title4">Enabling Kerberos for Impala</h3>
+
+ <div class="body conbody">
+
+
+
+ <ol class="ol">
+ <li class="li">
+ Create an Impala service principal, specifying the name of the OS user that the Impala daemons run
+ under, the fully qualified domain name of each node running <span class="keyword cmdname">impalad</span>, and the realm
+ name. For example:
+<pre class="pre codeblock"><code>$ kadmin
+kadmin: addprinc -requires_preauth -randkey impala/impala_host.example.com@TEST.EXAMPLE.COM</code></pre>
+ </li>
+
+ <li class="li">
+ Create an HTTP service principal. For example:
+<pre class="pre codeblock"><code>kadmin: addprinc -randkey HTTP/impala_host.example.com@TEST.EXAMPLE.COM</code></pre>
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ The <code class="ph codeph">HTTP</code> component of the service principal must be uppercase as shown in the
+ preceding example.
+ </div>
+ </li>
+
+ <li class="li">
+ Create <code class="ph codeph">keytab</code> files with both principals. For example:
+<pre class="pre codeblock"><code>kadmin: xst -k impala.keytab impala/impala_host.example.com
+kadmin: xst -k http.keytab HTTP/impala_host.example.com
+kadmin: quit</code></pre>
+ </li>
+
+ <li class="li">
+ Use <code class="ph codeph">ktutil</code> to read the contents of the two keytab files and then write those contents
+ to a new file. For example:
+<pre class="pre codeblock"><code>$ ktutil
+ktutil: rkt impala.keytab
+ktutil: rkt http.keytab
+ktutil: wkt impala-http.keytab
+ktutil: quit</code></pre>
+ </li>
+
+ <li class="li">
+ (Optional) Test that credentials in the merged keytab file are valid, and that the <span class="q">"renew until"</span>
+ date is in the future. For example:
+<pre class="pre codeblock"><code>$ klist -e -k -t impala-http.keytab</code></pre>
+ </li>
+
+ <li class="li">
+ Copy the <span class="ph filepath">impala-http.keytab</span> file to the Impala configuration directory. Change the
+ permissions to be only read for the file owner and change the file owner to the <code class="ph codeph">impala</code>
+ user. By default, the Impala user and group are both named <code class="ph codeph">impala</code>. For example:
+<pre class="pre codeblock"><code>$ cp impala-http.keytab /etc/impala/conf
+$ cd /etc/impala/conf
+$ chmod 400 impala-http.keytab
+$ chown impala:impala impala-http.keytab</code></pre>
+ </li>
+
+ <li class="li">
+ Add Kerberos options to the Impala defaults file, <span class="ph filepath">/etc/default/impala</span>. Add the
+ options for both the <span class="keyword cmdname">impalad</span> and <span class="keyword cmdname">statestored</span> daemons, using the
+ <code class="ph codeph">IMPALA_SERVER_ARGS</code> and <code class="ph codeph">IMPALA_STATE_STORE_ARGS</code> variables. For
+ example, you might add:
+
+<pre class="pre codeblock"><code>-kerberos_reinit_interval=60
+-principal=impala_1/impala_host.example.com@TEST.EXAMPLE.COM
+-keytab_file=<var class="keyword varname">/path/to/impala.keytab</var></code></pre>
+ <p class="p">
+ For more information on changing the Impala defaults specified in
+ <span class="ph filepath">/etc/default/impala</span>, see
+ <a class="xref" href="impala_config_options.html#config_options">Modifying Impala Startup
+ Options</a>.
+ </p>
+ </li>
+ </ol>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ Restart <span class="keyword cmdname">impalad</span> and <span class="keyword cmdname">statestored</span> for these configuration changes to
+ take effect.
+ </div>
+ </div>
+ </article>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="kerberos__kerberos_proxy">
+
+ <h2 class="title topictitle2" id="ariaid-title5">Enabling Kerberos for Impala with a Proxy Server</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ A common configuration for Impala with High Availability is to use a proxy server to submit requests to the
+ actual <span class="keyword cmdname">impalad</span> daemons on different hosts in the cluster. This configuration avoids
+ connection problems in case of machine failure, because the proxy server can route new requests through one
+ of the remaining hosts in the cluster. This configuration also helps with load balancing, because the
+ additional overhead of being the <span class="q">"coordinator node"</span> for each query is spread across multiple hosts.
+ </p>
+
+ <p class="p">
+ Although you can set up a proxy server with or without Kerberos authentication, typically users set up a
+ secure Kerberized configuration. For information about setting up a proxy server for Impala, including
+ Kerberos-specific steps, see <a class="xref" href="impala_proxy.html#proxy">Using Impala through a Proxy for High Availability</a>.
+ </p>
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="kerberos__spnego">
+
+ <h2 class="title topictitle2" id="ariaid-title6">Using a Web Browser to Access a URL Protected by Kerberos HTTP SPNEGO</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ Your web browser must support Kerberos HTTP SPNEGO. For example, Chrome, Firefox, or Internet Explorer.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">To configure Firefox to access a URL protected by Kerberos HTTP SPNEGO:</strong>
+ </p>
+
+ <ol class="ol">
+ <li class="li">
+ Open the advanced settings Firefox configuration page by loading the <code class="ph codeph">about:config</code> page.
+ </li>
+
+ <li class="li">
+ Use the <strong class="ph b">Filter</strong> text box to find <code class="ph codeph">network.negotiate-auth.trusted-uris</code>.
+ </li>
+
+ <li class="li">
+ Double-click the <code class="ph codeph">network.negotiate-auth.trusted-uris</code> preference and enter the hostname
+ or the domain of the web server that is protected by Kerberos HTTP SPNEGO. Separate multiple domains and
+ hostnames with a comma.
+ </li>
+
+ <li class="li">
+ Click <strong class="ph b">OK</strong>.
+ </li>
+ </ol>
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title7" id="kerberos__kerberos_delegation">
+ <h2 class="title topictitle2" id="ariaid-title7">Enabling Impala Delegation for Kerberos Users</h2>
+ <div class="body conbody">
+ <p class="p">
+ See <a class="xref" href="impala_delegation.html#delegation">Configuring Impala Delegation for Hue and BI Tools</a> for details about the delegation feature
+ that lets certain users submit queries using the credentials of other users.
+ </p>
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title8" id="kerberos__ssl_jdbc_odbc">
+ <h2 class="title topictitle2" id="ariaid-title8">Using TLS/SSL with Business Intelligence Tools</h2>
+ <div class="body conbody">
+ <p class="p">
+ You can use Kerberos authentication, TLS/SSL encryption, or both to secure
+ connections from JDBC and ODBC applications to Impala.
+ See <a class="xref" href="impala_jdbc.html#impala_jdbc">Configuring Impala to Work with JDBC</a> and <a class="xref" href="impala_odbc.html#impala_odbc">Configuring Impala to Work with ODBC</a>
+ for details.
+ </p>
+
+ <p class="p">
+ Prior to <span class="keyword">Impala 2.5</span>, the Hive JDBC driver did not support connections that use both Kerberos authentication
+ and SSL encryption. If your cluster is running an older release that has this restriction,
+ use an alternative JDBC driver that supports
+ both of these security features.
+ </p>
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title9" id="kerberos__whitelisting_internal_apis">
+ <h2 class="title topictitle2" id="ariaid-title9">Enabling Access to Internal Impala APIs for Kerberos Users</h2>
+ <div class="body conbody">
+
+ <p class="p">
+ For applications that need direct access
+ to Impala APIs, without going through the HiveServer2 or Beeswax interfaces, you can
+ specify a list of Kerberos users who are allowed to call those APIs. By default, the
+ <code class="ph codeph">impala</code> and <code class="ph codeph">hdfs</code> users are the only ones authorized
+ for this kind of access.
+ Any users not explicitly authorized through the <code class="ph codeph">internal_principals_whitelist</code>
+ configuration setting are blocked from accessing the APIs. This setting applies to all the
+ Impala-related daemons, although currently it is primarily used for HDFS to control the
+ behavior of the catalog server.
+ </p>
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title10" id="kerberos__auth_to_local">
+ <h2 class="title topictitle2" id="ariaid-title10">Mapping Kerberos Principals to Short Names for Impala</h2>
+ <div class="body conbody">
+ <div class="p">
+ In <span class="keyword">Impala 2.6</span> and higher, Impala recognizes the <code class="ph codeph">auth_to_local</code> setting,
+ specified through the HDFS configuration setting
+ <code class="ph codeph">hadoop.security.auth_to_local</code>.
+ This feature is disabled by default, to avoid an unexpected change in security-related behavior.
+ To enable it:
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ Specify <code class="ph codeph">--load_auth_to_local_rules=true</code>
+ in the <span class="keyword cmdname">impalad</span> and <span class="keyword cmdname">catalogd</span> configuration settings.
+ </p>
+ </li>
+ </ul>
+ </div>
+ </div>
+ </article>
+
+</article></main></body></html>
\ No newline at end of file