You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by mi...@apache.org on 2018/05/09 21:10:36 UTC

[27/51] [partial] impala git commit: [DOCS] Impala doc site update for 3.0

http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_invalidate_metadata.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_invalidate_metadata.html b/docs/build3x/html/topics/impala_invalidate_metadata.html
new file mode 100644
index 0000000..ae7e419
--- /dev/null
+++ b/docs/build3x/html/topics/impala_invalidate_metadata.html
@@ -0,0 +1,286 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_langref_sql.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="invalidate_metadata"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>INVALIDATE METADATA Statement</title></head><body id="invalidate_metadata"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">INVALIDATE METADATA Statement</h1>
+
+
+
+  <div class="body conbody">
+
+    <p class="p">
+
+      Marks the metadata for one or all tables as stale. Required after a table is created through the Hive shell,
+      before the table is available for Impala queries. The next time the current Impala node performs a query
+      against a table whose metadata is invalidated, Impala reloads the associated metadata before the query
+      proceeds. This is a relatively expensive operation compared to the incremental metadata update done by the
+      <code class="ph codeph">REFRESH</code> statement, so in the common scenario of adding new data files to an existing table,
+      prefer <code class="ph codeph">REFRESH</code> rather than <code class="ph codeph">INVALIDATE METADATA</code>. If you are not familiar
+      with the way Impala uses metadata and how it shares the same metastore database as Hive, see
+      <a class="xref" href="impala_hadoop.html#intro_metastore">Overview of Impala Metadata and the Metastore</a> for background information.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Syntax:</strong>
+      </p>
+
+<pre class="pre codeblock"><code>INVALIDATE METADATA [[<var class="keyword varname">db_name</var>.]<var class="keyword varname">table_name</var>]</code></pre>
+
+    <p class="p">
+      By default, the cached metadata for all tables is flushed. If you specify a table name, only the metadata for
+      that one table is flushed. Even for a single table, <code class="ph codeph">INVALIDATE METADATA</code> is more expensive
+      than <code class="ph codeph">REFRESH</code>, so prefer <code class="ph codeph">REFRESH</code> in the common case where you add new data
+      files for an existing table.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Internal details:</strong>
+      </p>
+
+    <p class="p">
+      To accurately respond to queries, Impala must have current metadata about those databases and tables that
+      clients query directly. Therefore, if some other entity modifies information used by Impala in the metastore
+      that Impala and Hive share, the information cached by Impala must be updated. However, this does not mean
+      that all metadata updates require an Impala update.
+    </p>
+
+    <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+      <p class="p">
+        In Impala 1.2.4 and higher, you can specify a table name with <code class="ph codeph">INVALIDATE METADATA</code> after
+        the table is created in Hive, allowing you to make individual tables visible to Impala without doing a full
+        reload of the catalog metadata. Impala 1.2.4 also includes other changes to make the metadata broadcast
+        mechanism faster and more responsive, especially during Impala startup. See
+        <a class="xref" href="../shared/../topics/impala_new_features.html#new_features_124">New Features in Impala 1.2.4</a> for details.
+      </p>
+      <p class="p">
+        In Impala 1.2 and higher, a dedicated daemon (<span class="keyword cmdname">catalogd</span>) broadcasts DDL changes made
+        through Impala to all Impala nodes. Formerly, after you created a database or table while connected to one
+        Impala node, you needed to issue an <code class="ph codeph">INVALIDATE METADATA</code> statement on another Impala node
+        before accessing the new database or table from the other node. Now, newly created or altered objects are
+        picked up automatically by all Impala nodes. You must still use the <code class="ph codeph">INVALIDATE METADATA</code>
+        technique after creating or altering objects through Hive. See
+        <a class="xref" href="impala_components.html#intro_catalogd">The Impala Catalog Service</a> for more information on the catalog service.
+      </p>
+      <p class="p">
+        The <code class="ph codeph">INVALIDATE METADATA</code> statement is new in Impala 1.1 and higher, and takes over some of
+        the use cases of the Impala 1.0 <code class="ph codeph">REFRESH</code> statement. Because <code class="ph codeph">REFRESH</code> now
+        requires a table name parameter, to flush the metadata for all tables at once, use the <code class="ph codeph">INVALIDATE
+        METADATA</code> statement.
+      </p>
+      <p class="p">
+      Because <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> only works for tables that the current
+      Impala node is already aware of, when you create a new table in the Hive shell, enter
+      <code class="ph codeph">INVALIDATE METADATA <var class="keyword varname">new_table</var></code> before you can see the new table in
+      <span class="keyword cmdname">impala-shell</span>. Once the table is known by Impala, you can issue <code class="ph codeph">REFRESH
+      <var class="keyword varname">table_name</var></code> after you add data files for that table.
+    </p>
+    </div>
+
+    <p class="p">
+      <code class="ph codeph">INVALIDATE METADATA</code> and <code class="ph codeph">REFRESH</code> are counterparts: <code class="ph codeph">INVALIDATE
+      METADATA</code> waits to reload the metadata when needed for a subsequent query, but reloads all the
+      metadata for the table, which can be an expensive operation, especially for large tables with many
+      partitions. <code class="ph codeph">REFRESH</code> reloads the metadata immediately, but only loads the block location
+      data for newly added data files, making it a less expensive operation overall. If data was altered in some
+      more extensive way, such as being reorganized by the HDFS balancer, use <code class="ph codeph">INVALIDATE
+      METADATA</code> to avoid a performance penalty from reduced local reads. If you used Impala version 1.0,
+      the <code class="ph codeph">INVALIDATE METADATA</code> statement works just like the Impala 1.0 <code class="ph codeph">REFRESH</code>
+      statement did, while the Impala 1.1 <code class="ph codeph">REFRESH</code> is optimized for the common use case of adding
+      new data files to an existing table, thus the table name argument is now required.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+
+    <p class="p">
+      A metadata update for an <code class="ph codeph">impalad</code> instance <strong class="ph b">is</strong> required if:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        A metadata change occurs.
+      </li>
+
+      <li class="li">
+        <strong class="ph b">and</strong> the change is made from another <code class="ph codeph">impalad</code> instance in your cluster, or through
+        Hive.
+      </li>
+
+      <li class="li">
+        <strong class="ph b">and</strong> the change is made to a metastore database to which clients such as the Impala shell or ODBC directly
+        connect.
+      </li>
+    </ul>
+
+    <p class="p">
+      A metadata update for an Impala node is <strong class="ph b">not</strong> required when you issue queries from the same Impala node
+      where you ran <code class="ph codeph">ALTER TABLE</code>, <code class="ph codeph">INSERT</code>, or other table-modifying statement.
+    </p>
+
+    <p class="p">
+      Database and table metadata is typically modified by:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        Hive - via <code class="ph codeph">ALTER</code>, <code class="ph codeph">CREATE</code>, <code class="ph codeph">DROP</code> or
+        <code class="ph codeph">INSERT</code> operations.
+      </li>
+
+      <li class="li">
+        Impalad - via <code class="ph codeph">CREATE TABLE</code>, <code class="ph codeph">ALTER TABLE</code>, and <code class="ph codeph">INSERT</code>
+        operations.
+      </li>
+    </ul>
+
+    <p class="p">
+      <code class="ph codeph">INVALIDATE METADATA</code> causes the metadata for that table to be marked as stale, and reloaded
+      the next time the table is referenced. For a huge table, that process could take a noticeable amount of time;
+      thus you might prefer to use <code class="ph codeph">REFRESH</code> where practical, to avoid an unpredictable delay later,
+      for example if the next reference to the table is during a benchmark test.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Examples:</strong>
+      </p>
+
+    <p class="p">
+      The following example shows how you might use the <code class="ph codeph">INVALIDATE METADATA</code> statement after
+      creating new tables (such as SequenceFile or HBase tables) through the Hive shell. Before the
+      <code class="ph codeph">INVALIDATE METADATA</code> statement was issued, Impala would give a <span class="q">"table not found"</span> error
+      if you tried to refer to those table names. The <code class="ph codeph">DESCRIBE</code> statements cause the latest
+      metadata to be immediately loaded for the tables, avoiding a delay the next time those tables are queried.
+    </p>
+
+<pre class="pre codeblock"><code>[impalad-host:21000] &gt; invalidate metadata;
+[impalad-host:21000] &gt; describe t1;
+...
+[impalad-host:21000] &gt; describe t2;
+... </code></pre>
+
+    <p class="p">
+      For more examples of using <code class="ph codeph">REFRESH</code> and <code class="ph codeph">INVALIDATE METADATA</code> with a
+      combination of Impala and Hive operations, see <a class="xref" href="impala_tutorial.html#tutorial_impala_hive">Switching Back and Forth Between Impala and Hive</a>.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">HDFS permissions:</strong>
+      </p>
+    <p class="p">
+      The user ID that the <span class="keyword cmdname">impalad</span> daemon runs under,
+      typically the <code class="ph codeph">impala</code> user, must have execute
+      permissions for all the relevant directories holding table data.
+      (A table could have data spread across multiple directories,
+      or in unexpected paths, if it uses partitioning or
+      specifies a <code class="ph codeph">LOCATION</code> attribute for
+      individual partitions or the entire table.)
+      Issues with permissions might not cause an immediate error for this statement,
+      but subsequent statements such as <code class="ph codeph">SELECT</code>
+      or <code class="ph codeph">SHOW TABLE STATS</code> could fail.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">HDFS considerations:</strong>
+      </p>
+
+    <p class="p">
+      By default, the <code class="ph codeph">INVALIDATE METADATA</code> command checks HDFS permissions of the underlying data
+      files and directories, caching this information so that a statement can be cancelled immediately if for
+      example the <code class="ph codeph">impala</code> user does not have permission to write to the data directory for the
+      table. (This checking does not apply when the <span class="keyword cmdname">catalogd</span> configuration option
+      <code class="ph codeph">--load_catalog_in_background</code> is set to <code class="ph codeph">false</code>, which it is by default.)
+      Impala reports any lack of write permissions as an <code class="ph codeph">INFO</code> message in the log file, in case
+      that represents an oversight. If you change HDFS permissions to make data readable or writeable by the Impala
+      user, issue another <code class="ph codeph">INVALIDATE METADATA</code> to make Impala aware of the change.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+
+    <p class="p">
+      This example illustrates creating a new database and new table in Hive, then doing an <code class="ph codeph">INVALIDATE
+      METADATA</code> statement in Impala using the fully qualified table name, after which both the new table
+      and the new database are visible to Impala. The ability to specify <code class="ph codeph">INVALIDATE METADATA
+      <var class="keyword varname">table_name</var></code> for a table created in Hive is a new capability in Impala 1.2.4. In
+      earlier releases, that statement would have returned an error indicating an unknown table, requiring you to
+      do <code class="ph codeph">INVALIDATE METADATA</code> with no table name, a more expensive operation that reloaded metadata
+      for all tables and databases.
+    </p>
+
+<pre class="pre codeblock"><code>$ hive
+hive&gt; create database new_db_from_hive;
+OK
+Time taken: 4.118 seconds
+hive&gt; create table new_db_from_hive.new_table_from_hive (x int);
+OK
+Time taken: 0.618 seconds
+hive&gt; quit;
+$ impala-shell
+[localhost:21000] &gt; show databases like 'new*';
+[localhost:21000] &gt; refresh new_db_from_hive.new_table_from_hive;
+ERROR: AnalysisException: Database does not exist: new_db_from_hive
+[localhost:21000] &gt; invalidate metadata new_db_from_hive.new_table_from_hive;
+[localhost:21000] &gt; show databases like 'new*';
++--------------------+
+| name               |
++--------------------+
+| new_db_from_hive   |
++--------------------+
+[localhost:21000] &gt; show tables in new_db_from_hive;
++---------------------+
+| name                |
++---------------------+
+| new_table_from_hive |
++---------------------+</code></pre>
+
+    <p class="p">
+        <strong class="ph b">Amazon S3 considerations:</strong>
+      </p>
+    <p class="p">
+        The <code class="ph codeph">REFRESH</code> and <code class="ph codeph">INVALIDATE METADATA</code> statements also cache metadata
+        for tables where the data resides in the Amazon Simple Storage Service (S3).
+        In particular, issue a <code class="ph codeph">REFRESH</code> for a table after adding or removing files
+        in the associated S3 data directory.
+        See <a class="xref" href="../shared/../topics/impala_s3.html#s3">Using Impala with the Amazon S3 Filesystem</a> for details about working with S3 tables.
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Cancellation:</strong> Cannot be cancelled.
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Kudu considerations:</strong>
+      </p>
+    <p class="p">
+        Much of the metadata for Kudu tables is handled by the underlying
+        storage layer. Kudu tables have less reliance on the metastore
+        database, and require less metadata caching on the Impala side.
+        For example, information about partitions in Kudu tables is managed
+        by Kudu, and Impala does not cache any block locality metadata
+        for Kudu tables.
+      </p>
+    <p class="p">
+        The <code class="ph codeph">REFRESH</code> and <code class="ph codeph">INVALIDATE METADATA</code>
+        statements are needed less frequently for Kudu tables than for
+        HDFS-backed tables. Neither statement is needed when data is
+        added to, removed, or updated in a Kudu table, even if the changes
+        are made directly to Kudu through a client program using the Kudu API.
+        Run <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code> or
+        <code class="ph codeph">INVALIDATE METADATA <var class="keyword varname">table_name</var></code>
+        for a Kudu table only after making a change to the Kudu table schema,
+        such as adding or dropping a column, by a mechanism other than
+        Impala.
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Related information:</strong>
+      </p>
+    <p class="p">
+      <a class="xref" href="impala_hadoop.html#intro_metastore">Overview of Impala Metadata and the Metastore</a>,
+      <a class="xref" href="impala_refresh.html#refresh">REFRESH Statement</a>
+    </p>
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_langref_sql.html">Impala SQL Statements</a></div></div></nav></article></main></body></html>

http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_isilon.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_isilon.html b/docs/build3x/html/topics/impala_isilon.html
new file mode 100644
index 0000000..b0a2a2a
--- /dev/null
+++ b/docs/build3x/html/topics/impala_isilon.html
@@ -0,0 +1,89 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="impala_isilon"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using Impala with Isilon Storage</title></head><body id="impala_isilon"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Using Impala with Isilon Storage</h1>
+
+
+
+
+  <div class="body conbody">
+
+    <p class="p">
+
+      You can use Impala to query data files that reside on EMC Isilon storage devices, rather than in HDFS.
+      This capability allows convenient query access to a storage system where you might already be
+      managing large volumes of data. The combination of the Impala query engine and Isilon storage is
+      certified on <span class="keyword">Impala 2.2.4</span> or higher.
+    </p>
+
+    <div class="p">
+        Because the EMC Isilon storage devices use a global value for the block size
+        rather than a configurable value for each file, the <code class="ph codeph">PARQUET_FILE_SIZE</code>
+        query option has no effect when Impala inserts data into a table or partition
+        residing on Isilon storage. Use the <code class="ph codeph">isi</code> command to set the
+        default block size globally on the Isilon device. For example, to set the
+        Isilon default block size to 256 MB, the recommended size for Parquet
+        data files for Impala, issue the following command:
+<pre class="pre codeblock"><code>isi hdfs settings modify --default-block-size=256MB</code></pre>
+      </div>
+
+    <p class="p">
+      The typical use case for Impala and Isilon together is to use Isilon for the
+      default filesystem, replacing HDFS entirely. In this configuration,
+      when you create a database, table, or partition, the data always resides on
+      Isilon storage and you do not need to specify any special <code class="ph codeph">LOCATION</code>
+      attribute. If you do specify a <code class="ph codeph">LOCATION</code> attribute, its value refers
+      to a path within the Isilon filesystem.
+      For example:
+    </p>
+<pre class="pre codeblock"><code>-- If the default filesystem is Isilon, all Impala data resides there
+-- and all Impala databases and tables are located there.
+CREATE TABLE t1 (x INT, s STRING);
+
+-- You can specify LOCATION for database, table, or partition,
+-- using values from the Isilon filesystem.
+CREATE DATABASE d1 LOCATION '/some/path/on/isilon/server/d1.db';
+CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN);
+</code></pre>
+
+    <p class="p">
+      Impala can write to, delete, and rename data files and database, table,
+      and partition directories on Isilon storage. Therefore, Impala statements such
+      as
+      <code class="ph codeph">CREATE TABLE</code>, <code class="ph codeph">DROP TABLE</code>,
+      <code class="ph codeph">CREATE DATABASE</code>, <code class="ph codeph">DROP DATABASE</code>,
+      <code class="ph codeph">ALTER TABLE</code>,
+      and
+      <code class="ph codeph">INSERT</code> work the same with Isilon storage as with HDFS.
+    </p>
+
+    <p class="p">
+      When the Impala spill-to-disk feature is activated by a query that approaches
+      the memory limit, Impala writes all the temporary data to a local (not Isilon)
+      storage device. Because the I/O bandwidth for the temporary data depends on
+      the number of local disks, and clusters using Isilon storage might not have
+      as many local disks attached, pay special attention on Isilon-enabled clusters
+      to any queries that use the spill-to-disk feature. Where practical, tune the
+      queries or allocate extra memory for Impala to avoid spilling.
+      Although you can specify an Isilon storage device as the destination for
+      the temporary data for the spill-to-disk feature, that configuration is
+      not recommended due to the need to transfer the data both ways using remote I/O.
+    </p>
+
+    <p class="p">
+      When tuning Impala queries on HDFS, you typically try to avoid any remote reads.
+      When the data resides on Isilon storage, all the I/O consists of remote reads.
+      Do not be alarmed when you see non-zero numbers for remote read measurements
+      in query profile output. The benefit of the Impala and Isilon integration is
+      primarily convenience of not having to move or copy large volumes of data to HDFS,
+      rather than raw query performance. You can increase the performance of Impala
+      I/O for Isilon systems by increasing the value for the
+      <code class="ph codeph">--num_remote_hdfs_io_threads</code> startup option for the
+      <span class="keyword cmdname">impalad</span> daemon.
+    </p>
+
+
+  </div>
+
+</article></main></body></html>

http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_jdbc.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_jdbc.html b/docs/build3x/html/topics/impala_jdbc.html
new file mode 100644
index 0000000..33ed714
--- /dev/null
+++ b/docs/build3x/html/topics/impala_jdbc.html
@@ -0,0 +1,340 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_config.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="impala_jdbc"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Configuring Impala to Work with JDBC</title></head><body id="impala_jdbc"><main role="main"><article role="article" aria-labelledby="impala_jdbc__jdbc">
+
+  <h1 class="title topictitle1" id="impala_jdbc__jdbc">Configuring Impala to Work with JDBC</h1>
+
+
+  <div class="body conbody">
+
+    <p class="p">
+
+      Impala supports the standard JDBC interface, allowing access from commercial Business Intelligence tools and
+      custom software written in Java or other programming languages. The JDBC driver allows you to access Impala
+      from a Java program that you write, or a Business Intelligence or similar tool that uses JDBC to communicate
+      with various database products.
+    </p>
+
+    <p class="p">
+      Setting up a JDBC connection to Impala involves the following steps:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        Verifying the communication port where the Impala daemons in your cluster are listening for incoming JDBC
+        requests.
+      </li>
+
+      <li class="li">
+        Installing the JDBC driver on every system that runs the JDBC-enabled application.
+      </li>
+
+      <li class="li">
+        Specifying a connection string for the JDBC application to access one of the servers running the
+        <span class="keyword cmdname">impalad</span> daemon, with the appropriate security settings.
+      </li>
+    </ul>
+
+    <p class="p toc inpage"></p>
+  </div>
+
+  <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_config.html">Managing Impala</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="impala_jdbc__jdbc_port">
+
+    <h2 class="title topictitle2" id="ariaid-title2">Configuring the JDBC Port</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The default port used by JDBC 2.0 and later (as well as ODBC 2.x) is 21050. Impala server accepts JDBC
+        connections through this same port 21050 by default. Make sure this port is available for communication
+        with other hosts on your network, for example, that it is not blocked by firewall software. If your JDBC
+        client software connects to a different port, specify that alternative port number with the
+        <code class="ph codeph">--hs2_port</code> option when starting <code class="ph codeph">impalad</code>. See
+        <a class="xref" href="impala_processes.html#processes">Starting Impala</a> for details about Impala startup options. See
+        <a class="xref" href="impala_ports.html#ports">Ports Used by Impala</a> for information about all ports used for communication between Impala
+        and clients or between Impala components.
+      </p>
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="impala_jdbc__jdbc_driver_choice">
+
+    <h2 class="title topictitle2" id="ariaid-title3">Choosing the JDBC Driver</h2>
+
+
+    <div class="body conbody">
+
+      <p class="p">
+        In Impala 2.0 and later, you can use the Hive 0.13 JDBC driver.  If you are
+        already using JDBC applications with an earlier Impala release, you should update
+        your JDBC driver, because the Hive 0.12 driver that was formerly the only choice
+        is not compatible with Impala 2.0 and later.
+      </p>
+
+      <p class="p">
+        The Hive JDBC driver provides a substantial speed increase for JDBC
+        applications with Impala 2.0 and higher, for queries that return large result sets.
+      </p>
+
+      <p class="p">
+        <strong class="ph b">Complex type considerations:</strong>
+      </p>
+
+      <p class="p">
+        The Impala complex types (<code class="ph codeph">STRUCT</code>, <code class="ph codeph">ARRAY</code>, or <code class="ph codeph">MAP</code>)
+        are available in <span class="keyword">Impala 2.3</span> and higher.
+        To use these types with JDBC requires version 2.5.28 or higher of the JDBC Connector for Impala.
+        To use these types with ODBC requires version 2.5.30 or higher of the ODBC Connector for Impala.
+        Consider upgrading all JDBC and ODBC drivers at the same time you upgrade from <span class="keyword">Impala 2.3</span> or higher.
+      </p>
+      <p class="p">
+        Although the result sets from queries involving complex types consist of all scalar values,
+        the queries involve join notation and column references that might not be understood by
+        a particular JDBC or ODBC connector. Consider defining a view that represents the
+        flattened version of a table containing complex type columns, and pointing the JDBC
+        or ODBC application at the view.
+        See <a class="xref" href="../shared/../topics/impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details.
+      </p>
+
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="impala_jdbc__jdbc_setup">
+
+    <h2 class="title topictitle2" id="ariaid-title4">Enabling Impala JDBC Support on Client Systems</h2>
+
+
+    <div class="body conbody">
+
+      <section class="section" id="jdbc_setup__install_hive_driver"><h3 class="title sectiontitle">Using the Hive JDBC Driver</h3>
+
+        <p class="p">
+          You install the Hive JDBC driver (<code class="ph codeph">hive-jdbc</code> package) through the Linux package manager, on
+          hosts within the cluster. The driver consists of several Java JAR files. The same driver can be used by Impala and Hive.
+        </p>
+
+        <p class="p">
+          To get the JAR files, install the Hive JDBC driver on each host in the cluster that will run
+          JDBC applications.
+        </p>
+
+        <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+          The latest JDBC driver, corresponding to Hive 0.13, provides substantial performance improvements for
+          Impala queries that return large result sets. Impala 2.0 and later are compatible with the Hive 0.13
+          driver. If you already have an older JDBC driver installed, and are running Impala 2.0 or higher, consider
+          upgrading to the latest Hive JDBC driver for best performance with JDBC applications.
+        </div>
+
+        <p class="p">
+          If you are using JDBC-enabled applications on hosts outside the cluster, you cannot use the the same install
+          procedure on the hosts. Install the JDBC driver on at least one cluster host using the preceding
+          procedure. Then download the JAR files to each client machine that will use JDBC with Impala:
+        </p>
+
+  <pre class="pre codeblock"><code>commons-logging-X.X.X.jar
+  hadoop-common.jar
+  hive-common-X.XX.X.jar
+  hive-jdbc-X.XX.X.jar
+  hive-metastore-X.XX.X.jar
+  hive-service-X.XX.X.jar
+  httpclient-X.X.X.jar
+  httpcore-X.X.X.jar
+  libfb303-X.X.X.jar
+  libthrift-X.X.X.jar
+  log4j-X.X.XX.jar
+  slf4j-api-X.X.X.jar
+  slf4j-logXjXX-X.X.X.jar
+  </code></pre>
+
+        <p class="p">
+          <strong class="ph b">To enable JDBC support for Impala on the system where you run the JDBC application:</strong>
+        </p>
+
+        <ol class="ol">
+          <li class="li">
+            Download the JAR files listed above to each client machine.
+            <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+              For Maven users, see
+              <a class="xref" href="https://github.com/onefoursix/Cloudera-Impala-JDBC-Example" target="_blank">this sample github page</a> for an example of the
+              dependencies you could add to a <code class="ph codeph">pom</code> file instead of downloading the individual JARs.
+            </div>
+          </li>
+
+          <li class="li">
+            Store the JAR files in a location of your choosing, ideally a directory already referenced in your
+            <code class="ph codeph">CLASSPATH</code> setting. For example:
+            <ul class="ul">
+              <li class="li">
+                On Linux, you might use a location such as <code class="ph codeph">/opt/jars/</code>.
+              </li>
+
+              <li class="li">
+                On Windows, you might use a subdirectory underneath <span class="ph filepath">C:\Program Files</span>.
+              </li>
+            </ul>
+          </li>
+
+          <li class="li">
+            To successfully load the Impala JDBC driver, client programs must be able to locate the associated JAR
+            files. This often means setting the <code class="ph codeph">CLASSPATH</code> for the client process to include the
+            JARs. Consult the documentation for your JDBC client for more details on how to install new JDBC drivers,
+            but some examples of how to set <code class="ph codeph">CLASSPATH</code> variables include:
+            <ul class="ul">
+              <li class="li">
+                On Linux, if you extracted the JARs to <code class="ph codeph">/opt/jars/</code>, you might issue the following
+                command to prepend the JAR files path to an existing classpath:
+  <pre class="pre codeblock"><code>export CLASSPATH=/opt/jars/*.jar:$CLASSPATH</code></pre>
+              </li>
+
+              <li class="li">
+                On Windows, use the <strong class="ph b">System Properties</strong> control panel item to modify the <strong class="ph b">Environment
+                Variables</strong> for your system. Modify the environment variables to include the path to which you
+                extracted the files.
+                <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+                  If the existing <code class="ph codeph">CLASSPATH</code> on your client machine refers to some older version of
+                  the Hive JARs, ensure that the new JARs are the first ones listed. Either put the new JAR files
+                  earlier in the listings, or delete the other references to Hive JAR files.
+                </div>
+              </li>
+            </ul>
+          </li>
+        </ol>
+      </section>
+
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="impala_jdbc__jdbc_connect">
+
+    <h2 class="title topictitle2" id="ariaid-title5">Establishing JDBC Connections</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The JDBC driver class depends on which driver you select.
+      </p>
+
+      <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+        If your JDBC or ODBC application connects to Impala through a load balancer such as
+        <code class="ph codeph">haproxy</code>, be cautious about reusing the connections. If the load balancer has set up
+        connection timeout values, either check the connection frequently so that it never sits idle longer than
+        the load balancer timeout value, or check the connection validity before using it and create a new one if
+        the connection has been closed.
+      </div>
+
+      <section class="section" id="jdbc_connect__class_hive_driver"><h3 class="title sectiontitle">Using the Hive JDBC Driver</h3>
+
+
+      <p class="p">
+        For example, with the Hive JDBC driver, the class name is <code class="ph codeph">org.apache.hive.jdbc.HiveDriver</code>.
+        Once you have configured Impala to work with JDBC, you can establish connections between the two.
+        To do so for a cluster that does not use
+        Kerberos authentication, use a connection string of the form
+        <code class="ph codeph">jdbc:hive2://<var class="keyword varname">host</var>:<var class="keyword varname">port</var>/;auth=noSasl</code>.
+
+        For example, you might use:
+      </p>
+
+<pre class="pre codeblock"><code>jdbc:hive2://myhost.example.com:21050/;auth=noSasl</code></pre>
+
+      <p class="p">
+        To connect to an instance of Impala that requires Kerberos authentication, use a connection string of the
+        form
+        <code class="ph codeph">jdbc:hive2://<var class="keyword varname">host</var>:<var class="keyword varname">port</var>/;principal=<var class="keyword varname">principal_name</var></code>.
+        The principal must be the same user principal you used when starting Impala. For example, you might use:
+      </p>
+
+<pre class="pre codeblock"><code>jdbc:hive2://myhost.example.com:21050/;principal=impala/myhost.example.com@H2.EXAMPLE.COM</code></pre>
+
+      <p class="p">
+        To connect to an instance of Impala that requires LDAP authentication, use a connection string of the form
+        <code class="ph codeph">jdbc:hive2://<var class="keyword varname">host</var>:<var class="keyword varname">port</var>/<var class="keyword varname">db_name</var>;user=<var class="keyword varname">ldap_userid</var>;password=<var class="keyword varname">ldap_password</var></code>.
+        For example, you might use:
+      </p>
+
+<pre class="pre codeblock"><code>jdbc:hive2://myhost.example.com:21050/test_db;user=fred;password=xyz123</code></pre>
+
+      <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+          <p class="p">
+        Prior to <span class="keyword">Impala 2.5</span>, the Hive JDBC driver did not support connections that use both Kerberos authentication
+        and SSL encryption. If your cluster is running an older release that has this restriction,
+        use an alternative JDBC driver that supports
+        both of these security features.
+      </p>
+      </div>
+
+      </section>
+
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="impala_jdbc__jdbc_odbc_notes">
+    <h2 class="title topictitle2" id="ariaid-title6">Notes about JDBC and ODBC Interaction with Impala SQL Features</h2>
+    <div class="body conbody">
+      <p class="p">
+        Most Impala SQL features work equivalently through the <span class="keyword cmdname">impala-shell</span> interpreter
+        of the JDBC or ODBC APIs. The following are some exceptions to keep in mind when switching between
+        the interactive shell and applications using the APIs:
+      </p>
+      <ul class="ul">
+        <li class="li">
+          <p class="p">
+        <strong class="ph b">Complex type considerations:</strong>
+      </p>
+          <ul class="ul">
+          <li class="li">
+          <p class="p">
+            Queries involving the complex types (<code class="ph codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, and <code class="ph codeph">MAP</code>)
+            require notation that might not be available in all levels of JDBC and ODBC drivers.
+            If you have trouble querying such a table due to the driver level or
+            inability to edit the queries used by the application, you can create a view that exposes
+            a <span class="q">"flattened"</span> version of the complex columns and point the application at the view.
+            See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details.
+          </p>
+        </li>
+        <li class="li">
+          <p class="p">
+            The complex types available in <span class="keyword">Impala 2.3</span> and higher are supported by the
+            JDBC <code class="ph codeph">getColumns()</code> API.
+            Both <code class="ph codeph">MAP</code> and <code class="ph codeph">ARRAY</code> are reported as the JDBC SQL Type <code class="ph codeph">ARRAY</code>,
+            because this is the closest matching Java SQL type. This behavior is consistent with Hive.
+            <code class="ph codeph">STRUCT</code> types are reported as the JDBC SQL Type <code class="ph codeph">STRUCT</code>.
+          </p>
+          <div class="p">
+            To be consistent with Hive's behavior, the TYPE_NAME field is populated
+            with the primitive type name for scalar types, and with the full <code class="ph codeph">toSql()</code>
+            for complex types. The resulting type names are somewhat inconsistent,
+            because nested types are printed differently than top-level types. For example,
+            the following list shows how <code class="ph codeph">toSQL()</code> for Impala types are
+            translated to <code class="ph codeph">TYPE_NAME</code> values:
+<pre class="pre codeblock"><code>DECIMAL(10,10)         becomes  DECIMAL
+CHAR(10)               becomes  CHAR
+VARCHAR(10)            becomes  VARCHAR
+ARRAY&lt;DECIMAL(10,10)&gt;  becomes  ARRAY&lt;DECIMAL(10,10)&gt;
+ARRAY&lt;CHAR(10)&gt;        becomes  ARRAY&lt;CHAR(10)&gt;
+ARRAY&lt;VARCHAR(10)&gt;     becomes  ARRAY&lt;VARCHAR(10)&gt;
+
+</code></pre>
+          </div>
+          </li>
+        </ul>
+        </li>
+      </ul>
+    </div>
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title7" id="impala_jdbc__jdbc_kudu">
+    <h2 class="title topictitle2" id="ariaid-title7">Kudu Considerations for DML Statements</h2>
+    <div class="body conbody">
+      <p class="p">
+        Currently, Impala <code class="ph codeph">INSERT</code>, <code class="ph codeph">UPDATE</code>, or
+        other DML statements issued through the JDBC interface against a Kudu
+        table do not return JDBC error codes for conditions such as duplicate
+        primary key columns. Therefore, for applications that issue a high
+        volume of DML statements, prefer to use the Kudu Java API directly
+        rather than a JDBC application.
+      </p>
+    </div>
+  </article>
+
+</article></main></body></html>

http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_joins.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_joins.html b/docs/build3x/html/topics/impala_joins.html
new file mode 100644
index 0000000..51ccf6b
--- /dev/null
+++ b/docs/build3x/html/topics/impala_joins.html
@@ -0,0 +1,531 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_select.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="joins"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Joins in Impala SELECT Statements</title></head><body id="joins"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Joins in Impala SELECT Statements</h1>
+
+
+
+  <div class="body conbody">
+
+    <p class="p">
+
+      A join query is a <code class="ph codeph">SELECT</code> statement that combines data from two or more tables,
+      and returns a result set containing items from some or all of those tables. It is a way to
+      cross-reference and correlate related data that is organized into multiple tables, typically
+      using identifiers that are repeated in each of the joined tables.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Syntax:</strong>
+      </p>
+
+    <p class="p">
+        Impala supports a wide variety of <code class="ph codeph">JOIN</code> clauses. Left, right, semi, full, and outer joins
+        are supported in all Impala versions. The <code class="ph codeph">CROSS JOIN</code> operator is available in Impala 1.2.2
+        and higher. During performance tuning, you can override the reordering of join clauses that Impala does
+        internally by including the keyword <code class="ph codeph">STRAIGHT_JOIN</code> immediately after the
+        <code class="ph codeph">SELECT</code> and any <code class="ph codeph">DISTINCT</code> or <code class="ph codeph">ALL</code> keywords.
+      </p>
+
+<pre class="pre codeblock"><code>SELECT <var class="keyword varname">select_list</var> FROM
+  <var class="keyword varname">table_or_subquery1</var> [INNER] JOIN <var class="keyword varname">table_or_subquery2</var> |
+  <var class="keyword varname">table_or_subquery1</var> {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} JOIN <var class="keyword varname">table_or_subquery2</var> |
+  <var class="keyword varname">table_or_subquery1</var> {LEFT | RIGHT} SEMI JOIN <var class="keyword varname">table_or_subquery2</var> |
+  <span class="ph"><var class="keyword varname">table_or_subquery1</var> {LEFT | RIGHT} ANTI JOIN <var class="keyword varname">table_or_subquery2</var> |</span>
+    [ ON <var class="keyword varname">col1</var> = <var class="keyword varname">col2</var> [AND <var class="keyword varname">col3</var> = <var class="keyword varname">col4</var> ...] |
+      USING (<var class="keyword varname">col1</var> [, <var class="keyword varname">col2</var> ...]) ]
+  [<var class="keyword varname">other_join_clause</var> ...]
+[ WHERE <var class="keyword varname">where_clauses</var> ]
+
+SELECT <var class="keyword varname">select_list</var> FROM
+  <var class="keyword varname">table_or_subquery1</var>, <var class="keyword varname">table_or_subquery2</var> [, <var class="keyword varname">table_or_subquery3</var> ...]
+  [<var class="keyword varname">other_join_clause</var> ...]
+WHERE
+    <var class="keyword varname">col1</var> = <var class="keyword varname">col2</var> [AND <var class="keyword varname">col3</var> = <var class="keyword varname">col4</var> ...]
+
+SELECT <var class="keyword varname">select_list</var> FROM
+  <var class="keyword varname">table_or_subquery1</var> CROSS JOIN <var class="keyword varname">table_or_subquery2</var>
+  [<var class="keyword varname">other_join_clause</var> ...]
+[ WHERE <var class="keyword varname">where_clauses</var> ]</code></pre>
+
+    <p class="p">
+      <strong class="ph b">SQL-92 and SQL-89 Joins:</strong>
+    </p>
+
+    <p class="p">
+      Queries with the explicit <code class="ph codeph">JOIN</code> keywords are known as SQL-92 style joins, referring to the
+      level of the SQL standard where they were introduced. The corresponding <code class="ph codeph">ON</code> or
+      <code class="ph codeph">USING</code> clauses clearly show which columns are used as the join keys in each case:
+    </p>
+
+<pre class="pre codeblock"><code>SELECT t1.c1, t2.c2 FROM <strong class="ph b">t1 JOIN t2</strong>
+  <strong class="ph b">ON t1.id = t2.id and t1.type_flag = t2.type_flag</strong>
+  WHERE t1.c1 &gt; 100;
+
+SELECT t1.c1, t2.c2 FROM <strong class="ph b">t1 JOIN t2</strong>
+  <strong class="ph b">USING (id, type_flag)</strong>
+  WHERE t1.c1 &gt; 100;</code></pre>
+
+    <p class="p">
+      The <code class="ph codeph">ON</code> clause is a general way to compare columns across the two tables, even if the column
+      names are different. The <code class="ph codeph">USING</code> clause is a shorthand notation for specifying the join
+      columns, when the column names are the same in both tables. You can code equivalent <code class="ph codeph">WHERE</code>
+      clauses that compare the columns, instead of <code class="ph codeph">ON</code> or <code class="ph codeph">USING</code> clauses, but that
+      practice is not recommended because mixing the join comparisons with other filtering clauses is typically
+      less readable and harder to maintain.
+    </p>
+
+    <p class="p">
+      Queries with a comma-separated list of tables and subqueries are known as SQL-89 style joins. In these
+      queries, the equality comparisons between columns of the joined tables go in the <code class="ph codeph">WHERE</code>
+      clause alongside other kinds of comparisons. This syntax is easy to learn, but it is also easy to
+      accidentally remove a <code class="ph codeph">WHERE</code> clause needed for the join to work correctly.
+    </p>
+
+<pre class="pre codeblock"><code>SELECT t1.c1, t2.c2 FROM <strong class="ph b">t1, t2</strong>
+  WHERE
+  <strong class="ph b">t1.id = t2.id AND t1.type_flag = t2.type_flag</strong>
+  AND t1.c1 &gt; 100;</code></pre>
+
+    <p class="p">
+      <strong class="ph b">Self-joins:</strong>
+    </p>
+
+    <p class="p">
+      Impala can do self-joins, for example to join on two different columns in the same table to represent
+      parent-child relationships or other tree-structured data. There is no explicit syntax for this; just use the
+      same table name for both the left-hand and right-hand table, and assign different table aliases to use when
+      referring to the fully qualified column names:
+    </p>
+
+<pre class="pre codeblock"><code>-- Combine fields from both parent and child rows.
+SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs WHERE lhs.id = rhs.parent;</code></pre>
+
+    <p class="p">
+      <strong class="ph b">Cartesian joins:</strong>
+    </p>
+
+    <div class="p">
+      To avoid producing huge result sets by mistake, Impala does not allow Cartesian joins of the form:
+<pre class="pre codeblock"><code>SELECT ... FROM t1 JOIN t2;
+SELECT ... FROM t1, t2;</code></pre>
+      If you intend to join the tables based on common values, add <code class="ph codeph">ON</code> or <code class="ph codeph">WHERE</code>
+      clauses to compare columns across the tables. If you truly intend to do a Cartesian join, use the
+      <code class="ph codeph">CROSS JOIN</code> keyword as the join operator. The <code class="ph codeph">CROSS JOIN</code> form does not use
+      any <code class="ph codeph">ON</code> clause, because it produces a result set with all combinations of rows from the
+      left-hand and right-hand tables. The result set can still be filtered by subsequent <code class="ph codeph">WHERE</code>
+      clauses. For example:
+    </div>
+
+<pre class="pre codeblock"><code>SELECT ... FROM t1 CROSS JOIN t2;
+SELECT ... FROM t1 CROSS JOIN t2 WHERE <var class="keyword varname">tests_on_non_join_columns</var>;</code></pre>
+
+    <p class="p">
+      <strong class="ph b">Inner and outer joins:</strong>
+    </p>
+
+    <p class="p">
+      An inner join is the most common and familiar type: rows in the result set contain the requested columns from
+      the appropriate tables, for all combinations of rows where the join columns of the tables have identical
+      values. If a column with the same name occurs in both tables, use a fully qualified name or a column alias to
+      refer to the column in the select list or other clauses. Impala performs inner joins by default for both
+      SQL-89 and SQL-92 join syntax:
+    </p>
+
+<pre class="pre codeblock"><code>-- The following 3 forms are all equivalent.
+SELECT t1.id, c1, c2 FROM t1, t2 WHERE t1.id = t2.id;
+SELECT t1.id, c1, c2 FROM t1 JOIN t2 ON t1.id = t2.id;
+SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;</code></pre>
+
+    <p class="p">
+      An outer join retrieves all rows from the left-hand table, or the right-hand table, or both; wherever there
+      is no matching data in the table on the other side of the join, the corresponding columns in the result set
+      are set to <code class="ph codeph">NULL</code>. To perform an outer join, include the <code class="ph codeph">OUTER</code> keyword in the
+      join operator, along with either <code class="ph codeph">LEFT</code>, <code class="ph codeph">RIGHT</code>, or <code class="ph codeph">FULL</code>:
+    </p>
+
+<pre class="pre codeblock"><code>SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id;
+SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id;
+SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;</code></pre>
+
+    <p class="p">
+      For outer joins, Impala requires SQL-92 syntax; that is, the <code class="ph codeph">JOIN</code> keyword instead of
+      comma-separated table names. Impala does not support vendor extensions such as <code class="ph codeph">(+)</code> or
+      <code class="ph codeph">*=</code> notation for doing outer joins with SQL-89 query syntax.
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Equijoins and Non-Equijoins:</strong>
+    </p>
+
+    <p class="p">
+      By default, Impala requires an equality comparison between the left-hand and right-hand tables, either
+      through <code class="ph codeph">ON</code>, <code class="ph codeph">USING</code>, or <code class="ph codeph">WHERE</code> clauses. These types of
+      queries are classified broadly as equijoins. Inner, outer, full, and semi joins can all be equijoins based on
+      the presence of equality tests between columns in the left-hand and right-hand tables.
+    </p>
+
+    <p class="p">
+      In Impala 1.2.2 and higher, non-equijoin queries are also possible, with comparisons such as
+      <code class="ph codeph">!=</code> or <code class="ph codeph">&lt;</code> between the join columns. These kinds of queries require care to
+      avoid producing huge result sets that could exceed resource limits. Once you have planned a non-equijoin
+      query that produces a result set of acceptable size, you can code the query using the <code class="ph codeph">CROSS
+      JOIN</code> operator, and add the extra comparisons in the <code class="ph codeph">WHERE</code> clause:
+    </p>
+
+<pre class="pre codeblock"><code>SELECT * FROM t1 CROSS JOIN t2 WHERE t1.total &gt; t2.maximum_price;</code></pre>
+
+    <p class="p">
+      In <span class="keyword">Impala 2.3</span> and higher, additional non-equijoin queries are possible due to the addition
+      of nested loop joins. These queries typically involve <code class="ph codeph">SEMI JOIN</code>,
+      <code class="ph codeph">ANTI JOIN</code>, or <code class="ph codeph">FULL OUTER JOIN</code> clauses.
+      Impala sometimes also uses nested loop joins internally when evaluating <code class="ph codeph">OUTER JOIN</code>
+      queries involving complex type columns.
+      Query phases involving nested loop joins do not use the spill-to-disk mechanism if they
+      exceed the memory limit. Impala decides internally when to use each join mechanism; you cannot
+      specify any query hint to choose between the nested loop join or the original hash join algorithm.
+    </p>
+
+<pre class="pre codeblock"><code>SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.int_col &lt; t2.int_col;</code></pre>
+
+    <p class="p">
+      <strong class="ph b">Semi-joins:</strong>
+    </p>
+
+    <p class="p">
+      Semi-joins are a relatively rarely used variation. With the left semi-join, only data from the left-hand
+      table is returned, for rows where there is matching data in the right-hand table, based on comparisons
+      between join columns in <code class="ph codeph">ON</code> or <code class="ph codeph">WHERE</code> clauses. Only one instance of each row
+      from the left-hand table is returned, regardless of how many matching rows exist in the right-hand table.
+      <span class="ph">A right semi-join (available in Impala 2.0 and higher) reverses the comparison and returns
+      data from the right-hand table.</span>
+    </p>
+
+<pre class="pre codeblock"><code>SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI JOIN t2 ON t1.id = t2.id;</code></pre>
+
+    <p class="p">
+      <strong class="ph b">Natural joins (not supported):</strong>
+    </p>
+
+    <p class="p">
+      Impala does not support the <code class="ph codeph">NATURAL JOIN</code> operator, again to avoid inconsistent or huge
+      result sets. Natural joins do away with the <code class="ph codeph">ON</code> and <code class="ph codeph">USING</code> clauses, and
+      instead automatically join on all columns with the same names in the left-hand and right-hand tables. This
+      kind of query is not recommended for rapidly evolving data structures such as are typically used in Hadoop.
+      Thus, Impala does not support the <code class="ph codeph">NATURAL JOIN</code> syntax, which can produce different query
+      results as columns are added to or removed from tables.
+    </p>
+
+    <p class="p">
+      If you do have any queries that use <code class="ph codeph">NATURAL JOIN</code>, make sure to rewrite them with explicit
+      <code class="ph codeph">USING</code> clauses, because Impala could interpret the <code class="ph codeph">NATURAL</code> keyword as a
+      table alias:
+    </p>
+
+<pre class="pre codeblock"><code>-- 'NATURAL' is interpreted as an alias for 't1' and Impala attempts an inner join,
+-- resulting in an error because inner joins require explicit comparisons between columns.
+SELECT t1.c1, t2.c2 FROM t1 NATURAL JOIN t2;
+ERROR: NotImplementedException: Join with 't2' requires at least one conjunctive equality predicate.
+  To perform a Cartesian product between two tables, use a CROSS JOIN.
+
+-- If you expect the tables to have identically named columns with matching values,
+-- list the corresponding column names in a USING clause.
+SELECT t1.c1, t2.c2 FROM t1 JOIN t2 USING (id, type_flag, name, address);</code></pre>
+
+    <p class="p">
+      <strong class="ph b">Anti-joins (<span class="keyword">Impala 2.0</span> and higher only):</strong>
+    </p>
+
+    <p class="p">
+      Impala supports the <code class="ph codeph">LEFT ANTI JOIN</code> and <code class="ph codeph">RIGHT ANTI JOIN</code> clauses in
+      <span class="keyword">Impala 2.0</span> and higher. The <code class="ph codeph">LEFT</code> or <code class="ph codeph">RIGHT</code>
+      keyword is required for this kind of join. For <code class="ph codeph">LEFT ANTI JOIN</code>, this clause returns those
+      values from the left-hand table that have no matching value in the right-hand table. <code class="ph codeph">RIGHT ANTI
+      JOIN</code> reverses the comparison and returns values from the right-hand table. You can express this
+      negative relationship either through the <code class="ph codeph">ANTI JOIN</code> clause or through a <code class="ph codeph">NOT
+      EXISTS</code> operator with a subquery.
+    </p>
+
+
+
+    <p class="p">
+        <strong class="ph b">Complex type considerations:</strong>
+      </p>
+
+
+
+    <p class="p">
+      When referring to a column with a complex type (<code class="ph codeph">STRUCT</code>, <code class="ph codeph">ARRAY</code>, or <code class="ph codeph">MAP</code>)
+      in a query, you use join notation to <span class="q">"unpack"</span> the scalar fields of the struct, the elements of the array, or
+      the key-value pairs of the map. (The join notation is not required for aggregation operations, such as
+      <code class="ph codeph">COUNT()</code> or <code class="ph codeph">SUM()</code> for array elements.) Because Impala recognizes which complex type elements are associated with which row
+      of the result set, you use the same syntax as for a cross or cartesian join, without an explicit join condition.
+      See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details about Impala support for complex types.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+
+    <p class="p">
+      You typically use join queries in situations like these:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        When related data arrives from different sources, with each data set physically residing in a separate
+        table. For example, you might have address data from business records that you cross-check against phone
+        listings or census data.
+        <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+          Impala can join tables of different file formats, including Impala-managed tables and HBase tables. For
+          example, you might keep small dimension tables in HBase, for convenience of single-row lookups and
+          updates, and for the larger fact tables use Parquet or other binary file format optimized for scan
+          operations. Then, you can issue a join query to cross-reference the fact tables with the dimension
+          tables.
+        </div>
+      </li>
+
+      <li class="li">
+        When data is normalized, a technique for reducing data duplication by dividing it across multiple tables.
+        This kind of organization is often found in data that comes from traditional relational database systems.
+        For example, instead of repeating some long string such as a customer name in multiple tables, each table
+        might contain a numeric customer ID. Queries that need to display the customer name could <span class="q">"join"</span> the
+        table that specifies which customer ID corresponds to which name.
+      </li>
+
+      <li class="li">
+        When certain columns are rarely needed for queries, so they are moved into separate tables to reduce
+        overhead for common queries. For example, a <code class="ph codeph">biography</code> field might be rarely needed in
+        queries on employee data. Putting that field in a separate table reduces the amount of I/O for common
+        queries on employee addresses or phone numbers. Queries that do need the <code class="ph codeph">biography</code> column
+        can retrieve it by performing a join with that separate table.
+      </li>
+
+      <li class="li">
+        In <span class="keyword">Impala 2.3</span> or higher, when referring to complex type columns in queries.
+        See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details.
+      </li>
+    </ul>
+
+    <p class="p">
+      When comparing columns with the same names in <code class="ph codeph">ON</code> or <code class="ph codeph">WHERE</code> clauses, use the
+      fully qualified names such as <code class="ph codeph"><var class="keyword varname">db_name</var>.<var class="keyword varname">table_name</var></code>, or
+      assign table aliases, column aliases, or both to make the code more compact and understandable:
+    </p>
+
+<pre class="pre codeblock"><code>select t1.c1 as first_id, t2.c2 as second_id from
+  t1 join t2 on first_id = second_id;
+
+select fact.custno, dimension.custno from
+  customer_data as fact join customer_address as dimension
+  using (custno)</code></pre>
+
+    <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+      <p class="p">
+        Performance for join queries is a crucial aspect for Impala, because complex join queries are
+        resource-intensive operations. An efficient join query produces much less network traffic and CPU overhead
+        than an inefficient one. For best results:
+      </p>
+      <ul class="ul">
+        <li class="li">
+          Make sure that both <a class="xref" href="impala_perf_stats.html#perf_stats">table and column statistics</a> are
+          available for all the tables involved in a join query, and especially for the columns referenced in any
+          join conditions. Impala uses the statistics to automatically deduce an efficient join order.
+          Use <a class="xref" href="impala_show.html#show"><code class="ph codeph">SHOW TABLE STATS <var class="keyword varname">table_name</var></code> and
+          <code class="ph codeph">SHOW COLUMN STATS <var class="keyword varname">table_name</var></code></a> to check if statistics are
+          already present. Issue the <code class="ph codeph">COMPUTE STATS <var class="keyword varname">table_name</var></code> for a nonpartitioned table,
+          or (in Impala 2.1.0 and higher) <code class="ph codeph">COMPUTE INCREMENTAL STATS <var class="keyword varname">table_name</var></code>
+          for a partitioned table, to collect the initial statistics at both the table and column levels, and to keep the
+          statistics up to date after any substantial <code class="ph codeph">INSERT</code> or <code class="ph codeph">LOAD DATA</code> operations.
+        </li>
+
+        <li class="li">
+          If table or column statistics are not available, join the largest table first. You can check the
+          existence of statistics with the <code class="ph codeph">SHOW TABLE STATS <var class="keyword varname">table_name</var></code> and
+          <code class="ph codeph">SHOW COLUMN STATS <var class="keyword varname">table_name</var></code> statements.
+        </li>
+
+        <li class="li">
+          If table or column statistics are not available, join subsequent tables according to which table has the
+          most selective filter, based on overall size and <code class="ph codeph">WHERE</code> clauses. Joining the table with
+          the most selective filter results in the fewest number of rows being returned.
+        </li>
+      </ul>
+      <p class="p">
+        For more information and examples of performance for join queries, see
+        <a class="xref" href="impala_perf_joins.html#perf_joins">Performance Considerations for Join Queries</a>.
+      </p>
+    </div>
+
+    <p class="p">
+      To control the result set from a join query, include the names of corresponding column names in both tables
+      in an <code class="ph codeph">ON</code> or <code class="ph codeph">USING</code> clause, or by coding equality comparisons for those
+      columns in the <code class="ph codeph">WHERE</code> clause.
+    </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; select c_last_name, ca_city from customer join customer_address where c_customer_sk = ca_address_sk;
++-------------+-----------------+
+| c_last_name | ca_city         |
++-------------+-----------------+
+| Lewis       | Fairfield       |
+| Moses       | Fairview        |
+| Hamilton    | Pleasant Valley |
+| White       | Oak Ridge       |
+| Moran       | Glendale        |
+...
+| Richards    | Lakewood         |
+| Day         | Lebanon          |
+| Painter     | Oak Hill         |
+| Bentley     | Greenfield       |
+| Jones       | Stringtown       |
++-------------+------------------+
+Returned 50000 row(s) in 9.82s</code></pre>
+
+    <p class="p">
+      One potential downside of joins is the possibility of excess resource usage in poorly constructed queries.
+      Impala imposes restrictions on join queries to guard against such issues. To minimize the chance of runaway
+      queries on large data sets, Impala requires every join query to contain at least one equality predicate
+      between the columns of the various tables. For example, if <code class="ph codeph">T1</code> contains 1000 rows and
+      <code class="ph codeph">T2</code> contains 1,000,000 rows, a query <code class="ph codeph">SELECT <var class="keyword varname">columns</var> FROM t1 JOIN
+      t2</code> could return up to 1 billion rows (1000 * 1,000,000); Impala requires that the query include a
+      clause such as <code class="ph codeph">ON t1.c1 = t2.c2</code> or <code class="ph codeph">WHERE t1.c1 = t2.c2</code>.
+    </p>
+
+    <p class="p">
+      Because even with equality clauses, the result set can still be large, as we saw in the previous example, you
+      might use a <code class="ph codeph">LIMIT</code> clause to return a subset of the results:
+    </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; select c_last_name, ca_city from customer, customer_address where c_customer_sk = ca_address_sk limit 10;
++-------------+-----------------+
+| c_last_name | ca_city         |
++-------------+-----------------+
+| Lewis       | Fairfield       |
+| Moses       | Fairview        |
+| Hamilton    | Pleasant Valley |
+| White       | Oak Ridge       |
+| Moran       | Glendale        |
+| Sharp       | Lakeview        |
+| Wiles       | Farmington      |
+| Shipman     | Union           |
+| Gilbert     | New Hope        |
+| Brunson     | Martinsville    |
++-------------+-----------------+
+Returned 10 row(s) in 0.63s</code></pre>
+
+    <p class="p">
+      Or you might use additional comparison operators or aggregation functions to condense a large result set into
+      a smaller set of values:
+    </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; -- Find the names of customers who live in one particular town.
+[localhost:21000] &gt; select distinct c_last_name from customer, customer_address where
+  c_customer_sk = ca_address_sk
+  and ca_city = "Green Acres";
++---------------+
+| c_last_name   |
++---------------+
+| Hensley       |
+| Pearson       |
+| Mayer         |
+| Montgomery    |
+| Ricks         |
+...
+| Barrett       |
+| Price         |
+| Hill          |
+| Hansen        |
+| Meeks         |
++---------------+
+Returned 332 row(s) in 0.97s
+
+[localhost:21000] &gt; -- See how many different customers in this town have names starting with "A".
+[localhost:21000] &gt; select count(distinct c_last_name) from customer, customer_address where
+  c_customer_sk = ca_address_sk
+  and ca_city = "Green Acres"
+  and substr(c_last_name,1,1) = "A";
++-----------------------------+
+| count(distinct c_last_name) |
++-----------------------------+
+| 12                          |
++-----------------------------+
+Returned 1 row(s) in 1.00s</code></pre>
+
+    <p class="p">
+      Because a join query can involve reading large amounts of data from disk, sending large amounts of data
+      across the network, and loading large amounts of data into memory to do the comparisons and filtering, you
+      might do benchmarking, performance analysis, and query tuning to find the most efficient join queries for
+      your data set, hardware capacity, network configuration, and cluster workload.
+    </p>
+
+    <p class="p">
+      The two categories of joins in Impala are known as <strong class="ph b">partitioned joins</strong> and <strong class="ph b">broadcast joins</strong>. If
+      inaccurate table or column statistics, or some quirk of the data distribution, causes Impala to choose the
+      wrong mechanism for a particular join, consider using query hints as a temporary workaround. For details, see
+      <a class="xref" href="impala_hints.html#hints">Optimizer Hints</a>.
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Handling NULLs in Join Columns:</strong>
+    </p>
+
+    <p class="p">
+      By default, join key columns do not match if either one contains a <code class="ph codeph">NULL</code> value.
+      To treat such columns as equal if both contain <code class="ph codeph">NULL</code>, you can use an expression
+      such as <code class="ph codeph">A = B OR (A IS NULL AND B IS NULL)</code>.
+      In <span class="keyword">Impala 2.5</span> and higher, the <code class="ph codeph">&lt;=&gt;</code> operator (shorthand for
+      <code class="ph codeph">IS NOT DISTINCT FROM</code>) performs the same comparison in a concise and efficient form.
+      The <code class="ph codeph">&lt;=&gt;</code> operator is more efficient in for comparing join keys in a <code class="ph codeph">NULL</code>-safe
+      manner, because the operator can use a hash join while the <code class="ph codeph">OR</code> expression cannot.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Examples:</strong>
+      </p>
+
+    <div class="p">
+      The following examples refer to these simple tables containing small sets of integers:
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table t1 (x int);
+[localhost:21000] &gt; insert into t1 values (1), (2), (3), (4), (5), (6);
+
+[localhost:21000] &gt; create table t2 (y int);
+[localhost:21000] &gt; insert into t2 values (2), (4), (6);
+
+[localhost:21000] &gt; create table t3 (z int);
+[localhost:21000] &gt; insert into t3 values (1), (3), (5);
+</code></pre>
+    </div>
+
+
+
+    <p class="p">
+      The following example demonstrates an anti-join, returning the values from <code class="ph codeph">T1</code> that do not
+      exist in <code class="ph codeph">T2</code> (in this case, the odd numbers 1, 3, and 5):
+    </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; select x from t1 left anti join t2 on (t1.x = t2.y);
++---+
+| x |
++---+
+| 1 |
+| 3 |
+| 5 |
++---+
+</code></pre>
+
+    <p class="p">
+        <strong class="ph b">Related information:</strong>
+      </p>
+
+    <p class="p">
+      See these tutorials for examples of different kinds of joins:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        <a class="xref" href="impala_tutorial.html#tut_cross_join">Cross Joins and Cartesian Products with the CROSS JOIN Operator</a>
+      </li>
+    </ul>
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_select.html">SELECT Statement</a></div></div></nav></article></main></body></html>