You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jr...@apache.org on 2016/07/26 23:05:02 UTC
[10/22] incubator-impala git commit: First try at porting over the
source files necessary for the Impala SQL Reference.
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_invalidate_metadata.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_invalidate_metadata.xml b/docs/topics/impala_invalidate_metadata.xml
new file mode 100644
index 0000000..96fca7d
--- /dev/null
+++ b/docs/topics/impala_invalidate_metadata.xml
@@ -0,0 +1,236 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="1.1" id="invalidate_metadata">
+
+ <title>INVALIDATE METADATA Statement</title>
+ <titlealts><navtitle>INVALIDATE METADATA</navtitle></titlealts>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="DDL"/>
+ <data name="Category" value="Metastore"/>
+ <data name="Category" value="Schemas"/>
+ <data name="Category" value="Tables"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">INVALIDATE METADATA statement</indexterm>
+ Marks the metadata for one or all tables as stale. Required after a table is created through the Hive shell,
+ before the table is available for Impala queries. The next time the current Impala node performs a query
+ against a table whose metadata is invalidated, Impala reloads the associated metadata before the query
+ proceeds. This is a relatively expensive operation compared to the incremental metadata update done by the
+ <codeph>REFRESH</codeph> statement, so in the common scenario of adding new data files to an existing table,
+ prefer <codeph>REFRESH</codeph> rather than <codeph>INVALIDATE METADATA</codeph>. If you are not familiar
+ with the way Impala uses metadata and how it shares the same metastore database as Hive, see
+ <xref href="impala_hadoop.xml#intro_metastore"/> for background information.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>INVALIDATE METADATA [[<varname>db_name</varname>.]<varname>table_name</varname>]</codeblock>
+
+ <p>
+ By default, the cached metadata for all tables is flushed. If you specify a table name, only the metadata for
+ that one table is flushed. Even for a single table, <codeph>INVALIDATE METADATA</codeph> is more expensive
+ than <codeph>REFRESH</codeph>, so prefer <codeph>REFRESH</codeph> in the common case where you add new data
+ files for an existing table.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/internals_blurb"/>
+
+ <p>
+ To accurately respond to queries, Impala must have current metadata about those databases and tables that
+ clients query directly. Therefore, if some other entity modifies information used by Impala in the metastore
+ that Impala and Hive share, the information cached by Impala must be updated. However, this does not mean
+ that all metadata updates require an Impala update.
+ </p>
+
+ <note>
+ <p conref="../shared/impala_common.xml#common/catalog_server_124"/>
+ <p rev="1.2">
+ In Impala 1.2 and higher, a dedicated daemon (<cmdname>catalogd</cmdname>) broadcasts DDL changes made
+ through Impala to all Impala nodes. Formerly, after you created a database or table while connected to one
+ Impala node, you needed to issue an <codeph>INVALIDATE METADATA</codeph> statement on another Impala node
+ before accessing the new database or table from the other node. Now, newly created or altered objects are
+ picked up automatically by all Impala nodes. You must still use the <codeph>INVALIDATE METADATA</codeph>
+ technique after creating or altering objects through Hive. See
+ <xref href="impala_components.xml#intro_catalogd"/> for more information on the catalog service.
+ </p>
+ <p>
+ The <codeph>INVALIDATE METADATA</codeph> statement is new in Impala 1.1 and higher, and takes over some of
+ the use cases of the Impala 1.0 <codeph>REFRESH</codeph> statement. Because <codeph>REFRESH</codeph> now
+ requires a table name parameter, to flush the metadata for all tables at once, use the <codeph>INVALIDATE
+ METADATA</codeph> statement.
+ </p>
+ <draft-comment translate="no"> Almost-identical wording here, under INVALIDATE METADATA, and in Release Notes :: New Features. Makes sense to conref. </draft-comment>
+ <p>
+ Because <codeph>REFRESH <varname>table_name</varname></codeph> only works for tables that the current
+ Impala node is already aware of, when you create a new table in the Hive shell, you must enter
+ <codeph>INVALIDATE METADATA</codeph> with no table parameter before you can see the new table in
+ <cmdname>impala-shell</cmdname>. Once the table is known by the Impala node, you can issue <codeph>REFRESH
+ <varname>table_name</varname></codeph> after you add data files for that table.
+ </p>
+ </note>
+
+ <p conref="../shared/impala_common.xml#common/refresh_vs_invalidate"/>
+
+ <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+ <p>
+ A metadata update for an <codeph>impalad</codeph> instance <b>is</b> required if:
+ </p>
+
+ <ul>
+ <li>
+ A metadata change occurs.
+ </li>
+
+ <li>
+ <b>and</b> the change is made from another <codeph>impalad</codeph> instance in your cluster, or through
+ Hive.
+ </li>
+
+ <li>
+ <b>and</b> the change is made to a database to which clients such as the Impala shell or ODBC directly
+ connect.
+ </li>
+ </ul>
+
+ <p>
+ A metadata update for an Impala node is <b>not</b> required when you issue queries from the same Impala node
+ where you ran <codeph>ALTER TABLE</codeph>, <codeph>INSERT</codeph>, or other table-modifying statement.
+ </p>
+
+ <p>
+ Database and table metadata is typically modified by:
+ </p>
+
+ <ul>
+ <li>
+ Hive - via <codeph>ALTER</codeph>, <codeph>CREATE</codeph>, <codeph>DROP</codeph> or
+ <codeph>INSERT</codeph> operations.
+ </li>
+
+ <li>
+ Impalad - via <codeph>CREATE TABLE</codeph>, <codeph>ALTER TABLE</codeph>, and <codeph>INSERT</codeph>
+ operations.
+ </li>
+ </ul>
+
+ <p>
+ <codeph>INVALIDATE METADATA</codeph> causes the metadata for that table to be marked as stale, and reloaded
+ the next time the table is referenced. For a huge table, that process could take a noticeable amount of time;
+ thus you might prefer to use <codeph>REFRESH</codeph> where practical, to avoid an unpredictable delay later,
+ for example if the next reference to the table is during a benchmark test.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+ <p>
+ The following example shows how you might use the <codeph>INVALIDATE METADATA</codeph> statement after
+ creating new tables (such as SequenceFile or HBase tables) through the Hive shell. Before the
+ <codeph>INVALIDATE METADATA</codeph> statement was issued, Impala would give a <q>table not found</q> error
+ if you tried to refer to those table names. The <codeph>DESCRIBE</codeph> statements cause the latest
+ metadata to be immediately loaded for the tables, avoiding a delay the next time those tables are queried.
+ </p>
+
+<codeblock>[impalad-host:21000] > invalidate metadata;
+[impalad-host:21000] > describe t1;
+...
+[impalad-host:21000] > describe t2;
+... </codeblock>
+
+ <p>
+ For more examples of using <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> with a
+ combination of Impala and Hive operations, see <xref href="impala_tutorial.xml#tutorial_impala_hive"/>.
+ </p>
+
+ <p>
+ If you need to ensure that the metadata is up-to-date when you start an <cmdname>impala-shell</cmdname>
+ session, run <cmdname>impala-shell</cmdname> with the <codeph>-r</codeph> or
+ <codeph>--refresh_after_connect</codeph> command-line option. Because this operation adds a delay to the next
+ query against each table, potentially expensive for large tables with many partitions, try to avoid using
+ this option for day-to-day operations in a production environment.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/permissions_blurb"/>
+ <p rev="CDH-19187">
+ The user ID that the <cmdname>impalad</cmdname> daemon runs under,
+ typically the <codeph>impala</codeph> user, must have execute
+ permissions for all the relevant directories holding table data.
+ (A table could have data spread across multiple directories,
+ or in unexpected paths, if it uses partitioning or
+ specifies a <codeph>LOCATION</codeph> attribute for
+ individual partitions or the entire table.)
+ Issues with permissions might not cause an immediate error for this statement,
+ but subsequent statements such as <codeph>SELECT</codeph>
+ or <codeph>SHOW TABLE STATS</codeph> could fail.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/hdfs_blurb"/>
+
+ <p>
+ By default, the <codeph>INVALIDATE METADATA</codeph> command checks HDFS permissions of the underlying data
+ files and directories, caching this information so that a statement can be cancelled immediately if for
+ example the <codeph>impala</codeph> user does not have permission to write to the data directory for the
+ table. (This checking does not apply if you have set the <cmdname>catalogd</cmdname> configuration option
+ <codeph>--load_catalog_in_background=false</codeph>.) Impala reports any lack of write permissions as an
+ <codeph>INFO</codeph> message in the log file, in case that represents an oversight. If you change HDFS
+ permissions to make data readable or writeable by the Impala user, issue another <codeph>INVALIDATE
+ METADATA</codeph> to make Impala aware of the change.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+ <p rev="1.2.4">
+ This example illustrates creating a new database and new table in Hive, then doing an <codeph>INVALIDATE
+ METADATA</codeph> statement in Impala using the fully qualified table name, after which both the new table
+ and the new database are visible to Impala. The ability to specify <codeph>INVALIDATE METADATA
+ <varname>table_name</varname></codeph> for a table created in Hive is a new capability in Impala 1.2.4. In
+ earlier releases, that statement would have returned an error indicating an unknown table, requiring you to
+ do <codeph>INVALIDATE METADATA</codeph> with no table name, a more expensive operation that reloaded metadata
+ for all tables and databases.
+ </p>
+
+<codeblock rev="1.2.4">$ hive
+hive> create database new_db_from_hive;
+OK
+Time taken: 4.118 seconds
+hive> create table new_db_from_hive.new_table_from_hive (x int);
+OK
+Time taken: 0.618 seconds
+hive> quit;
+$ impala-shell
+[localhost:21000] > show databases like 'new*';
+[localhost:21000] > refresh new_db_from_hive.new_table_from_hive;
+ERROR: AnalysisException: Database does not exist: new_db_from_hive
+[localhost:21000] > invalidate metadata new_db_from_hive.new_table_from_hive;
+[localhost:21000] > show databases like 'new*';
++--------------------+
+| name |
++--------------------+
+| new_db_from_hive |
++--------------------+
+[localhost:21000] > show tables in new_db_from_hive;
++---------------------+
+| name |
++---------------------+
+| new_table_from_hive |
++---------------------+</codeblock>
+
+ <p conref="../shared/impala_common.xml#common/s3_blurb"/>
+ <p conref="../shared/impala_common.xml#common/s3_metadata"/>
+
+ <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
+ <p conref="../shared/impala_common.xml#common/related_info"/>
+ <p>
+ <xref href="impala_hadoop.xml#intro_metastore"/>,
+ <xref href="impala_refresh.xml#refresh"/>
+ </p>
+
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_joins.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_joins.xml b/docs/topics/impala_joins.xml
new file mode 100644
index 0000000..011a488
--- /dev/null
+++ b/docs/topics/impala_joins.xml
@@ -0,0 +1,520 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="joins">
+
+ <title>Joins in Impala SELECT Statements</title>
+ <titlealts><navtitle>Joins</navtitle></titlealts>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="Data Analysts"/>
+ <data name="Category" value="Developers"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Querying"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">joins</indexterm>
+ A join query is a <codeph>SELECT</codeph> statement that combines data from two or more tables,
+ and returns a result set containing items from some or all of those tables. It is a way to
+ cross-reference and correlate related data that is organized into multiple tables, typically
+ using identifiers that are repeated in each of the joined tables.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+ <p conref="../shared/impala_common.xml#common/join_types"/>
+
+<codeblock>SELECT <varname>select_list</varname> FROM
+ <varname>table_or_subquery1</varname> [INNER] JOIN <varname>table_or_subquery2</varname> |
+ <varname>table_or_subquery1</varname> {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} JOIN <varname>table_or_subquery2</varname> |
+ <varname>table_or_subquery1</varname> {LEFT | RIGHT} SEMI JOIN <varname>table_or_subquery2</varname> |
+ <ph rev="2.0.0"><varname>table_or_subquery1</varname> {LEFT | RIGHT} ANTI JOIN <varname>table_or_subquery2</varname> |</ph>
+ [ ON <varname>col1</varname> = <varname>col2</varname> [AND <varname>col3</varname> = <varname>col4</varname> ...] |
+ USING (<varname>col1</varname> [, <varname>col2</varname> ...]) ]
+ [<varname>other_join_clause</varname> ...]
+[ WHERE <varname>where_clauses</varname> ]
+
+SELECT <varname>select_list</varname> FROM
+ <varname>table_or_subquery1</varname>, <varname>table_or_subquery2</varname> [, <varname>table_or_subquery3</varname> ...]
+ [<varname>other_join_clause</varname> ...]
+WHERE
+ <varname>col1</varname> = <varname>col2</varname> [AND <varname>col3</varname> = <varname>col4</varname> ...]
+
+SELECT <varname>select_list</varname> FROM
+ <varname>table_or_subquery1</varname> CROSS JOIN <varname>table_or_subquery2</varname>
+ [<varname>other_join_clause</varname> ...]
+[ WHERE <varname>where_clauses</varname> ]</codeblock>
+
+ <p>
+ <b>SQL-92 and SQL-89 Joins:</b>
+ </p>
+
+ <p>
+ Queries with the explicit <codeph>JOIN</codeph> keywords are known as SQL-92 style joins, referring to the
+ level of the SQL standard where they were introduced. The corresponding <codeph>ON</codeph> or
+ <codeph>USING</codeph> clauses clearly show which columns are used as the join keys in each case:
+ </p>
+
+<codeblock>SELECT t1.c1, t2.c2 FROM <b>t1 JOIN t2</b>
+ <b>ON t1.id = t2.id and t1.type_flag = t2.type_flag</b>
+ WHERE t1.c1 > 100;
+
+SELECT t1.c1, t2.c2 FROM <b>t1 JOIN t2</b>
+ <b>USING (id, type_flag)</b>
+ WHERE t1.c1 > 100;</codeblock>
+
+ <p>
+ The <codeph>ON</codeph> clause is a general way to compare columns across the two tables, even if the column
+ names are different. The <codeph>USING</codeph> clause is a shorthand notation for specifying the join
+ columns, when the column names are the same in both tables. You can code equivalent <codeph>WHERE</codeph>
+ clauses that compare the columns, instead of <codeph>ON</codeph> or <codeph>USING</codeph> clauses, but that
+ practice is not recommended because mixing the join comparisons with other filtering clauses is typically
+ less readable and harder to maintain.
+ </p>
+
+ <p>
+ Queries with a comma-separated list of tables and subqueries are known as SQL-89 style joins. In these
+ queries, the equality comparisons between columns of the joined tables go in the <codeph>WHERE</codeph>
+ clause alongside other kinds of comparisons. This syntax is easy to learn, but it is also easy to
+ accidentally remove a <codeph>WHERE</codeph> clause needed for the join to work correctly.
+ </p>
+
+<codeblock>SELECT t1.c1, t2.c2 FROM <b>t1, t2</b>
+ WHERE
+ <b>t1.id = t2.id AND t1.type_flag = t2.type_flag</b>
+ AND t1.c1 > 100;</codeblock>
+
+ <p>
+ <b>Self-joins:</b>
+ </p>
+
+ <p>
+ Impala can do self-joins, for example to join on two different columns in the same table to represent
+ parent-child relationships or other tree-structured data. There is no explicit syntax for this; just use the
+ same table name for both the left-hand and right-hand table, and assign different table aliases to use when
+ referring to the fully qualified column names:
+ </p>
+
+<codeblock>-- Combine fields from both parent and child rows.
+SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs WHERE lhs.id = rhs.parent;</codeblock>
+
+ <p>
+ <b>Cartesian joins:</b>
+ </p>
+
+ <p>
+ To avoid producing huge result sets by mistake, Impala does not allow Cartesian joins of the form:
+<codeblock>SELECT ... FROM t1 JOIN t2;
+SELECT ... FROM t1, t2;</codeblock>
+ If you intend to join the tables based on common values, add <codeph>ON</codeph> or <codeph>WHERE</codeph>
+ clauses to compare columns across the tables. If you truly intend to do a Cartesian join, use the
+ <codeph>CROSS JOIN</codeph> keyword as the join operator. The <codeph>CROSS JOIN</codeph> form does not use
+ any <codeph>ON</codeph> clause, because it produces a result set with all combinations of rows from the
+ left-hand and right-hand tables. The result set can still be filtered by subsequent <codeph>WHERE</codeph>
+ clauses. For example:
+ </p>
+
+<codeblock>SELECT ... FROM t1 CROSS JOIN t2;
+SELECT ... FROM t1 CROSS JOIN t2 WHERE <varname>tests_on_non_join_columns</varname>;</codeblock>
+
+ <p>
+ <b>Inner and outer joins:</b>
+ </p>
+
+ <p>
+ An inner join is the most common and familiar type: rows in the result set contain the requested columns from
+ the appropriate tables, for all combinations of rows where the join columns of the tables have identical
+ values. If a column with the same name occurs in both tables, use a fully qualified name or a column alias to
+ refer to the column in the select list or other clauses. Impala performs inner joins by default for both
+ SQL-89 and SQL-92 join syntax:
+ </p>
+
+<codeblock>-- The following 3 forms are all equivalent.
+SELECT t1.id, c1, c2 FROM t1, t2 WHERE t1.id = t2.id;
+SELECT t1.id, c1, c2 FROM t1 JOIN t2 ON t1.id = t2.id;
+SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;</codeblock>
+
+ <p>
+ An outer join retrieves all rows from the left-hand table, or the right-hand table, or both; wherever there
+ is no matching data in the table on the other side of the join, the corresponding columns in the result set
+ are set to <codeph>NULL</codeph>. To perform an outer join, include the <codeph>OUTER</codeph> keyword in the
+ join operator, along with either <codeph>LEFT</codeph>, <codeph>RIGHT</codeph>, or <codeph>FULL</codeph>:
+ </p>
+
+<codeblock>SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id;
+SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id;
+SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;</codeblock>
+
+ <p>
+ For outer joins, Impala requires SQL-92 syntax; that is, the <codeph>JOIN</codeph> keyword instead of
+ comma-separated table names. Impala does not support vendor extensions such as <codeph>(+)</codeph> or
+ <codeph>*=</codeph> notation for doing outer joins with SQL-89 query syntax.
+ </p>
+
+ <p>
+ <b>Equijoins and Non-Equijoins:</b>
+ </p>
+
+ <p>
+ By default, Impala requires an equality comparison between the left-hand and right-hand tables, either
+ through <codeph>ON</codeph>, <codeph>USING</codeph>, or <codeph>WHERE</codeph> clauses. These types of
+ queries are classified broadly as equijoins. Inner, outer, full, and semi joins can all be equijoins based on
+ the presence of equality tests between columns in the left-hand and right-hand tables.
+ </p>
+
+ <p>
+ In Impala 1.2.2 and higher, non-equijoin queries are also possible, with comparisons such as
+ <codeph>!=</codeph> or <codeph><</codeph> between the join columns. These kinds of queries require care to
+ avoid producing huge result sets that could exceed resource limits. Once you have planned a non-equijoin
+ query that produces a result set of acceptable size, you can code the query using the <codeph>CROSS
+ JOIN</codeph> operator, and add the extra comparisons in the <codeph>WHERE</codeph> clause:
+ </p>
+
+<codeblock>SELECT * FROM t1 CROSS JOIN t2 WHERE t1.total > t2.maximum_price;</codeblock>
+
+ <p rev="2.3.0">
+ In CDH 5.5 / Impala 2.3 and higher, additional non-equijoin queries are possible due to the addition
+ of nested loop joins. These queries typically involve <codeph>SEMI JOIN</codeph>,
+ <codeph>ANTI JOIN</codeph>, or <codeph>FULL OUTER JOIN</codeph> clauses.
+ Impala sometimes also uses nested loop joins internally when evaluating <codeph>OUTER JOIN</codeph>
+ queries involving complex type columns.
+ Query phases involving nested loop joins do not use the spill-to-disk mechanism if they
+ exceed the memory limit. Impala decides internally when to use each join mechanism; you cannot
+ specify any query hint to choose between the nested loop join or the original hash join algorithm.
+ </p>
+
+<codeblock rev="2.3.0">SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.int_col < t2.int_col;</codeblock>
+
+ <p>
+ <b>Semi-joins:</b>
+ </p>
+
+ <p>
+ Semi-joins are a relatively rarely used variation. With the left semi-join, only data from the left-hand
+ table is returned, for rows where there is matching data in the right-hand table, based on comparisons
+ between join columns in <codeph>ON</codeph> or <codeph>WHERE</codeph> clauses. Only one instance of each row
+ from the left-hand table is returned, regardless of how many matching rows exist in the right-hand table.
+ <ph rev="2.0.0">A right semi-join (available in Impala 2.0 and higher) reverses the comparison and returns
+ data from the right-hand table.</ph>
+ </p>
+
+<codeblock>SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI JOIN t2 ON t1.id = t2.id;</codeblock>
+
+ <p>
+ <b>Natural joins (not supported):</b>
+ </p>
+
+ <p>
+ Impala does not support the <codeph>NATURAL JOIN</codeph> operator, again to avoid inconsistent or huge
+ result sets. Natural joins do away with the <codeph>ON</codeph> and <codeph>USING</codeph> clauses, and
+ instead automatically join on all columns with the same names in the left-hand and right-hand tables. This
+ kind of query is not recommended for rapidly evolving data structures such as are typically used in Hadoop.
+ Thus, Impala does not support the <codeph>NATURAL JOIN</codeph> syntax, which can produce different query
+ results as columns are added to or removed from tables.
+ </p>
+
+ <p>
+ If you do have any queries that use <codeph>NATURAL JOIN</codeph>, make sure to rewrite them with explicit
+ <codeph>USING</codeph> clauses, because Impala could interpret the <codeph>NATURAL</codeph> keyword as a
+ table alias:
+ </p>
+
+<codeblock>-- 'NATURAL' is interpreted as an alias for 't1' and Impala attempts an inner join,
+-- resulting in an error because inner joins require explicit comparisons between columns.
+SELECT t1.c1, t2.c2 FROM t1 NATURAL JOIN t2;
+ERROR: NotImplementedException: Join with 't2' requires at least one conjunctive equality predicate.
+ To perform a Cartesian product between two tables, use a CROSS JOIN.
+
+-- If you expect the tables to have identically named columns with matching values,
+-- list the corresponding column names in a USING clause.
+SELECT t1.c1, t2.c2 FROM t1 JOIN t2 USING (id, type_flag, name, address);</codeblock>
+
+ <p rev="2.0.0">
+ <b>Anti-joins (Impala 2.0 / CDH 5.2 and higher only):</b>
+ </p>
+
+ <p rev="2.0.0">
+ Impala supports the <codeph>LEFT ANTI JOIN</codeph> and <codeph>RIGHT ANTI JOIN</codeph> clauses in Impala
+ 2.0 and higher on CDH 4, or CDH 5.2 and higher on CDH 5. The <codeph>LEFT</codeph> or <codeph>RIGHT</codeph>
+ keyword is required for this kind of join. For <codeph>LEFT ANTI JOIN</codeph>, this clause returns those
+ values from the left-hand table that have no matching value in the right-hand table. <codeph>RIGHT ANTI
+ JOIN</codeph> reverses the comparison and returns values from the right-hand table. You can express this
+ negative relationship either through the <codeph>ANTI JOIN</codeph> clause or through a <codeph>NOT
+ EXISTS</codeph> operator with a subquery.
+ </p>
+
+<!-- Restriction lifted in Impala 2.0.
+<p>
+Impala does not support <codeph>WHERE</codeph> clauses
+such as <codeph>IN (<varname>subquery</varname>)</codeph>,
+<codeph>NOT IN (<varname>subquery</varname>)</codeph>,
+<codeph>EXISTS (<varname>subquery</varname>)</codeph>,
+and <codeph>NOT EXISTS (<varname>subquery</varname>)</codeph>.
+Therefore from a practical standpoint, you cannot
+express an anti-join condition, where values from one table
+are returned only if no matching values are present in another table.
+</p>
+-->
+
+ <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
+
+<!-- To do: reuse some complex types examples with joins here or under Examples farther down. -->
+
+ <p rev="2.3.0">
+ When referring to a column with a complex type (<codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, or <codeph>MAP</codeph>)
+ in a query, you use join notation to <q>unpack</q> the scalar fields of the struct, the elements of the array, or
+ the key-value pairs of the map. (The join notation is not required for aggregation operations, such as
+ <codeph>COUNT()</codeph> or <codeph>SUM()</codeph> for array elements.) Because Impala recognizes which complex type elements are associated with which row
+ of the result set, you use the same syntax as for a cross or cartesian join, without an explicit join condition.
+ See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+ <p>
+ You typically use join queries in situations like these:
+ </p>
+
+ <ul>
+ <li>
+ When related data arrives from different sources, with each data set physically residing in a separate
+ table. For example, you might have address data from business records that you cross-check against phone
+ listings or census data.
+ <note>
+ Impala can join tables of different file formats, including Impala-managed tables and HBase tables. For
+ example, you might keep small dimension tables in HBase, for convenience of single-row lookups and
+ updates, and for the larger fact tables use Parquet or other binary file format optimized for scan
+ operations. Then, you can issue a join query to cross-reference the fact tables with the dimension
+ tables.
+ </note>
+ </li>
+
+ <li>
+ When data is normalized, a technique for reducing data duplication by dividing it across multiple tables.
+ This kind of organization is often found in data that comes from traditional relational database systems.
+ For example, instead of repeating some long string such as a customer name in multiple tables, each table
+ might contain a numeric customer ID. Queries that need to display the customer name could <q>join</q> the
+ table that specifies which customer ID corresponds to which name.
+ </li>
+
+ <li>
+ When certain columns are rarely needed for queries, so they are moved into separate tables to reduce
+ overhead for common queries. For example, a <codeph>biography</codeph> field might be rarely needed in
+ queries on employee data. Putting that field in a separate table reduces the amount of I/O for common
+ queries on employee addresses or phone numbers. Queries that do need the <codeph>biography</codeph> column
+ can retrieve it by performing a join with that separate table.
+ </li>
+
+ <li>
+ In CDH 5.5 / Impala 2.3 or higher, when referring to complex type columns in queries.
+ See <xref href="impala_complex_types.xml#complex_types"/> for details.
+ </li>
+ </ul>
+
+ <p>
+ When comparing columns with the same names in <codeph>ON</codeph> or <codeph>WHERE</codeph> clauses, use the
+ fully qualified names such as <codeph><varname>db_name</varname>.<varname>table_name</varname></codeph>, or
+ assign table aliases, column aliases, or both to make the code more compact and understandable:
+ </p>
+
+<codeblock>select t1.c1 as first_id, t2.c2 as second_id from
+ t1 join t2 on first_id = second_id;
+
+select fact.custno, dimension.custno from
+ customer_data as fact join customer_address as dimension
+ using (custno)</codeblock>
+
+ <note>
+ <p>
+ Performance for join queries is a crucial aspect for Impala, because complex join queries are
+ resource-intensive operations. An efficient join query produces much less network traffic and CPU overhead
+ than an inefficient one. For best results:
+ </p>
+ <ul>
+ <li rev="1.2">
+ Make sure that both <xref href="impala_perf_stats.xml#perf_stats">table and column statistics</xref> are
+ available for all the tables involved in a join query, and especially for the columns referenced in any
+ join conditions. Impala uses the statistics to automatically deduce an efficient join order.
+ Use <xref href="impala_show.xml#show"><codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and
+ <codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph></xref> to check if statistics are
+ already present. Issue the <codeph>COMPUTE STATS <varname>table_name</varname></codeph> for a nonpartitioned table,
+ or (in Impala 2.1.0 and higher) <codeph>COMPUTE INCREMENTAL STATS <varname>table_name</varname></codeph>
+ for a partitioned table, to collect the initial statistics at both the table and column levels, and to keep the
+ statistics up to date after any substantial <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> operations.
+ </li>
+
+ <li rev="1.2">
+ If table or column statistics are not available, join the largest table first. You can check the
+ existence of statistics with the <codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and
+ <codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph> statements.
+ </li>
+
+ <li rev="1.2.2">
+ If table or column statistics are not available, join subsequent tables according to which table has the
+ most selective filter, based on overall size and <codeph>WHERE</codeph> clauses. Joining the table with
+ the most selective filter results in the fewest number of rows being returned.
+ </li>
+ </ul>
+ <p>
+ For more information and examples of performance for join queries, see
+ <xref href="impala_perf_joins.xml#perf_joins"/>.
+ </p>
+ </note>
+
+ <p>
+ To control the result set from a join query, include the names of corresponding column names in both tables
+ in an <codeph>ON</codeph> or <codeph>USING</codeph> clause, or by coding equality comparisons for those
+ columns in the <codeph>WHERE</codeph> clause.
+ </p>
+
+<codeblock>[localhost:21000] > select c_last_name, ca_city from customer join customer_address where c_customer_sk = ca_address_sk;
++-------------+-----------------+
+| c_last_name | ca_city |
++-------------+-----------------+
+| Lewis | Fairfield |
+| Moses | Fairview |
+| Hamilton | Pleasant Valley |
+| White | Oak Ridge |
+| Moran | Glendale |
+...
+| Richards | Lakewood |
+| Day | Lebanon |
+| Painter | Oak Hill |
+| Bentley | Greenfield |
+| Jones | Stringtown |
++-------------+------------------+
+Returned 50000 row(s) in 9.82s</codeblock>
+
+ <p>
+ One potential downside of joins is the possibility of excess resource usage in poorly constructed queries.
+ Impala imposes restrictions on join queries to guard against such issues. To minimize the chance of runaway
+ queries on large data sets, Impala requires every join query to contain at least one equality predicate
+ between the columns of the various tables. For example, if <codeph>T1</codeph> contains 1000 rows and
+ <codeph>T2</codeph> contains 1,000,000 rows, a query <codeph>SELECT <varname>columns</varname> FROM t1 JOIN
+ t2</codeph> could return up to 1 billion rows (1000 * 1,000,000); Impala requires that the query include a
+ clause such as <codeph>ON t1.c1 = t2.c2</codeph> or <codeph>WHERE t1.c1 = t2.c2</codeph>.
+ </p>
+
+ <p>
+ Because even with equality clauses, the result set can still be large, as we saw in the previous example, you
+ might use a <codeph>LIMIT</codeph> clause to return a subset of the results:
+ </p>
+
+<codeblock>[localhost:21000] > select c_last_name, ca_city from customer, customer_address where c_customer_sk = ca_address_sk limit 10;
++-------------+-----------------+
+| c_last_name | ca_city |
++-------------+-----------------+
+| Lewis | Fairfield |
+| Moses | Fairview |
+| Hamilton | Pleasant Valley |
+| White | Oak Ridge |
+| Moran | Glendale |
+| Sharp | Lakeview |
+| Wiles | Farmington |
+| Shipman | Union |
+| Gilbert | New Hope |
+| Brunson | Martinsville |
++-------------+-----------------+
+Returned 10 row(s) in 0.63s</codeblock>
+
+ <p>
+ Or you might use additional comparison operators or aggregation functions to condense a large result set into
+ a smaller set of values:
+ </p>
+
+<codeblock>[localhost:21000] > -- Find the names of customers who live in one particular town.
+[localhost:21000] > select distinct c_last_name from customer, customer_address where
+ c_customer_sk = ca_address_sk
+ and ca_city = "Green Acres";
++---------------+
+| c_last_name |
++---------------+
+| Hensley |
+| Pearson |
+| Mayer |
+| Montgomery |
+| Ricks |
+...
+| Barrett |
+| Price |
+| Hill |
+| Hansen |
+| Meeks |
++---------------+
+Returned 332 row(s) in 0.97s
+
+[localhost:21000] > -- See how many different customers in this town have names starting with "A".
+[localhost:21000] > select count(distinct c_last_name) from customer, customer_address where
+ c_customer_sk = ca_address_sk
+ and ca_city = "Green Acres"
+ and substr(c_last_name,1,1) = "A";
++-----------------------------+
+| count(distinct c_last_name) |
++-----------------------------+
+| 12 |
++-----------------------------+
+Returned 1 row(s) in 1.00s</codeblock>
+
+ <p>
+ Because a join query can involve reading large amounts of data from disk, sending large amounts of data
+ across the network, and loading large amounts of data into memory to do the comparisons and filtering, you
+ might do benchmarking, performance analysis, and query tuning to find the most efficient join queries for
+ your data set, hardware capacity, network configuration, and cluster workload.
+ </p>
+
+ <p>
+ The two categories of joins in Impala are known as <b>partitioned joins</b> and <b>broadcast joins</b>. If
+ inaccurate table or column statistics, or some quirk of the data distribution, causes Impala to choose the
+ wrong mechanism for a particular join, consider using query hints as a temporary workaround. For details, see
+ <xref href="impala_hints.xml#hints"/>.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+ <p>
+ The following examples refer to these simple tables containing small sets of integers:
+<codeblock>[localhost:21000] > create table t1 (x int);
+[localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6);
+
+[localhost:21000] > create table t2 (y int);
+[localhost:21000] > insert into t2 values (2), (4), (6);
+
+[localhost:21000] > create table t3 (z int);
+[localhost:21000] > insert into t3 values (1), (3), (5);
+</codeblock>
+ </p>
+
+<!-- To do: fill in examples for other join types. -->
+
+ <p>
+ The following example demonstrates an anti-join, returning the values from <codeph>T1</codeph> that do not
+ exist in <codeph>T2</codeph> (in this case, the odd numbers 1, 3, and 5):
+ </p>
+
+<codeblock>[localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y);
++---+
+| x |
++---+
+| 1 |
+| 3 |
+| 5 |
++---+
+</codeblock>
+
+ <p conref="../shared/impala_common.xml#common/related_info"/>
+
+ <p>
+ See these tutorials for examples of different kinds of joins:
+ </p>
+
+ <ul>
+ <li>
+ <xref href="impala_tutorial.xml#tut_cross_join"/>
+ </li>
+ </ul>
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_langref.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_langref.xml b/docs/topics/impala_langref.xml
new file mode 100644
index 0000000..aaa76aa
--- /dev/null
+++ b/docs/topics/impala_langref.xml
@@ -0,0 +1,179 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="langref">
+
+ <title><ph audience="PDF">Impala SQL Language Reference</ph><ph audience="HTML">Overview of Impala SQL</ph></title>
+
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Data Analysts"/>
+ <data name="Category" value="Developers"/>
+ <data name="Category" value="impala-shell"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ Impala uses SQL as its query language. Impala interprets SQL statements and performs the
+ full end-to-end processing for each statement. (As opposed to acting as a translation
+ layer for some other Hadoop subsystem.)
+ </p>
+
+ <p>
+ Impala implements many familiar statements, such as <codeph>CREATE TABLE</codeph>,
+ <codeph>INSERT</codeph>, and <codeph>SELECT</codeph>. Currently, the DML statements
+ <codeph>UPDATE</codeph> and <codeph>DELETE</codeph> are not available in the production
+ level of Impala, because big data analytics with Hadoop and HDFS typically involves
+ unchanging data. <codeph>UPDATE</codeph> and <codeph>DELETE</codeph> <i>are</i> available
+ in beta form in the version of Impala used with the Kudu storage layer. For full details
+ about Impala SQL syntax and semantics, see
+ <xref href="impala_langref_sql.xml#langref_sql"/>.
+ </p>
+
+ <p>
+ Queries include clauses such as <codeph>WHERE</codeph>, <codeph>GROUP BY</codeph>,
+ <codeph>ORDER BY</codeph>, and <codeph>JOIN</codeph>. For information about query syntax,
+ see <xref href="impala_select.xml#select"/>.
+ </p>
+
+ <p>
+ Queries can also include function calls, to scalar functions such as
+ <codeph>sin()</codeph> and <codeph>substr()</codeph>, aggregate functions such as
+ <codeph>count()</codeph> and <codeph>avg()</codeph>, and analytic functions such as
+ <codeph>lag()</codeph> and <codeph>rank()</codeph>. For a list of the built-in functions
+ available in Impala queries, see <xref href="impala_functions.xml#builtins"/>.
+ </p>
+
+ <p outputclass="toc"/>
+
+ </conbody>
+
+ <concept id="langref_performance">
+
+ <title>Performance Features</title>
+
+ <conbody>
+
+ <p>
+ The main performance-related SQL features for Impala are:
+ </p>
+
+ <ul>
+ <li>
+ <p>
+ The <codeph>COMPUTE STATS</codeph> statement, and the underlying table statistics
+ and column statistics used in query planning. The statistics are used to estimate
+ the number of rows and size of the result set for queries, subqueries, and the
+ different <q>sides</q> of a join query.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ The output of the <codeph>EXPLAIN</codeph> statement. It outlines the ways in which
+ the query is parallelized, and how much I/O, memory, and so on the query expects to
+ use. You can control the level of detail in the output through a query option.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Partitioning for tables. By organizing the data for efficient access along one or
+ more dimensions, this technique lets queries read only the relevant data.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Query hints, especially for join queries. Impala selects from different join
+ algorithms based on the relative sizes of the result sets for each side of the join.
+ In cases where you know the most effective technique for a particular query, you can
+ override the estimates that Impala uses to make that choice, and select the join
+ technique directly.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Query options. These options control settings that can influence the performance of
+ individual queries when you know the special considerations based on your workload,
+ hardware configuration, or data distribution.
+ </p>
+ </li>
+ </ul>
+
+ <p>
+ Because analytic queries against high volumes of data tend to require full scans against
+ large portions of data from each table, Impala does not include index-related SQL
+ statements such as <codeph>CREATE INDEX</codeph>. The <codeph>COMPUTE STATS</codeph>
+ serves the purpose of analyzing the distribution of data within each column and the
+ overall table. Partitioning optimizes the physical layout of the data for queries that
+ filter on one or more crucial columns.
+ </p>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="hive_interoperability">
+
+ <title>Sharing Tables, Data, and Queries Between Impala and Hive</title>
+
+ <conbody>
+
+ <p>
+ To protect user investment in skills development and query design, Impala provides a
+ high degree of compatibility with the Hive Query Language (HiveQL):
+ </p>
+
+ <ul>
+ <li>
+ Because Impala uses the same metadata store as Hive to record information about table
+ structure and properties, Impala can access tables defined through the native Impala
+ <codeph>CREATE TABLE</codeph> command, or tables created using the Hive data
+ definition language (DDL).
+ </li>
+
+ <li>
+ Impala supports data manipulation (DML) statements similar to the DML component of
+ HiveQL.
+ </li>
+
+ <li>
+ Impala provides many <xref href="impala_functions.xml#builtins">built-in
+ functions</xref> with the same names and parameter types as their HiveQL equivalents.
+ </li>
+ </ul>
+
+ <p>
+ Impala supports most of the same
+ <xref href="impala_langref_sql.xml#langref_sql">statements and clauses</xref> as HiveQL,
+ including, but not limited to <codeph>JOIN</codeph>, <codeph>AGGREGATE</codeph>,
+ <codeph>DISTINCT</codeph>, <codeph>UNION ALL</codeph>, <codeph>ORDER BY</codeph>,
+ <codeph>LIMIT</codeph> and (uncorrelated) subquery in the <codeph>FROM</codeph> clause.
+ Impala also supports <codeph>INSERT INTO</codeph> and <codeph>INSERT OVERWRITE</codeph>.
+ </p>
+
+ <p>
+ Impala supports data types with the same names and semantics as the equivalent Hive data
+ types: <codeph>STRING</codeph>, <codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>,
+ <codeph>INT</codeph>, <codeph>BIGINT</codeph>, <codeph>FLOAT</codeph>,
+ <codeph>DOUBLE</codeph>, <codeph>BOOLEAN</codeph>, <codeph>STRING</codeph>,
+ <codeph>TIMESTAMP</codeph>. CDH 5.5 / Impala 2.3 and higher also include the complex
+ types <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>.
+ </p>
+
+ <p>
+ Most HiveQL <codeph>SELECT</codeph> and <codeph>INSERT</codeph> statements run
+ unmodified with Impala. For information about Hive syntax not available in Impala, see
+ <xref href="impala_langref_unsupported.xml#langref_hiveql_delta"/>.
+ </p>
+
+ </conbody>
+
+ </concept>
+
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_langref_sql.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_langref_sql.xml b/docs/topics/impala_langref_sql.xml
new file mode 100644
index 0000000..d759e76
--- /dev/null
+++ b/docs/topics/impala_langref_sql.xml
@@ -0,0 +1,35 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="langref_sql">
+
+ <title>Impala SQL Statements</title>
+ <titlealts><navtitle>SQL Statements</navtitle></titlealts>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Developers"/>
+ <data name="Category" value="Data Analysts"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ The Impala SQL dialect supports a range of standard elements, plus some extensions for Big Data use cases
+ related to data loading and data warehousing.
+ </p>
+
+ <note>
+ <p>
+ In the <cmdname>impala-shell</cmdname> interpreter, a semicolon at the end of each statement is required.
+ Since the semicolon is not actually part of the SQL syntax, we do not include it in the syntax definition
+ of each statement, but we do show it in examples intended to be run in <cmdname>impala-shell</cmdname>.
+ </p>
+ </note>
+
+ <p audience="PDF" outputclass="toc all">
+ The following sections show the major SQL statements that you work with in Impala:
+ </p>
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_langref_unsupported.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_langref_unsupported.xml b/docs/topics/impala_langref_unsupported.xml
new file mode 100644
index 0000000..f2b0560
--- /dev/null
+++ b/docs/topics/impala_langref_unsupported.xml
@@ -0,0 +1,296 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="langref_hiveql_delta">
+
+ <title>SQL Differences Between Impala and Hive</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Hive"/>
+ <data name="Category" value="Porting"/>
+ <data name="Category" value="Data Analysts"/>
+ <data name="Category" value="Developers"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">Hive</indexterm>
+ <indexterm audience="Cloudera">HiveQL</indexterm>
+ Impala's SQL syntax follows the SQL-92 standard, and includes many industry extensions in areas such as
+ built-in functions. See <xref href="impala_porting.xml#porting"/> for a general discussion of adapting SQL
+ code from a variety of database systems to Impala.
+ </p>
+
+ <p>
+ Because Impala and Hive share the same metastore database and their tables are often used interchangeably,
+ the following section covers differences between Impala and Hive in detail.
+ </p>
+
+ <p outputclass="toc inpage"/>
+ </conbody>
+
+ <concept id="langref_hiveql_unsupported">
+
+ <title>HiveQL Features not Available in Impala</title>
+
+ <conbody>
+
+ <p>
+ The current release of Impala does not support the following SQL features that you might be familiar with
+ from HiveQL:
+ </p>
+
+ <draft-comment translate="no">
+Yeesh, too many separate lists of unsupported Hive syntax.
+Here, the FAQ, and in some of the intro topics.
+Some discussion in IMP-1061 about how best to reorg.
+Lots of opportunities for conrefs.
+</draft-comment>
+
+ <ul>
+<!-- Now supported in CDH 5.5 / Impala 2.3 and higher. Find places on this page (like already done under lateral views) to note the new data type support.
+ <li>
+ Non-scalar data types such as maps, arrays, structs.
+ </li>
+-->
+
+ <li rev="1.2">
+ Extensibility mechanisms such as <codeph>TRANSFORM</codeph>, custom file formats, or custom SerDes.
+ </li>
+
+ <li>
+ XML and JSON functions.
+ </li>
+
+ <li>
+ Certain aggregate functions from HiveQL: <codeph>covar_pop</codeph>, <codeph>covar_samp</codeph>,
+ <codeph>corr</codeph>, <codeph>percentile</codeph>, <codeph>percentile_approx</codeph>,
+ <codeph>histogram_numeric</codeph>, <codeph>collect_set</codeph>; Impala supports the set of aggregate
+ functions listed in <xref href="impala_aggregate_functions.xml#aggregate_functions"/> and analytic
+ functions listed in <xref href="impala_analytic_functions.xml#analytic_functions"/>.
+ </li>
+
+ <li>
+ Sampling.
+ </li>
+
+ <li>
+ Lateral views. In CDH 5.5 / Impala 2.3 and higher, Impala supports queries on complex types
+ (<codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, or <codeph>MAP</codeph>), using join notation
+ rather than the <codeph>EXPLODE()</codeph> keyword.
+ See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types.
+ </li>
+
+ <li>
+ Multiple <codeph>DISTINCT</codeph> clauses per query, although Impala includes some workarounds for this
+ limitation.
+ <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
+ </li>
+ </ul>
+
+ <p>
+ User-defined functions (UDFs) are supported starting in Impala 1.2. See <xref href="impala_udf.xml#udfs"/>
+ for full details on Impala UDFs.
+ <ul>
+ <li>
+ Impala supports high-performance UDFs written in C++, as well as reusing some Java-based Hive UDFs.
+ </li>
+
+ <li>
+ Impala supports scalar UDFs and user-defined aggregate functions (UDAFs). Impala does not currently
+ support user-defined table generating functions (UDTFs).
+ </li>
+
+ <li>
+ Only Impala-supported column types are supported in Java-based UDFs.
+ </li>
+ </ul>
+ </p>
+
+ <p>
+ Impala does not currently support these HiveQL statements:
+ </p>
+
+ <ul>
+ <li>
+ <codeph>ANALYZE TABLE</codeph> (the Impala equivalent is <codeph>COMPUTE STATS</codeph>)
+ </li>
+
+ <li>
+ <codeph>DESCRIBE COLUMN</codeph>
+ </li>
+
+ <li>
+ <codeph>DESCRIBE DATABASE</codeph>
+ </li>
+
+ <li>
+ <codeph>EXPORT TABLE</codeph>
+ </li>
+
+ <li>
+ <codeph>IMPORT TABLE</codeph>
+ </li>
+
+ <li>
+ <codeph>SHOW TABLE EXTENDED</codeph>
+ </li>
+
+ <li>
+ <codeph>SHOW INDEXES</codeph>
+ </li>
+
+ <li>
+ <codeph>SHOW COLUMNS</codeph>
+ </li>
+ </ul>
+ </conbody>
+ </concept>
+
+ <concept id="langref_hiveql_semantics">
+
+ <title>Semantic Differences Between Impala and HiveQL Features</title>
+
+ <conbody>
+
+ <p>
+ This section covers instances where Impala and Hive have similar functionality, sometimes including the
+ same syntax, but there are differences in the runtime semantics of those features.
+ </p>
+
+ <p>
+ <b>Security:</b>
+ </p>
+
+ <p>
+ Impala utilizes the <xref href="http://sentry.incubator.apache.org/" scope="external" format="html">Apache
+ Sentry (incubating)</xref> authorization framework, which provides fine-grained role-based access control
+ to protect data against unauthorized access or tampering.
+ </p>
+
+ <p>
+ The Hive component included in CDH 5.1 and higher now includes Sentry-enabled <codeph>GRANT</codeph>,
+ <codeph>REVOKE</codeph>, and <codeph>CREATE/DROP ROLE</codeph> statements. Earlier Hive releases had a
+ privilege system with <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements that were primarily
+ intended to prevent accidental deletion of data, rather than a security mechanism to protect against
+ malicious users.
+ </p>
+
+ <p>
+ Impala can make use of privileges set up through Hive <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements.
+ Impala has its own <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in Impala 2.0 and higher.
+ See <xref href="impala_authorization.xml#authorization"/> for the details of authorization in Impala, including
+ how to switch from the original policy file-based privilege model to the Sentry service using privileges
+ stored in the metastore database.
+ </p>
+
+ <p>
+ <b>SQL statements and clauses:</b>
+ </p>
+
+ <p>
+ The semantics of Impala SQL statements varies from HiveQL in some cases where they use similar SQL
+ statement and clause names:
+ </p>
+
+ <ul>
+ <li>
+ Impala uses different syntax and names for query hints, <codeph>[SHUFFLE]</codeph> and
+ <codeph>[NOSHUFFLE]</codeph> rather than <codeph>MapJoin</codeph> or <codeph>StreamJoin</codeph>. See
+ <xref href="impala_joins.xml#joins"/> for the Impala details.
+ </li>
+
+ <li>
+ Impala does not expose MapReduce specific features of <codeph>SORT BY</codeph>, <codeph>DISTRIBUTE
+ BY</codeph>, or <codeph>CLUSTER BY</codeph>.
+ </li>
+
+ <li>
+ Impala does not require queries to include a <codeph>FROM</codeph> clause.
+ </li>
+ </ul>
+
+ <p>
+ <b>Data types:</b>
+ </p>
+
+ <ul>
+ <li>
+ Impala supports a limited set of implicit casts. This can help avoid undesired results from unexpected
+ casting behavior.
+ <ul>
+ <li>
+ Impala does not implicitly cast between string and numeric or Boolean types. Always use
+ <codeph>CAST()</codeph> for these conversions.
+ </li>
+
+ <li>
+ Impala does perform implicit casts among the numeric types, when going from a smaller or less precise
+ type to a larger or more precise one. For example, Impala will implicitly convert a
+ <codeph>SMALLINT</codeph> to a <codeph>BIGINT</codeph> or <codeph>FLOAT</codeph>, but to convert from
+ <codeph>DOUBLE</codeph> to <codeph>FLOAT</codeph> or <codeph>INT</codeph> to <codeph>TINYINT</codeph>
+ requires a call to <codeph>CAST()</codeph> in the query.
+ </li>
+
+ <li>
+ Impala does perform implicit casts from string to timestamp. Impala has a restricted set of literal
+ formats for the <codeph>TIMESTAMP</codeph> data type and the <codeph>from_unixtime()</codeph> format
+ string; see <xref href="impala_timestamp.xml#timestamp"/> for details.
+ </li>
+ </ul>
+ <p>
+ See <xref href="impala_datatypes.xml#datatypes"/> for full details on implicit and explicit casting for
+ all types, and <xref href="impala_conversion_functions.xml#conversion_functions"/> for details about
+ the <codeph>CAST()</codeph> function.
+ </p>
+ </li>
+
+ <li>
+ Impala does not store or interpret timestamps using the local timezone, to avoid undesired results from
+ unexpected time zone issues. Timestamps are stored and interpreted relative to UTC. This difference can
+ produce different results for some calls to similarly named date/time functions between Impala and Hive.
+ See <xref href="impala_datetime_functions.xml#datetime_functions"/> for details about the Impala
+ functions. See <xref href="impala_timestamp.xml#timestamp"/> for a discussion of how Impala handles
+ time zones, and configuration options you can use to make Impala match the Hive behavior more closely
+ when dealing with Parquet-encoded <codeph>TIMESTAMP</codeph> data or when converting between
+ the local time zone and UTC.
+ </li>
+
+ <li>
+ The Impala <codeph>TIMESTAMP</codeph> type can represent dates ranging from 1400-01-01 to 9999-12-31.
+ This is different from the Hive date range, which is 0000-01-01 to 9999-12-31.
+ </li>
+
+ <li>
+ Impala does not return column overflows as <codeph>NULL</codeph>, so that customers can distinguish
+ between <codeph>NULL</codeph> data and overflow conditions similar to how they do so with traditional
+ database systems. Impala returns the largest or smallest value in the range for the type. For example,
+ valid values for a <codeph>tinyint</codeph> range from -128 to 127. In Impala, a <codeph>tinyint</codeph>
+ with a value of -200 returns -128 rather than <codeph>NULL</codeph>. A <codeph>tinyint</codeph> with a
+ value of 200 returns 127.
+ </li>
+ </ul>
+
+ <p>
+ <b>Miscellaneous features:</b>
+ </p>
+
+ <ul>
+ <li>
+ Impala does not provide virtual columns.
+ </li>
+
+ <li>
+ Impala does not expose locking.
+ </li>
+
+ <li>
+ Impala does not expose some configuration properties.
+ </li>
+ </ul>
+ </conbody>
+ </concept>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_limit.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_limit.xml b/docs/topics/impala_limit.xml
new file mode 100644
index 0000000..c186cd4
--- /dev/null
+++ b/docs/topics/impala_limit.xml
@@ -0,0 +1,149 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="limit">
+
+ <title>LIMIT Clause</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Querying"/>
+ <data name="Category" value="Reports"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ The <codeph>LIMIT</codeph> clause in a <codeph>SELECT</codeph> query sets a maximum number of rows for the
+ result set. Pre-selecting the maximum size of the result set helps Impala to optimize memory usage while
+ processing a distributed query.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>LIMIT <varname>constant_integer_expression</varname></codeblock>
+
+ <p>
+ The argument to the <codeph>LIMIT</codeph> clause must evaluate to a constant value. It can be a numeric
+ literal, or another kind of numeric expression involving operators, casts, and function return values. You
+ cannot refer to a column or use a subquery.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+ <p>
+ This clause is useful in contexts such as:
+ </p>
+
+ <ul>
+ <li>
+ To return exactly N items from a top-N query, such as the 10 highest-rated items in a shopping category or
+ the 50 hostnames that refer the most traffic to a web site.
+ </li>
+
+ <li>
+ To demonstrate some sample values from a table or a particular query. (To display some arbitrary items, use
+ a query with no <codeph>ORDER BY</codeph> clause. An <codeph>ORDER BY</codeph> clause causes additional
+ memory and/or disk usage during the query.)
+ </li>
+
+ <li>
+ To keep queries from returning huge result sets by accident if a table is larger than expected, or a
+ <codeph>WHERE</codeph> clause matches more rows than expected.
+ </li>
+ </ul>
+
+ <p rev="1.2.1">
+ Originally, the value for the <codeph>LIMIT</codeph> clause had to be a numeric literal. In Impala 1.2.1 and
+ higher, it can be a numeric expression.
+ </p>
+
+ <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_limit"/>
+
+ <p>
+ See <xref href="impala_order_by.xml#order_by"/> for details.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/limit_and_offset"/>
+
+ <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
+
+ <p conref="../shared/impala_common.xml#common/subquery_no_limit"/>
+
+ <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+ <p>
+ The following example shows how the <codeph>LIMIT</codeph> clause caps the size of the result set, with the
+ limit being applied after any other clauses such as <codeph>WHERE</codeph>.
+ </p>
+
+<codeblock>[localhost:21000] > create database limits;
+[localhost:21000] > use limits;
+[localhost:21000] > create table numbers (x int);
+[localhost:21000] > insert into numbers values (1), (3), (4), (5), (2);
+Inserted 5 rows in 1.34s
+[localhost:21000] > select x from numbers limit 100;
++---+
+| x |
++---+
+| 1 |
+| 3 |
+| 4 |
+| 5 |
+| 2 |
++---+
+Returned 5 row(s) in 0.26s
+[localhost:21000] > select x from numbers limit 3;
++---+
+| x |
++---+
+| 1 |
+| 3 |
+| 4 |
++---+
+Returned 3 row(s) in 0.27s
+[localhost:21000] > select x from numbers where x > 2 limit 2;
++---+
+| x |
++---+
+| 3 |
+| 4 |
++---+
+Returned 2 row(s) in 0.27s</codeblock>
+
+ <p>
+ For top-N and bottom-N queries, you use the <codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses
+ together:
+ </p>
+
+<codeblock rev="obwl">[localhost:21000] > select x as "Top 3" from numbers order by x desc limit 3;
++-------+
+| top 3 |
++-------+
+| 5 |
+| 4 |
+| 3 |
++-------+
+[localhost:21000] > select x as "Bottom 3" from numbers order by x limit 3;
++----------+
+| bottom 3 |
++----------+
+| 1 |
+| 2 |
+| 3 |
++----------+
+</codeblock>
+
+ <p>
+ You can use constant values besides integer literals as the <codeph>LIMIT</codeph> argument:
+ </p>
+
+<codeblock>-- Other expressions that yield constant integer values work too.
+SELECT x FROM t1 LIMIT 1e6; -- Limit is one million.
+SELECT x FROM t1 LIMIT length('hello world'); -- Limit is 11.
+SELECT x FROM t1 LIMIT 2+2; -- Limit is 4.
+SELECT x FROM t1 LIMIT cast(truncate(9.9) AS INT); -- Limit is 9.
+</codeblock>
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_literals.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_literals.xml b/docs/topics/impala_literals.xml
new file mode 100644
index 0000000..3c53796
--- /dev/null
+++ b/docs/topics/impala_literals.xml
@@ -0,0 +1,384 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="literals">
+
+ <title>Literals</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="Impala Data Types"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Data Analysts"/>
+ <data name="Category" value="Developers"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">literals</indexterm>
+ Each of the Impala data types has corresponding notation for literal values of that type. You specify literal
+ values in SQL statements, such as in the <codeph>SELECT</codeph> list or <codeph>WHERE</codeph> clause of a
+ query, or as an argument to a function call. See <xref href="impala_datatypes.xml#datatypes"/> for a complete
+ list of types, ranges, and conversion rules.
+ </p>
+
+ <p outputclass="toc inpage"/>
+ </conbody>
+
+ <concept id="numeric_literals">
+
+ <title>Numeric Literals</title>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">numeric literals</indexterm>
+ To write literals for the integer types (<codeph>TINYINT</codeph>, <codeph>SMALLINT</codeph>,
+ <codeph>INT</codeph>, and <codeph>BIGINT</codeph>), use a sequence of digits with optional leading zeros.
+ </p>
+
+ <p rev="1.4.0">
+ To write literals for the floating-point types (<codeph rev="1.4.0">DECIMAL</codeph>,
+ <codeph>FLOAT</codeph>, and <codeph>DOUBLE</codeph>), use a sequence of digits with an optional decimal
+ point (<codeph>.</codeph> character). To preserve accuracy during arithmetic expressions, Impala interprets
+ floating-point literals as the <codeph>DECIMAL</codeph> type with the smallest appropriate precision and
+ scale, until required by the context to convert the result to <codeph>FLOAT</codeph> or
+ <codeph>DOUBLE</codeph>.
+ </p>
+
+ <p>
+ Integer values are promoted to floating-point when necessary, based on the context.
+ </p>
+
+ <p>
+ You can also use exponential notation by including an <codeph>e</codeph> character. For example,
+ <codeph>1e6</codeph> is 1 times 10 to the power of 6 (1 million). A number in exponential notation is
+ always interpreted as floating-point.
+ </p>
+
+ <p rev="tk">
+ When Impala encounters a numeric literal, it considers the type to be the <q>smallest</q> that can
+ accurately represent the value. The type is promoted to larger or more accurate types if necessary, based
+ on subsequent parts of an expression.
+ </p>
+ <p>
+ For example, you can see by the types Impala defines for the following table columns
+ how it interprets the corresponding numeric literals:
+ </p>
+<codeblock>[localhost:21000] > create table ten as select 10 as x;
++-------------------+
+| summary |
++-------------------+
+| Inserted 1 row(s) |
++-------------------+
+[localhost:21000] > desc ten;
++------+---------+---------+
+| name | type | comment |
++------+---------+---------+
+| x | tinyint | |
++------+---------+---------+
+
+[localhost:21000] > create table four_k as select 4096 as x;
++-------------------+
+| summary |
++-------------------+
+| Inserted 1 row(s) |
++-------------------+
+[localhost:21000] > desc four_k;
++------+----------+---------+
+| name | type | comment |
++------+----------+---------+
+| x | smallint | |
++------+----------+---------+
+
+[localhost:21000] > create table one_point_five as select 1.5 as x;
++-------------------+
+| summary |
++-------------------+
+| Inserted 1 row(s) |
++-------------------+
+[localhost:21000] > desc one_point_five;
++------+--------------+---------+
+| name | type | comment |
++------+--------------+---------+
+| x | decimal(2,1) | |
++------+--------------+---------+
+
+[localhost:21000] > create table one_point_three_three_three as select 1.333 as x;
++-------------------+
+| summary |
++-------------------+
+| Inserted 1 row(s) |
++-------------------+
+[localhost:21000] > desc one_point_three_three_three;
++------+--------------+---------+
+| name | type | comment |
++------+--------------+---------+
+| x | decimal(4,3) | |
++------+--------------+---------+
+</codeblock>
+ </conbody>
+ </concept>
+
+ <concept id="string_literals">
+
+ <title>String Literals</title>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">string literals</indexterm>
+ String literals are quoted using either single or double quotation marks. You can use either kind of quotes
+ for string literals, even both kinds for different literals within the same statement.
+ </p>
+
+ <p rev="2.0.0">
+ Quoted literals are considered to be of type <codeph>STRING</codeph>. To use quoted literals in contexts
+ requiring a <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> value, <codeph>CAST()</codeph> the literal to
+ a <codeph>CHAR</codeph> or <codeph>VARCHAR</codeph> of the appropriate length.
+ </p>
+
+ <p>
+ <b>Escaping special characters:</b>
+ </p>
+
+ <p>
+ To encode special characters within a string literal, precede them with the backslash (<codeph>\</codeph>)
+ escape character:
+ </p>
+
+ <ul>
+ <li>
+ <codeph>\t</codeph> represents a tab.
+ </li>
+
+ <li>
+ <codeph>\n</codeph> represents a newline or linefeed. This might cause extra line breaks in
+ <cmdname>impala-shell</cmdname> output.
+ </li>
+
+ <li>
+ <codeph>\r</codeph> represents a carriage return. This might cause unusual formatting (making it appear
+ that some content is overwritten) in <cmdname>impala-shell</cmdname> output.
+ </li>
+
+ <li>
+ <codeph>\b</codeph> represents a backspace. This might cause unusual formatting (making it appear that
+ some content is overwritten) in <cmdname>impala-shell</cmdname> output.
+ </li>
+
+ <li>
+ <codeph>\0</codeph> represents an ASCII <codeph>nul</codeph> character (not the same as a SQL
+ <codeph>NULL</codeph>). This might not be visible in <cmdname>impala-shell</cmdname> output.
+ </li>
+
+ <li>
+ <codeph>\Z</codeph> represents a DOS end-of-file character. This might not be visible in
+ <cmdname>impala-shell</cmdname> output.
+ </li>
+
+ <li>
+ <codeph>\%</codeph> and <codeph>\_</codeph> can be used to escape wildcard characters within the string
+ passed to the <codeph>LIKE</codeph> operator.
+ </li>
+
+ <li>
+ <codeph>\</codeph> followed by 3 octal digits represents the ASCII code of a single character; for
+ example, <codeph>\101</codeph> is ASCII 65, the character <codeph>A</codeph>.
+ </li>
+
+ <li>
+ Use two consecutive backslashes (<codeph>\\</codeph>) to prevent the backslash from being interpreted as
+ an escape character.
+ </li>
+
+ <li>
+ Use the backslash to escape single or double quotation mark characters within a string literal, if the
+ literal is enclosed by the same type of quotation mark.
+ </li>
+
+ <li>
+ If the character following the <codeph>\</codeph> does not represent the start of a recognized escape
+ sequence, the character is passed through unchanged.
+ </li>
+ </ul>
+
+ <p>
+ <b>Quotes within quotes:</b>
+ </p>
+
+ <p>
+ To include a single quotation character within a string value, enclose the literal with either single or
+ double quotation marks, and optionally escape the single quote as a <codeph>\'</codeph> sequence. Earlier
+ releases required escaping a single quote inside double quotes. Continue using escape sequences in this
+ case if you also need to run your SQL code on older versions of Impala.
+ </p>
+
+ <p>
+ To include a double quotation character within a string value, enclose the literal with single quotation
+ marks, no escaping is necessary in this case. Or, enclose the literal with double quotation marks and
+ escape the double quote as a <codeph>\"</codeph> sequence.
+ </p>
+
+<codeblock>[localhost:21000] > select "What\'s happening?" as single_within_double,
+ > 'I\'m not sure.' as single_within_single,
+ > "Homer wrote \"The Iliad\"." as double_within_double,
+ > 'Homer also wrote "The Odyssey".' as double_within_single;
++----------------------+----------------------+--------------------------+---------------------------------+
+| single_within_double | single_within_single | double_within_double | double_within_single |
++----------------------+----------------------+--------------------------+---------------------------------+
+| What's happening? | I'm not sure. | Homer wrote "The Iliad". | Homer also wrote "The Odyssey". |
++----------------------+----------------------+--------------------------+---------------------------------+
+</codeblock>
+
+ <p>
+ <b>Field terminator character in CREATE TABLE:</b>
+ </p>
+
+ <note conref="../shared/impala_common.xml#common/thorn"/>
+
+ <p>
+ <b>impala-shell considerations:</b>
+ </p>
+
+ <p>
+ When dealing with output that includes non-ASCII or non-printable characters such as linefeeds and
+ backspaces, use the <cmdname>impala-shell</cmdname> options to save to a file, turn off pretty printing, or
+ both rather than relying on how the output appears visually. See
+ <xref href="impala_shell_options.xml#shell_options"/> for a list of <cmdname>impala-shell</cmdname>
+ options.
+ </p>
+ </conbody>
+ </concept>
+
+ <concept id="boolean_literals">
+
+ <title>Boolean Literals</title>
+
+ <conbody>
+
+ <p>
+ For <codeph>BOOLEAN</codeph> values, the literals are <codeph>TRUE</codeph> and <codeph>FALSE</codeph>,
+ with no quotation marks and case-insensitive.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+<codeblock>select true;
+select * from t1 where assertion = false;
+select case bool_col when true then 'yes' when false 'no' else 'null' end from t1;</codeblock>
+ </conbody>
+ </concept>
+
+ <concept id="timestamp_literals">
+
+ <title>Timestamp Literals</title>
+
+ <conbody>
+
+ <p conref="../shared/impala_common.xml#common/timestamp_conversions"/>
+
+ <p>
+ You can also use <codeph>INTERVAL</codeph> expressions to add or subtract from timestamp literal values,
+ such as <codeph>'1966-07-30' + INTERVAL 5 YEARS + INTERVAL 3 DAYS</codeph>. See
+ <xref href="impala_timestamp.xml#timestamp"/> for details.
+ </p>
+
+ <p>
+ Depending on your data pipeline, you might receive date and time data as text, in notation that does not
+ exactly match the format for Impala <codeph>TIMESTAMP</codeph> literals.
+ See <xref href="impala_datetime_functions.xml#datetime_functions"/> for functions that can convert
+ between a variety of string literals (including different field order, separators, and timezone notation)
+ and equivalent <codeph>TIMESTAMP</codeph> or numeric values.
+ </p>
+ </conbody>
+ </concept>
+
+ <concept id="null">
+
+ <title>NULL</title>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">NULL</indexterm>
+ The notion of <codeph>NULL</codeph> values is familiar from all kinds of database systems, but each SQL
+ dialect can have its own behavior and restrictions on <codeph>NULL</codeph> values. For Big Data
+ processing, the precise semantics of <codeph>NULL</codeph> values are significant: any misunderstanding
+ could lead to inaccurate results or misformatted data, that could be time-consuming to correct for large
+ data sets.
+ </p>
+
+ <ul>
+ <li>
+ <codeph>NULL</codeph> is a different value than an empty string. The empty string is represented by a
+ string literal with nothing inside, <codeph>""</codeph> or <codeph>''</codeph>.
+ </li>
+
+ <li>
+ In a delimited text file, the <codeph>NULL</codeph> value is represented by the special token
+ <codeph>\N</codeph>.
+ </li>
+
+ <li>
+ When Impala inserts data into a partitioned table, and the value of one of the partitioning columns is
+ <codeph>NULL</codeph> or the empty string, the data is placed in a special partition that holds only
+ these two kinds of values. When these values are returned in a query, the result is <codeph>NULL</codeph>
+ whether the value was originally <codeph>NULL</codeph> or an empty string. This behavior is compatible
+ with the way Hive treats <codeph>NULL</codeph> values in partitioned tables. Hive does not allow empty
+ strings as partition keys, and it returns a string value such as
+ <codeph>__HIVE_DEFAULT_PARTITION__</codeph> instead of <codeph>NULL</codeph> when such values are
+ returned from a query. For example:
+<codeblock>create table t1 (i int) partitioned by (x int, y string);
+-- Select an INT column from another table, with all rows going into a special HDFS subdirectory
+-- named __HIVE_DEFAULT_PARTITION__. Depending on whether one or both of the partitioning keys
+-- are null, this special directory name occurs at different levels of the physical data directory
+-- for the table.
+insert into t1 partition(x=NULL, y=NULL) select c1 from some_other_table;
+insert into t1 partition(x, y=NULL) select c1, c2 from some_other_table;
+insert into t1 partition(x=NULL, y) select c1, c3 from some_other_table;</codeblock>
+ </li>
+
+ <li>
+ There is no <codeph>NOT NULL</codeph> clause when defining a column to prevent <codeph>NULL</codeph>
+ values in that column.
+ </li>
+
+ <li>
+ There is no <codeph>DEFAULT</codeph> clause to specify a non-<codeph>NULL</codeph> default value.
+ </li>
+
+ <li>
+ If an <codeph>INSERT</codeph> operation mentions some columns but not others, the unmentioned columns
+ contain <codeph>NULL</codeph> for all inserted rows.
+ </li>
+
+ <li rev="1.2.1">
+ <p conref="../shared/impala_common.xml#common/null_sorting_change"/>
+ <note>
+ <draft-comment translate="no"> Probably a bunch of similar view-related restrictions like this that should be collected, reused, or cross-referenced under the Views topic. </draft-comment>
+ Because the <codeph>NULLS FIRST</codeph> and <codeph>NULLS LAST</codeph> keywords are not currently
+ available in Hive queries, any views you create using those keywords will not be available through
+ Hive.
+ </note>
+ </li>
+
+ <li>
+ In all other contexts besides sorting with <codeph>ORDER BY</codeph>, comparing a <codeph>NULL</codeph>
+ to anything else returns <codeph>NULL</codeph>, making the comparison meaningless. For example,
+ <codeph>10 > NULL</codeph> produces <codeph>NULL</codeph>, <codeph>10 < NULL</codeph> also produces
+ <codeph>NULL</codeph>, <codeph>5 BETWEEN 1 AND NULL</codeph> produces <codeph>NULL</codeph>, and so on.
+ </li>
+ </ul>
+
+ <p>
+ Several built-in functions serve as shorthand for evaluating expressions and returning
+ <codeph>NULL</codeph>, 0, or some other substitution value depending on the expression result:
+ <codeph>ifnull()</codeph>, <codeph>isnull()</codeph>, <codeph>nvl()</codeph>, <codeph>nullif()</codeph>,
+ <codeph>nullifzero()</codeph>, and <codeph>zeroifnull()</codeph>. See
+ <xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
+ </p>
+ </conbody>
+ </concept>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_live_progress.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_live_progress.xml b/docs/topics/impala_live_progress.xml
new file mode 100644
index 0000000..f58cdcb
--- /dev/null
+++ b/docs/topics/impala_live_progress.xml
@@ -0,0 +1,81 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="2.3.0" id="live_progress">
+
+ <title>LIVE_PROGRESS Query Option</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="Impala Query Options"/>
+ <data name="Category" value="Querying"/>
+ <data name="Category" value="Performance"/>
+ <data name="Category" value="Reports"/>
+ <data name="Category" value="impala-shell"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">LIVE_PROGRESS query option</indexterm>
+ For queries submitted through the <cmdname>impala-shell</cmdname> command,
+ displays an interactive progress bar showing roughly what percentage of
+ processing has been completed. When the query finishes, the progress bar is erased
+ from the <cmdname>impala-shell</cmdname> console output.
+ </p>
+
+ <p>
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/type_boolean"/>
+ <p conref="../shared/impala_common.xml#common/default_false_0"/>
+
+ <p conref="../shared/impala_common.xml#common/command_line_blurb"/>
+ <p>
+ You can enable this query option within <cmdname>impala-shell</cmdname>
+ by starting the shell with the <codeph>--live_progress</codeph>
+ command-line option.
+ You can still turn this setting off and on again within the shell through the
+ <codeph>SET</codeph> command.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+ <p conref="../shared/impala_common.xml#common/live_reporting_details"/>
+ <p>
+ For a more detailed way of tracking the progress of an interactive query through
+ all phases of processing, see <xref href="impala_live_summary.xml#live_summary"/>.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
+ <p>
+ Because the percentage complete figure is calculated using the number of
+ issued and completed <q>scan ranges</q>, which occur while reading the table
+ data, the progress bar might reach 100% before the query is entirely finished.
+ For example, the query might do work to perform aggregations after all the
+ table data has been read. If many of your queries fall into this category,
+ consider using the <codeph>LIVE_SUMMARY</codeph> option instead for
+ more granular progress reporting.
+ </p>
+ <p conref="../shared/impala_common.xml#common/impala_shell_progress_reports_compute_stats_caveat"/>
+ <p conref="../shared/impala_common.xml#common/impala_shell_progress_reports_shell_only_caveat"/>
+
+ <p conref="../shared/impala_common.xml#common/example_blurb"/>
+<codeblock><![CDATA[[localhost:21000] > set live_progress=true;
+LIVE_PROGRESS set to true
+[localhost:21000] > select count(*) from customer;
++----------+
+| count(*) |
++----------+
+| 150000 |
++----------+
+[localhost:21000] > select count(*) from customer t1 cross join customer t2;
+[################################################## ] 50%
+[####################################################################################################] 100%
+
+]]>
+</codeblock>
+
+ <p conref="../shared/impala_common.xml#common/live_progress_live_summary_asciinema"/>
+
+ </conbody>
+</concept>