You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jr...@apache.org on 2016/07/26 23:04:58 UTC
[06/22] incubator-impala git commit: First try at porting over the
source files necessary for the Impala SQL Reference.
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_porting.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_porting.xml b/docs/topics/impala_porting.xml
new file mode 100644
index 0000000..c9c8e52
--- /dev/null
+++ b/docs/topics/impala_porting.xml
@@ -0,0 +1,622 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="porting">
+
+ <title>Porting SQL from Other Database Systems to Impala</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Databases"/>
+ <data name="Category" value="Hive"/>
+ <data name="Category" value="Oracle"/>
+ <data name="Category" value="MySQL"/>
+ <data name="Category" value="PostgreSQL"/>
+ <data name="Category" value="Troubleshooting"/>
+ <data name="Category" value="Porting"/>
+ <data name="Category" value="Data Analysts"/>
+ <data name="Category" value="Developers"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">porting</indexterm>
+ Although Impala uses standard SQL for queries, you might need to modify SQL source when bringing applications
+ to Impala, due to variations in data types, built-in functions, vendor language extensions, and
+ Hadoop-specific syntax. Even when SQL is working correctly, you might make further minor modifications for
+ best performance.
+ </p>
+
+ <p outputclass="toc inpage"/>
+ </conbody>
+
+ <concept id="porting_ddl_dml">
+
+ <title>Porting DDL and DML Statements</title>
+
+ <conbody>
+
+ <p>
+ When adapting SQL code from a traditional database system to Impala, expect to find a number of differences
+ in the DDL statements that you use to set up the schema. Clauses related to physical layout of files,
+ tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to
+ account for the Impala partitioning scheme and Hadoop file formats.
+ </p>
+
+ <p>
+ Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor
+ extensions and features not yet supported in Impala, you might be able to run identical or almost-identical
+ query text on both systems.
+ </p>
+
+ <p>
+ Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse
+ and ongoing tuning efforts on the code for SQL queries.
+ </p>
+ </conbody>
+ </concept>
+
+ <concept id="porting_data_types">
+
+ <title>Porting Data Types from Other Database Systems</title>
+
+ <conbody>
+
+ <ul>
+ <li>
+ <p>
+ Change any <codeph>VARCHAR</codeph>, <codeph>VARCHAR2</codeph>, and <codeph>CHAR</codeph> columns to
+ <codeph>STRING</codeph>. Remove any length constraints from the column declarations; for example,
+ change <codeph>VARCHAR(32)</codeph> or <codeph>CHAR(1)</codeph> to <codeph>STRING</codeph>. Impala is
+ very flexible about the length of string values; it does not impose any length constraints
+ or do any special processing (such as blank-padding) for <codeph>STRING</codeph> columns.
+ (In Impala 2.0 and higher, there are data types <codeph>VARCHAR</codeph> and <codeph>CHAR</codeph>,
+ with length constraints for both types and blank-padding for <codeph>CHAR</codeph>.
+ However, for performance reasons, it is still preferable to use <codeph>STRING</codeph>
+ columns where practical.)
+ </p>
+ </li>
+
+ <li>
+ <p>
+ For national language character types such as <codeph>NCHAR</codeph>, <codeph>NVARCHAR</codeph>, or
+ <codeph>NCLOB</codeph>, be aware that while Impala can store and query UTF-8 character data, currently
+ some string manipulation operations only work correctly with ASCII data. See
+ <xref href="impala_string.xml#string"/> for details.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Change any <codeph>DATE</codeph>, <codeph>DATETIME</codeph>, or <codeph>TIME</codeph> columns to
+ <codeph>TIMESTAMP</codeph>. Remove any precision constraints. Remove any timezone clauses, and make
+ sure your application logic or ETL process accounts for the fact that Impala expects all
+ <codeph>TIMESTAMP</codeph> values to be in
+ <xref href="http://en.wikipedia.org/wiki/Coordinated_Universal_Time" scope="external" format="html">Coordinated
+ Universal Time (UTC)</xref>. See <xref href="impala_timestamp.xml#timestamp"/> for information about
+ the <codeph>TIMESTAMP</codeph> data type, and
+ <xref href="impala_datetime_functions.xml#datetime_functions"/> for conversion functions for different
+ date and time formats.
+ </p>
+ <p>
+ You might also need to adapt date- and time-related literal values and format strings to use the
+ supported Impala date and time formats. If you have date and time literals with different separators or
+ different numbers of <codeph>YY</codeph>, <codeph>MM</codeph>, and so on placeholders than Impala
+ expects, consider using calls to <codeph>regexp_replace()</codeph> to transform those values to the
+ Impala-compatible format. See <xref href="impala_timestamp.xml#timestamp"/> for information about the
+ allowed formats for date and time literals, and
+ <xref href="impala_string_functions.xml#string_functions"/> for string conversion functions such as
+ <codeph>regexp_replace()</codeph>.
+ </p>
+ <p>
+ Instead of <codeph>SYSDATE</codeph>, call the function <codeph>NOW()</codeph>.
+ </p>
+ <p>
+ Instead of adding or subtracting directly from a date value to produce a value <varname>N</varname>
+ days in the past or future, use an <codeph>INTERVAL</codeph> expression, for example <codeph>NOW() +
+ INTERVAL 30 DAYS</codeph>.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Although Impala supports <codeph>INTERVAL</codeph> expressions for datetime arithmetic, as shown in
+ <xref href="impala_timestamp.xml#timestamp"/>, <codeph>INTERVAL</codeph> is not available as a column
+ data type in Impala. For any <codeph>INTERVAL</codeph> values stored in tables, convert them to numeric
+ values that you can add or subtract using the functions in
+ <xref href="impala_datetime_functions.xml#datetime_functions"/>. For example, if you had a table
+ <codeph>DEADLINES</codeph> with an <codeph>INT</codeph> column <codeph>TIME_PERIOD</codeph>, you could
+ construct dates N days in the future like so:
+ </p>
+<codeblock>SELECT NOW() + INTERVAL time_period DAYS from deadlines;</codeblock>
+ </li>
+
+ <li>
+ <p>
+ For <codeph>YEAR</codeph> columns, change to the smallest Impala integer type that has sufficient
+ range. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on
+ for the various numeric data types.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Change any <codeph>DECIMAL</codeph> and <codeph>NUMBER</codeph> types. If fixed-point precision is not
+ required, you can use <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> on the Impala side depending on
+ the range of values. For applications that require precise decimal values, such as financial data, you
+ might need to make more extensive changes to table structure and application logic, such as using
+ separate integer columns for dollars and cents, or encoding numbers as string values and writing UDFs
+ to manipulate them. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges,
+ casting, and so on for the various numeric data types.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ <codeph>FLOAT</codeph>, <codeph>DOUBLE</codeph>, and <codeph>REAL</codeph> types are supported in
+ Impala. Remove any precision and scale specifications. (In Impala, <codeph>REAL</codeph> is just an
+ alias for <codeph>DOUBLE</codeph>; columns declared as <codeph>REAL</codeph> are turned into
+ <codeph>DOUBLE</codeph> behind the scenes.) See <xref href="impala_datatypes.xml#datatypes"/> for
+ details about ranges, casting, and so on for the various numeric data types.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Most integer types from other systems have equivalents in Impala, perhaps under different names such as
+ <codeph>BIGINT</codeph> instead of <codeph>INT8</codeph>. For any that are unavailable, for example
+ <codeph>MEDIUMINT</codeph>, switch to the smallest Impala integer type that has sufficient range.
+ Remove any precision specifications. See <xref href="impala_datatypes.xml#datatypes"/> for details
+ about ranges, casting, and so on for the various numeric data types.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Remove any <codeph>UNSIGNED</codeph> constraints. All Impala numeric types are signed. See
+ <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on for the
+ various numeric data types.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ For any types holding bitwise values, use an integer type with enough range to hold all the relevant
+ bits within a positive integer. See <xref href="impala_datatypes.xml#datatypes"/> for details about
+ ranges, casting, and so on for the various numeric data types.
+ </p>
+ <p>
+ For example, <codeph>TINYINT</codeph> has a maximum positive value of 127, not 256, so to manipulate
+ 8-bit bitfields as positive numbers switch to the next largest type <codeph>SMALLINT</codeph>.
+ </p>
+<codeblock>[localhost:21000] > select cast(127*2 as tinyint);
++--------------------------+
+| cast(127 * 2 as tinyint) |
++--------------------------+
+| -2 |
++--------------------------+
+[localhost:21000] > select cast(128 as tinyint);
++----------------------+
+| cast(128 as tinyint) |
++----------------------+
+| -128 |
++----------------------+
+[localhost:21000] > select cast(127*2 as smallint);
++---------------------------+
+| cast(127 * 2 as smallint) |
++---------------------------+
+| 254 |
++---------------------------+</codeblock>
+ <p>
+ Impala does not support notation such as <codeph>b'0101'</codeph> for bit literals.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ For BLOB values, use <codeph>STRING</codeph> to represent <codeph>CLOB</codeph> or
+ <codeph>TEXT</codeph> types (character based large objects) up to 32 KB in size. Binary large objects
+ such as <codeph>BLOB</codeph>, <codeph>RAW</codeph> <codeph>BINARY</codeph>, and
+ <codeph>VARBINARY</codeph> do not currently have an equivalent in Impala.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ For Boolean-like types such as <codeph>BOOL</codeph>, use the Impala <codeph>BOOLEAN</codeph> type.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Because Impala currently does not support composite or nested types, any spatial data types in other
+ database systems do not have direct equivalents in Impala. You could represent spatial values in string
+ format and write UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details. Where
+ practical, separate spatial types into separate tables so that Impala can still work with the
+ non-spatial data.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Take out any <codeph>DEFAULT</codeph> clauses. Impala can use data files produced from many different
+ sources, such as Pig, Hive, or MapReduce jobs. The fast import mechanisms of <codeph>LOAD DATA</codeph>
+ and external tables mean that Impala is flexible about the format of data files, and Impala does not
+ necessarily validate or cleanse data before querying it. When copying data through Impala
+ <codeph>INSERT</codeph> statements, you can use conditional functions such as <codeph>CASE</codeph> or
+ <codeph>NVL</codeph> to substitute some other value for <codeph>NULL</codeph> fields; see
+ <xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Take out any constraints from your <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
+ statements, for example <codeph>PRIMARY KEY</codeph>, <codeph>FOREIGN KEY</codeph>,
+ <codeph>UNIQUE</codeph>, <codeph>NOT NULL</codeph>, <codeph>UNSIGNED</codeph>, or
+ <codeph>CHECK</codeph> constraints. Impala can use data files produced from many different sources,
+ such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects initial data validation to happen
+ earlier during the ETL or ELT cycle. After data is loaded into Impala tables, you can perform queries
+ to test for <codeph>NULL</codeph> values. When copying data through Impala <codeph>INSERT</codeph>
+ statements, you can use conditional functions such as <codeph>CASE</codeph> or <codeph>NVL</codeph> to
+ substitute some other value for <codeph>NULL</codeph> fields; see
+ <xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
+ </p>
+ <p>
+ Do as much verification as practical before loading data into Impala. After data is loaded into Impala,
+ you can do further verification using SQL queries to check if values have expected ranges, if values
+ are <codeph>NULL</codeph> or not, and so on. If there is a problem with the data, you will need to
+ re-run earlier stages of the ETL process, or do an <codeph>INSERT ... SELECT</codeph> statement in
+ Impala to copy the faulty data to a new table and transform or filter out the bad values.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Take out any <codeph>CREATE INDEX</codeph>, <codeph>DROP INDEX</codeph>, and <codeph>ALTER
+ INDEX</codeph> statements, and equivalent <codeph>ALTER TABLE</codeph> statements. Remove any
+ <codeph>INDEX</codeph>, <codeph>KEY</codeph>, or <codeph>PRIMARY KEY</codeph> clauses from
+ <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements. Impala is optimized for bulk
+ read operations for data warehouse-style queries, and therefore does not support indexes for its
+ tables.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Calls to built-in functions with out-of-range or otherwise incorrect arguments, return
+ <codeph>NULL</codeph> in Impala as opposed to raising exceptions. (This rule applies even when the
+ <codeph>ABORT_ON_ERROR=true</codeph> query option is in effect.) Run small-scale queries using
+ representative data to doublecheck that calls to built-in functions are returning expected values
+ rather than <codeph>NULL</codeph>. For example, unsupported <codeph>CAST</codeph> operations do not
+ raise an error in Impala:
+ </p>
+<codeblock>select cast('foo' as int);
++--------------------+
+| cast('foo' as int) |
++--------------------+
+| NULL |
++--------------------+</codeblock>
+ </li>
+
+ <li>
+ <p>
+ For any other type not supported in Impala, you could represent their values in string format and write
+ UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ To detect the presence of unsupported or unconvertable data types in data files, do initial testing
+ with the <codeph>ABORT_ON_ERROR=true</codeph> query option in effect. This option causes queries to
+ fail immediately if they encounter disallowed type conversions. See
+ <xref href="impala_abort_on_error.xml#abort_on_error"/> for details. For example:
+ </p>
+<codeblock>set abort_on_error=true;
+select count(*) from (select * from t1);
+-- The above query will fail if the data files for T1 contain any
+-- values that can't be converted to the expected Impala data types.
+-- For example, if T1.C1 is defined as INT but the column contains
+-- floating-point values like 1.1, the query will return an error.</codeblock>
+ </li>
+ </ul>
+ </conbody>
+ </concept>
+
+ <concept id="porting_statements">
+
+ <title>SQL Statements to Remove or Adapt</title>
+
+ <conbody>
+
+ <p>
+ Some SQL statements or clauses that you might be familiar with are not currently supported in Impala:
+ </p>
+
+ <ul>
+ <li>
+ <p>
+ Impala has no <codeph>DELETE</codeph> statement. Impala is intended for data warehouse-style operations
+ where you do bulk moves and transforms of large quantities of data. Instead of using
+ <codeph>DELETE</codeph>, use <codeph>INSERT OVERWRITE</codeph> to entirely replace the contents of a
+ table or partition, or use <codeph>INSERT ... SELECT</codeph> to copy a subset of data (everything but
+ the rows you intended to delete) from one table to another. See <xref href="impala_dml.xml#dml"/> for
+ an overview of Impala DML statements.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Impala has no <codeph>UPDATE</codeph> statement. Impala is intended for data warehouse-style operations
+ where you do bulk moves and transforms of large quantities of data. Instead of using
+ <codeph>UPDATE</codeph>, do all necessary transformations early in the ETL process, such as in the job
+ that generates the original data, or when copying from one table to another to convert to a particular
+ file format or partitioning scheme. See <xref href="impala_dml.xml#dml"/> for an overview of Impala DML
+ statements.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Impala has no transactional statements, such as <codeph>COMMIT</codeph> or <codeph>ROLLBACK</codeph>.
+ Impala effectively works like the <codeph>AUTOCOMMIT</codeph> mode in some database systems, where
+ changes take effect as soon as they are made.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ If your database, table, column, or other names conflict with Impala reserved words, use different
+ names or quote the names with backticks. See <xref href="impala_reserved_words.xml#reserved_words"/>
+ for the current list of Impala reserved words.
+ </p>
+ <p>
+ Conversely, if you use a keyword that Impala does not recognize, it might be interpreted as a table or
+ column alias. For example, in <codeph>SELECT * FROM t1 NATURAL JOIN t2</codeph>, Impala does not
+ recognize the <codeph>NATURAL</codeph> keyword and interprets it as an alias for the table
+ <codeph>t1</codeph>. If you experience any unexpected behavior with queries, check the list of reserved
+ words to make sure all keywords in join and <codeph>WHERE</codeph> clauses are recognized.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Impala supports subqueries only in the <codeph>FROM</codeph> clause of a query, not within the
+ <codeph>WHERE</codeph> clauses. Therefore, you cannot use clauses such as <codeph>WHERE
+ <varname>column</varname> IN (<varname>subquery</varname>)</codeph>. Also, Impala does not allow
+ <codeph>EXISTS</codeph> or <codeph>NOT EXISTS</codeph> clauses (although <codeph>EXISTS</codeph> is a
+ reserved keyword).
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Impala supports <codeph>UNION</codeph> and <codeph>UNION ALL</codeph> set operators, but not
+ <codeph>INTERSECT</codeph>. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/>
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Within queries, Impala requires query aliases for any subqueries:
+ </p>
+<codeblock>-- Without the alias 'contents_of_t1' at the end, query gives syntax error.
+select count(*) from (select * from t1) contents_of_t1;</codeblock>
+ </li>
+
+ <li>
+ <p>
+ When an alias is declared for an expression in a query, that alias cannot be referenced again within
+ the same query block:
+ </p>
+<codeblock>-- Can't reference AVERAGE twice in the SELECT list where it's defined.
+select avg(x) as average, average+1 from t1 group by x;
+ERROR: AnalysisException: couldn't resolve column reference: 'average'
+
+-- Although it can be referenced again later in the same query.
+select avg(x) as average from t1 group by x having average > 3;</codeblock>
+ <p>
+ For Impala, either repeat the expression again, or abstract the expression into a <codeph>WITH</codeph>
+ clause, creating named columns that can be referenced multiple times anywhere in the base query:
+ </p>
+<codeblock>-- The following 2 query forms are equivalent.
+select avg(x) as average, avg(x)+1 from t1 group by x;
+with avg_t as (select avg(x) average from t1 group by x) select average, average+1 from avg_t;</codeblock>
+<!-- An alternative bunch of queries to use in the example above.
+[localhost:21000] > select x*x as x_squared from t1;
+
+[localhost:21000] > select x*x as x_squared from t1 where x_squared < 100;
+ERROR: AnalysisException: couldn't resolve column reference: 'x_squared'
+[localhost:21000] > select x*x as x_squared, x_squared * pi() as pi_x_squared from t1;
+ERROR: AnalysisException: couldn't resolve column reference: 'x_squared'
+[localhost:21000] > select x*x as x_squared from t1 group by x_squared;
+
+[localhost:21000] > select x*x as x_squared from t1 group by x_squared having x_squared < 100;
+-->
+ </li>
+
+ <li>
+ <p>
+ Impala does not support certain rarely used join types that are less appropriate for high-volume tables
+ used for data warehousing. In some cases, Impala supports join types but requires explicit syntax to
+ ensure you do not do inefficient joins of huge tables by accident. For example, Impala does not support
+ natural joins or anti-joins, and requires the <codeph>CROSS JOIN</codeph> operator for Cartesian
+ products. See <xref href="impala_joins.xml#joins"/> for details on the syntax for Impala join clauses.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Impala has a limited choice of partitioning types. Partitions are defined based on each distinct
+ combination of values for one or more partition key columns. Impala does not redistribute or check data
+ to create evenly distributed partitions; you must choose partition key columns based on your knowledge
+ of the data volume and distribution. Adapt any tables that use range, list, hash, or key partitioning
+ to use the Impala partition syntax for <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
+ statements. Impala partitioning is similar to range partitioning where every range has exactly one
+ value, or key partitioning where the hash function produces a separate bucket for every combination of
+ key values. See <xref href="impala_partitioning.xml#partitioning"/> for usage details, and
+ <xref href="impala_create_table.xml#create_table"/> and
+ <xref href="impala_alter_table.xml#alter_table"/> for syntax.
+ </p>
+ <note>
+ Because the number of separate partitions is potentially higher than in other database systems, keep a
+ close eye on the number of partitions and the volume of data in each one; scale back the number of
+ partition key columns if you end up with too many partitions with a small volume of data in each one.
+ Remember, to distribute work for a query across a cluster, you need at least one HDFS block per node.
+ HDFS blocks are typically multiple megabytes, <ph rev="parquet_block_size">especially</ph> for Parquet
+ files. Therefore, if each partition holds only a few megabytes of data, you are unlikely to see much
+ parallelism in the query because such a small amount of data is typically processed by a single node.
+ </note>
+ </li>
+
+ <li>
+ <p>
+ For <q>top-N</q> queries, Impala uses the <codeph>LIMIT</codeph> clause rather than comparing against a
+ pseudocolumn named <codeph>ROWNUM</codeph> or <codeph>ROW_NUM</codeph>. See
+ <xref href="impala_limit.xml#limit"/> for details.
+ </p>
+ </li>
+ </ul>
+ </conbody>
+ </concept>
+
+ <concept id="porting_antipatterns">
+
+ <title>SQL Constructs to Doublecheck</title>
+
+ <conbody>
+
+ <p>
+ Some SQL constructs that are supported have behavior or defaults more oriented towards convenience than
+ optimal performance. Also, sometimes machine-generated SQL, perhaps issued through JDBC or ODBC
+ applications, might have inefficiencies or exceed internal Impala limits. As you port SQL code, be alert
+ and change these things where appropriate:
+ </p>
+
+ <ul>
+ <li>
+ <p>
+ A <codeph>CREATE TABLE</codeph> statement with no <codeph>STORED AS</codeph> clause creates data files
+ in plain text format, which is convenient for data interchange but not a good choice for high-volume
+ data with high-performance queries. See <xref href="impala_file_formats.xml#file_formats"/> for why and
+ how to use specific file formats for compact data and high-performance queries. Especially see
+ <xref href="impala_parquet.xml#parquet"/>, for details about the file format most heavily optimized for
+ large-scale data warehouse queries.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ A <codeph>CREATE TABLE</codeph> statement with no <codeph>PARTITIONED BY</codeph> clause stores all the
+ data files in the same physical location, which can lead to scalability problems when the data volume
+ becomes large.
+ </p>
+ <p>
+ On the other hand, adapting tables that were already partitioned in a different database system could
+ produce an Impala table with a high number of partitions and not enough data in each one, leading to
+ underutilization of Impala's parallel query features.
+ </p>
+ <p>
+ See <xref href="impala_partitioning.xml#partitioning"/> for details about setting up partitioning and
+ tuning the performance of queries on partitioned tables.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ The <codeph>INSERT ... VALUES</codeph> syntax is suitable for setting up toy tables with a few rows for
+ functional testing, but because each such statement creates a separate tiny file in HDFS, it is not a
+ scalable technique for loading megabytes or gigabytes (let alone petabytes) of data. Consider revising
+ your data load process to produce raw data files outside of Impala, then setting up Impala external
+ tables or using the <codeph>LOAD DATA</codeph> statement to use those data files instantly in Impala
+ tables, with no conversion or indexing stage. See <xref href="impala_tables.xml#external_tables"/> and
+ <xref href="impala_load_data.xml#load_data"/> for details about the Impala techniques for working with
+ data files produced outside of Impala; see <xref href="impala_tutorial.xml#tutorial_etl"/> for examples
+ of ETL workflow for Impala.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data
+ files, or a single giant data file that cannot take advantage of distributed parallel queries or
+ partitioning. In this case, use an <codeph>INSERT ... SELECT</codeph> statement to copy the data into a
+ new table and reorganize into a more efficient layout in the same operation. See
+ <xref href="impala_insert.xml#insert"/> for details about the <codeph>INSERT</codeph> statement.
+ </p>
+ <p>
+ You can do <codeph>INSERT ... SELECT</codeph> into a table with a more efficient file format (see
+ <xref href="impala_file_formats.xml#file_formats"/>) or from an unpartitioned table into a partitioned
+ one (see <xref href="impala_partitioning.xml#partitioning"/>).
+ </p>
+ </li>
+
+ <li>
+ <p>
+ The number of expressions allowed in an Impala query might be smaller than for some other database
+ systems, causing failures for very complicated queries (typically produced by automated SQL
+ generators). Where practical, keep the number of expressions in the <codeph>WHERE</codeph> clauses to
+ approximately 2000 or fewer. As a workaround, set the query option
+ <codeph>DISABLE_CODEGEN=true</codeph> if queries fail for this reason. See
+ <xref href="impala_disable_codegen.xml#disable_codegen"/> for details.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ If practical, rewrite <codeph>UNION</codeph> queries to use the <codeph>UNION ALL</codeph> operator
+ instead. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/>
+ </p>
+ </li>
+ </ul>
+ </conbody>
+ </concept>
+
+ <concept id="porting_next">
+
+ <title>Next Porting Steps after Verifying Syntax and Semantics</title>
+
+ <conbody>
+
+ <p>
+ Throughout this section, some of the decisions you make during the porting process also have a substantial
+ impact on performance. After your SQL code is ported and working correctly, doublecheck the
+ performance-related aspects of your schema design, physical layout, and queries to make sure that the
+ ported application is taking full advantage of Impala's parallelism, performance-related SQL features, and
+ integration with Hadoop components.
+ </p>
+
+ <ul>
+ <li>
+ Have you run the <codeph>COMPUTE STATS</codeph> statement on each table involved in join queries? Have
+ you also run <codeph>COMPUTE STATS</codeph> for each table used as the source table in an <codeph>INSERT
+ ... SELECT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> statement?
+ </li>
+
+ <li>
+ Are you using the most efficient file format for your data volumes, table structure, and query
+ characteristics?
+ </li>
+
+ <li>
+ Are you using partitioning effectively? That is, have you partitioned on columns that are often used for
+ filtering in <codeph>WHERE</codeph> clauses? Have you partitioned at the right granularity so that there
+ is enough data in each partition to parallelize the work for each query?
+ </li>
+
+ <li>
+ Does your ETL process produce a relatively small number of multi-megabyte data files (good) rather than a
+ huge number of small files (bad)?
+ </li>
+ </ul>
+
+ <p>
+ See <xref href="impala_performance.xml#performance"/> for details about the whole performance tuning
+ process.
+ </p>
+ </conbody>
+ </concept>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_query_options.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_query_options.xml b/docs/topics/impala_query_options.xml
new file mode 100644
index 0000000..1011746
--- /dev/null
+++ b/docs/topics/impala_query_options.xml
@@ -0,0 +1,75 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="query_options">
+
+ <title>Query Options for the SET Statement</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala Query Options"/>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="impala-shell"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Querying"/>
+ <data name="Category" value="Configuring"/>
+ <data name="Category" value="Data Analysts"/>
+ <data name="Category" value="Developers"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ You can specify the following options using the <codeph>SET</codeph> statement, and those settings affect all
+ queries issued from that session.
+ </p>
+
+ <p>
+ Some query options are useful in day-to-day operations for improving usability, performance, or flexibility.
+ </p>
+
+ <p>
+ Other query options control special-purpose aspects of Impala operation and are intended primarily for
+ advanced debugging or troubleshooting.
+ </p>
+
+ <p>
+ Options with Boolean parameters can be set to 1 or <codeph>true</codeph> to enable, or 0 or <codeph>false</codeph>
+ to turn off.
+ </p>
+
+ <note rev="2.0.0">
+ In Impala 2.0 and later, you can set query options directly through the JDBC and ODBC interfaces by using the
+ <codeph>SET</codeph> statement. Formerly, <codeph>SET</codeph> was only available as a command within the
+ <cmdname>impala-shell</cmdname> interpreter.
+ </note>
+
+<!-- This is the list including defaults from the pre-release 1.2 impala-shell:
+ ABORT_ON_DEFAULT_LIMIT_EXCEEDED: 0
+ ABORT_ON_ERROR: 0
+ ALLOW_UNSUPPORTED_FORMATS: 0
+ BATCH_SIZE: 0
+ DEBUG_ACTION:
+ DEFAULT_ORDER_BY_LIMIT: -1
+ DISABLE_CODEGEN: 0
+ HBASE_CACHE_BLOCKS: 0
+ HBASE_CACHING: 0
+ MAX_ERRORS: 0
+ MAX_IO_BUFFERS: 0
+ MAX_SCAN_RANGE_LENGTH: 0
+ MEM_LIMIT: 0
+ NUM_NODES: 0
+ NUM_SCANNER_THREADS: 0
+ PARQUET_COMPRESSION_CODEC: SNAPPY
+ PARQUET_FILE_SIZE: 0
+ SUPPORT_START_OVER: false
+-->
+
+ <p outputclass="toc"/>
+
+ <p conref="../shared/impala_common.xml#common/related_info"/>
+
+ <p>
+ <xref href="impala_set.xml#set"/>
+ </p>
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_query_timeout_s.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_query_timeout_s.xml b/docs/topics/impala_query_timeout_s.xml
new file mode 100644
index 0000000..41f2918
--- /dev/null
+++ b/docs/topics/impala_query_timeout_s.xml
@@ -0,0 +1,51 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="2.0.0" id="query_timeout_s">
+
+ <title>QUERY_TIMEOUT_S Query Option</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="Impala Query Options"/>
+ <data name="Category" value="Querying"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">QUERY_TIMEOUT_S query option</indexterm>
+ Sets the idle query timeout value for the session, in seconds. Queries that sit idle for longer than the
+ timeout value are automatically cancelled. If the system administrator specified the
+ <codeph>--idle_query_timeout</codeph> startup option, <codeph>QUERY_TIMEOUT_S</codeph> must be smaller than
+ or equal to the <codeph>--idle_query_timeout</codeph> value.
+ </p>
+
+ <note conref="../shared/impala_common.xml#common/timeout_clock_blurb"/>
+
+ <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>SET QUERY_TIMEOUT_S=<varname>seconds</varname>;</codeblock>
+
+<!-- Don't have a compelling example to show at this time because the 'idle' aspect only applies
+ when the client is careless and leaves the query open. Can't easily demonstrate in impala-shell.
+
+ <p conref="/Content/impala_common_xi44078.xml#common/example_blurb"/>
+-->
+
+ <p>
+ <b>Type:</b> numeric
+ </p>
+
+ <p>
+ <b>Default:</b> 0 (no timeout if <codeph>--idle_query_timeout</codeph> not in effect; otherwise, use
+ <codeph>--idle_query_timeout</codeph> value)
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/related_info"/>
+
+ <p>
+ <xref href="impala_timeouts.xml#timeouts"/>
+ </p>
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_real.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_real.xml b/docs/topics/impala_real.xml
new file mode 100644
index 0000000..e6430e3
--- /dev/null
+++ b/docs/topics/impala_real.xml
@@ -0,0 +1,46 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="real">
+
+ <title>REAL Data Type</title>
+ <titlealts><navtitle>REAL</navtitle></titlealts>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="Impala Data Types"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Data Analysts"/>
+ <data name="Category" value="Developers"/>
+ <data name="Category" value="Schemas"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ An alias for the <codeph>DOUBLE</codeph> data type. See <xref href="impala_double.xml#double"/> for details.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+ <p>
+ These examples show how you can use the type names <codeph>REAL</codeph> and <codeph>DOUBLE</codeph>
+ interchangeably, and behind the scenes Impala treats them always as <codeph>DOUBLE</codeph>.
+ </p>
+
+<codeblock>[localhost:21000] > create table r1 (x real);
+[localhost:21000] > describe r1;
++------+--------+---------+
+| name | type | comment |
++------+--------+---------+
+| x | double | |
++------+--------+---------+
+[localhost:21000] > insert into r1 values (1.5), (cast (2.2 as double));
+[localhost:21000] > select cast (1e6 as real);
++---------------------------+
+| cast(1000000.0 as double) |
++---------------------------+
+| 1000000 |
++---------------------------+</codeblock>
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_refresh.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_refresh.xml b/docs/topics/impala_refresh.xml
new file mode 100644
index 0000000..ee022d5
--- /dev/null
+++ b/docs/topics/impala_refresh.xml
@@ -0,0 +1,234 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="refresh">
+
+ <title>REFRESH Statement</title>
+ <titlealts><navtitle>REFRESH</navtitle></titlealts>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="DDL"/>
+ <data name="Category" value="Tables"/>
+ <data name="Category" value="Hive"/>
+ <data name="Category" value="Metastore"/>
+ <data name="Category" value="ETL"/>
+ <data name="Category" value="Ingest"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">REFRESH statement</indexterm>
+ To accurately respond to queries, the Impala node that acts as the coordinator (the node to which you are
+ connected through <cmdname>impala-shell</cmdname>, JDBC, or ODBC) must have current metadata about those
+ databases and tables that are referenced in Impala queries. If you are not familiar with the way Impala uses
+ metadata and how it shares the same metastore database as Hive, see
+ <xref href="impala_hadoop.xml#intro_metastore"/> for background information.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>REFRESH [<varname>db_name</varname>.]<varname>table_name</varname></codeblock>
+
+ <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+ <p>
+ Use the <codeph>REFRESH</codeph> statement to load the latest metastore metadata and block location data for
+ a particular table in these scenarios:
+ </p>
+
+ <ul>
+ <li>
+ After loading new data files into the HDFS data directory for the table. (Once you have set up an ETL
+ pipeline to bring data into Impala on a regular basis, this is typically the most frequent reason why
+ metadata needs to be refreshed.)
+ </li>
+
+ <li>
+ After issuing <codeph>ALTER TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, or other
+ table-modifying SQL statement in Hive.
+ </li>
+ </ul>
+
+ <p>
+ You only need to issue the <codeph>REFRESH</codeph> statement on the node to which you connect to issue
+ queries. The coordinator node divides the work among all the Impala nodes in a cluster, and sends read
+ requests for the correct HDFS blocks without relying on the metadata on the other nodes.
+ </p>
+
+ <p>
+ <codeph>REFRESH</codeph> reloads the metadata for the table from the metastore database, and does an
+ incremental reload of the low-level block location data to account for any new data files added to the HDFS
+ data directory for the table. It is a low-overhead, single-table operation, specifically tuned for the common
+ scenario where new data files are added to HDFS.
+ </p>
+
+ <p>
+ Only the metadata for the specified table is flushed. The table must already exist and be known to Impala,
+ either because the <codeph>CREATE TABLE</codeph> statement was run in Impala rather than Hive, or because a
+ previous <codeph>INVALIDATE METADATA</codeph> statement caused Impala to reload its entire metadata catalog.
+ </p>
+
+ <note>
+ <p rev="1.2">
+ In Impala 1.2 and higher, the catalog service broadcasts any changed metadata as a result of Impala
+ <codeph>ALTER TABLE</codeph>, <codeph>INSERT</codeph> and <codeph>LOAD DATA</codeph> statements to all
+ Impala nodes. Thus, the <codeph>REFRESH</codeph> statement is only required if you load data through Hive
+ or by manipulating data files in HDFS directly. See <xref href="impala_components.xml#intro_catalogd"/> for
+ more information on the catalog service.
+ </p>
+ <p rev="1.2.1">
+ In Impala 1.2.1 and higher, another way to avoid inconsistency across nodes is to enable the
+ <codeph>SYNC_DDL</codeph> query option before performing a DDL statement or an <codeph>INSERT</codeph> or
+ <codeph>LOAD DATA</codeph>.
+ </p>
+ <p>
+ The functionality of the <codeph>REFRESH</codeph> statement has changed in Impala 1.1 and higher. Now the
+ table name is a required parameter. To flush the metadata for all tables, use the
+ <codeph><xref href="impala_invalidate_metadata.xml#invalidate_metadata">INVALIDATE METADATA</xref></codeph>
+ command.
+ </p>
+ <draft-comment translate="no"> Almost-identical wording here, under INVALIDATE METADATA, and in Release Notes :: New Features. Makes sense to conref. </draft-comment>
+ <p>
+ Because <codeph>REFRESH <varname>table_name</varname></codeph> only works for tables that Impala is already
+ aware of, when you create a new table in the Hive shell, you must enter <codeph>INVALIDATE
+ METADATA</codeph> with no table parameter before you can see the new table in
+ <cmdname>impala-shell</cmdname>. Once the table is known to Impala, you can issue <codeph>REFRESH
+ <varname>table_name</varname></codeph> as needed after you add more data files for that table.
+ </p>
+ </note>
+
+ <p conref="../shared/impala_common.xml#common/refresh_vs_invalidate"/>
+
+ <p>
+ A metadata update for an <codeph>impalad</codeph> instance <b>is</b> required if:
+ </p>
+
+ <ul>
+ <li>
+ A metadata change occurs.
+ </li>
+
+ <li>
+ <b>and</b> the change is made through Hive.
+ </li>
+
+ <li>
+ <b>and</b> the change is made to a database to which clients such as the Impala shell or ODBC directly
+ connect.
+ </li>
+ </ul>
+
+ <p rev="1.2">
+ A metadata update for an Impala node is <b>not</b> required after you run <codeph>ALTER TABLE</codeph>,
+ <codeph>INSERT</codeph>, or other table-modifying statement in Impala rather than Hive. Impala handles the
+ metadata synchronization automatically through the catalog service.
+ </p>
+
+ <p>
+ Database and table metadata is typically modified by:
+ </p>
+
+ <ul>
+ <li>
+ Hive - through <codeph>ALTER</codeph>, <codeph>CREATE</codeph>, <codeph>DROP</codeph> or
+ <codeph>INSERT</codeph> operations.
+ </li>
+
+ <li>
+ Impalad - through <codeph>CREATE TABLE</codeph>, <codeph>ALTER TABLE</codeph>, and <codeph>INSERT</codeph>
+ operations. <ph rev="1.2">In Impala 1.2 and higher, such changes are propagated to all Impala nodes by the
+ Impala catalog service.</ph>
+ </li>
+ </ul>
+
+ <p>
+ <codeph>REFRESH</codeph> causes the metadata for that table to be immediately reloaded. For a huge table,
+ that process could take a noticeable amount of time; but doing the refresh up front avoids an unpredictable
+ delay later, for example if the next reference to the table is during a benchmark test.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
+
+ <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+ <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+ <p>
+ The following example shows how you might use the <codeph>REFRESH</codeph> statement after manually adding
+ new HDFS data files to the Impala data directory for a table:
+ </p>
+
+<codeblock>[impalad-host:21000] > refresh t1;
+[impalad-host:21000] > refresh t2;
+[impalad-host:21000] > select * from t1;
+...
+[impalad-host:21000] > select * from t2;
+... </codeblock>
+
+ <p>
+ For more examples of using <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> with a
+ combination of Impala and Hive operations, see <xref href="impala_tutorial.xml#tutorial_impala_hive"/>.
+ </p>
+
+ <p>
+ <b>Related impalad options:</b>
+ </p>
+
+ <p>
+ In Impala 1.0, the <codeph>-r</codeph> option of <cmdname>impala-shell</cmdname> issued
+ <codeph>REFRESH</codeph> to reload metadata for all tables.
+ </p>
+
+ <p>
+ In Impala 1.1 and higher, this option issues <codeph>INVALIDATE METADATA</codeph> because
+ <codeph>REFRESH</codeph> now requires a table name parameter. Due to the expense of reloading the metadata
+ for all tables, the <cmdname>impala-shell</cmdname> <codeph>-r</codeph> option is not recommended for
+ day-to-day use in a production environment.
+ </p>
+
+ <p rev="1.2">
+ In Impala 1.2 and higher, the <codeph>-r</codeph> option is needed even less frequently, because metadata
+ changes caused by SQL statements in Impala are automatically broadcast to all Impala nodes.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/permissions_blurb"/>
+ <p rev="CDH-19187">
+ The user ID that the <cmdname>impalad</cmdname> daemon runs under,
+ typically the <codeph>impala</codeph> user, must have execute
+ permissions for all the relevant directories holding table data.
+ (A table could have data spread across multiple directories,
+ or in unexpected paths, if it uses partitioning or
+ specifies a <codeph>LOCATION</codeph> attribute for
+ individual partitions or the entire table.)
+ Issues with permissions might not cause an immediate error for this statement,
+ but subsequent statements such as <codeph>SELECT</codeph>
+ or <codeph>SHOW TABLE STATS</codeph> could fail.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/hdfs_blurb"/>
+
+ <p>
+ The <codeph>REFRESH</codeph> command checks HDFS permissions of the underlying data files and directories,
+ caching this information so that a statement can be cancelled immediately if for example the
+ <codeph>impala</codeph> user does not have permission to write to the data directory for the table. Impala
+ reports any lack of write permissions as an <codeph>INFO</codeph> message in the log file, in case that
+ represents an oversight. If you change HDFS permissions to make data readable or writeable by the Impala
+ user, issue another <codeph>REFRESH</codeph> to make Impala aware of the change.
+ </p>
+
+ <note conref="../shared/impala_common.xml#common/compute_stats_next"/>
+
+ <p conref="../shared/impala_common.xml#common/s3_blurb"/>
+ <p conref="../shared/impala_common.xml#common/s3_metadata"/>
+
+ <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
+ <p conref="../shared/impala_common.xml#common/related_info"/>
+ <p>
+ <xref href="impala_hadoop.xml#intro_metastore"/>,
+ <xref href="impala_invalidate_metadata.xml#invalidate_metadata"/>
+ </p>
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_request_pool.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_request_pool.xml b/docs/topics/impala_request_pool.xml
new file mode 100644
index 0000000..cf2a811
--- /dev/null
+++ b/docs/topics/impala_request_pool.xml
@@ -0,0 +1,45 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="1.3.0" id="request_pool">
+
+ <title>REQUEST_POOL Query Option</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="Resource Management"/>
+ <data name="Category" value="Impala Query Options"/>
+ <data name="Category" value="Admission Control"/>
+ <data name="Category" value="YARN"/>
+ <data name="Category" value="Llama"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">REQUEST_POOL query option</indexterm>
+ The pool or queue name that queries should be submitted to. Only applies when you enable the Impala admission
+ control feature (CDH 4 or CDH 5; see <xref href="impala_admission.xml#admission_control"/>), or the YARN
+ resource management feature (CDH 5 only; see
+ <xref href="impala_resource_management.xml#resource_management"/>). Specifies the name of the pool used by
+ requests from Impala to the resource manager.
+ </p>
+
+ <p>
+ Formerly known as <codeph>YARN_POOL</codeph> during the CDH 5 beta period. Renamed to reflect that it can be
+ used both with YARN and with the lightweight admission control feature introduced in Impala 1.3.
+ </p>
+
+ <p>
+ <b>Type:</b> <codeph>STRING</codeph>
+ </p>
+
+ <p>
+ <b>Default:</b> empty (use the user-to-pool mapping defined by an <cmdname>impalad</cmdname> startup option
+ in the Impala configuration file)
+ </p>
+
+<!-- Worth adding a couple of related info links here. -->
+
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_reservation_request_timeout.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_reservation_request_timeout.xml b/docs/topics/impala_reservation_request_timeout.xml
new file mode 100644
index 0000000..0316e44
--- /dev/null
+++ b/docs/topics/impala_reservation_request_timeout.xml
@@ -0,0 +1,35 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="1.2" id="reservation_request_timeout">
+
+ <title>RESERVATION_REQUEST_TIMEOUT Query Option (CDH 5 only)</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="Impala Query Options"/>
+ <data name="Category" value="Resource Management"/>
+ <data name="Category" value="YARN"/>
+ <data name="Category" value="Llama"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">RESERVATION_REQUEST_TIMEOUT query option</indexterm>
+ Maximum number of milliseconds Impala will wait for a reservation to be completely granted or denied. Used in
+ conjunction with the Impala resource management feature in Impala 1.2 and higher with CDH 5.
+ </p>
+
+ <p>
+ <b>Type:</b> numeric
+ </p>
+
+ <p>
+ <b>Default:</b> 300000 (5 minutes)
+ </p>
+
+<!-- Worth adding a couple of related info links here. -->
+
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_revoke.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_revoke.xml b/docs/topics/impala_revoke.xml
new file mode 100644
index 0000000..88fbbf9
--- /dev/null
+++ b/docs/topics/impala_revoke.xml
@@ -0,0 +1,96 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="2.0.0" id="revoke">
+
+ <title>REVOKE Statement (CDH 5.2 or higher only)</title>
+ <titlealts><navtitle>REVOKE (CDH 5.2 or higher only)</navtitle></titlealts>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="DDL"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Sentry"/>
+ <data name="Category" value="Roles"/>
+ <!-- Consider whether to go deeper into categories like Security for the Sentry-related statements. -->
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">REVOKE statement</indexterm>
+<!-- Copied from Sentry docs. Turn into conref. I did some rewording for clarity. -->
+ The <codeph>REVOKE</codeph> statement revokes roles or privileges on a specified object from groups. Only
+ Sentry administrative users can revoke the role from a group. The revocation has a cascading effect. For
+ example, revoking the <codeph>ALL</codeph> privilege on a database also revokes the same privilege for all
+ the tables in that database.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock rev="2.3.0 collevelauth">REVOKE ROLE <varname>role_name</varname> FROM GROUP <varname>group_name</varname>
+
+REVOKE <varname>privilege</varname> ON <varname>object_type</varname> <varname>object_name</varname>
+ FROM [ROLE] <varname>role_name</varname>
+
+<ph rev="2.3.0">privilege ::= SELECT | SELECT(<varname>column_name</varname>) | INSERT | ALL</ph>
+object_type ::= TABLE | DATABASE | SERVER | URI
+</codeblock>
+
+ <p>
+ Typically, the object name is an identifier. For URIs, it is a string literal.
+ </p>
+
+ <p rev="2.3.0 collevelauth">
+ The ability to grant or revoke <codeph>SELECT</codeph> privilege on specific columns is available
+ in CDH 5.5 / Impala 2.3 and higher. See <xref href="sg_hive_sql.xml#concept_c2q_4qx_p4/col_level_auth_sentry"/>
+ for details.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/privileges_blurb"/>
+
+ <p>
+ Only administrative users (those with <codeph>ALL</codeph> privileges on the server, defined in the Sentry
+ policy file) can use this statement.
+ </p>
+
+<!-- Turn compatibility info into a conref or series of conrefs. (In both GRANT and REVOKE.) -->
+
+ <p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
+
+ <p>
+ <ul>
+ <li>
+ The Impala <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements are available in CDH 5.2 and
+ higher.
+ </li>
+
+ <li>
+ In CDH 5.1 and higher, Impala makes use of any roles and privileges specified by the
+ <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in Hive, when your system is configured to
+ use the Sentry service instead of the file-based policy mechanism.
+ </li>
+
+ <li>
+ The Impala <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements do not require the
+ <codeph>ROLE</codeph> keyword to be repeated before each role name, unlike the equivalent Hive
+ statements.
+ </li>
+
+ <li conref="../shared/impala_common.xml#common/grant_revoke_single"/>
+ </ul>
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
+
+ <p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
+
+ <p conref="../shared/impala_common.xml#common/related_info"/>
+
+ <p>
+ <xref href="impala_authorization.xml#authorization"/>, <xref href="impala_grant.xml#grant"/>
+ <xref href="impala_create_role.xml#create_role"/>, <xref href="impala_drop_role.xml#drop_role"/>,
+ <xref href="impala_show.xml#show"/>
+ </p>
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_schema_objects.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_schema_objects.xml b/docs/topics/impala_schema_objects.xml
new file mode 100644
index 0000000..d8abe12
--- /dev/null
+++ b/docs/topics/impala_schema_objects.xml
@@ -0,0 +1,57 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="schema_objects">
+
+ <title>Impala Schema Objects and Object Names</title>
+ <titlealts><navtitle>Schema Objects and Object Names</navtitle></titlealts>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Data Analysts"/>
+ <data name="Category" value="Developers"/>
+ <data name="Category" value="Querying"/>
+ <data name="Category" value="Schemas"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">schema objects</indexterm>
+ With Impala, you work with schema objects that are familiar to database users: primarily databases, tables, views,
+ and functions. The SQL syntax to work with these objects is explained in
+ <xref href="impala_langref_sql.xml#langref_sql"/>. This section explains the conceptual knowledge you need to
+ work with these objects and the various ways to specify their names.
+ </p>
+
+ <p>
+ Within a table, partitions can also be considered a kind of object. Partitioning is an important subject for
+ Impala, with its own documentation section covering use cases and performance considerations. See
+ <xref href="impala_partitioning.xml#partitioning"/> for details.
+ </p>
+
+ <p>
+ Impala does not have a counterpart of the <q>tablespace</q> notion from some database systems. By default,
+ all the data files for a database, table, or partition are located within nested folders within the HDFS file
+ system. You can also specify a particular HDFS location for a given Impala table or partition. The raw data
+ for these objects is represented as a collection of data files, providing the flexibility to load data by
+ simply moving files into the expected HDFS location.
+ </p>
+
+ <p>
+ Information about the schema objects is held in the
+ <xref href="impala_hadoop.xml#intro_metastore">metastore</xref> database. This database is shared between
+ Impala and Hive, allowing each to create, drop, and query each other's databases, tables, and so on. When
+ Impala makes a change to schema objects through a <codeph>CREATE</codeph>, <codeph>ALTER</codeph>,
+ <codeph>DROP</codeph>, <codeph>INSERT</codeph>, or <codeph>LOAD DATA</codeph> statement, it broadcasts those
+ changes to all nodes in the cluster through the <xref href="impala_components.xml#intro_catalogd">catalog
+ service</xref>. When you make such changes through Hive or directly through manipulating HDFS files, you use
+ the <xref href="impala_refresh.xml#refresh">REFRESH</xref> or
+ <xref href="impala_invalidate_metadata.xml#invalidate_metadata">INVALIDATE METADATA</xref> statements on the
+ Impala side to recognize the newly loaded data, new tables, and so on.
+ </p>
+
+ <p outputclass="toc"/>
+ </conbody>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_select.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_select.xml b/docs/topics/impala_select.xml
new file mode 100644
index 0000000..db63f71
--- /dev/null
+++ b/docs/topics/impala_select.xml
@@ -0,0 +1,203 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="select">
+
+ <title>SELECT Statement</title>
+ <titlealts><navtitle>SELECT</navtitle></titlealts>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Querying"/>
+ <data name="Category" value="Reports"/>
+ <data name="Category" value="Tables"/>
+ <data name="Category" value="Data Analysts"/>
+ <data name="Category" value="Developers"/>
+ <!-- This is such an important statement, think if there are more applicable categories. -->
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">SELECT statement</indexterm>
+ The <codeph>SELECT</codeph> statement performs queries, retrieving data from one or more tables and producing
+ result sets consisting of rows and columns.
+ </p>
+
+ <p>
+ The Impala <codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph> statement also typically ends
+ with a <codeph>SELECT</codeph> statement, to define data to copy from one table to another.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>[WITH <i>name</i> AS (<i>select_expression</i>) [, ...] ]
+SELECT
+ [ALL | DISTINCT]
+ [STRAIGHT_JOIN]
+ <i>expression</i> [, <i>expression</i> ...]
+FROM <i>table_reference</i> [, <i>table_reference</i> ...]
+[[FULL | [LEFT | RIGHT] INNER | [LEFT | RIGHT] OUTER | [LEFT | RIGHT] SEMI | [LEFT | RIGHT] ANTI | CROSS]
+ JOIN <i>table_reference</i>
+ [ON <i>join_equality_clauses</i> | USING (<varname>col1</varname>[, <varname>col2</varname> ...]] ...
+WHERE <i>conditions</i>
+GROUP BY { <i>column</i> | <i>expression</i> [ASC | DESC] [NULLS FIRST | NULLS LAST] [, ...] }
+HAVING <codeph>conditions</codeph>
+GROUP BY { <i>column</i> | <i>expression</i> [ASC | DESC] [, ...] }
+LIMIT <i>expression</i> [OFFSET <i>expression</i>]
+[UNION [ALL] <i>select_statement</i>] ...]
+</codeblock>
+
+ <p>
+ Impala <codeph>SELECT</codeph> queries support:
+ </p>
+
+ <ul>
+ <li>
+ SQL scalar data types: <codeph><xref href="impala_boolean.xml#boolean">BOOLEAN</xref></codeph>,
+ <codeph><xref href="impala_tinyint.xml#tinyint">TINYINT</xref></codeph>,
+ <codeph><xref href="impala_smallint.xml#smallint">SMALLINT</xref></codeph>,
+ <codeph><xref href="impala_int.xml#int">INT</xref></codeph>,
+ <codeph><xref href="impala_bigint.xml#bigint">BIGINT</xref></codeph>,
+ <codeph><xref href="impala_decimal.xml#decimal">DECIMAL</xref></codeph>
+ <codeph><xref href="impala_float.xml#float">FLOAT</xref></codeph>,
+ <codeph><xref href="impala_double.xml#double">DOUBLE</xref></codeph>,
+ <codeph><xref href="impala_timestamp.xml#timestamp">TIMESTAMP</xref></codeph>,
+ <codeph><xref href="impala_string.xml#string">STRING</xref></codeph>,
+ <codeph><xref href="impala_varchar.xml#varchar">VARCHAR</xref></codeph>,
+ <codeph><xref href="impala_char.xml#char">CHAR</xref></codeph>.
+ </li>
+
+<!-- To do: Consider promoting 'querying complex types' to its own subtopic or pseudo-heading. -->
+ <li rev="2.3.0">
+ The complex data types <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>,
+ are available in CDH 5.5 / Impala 2.3 and higher.
+ Queries involving these types typically involve special qualified names
+ using dot notation for referring to the complex column fields,
+ and join clauses for bringing the complex columns into the result set.
+ See <xref href="impala_complex_types.xml#complex_types"/> for details.
+ </li>
+
+ <li rev="1.1">
+ An optional <xref href="impala_with.xml#with"><codeph>WITH</codeph> clause</xref> before the
+ <codeph>SELECT</codeph> keyword, to define a subquery whose name or column names can be referenced from
+ later in the main query. This clause lets you abstract repeated clauses, such as aggregation functions,
+ that are referenced multiple times in the same query.
+ </li>
+
+ <li>
+ By default, one <codeph>DISTINCT</codeph> clause per query. See <xref href="impala_distinct.xml#distinct"/>
+ for details. See <xref href="impala_appx_count_distinct.xml#appx_count_distinct"/> for a query option to
+ allow multiple <codeph>COUNT(DISTINCT)</codeph> impressions in the same query.
+ </li>
+
+ <li>
+ Subqueries in a <codeph>FROM</codeph> clause. In CDH 5.2 / Impala 2.0 and higher,
+ subqueries can also go in the <codeph>WHERE</codeph> clause, for example with the
+ <codeph>IN()</codeph>, <codeph>EXISTS</codeph>, and <codeph>NOT EXISTS</codeph> operators.
+ </li>
+
+ <li>
+ <codeph>WHERE</codeph>, <codeph>GROUP BY</codeph>, <codeph>HAVING</codeph> clauses.
+ </li>
+
+ <li rev="obwl">
+ <codeph><xref href="impala_order_by.xml#order_by">ORDER BY</xref></codeph>. Prior to Impala 1.4.0, Impala
+ required that queries using an <codeph>ORDER BY</codeph> clause also include a
+ <codeph><xref href="impala_limit.xml#limit">LIMIT</xref></codeph> clause. In Impala 1.4.0 and higher, this
+ restriction is lifted; sort operations that would exceed the Impala memory limit automatically use a
+ temporary disk work area to perform the sort.
+ </li>
+
+ <li>
+ <p conref="../shared/impala_common.xml#common/join_types"/>
+ <p>
+ See <xref href="impala_joins.xml#joins"/> for details and examples of join queries.
+ </p>
+ </li>
+
+ <li>
+ <codeph>UNION ALL</codeph>.
+ </li>
+
+ <li>
+ <codeph>LIMIT</codeph>.
+ </li>
+
+ <li>
+ External tables.
+ </li>
+
+ <li>
+ Relational operators such as greater than, less than, or equal to.
+ </li>
+
+ <li>
+ Arithmetic operators such as addition or subtraction.
+ </li>
+
+ <li>
+ Logical/Boolean operators <codeph>AND</codeph>, <codeph>OR</codeph>, and <codeph>NOT</codeph>. Impala does
+ not support the corresponding symbols <codeph>&&</codeph>, <codeph>||</codeph>, and
+ <codeph>!</codeph>.
+ </li>
+
+ <li>
+ Common SQL built-in functions such as <codeph>COUNT</codeph>, <codeph>SUM</codeph>, <codeph>CAST</codeph>,
+ <codeph>LIKE</codeph>, <codeph>IN</codeph>, <codeph>BETWEEN</codeph>, and <codeph>COALESCE</codeph>. Impala
+ specifically supports built-ins described in <xref href="impala_functions.xml#builtins"/>.
+ </li>
+ </ul>
+
+ <p conref="../shared/impala_common.xml#common/ignore_file_extensions"/>
+
+ <p conref="../shared/impala_common.xml#common/security_blurb"/>
+ <p conref="../shared/impala_common.xml#common/redaction_yes"/>
+
+ <p conref="../shared/impala_common.xml#common/cancel_blurb_yes"/>
+
+ <p conref="../shared/impala_common.xml#common/permissions_blurb"/>
+ <p rev="CDH-19187">
+ The user ID that the <cmdname>impalad</cmdname> daemon runs under,
+ typically the <codeph>impala</codeph> user, must have read
+ permissions for the files in all applicable directories in all source tables,
+ and read and execute permissions for the relevant data directories.
+ (A <codeph>SELECT</codeph> operation could read files from multiple different HDFS directories
+ if the source table is partitioned.)
+ If a query attempts to read a data file and is unable to because of an HDFS permission error,
+ the query halts and does not return any further results.
+ </p>
+
+ <p outputclass="toc"/>
+
+ <p conref="../shared/impala_common.xml#common/related_info"/>
+
+ <p>
+ The <codeph>SELECT</codeph> syntax is so extensive that it forms its own category of statements: queries. The
+ other major classifications of SQL statements are data definition language (see
+ <xref href="impala_ddl.xml#ddl"/>) and data manipulation language (see <xref href="impala_dml.xml#dml"/>).
+ </p>
+
+ <p>
+ Because the focus of Impala is on fast queries with interactive response times over huge data sets, query
+ performance and scalability are important considerations. See
+ <xref href="impala_performance.xml#performance"/> and <xref href="impala_scalability.xml#scalability"/> for
+ details.
+ </p>
+ </conbody>
+
+ <concept id="where" audience="Cloudera">
+
+<!-- WHERE hidden for the moment until there's the chance to add some reasonably comprehensive content
+
+ and make it its own file. -->
+
+ <title>WHERE Clause</title>
+
+ <conbody>
+
+ <p/>
+ </conbody>
+ </concept>
+</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_set.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_set.xml b/docs/topics/impala_set.xml
new file mode 100644
index 0000000..afa6777
--- /dev/null
+++ b/docs/topics/impala_set.xml
@@ -0,0 +1,90 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept rev="2.0.0" id="set">
+
+ <title>SET Statement</title>
+ <titlealts><navtitle>SET</navtitle></titlealts>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="SQL"/>
+ <data name="Category" value="Querying"/>
+ <data name="Category" value="Configuring"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">SET statement</indexterm>
+ Specifies values for query options that control the runtime behavior of other statements within the same
+ session.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/syntax_blurb"/>
+
+<codeblock>SET [<varname>query_option</varname>=<varname>option_value</varname>]
+</codeblock>
+
+ <p>
+ <codeph>SET</codeph> with no arguments returns a result set consisting of all available query options and
+ their current values.
+ </p>
+
+ <p>
+ The query option name and any string argument values are case-insensitive.
+ </p>
+
+ <p>
+ Each query option has a specific allowed notation for its arguments. Boolean options can be enabled and
+ disabled by assigning values of either <codeph>true</codeph> and <codeph>false</codeph>, or
+ <codeph>1</codeph> and <codeph>0</codeph>. Some numeric options accept a final character signifying the unit,
+ such as <codeph>2g</codeph> for 2 gigabytes or <codeph>100m</codeph> for 100 megabytes. See
+ <xref href="impala_query_options.xml#query_options"/> for the details of each query option.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
+
+ <p>
+ <codeph>MEM_LIMIT</codeph> is probably the most commonly used query option. You can specify a high value to
+ allow a resource-intensive query to complete. For testing how queries would work on memory-constrained
+ systems, you might specify an artificially low value.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
+
+ <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+ <p>
+ The following example sets some numeric and some Boolean query options to control usage of memory, disk
+ space, and timeout periods, then runs a query whose success could depend on the options in effect:
+ </p>
+
+<codeblock>set mem_limit=64g;
+set DISABLE_UNSAFE_SPILLS=true;
+set parquet_file_size=400m;
+set RESERVATION_REQUEST_TIMEOUT=900000;
+insert overwrite parquet_table select c1, c2, count(c3) from text_table group by c1, c2, c3;
+</codeblock>
+
+ <p conref="../shared/impala_common.xml#common/added_in_20"/>
+
+ <p>
+ <codeph>SET</codeph> has always been available as an <cmdname>impala-shell</cmdname> command. Promoting it to
+ a SQL statement lets you use this feature in client applications through the JDBC and ODBC APIs.
+ </p>
+
+<!-- <p conref="/Content/impala_common_xi44078.xml#common/jdbc_blurb"/> -->
+
+ <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
+
+ <p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
+
+ <p conref="../shared/impala_common.xml#common/related_info"/>
+
+ <p>
+ See <xref href="impala_query_options.xml#query_options"/> for the query options you can adjust using this
+ statement.
+ </p>
+ </conbody>
+</concept>