You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by mj...@apache.org on 2017/02/17 23:17:06 UTC
[3/9] incubator-impala git commit: [DOCS] Major update to Impala +
Kudu page
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/661921b2/docs/topics/impala_kudu.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_kudu.xml b/docs/topics/impala_kudu.xml
index ef90e63..cf08671 100644
--- a/docs/topics/impala_kudu.xml
+++ b/docs/topics/impala_kudu.xml
@@ -20,7 +20,7 @@ under the License.
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="impala_kudu" rev="kudu">
- <title>Using Impala to Query Kudu Tables</title>
+ <title id="kudu">Using Impala to Query Kudu Tables</title>
<prolog>
<metadata>
@@ -36,19 +36,30 @@ under the License.
<p>
<indexterm audience="hidden">Kudu</indexterm>
- You can use Impala to query Kudu tables. This capability allows convenient access to a storage system that is
- tuned for different kinds of workloads than the default with Impala. The default Impala tables use data files
- stored on HDFS, which are ideal for bulk loads and queries using full-table scans. In contrast, Kudu can do
- efficient queries for data organized either in data warehouse style (with full table scans) or for OLTP-style
- workloads (with key-based lookups for single rows or small ranges of values).
+ You can use Impala to query tables stored by Apache Kudu. This capability
+ allows convenient access to a storage system that is tuned for different kinds of
+ workloads than the default with Impala.
</p>
<p>
- Certain Impala SQL statements, such as <codeph>UPDATE</codeph> and <codeph>DELETE</codeph>, only work with
- Kudu tables. These operations were impractical from a performance perspective to perform at large scale on
- HDFS data, or on HBase tables.
+ By default, Impala tables are stored on HDFS using data files with various file formats.
+ HDFS files are ideal for bulk loads (append operations) and queries using full-table scans,
+ but do not support in-place updates or deletes. Kudu is an alternative storage engine used
+ by Impala which can do both in-place updates (for mixed read/write workloads) and fast scans
+ (for data-warehouse/analytic operations). Using Kudu tables with Impala can simplify the
+ ETL pipeline by avoiding extra steps to segregate and reorganize newly arrived data.
</p>
+ <p>
+ Certain Impala SQL statements and clauses, such as <codeph>DELETE</codeph>,
+ <codeph>UPDATE</codeph>, <codeph>UPSERT</codeph>, and <codeph>PRIMARY KEY</codeph> work
+ only with Kudu tables. Other statements and clauses, such as <codeph>LOAD DATA</codeph>,
+ <codeph>TRUNCATE TABLE</codeph>, and <codeph>INSERT OVERWRITE</codeph>, are not applicable
+ to Kudu tables.
+ </p>
+
+ <p outputclass="toc inpage"/>
+
</conbody>
<concept id="kudu_benefits">
@@ -58,50 +69,1097 @@ under the License.
<conbody>
<p>
- The combination of Kudu and Impala works best for tables where scan performance is important, but data
- arrives continuously, in small batches, or needs to be updated without being completely replaced. In these
- scenarios (such as for streaming data), it might be impractical to use Parquet tables because Parquet works
- best with multi-megabyte data files, requiring substantial overhead to replace or reorganize data files to
- accomodate frequent additions or changes to data. Impala can query Kudu tables with scan performance close
- to that of Parquet, and Impala can also perform update or delete operations without replacing the entire
- table contents. You can also use the Kudu API to do ingestion or transformation operations outside of
- Impala, and Impala can query the current data at any time.
+ The combination of Kudu and Impala works best for tables where scan performance is
+ important, but data arrives continuously, in small batches, or needs to be updated
+ without being completely replaced. HDFS-backed tables can require substantial overhead
+ to replace or reorganize data files as new data arrives. Impala can perform efficient
+ lookups and scans within Kudu tables, and Impala can also perform update or
+ delete operations efficiently. You can also use the Kudu Java, C++, and Python APIs to
+ do ingestion or transformation operations outside of Impala, and Impala can query the
+ current data at any time.
</p>
</conbody>
</concept>
- <concept id="kudu_primary_key">
+ <concept id="kudu_config">
- <title>Primary Key Columns for Kudu Tables</title>
+ <title>Configuring Impala for Use with Kudu</title>
<conbody>
<p>
- Kudu tables introduce the notion of primary keys to Impala for the first time. The primary key is made up
- of one or more columns, whose values are combined and used as a lookup key during queries. These columns
- cannot contain any <codeph>NULL</codeph> values or any duplicate values, and can never be updated. For a
- partitioned Kudu table, all the partition key columns must come from the set of primary key columns.
+ The <codeph>-kudu_master_hosts</codeph> configuration property must be set correctly
+ for the <cmdname>impalad</cmdname> daemon, for <codeph>CREATE TABLE ... STORED AS
+ KUDU</codeph> statements to connect to the appropriate Kudu server. Typically, the
+ required value for this setting is <codeph><varname>kudu_host</varname>:7051</codeph>.
+ In a high-availability Kudu deployment, specify the names of multiple Kudu hosts separated by commas.
</p>
<p>
- Impala itself still does not have the notion of unique or non-<codeph>NULL</codeph> constraints. These
- restrictions on the primary key columns are enforced on the Kudu side.
+ If the <codeph>-kudu_master_hosts</codeph> configuration property is not set, you can
+ still associate the appropriate value for each table by specifying a
+ <codeph>TBLPROPERTIES('kudu.master_addresses')</codeph> clause in the <codeph>CREATE TABLE</codeph> statement or
+ changing the <codeph>TBLPROPERTIES('kudu.master_addresses')</codeph> value with an <codeph>ALTER TABLE</codeph>
+ statement.
</p>
+ </conbody>
+
+ <concept id="kudu_topology">
+
+ <title>Cluster Topology for Kudu Tables</title>
+
+ <conbody>
+
+ <p>
+ With HDFS-backed tables, you are typically concerned with the number of DataNodes in
+ the cluster, how many and how large HDFS data files are read during a query, and
+ therefore the amount of work performed by each DataNode and the network communication
+ to combine intermediate results and produce the final result set.
+ </p>
+
+ <p>
+ With Kudu tables, the topology considerations are different, because:
+ </p>
+
+ <ul>
+ <li>
+ <p>
+ The underlying storage is managed and organized by Kudu, not represented as HDFS
+ data files.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Kudu handles some of the underlying mechanics of partitioning the data. You can specify
+ the partitioning scheme with combinations of hash and range partitioning, so that you can
+ decide how much effort to expend to manage the partitions as new data arrives. For example,
+ you can construct partitions that apply to date ranges rather than a separate partition for each
+ day or each hour.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Data is physically divided based on units of storage called <term>tablets</term>. Tablets are
+ stored by <term>tablet servers</term>. Each tablet server can store multiple tablets,
+ and each tablet is replicated across multiple tablet servers, managed automatically by Kudu.
+ Where practical, colocate the tablet servers on the same hosts as the DataNodes, although that is not required.
+ </p>
+ </li>
+ </ul>
+
+ <p>
+ One consideration for the cluster topology is that the number of replicas for a Kudu table
+ must be odd.
+ </p>
+
+ </conbody>
+
+ </concept>
+
+ </concept>
+
+ <concept id="kudu_ddl">
+
+ <title>Impala DDL Enhancements for Kudu Tables (CREATE TABLE and ALTER TABLE)</title>
+
+ <prolog>
+ <metadata>
+ <data name="Category" value="DDL"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
<p>
- The primary key columns must be the first ones specified in the <codeph>CREATE TABLE</codeph> statement.
- You specify which column or columns make up the primary key in the table properties, rather than through
- attributes in the column list.
+ You can use the Impala <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
+ statements to create and fine-tune the characteristics of Kudu tables. Because Kudu
+ tables have features and properties that do not apply to other kinds of Impala tables,
+ familiarize yourself with Kudu-related concepts and syntax first.
+ For the general syntax of the <codeph>CREATE TABLE</codeph>
+ statement for Kudu tables, see <xref keyref="create_table"/>.
+ </p>
+
+ <p outputclass="toc inpage"/>
+
+ </conbody>
+
+ <concept id="kudu_primary_key">
+
+ <title>Primary Key Columns for Kudu Tables</title>
+
+ <conbody>
+
+ <p>
+ Kudu tables introduce the notion of primary keys to Impala for the first time. The
+ primary key is made up of one or more columns, whose values are combined and used as a
+ lookup key during queries. The tuple represented by these columns must be unique and cannot contain any
+ <codeph>NULL</codeph> values, and can never be updated once inserted. For a
+ Kudu table, all the partition key columns must come from the set of
+ primary key columns.
+ </p>
+
+ <p>
+ The primary key has both physical and logical aspects:
+ </p>
+
+ <ul>
+ <li>
+ <p>
+ On the physical side, it is used to map the data values to particular tablets for fast retrieval.
+ Because the tuples formed by the primary key values are unique, the primary key columns are typically
+ highly selective.
+ </p>
+ </li>
+ <li>
+ <p>
+ On the logical side, the uniqueness constraint allows you to avoid duplicate data in a table.
+ For example, if an <codeph>INSERT</codeph> operation fails partway through, only some of the
+ new rows might be present in the table. You can re-run the same <codeph>INSERT</codeph>, and
+ only the missing rows will be added. Or if data in the table is stale, you can run an
+ <codeph>UPSERT</codeph> statement that brings the data up to date, without the possibility
+ of creating duplicate copies of existing rows.
+ </p>
+ </li>
+ </ul>
+
+ <note>
+ <p>
+ Impala only allows <codeph>PRIMARY KEY</codeph> clauses and <codeph>NOT NULL</codeph>
+ constraints on columns for Kudu tables. These constraints are enforced on the Kudu side.
+ </p>
+ </note>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="kudu_column_attributes" rev="IMPALA-3726">
+
+ <title>Kudu-Specific Column Attributes for CREATE TABLE</title>
+
+ <conbody>
+
+ <p>
+ For the general syntax of the <codeph>CREATE TABLE</codeph>
+ statement for Kudu tables, see <xref keyref="create_table"/>.
+ The following sections provide more detail for some of the
+ Kudu-specific keywords you can use in column definitions.
+ </p>
+
+ <p>
+ The column list in a <codeph>CREATE TABLE</codeph> statement can include the following
+ attributes, which only apply to Kudu tables:
+ </p>
+
+<codeblock>
+ PRIMARY KEY
+| [NOT] NULL
+| ENCODING <varname>codec</varname>
+| COMPRESSION <varname>algorithm</varname>
+| DEFAULT <varname>constant_expression</varname>
+| BLOCK_SIZE <varname>number</varname>
+</codeblock>
+
+ <p outputclass="toc inpage">
+ See the following sections for details about each column attribute.
+ </p>
+
+ </conbody>
+
+ <concept id="kudu_primary_key_attribute">
+
+ <title>PRIMARY KEY Attribute</title>
+
+ <conbody>
+
+ <p>
+ The primary key for a Kudu table is a column, or set of columns, that uniquely
+ identifies every row. The primary key value also is used as the natural sort order
+ for the values from the table. The primary key value for each row is based on the
+ combination of values for the columns.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/pk_implies_not_null"/>
+
+ <p>
+ The primary key columns must be the first ones specified in the <codeph>CREATE
+ TABLE</codeph> statement. For a single-column primary key, you can include a
+ <codeph>PRIMARY KEY</codeph> attribute inline with the column definition. For a
+ multi-column primary key, you include a <codeph>PRIMARY KEY (<varname>c1</varname>,
+ <varname>c2</varname>, ...)</codeph> clause as a separate entry at the end of the
+ column list.
+ </p>
+
+ <p>
+ You can specify the <codeph>PRIMARY KEY</codeph> attribute either inline in a single
+ column definition, or as a separate clause at the end of the column list:
+ </p>
+
+<codeblock>
+CREATE TABLE pk_inline
+(
+ col1 BIGINT PRIMARY KEY,
+ col2 STRING,
+ col3 BOOLEAN
+) PARTITION BY HASH(col1) PARTITIONS 2 STORED AS KUDU;
+
+CREATE TABLE pk_at_end
+(
+ col1 BIGINT,
+ col2 STRING,
+ col3 BOOLEAN,
+ PRIMARY KEY (col1)
+) PARTITION BY HASH(col1) PARTITIONS 2 STORED AS KUDU;
+</codeblock>
+
+ <p>
+ When the primary key is a single column, these two forms are equivalent. If the
+ primary key consists of more than one column, you must specify the primary key using
+ a separate entry in the column list:
+ </p>
+
+<codeblock>
+CREATE TABLE pk_multiple_columns
+(
+ col1 BIGINT,
+ col2 STRING,
+ col3 BOOLEAN,
+ <b>PRIMARY KEY (col1, col2)</b>
+) PARTITION BY HASH(col2) PARTITIONS 2 STORED AS KUDU;
+</codeblock>
+
+ <p>
+ The <codeph>SHOW CREATE TABLE</codeph> statement always represents the
+ <codeph>PRIMARY KEY</codeph> specification as a separate item in the column list:
+ </p>
+
+<codeblock>
+CREATE TABLE inline_pk_rewritten (id BIGINT <b>PRIMARY KEY</b>, s STRING)
+ PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU;
+
+SHOW CREATE TABLE inline_pk_rewritten;
++------------------------------------------------------------------------------+
+| result |
++------------------------------------------------------------------------------+
+| CREATE TABLE user.inline_pk_rewritten ( |
+| id BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, |
+| s STRING NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, |
+| <b>PRIMARY KEY (id)</b> |
+| ) |
+| PARTITION BY HASH (id) PARTITIONS 2 |
+| STORED AS KUDU |
+| TBLPROPERTIES ('kudu.master_addresses'='host.example.com') |
++------------------------------------------------------------------------------+
+</codeblock>
+
+ <p>
+ The notion of primary key only applies to Kudu tables. Every Kudu table requires a
+ primary key. The primary key consists of one or more columns. You must specify any
+ primary key columns first in the column list.
+ </p>
+
+ <p>
+ The contents of the primary key columns cannot be changed by an
+ <codeph>UPDATE</codeph> or <codeph>UPSERT</codeph> statement. Including too many
+ columns in the primary key (more than 5 or 6) can also reduce the performance of
+ write operations. Therefore, pick the most selective and most frequently
+ tested non-null columns for the primary key specification.
+ If a column must always have a value, but that value
+ might change later, leave it out of the primary key and use a <codeph>NOT
+ NULL</codeph> clause for that column instead. If an existing row has an
+ incorrect or outdated key column value, delete the old row and insert an entirely
+ new row with the correct primary key.
+ </p>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="kudu_not_null_attribute">
+
+ <title>NULL | NOT NULL Attribute</title>
+
+ <conbody>
+
+ <p>
+ For Kudu tables, you can specify which columns can contain nulls or not. This
+ constraint offers an extra level of consistency enforcement for Kudu tables. If an
+ application requires a field to always be specified, include a <codeph>NOT
+ NULL</codeph> clause in the corresponding column definition, and Kudu prevents rows
+ from being inserted with a <codeph>NULL</codeph> in that column.
+ </p>
+
+ <p>
+ For example, a table containing geographic information might require the latitude
+ and longitude coordinates to always be specified. Other attributes might be allowed
+ to be <codeph>NULL</codeph>. For example, a location might not have a designated
+ place name, its altitude might be unimportant, and its population might be initially
+ unknown, to be filled in later.
+ </p>
+
+ <p conref="../shared/impala_common.xml#common/pk_implies_not_null"/>
+
+ <p>
+ For non-Kudu tables, Impala allows any column to contain <codeph>NULL</codeph>
+ values, because it is not practical to enforce a <q>not null</q> constraint on HDFS
+ data files that could be prepared using external tools and ETL processes.
+ </p>
+
+<codeblock>
+CREATE TABLE required_columns
+(
+ id BIGINT PRIMARY KEY,
+ latitude DOUBLE NOT NULL,
+ longitude DOUBLE NOT NULL,
+ place_name STRING,
+ altitude DOUBLE,
+ population BIGINT
+) PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU;
+</codeblock>
+
+ <p>
+ During performance optimization, Kudu can use the knowledge that nulls are not
+ allowed to skip certain checks on each input row, speeding up queries and join
+ operations. Therefore, specify <codeph>NOT NULL</codeph> constraints when
+ appropriate.
+ </p>
+
+ <p>
+ The <codeph>NULL</codeph> clause is the default condition for all columns that are not
+ part of the primary key. You can omit it, or specify it to clarify that you have made a
+ conscious design decision to allow nulls in a column.
+ </p>
+
+ <p>
+ Because primary key columns cannot contain any <codeph>NULL</codeph> values, the
+ <codeph>NOT NULL</codeph> clause is not required for the primary key columns,
+ but you might still specify it to make your code self-describing.
+ </p>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="kudu_default_attribute">
+
+ <title>DEFAULT Attribute</title>
+
+ <conbody>
+
+ <p>
+ You can specify a default value for columns in Kudu tables. The default value can be
+ any constant expression, for example, a combination of literal values, arithmetic
+ and string operations. It cannot contain references to columns or non-deterministic
+ function calls.
+ </p>
+
+ <p>
+ The following example shows different kinds of expressions for the
+ <codeph>DEFAULT</codeph> clause. The requirement to use a constant value means that
+ you can fill in a placeholder value such as <codeph>NULL</codeph>, empty string,
+ 0, -1, <codeph>'N/A'</codeph> and so on, but you cannot reference functions or
+ column names. Therefore, you cannot use <codeph>DEFAULT</codeph> to do things such as
+ automatically making an uppercase copy of a string value, storing Boolean values based
+ on tests of other columns, or add or subtract one from another column representing a sequence number.
+ </p>
+
+<codeblock>
+CREATE TABLE default_vals
+(
+ id BIGINT PRIMARY KEY,
+ name STRING NOT NULL DEFAULT 'unknown',
+ address STRING DEFAULT upper('no fixed address'),
+ age INT DEFAULT -1,
+ earthling BOOLEAN DEFAULT TRUE,
+ planet_of_origin STRING DEFAULT 'Earth',
+ optional_col STRING DEFAULT NULL
+) PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU;
+</codeblock>
+
+ <note>
+ <p>
+ When designing an entirely new schema, prefer to use <codeph>NULL</codeph> as the
+ placeholder for any unknown or missing values, because that is the universal convention
+ among database systems. Null values can be stored efficiently, and easily checked with the
+ <codeph>IS NULL</codeph> or <codeph>IS NOT NULL</codeph> operators. The <codeph>DEFAULT</codeph>
+ attribute is appropriate when ingesting data that already has an established convention for
+ representing unknown or missing values, or where the vast majority of rows have some common
+ non-null value.
+ </p>
+ </note>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="kudu_encoding_attribute">
+
+ <title>ENCODING Attribute</title>
+
+ <conbody>
+
+ <p>
+ Each column in a Kudu table can optionally use an encoding, a low-overhead form of
+ compression that reduces the size on disk, then requires additional CPU cycles to
+ reconstruct the original values during queries. Typically, highly compressible data
+ benefits from the reduced I/O to read the data back from disk. By default, each
+ column uses the <q>plain</q> encoding where the data is stored unchanged.
+ </p>
+
+ <p>
+ The encoding keywords that Impala recognizes are:
+
+ <ul>
+ <li>
+ <p>
+ <codeph>AUTO_ENCODING</codeph>: use the default encoding based on the column
+ type; currently always the same as <codeph>PLAIN_ENCODING</codeph>, but subject to
+ change in the future.
+ </p>
+ </li>
+ <li>
+ <p>
+ <codeph>PLAIN_ENCODING</codeph>: leave the value in its original binary format.
+ </p>
+ </li>
+ <!-- GROUP_VARINT is internal use only, not documenting that although it shows up
+ in parser error messages. -->
+ <li>
+ <p>
+ <codeph>RLE</codeph>: compress repeated values (when sorted in primary key
+ order) by including a count.
+ </p>
+ </li>
+ <li>
+ <p>
+ <codeph>DICT_ENCODING</codeph>: when the number of different string values is
+ low, replace the original string with a numeric ID.
+ </p>
+ </li>
+ <li>
+ <p>
+ <codeph>BIT_SHUFFLE</codeph>: rearrange the bits of the values to efficiently
+ compress sequences of values that are identical or vary only slightly based
+ on primary key order. The resulting encoded data is also compressed with LZ4.
+ </p>
+ </li>
+ <li>
+ <p>
+ <codeph>PREFIX_ENCODING</codeph>: compress common prefixes in string values; mainly for use internally within Kudu.
+ </p>
+ </li>
+ </ul>
+ </p>
+
+<!--
+UNKNOWN, AUTO_ENCODING, PLAIN_ENCODING, PREFIX_ENCODING, GROUP_VARINT, RLE, DICT_ENCODING, BIT_SHUFFLE
+
+No joy trying keywords UNKNOWN, or GROUP_VARINT with TINYINT and BIGINT.
+-->
+
+ <p>
+ The following example shows the Impala keywords representing the encoding types.
+ (The Impala keywords match the symbolic names used within Kudu.)
+ For usage guidelines on the different kinds of encoding, see
+ <xref href="https://kudu.apache.org/docs/schema_design.html" scope="external" format="html">the Kudu documentation</xref>.
+ The <codeph>DESCRIBE</codeph> output shows how the encoding is reported after
+ the table is created, and that omitting the encoding (in this case, for the
+ <codeph>ID</codeph> column) is the same as specifying <codeph>DEFAULT_ENCODING</codeph>.
+ </p>
+
+<codeblock>
+CREATE TABLE various_encodings
+(
+ id BIGINT PRIMARY KEY,
+ c1 BIGINT ENCODING PLAIN_ENCODING,
+ c2 BIGINT ENCODING AUTO_ENCODING,
+ c3 TINYINT ENCODING BIT_SHUFFLE,
+ c4 DOUBLE ENCODING BIT_SHUFFLE,
+ c5 BOOLEAN ENCODING RLE,
+ c6 STRING ENCODING DICT_ENCODING,
+ c7 STRING ENCODING PREFIX_ENCODING
+) PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU;
+
+-- Some columns are omitted from the output for readability.
+describe various_encodings;
++------+---------+-------------+----------+-----------------+
+| name | type | primary_key | nullable | encoding |
++------+---------+-------------+----------+-----------------+
+| id | bigint | true | false | AUTO_ENCODING |
+| c1 | bigint | false | true | PLAIN_ENCODING |
+| c2 | bigint | false | true | AUTO_ENCODING |
+| c3 | tinyint | false | true | BIT_SHUFFLE |
+| c4 | double | false | true | BIT_SHUFFLE |
+| c5 | boolean | false | true | RLE |
+| c6 | string | false | true | DICT_ENCODING |
+| c7 | string | false | true | PREFIX_ENCODING |
++------+---------+-------------+----------+-----------------+
+</codeblock>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="kudu_compression_attribute">
+
+ <title>COMPRESSION Attribute</title>
+
+ <conbody>
+
+ <p>
+ You can specify a compression algorithm to use for each column in a Kudu table. This
+ attribute imposes more CPU overhead when retrieving the values than the
+ <codeph>ENCODING</codeph> attribute does. Therefore, use it primarily for columns with
+ long strings that do not benefit much from the less-expensive <codeph>ENCODING</codeph>
+ attribute.
+ </p>
+
+ <p>
+ The choices for <codeph>COMPRESSION</codeph> are <codeph>LZ4</codeph>,
+ <codeph>SNAPPY</codeph>, and <codeph>ZLIB</codeph>.
+ </p>
+
+ <note>
+ <p>
+ Columns that use the <codeph>BITSHUFFLE</codeph> encoding are already compressed
+ using <codeph>LZ4</codeph>, and so typically do not need any additional
+ <codeph>COMPRESSION</codeph> attribute.
+ </p>
+ </note>
+
+ <p>
+ The following example shows design considerations for several
+ <codeph>STRING</codeph> columns with different distribution characteristics, leading
+ to choices for both the <codeph>ENCODING</codeph> and <codeph>COMPRESSION</codeph>
+ attributes. The <codeph>country</codeph> values come from a specific set of strings,
+ therefore this column is a good candidate for dictionary encoding. The
+ <codeph>post_id</codeph> column contains an ascending sequence of integers, where
+ several leading bits are likely to be all zeroes, therefore this column is a good
+ candidate for bitshuffle encoding. The <codeph>body</codeph>
+ column and the corresponding columns for translated versions tend to be long unique
+ strings that are not practical to use with any of the encoding schemes, therefore
+ they employ the <codeph>COMPRESSION</codeph> attribute instead. The ideal compression
+ codec in each case would require some experimentation to determine how much space
+ savings it provided and how much CPU overhead it added, based on real-world data.
+ </p>
+
+<codeblock>
+CREATE TABLE blog_posts
+(
+ user_id STRING ENCODING DICT_ENCODING,
+ post_id BIGINT ENCODING BIT_SHUFFLE,
+ subject STRING ENCODING PLAIN_ENCODING,
+ body STRING COMPRESSION LZ4,
+ spanish_translation STRING COMPRESSION SNAPPY,
+ esperanto_translation STRING COMPRESSION ZLIB,
+ PRIMARY KEY (user_id, post_id)
+) PARTITION BY HASH(user_id, post_id) PARTITIONS 2 STORED AS KUDU;
+</codeblock>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="kudu_block_size_attribute">
+
+ <title>BLOCK_SIZE Attribute</title>
+
+ <conbody>
+
+ <p>
+ Although Kudu does not use HDFS files internally, and thus is not affected by
+ the HDFS block size, it does have an underlying unit of I/O called the
+ <term>block size</term>. The <codeph>BLOCK_SIZE</codeph> attribute lets you set the
+ block size for any column.
+ </p>
+
+ <p>
+ The block size attribute is a relatively advanced feature. Refer to
+ <xref href="https://kudu.apache.org/docs/index.html" scope="external" format="html">the Kudu documentation</xref>
+ for usage details.
+ </p>
+
+<!-- Commenting out this example for the time being.
+<codeblock>
+CREATE TABLE performance_for_benchmark_xyz
+(
+ id BIGINT PRIMARY KEY,
+ col1 BIGINT BLOCK_SIZE 4096,
+ col2 STRING BLOCK_SIZE 16384,
+ col3 SMALLINT BLOCK_SIZE 2048
+) PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU;
+</codeblock>
+-->
+
+ </conbody>
+
+ </concept>
+
+ </concept>
+
+ <concept id="kudu_partitioning">
+
+ <title>Partitioning for Kudu Tables</title>
+
+ <conbody>
+
+ <p>
+ Kudu tables use special mechanisms to distribute data among the underlying
+ tablet servers. Although we refer to such tables as partitioned tables, they are
+ distinguished from traditional Impala partitioned tables by use of different clauses
+ on the <codeph>CREATE TABLE</codeph> statement. Kudu tables use
+ <codeph>PARTITION BY</codeph>, <codeph>HASH</codeph>, <codeph>RANGE</codeph>, and
+ range specification clauses rather than the <codeph>PARTITIONED BY</codeph> clause
+ for HDFS-backed tables, which specifies only a column name and creates a new partition for each
+ different value.
+ </p>
+
+ <p>
+ For background information and architectural details about the Kudu partitioning
+ mechanism, see
+ <xref href="https://kudu.apache.org/kudu.pdf" scope="external" format="html">the Kudu white paper, section 3.2</xref>.
+ </p>
+
+<!-- Hiding but leaving in place for the moment, in case the white paper discussion isn't enough.
+ <p>
+ With Kudu tables, all of the columns involved in these clauses must be primary key
+ columns. These clauses let you specify different ways to divide the data for each
+ column, or even for different value ranges within a column. This flexibility lets you
+ avoid problems with uneven distribution of data, where the partitioning scheme for
+ HDFS tables might result in some partitions being much larger than others. By setting
+ up an effective partitioning scheme for a Kudu table, you can ensure that the work for
+ a query can be parallelized evenly across the hosts in a cluster.
+ </p>
+-->
+
+ <note>
+ <p>
+ The Impala DDL syntax for Kudu tables is different than in early Kudu versions,
+ which used an experimental fork of the Impala code. For example, the
+ <codeph>DISTRIBUTE BY</codeph> clause is now <codeph>PARTITION BY</codeph>, the
+ <codeph>INTO <varname>n</varname> BUCKETS</codeph> clause is now
+ <codeph>PARTITIONS <varname>n</varname></codeph> and the range partitioning syntax
+ is reworked to replace the <codeph>SPLIT ROWS</codeph> clause with more expressive
+ syntax involving comparison operators.
+ </p>
+ </note>
+
+ <p outputclass="toc inpage"/>
+
+ </conbody>
+
+ <concept id="kudu_hash_partitioning">
+ <title>Hash Partitioning</title>
+ <conbody>
+
+ <p>
+ Hash partitioning is the simplest type of partitioning for Kudu tables.
+ For hash-partitioned Kudu tables, inserted rows are divided up between a fixed number
+ of <q>buckets</q> by applying a hash function to the values of the columns specified
+ in the <codeph>HASH</codeph> clause.
+ Hashing ensures that rows with similar values are evenly distributed, instead of
+ clumping together all in the same bucket. Spreading new rows across the buckets this
+ way lets insertion operations work in parallel across multiple tablet servers.
+ Separating the hashed values can impose additional overhead on queries, where
+ queries with range-based predicates might have to read multiple tablets to retrieve
+ all the relevant values.
+ </p>
+
+<codeblock>
+-- 1M rows with 50 hash partitions = approximately 20,000 rows per partition.
+-- The values in each partition are not sequential, but rather based on a hash function.
+-- Rows 1, 99999, and 123456 might be in the same partition.
+CREATE TABLE million_rows (id string primary key, s string)
+ PARTITION BY HASH(id) PARTITIONS 50
+ STORED AS KUDU;
+
+-- Because the ID values are unique, we expect the rows to be roughly
+-- evenly distributed between the buckets in the destination table.
+INSERT INTO million_rows SELECT * FROM billion_rows ORDER BY id LIMIT 1e6;
+</codeblock>
+
+ <note>
+ <p>
+ The largest number of buckets that you can create with a <codeph>PARTITIONS</codeph>
+ clause varies depending on the number of tablet servers in the cluster, while the smallest is 2.
+ For simplicity, some of the simple <codeph>CREATE TABLE</codeph> statements throughout this section
+ use <codeph>PARTITIONS 2</codeph> to illustrate the minimum requirements for a Kudu table.
+ For large tables, prefer to use roughly 10 partitions per server in the cluster.
+ </p>
+ </note>
+
+ </conbody>
+ </concept>
+
+ <concept id="kudu_range_partitioning">
+ <title>Range Partitioning</title>
+ <conbody>
+
+ <p>
+ Range partitioning lets you specify partitioning precisely, based on single values or ranges
+ of values within one or more columns. You add one or more <codeph>RANGE</codeph> clauses to the
+ <codeph>CREATE TABLE</codeph> statement, following the <codeph>PARTITION BY</codeph>
+ clause.
+ </p>
+
+ <p>
+ Range-partitioned Kudu tables use one or more range clauses, which include a
+ combination of constant expressions, <codeph>VALUE</codeph> or <codeph>VALUES</codeph>
+ keywords, and comparison operators. (This syntax replaces the <codeph>SPLIT
+ ROWS</codeph> clause used with early Kudu versions.)
+ For the full syntax, see <xref keyref="create_table"/>.
+ </p>
+
+<codeblock><![CDATA[
+-- 50 buckets, all for IDs beginning with a lowercase letter.
+-- Having only a single range enforces the allowed range of values
+-- but does not add any extra parallelism.
+create table million_rows_one_range (id string primary key, s string)
+ partition by hash(id) partitions 50,
+ range (partition 'a' <= values < '{')
+ stored as kudu;
+
+-- 50 buckets for IDs beginning with a lowercase letter
+-- plus 50 buckets for IDs beginning with an uppercase letter.
+-- Total number of buckets = number in the PARTITIONS clause x number of ranges.
+-- We are still enforcing constraints on the primary key values
+-- allowed in the table, and the 2 ranges provide better parallelism
+-- as rows are inserted or the table is scanned.
+create table million_rows_two_ranges (id string primary key, s string)
+ partition by hash(id) partitions 50,
+ range (partition 'a' <= values < '{', partition 'A' <= values < '[')
+ stored as kudu;
+
+-- Same as previous table, with an extra range covering the single key value '00000'.
+create table million_rows_three_ranges (id string primary key, s string)
+ partition by hash(id) partitions 50,
+ range (partition 'a' <= values < '{', partition 'A' <= values < '[', partition value = '00000')
+ stored as kudu;
+
+-- The range partitioning can be displayed with a SHOW command in impala-shell.
+show range partitions million_rows_three_ranges;
++---------------------+
+| RANGE (id) |
++---------------------+
+| VALUE = "00000" |
+| "A" <= VALUES < "[" |
+| "a" <= VALUES < "{" |
++---------------------+
+]]>
+</codeblock>
+
+ <note>
+ <p>
+ When defining ranges, be careful to avoid <q>fencepost errors</q> where values at the
+ extreme ends might be included or omitted by accident. For example, in the tables defined
+ in the preceding code listings, the range <codeph><![CDATA["a" <= VALUES < "{"]]></codeph> ensures that
+ any values starting with <codeph>z</codeph>, such as <codeph>za</codeph> or <codeph>zzz</codeph>
+ or <codeph>zzz-ZZZ</codeph>, are all included, by using a less-than operator for the smallest
+ value after all the values starting with <codeph>z</codeph>.
+ </p>
+ </note>
+
+ <p>
+ For range-partitioned Kudu tables, an appropriate range must exist before a data value can be created in the table.
+ Any <codeph>INSERT</codeph>, <codeph>UPDATE</codeph>, or <codeph>UPSERT</codeph> statements fail if they try to
+ create column values that fall outside the specified ranges. The error checking for ranges is performed on the
+ Kudu side; Impala passes the specified range information to Kudu, and passes back any error or warning if the
+ ranges are not valid. (A nonsensical range specification causes an error for a DDL statement, but only a warning
+ for a DML statement.)
+ </p>
+
+ <p>
+ Ranges can be non-contiguous:
+ </p>
+
+<codeblock><![CDATA[
+partition by range (year) (partition 1885 <= values <= 1889, partition 1893 <= values <= 1897)
+
+partition by range (letter_grade) (partition value = 'A', partition value = 'B',
+ partition value = 'C', partition value = 'D', partition value = 'F')
+]]>
+</codeblock>
+
+ <p>
+ The <codeph>ALTER TABLE</codeph> statement with the <codeph>ADD PARTITION</codeph> or
+ <codeph>DROP PARTITION</codeph> clauses can be used to add or remove ranges from an
+ existing Kudu table.
+ </p>
+
+<codeblock><![CDATA[
+ALTER TABLE foo ADD PARTITION 30 <= VALUES < 50;
+ALTER TABLE foo DROP PARTITION 1 <= VALUES < 5;
+]]>
+</codeblock>
+
+ <p>
+ When a range is added, the new range must not overlap with any of the previous ranges;
+ that is, it can only fill in gaps within the previous ranges.
+ </p>
+
+<codeblock><![CDATA[
+alter table test_scores add range partition value = 'E';
+
+alter table year_ranges add range partition 1890 <= values < 1893;
+]]>
+</codeblock>
+
+ <p>
+ When a range is removed, all the associated rows in the table are deleted. (This
+ is true whether the table is internal or external.)
+ </p>
+
+<codeblock><![CDATA[
+alter table test_scores drop range partition value = 'E';
+
+alter table year_ranges drop range partition 1890 <= values < 1893;
+]]>
+</codeblock>
+
+ <p>
+ Kudu tables can also use a combination of hash and range partitioning.
+ </p>
+
+<codeblock><![CDATA[
+partition by hash (school) partitions 10,
+ range (letter_grade) (partition value = 'A', partition value = 'B',
+ partition value = 'C', partition value = 'D', partition value = 'F')
+]]>
+</codeblock>
+
+ </conbody>
+ </concept>
+
+ <concept id="kudu_partitioning_misc">
+ <title>Working with Partitioning in Kudu Tables</title>
+ <conbody>
+
+ <p>
+ To see the current partitioning scheme for a Kudu table, you can use the <codeph>SHOW
+ CREATE TABLE</codeph> statement or the <codeph>SHOW PARTITIONS</codeph> statement. The
+ <codeph>CREATE TABLE</codeph> syntax displayed by this statement includes all the
+ hash, range, or both clauses that reflect the original table structure plus any
+ subsequent <codeph>ALTER TABLE</codeph> statements that changed the table structure.
+ </p>
+
+ <p>
+ To see the underlying buckets and partitions for a Kudu table, use the
+ <codeph>SHOW TABLE STATS</codeph> or <codeph>SHOW PARTITIONS</codeph> statement.
+ </p>
+
+ </conbody>
+ </concept>
+
+ </concept>
+
+ <concept id="kudu_timestamps">
+
+ <title>Handling Date, Time, or Timestamp Data with Kudu</title>
+
+ <conbody>
+
+ <p>
+ Because currently a Kudu table cannot have a column with the Impala
+ <codeph>TIMESTAMP</codeph> type, expect to store date/time information as the number
+ of seconds, milliseconds, or microseconds since the Unix epoch date of January 1,
+ 1970. Specify the column as <codeph>BIGINT</codeph> in the Impala <codeph>CREATE
+ TABLE</codeph> statement, corresponding to an 8-byte integer (an
+ <codeph>int64</codeph>) in the underlying Kudu table). Then use Impala date/time
+ conversion functions as necessary to produce a numeric, <codeph>TIMESTAMP</codeph>,
+ or <codeph>STRING</codeph> value depending on the context.
+ </p>
+
+ <p>
+ For example, the <codeph>unix_timestamp()</codeph> function returns an integer result
+ representing the number of seconds past the epoch. The <codeph>now()</codeph> function
+ produces a <codeph>TIMESTAMP</codeph> representing the current date and time, which can
+ be passed as an argument to <codeph>unix_timestamp()</codeph>. And string literals
+ representing dates and date/times can be cast to <codeph>TIMESTAMP</codeph>, and from there
+ converted to numeric values. The following examples show how you might store a date/time
+ column as <codeph>BIGINT</codeph> in a Kudu table, but still use string literals and
+ <codeph>TIMESTAMP</codeph> values for convenience.
+ </p>
+
+<codeblock><![CDATA[
+-- now() returns a TIMESTAMP and shows the format for string literals you can cast to TIMESTAMP.
+select now();
++-------------------------------+
+| now() |
++-------------------------------+
+| 2017-01-25 23:50:10.132385000 |
++-------------------------------+
+
+-- unix_timestamp() accepts either a TIMESTAMP or an equivalent string literal.
+select unix_timestamp(now());
++------------------+
+| unix_timestamp() |
++------------------+
+| 1485386670 |
++------------------+
+
+select unix_timestamp('2017-01-01');
++------------------------------+
+| unix_timestamp('2017-01-01') |
++------------------------------+
+| 1483228800 |
++------------------------------+
+
+-- Make a table representing a date/time value as BIGINT.
+-- Construct 1 range partition and 20 associated hash partitions for each year.
+-- Use date/time conversion functions to express the ranges as human-readable dates.
+create table time_series(id bigint, when_exactly bigint, event string, primary key (id, when_exactly))
+ partition by hash (id) partitions 20,
+ range (when_exactly)
+ (
+ partition unix_timestamp('2015-01-01') <= values < unix_timestamp('2016-01-01'),
+ partition unix_timestamp('2016-01-01') <= values < unix_timestamp('2017-01-01'),
+ partition unix_timestamp('2017-01-01') <= values < unix_timestamp('2018-01-01')
+ )
+ stored as kudu;
+
+-- On insert, we can transform a human-readable date/time into a numeric value.
+insert into time_series values (12345, unix_timestamp('2017-01-25 23:24:56'), 'Working on doc examples');
+
+-- On retrieval, we can examine the numeric date/time value or turn it back into a string for readability.
+select id, when_exactly, from_unixtime(when_exactly) as 'human-readable date/time', event
+ from time_series order by when_exactly limit 100;
++-------+--------------+--------------------------+-------------------------+
+| id | when_exactly | human-readable date/time | event |
++-------+--------------+--------------------------+-------------------------+
+| 12345 | 1485386696 | 2017-01-25 23:24:56 | Working on doc examples |
++-------+--------------+--------------------------+-------------------------+
+]]>
+</codeblock>
+
+ <note>
+ <p>
+ If you do high-precision arithmetic involving numeric date/time values,
+ when dividing millisecond values by 1000, or microsecond values by 1 million, always
+ cast the integer numerator to a <codeph>DECIMAL</codeph> with sufficient precision
+ and scale to avoid any rounding or loss of precision.
+ </p>
+ </note>
+
+<codeblock><![CDATA[
+-- 1 million and 1 microseconds = 1.000001 seconds.
+select microseconds,
+ cast (microseconds as decimal(20,7)) / 1e6 as fractional_seconds
+ from table_with_microsecond_column;
++--------------+----------------------+
+| microseconds | fractional_seconds |
++--------------+----------------------+
+| 1000001 | 1.000001000000000000 |
++--------------+----------------------+
+]]>
+</codeblock>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="kudu_metadata">
+
+ <title>How Impala Handles Kudu Metadata</title>
+
+ <conbody>
+
+ <p conref="../shared/impala_common.xml#common/kudu_metadata_intro"/>
+ <p conref="../shared/impala_common.xml#common/kudu_metadata_details"/>
+
+ <p>
+ Because Kudu manages the metadata for its own tables separately from the metastore
+ database, there is a table name stored in the metastore database for Impala to use,
+ and a table name on the Kudu side, and these names can be modified independently
+ through <codeph>ALTER TABLE</codeph> statements.
+ </p>
+
+ <p>
+ To avoid potential name conflicts, the prefix <codeph>impala::</codeph>
+ and the Impala database name are encoded into the underlying Kudu
+ table name:
+ </p>
+
+<codeblock><![CDATA[
+create database some_database;
+use some_database;
+
+create table table_name_demo (x int primary key, y int)
+ partition by hash (x) partitions 2 stored as kudu;
+
+describe formatted table_name_demo;
+...
+kudu.table_name | impala::some_database.table_name_demo
+]]>
+</codeblock>
+
+ <p>
+ See <xref keyref="kudu_tables"/> for examples of how to change the name of
+ the Impala table in the metastore database, the name of the underlying Kudu
+ table, or both.
+ </p>
+
+ </conbody>
+
+ </concept>
+
+ </concept>
+
+ <concept id="kudu_etl">
+
+ <title>Loading Data into Kudu Tables</title>
+
+ <conbody>
+
+ <p>
+ Kudu tables are well-suited to use cases where data arrives continuously, in small or
+ moderate volumes. To bring data into Kudu tables, use the Impala <codeph>INSERT</codeph>
+ and <codeph>UPSERT</codeph> statements. The <codeph>LOAD DATA</codeph> statement does
+ not apply to Kudu tables.
</p>
<p>
- Kudu can do extra optimizations for queries that refer to the primary key columns in the
- <codeph>WHERE</codeph> clause. It is not crucial though to include the primary key columns in the
- <codeph>WHERE</codeph> clause of every query. The benefit is mainly for partitioned tables,
- which divide the data among various tablet servers based on the distribution of
- data values in some or all of the primary key columns.
+ Because Kudu manages its own storage layer that is optimized for smaller block sizes than
+ HDFS, and performs its own housekeeping to keep data evenly distributed, it is not
+ subject to the <q>many small files</q> issue and does not need explicit reorganization
+ and compaction as the data grows over time. The partitions within a Kudu table can be
+ specified to cover a variety of possible data distributions, instead of hardcoding a new
+ partition for each new day, hour, and so on, which can lead to inefficient,
+ hard-to-scale, and hard-to-manage partition schemes with HDFS tables.
+ </p>
+
+ <p>
+ Your strategy for performing ETL or bulk updates on Kudu tables should take into account
+ the limitations on consistency for DML operations.
+ </p>
+
+ <p>
+ Make <codeph>INSERT</codeph>, <codeph>UPDATE</codeph>, and <codeph>UPSERT</codeph>
+ operations <term>idempotent</term>: that is, able to be applied multiple times and still
+ produce an identical result.
+ </p>
+
+ <p>
+ If a bulk operation is in danger of exceeding capacity limits due to timeouts or high
+ memory usage, split it into a series of smaller operations.
+ </p>
+
+ <p>
+ Avoid running concurrent ETL operations where the end results depend on precise
+ ordering. In particular, do not rely on an <codeph>INSERT ... SELECT</codeph> statement
+ that selects from the same table into which it is inserting, unless you include extra
+ conditions in the <codeph>WHERE</codeph> clause to avoid reading the newly inserted rows
+ within the same statement.
+ </p>
+
+ <p>
+ Because relationships between tables cannot be enforced by Impala and Kudu, and cannot
+ be committed or rolled back together, do not expect transactional semantics for
+ multi-table operations.
</p>
</conbody>
@@ -110,52 +1168,149 @@ under the License.
<concept id="kudu_dml">
- <title>Impala DML Support for Kudu Tables</title>
+ <title>Impala DML Support for Kudu Tables (INSERT, UPDATE, DELETE, UPSERT)</title>
+
+ <prolog>
+ <metadata>
+ <data name="Category" value="DML"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ Impala supports certain DML statements for Kudu tables only. The <codeph>UPDATE</codeph>
+ and <codeph>DELETE</codeph> statements let you modify data within Kudu tables without
+ rewriting substantial amounts of table data. The <codeph>UPSERT</codeph> statement acts
+ as a combination of <codeph>INSERT</codeph> and <codeph>UPDATE</codeph>, inserting rows
+ where the primary key does not already exist, and updating the non-primary key columns
+ where the primary key does already exist in the table.
+ </p>
+
+ <p>
+ The <codeph>INSERT</codeph> statement for Kudu tables honors the unique and <codeph>NOT
+ NULL</codeph> requirements for the primary key columns.
+ </p>
+
+ <p>
+ Because Impala and Kudu do not support transactions, the effects of any
+ <codeph>INSERT</codeph>, <codeph>UPDATE</codeph>, or <codeph>DELETE</codeph> statement
+ are immediately visible. For example, you cannot do a sequence of
+ <codeph>UPDATE</codeph> statements and only make the changes visible after all the
+ statements are finished. Also, if a DML statement fails partway through, any rows that
+ were already inserted, deleted, or changed remain in the table; there is no rollback
+ mechanism to undo the changes.
+ </p>
+
+ <p>
+ In particular, an <codeph>INSERT ... SELECT</codeph> statement that refers to the table
+ being inserted into might insert more rows than expected, because the
+ <codeph>SELECT</codeph> part of the statement sees some of the new rows being inserted
+ and processes them again.
+ </p>
+
+ <note>
+ <p>
+ The <codeph>LOAD DATA</codeph> statement, which involves manipulation of HDFS data files,
+ does not apply to Kudu tables.
+ </p>
+ </note>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="kudu_consistency">
+
+ <title>Consistency Considerations for Kudu Tables</title>
<conbody>
<p>
- Impala supports certain DML statements for Kudu tables only. The <codeph>UPDATE</codeph> and
- <codeph>DELETE</codeph> statements let you modify data within Kudu tables without rewriting substantial
- amounts of table data.
+ Kudu tables have consistency characteristics such as uniqueness, controlled by the
+ primary key columns, and non-nullable columns. The emphasis for consistency is on
+ preventing duplicate or incomplete data from being stored in a table.
</p>
<p>
- The <codeph>INSERT</codeph> statement for Kudu tables honors the unique and non-<codeph>NULL</codeph>
- requirements for the primary key columns.
+ Currently, Kudu does not enforce strong consistency for order of operations, total
+ success or total failure of a multi-row statement, or data that is read while a write
+ operation is in progress. Changes are applied atomically to each row, but not applied
+ as a single unit to all rows affected by a multi-row DML statement. That is, Kudu does
+ not currently have atomic multi-row statements or isolation between statements.
</p>
<p>
- Because Impala and Kudu do not support transactions, the effects of any <codeph>INSERT</codeph>,
- <codeph>UPDATE</codeph>, or <codeph>DELETE</codeph> statement are immediately visible. For example, you
- cannot do a sequence of <codeph>UPDATE</codeph> statements and only make the change visible after all the
- statements are finished. Also, if a DML statement fails partway through, any rows that were already
- inserted, deleted, or changed remain in the table; there is no rollback mechanism to undo the changes.
+ If some rows are rejected during a DML operation because of a mismatch with duplicate
+ primary key values, <codeph>NOT NULL</codeph> constraints, and so on, the statement
+ succeeds with a warning. Impala still inserts, deletes, or updates the other rows that
+ are not affected by the constraint violation.
+ </p>
+
+ <p>
+ Consequently, the number of rows affected by a DML operation on a Kudu table might be
+ different than you expect.
+ </p>
+
+ <p>
+ Because there is no strong consistency guarantee for information being inserted into,
+ deleted from, or updated across multiple tables simultaneously, consider denormalizing
+ the data where practical. That is, if you run separate <codeph>INSERT</codeph>
+ statements to insert related rows into two different tables, one <codeph>INSERT</codeph>
+ might fail while the other succeeds, leaving the data in an inconsistent state. Even if
+ both inserts succeed, a join query might happen during the interval between the
+ completion of the first and second statements, and the query would encounter incomplete
+ inconsistent data. Denormalizing the data into a single wide table can reduce the
+ possibility of inconsistency due to multi-table operations.
+ </p>
+
+ <p>
+ Information about the number of rows affected by a DML operation is reported in
+ <cmdname>impala-shell</cmdname> output, and in the <codeph>PROFILE</codeph> output, but
+ is not currently reported to HiveServer2 clients such as JDBC or ODBC applications.
</p>
</conbody>
</concept>
- <concept id="kudu_partitioning">
+ <concept id="kudu_security">
- <title>Partitioning for Kudu Tables</title>
+ <title>Security Considerations for Kudu Tables</title>
<conbody>
<p>
- Kudu tables use special mechanisms to evenly distribute data among the underlying tablet servers. Although
- we refer to such tables as partitioned tables, they are distinguished from traditional Impala partitioned
- tables by use of different clauses on the <codeph>CREATE TABLE</codeph> statement. Partitioned Kudu tables
- use <codeph>PARTITION BY</codeph>, <codeph>HASH</codeph>, and <codeph>RANGE</codeph> clauses rather than
- the traditional <codeph>PARTITIONED BY</codeph> clause. All of the columns involved in these clauses must
- be primary key columns. These clauses let you specify different ways to divide the data for each column,
- or even for different value ranges within a column. This flexibility lets you avoid problems with uneven
- distribution of data, where the partitioning scheme for HDFS tables might result in some partitions being
- much larger than others. By setting up an effective partitioning scheme for a Kudu table, you can ensure
- that the work for a query can be parallelized evenly across the hosts in a cluster.
+ Security for Kudu tables involves:
</p>
+ <ul>
+ <li>
+ <p>
+ Sentry authorization.
+ </p>
+ <p conref="../shared/impala_common.xml#common/kudu_sentry_limitations"/>
+ </li>
+
+ <li>
+ <p>
+ Lineage tracking.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Auditing.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Redaction of sensitive information from log files.
+ </p>
+ </li>
+ </ul>
+
</conbody>
</concept>
@@ -167,18 +1322,66 @@ under the License.
<conbody>
<p>
- For queries involving Kudu tables, Impala can delegate much of the work of filtering the result set to
- Kudu, avoiding some of the I/O involved in full table scans of tables containing HDFS data files. This type
- of optimization is especially effective for partitioned Kudu tables, where the Impala query
- <codeph>WHERE</codeph> clause refers to one or more primary key columns that are also used as partition key
- columns. For example, if a partitioned Kudu table uses a <codeph>HASH</codeph> clause for
- <codeph>col1</codeph> and a <codeph>RANGE</codeph> clause for <codeph>col2</codeph>, a query using a clause
- such as <codeph>WHERE col1 IN (1,2,3) AND col2 > 100</codeph> can determine exactly which tablet servers
- contain relevant data, and therefore parallelize the query very efficiently.
+ For queries involving Kudu tables, Impala can delegate much of the work of filtering the
+ result set to Kudu, avoiding some of the I/O involved in full table scans of tables
+ containing HDFS data files. This type of optimization is especially effective for
+ partitioned Kudu tables, where the Impala query <codeph>WHERE</codeph> clause refers to
+ one or more primary key columns that are also used as partition key columns. For
+ example, if a partitioned Kudu table uses a <codeph>HASH</codeph> clause for
+ <codeph>col1</codeph> and a <codeph>RANGE</codeph> clause for <codeph>col2</codeph>, a
+ query using a clause such as <codeph>WHERE col1 IN (1,2,3) AND col2 > 100</codeph>
+ can determine exactly which tablet servers contain relevant data, and therefore
+ parallelize the query very efficiently.
</p>
+ <p>
+ See <xref keyref="explain"/> for examples of evaluating the effectiveness of
+ the predicate pushdown for a specific query against a Kudu table.
+ </p>
+
+ <!-- Hide until subtopics are ready to display. -->
+ <p outputclass="toc inpage" audience="hidden"/>
+
</conbody>
+ <concept id="kudu_vs_parquet" audience="hidden">
+ <!-- To do: if there is enough real-world experience in future to have a
+ substantive discussion of this subject, revisit this topic and
+ consider unhiding it. -->
+
+ <title>How Kudu Works with Column-Oriented Operations</title>
+
+ <conbody>
+
+ <p>
+ For immutable data, Impala is often used with Parquet tables due to the efficiency of
+ the column-oriented Parquet layout. This section describes how Kudu stores and
+ retrieves columnar data, to help you understand performance and storage considerations
+ of Kudu tables as compared with Parquet tables.
+ </p>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="kudu_memory" audience="hidden">
+ <!-- To do: if there is enough real-world experience in future to have a
+ substantive discussion of this subject, revisit this topic and
+ consider unhiding it. -->
+
+ <title>Memory Usage for Operations on Kudu Tables</title>
+
+ <conbody>
+
+ <p>
+ The Apache Kudu architecture, topology, and data storage techniques result in
+ different patterns of memory usage for Impala statements than with HDFS-backed tables.
+ </p>
+
+ </conbody>
+
+ </concept>
+
</concept>
</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/661921b2/docs/topics/impala_literals.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_literals.xml b/docs/topics/impala_literals.xml
index d0d475c..22404d9 100644
--- a/docs/topics/impala_literals.xml
+++ b/docs/topics/impala_literals.xml
@@ -397,6 +397,24 @@ insert into t1 partition(x=NULL, y) select c1, c3 from some_other_table;</codeb
<codeph>nullifzero()</codeph>, and <codeph>zeroifnull()</codeph>. See
<xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
</p>
+
+ <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+ <p rev="kudu">
+ Columns in Kudu tables have an attribute that specifies whether or not they can contain
+ <codeph>NULL</codeph> values. A column with a <codeph>NULL</codeph> attribute can contain
+ nulls. A column with a <codeph>NOT NULL</codeph> attribute cannot contain any nulls, and
+ an <codeph>INSERT</codeph>, <codeph>UPDATE</codeph>, or <codeph>UPSERT</codeph> statement
+ will skip any row that attempts to store a null in a column designated as <codeph>NOT NULL</codeph>.
+ Kudu tables default to the <codeph>NULL</codeph> setting for each column, except columns that
+ are part of the primary key.
+ </p>
+ <p rev="kudu">
+ In addition to columns with the <codeph>NOT NULL</codeph> attribute, Kudu tables also have
+ restrictions on <codeph>NULL</codeph> values in columns that are part of the primary key for
+ a table. No column that is part of the primary key in a Kudu table can contain any
+ <codeph>NULL</codeph> values.
+ </p>
+
</conbody>
</concept>
</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/661921b2/docs/topics/impala_map.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_map.xml b/docs/topics/impala_map.xml
index 3d03129..6fb697b 100644
--- a/docs/topics/impala_map.xml
+++ b/docs/topics/impala_map.xml
@@ -85,6 +85,9 @@ type ::= <varname>primitive_type</varname> | <varname>complex_type</varname>
<li/>
</ul>
+ <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+ <p conref="../shared/impala_common.xml#common/kudu_unsupported_data_type"/>
+
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/661921b2/docs/topics/impala_partitioning.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_partitioning.xml b/docs/topics/impala_partitioning.xml
index 4d723ad..1f70d2b 100644
--- a/docs/topics/impala_partitioning.xml
+++ b/docs/topics/impala_partitioning.xml
@@ -575,7 +575,7 @@ SELECT COUNT(*) FROM sales_table WHERE year IN (2005, 2010, 2015);
</concept>
- <concept rev="kudu" id="partition_kudu" audience="hidden">
+ <concept rev="kudu 2.8.0" id="partition_kudu">
<title>Using Partitioning with Kudu Tables</title>
@@ -593,6 +593,12 @@ SELECT COUNT(*) FROM sales_table WHERE year IN (2005, 2010, 2015);
columns.
</p>
+ <p>
+ See <xref href="impala_kudu.xml#kudu_partitioning"/> for
+ details and examples of the partitioning techniques
+ for Kudu tables.
+ </p>
+
</conbody>
</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/661921b2/docs/topics/impala_refresh.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_refresh.xml b/docs/topics/impala_refresh.xml
index 8244aa4..7897ecd 100644
--- a/docs/topics/impala_refresh.xml
+++ b/docs/topics/impala_refresh.xml
@@ -333,6 +333,11 @@ ERROR: AnalysisException: Items in partition spec must exactly match the partiti
<p conref="../shared/impala_common.xml#common/s3_metadata"/>
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
+
+ <p rev="kudu" conref="../shared/impala_common.xml#common/kudu_blurb"/>
+ <p conref="../shared/impala_common.xml#common/kudu_metadata_intro"/>
+ <p conref="../shared/impala_common.xml#common/kudu_metadata_details"/>
+
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_hadoop.xml#intro_metastore"/>,
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/661921b2/docs/topics/impala_reserved_words.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_reserved_words.xml b/docs/topics/impala_reserved_words.xml
index 423fd43..440120c 100644
--- a/docs/topics/impala_reserved_words.xml
+++ b/docs/topics/impala_reserved_words.xml
@@ -82,7 +82,9 @@ avro
between
bigint
<ph rev="1.4.0">binary</ph>
+<ph rev="kudu">blocksize</ph>
boolean
+<!-- <ph rev="kudu">buckets</ph> -->
by
<ph rev="1.4.0">cached</ph>
<ph rev="2.3.0">cascade</ph>
@@ -95,6 +97,7 @@ change
column
columns
comment
+<ph rev="kudu">compression</ph>
compute
create
cross
@@ -105,15 +108,18 @@ databases
date
datetime
decimal
-<ph rev="2.6.0">delete</ph>
+<ph rev="kudu">default</ph>
+<ph rev="kudu">delete</ph>
delimited
desc
describe
distinct
+<!-- <ph rev="kudu">distribute</ph> -->
div
double
drop
else
+<ph rev="kudu">encoding</ph>
end
escaped
exists
@@ -136,10 +142,10 @@ function
functions
<ph rev="2.1.0">grant</ph>
group
-<ph rev="2.6.0">hash</ph>
+<ph rev="kudu">hash</ph>
having
if
-<ph rev="2.6.0">ignore</ph>
+<!-- <ph rev="kudu">ignore</ph> -->
<ph rev="2.5.0">ilike</ph>
in
<ph rev="2.1.0">incremental</ph>
@@ -210,6 +216,7 @@ serdeproperties
set
show
smallint
+<!-- <ph rev="kudu">split</ph> -->
stats
stored
straight_join
@@ -229,8 +236,9 @@ true
<ph rev="2.0.0">unbounded</ph>
<ph rev="1.4.0">uncached</ph>
union
-<ph rev="2.6.0">update</ph>
+<ph rev="kudu">update</ph>
<ph rev="1.2.1">update_fn</ph>
+<ph rev="kudu">upsert</ph>
use
using
values
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/661921b2/docs/topics/impala_revoke.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_revoke.xml b/docs/topics/impala_revoke.xml
index 97a912d..b7a0908 100644
--- a/docs/topics/impala_revoke.xml
+++ b/docs/topics/impala_revoke.xml
@@ -108,6 +108,9 @@ object_type ::= TABLE | DATABASE | SERVER | URI
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
+ <p rev="2.8.0" conref="../shared/impala_common.xml#common/kudu_blurb"/>
+ <p conref="../shared/impala_common.xml#common/kudu_sentry_limitations"/>
+
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/661921b2/docs/topics/impala_show.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_show.xml b/docs/topics/impala_show.xml
index 84e9c0b..4e9e7fc 100644
--- a/docs/topics/impala_show.xml
+++ b/docs/topics/impala_show.xml
@@ -28,6 +28,7 @@ under the License.
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
<data name="Category" value="Reports"/>
+ <data name="Category" value="Kudu"/>
</metadata>
</prolog>
@@ -49,7 +50,8 @@ SHOW TABLES [IN <varname>database_name</varname>] [[LIKE] '<varname>pattern</var
<ph rev="1.2.1">SHOW TABLE STATS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
<ph rev="1.2.1">SHOW COLUMN STATS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
<ph rev="1.4.0">SHOW PARTITIONS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
-SHOW FILES IN [<varname>database_name</varname>.]<varname>table_name</varname> <ph rev="IMPALA-1654">[PARTITION (<varname>key_col_expression</varname> [, <varname>key_col_expression</varname>]</ph>]
+<ph rev="1.4.0">SHOW <ph rev="kudu">[RANGE]</ph> PARTITIONS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
+SHOW FILES IN [<varname>database_name</varname>.]<varname>table_name</varname> <ph rev="IMPALA-1654">[PARTITION (<varname>key_col_expression</varname> [, <varname>key_col_expression</varname>]</ph>]
<ph rev="2.0.0">SHOW ROLES
SHOW CURRENT ROLES
@@ -129,7 +131,8 @@ show files in sample_table partition (month like 'J%');
<note>
This statement applies to tables and partitions stored on HDFS, or in the Amazon Simple Storage System (S3).
It does not apply to views.
- It does not apply to tables mapped onto HBase, because HBase does not use the same file-based storage layout.
+ It does not apply to tables mapped onto HBase <ph rev="kudu">or Kudu</ph>,
+ because those data management systems do not use the same file-based storage layout.
</note>
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
@@ -742,6 +745,61 @@ show tables like '*dim*|t*';
<p conref="../shared/impala_common.xml#common/permissions_blurb_no"/>
+ <p rev="kudu">
+ For Kudu tables:
+ </p>
+
+ <ul rev="kudu">
+ <li>
+ <p>
+ The column specifications include attributes such as <codeph>NULL</codeph>,
+ <codeph>NOT NULL</codeph>, <codeph>ENCODING</codeph>, and <codeph>COMPRESSION</codeph>.
+ If you do not specify those attributes in the original <codeph>CREATE TABLE</codeph> statement,
+ the <codeph>SHOW CREATE TABLE</codeph> output displays the defaults that were used.
+ </p>
+ </li>
+ <li>
+ <p>
+ The specifications of any <codeph>RANGE</codeph> clauses are not displayed in full.
+ To see the definition of the range clauses for a Kudu table, use the <codeph>SHOW RANGE PARTITIONS</codeph> statement.
+ </p>
+ </li>
+ <li>
+ <p>
+ The <codeph>TBLPROPERTIES</codeph> output reflects the Kudu master address
+ and the internal Kudu name associated with the Impala table.
+ </p>
+ </li>
+ </ul>
+
+<codeblock rev="kudu">
+show CREATE TABLE numeric_grades_default_letter;
++------------------------------------------------------------------------------------------------+
+| result |
++------------------------------------------------------------------------------------------------+
+| CREATE TABLE user.numeric_grades_default_letter ( |
+| score TINYINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, |
+| letter_grade STRING NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION DEFAULT '-', |
+| student STRING NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, |
+| PRIMARY KEY (score) |
+| ) |
+| PARTITION BY <b>RANGE (score) (...)</b> |
+| STORED AS KUDU |
+| TBLPROPERTIES ('kudu.master_addresses'='vd0342.example.com:7051', |
+| 'kudu.table_name'='impala::USER.numeric_grades_default_letter') |
++------------------------------------------------------------------------------------------------+
+
+show range partitions numeric_grades_default_letter;
++--------------------+
+| RANGE (score) |
++--------------------+
+| 0 <= VALUES < 50 |
+| 50 <= VALUES < 65 |
+| 65 <= VALUES < 80 |
+| 80 <= VALUES < 100 |
++--------------------+
+</codeblock>
+
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
@@ -855,6 +913,39 @@ show create table show_create_table_demo;
<p conref="../shared/impala_common.xml#common/show_security"/>
+ <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
+ <p rev="kudu IMPALA-2830">
+ Because Kudu tables do not have characteristics derived from HDFS, such
+ as number of files, file format, and HDFS cache status, the output of
+ <codeph>SHOW TABLE STATS</codeph> reflects different characteristics
+ that apply to Kudu tables. If the Kudu table is created with the
+ clause <codeph>PARTITIONS 20</codeph>, then the result set of
+ <codeph>SHOW TABLE STATS</codeph> consists of 20 rows, each representing
+ one of the numbered partitions. For example:
+ </p>
+
+<codeblock rev="kudu IMPALA-2830">
+show table stats kudu_table;
++--------+-----------+----------+-----------------------+------------+
+| # Rows | Start Key | Stop Key | Leader Replica | # Replicas |
++--------+-----------+----------+-----------------------+------------+
+| -1 | | 00000001 | host.example.com:7050 | 3 |
+| -1 | 00000001 | 00000002 | host.example.com:7050 | 3 |
+| -1 | 00000002 | 00000003 | host.example.com:7050 | 3 |
+| -1 | 00000003 | 00000004 | host.example.com:7050 | 3 |
+| -1 | 00000004 | 00000005 | host.example.com:7050 | 3 |
+...
+</codeblock>
+
+ <p rev="IMPALA-2830">
+ Impala does not compute the number of rows for each partition for
+ Kudu tables. Therefore, you do not need to re-run <codeph>COMPUTE STATS</codeph>
+ when you see -1 in the <codeph># Rows</codeph> column of the output from
+ <codeph>SHOW TABLE STATS</codeph>. That column always shows -1 for
+ all Kudu tables.
+ </p>
+
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
@@ -959,6 +1050,14 @@ show table stats store_sales;
<p conref="../shared/impala_common.xml#common/show_security"/>
+ <p rev="kudu IMPALA-2830">
+ The output for <codeph>SHOW COLUMN STATS</codeph> includes
+ the relevant information for Kudu tables.
+ The information for column statistics that originates in the
+ underlying Kudu storage layer is also represented in the
+ metastore database that Impala uses.
+ </p>
+
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
@@ -1145,8 +1244,31 @@ show column stats store_sales;
<p conref="../shared/impala_common.xml#common/show_security"/>
+ <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
+ <p rev="kudu IMPALA-4403">
+ The optional <codeph>RANGE</codeph> clause only applies to Kudu tables. It displays only the partitions
+ defined by the <codeph>RANGE</codeph> clause of <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph>.
+ </p>
+
+ <p rev="kudu IMPALA-4403">
+ Although you can specify <codeph><</codeph> or
+ <codeph><=</codeph> comparison operators when defining
+ range partitions for Kudu tables, Kudu rewrites them if necessary
+ to represent each range as
+ <codeph><varname>low_bound</varname> <= VALUES < <varname>high_bound</varname></codeph>.
+ This rewriting might involve incrementing one of the boundary values
+ or appending a <codeph>\0</codeph> for string values, so that the
+ partition covers the same range as originally specified.
+ </p>
+
<p conref="../shared/impala_common.xml#common/example_blurb"/>
+ <p>
+ The following example shows the output for a Parquet, text, or other
+ HDFS-backed table partitioned on the <codeph>YEAR</codeph> column:
+ </p>
+
<codeblock rev="1.4.0">[localhost:21000] > show partitions census;
+-------+-------+--------+------+---------+
| year | #Rows | #Files | Size | Format |
@@ -1162,6 +1284,53 @@ show column stats store_sales;
+-------+-------+--------+------+---------+
</codeblock>
+ <p rev="kudu IMPALA-4403">
+ The following example shows the output for a Kudu table
+ using the hash partitioning mechanism. The number of
+ rows in the result set corresponds to the values used
+ in the <codeph>PARTITIONS <varname>N</varname></codeph>
+ clause of <codeph>CREATE TABLE</codeph>.
+ </p>
+
+<codeblock rev="kudu IMPALA-4403"><![CDATA[
+show partitions million_rows_hash;
+
++--------+-----------+----------+-----------------------+--
+| # Rows | Start Key | Stop Key | Leader Replica | # Replicas
++--------+-----------+----------+-----------------------+--
+| -1 | | 00000001 | n236.example.com:7050 | 3
+| -1 | 00000001 | 00000002 | n236.example.com:7050 | 3
+| -1 | 00000002 | 00000003 | n336.example.com:7050 | 3
+| -1 | 00000003 | 00000004 | n238.example.com:7050 | 3
+| -1 | 00000004 | 00000005 | n338.example.com:7050 | 3
+....
+| -1 | 0000002E | 0000002F | n240.example.com:7050 | 3
+| -1 | 0000002F | 00000030 | n336.example.com:7050 | 3
+| -1 | 00000030 | 00000031 | n240.example.com:7050 | 3
+| -1 | 00000031 | | n334.example.com:7050 | 3
++--------+-----------+----------+-----------------------+--
+Fetched 50 row(s) in 0.05s
+]]>
+</codeblock>
+
+ <p rev="kudu IMPALA-4403">
+ The following example shows the output for a Kudu table
+ using the range partitioning mechanism:
+ </p>
+
+<codeblock rev="kudu IMPALA-4403"><![CDATA[
+show range partitions million_rows_range;
++-----------------------+
+| RANGE (id) |
++-----------------------+
+| VALUES < "A" |
+| "A" <= VALUES < "[" |
+| "a" <= VALUES < "{" |
+| "{" <= VALUES < "~\0" |
++-----------------------+
+]]>
+</codeblock>
+
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
<p rev="CDH-19187">
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/661921b2/docs/topics/impala_struct.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_struct.xml b/docs/topics/impala_struct.xml
index d103d95..b4289ab 100644
--- a/docs/topics/impala_struct.xml
+++ b/docs/topics/impala_struct.xml
@@ -112,6 +112,9 @@ type ::= <varname>primitive_type</varname> | <varname>complex_type</varname>
<li/>
</ul>
+ <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+ <p conref="../shared/impala_common.xml#common/kudu_unsupported_data_type"/>
+
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<note conref="../shared/impala_common.xml#common/complex_type_schema_pointer"/>