You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by mi...@apache.org on 2018/05/09 21:10:25 UTC
[16/51] [partial] impala git commit: [DOCS] Impala doc site update
for 3.0
http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_parquet.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_parquet.html b/docs/build3x/html/topics/impala_parquet.html
new file mode 100644
index 0000000..ce5242e
--- /dev/null
+++ b/docs/build3x/html/topics/impala_parquet.html
@@ -0,0 +1,1421 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_file_formats.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content=
"Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content=
"parquet"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using the Parquet File Format with Impala Tables</title></head><body id="parquet"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">Using the Parquet File Format with Impala Tables</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ Impala helps you to create, manage, and query Parquet tables. Parquet is a column-oriented binary file format
+ intended to be highly efficient for the types of large-scale queries that Impala is best at. Parquet is
+ especially good for queries scanning particular columns within a table, for example to query <span class="q">"wide"</span>
+ tables with many columns, or to perform aggregation operations such as <code class="ph codeph">SUM()</code> and
+ <code class="ph codeph">AVG()</code> that need to process most or all of the values from a column. Each data file contains
+ the values for a set of rows (the <span class="q">"row group"</span>). Within a data file, the values from each column are
+ organized so that they are all adjacent, enabling good compression for the values from that column. Queries
+ against a Parquet table can retrieve and analyze these values from any column quickly and with minimal I/O.
+ </p>
+
+ <table class="table"><caption><span class="table--title-label">Table 1. </span><span class="title">Parquet Format Support in Impala</span></caption><colgroup><col style="width:10%"><col style="width:10%"><col style="width:20%"><col style="width:30%"><col style="width:30%"></colgroup><thead class="thead">
+ <tr class="row">
+ <th class="entry nocellnorowborder" id="parquet__entry__1">
+ File Type
+ </th>
+ <th class="entry nocellnorowborder" id="parquet__entry__2">
+ Format
+ </th>
+ <th class="entry nocellnorowborder" id="parquet__entry__3">
+ Compression Codecs
+ </th>
+ <th class="entry nocellnorowborder" id="parquet__entry__4">
+ Impala Can CREATE?
+ </th>
+ <th class="entry nocellnorowborder" id="parquet__entry__5">
+ Impala Can INSERT?
+ </th>
+ </tr>
+ </thead><tbody class="tbody">
+ <tr class="row">
+ <td class="entry nocellnorowborder" headers="parquet__entry__1 ">
+ <a class="xref" href="impala_parquet.html#parquet">Parquet</a>
+ </td>
+ <td class="entry nocellnorowborder" headers="parquet__entry__2 ">
+ Structured
+ </td>
+ <td class="entry nocellnorowborder" headers="parquet__entry__3 ">
+ Snappy, gzip; currently Snappy by default
+ </td>
+ <td class="entry nocellnorowborder" headers="parquet__entry__4 ">
+ Yes.
+ </td>
+ <td class="entry nocellnorowborder" headers="parquet__entry__5 ">
+ Yes: <code class="ph codeph">CREATE TABLE</code>, <code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, and query.
+ </td>
+ </tr>
+ </tbody></table>
+
+ <p class="p toc inpage"></p>
+
+ </div>
+
+
+ <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_file_formats.html">How Impala Works with Hadoop File Formats</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="parquet__parquet_ddl">
+
+ <h2 class="title topictitle2" id="ariaid-title2">Creating Parquet Tables in Impala</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ To create a table named <code class="ph codeph">PARQUET_TABLE</code> that uses the Parquet format, you would use a
+ command like the following, substituting your own table name, column names, and data types:
+ </p>
+
+<pre class="pre codeblock"><code>[impala-host:21000] > create table <var class="keyword varname">parquet_table_name</var> (x INT, y STRING) STORED AS PARQUET;</code></pre>
+
+
+
+ <p class="p">
+ Or, to clone the column names and data types of an existing table:
+ </p>
+
+<pre class="pre codeblock"><code>[impala-host:21000] > create table <var class="keyword varname">parquet_table_name</var> LIKE <var class="keyword varname">other_table_name</var> STORED AS PARQUET;</code></pre>
+
+ <p class="p">
+ In Impala 1.4.0 and higher, you can derive column definitions from a raw Parquet data file, even without an
+ existing Impala table. For example, you can create an external table pointing to an HDFS directory, and
+ base the column definitions on one of the files in that directory:
+ </p>
+
+<pre class="pre codeblock"><code>CREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET '/user/etl/destination/datafile1.dat'
+ STORED AS PARQUET
+ LOCATION '/user/etl/destination';
+</code></pre>
+
+ <p class="p">
+ Or, you can refer to an existing data file and create a new empty table with suitable column definitions.
+ Then you can use <code class="ph codeph">INSERT</code> to create new data files or <code class="ph codeph">LOAD DATA</code> to transfer
+ existing data files into the new table.
+ </p>
+
+<pre class="pre codeblock"><code>CREATE TABLE columns_from_data_file LIKE PARQUET '/user/etl/destination/datafile1.dat'
+ STORED AS PARQUET;
+</code></pre>
+
+ <p class="p">
+ The default properties of the newly created table are the same as for any other <code class="ph codeph">CREATE
+ TABLE</code> statement. For example, the default file format is text; if you want the new table to use
+ the Parquet file format, include the <code class="ph codeph">STORED AS PARQUET</code> file also.
+ </p>
+
+ <p class="p">
+ In this example, the new table is partitioned by year, month, and day. These partition key columns are not
+ part of the data file, so you specify them in the <code class="ph codeph">CREATE TABLE</code> statement:
+ </p>
+
+<pre class="pre codeblock"><code>CREATE TABLE columns_from_data_file LIKE PARQUET '/user/etl/destination/datafile1.dat'
+ PARTITION (year INT, month TINYINT, day TINYINT)
+ STORED AS PARQUET;
+</code></pre>
+
+ <p class="p">
+ See <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a> for more details about the <code class="ph codeph">CREATE TABLE
+ LIKE PARQUET</code> syntax.
+ </p>
+
+ <p class="p">
+ Once you have created a table, to insert data into that table, use a command similar to the following,
+ again with your own table names:
+ </p>
+
+
+
+<pre class="pre codeblock"><code>[impala-host:21000] > insert overwrite table <var class="keyword varname">parquet_table_name</var> select * from <var class="keyword varname">other_table_name</var>;</code></pre>
+
+ <p class="p">
+ If the Parquet table has a different number of columns or different column names than the other table,
+ specify the names of columns from the other table rather than <code class="ph codeph">*</code> in the
+ <code class="ph codeph">SELECT</code> statement.
+ </p>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="parquet__parquet_etl">
+
+ <h2 class="title topictitle2" id="ariaid-title3">Loading Data into Parquet Tables</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ Choose from the following techniques for loading data into Parquet tables, depending on whether the
+ original data is already in an Impala table, or exists as raw data files outside Impala.
+ </p>
+
+ <p class="p">
+ If you already have data in an Impala or Hive table, perhaps in a different file format or partitioning
+ scheme, you can transfer the data to a Parquet table using the Impala <code class="ph codeph">INSERT...SELECT</code>
+ syntax. You can convert, filter, repartition, and do other things to the data as part of this same
+ <code class="ph codeph">INSERT</code> statement. See <a class="xref" href="#parquet_compression">Snappy and GZip Compression for Parquet Data Files</a> for some examples showing how to
+ insert data into Parquet tables.
+ </p>
+
+ <div class="p">
+ When inserting into partitioned tables, especially using the Parquet file format, you can include a hint in
+ the <code class="ph codeph">INSERT</code> statement to fine-tune the overall performance of the operation and its
+ resource usage:
+ <ul class="ul">
+
+ <li class="li">
+ You would only use hints if an <code class="ph codeph">INSERT</code> into a partitioned Parquet table was
+ failing due to capacity limits, or if such an <code class="ph codeph">INSERT</code> was succeeding but with
+ less-than-optimal performance.
+ </li>
+
+ <li class="li">
+ To use a hint to influence the join order, put the hint keyword <code class="ph codeph">/* +SHUFFLE */</code> or <code class="ph codeph">/* +NOSHUFFLE */</code>
+ (including the square brackets) after the <code class="ph codeph">PARTITION</code> clause, immediately before the
+ <code class="ph codeph">SELECT</code> keyword.
+ </li>
+
+ <li class="li">
+ <code class="ph codeph">/* +SHUFFLE */</code> selects an execution plan that reduces the number of files being written
+ simultaneously to HDFS, and the number of memory buffers holding data for individual partitions. Thus
+ it reduces overall resource usage for the <code class="ph codeph">INSERT</code> operation, allowing some
+ <code class="ph codeph">INSERT</code> operations to succeed that otherwise would fail. It does involve some data
+ transfer between the nodes so that the data files for a particular partition are all constructed on the
+ same node.
+ </li>
+
+ <li class="li">
+ <code class="ph codeph">/* +NOSHUFFLE */</code> selects an execution plan that might be faster overall, but might also
+ produce a larger number of small data files or exceed capacity limits, causing the
+ <code class="ph codeph">INSERT</code> operation to fail. Use <code class="ph codeph">/* +SHUFFLE */</code> in cases where an
+ <code class="ph codeph">INSERT</code> statement fails or runs inefficiently due to all nodes attempting to construct
+ data for all partitions.
+ </li>
+
+ <li class="li">
+ Impala automatically uses the <code class="ph codeph">/* +SHUFFLE */</code> method if any partition key column in the
+ source table, mentioned in the <code class="ph codeph">INSERT ... SELECT</code> query, does not have column
+ statistics. In this case, only the <code class="ph codeph">/* +NOSHUFFLE */</code> hint would have any effect.
+ </li>
+
+ <li class="li">
+ If column statistics are available for all partition key columns in the source table mentioned in the
+ <code class="ph codeph">INSERT ... SELECT</code> query, Impala chooses whether to use the <code class="ph codeph">/* +SHUFFLE */</code>
+ or <code class="ph codeph">/* +NOSHUFFLE */</code> technique based on the estimated number of distinct values in those
+ columns and the number of nodes involved in the <code class="ph codeph">INSERT</code> operation. In this case, you
+ might need the <code class="ph codeph">/* +SHUFFLE */</code> or the <code class="ph codeph">/* +NOSHUFFLE */</code> hint to override the
+ execution plan selected by Impala.
+ </li>
+
+ <li class="li">
+ In <span class="keyword">Impala 2.8</span> or higher, you can make the
+ <code class="ph codeph">INSERT</code> operation organize (<span class="q">"cluster"</span>)
+ the data for each partition to avoid buffering data for multiple partitions
+ and reduce the risk of an out-of-memory condition. Specify the hint as
+ <code class="ph codeph">/* +CLUSTERED */</code>. This technique is primarily
+ useful for inserts into Parquet tables, where the large block
+ size requires substantial memory to buffer data for multiple
+ output files at once.
+ </li>
+
+ </ul>
+ </div>
+
+ <p class="p">
+ Any <code class="ph codeph">INSERT</code> statement for a Parquet table requires enough free space in the HDFS filesystem
+ to write one block. Because Parquet data files use a block size of 1 GB by default, an
+ <code class="ph codeph">INSERT</code> might fail (even for a very small amount of data) if your HDFS is running low on
+ space.
+ </p>
+
+
+
+ <p class="p">
+ Avoid the <code class="ph codeph">INSERT...VALUES</code> syntax for Parquet tables, because
+ <code class="ph codeph">INSERT...VALUES</code> produces a separate tiny data file for each
+ <code class="ph codeph">INSERT...VALUES</code> statement, and the strength of Parquet is in its handling of data
+ (compressing, parallelizing, and so on) in <span class="ph">large</span> chunks.
+ </p>
+
+ <p class="p">
+ If you have one or more Parquet data files produced outside of Impala, you can quickly make the data
+ queryable through Impala by one of the following methods:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ The <code class="ph codeph">LOAD DATA</code> statement moves a single data file or a directory full of data files into
+ the data directory for an Impala table. It does no validation or conversion of the data. The original
+ data files must be somewhere in HDFS, not the local filesystem.
+
+ </li>
+
+ <li class="li">
+ The <code class="ph codeph">CREATE TABLE</code> statement with the <code class="ph codeph">LOCATION</code> clause creates a table
+ where the data continues to reside outside the Impala data directory. The original data files must be
+ somewhere in HDFS, not the local filesystem. For extra safety, if the data is intended to be long-lived
+ and reused by other applications, you can use the <code class="ph codeph">CREATE EXTERNAL TABLE</code> syntax so that
+ the data files are not deleted by an Impala <code class="ph codeph">DROP TABLE</code> statement.
+
+ </li>
+
+ <li class="li">
+ If the Parquet table already exists, you can copy Parquet data files directly into it, then use the
+ <code class="ph codeph">REFRESH</code> statement to make Impala recognize the newly added data. Remember to preserve
+ the block size of the Parquet data files by using the <code class="ph codeph">hadoop distcp -pb</code> command rather
+ than a <code class="ph codeph">-put</code> or <code class="ph codeph">-cp</code> operation on the Parquet files. See
+ <a class="xref" href="#parquet_compression_multiple">Example of Copying Parquet Data Files</a> for an example of this kind of operation.
+ </li>
+ </ul>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ <p class="p">
+ Currently, Impala always decodes the column data in Parquet files based on the ordinal position of the
+ columns, not by looking up the position of each column based on its name. Parquet files produced outside
+ of Impala must write column data in the same order as the columns are declared in the Impala table. Any
+ optional columns that are omitted from the data files must be the rightmost columns in the Impala table
+ definition.
+ </p>
+
+ <p class="p">
+ If you created compressed Parquet files through some tool other than Impala, make sure that any
+ compression codecs are supported in Parquet by Impala. For example, Impala does not currently support LZO
+ compression in Parquet files. Also doublecheck that you used any recommended compatibility settings in
+ the other tool, such as <code class="ph codeph">spark.sql.parquet.binaryAsString</code> when writing Parquet files
+ through Spark.
+ </p>
+ </div>
+
+ <p class="p">
+ Recent versions of Sqoop can produce Parquet output files using the <code class="ph codeph">--as-parquetfile</code>
+ option.
+ </p>
+
+ <p class="p"> If you use Sqoop to
+ convert RDBMS data to Parquet, be careful with interpreting any
+ resulting values from <code class="ph codeph">DATE</code>, <code class="ph codeph">DATETIME</code>,
+ or <code class="ph codeph">TIMESTAMP</code> columns. The underlying values are
+ represented as the Parquet <code class="ph codeph">INT64</code> type, which is
+ represented as <code class="ph codeph">BIGINT</code> in the Impala table. The Parquet
+ values represent the time in milliseconds, while Impala interprets
+ <code class="ph codeph">BIGINT</code> as the time in seconds. Therefore, if you have
+ a <code class="ph codeph">BIGINT</code> column in a Parquet table that was imported
+ this way from Sqoop, divide the values by 1000 when interpreting as the
+ <code class="ph codeph">TIMESTAMP</code> type.</p>
+
+ <p class="p">
+ If the data exists outside Impala and is in some other format, combine both of the preceding techniques.
+ First, use a <code class="ph codeph">LOAD DATA</code> or <code class="ph codeph">CREATE EXTERNAL TABLE ... LOCATION</code> statement to
+ bring the data into an Impala table that uses the appropriate file format. Then, use an
+ <code class="ph codeph">INSERT...SELECT</code> statement to copy the data to the Parquet table, converting to Parquet
+ format as part of the process.
+ </p>
+
+
+
+ <p class="p">
+ Loading data into Parquet tables is a memory-intensive operation, because the incoming data is buffered
+ until it reaches <span class="ph">one data block</span> in size, then that chunk of data is
+ organized and compressed in memory before being written out. The memory consumption can be larger when
+ inserting data into partitioned Parquet tables, because a separate data file is written for each
+ combination of partition key column values, potentially requiring several
+ <span class="ph">large</span> chunks to be manipulated in memory at once.
+ </p>
+
+ <p class="p">
+ When inserting into a partitioned Parquet table, Impala redistributes the data among the nodes to reduce
+ memory consumption. You might still need to temporarily increase the memory dedicated to Impala during the
+ insert operation, or break up the load operation into several <code class="ph codeph">INSERT</code> statements, or both.
+ </p>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ All the preceding techniques assume that the data you are loading matches the structure of the destination
+ table, including column order, column names, and partition layout. To transform or reorganize the data,
+ start by loading the data into a Parquet table that matches the underlying structure of the data, then use
+ one of the table-copying techniques such as <code class="ph codeph">CREATE TABLE AS SELECT</code> or <code class="ph codeph">INSERT ...
+ SELECT</code> to reorder or rename columns, divide the data among multiple partitions, and so on. For
+ example to take a single comprehensive Parquet data file and load it into a partitioned table, you would
+ use an <code class="ph codeph">INSERT ... SELECT</code> statement with dynamic partitioning to let Impala create separate
+ data files with the appropriate partition values; for an example, see
+ <a class="xref" href="impala_insert.html#insert">INSERT Statement</a>.
+ </div>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="parquet__parquet_performance">
+
+ <h2 class="title topictitle2" id="ariaid-title4">Query Performance for Impala Parquet Tables</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ Query performance for Parquet tables depends on the number of columns needed to process the
+ <code class="ph codeph">SELECT</code> list and <code class="ph codeph">WHERE</code> clauses of the query, the way data is divided into
+ <span class="ph">large data files with block size equal to file size</span>, the reduction in I/O
+ by reading the data for each column in compressed format, which data files can be skipped (for partitioned
+ tables), and the CPU overhead of decompressing the data for each column.
+ </p>
+
+ <div class="p">
+ For example, the following is an efficient query for a Parquet table:
+<pre class="pre codeblock"><code>select avg(income) from census_data where state = 'CA';</code></pre>
+ The query processes only 2 columns out of a large number of total columns. If the table is partitioned by
+ the <code class="ph codeph">STATE</code> column, it is even more efficient because the query only has to read and decode
+ 1 column from each data file, and it can read only the data files in the partition directory for the state
+ <code class="ph codeph">'CA'</code>, skipping the data files for all the other states, which will be physically located
+ in other directories.
+ </div>
+
+ <div class="p">
+ The following is a relatively inefficient query for a Parquet table:
+<pre class="pre codeblock"><code>select * from census_data;</code></pre>
+ Impala would have to read the entire contents of each <span class="ph">large</span> data file,
+ and decompress the contents of each column for each row group, negating the I/O optimizations of the
+ column-oriented format. This query might still be faster for a Parquet table than a table with some other
+ file format, but it does not take advantage of the unique strengths of Parquet data files.
+ </div>
+
+ <p class="p">
+ Impala can optimize queries on Parquet tables, especially join queries, better when statistics are
+ available for all the tables. Issue the <code class="ph codeph">COMPUTE STATS</code> statement for each table after
+ substantial amounts of data are loaded into or appended to it. See
+ <a class="xref" href="impala_compute_stats.html#compute_stats">COMPUTE STATS Statement</a> for details.
+ </p>
+
+ <p class="p">
+ The runtime filtering feature, available in <span class="keyword">Impala 2.5</span> and higher, works best with Parquet tables.
+ The per-row filtering aspect only applies to Parquet tables.
+ See <a class="xref" href="impala_runtime_filtering.html#runtime_filtering">Runtime Filtering for Impala Queries (Impala 2.5 or higher only)</a> for details.
+ </p>
+
+ <p class="p">
+ In <span class="keyword">Impala 2.6</span> and higher, Impala queries are optimized for files stored in Amazon S3.
+ For Impala tables that use the file formats Parquet, RCFile, SequenceFile,
+ Avro, and uncompressed text, the setting <code class="ph codeph">fs.s3a.block.size</code>
+ in the <span class="ph filepath">core-site.xml</span> configuration file determines
+ how Impala divides the I/O work of reading the data files. This configuration
+ setting is specified in bytes. By default, this
+ value is 33554432 (32 MB), meaning that Impala parallelizes S3 read operations on the files
+ as if they were made up of 32 MB blocks. For example, if your S3 queries primarily access
+ Parquet files written by MapReduce or Hive, increase <code class="ph codeph">fs.s3a.block.size</code>
+ to 134217728 (128 MB) to match the row group size of those files. If most S3 queries involve
+ Parquet files written by Impala, increase <code class="ph codeph">fs.s3a.block.size</code>
+ to 268435456 (256 MB) to match the row group size produced by Impala.
+ </p>
+
+ <p class="p">
+ In <span class="keyword">Impala 2.9</span> and higher, Parquet files written by Impala include
+ embedded metadata specifying the minimum and maximum values for each column, within
+ each row group and each data page within the row group. Impala-written Parquet files
+ typically contain a single row group; a row group can contain many data pages.
+ Impala uses this information (currently, only the metadata for each row group)
+ when reading each Parquet data file during a query, to quickly determine whether each
+ row group within the file potentially includes any rows that match the conditions in the
+ <code class="ph codeph">WHERE</code> clause. For example, if the column <code class="ph codeph">X</code> within
+ a particular Parquet file has a minimum value of 1 and a maximum value of 100, then
+ a query including the clause <code class="ph codeph">WHERE x > 200</code> can quickly determine
+ that it is safe to skip that particular file, instead of scanning all the associated
+ column values. This optimization technique is especially effective for tables that
+ use the <code class="ph codeph">SORT BY</code> clause for the columns most frequently checked in
+ <code class="ph codeph">WHERE</code> clauses, because any <code class="ph codeph">INSERT</code> operation on
+ such tables produces Parquet data files with relatively narrow ranges of column values
+ within each file.
+ </p>
+
+ </div>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title5" id="parquet_performance__parquet_partitioning">
+
+ <h3 class="title topictitle3" id="ariaid-title5">Partitioning for Parquet Tables</h3>
+
+ <div class="body conbody">
+
+ <p class="p">
+ As explained in <a class="xref" href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a>, partitioning is an important
+ performance technique for Impala generally. This section explains some of the performance considerations
+ for partitioned Parquet tables.
+ </p>
+
+ <p class="p">
+ The Parquet file format is ideal for tables containing many columns, where most queries only refer to a
+ small subset of the columns. As explained in <a class="xref" href="#parquet_data_files">How Parquet Data Files Are Organized</a>, the physical layout of
+ Parquet data files lets Impala read only a small fraction of the data for many queries. The performance
+ benefits of this approach are amplified when you use Parquet tables in combination with partitioning.
+ Impala can skip the data files for certain partitions entirely, based on the comparisons in the
+ <code class="ph codeph">WHERE</code> clause that refer to the partition key columns. For example, queries on
+ partitioned tables often analyze data for time intervals based on columns such as <code class="ph codeph">YEAR</code>,
+ <code class="ph codeph">MONTH</code>, and/or <code class="ph codeph">DAY</code>, or for geographic regions. Remember that Parquet
+ data files use a <span class="ph">large</span> block size, so when deciding how finely to
+ partition the data, try to find a granularity where each partition contains
+ <span class="ph">256 MB</span> or more of data, rather than creating a large number of smaller
+ files split among many partitions.
+ </p>
+
+ <p class="p">
+ Inserting into a partitioned Parquet table can be a resource-intensive operation, because each Impala
+ node could potentially be writing a separate data file to HDFS for each combination of different values
+ for the partition key columns. The large number of simultaneous open files could exceed the HDFS
+ <span class="q">"transceivers"</span> limit. To avoid exceeding this limit, consider the following techniques:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ Load different subsets of data using separate <code class="ph codeph">INSERT</code> statements with specific values
+ for the <code class="ph codeph">PARTITION</code> clause, such as <code class="ph codeph">PARTITION (year=2010)</code>.
+ </li>
+
+ <li class="li">
+ Increase the <span class="q">"transceivers"</span> value for HDFS, sometimes spelled <span class="q">"xcievers"</span> (sic). The property
+ value in the <span class="ph filepath">hdfs-site.xml</span> configuration file is
+
+ <code class="ph codeph">dfs.datanode.max.transfer.threads</code>. For example, if you were loading 12 years of data
+ partitioned by year, month, and day, even a value of 4096 might not be high enough. This
+ <a class="xref" href="http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/" target="_blank">blog post</a> explores the considerations for setting this value
+ higher or lower, using HBase examples for illustration.
+ </li>
+
+ <li class="li">
+ Use the <code class="ph codeph">COMPUTE STATS</code> statement to collect
+ <a class="xref" href="impala_perf_stats.html#perf_column_stats">column statistics</a> on the source table from
+ which data is being copied, so that the Impala query can estimate the number of different values in the
+ partition key columns and distribute the work accordingly.
+ </li>
+ </ul>
+
+ </div>
+
+ </article>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="parquet__parquet_compression">
+
+ <h2 class="title topictitle2" id="ariaid-title6">Snappy and GZip Compression for Parquet Data Files</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ When Impala writes Parquet data files using the <code class="ph codeph">INSERT</code> statement, the underlying
+ compression is controlled by the <code class="ph codeph">COMPRESSION_CODEC</code> query option. (Prior to Impala 2.0, the
+ query option name was <code class="ph codeph">PARQUET_COMPRESSION_CODEC</code>.) The allowed values for this query option
+ are <code class="ph codeph">snappy</code> (the default), <code class="ph codeph">gzip</code>, and <code class="ph codeph">none</code>. The option
+ value is not case-sensitive. If the option is set to an unrecognized value, all kinds of queries will fail
+ due to the invalid option setting, not just queries involving Parquet tables.
+ </p>
+
+ </div>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title7" id="parquet_compression__parquet_snappy">
+
+ <h3 class="title topictitle3" id="ariaid-title7">Example of Parquet Table with Snappy Compression</h3>
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ By default, the underlying data files for a Parquet table are compressed with Snappy. The combination of
+ fast compression and decompression makes it a good choice for many data sets. To ensure Snappy
+ compression is used, for example after experimenting with other compression codecs, set the
+ <code class="ph codeph">COMPRESSION_CODEC</code> query option to <code class="ph codeph">snappy</code> before inserting the data:
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > create database parquet_compression;
+[localhost:21000] > use parquet_compression;
+[localhost:21000] > create table parquet_snappy like raw_text_data;
+[localhost:21000] > set COMPRESSION_CODEC=snappy;
+[localhost:21000] > insert into parquet_snappy select * from raw_text_data;
+Inserted 1000000000 rows in 181.98s
+</code></pre>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title8" id="parquet_compression__parquet_gzip">
+
+ <h3 class="title topictitle3" id="ariaid-title8">Example of Parquet Table with GZip Compression</h3>
+
+ <div class="body conbody">
+
+ <p class="p">
+ If you need more intensive compression (at the expense of more CPU cycles for uncompressing during
+ queries), set the <code class="ph codeph">COMPRESSION_CODEC</code> query option to <code class="ph codeph">gzip</code> before
+ inserting the data:
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > create table parquet_gzip like raw_text_data;
+[localhost:21000] > set COMPRESSION_CODEC=gzip;
+[localhost:21000] > insert into parquet_gzip select * from raw_text_data;
+Inserted 1000000000 rows in 1418.24s
+</code></pre>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title9" id="parquet_compression__parquet_none">
+
+ <h3 class="title topictitle3" id="ariaid-title9">Example of Uncompressed Parquet Table</h3>
+
+ <div class="body conbody">
+
+ <p class="p">
+ If your data compresses very poorly, or you want to avoid the CPU overhead of compression and
+ decompression entirely, set the <code class="ph codeph">COMPRESSION_CODEC</code> query option to <code class="ph codeph">none</code>
+ before inserting the data:
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > create table parquet_none like raw_text_data;
+[localhost:21000] > set COMPRESSION_CODEC=none;
+[localhost:21000] > insert into parquet_none select * from raw_text_data;
+Inserted 1000000000 rows in 146.90s
+</code></pre>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title10" id="parquet_compression__parquet_compression_examples">
+
+ <h3 class="title topictitle3" id="ariaid-title10">Examples of Sizes and Speeds for Compressed Parquet Tables</h3>
+
+ <div class="body conbody">
+
+ <p class="p">
+ Here are some examples showing differences in data sizes and query speeds for 1 billion rows of synthetic
+ data, compressed with each kind of codec. As always, run similar tests with realistic data sets of your
+ own. The actual compression ratios, and relative insert and query speeds, will vary depending on the
+ characteristics of the actual data.
+ </p>
+
+ <p class="p">
+ In this case, switching from Snappy to GZip compression shrinks the data by an additional 40% or so,
+ while switching from Snappy compression to no compression expands the data also by about 40%:
+ </p>
+
+<pre class="pre codeblock"><code>$ hdfs dfs -du -h /user/hive/warehouse/parquet_compression.db
+23.1 G /user/hive/warehouse/parquet_compression.db/parquet_snappy
+13.5 G /user/hive/warehouse/parquet_compression.db/parquet_gzip
+32.8 G /user/hive/warehouse/parquet_compression.db/parquet_none
+</code></pre>
+
+ <p class="p">
+ Because Parquet data files are typically <span class="ph">large</span>, each directory will
+ have a different number of data files and the row groups will be arranged differently.
+ </p>
+
+ <p class="p">
+ At the same time, the less agressive the compression, the faster the data can be decompressed. In this
+ case using a table with a billion rows, a query that evaluates all the values for a particular column
+ runs faster with no compression than with Snappy compression, and faster with Snappy compression than
+ with Gzip compression. Query performance depends on several other factors, so as always, run your own
+ benchmarks with your own data to determine the ideal tradeoff between data size, CPU efficiency, and
+ speed of insert and query operations.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > desc parquet_snappy;
+Query finished, fetching results ...
++-----------+---------+---------+
+| name | type | comment |
++-----------+---------+---------+
+| id | int | |
+| val | int | |
+| zfill | string | |
+| name | string | |
+| assertion | boolean | |
++-----------+---------+---------+
+Returned 5 row(s) in 0.14s
+[localhost:21000] > select avg(val) from parquet_snappy;
+Query finished, fetching results ...
++-----------------+
+| _c0 |
++-----------------+
+| 250000.93577915 |
++-----------------+
+Returned 1 row(s) in 4.29s
+[localhost:21000] > select avg(val) from parquet_gzip;
+Query finished, fetching results ...
++-----------------+
+| _c0 |
++-----------------+
+| 250000.93577915 |
++-----------------+
+Returned 1 row(s) in 6.97s
+[localhost:21000] > select avg(val) from parquet_none;
+Query finished, fetching results ...
++-----------------+
+| _c0 |
++-----------------+
+| 250000.93577915 |
++-----------------+
+Returned 1 row(s) in 3.67s
+</code></pre>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title11" id="parquet_compression__parquet_compression_multiple">
+
+ <h3 class="title topictitle3" id="ariaid-title11">Example of Copying Parquet Data Files</h3>
+
+ <div class="body conbody">
+
+ <p class="p">
+ Here is a final example, to illustrate how the data files using the various compression codecs are all
+ compatible with each other for read operations. The metadata about the compression format is written into
+ each data file, and can be decoded during queries regardless of the <code class="ph codeph">COMPRESSION_CODEC</code>
+ setting in effect at the time. In this example, we copy data files from the
+ <code class="ph codeph">PARQUET_SNAPPY</code>, <code class="ph codeph">PARQUET_GZIP</code>, and <code class="ph codeph">PARQUET_NONE</code> tables
+ used in the previous examples, each containing 1 billion rows, all to the data directory of a new table
+ <code class="ph codeph">PARQUET_EVERYTHING</code>. A couple of sample queries demonstrate that the new table now
+ contains 3 billion rows featuring a variety of compression codecs for the data files.
+ </p>
+
+ <p class="p">
+ First, we create the table in Impala so that there is a destination directory in HDFS to put the data
+ files:
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > create table parquet_everything like parquet_snappy;
+Query: create table parquet_everything like parquet_snappy
+</code></pre>
+
+ <p class="p">
+ Then in the shell, we copy the relevant data files into the data directory for this new table. Rather
+ than using <code class="ph codeph">hdfs dfs -cp</code> as with typical files, we use <code class="ph codeph">hadoop distcp -pb</code>
+ to ensure that the special <span class="ph"> block size</span> of the Parquet data files is
+ preserved.
+ </p>
+
+<pre class="pre codeblock"><code>$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_snappy \
+ /user/hive/warehouse/parquet_compression.db/parquet_everything
+...<var class="keyword varname">MapReduce output</var>...
+$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_gzip \
+ /user/hive/warehouse/parquet_compression.db/parquet_everything
+...<var class="keyword varname">MapReduce output</var>...
+$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_none \
+ /user/hive/warehouse/parquet_compression.db/parquet_everything
+...<var class="keyword varname">MapReduce output</var>...
+</code></pre>
+
+ <p class="p">
+ Back in the <span class="keyword cmdname">impala-shell</span> interpreter, we use the <code class="ph codeph">REFRESH</code> statement to
+ alert the Impala server to the new data files for this table, then we can run queries demonstrating that
+ the data files represent 3 billion rows, and the values for one of the numeric columns match what was in
+ the original smaller tables:
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > refresh parquet_everything;
+Query finished, fetching results ...
+
+Returned 0 row(s) in 0.32s
+[localhost:21000] > select count(*) from parquet_everything;
+Query finished, fetching results ...
++------------+
+| _c0 |
++------------+
+| 3000000000 |
++------------+
+Returned 1 row(s) in 8.18s
+[localhost:21000] > select avg(val) from parquet_everything;
+Query finished, fetching results ...
++-----------------+
+| _c0 |
++-----------------+
+| 250000.93577915 |
++-----------------+
+Returned 1 row(s) in 13.35s
+</code></pre>
+
+ </div>
+
+ </article>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title12" id="parquet__parquet_complex_types">
+
+ <h2 class="title topictitle2" id="ariaid-title12">Parquet Tables for Impala Complex Types</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ In <span class="keyword">Impala 2.3</span> and higher, Impala supports the complex types
+ <code class="ph codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, and <code class="ph codeph">MAP</code>
+ See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details.
+ Because these data types are currently supported only for the Parquet file format,
+ if you plan to use them, become familiar with the performance and storage aspects
+ of Parquet first.
+ </p>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title13" id="parquet__parquet_interop">
+
+ <h2 class="title topictitle2" id="ariaid-title13">Exchanging Parquet Data Files with Other Hadoop Components</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ You can read and write Parquet data files from other <span class="keyword"></span> components.
+ See <span class="xref">the documentation for your Apache Hadoop distribution</span> for details.
+ </p>
+
+
+
+
+
+
+
+
+
+ <p class="p">
+ Previously, it was not possible to create Parquet data through Impala and reuse that table within Hive. Now
+ that Parquet support is available for Hive, reusing existing Impala Parquet data files in Hive
+ requires updating the table metadata. Use the following command if you are already running Impala 1.1.1 or
+ higher:
+ </p>
+
+<pre class="pre codeblock"><code>ALTER TABLE <var class="keyword varname">table_name</var> SET FILEFORMAT PARQUET;
+</code></pre>
+
+ <p class="p">
+ If you are running a level of Impala that is older than 1.1.1, do the metadata update through Hive:
+ </p>
+
+<pre class="pre codeblock"><code>ALTER TABLE <var class="keyword varname">table_name</var> SET SERDE 'parquet.hive.serde.ParquetHiveSerDe';
+ALTER TABLE <var class="keyword varname">table_name</var> SET FILEFORMAT
+ INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
+ OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat";
+</code></pre>
+
+ <p class="p">
+ Impala 1.1.1 and higher can reuse Parquet data files created by Hive, without any action required.
+ </p>
+
+
+
+ <p class="p">
+ Impala supports the scalar data types that you can encode in a Parquet data file, but not composite or
+ nested types such as maps or arrays. In <span class="keyword">Impala 2.2</span> and higher, Impala can query Parquet data
+ files that include composite or nested types, as long as the query only refers to columns with scalar
+ types.
+
+ </p>
+
+ <p class="p">
+ If you copy Parquet data files between nodes, or even between different directories on the same node, make
+ sure to preserve the block size by using the command <code class="ph codeph">hadoop distcp -pb</code>. To verify that the
+ block size was preserved, issue the command <code class="ph codeph">hdfs fsck -blocks
+ <var class="keyword varname">HDFS_path_of_impala_table_dir</var></code> and check that the average block size is at or
+ near <span class="ph">256 MB (or whatever other size is defined by the
+ <code class="ph codeph">PARQUET_FILE_SIZE</code> query option).</span>. (The <code class="ph codeph">hadoop distcp</code> operation
+ typically leaves some directories behind, with names matching <span class="ph filepath">_distcp_logs_*</span>, that you
+ can delete from the destination directory afterward.)
+
+
+
+ Issue the command <span class="keyword cmdname">hadoop distcp</span> for details about <span class="keyword cmdname">distcp</span> command
+ syntax.
+ </p>
+
+
+
+ <p class="p">
+ Impala can query Parquet files that use the <code class="ph codeph">PLAIN</code>, <code class="ph codeph">PLAIN_DICTIONARY</code>,
+ <code class="ph codeph">BIT_PACKED</code>, and <code class="ph codeph">RLE</code> encodings.
+ Currently, Impala does not support <code class="ph codeph">RLE_DICTIONARY</code> encoding.
+ When creating files outside of Impala for use by Impala, make sure to use one of the supported encodings.
+ In particular, for MapReduce jobs, <code class="ph codeph">parquet.writer.version</code> must not be defined
+ (especially as <code class="ph codeph">PARQUET_2_0</code>) for writing the configurations of Parquet MR jobs.
+ Use the default version (or format). The default format, 1.0, includes some enhancements that are compatible with older versions.
+ Data using the 2.0 format might not be consumable by Impala, due to use of the <code class="ph codeph">RLE_DICTIONARY</code> encoding.
+ </p>
+ <div class="p">
+ To examine the internal structure and data of Parquet files, you can use the
+ <span class="keyword cmdname">parquet-tools</span> command. Make sure this
+ command is in your <code class="ph codeph">$PATH</code>. (Typically, it is symlinked from
+ <span class="ph filepath">/usr/bin</span>; sometimes, depending on your installation setup, you
+ might need to locate it under an alternative <code class="ph codeph">bin</code> directory.)
+ The arguments to this command let you perform operations such as:
+ <ul class="ul">
+ <li class="li">
+ <code class="ph codeph">cat</code>: Print a file's contents to standard out. In <span class="keyword">Impala 2.3</span> and higher, you can use
+ the <code class="ph codeph">-j</code> option to output JSON.
+ </li>
+ <li class="li">
+ <code class="ph codeph">head</code>: Print the first few records of a file to standard output.
+ </li>
+ <li class="li">
+ <code class="ph codeph">schema</code>: Print the Parquet schema for the file.
+ </li>
+ <li class="li">
+ <code class="ph codeph">meta</code>: Print the file footer metadata, including key-value properties (like Avro schema), compression ratios,
+ encodings, compression used, and row group information.
+ </li>
+ <li class="li">
+ <code class="ph codeph">dump</code>: Print all data and metadata.
+ </li>
+ </ul>
+ Use <code class="ph codeph">parquet-tools -h</code> to see usage information for all the arguments.
+ Here are some examples showing <span class="keyword cmdname">parquet-tools</span> usage:
+
+<pre class="pre codeblock"><code>
+$ # Be careful doing this for a big file! Use parquet-tools head to be safe.
+$ parquet-tools cat sample.parq
+year = 1992
+month = 1
+day = 2
+dayofweek = 4
+dep_time = 748
+crs_dep_time = 750
+arr_time = 851
+crs_arr_time = 846
+carrier = US
+flight_num = 53
+actual_elapsed_time = 63
+crs_elapsed_time = 56
+arrdelay = 5
+depdelay = -2
+origin = CMH
+dest = IND
+distance = 182
+cancelled = 0
+diverted = 0
+
+year = 1992
+month = 1
+day = 3
+...
+
+</code></pre>
+
+<pre class="pre codeblock"><code>
+$ parquet-tools head -n 2 sample.parq
+year = 1992
+month = 1
+day = 2
+dayofweek = 4
+dep_time = 748
+crs_dep_time = 750
+arr_time = 851
+crs_arr_time = 846
+carrier = US
+flight_num = 53
+actual_elapsed_time = 63
+crs_elapsed_time = 56
+arrdelay = 5
+depdelay = -2
+origin = CMH
+dest = IND
+distance = 182
+cancelled = 0
+diverted = 0
+
+year = 1992
+month = 1
+day = 3
+...
+
+</code></pre>
+
+<pre class="pre codeblock"><code>
+$ parquet-tools schema sample.parq
+message schema {
+ optional int32 year;
+ optional int32 month;
+ optional int32 day;
+ optional int32 dayofweek;
+ optional int32 dep_time;
+ optional int32 crs_dep_time;
+ optional int32 arr_time;
+ optional int32 crs_arr_time;
+ optional binary carrier;
+ optional int32 flight_num;
+...
+
+</code></pre>
+
+<pre class="pre codeblock"><code>
+$ parquet-tools meta sample.parq
+creator: impala version 2.2.0-...
+
+file schema: schema
+-------------------------------------------------------------------
+year: OPTIONAL INT32 R:0 D:1
+month: OPTIONAL INT32 R:0 D:1
+day: OPTIONAL INT32 R:0 D:1
+dayofweek: OPTIONAL INT32 R:0 D:1
+dep_time: OPTIONAL INT32 R:0 D:1
+crs_dep_time: OPTIONAL INT32 R:0 D:1
+arr_time: OPTIONAL INT32 R:0 D:1
+crs_arr_time: OPTIONAL INT32 R:0 D:1
+carrier: OPTIONAL BINARY R:0 D:1
+flight_num: OPTIONAL INT32 R:0 D:1
+...
+
+row group 1: RC:20636601 TS:265103674
+-------------------------------------------------------------------
+year: INT32 SNAPPY DO:4 FPO:35 SZ:10103/49723/4.92 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+month: INT32 SNAPPY DO:10147 FPO:10210 SZ:11380/35732/3.14 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+day: INT32 SNAPPY DO:21572 FPO:21714 SZ:3071658/9868452/3.21 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+dayofweek: INT32 SNAPPY DO:3093276 FPO:3093319 SZ:2274375/5941876/2.61 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+dep_time: INT32 SNAPPY DO:5367705 FPO:5373967 SZ:28281281/28573175/1.01 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+crs_dep_time: INT32 SNAPPY DO:33649039 FPO:33654262 SZ:10220839/11574964/1.13 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+arr_time: INT32 SNAPPY DO:43869935 FPO:43876489 SZ:28562410/28797767/1.01 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+crs_arr_time: INT32 SNAPPY DO:72432398 FPO:72438151 SZ:10908972/12164626/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+carrier: BINARY SNAPPY DO:83341427 FPO:83341558 SZ:114916/128611/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+flight_num: INT32 SNAPPY DO:83456393 FPO:83488603 SZ:10216514/11474301/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+...
+
+</code></pre>
+ </div>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title14" id="parquet__parquet_data_files">
+
+ <h2 class="title topictitle2" id="ariaid-title14">How Parquet Data Files Are Organized</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ Although Parquet is a column-oriented file format, do not expect to find one data file for each column.
+ Parquet keeps all the data for a row within the same data file, to ensure that the columns for a row are
+ always available on the same node for processing. What Parquet does is to set a large HDFS block size and a
+ matching maximum data file size, to ensure that I/O and network transfer requests apply to large batches of
+ data.
+ </p>
+
+ <p class="p">
+ Within that data file, the data for a set of rows is rearranged so that all the values from the first
+ column are organized in one contiguous block, then all the values from the second column, and so on.
+ Putting the values from the same column next to each other lets Impala use effective compression techniques
+ on the values in that column.
+ </p>
+
+ <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+ <p class="p">
+ Impala <code class="ph codeph">INSERT</code> statements write Parquet data files using an HDFS block size
+ <span class="ph">that matches the data file size</span>, to ensure that each data file is
+ represented by a single HDFS block, and the entire file can be processed on a single node without
+ requiring any remote reads.
+ </p>
+
+ <p class="p">
+ If you create Parquet data files outside of Impala, such as through a MapReduce or Pig job, ensure that
+ the HDFS block size is greater than or equal to the file size, so that the <span class="q">"one file per block"</span>
+ relationship is maintained. Set the <code class="ph codeph">dfs.block.size</code> or the <code class="ph codeph">dfs.blocksize</code>
+ property large enough that each file fits within a single HDFS block, even if that size is larger than
+ the normal HDFS block size.
+ </p>
+
+ <p class="p">
+ If the block size is reset to a lower value during a file copy, you will see lower performance for
+ queries involving those files, and the <code class="ph codeph">PROFILE</code> statement will reveal that some I/O is
+ being done suboptimally, through remote reads. See
+ <a class="xref" href="impala_parquet.html#parquet_compression_multiple">Example of Copying Parquet Data Files</a> for an example showing how to preserve the
+ block size when copying Parquet data files.
+ </p>
+ </div>
+
+ <p class="p">
+ When Impala retrieves or tests the data for a particular column, it opens all the data files, but only
+ reads the portion of each file containing the values for that column. The column values are stored
+ consecutively, minimizing the I/O required to process the values within a single column. If other columns
+ are named in the <code class="ph codeph">SELECT</code> list or <code class="ph codeph">WHERE</code> clauses, the data for all columns
+ in the same row is available within that same data file.
+ </p>
+
+ <p class="p">
+ If an <code class="ph codeph">INSERT</code> statement brings in less than <span class="ph">one Parquet
+ block's worth</span> of data, the resulting data file is smaller than ideal. Thus, if you do split up an ETL
+ job to use multiple <code class="ph codeph">INSERT</code> statements, try to keep the volume of data for each
+ <code class="ph codeph">INSERT</code> statement to approximately <span class="ph">256 MB, or a multiple of
+ 256 MB</span>.
+ </p>
+
+ </div>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title15" id="parquet_data_files__parquet_encoding">
+
+ <h3 class="title topictitle3" id="ariaid-title15">RLE and Dictionary Encoding for Parquet Data Files</h3>
+
+ <div class="body conbody">
+
+ <p class="p">
+ Parquet uses some automatic compression techniques, such as run-length encoding (RLE) and dictionary
+ encoding, based on analysis of the actual data values. Once the data values are encoded in a compact
+ form, the encoded data can optionally be further compressed using a compression algorithm. Parquet data
+ files created by Impala can use Snappy, GZip, or no compression; the Parquet spec also allows LZO
+ compression, but currently Impala does not support LZO-compressed Parquet files.
+ </p>
+
+ <p class="p">
+ RLE and dictionary encoding are compression techniques that Impala applies automatically to groups of
+ Parquet data values, in addition to any Snappy or GZip compression applied to the entire data files.
+ These automatic optimizations can save you time and planning that are normally needed for a traditional
+ data warehouse. For example, dictionary encoding reduces the need to create numeric IDs as abbreviations
+ for longer string values.
+ </p>
+
+ <p class="p">
+ Run-length encoding condenses sequences of repeated data values. For example, if many consecutive rows
+ all contain the same value for a country code, those repeating values can be represented by the value
+ followed by a count of how many times it appears consecutively.
+ </p>
+
+ <p class="p">
+ Dictionary encoding takes the different values present in a column, and represents each one in compact
+ 2-byte form rather than the original value, which could be several bytes. (Additional compression is
+ applied to the compacted values, for extra space savings.) This type of encoding applies when the number
+ of different values for a column is less than 2**16 (16,384). It does not apply to columns of data type
+ <code class="ph codeph">BOOLEAN</code>, which are already very short. <code class="ph codeph">TIMESTAMP</code> columns sometimes have
+ a unique value for each row, in which case they can quickly exceed the 2**16 limit on distinct values.
+ The 2**16 limit on different values within a column is reset for each data file, so if several different
+ data files each contained 10,000 different city names, the city name column in each data file could still
+ be condensed using dictionary encoding.
+ </p>
+
+ </div>
+
+ </article>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title16" id="parquet__parquet_compacting">
+
+ <h2 class="title topictitle2" id="ariaid-title16">Compacting Data Files for Parquet Tables</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ If you reuse existing table structures or ETL processes for Parquet tables, you might encounter a <span class="q">"many
+ small files"</span> situation, which is suboptimal for query efficiency. For example, statements like these
+ might produce inefficiently organized data files:
+ </p>
+
+<pre class="pre codeblock"><code>-- In an N-node cluster, each node produces a data file
+-- for the INSERT operation. If you have less than
+-- N GB of data to copy, some files are likely to be
+-- much smaller than the <span class="ph">default Parquet</span> block size.
+insert into parquet_table select * from text_table;
+
+-- Even if this operation involves an overall large amount of data,
+-- when split up by year/month/day, each partition might only
+-- receive a small amount of data. Then the data files for
+-- the partition might be divided between the N nodes in the cluster.
+-- A multi-gigabyte copy operation might produce files of only
+-- a few MB each.
+insert into partitioned_parquet_table partition (year, month, day)
+ select year, month, day, url, referer, user_agent, http_code, response_time
+ from web_stats;
+</code></pre>
+
+ <p class="p">
+ Here are techniques to help you produce large data files in Parquet <code class="ph codeph">INSERT</code> operations, and
+ to compact existing too-small data files:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ When inserting into a partitioned Parquet table, use statically partitioned <code class="ph codeph">INSERT</code>
+ statements where the partition key values are specified as constant values. Ideally, use a separate
+ <code class="ph codeph">INSERT</code> statement for each partition.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ You might set the <code class="ph codeph">NUM_NODES</code> option to 1 briefly, during <code class="ph codeph">INSERT</code> or
+ <code class="ph codeph">CREATE TABLE AS SELECT</code> statements. Normally, those statements produce one or more data
+ files per data node. If the write operation involves small amounts of data, a Parquet table, and/or a
+ partitioned table, the default behavior could produce many small files when intuitively you might expect
+ only a single output file. <code class="ph codeph">SET NUM_NODES=1</code> turns off the <span class="q">"distributed"</span> aspect of the
+ write operation, making it more likely to produce only one or a few data files.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ Be prepared to reduce the number of partition key columns from what you are used to with traditional
+ analytic database systems.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ Do not expect Impala-written Parquet files to fill up the entire Parquet block size. Impala estimates
+ on the conservative side when figuring out how much data to write to each Parquet file. Typically, the
+ of uncompressed data in memory is substantially reduced on disk by the compression and encoding
+ techniques in the Parquet file format.
+
+ The final data file size varies depending on the compressibility of the data. Therefore, it is not an
+ indication of a problem if <span class="ph">256 MB</span> of text data is turned into 2
+ Parquet data files, each less than <span class="ph">256 MB</span>.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ If you accidentally end up with a table with many small data files, consider using one or more of the
+ preceding techniques and copying all the data into a new Parquet table, either through <code class="ph codeph">CREATE
+ TABLE AS SELECT</code> or <code class="ph codeph">INSERT ... SELECT</code> statements.
+ </p>
+
+ <p class="p">
+ To avoid rewriting queries to change table names, you can adopt a convention of always running
+ important queries against a view. Changing the view definition immediately switches any subsequent
+ queries to use the new underlying tables:
+ </p>
+<pre class="pre codeblock"><code>create view production_table as select * from table_with_many_small_files;
+-- CTAS or INSERT...SELECT all the data into a more efficient layout...
+alter view production_table as select * from table_with_few_big_files;
+select * from production_table where c1 = 100 and c2 < 50 and ...;
+</code></pre>
+ </li>
+ </ul>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title17" id="parquet__parquet_schema_evolution">
+
+ <h2 class="title topictitle2" id="ariaid-title17">Schema Evolution for Parquet Tables</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ Schema evolution refers to using the statement <code class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> to change
+ the names, data type, or number of columns in a table. You can perform schema evolution for Parquet tables
+ as follows:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ The Impala <code class="ph codeph">ALTER TABLE</code> statement never changes any data files in the tables. From the
+ Impala side, schema evolution involves interpreting the same data files in terms of a new table
+ definition. Some types of schema changes make sense and are represented correctly. Other types of
+ changes cannot be represented in a sensible way, and produce special result values or conversion errors
+ during queries.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ The <code class="ph codeph">INSERT</code> statement always creates data using the latest table definition. You might
+ end up with data files with different numbers of columns or internal data representations if you do a
+ sequence of <code class="ph codeph">INSERT</code> and <code class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> statements.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ If you use <code class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> to define additional columns at the end,
+ when the original data files are used in a query, these final columns are considered to be all
+ <code class="ph codeph">NULL</code> values.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ If you use <code class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> to define fewer columns than before, when
+ the original data files are used in a query, the unused columns still present in the data file are
+ ignored.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ Parquet represents the <code class="ph codeph">TINYINT</code>, <code class="ph codeph">SMALLINT</code>, and <code class="ph codeph">INT</code>
+ types the same internally, all stored in 32-bit integers.
+ </p>
+ <ul class="ul">
+ <li class="li">
+ That means it is easy to promote a <code class="ph codeph">TINYINT</code> column to <code class="ph codeph">SMALLINT</code> or
+ <code class="ph codeph">INT</code>, or a <code class="ph codeph">SMALLINT</code> column to <code class="ph codeph">INT</code>. The numbers are
+ represented exactly the same in the data file, and the columns being promoted would not contain any
+ out-of-range values.
+ </li>
+
+ <li class="li">
+ <p class="p">
+ If you change any of these column types to a smaller type, any values that are out-of-range for the
+ new type are returned incorrectly, typically as negative numbers.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ You cannot change a <code class="ph codeph">TINYINT</code>, <code class="ph codeph">SMALLINT</code>, or <code class="ph codeph">INT</code>
+ column to <code class="ph codeph">BIGINT</code>, or the other way around. Although the <code class="ph codeph">ALTER
+ TABLE</code> succeeds, any attempt to query those columns results in conversion errors.
+ </p>
+ </li>
+
+ <li class="li">
+ <p class="p">
+ Any other type conversion for columns produces a conversion error during queries. For example,
+ <code class="ph codeph">INT</code> to <code class="ph codeph">STRING</code>, <code class="ph codeph">FLOAT</code> to <code class="ph codeph">DOUBLE</code>,
+ <code class="ph codeph">TIMESTAMP</code> to <code class="ph codeph">STRING</code>, <code class="ph codeph">DECIMAL(9,0)</code> to
+ <code class="ph codeph">DECIMAL(5,2)</code>, and so on.
+ </p>
+ </li>
+ </ul>
+ </li>
+ </ul>
+
+ <div class="p">
+ You might find that you have Parquet files where the columns do not line up in the same
+ order as in your Impala table. For example, you might have a Parquet file that was part of
+ a table with columns <code class="ph codeph">C1,C2,C3,C4</code>, and now you want to reuse the same
+ Parquet file in a table with columns <code class="ph codeph">C4,C2</code>. By default, Impala expects the
+ columns in the data file to appear in the same order as the columns defined for the table,
+ making it impractical to do some kinds of file reuse or schema evolution. In <span class="keyword">Impala 2.6</span>
+ and higher, the query option <code class="ph codeph">PARQUET_FALLBACK_SCHEMA_RESOLUTION=name</code> lets Impala
+ resolve columns by name, and therefore handle out-of-order or extra columns in the data file.
+ For example:
+
+<pre class="pre codeblock"><code>
+create database schema_evolution;
+use schema_evolution;
+create table t1 (c1 int, c2 boolean, c3 string, c4 timestamp)
+ stored as parquet;
+insert into t1 values
+ (1, true, 'yes', now()),
+ (2, false, 'no', now() + interval 1 day);
+
+select * from t1;
++----+-------+-----+-------------------------------+
+| c1 | c2 | c3 | c4 |
++----+-------+-----+-------------------------------+
+| 1 | true | yes | 2016-06-28 14:53:26.554369000 |
+| 2 | false | no | 2016-06-29 14:53:26.554369000 |
++----+-------+-----+-------------------------------+
+
+desc formatted t1;
+...
+| Location: | /user/hive/warehouse/schema_evolution.db/t1 |
+...
+
+-- Make T2 have the same data file as in T1, including 2
+-- unused columns and column order different than T2 expects.
+load data inpath '/user/hive/warehouse/schema_evolution.db/t1'
+ into table t2;
++----------------------------------------------------------+
+| summary |
++----------------------------------------------------------+
+| Loaded 1 file(s). Total files in destination location: 1 |
++----------------------------------------------------------+
+
+-- 'position' is the default setting.
+-- Impala cannot read the Parquet file if the column order does not match.
+set PARQUET_FALLBACK_SCHEMA_RESOLUTION=position;
+PARQUET_FALLBACK_SCHEMA_RESOLUTION set to position
+
+select * from t2;
+WARNINGS:
+File 'schema_evolution.db/t2/45331705_data.0.parq'
+has an incompatible Parquet schema for column 'schema_evolution.t2.c4'.
+Column type: TIMESTAMP, Parquet schema: optional int32 c1 [i:0 d:1 r:0]
+
+File 'schema_evolution.db/t2/45331705_data.0.parq'
+has an incompatible Parquet schema for column 'schema_evolution.t2.c4'.
+Column type: TIMESTAMP, Parquet schema: optional int32 c1 [i:0 d:1 r:0]
+
+-- With the 'name' setting, Impala can read the Parquet data files
+-- despite mismatching column order.
+set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
+PARQUET_FALLBACK_SCHEMA_RESOLUTION set to name
+
+select * from t2;
++-------------------------------+-------+
+| c4 | c2 |
++-------------------------------+-------+
+| 2016-06-28 14:53:26.554369000 | true |
+| 2016-06-29 14:53:26.554369000 | false |
++-------------------------------+-------+
+
+</code></pre>
+
+ See <a class="xref" href="impala_parquet_fallback_schema_resolution.html#parquet_fallback_schema_resolution">PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (Impala 2.6 or higher only)</a>
+ for more details.
+ </div>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title18" id="parquet__parquet_data_types">
+
+ <h2 class="title topictitle2" id="ariaid-title18">Data Type Considerations for Parquet Tables</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ The Parquet format defines a set of data types whose names differ from the names of the corresponding
+ Impala data types. If you are preparing Parquet files using other Hadoop components such as Pig or
+ MapReduce, you might need to work with the type names defined by Parquet. The following figure lists the
+ Parquet-defined types and the equivalent types in Impala.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Primitive types:</strong>
+ </p>
+
+<pre class="pre codeblock"><code>BINARY -> STRING
+BOOLEAN -> BOOLEAN
+DOUBLE -> DOUBLE
+FLOAT -> FLOAT
+INT32 -> INT
+INT64 -> BIGINT
+INT96 -> TIMESTAMP
+</code></pre>
+
+ <p class="p">
+ <strong class="ph b">Logical types:</strong>
+ </p>
+
+<pre class="pre codeblock"><code>BINARY + OriginalType UTF8 -> STRING
+BINARY + OriginalType ENUM -> STRING
+BINARY + OriginalType DECIMAL -> DECIMAL
+</code></pre>
+
+ <p class="p">
+ <strong class="ph b">Complex types:</strong>
+ </p>
+
+ <p class="p">
+ For the complex types (<code class="ph codeph">ARRAY</code>, <code class="ph codeph">MAP</code>, and <code class="ph codeph">STRUCT</code>)
+ available in <span class="keyword">Impala 2.3</span> and higher, Impala only supports queries
+ against those types in Parquet tables.
+ </p>
+
+ </div>
+
+ </article>
+
+</article></main></body></html>
http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_parquet_annotate_strings_utf8.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_parquet_annotate_strings_utf8.html b/docs/build3x/html/topics/impala_parquet_annotate_strings_utf8.html
new file mode 100644
index 0000000..f72b664
--- /dev/null
+++ b/docs/build3x/html/topics/impala_parquet_annotate_strings_utf8.html
@@ -0,0 +1,54 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="parquet_annotate_strings_utf8"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (Impala 2.6 or higher only)</title></head><body id="parquet_annotate_strings_utf8"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (<span class="keyword">Impala 2.6</span> or higher only)</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ Causes Impala <code class="ph codeph">INSERT</code> and <code class="ph codeph">CREATE TABLE AS SELECT</code> statements
+ to write Parquet files that use the UTF-8 annotation for <code class="ph codeph">STRING</code> columns.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Usage notes:</strong>
+ </p>
+ <p class="p">
+ By default, Impala represents a <code class="ph codeph">STRING</code> column in Parquet as an unannotated binary field.
+ </p>
+ <p class="p">
+ Impala always uses the UTF-8 annotation when writing <code class="ph codeph">CHAR</code> and <code class="ph codeph">VARCHAR</code>
+ columns to Parquet files. An alternative to using the query option is to cast <code class="ph codeph">STRING</code>
+ values to <code class="ph codeph">VARCHAR</code>.
+ </p>
+ <p class="p">
+ This option is to help make Impala-written data more interoperable with other data processing engines.
+ Impala itself currently does not support all operations on UTF-8 data.
+ Although data processed by Impala is typically represented in ASCII, it is valid to designate the
+ data as UTF-8 when storing on disk, because ASCII is a subset of UTF-8.
+ </p>
+ <p class="p">
+ <strong class="ph b">Type:</strong> Boolean; recognized values are 1 and 0, or <code class="ph codeph">true</code> and <code class="ph codeph">false</code>;
+ any other value interpreted as <code class="ph codeph">false</code>
+ </p>
+ <p class="p">
+ <strong class="ph b">Default:</strong> <code class="ph codeph">false</code> (shown as 0 in output of <code class="ph codeph">SET</code> statement)
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.6.0</span>
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Related information:</strong>
+ </p>
+ <p class="p">
+ <a class="xref" href="impala_parquet.html#parquet">Using the Parquet File Format with Impala Tables</a>
+ </p>
+
+ </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html>
http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_parquet_array_resolution.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_parquet_array_resolution.html b/docs/build3x/html/topics/impala_parquet_array_resolution.html
new file mode 100644
index 0000000..831ac46
--- /dev/null
+++ b/docs/build3x/html/topics/impala_parquet_array_resolution.html
@@ -0,0 +1,180 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="parquet_array_resolution"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>PARQUET_ARRAY_RESOLUTION Query Option (Impala 2.9 or higher only)</title></head><body id="parquet_array_resolution"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">
+ PARQUET_ARRAY_RESOLUTION Query Option (<span class="keyword">Impala 2.9</span> or higher only)
+ </h1>
+
+
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ The <code class="ph codeph">PARQUET_ARRAY_RESOLUTION</code> query option controls the
+ behavior of the indexed-based resolution for nested arrays in Parquet.
+ </p>
+
+ <p class="p">
+ In Parquet, you can represent an array using a 2-level or 3-level
+ representation. The modern, standard representation is 3-level. The legacy
+ 2-level scheme is supported for compatibility with older Parquet files.
+ However, there is no reliable metadata within Parquet files to indicate
+ which encoding was used. It is even possible to have mixed encodings within
+ the same file if there are multiple arrays. The
+ <code class="ph codeph">PARQUET_ARRAY_RESOLTUTION</code> option controls the process of
+ resolution that is to match every column/field reference from a query to a
+ column in the Parquet file.</p>
+
+ <p class="p">
+ The supported values for the query option are:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ <code class="ph codeph">THREE_LEVEL</code>: Assumes arrays are encoded with the 3-level
+ representation, and does not attempt the 2-level resolution.
+ </li>
+
+ <li class="li">
+ <code class="ph codeph">TWO_LEVEL</code>: Assumes arrays are encoded with the 2-level
+ representation, and does not attempt the 3-level resolution.
+ </li>
+
+ <li class="li">
+ <code class="ph codeph">TWO_LEVEL_THEN_THREE_LEVEL</code>: First tries to resolve
+ assuming a 2-level representation, and if unsuccessful, tries a 3-level
+ representation.
+ </li>
+ </ul>
+
+ <p class="p">
+ All of the above options resolve arrays encoded with a single level.
+ </p>
+
+ <p class="p">
+ A failure to resolve a column/field reference in a query with a given array
+ resolution policy does not necessarily result in a warning or error returned
+ by the query. A mismatch might be treated like a missing column (returns
+ NULL values), and it is not possible to reliably distinguish the 'bad
+ resolution' and 'legitimately missing column' cases.
+ </p>
+
+ <p class="p">
+ The name-based policy generally does not have the problem of ambiguous
+ array representations. You specify to use the name-based policy by setting
+ the <code class="ph codeph">PARQUET_FALLBACK_SCHEMA_RESOLUTION</code> query option to
+ <code class="ph codeph">NAME</code>.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Type:</strong> Enum of <code class="ph codeph">ONE_LEVEL</code>, <code class="ph codeph">TWO_LEVEL</code>,
+ <code class="ph codeph">THREE_LEVEL</code>
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Default:</strong> <code class="ph codeph">THREE_LEVEL</code>
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.9.0</span>
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Examples:</strong>
+ </p>
+
+ <p class="p">
+ EXAMPLE A: The following Parquet schema of a file can be interpreted as a
+ 2-level or 3-level:
+ </p>
+
+<pre class="pre codeblock"><code>
+ParquetSchemaExampleA {
+ optional group single_element_groups (LIST) {
+ repeated group single_element_group {
+ required int64 count;
+ }
+ }
+}
+</code></pre>
+
+ <p class="p">
+ The following table schema corresponds to a 2-level interpretation:
+ </p>
+
+<pre class="pre codeblock"><code>
+CREATE TABLE t (col1 array<struct<f1: bigint>>) STORED AS PARQUET;
+</code></pre>
+
+ <p class="p">
+ Successful query with a 2-level interpretation:
+ </p>
+
+<pre class="pre codeblock"><code>
+SET PARQUET_ARRAY_RESOLUTION=TWO_LEVEL;
+SELECT ITEM.f1 FROM t.col1;
+</code></pre>
+
+ <p class="p">
+ The following table schema corresponds to a 3-level interpretation:
+ </p>
+
+<pre class="pre codeblock"><code>
+CREATE TABLE t (col1 array<bigint>) STORED AS PARQUET;
+</code></pre>
+
+ <p class="p">
+ Successful query with a 3-level interpretation:
+ </p>
+
+<pre class="pre codeblock"><code>
+SET PARQUET_ARRAY_RESOLUTION=THREE_LEVEL;
+SELECT ITEM FROM t.col1
+</code></pre>
+
+ <p class="p">
+ EXAMPLE B: The following Parquet schema of a file can be only be successfully
+ interpreted as a 2-level:
+ </p>
+
+<pre class="pre codeblock"><code>
+ParquetSchemaExampleB {
+ required group list_of_ints (LIST) {
+ repeated int32 list_of_ints_tuple;
+ }
+}
+</code></pre>
+
+ <p class="p">
+ The following table schema corresponds to a 2-level interpretation:
+ </p>
+
+<pre class="pre codeblock"><code>
+CREATE TABLE t (col1 array<int>) STORED AS PARQUET;
+</code></pre>
+
+ <p class="p">
+ Successful query with a 2-level interpretation:
+ </p>
+
+<pre class="pre codeblock"><code>
+SET PARQUET_ARRAY_RESOLUTION=TWO_LEVEL;
+SELECT ITEM FROM t.col1
+</code></pre>
+
+ <p class="p">
+ Unsuccessful query with a 3-level interpretation. The query returns
+ <code class="ph codeph">NULL</code>s as if the column was missing in the file:
+ </p>
+
+<pre class="pre codeblock"><code>
+SET PARQUET_ARRAY_RESOLUTION=THREE_LEVEL;
+SELECT ITEM FROM t.col1
+</code></pre>
+
+ </div>
+
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html>
http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_parquet_compression_codec.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_parquet_compression_codec.html b/docs/build3x/html/topics/impala_parquet_compression_codec.html
new file mode 100644
index 0000000..ac5551a
--- /dev/null
+++ b/docs/build3x/html/topics/impala_parquet_compression_codec.html
@@ -0,0 +1,17 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="parquet_compression_codec"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>PARQUET_COMPRESSION_CODEC Query Option</title></head><body id="parquet_compression_codec"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">PARQUET_COMPRESSION_CODEC Query Option</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ Deprecated. Use <code class="ph codeph">COMPRESSION_CODEC</code> in Impala 2.0 and later. See
+ <a class="xref" href="impala_compression_codec.html#compression_codec">COMPRESSION_CODEC Query Option (Impala 2.0 or higher only)</a> for details.
+ </p>
+ </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html>