You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by mi...@apache.org on 2018/05/09 21:10:25 UTC

[16/51] [partial] impala git commit: [DOCS] Impala doc site update for 3.0

http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_parquet.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_parquet.html b/docs/build3x/html/topics/impala_parquet.html
new file mode 100644
index 0000000..ce5242e
--- /dev/null
+++ b/docs/build3x/html/topics/impala_parquet.html
@@ -0,0 +1,1421 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_file_formats.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content=
 "Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content=
 "parquet"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using the Parquet File Format with Impala Tables</title></head><body id="parquet"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Using the Parquet File Format with Impala Tables</h1>
+
+
+
+  <div class="body conbody">
+
+    <p class="p">
+
+      Impala helps you to create, manage, and query Parquet tables. Parquet is a column-oriented binary file format
+      intended to be highly efficient for the types of large-scale queries that Impala is best at. Parquet is
+      especially good for queries scanning particular columns within a table, for example to query <span class="q">"wide"</span>
+      tables with many columns, or to perform aggregation operations such as <code class="ph codeph">SUM()</code> and
+      <code class="ph codeph">AVG()</code> that need to process most or all of the values from a column. Each data file contains
+      the values for a set of rows (the <span class="q">"row group"</span>). Within a data file, the values from each column are
+      organized so that they are all adjacent, enabling good compression for the values from that column. Queries
+      against a Parquet table can retrieve and analyze these values from any column quickly and with minimal I/O.
+    </p>
+
+    <table class="table"><caption><span class="table--title-label">Table 1. </span><span class="title">Parquet Format Support in Impala</span></caption><colgroup><col style="width:10%"><col style="width:10%"><col style="width:20%"><col style="width:30%"><col style="width:30%"></colgroup><thead class="thead">
+          <tr class="row">
+            <th class="entry nocellnorowborder" id="parquet__entry__1">
+              File Type
+            </th>
+            <th class="entry nocellnorowborder" id="parquet__entry__2">
+              Format
+            </th>
+            <th class="entry nocellnorowborder" id="parquet__entry__3">
+              Compression Codecs
+            </th>
+            <th class="entry nocellnorowborder" id="parquet__entry__4">
+              Impala Can CREATE?
+            </th>
+            <th class="entry nocellnorowborder" id="parquet__entry__5">
+              Impala Can INSERT?
+            </th>
+          </tr>
+        </thead><tbody class="tbody">
+          <tr class="row">
+            <td class="entry nocellnorowborder" headers="parquet__entry__1 ">
+              <a class="xref" href="impala_parquet.html#parquet">Parquet</a>
+            </td>
+            <td class="entry nocellnorowborder" headers="parquet__entry__2 ">
+              Structured
+            </td>
+            <td class="entry nocellnorowborder" headers="parquet__entry__3 ">
+              Snappy, gzip; currently Snappy by default
+            </td>
+            <td class="entry nocellnorowborder" headers="parquet__entry__4 ">
+              Yes.
+            </td>
+            <td class="entry nocellnorowborder" headers="parquet__entry__5 ">
+              Yes: <code class="ph codeph">CREATE TABLE</code>, <code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, and query.
+            </td>
+          </tr>
+        </tbody></table>
+
+    <p class="p toc inpage"></p>
+
+  </div>
+
+
+  <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_file_formats.html">How Impala Works with Hadoop File Formats</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="parquet__parquet_ddl">
+
+    <h2 class="title topictitle2" id="ariaid-title2">Creating Parquet Tables in Impala</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        To create a table named <code class="ph codeph">PARQUET_TABLE</code> that uses the Parquet format, you would use a
+        command like the following, substituting your own table name, column names, and data types:
+      </p>
+
+<pre class="pre codeblock"><code>[impala-host:21000] &gt; create table <var class="keyword varname">parquet_table_name</var> (x INT, y STRING) STORED AS PARQUET;</code></pre>
+
+
+
+      <p class="p">
+        Or, to clone the column names and data types of an existing table:
+      </p>
+
+<pre class="pre codeblock"><code>[impala-host:21000] &gt; create table <var class="keyword varname">parquet_table_name</var> LIKE <var class="keyword varname">other_table_name</var> STORED AS PARQUET;</code></pre>
+
+      <p class="p">
+        In Impala 1.4.0 and higher, you can derive column definitions from a raw Parquet data file, even without an
+        existing Impala table. For example, you can create an external table pointing to an HDFS directory, and
+        base the column definitions on one of the files in that directory:
+      </p>
+
+<pre class="pre codeblock"><code>CREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET '/user/etl/destination/datafile1.dat'
+  STORED AS PARQUET
+  LOCATION '/user/etl/destination';
+</code></pre>
+
+      <p class="p">
+        Or, you can refer to an existing data file and create a new empty table with suitable column definitions.
+        Then you can use <code class="ph codeph">INSERT</code> to create new data files or <code class="ph codeph">LOAD DATA</code> to transfer
+        existing data files into the new table.
+      </p>
+
+<pre class="pre codeblock"><code>CREATE TABLE columns_from_data_file LIKE PARQUET '/user/etl/destination/datafile1.dat'
+  STORED AS PARQUET;
+</code></pre>
+
+      <p class="p">
+        The default properties of the newly created table are the same as for any other <code class="ph codeph">CREATE
+        TABLE</code> statement. For example, the default file format is text; if you want the new table to use
+        the Parquet file format, include the <code class="ph codeph">STORED AS PARQUET</code> file also.
+      </p>
+
+      <p class="p">
+        In this example, the new table is partitioned by year, month, and day. These partition key columns are not
+        part of the data file, so you specify them in the <code class="ph codeph">CREATE TABLE</code> statement:
+      </p>
+
+<pre class="pre codeblock"><code>CREATE TABLE columns_from_data_file LIKE PARQUET '/user/etl/destination/datafile1.dat'
+  PARTITION (year INT, month TINYINT, day TINYINT)
+  STORED AS PARQUET;
+</code></pre>
+
+      <p class="p">
+        See <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a> for more details about the <code class="ph codeph">CREATE TABLE
+        LIKE PARQUET</code> syntax.
+      </p>
+
+      <p class="p">
+        Once you have created a table, to insert data into that table, use a command similar to the following,
+        again with your own table names:
+      </p>
+
+
+
+<pre class="pre codeblock"><code>[impala-host:21000] &gt; insert overwrite table <var class="keyword varname">parquet_table_name</var> select * from <var class="keyword varname">other_table_name</var>;</code></pre>
+
+      <p class="p">
+        If the Parquet table has a different number of columns or different column names than the other table,
+        specify the names of columns from the other table rather than <code class="ph codeph">*</code> in the
+        <code class="ph codeph">SELECT</code> statement.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="parquet__parquet_etl">
+
+    <h2 class="title topictitle2" id="ariaid-title3">Loading Data into Parquet Tables</h2>
+
+
+    <div class="body conbody">
+
+      <p class="p">
+        Choose from the following techniques for loading data into Parquet tables, depending on whether the
+        original data is already in an Impala table, or exists as raw data files outside Impala.
+      </p>
+
+      <p class="p">
+        If you already have data in an Impala or Hive table, perhaps in a different file format or partitioning
+        scheme, you can transfer the data to a Parquet table using the Impala <code class="ph codeph">INSERT...SELECT</code>
+        syntax. You can convert, filter, repartition, and do other things to the data as part of this same
+        <code class="ph codeph">INSERT</code> statement. See <a class="xref" href="#parquet_compression">Snappy and GZip Compression for Parquet Data Files</a> for some examples showing how to
+        insert data into Parquet tables.
+      </p>
+
+      <div class="p">
+        When inserting into partitioned tables, especially using the Parquet file format, you can include a hint in
+        the <code class="ph codeph">INSERT</code> statement to fine-tune the overall performance of the operation and its
+        resource usage:
+        <ul class="ul">
+
+          <li class="li">
+            You would only use hints if an <code class="ph codeph">INSERT</code> into a partitioned Parquet table was
+            failing due to capacity limits, or if such an <code class="ph codeph">INSERT</code> was succeeding but with
+            less-than-optimal performance.
+          </li>
+
+          <li class="li">
+            To use a hint to influence the join order, put the hint keyword <code class="ph codeph">/* +SHUFFLE */</code> or <code class="ph codeph">/* +NOSHUFFLE */</code>
+            (including the square brackets) after the <code class="ph codeph">PARTITION</code> clause, immediately before the
+            <code class="ph codeph">SELECT</code> keyword.
+          </li>
+
+          <li class="li">
+            <code class="ph codeph">/* +SHUFFLE */</code> selects an execution plan that reduces the number of files being written
+            simultaneously to HDFS, and the number of memory buffers holding data for individual partitions. Thus
+            it reduces overall resource usage for the <code class="ph codeph">INSERT</code> operation, allowing some
+            <code class="ph codeph">INSERT</code> operations to succeed that otherwise would fail. It does involve some data
+            transfer between the nodes so that the data files for a particular partition are all constructed on the
+            same node.
+          </li>
+
+          <li class="li">
+            <code class="ph codeph">/* +NOSHUFFLE */</code> selects an execution plan that might be faster overall, but might also
+            produce a larger number of small data files or exceed capacity limits, causing the
+            <code class="ph codeph">INSERT</code> operation to fail. Use <code class="ph codeph">/* +SHUFFLE */</code> in cases where an
+            <code class="ph codeph">INSERT</code> statement fails or runs inefficiently due to all nodes attempting to construct
+            data for all partitions.
+          </li>
+
+          <li class="li">
+            Impala automatically uses the <code class="ph codeph">/* +SHUFFLE */</code> method if any partition key column in the
+            source table, mentioned in the <code class="ph codeph">INSERT ... SELECT</code> query, does not have column
+            statistics. In this case, only the <code class="ph codeph">/* +NOSHUFFLE */</code> hint would have any effect.
+          </li>
+
+          <li class="li">
+            If column statistics are available for all partition key columns in the source table mentioned in the
+            <code class="ph codeph">INSERT ... SELECT</code> query, Impala chooses whether to use the <code class="ph codeph">/* +SHUFFLE */</code>
+            or <code class="ph codeph">/* +NOSHUFFLE */</code> technique based on the estimated number of distinct values in those
+            columns and the number of nodes involved in the <code class="ph codeph">INSERT</code> operation. In this case, you
+            might need the <code class="ph codeph">/* +SHUFFLE */</code> or the <code class="ph codeph">/* +NOSHUFFLE */</code> hint to override the
+            execution plan selected by Impala.
+          </li>
+
+          <li class="li">
+            In <span class="keyword">Impala 2.8</span> or higher, you can make the
+            <code class="ph codeph">INSERT</code> operation organize (<span class="q">"cluster"</span>)
+            the data for each partition to avoid buffering data for multiple partitions
+            and reduce the risk of an out-of-memory condition. Specify the hint as
+            <code class="ph codeph">/* +CLUSTERED */</code>. This technique is primarily
+            useful for inserts into Parquet tables, where the large block
+            size requires substantial memory to buffer data for multiple
+            output files at once.
+          </li>
+
+        </ul>
+      </div>
+
+      <p class="p">
+        Any <code class="ph codeph">INSERT</code> statement for a Parquet table requires enough free space in the HDFS filesystem
+        to write one block. Because Parquet data files use a block size of 1 GB by default, an
+        <code class="ph codeph">INSERT</code> might fail (even for a very small amount of data) if your HDFS is running low on
+        space.
+      </p>
+
+
+
+      <p class="p">
+        Avoid the <code class="ph codeph">INSERT...VALUES</code> syntax for Parquet tables, because
+        <code class="ph codeph">INSERT...VALUES</code> produces a separate tiny data file for each
+        <code class="ph codeph">INSERT...VALUES</code> statement, and the strength of Parquet is in its handling of data
+        (compressing, parallelizing, and so on) in <span class="ph">large</span> chunks.
+      </p>
+
+      <p class="p">
+        If you have one or more Parquet data files produced outside of Impala, you can quickly make the data
+        queryable through Impala by one of the following methods:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          The <code class="ph codeph">LOAD DATA</code> statement moves a single data file or a directory full of data files into
+          the data directory for an Impala table. It does no validation or conversion of the data. The original
+          data files must be somewhere in HDFS, not the local filesystem.
+
+        </li>
+
+        <li class="li">
+          The <code class="ph codeph">CREATE TABLE</code> statement with the <code class="ph codeph">LOCATION</code> clause creates a table
+          where the data continues to reside outside the Impala data directory. The original data files must be
+          somewhere in HDFS, not the local filesystem. For extra safety, if the data is intended to be long-lived
+          and reused by other applications, you can use the <code class="ph codeph">CREATE EXTERNAL TABLE</code> syntax so that
+          the data files are not deleted by an Impala <code class="ph codeph">DROP TABLE</code> statement.
+
+        </li>
+
+        <li class="li">
+          If the Parquet table already exists, you can copy Parquet data files directly into it, then use the
+          <code class="ph codeph">REFRESH</code> statement to make Impala recognize the newly added data. Remember to preserve
+          the block size of the Parquet data files by using the <code class="ph codeph">hadoop distcp -pb</code> command rather
+          than a <code class="ph codeph">-put</code> or <code class="ph codeph">-cp</code> operation on the Parquet files. See
+          <a class="xref" href="#parquet_compression_multiple">Example of Copying Parquet Data Files</a> for an example of this kind of operation.
+        </li>
+      </ul>
+
+      <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+        <p class="p">
+          Currently, Impala always decodes the column data in Parquet files based on the ordinal position of the
+          columns, not by looking up the position of each column based on its name. Parquet files produced outside
+          of Impala must write column data in the same order as the columns are declared in the Impala table. Any
+          optional columns that are omitted from the data files must be the rightmost columns in the Impala table
+          definition.
+        </p>
+
+        <p class="p">
+          If you created compressed Parquet files through some tool other than Impala, make sure that any
+          compression codecs are supported in Parquet by Impala. For example, Impala does not currently support LZO
+          compression in Parquet files. Also doublecheck that you used any recommended compatibility settings in
+          the other tool, such as <code class="ph codeph">spark.sql.parquet.binaryAsString</code> when writing Parquet files
+          through Spark.
+        </p>
+      </div>
+
+      <p class="p">
+        Recent versions of Sqoop can produce Parquet output files using the <code class="ph codeph">--as-parquetfile</code>
+        option.
+      </p>
+
+      <p class="p"> If you use Sqoop to
+        convert RDBMS data to Parquet, be careful with interpreting any
+        resulting values from <code class="ph codeph">DATE</code>, <code class="ph codeph">DATETIME</code>,
+        or <code class="ph codeph">TIMESTAMP</code> columns. The underlying values are
+        represented as the Parquet <code class="ph codeph">INT64</code> type, which is
+        represented as <code class="ph codeph">BIGINT</code> in the Impala table. The Parquet
+        values represent the time in milliseconds, while Impala interprets
+          <code class="ph codeph">BIGINT</code> as the time in seconds. Therefore, if you have
+        a <code class="ph codeph">BIGINT</code> column in a Parquet table that was imported
+        this way from Sqoop, divide the values by 1000 when interpreting as the
+          <code class="ph codeph">TIMESTAMP</code> type.</p>
+
+      <p class="p">
+        If the data exists outside Impala and is in some other format, combine both of the preceding techniques.
+        First, use a <code class="ph codeph">LOAD DATA</code> or <code class="ph codeph">CREATE EXTERNAL TABLE ... LOCATION</code> statement to
+        bring the data into an Impala table that uses the appropriate file format. Then, use an
+        <code class="ph codeph">INSERT...SELECT</code> statement to copy the data to the Parquet table, converting to Parquet
+        format as part of the process.
+      </p>
+
+
+
+      <p class="p">
+        Loading data into Parquet tables is a memory-intensive operation, because the incoming data is buffered
+        until it reaches <span class="ph">one data block</span> in size, then that chunk of data is
+        organized and compressed in memory before being written out. The memory consumption can be larger when
+        inserting data into partitioned Parquet tables, because a separate data file is written for each
+        combination of partition key column values, potentially requiring several
+        <span class="ph">large</span> chunks to be manipulated in memory at once.
+      </p>
+
+      <p class="p">
+        When inserting into a partitioned Parquet table, Impala redistributes the data among the nodes to reduce
+        memory consumption. You might still need to temporarily increase the memory dedicated to Impala during the
+        insert operation, or break up the load operation into several <code class="ph codeph">INSERT</code> statements, or both.
+      </p>
+
+      <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+        All the preceding techniques assume that the data you are loading matches the structure of the destination
+        table, including column order, column names, and partition layout. To transform or reorganize the data,
+        start by loading the data into a Parquet table that matches the underlying structure of the data, then use
+        one of the table-copying techniques such as <code class="ph codeph">CREATE TABLE AS SELECT</code> or <code class="ph codeph">INSERT ...
+        SELECT</code> to reorder or rename columns, divide the data among multiple partitions, and so on. For
+        example to take a single comprehensive Parquet data file and load it into a partitioned table, you would
+        use an <code class="ph codeph">INSERT ... SELECT</code> statement with dynamic partitioning to let Impala create separate
+        data files with the appropriate partition values; for an example, see
+        <a class="xref" href="impala_insert.html#insert">INSERT Statement</a>.
+      </div>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="parquet__parquet_performance">
+
+    <h2 class="title topictitle2" id="ariaid-title4">Query Performance for Impala Parquet Tables</h2>
+
+
+    <div class="body conbody">
+
+      <p class="p">
+        Query performance for Parquet tables depends on the number of columns needed to process the
+        <code class="ph codeph">SELECT</code> list and <code class="ph codeph">WHERE</code> clauses of the query, the way data is divided into
+        <span class="ph">large data files with block size equal to file size</span>, the reduction in I/O
+        by reading the data for each column in compressed format, which data files can be skipped (for partitioned
+        tables), and the CPU overhead of decompressing the data for each column.
+      </p>
+
+      <div class="p">
+        For example, the following is an efficient query for a Parquet table:
+<pre class="pre codeblock"><code>select avg(income) from census_data where state = 'CA';</code></pre>
+        The query processes only 2 columns out of a large number of total columns. If the table is partitioned by
+        the <code class="ph codeph">STATE</code> column, it is even more efficient because the query only has to read and decode
+        1 column from each data file, and it can read only the data files in the partition directory for the state
+        <code class="ph codeph">'CA'</code>, skipping the data files for all the other states, which will be physically located
+        in other directories.
+      </div>
+
+      <div class="p">
+        The following is a relatively inefficient query for a Parquet table:
+<pre class="pre codeblock"><code>select * from census_data;</code></pre>
+        Impala would have to read the entire contents of each <span class="ph">large</span> data file,
+        and decompress the contents of each column for each row group, negating the I/O optimizations of the
+        column-oriented format. This query might still be faster for a Parquet table than a table with some other
+        file format, but it does not take advantage of the unique strengths of Parquet data files.
+      </div>
+
+      <p class="p">
+        Impala can optimize queries on Parquet tables, especially join queries, better when statistics are
+        available for all the tables. Issue the <code class="ph codeph">COMPUTE STATS</code> statement for each table after
+        substantial amounts of data are loaded into or appended to it. See
+        <a class="xref" href="impala_compute_stats.html#compute_stats">COMPUTE STATS Statement</a> for details.
+      </p>
+
+      <p class="p">
+        The runtime filtering feature, available in <span class="keyword">Impala 2.5</span> and higher, works best with Parquet tables.
+        The per-row filtering aspect only applies to Parquet tables.
+        See <a class="xref" href="impala_runtime_filtering.html#runtime_filtering">Runtime Filtering for Impala Queries (Impala 2.5 or higher only)</a> for details.
+      </p>
+
+      <p class="p">
+        In <span class="keyword">Impala 2.6</span> and higher, Impala queries are optimized for files stored in Amazon S3.
+        For Impala tables that use the file formats Parquet, RCFile, SequenceFile,
+        Avro, and uncompressed text, the setting <code class="ph codeph">fs.s3a.block.size</code>
+        in the <span class="ph filepath">core-site.xml</span> configuration file determines
+        how Impala divides the I/O work of reading the data files. This configuration
+        setting is specified in bytes. By default, this
+        value is 33554432 (32 MB), meaning that Impala parallelizes S3 read operations on the files
+        as if they were made up of 32 MB blocks. For example, if your S3 queries primarily access
+        Parquet files written by MapReduce or Hive, increase <code class="ph codeph">fs.s3a.block.size</code>
+        to 134217728 (128 MB) to match the row group size of those files. If most S3 queries involve
+        Parquet files written by Impala, increase <code class="ph codeph">fs.s3a.block.size</code>
+        to 268435456 (256 MB) to match the row group size produced by Impala.
+      </p>
+
+      <p class="p">
+        In <span class="keyword">Impala 2.9</span> and higher, Parquet files written by Impala include
+        embedded metadata specifying the minimum and maximum values for each column, within
+        each row group and each data page within the row group. Impala-written Parquet files
+        typically contain a single row group; a row group can contain many data pages.
+        Impala uses this information (currently, only the metadata for each row group)
+        when reading each Parquet data file during a query, to quickly determine whether each
+        row group within the file potentially includes any rows that match the conditions in the
+        <code class="ph codeph">WHERE</code> clause. For example, if the column <code class="ph codeph">X</code> within
+        a particular Parquet file has a minimum value of 1 and a maximum value of 100, then
+        a query including the clause <code class="ph codeph">WHERE x &gt; 200</code> can quickly determine
+        that it is safe to skip that particular file, instead of scanning all the associated
+        column values. This optimization technique is especially effective for tables that
+        use the <code class="ph codeph">SORT BY</code> clause for the columns most frequently checked in
+        <code class="ph codeph">WHERE</code> clauses, because any <code class="ph codeph">INSERT</code> operation on
+        such tables produces Parquet data files with relatively narrow ranges of column values
+        within each file.
+      </p>
+
+    </div>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title5" id="parquet_performance__parquet_partitioning">
+
+      <h3 class="title topictitle3" id="ariaid-title5">Partitioning for Parquet Tables</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          As explained in <a class="xref" href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a>, partitioning is an important
+          performance technique for Impala generally. This section explains some of the performance considerations
+          for partitioned Parquet tables.
+        </p>
+
+        <p class="p">
+          The Parquet file format is ideal for tables containing many columns, where most queries only refer to a
+          small subset of the columns. As explained in <a class="xref" href="#parquet_data_files">How Parquet Data Files Are Organized</a>, the physical layout of
+          Parquet data files lets Impala read only a small fraction of the data for many queries. The performance
+          benefits of this approach are amplified when you use Parquet tables in combination with partitioning.
+          Impala can skip the data files for certain partitions entirely, based on the comparisons in the
+          <code class="ph codeph">WHERE</code> clause that refer to the partition key columns. For example, queries on
+          partitioned tables often analyze data for time intervals based on columns such as <code class="ph codeph">YEAR</code>,
+          <code class="ph codeph">MONTH</code>, and/or <code class="ph codeph">DAY</code>, or for geographic regions. Remember that Parquet
+          data files use a <span class="ph">large</span> block size, so when deciding how finely to
+          partition the data, try to find a granularity where each partition contains
+          <span class="ph">256 MB</span> or more of data, rather than creating a large number of smaller
+          files split among many partitions.
+        </p>
+
+        <p class="p">
+          Inserting into a partitioned Parquet table can be a resource-intensive operation, because each Impala
+          node could potentially be writing a separate data file to HDFS for each combination of different values
+          for the partition key columns. The large number of simultaneous open files could exceed the HDFS
+          <span class="q">"transceivers"</span> limit. To avoid exceeding this limit, consider the following techniques:
+        </p>
+
+        <ul class="ul">
+          <li class="li">
+            Load different subsets of data using separate <code class="ph codeph">INSERT</code> statements with specific values
+            for the <code class="ph codeph">PARTITION</code> clause, such as <code class="ph codeph">PARTITION (year=2010)</code>.
+          </li>
+
+          <li class="li">
+            Increase the <span class="q">"transceivers"</span> value for HDFS, sometimes spelled <span class="q">"xcievers"</span> (sic). The property
+            value in the <span class="ph filepath">hdfs-site.xml</span> configuration file is
+
+            <code class="ph codeph">dfs.datanode.max.transfer.threads</code>. For example, if you were loading 12 years of data
+            partitioned by year, month, and day, even a value of 4096 might not be high enough. This
+            <a class="xref" href="http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/" target="_blank">blog post</a> explores the considerations for setting this value
+            higher or lower, using HBase examples for illustration.
+          </li>
+
+          <li class="li">
+            Use the <code class="ph codeph">COMPUTE STATS</code> statement to collect
+            <a class="xref" href="impala_perf_stats.html#perf_column_stats">column statistics</a> on the source table from
+            which data is being copied, so that the Impala query can estimate the number of different values in the
+            partition key columns and distribute the work accordingly.
+          </li>
+        </ul>
+
+      </div>
+
+    </article>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="parquet__parquet_compression">
+
+    <h2 class="title topictitle2" id="ariaid-title6">Snappy and GZip Compression for Parquet Data Files</h2>
+
+
+    <div class="body conbody">
+
+      <p class="p">
+
+        When Impala writes Parquet data files using the <code class="ph codeph">INSERT</code> statement, the underlying
+        compression is controlled by the <code class="ph codeph">COMPRESSION_CODEC</code> query option. (Prior to Impala 2.0, the
+        query option name was <code class="ph codeph">PARQUET_COMPRESSION_CODEC</code>.) The allowed values for this query option
+        are <code class="ph codeph">snappy</code> (the default), <code class="ph codeph">gzip</code>, and <code class="ph codeph">none</code>. The option
+        value is not case-sensitive. If the option is set to an unrecognized value, all kinds of queries will fail
+        due to the invalid option setting, not just queries involving Parquet tables.
+      </p>
+
+    </div>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title7" id="parquet_compression__parquet_snappy">
+
+      <h3 class="title topictitle3" id="ariaid-title7">Example of Parquet Table with Snappy Compression</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+
+          By default, the underlying data files for a Parquet table are compressed with Snappy. The combination of
+          fast compression and decompression makes it a good choice for many data sets. To ensure Snappy
+          compression is used, for example after experimenting with other compression codecs, set the
+          <code class="ph codeph">COMPRESSION_CODEC</code> query option to <code class="ph codeph">snappy</code> before inserting the data:
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create database parquet_compression;
+[localhost:21000] &gt; use parquet_compression;
+[localhost:21000] &gt; create table parquet_snappy like raw_text_data;
+[localhost:21000] &gt; set COMPRESSION_CODEC=snappy;
+[localhost:21000] &gt; insert into parquet_snappy select * from raw_text_data;
+Inserted 1000000000 rows in 181.98s
+</code></pre>
+
+      </div>
+
+    </article>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title8" id="parquet_compression__parquet_gzip">
+
+      <h3 class="title topictitle3" id="ariaid-title8">Example of Parquet Table with GZip Compression</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          If you need more intensive compression (at the expense of more CPU cycles for uncompressing during
+          queries), set the <code class="ph codeph">COMPRESSION_CODEC</code> query option to <code class="ph codeph">gzip</code> before
+          inserting the data:
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table parquet_gzip like raw_text_data;
+[localhost:21000] &gt; set COMPRESSION_CODEC=gzip;
+[localhost:21000] &gt; insert into parquet_gzip select * from raw_text_data;
+Inserted 1000000000 rows in 1418.24s
+</code></pre>
+
+      </div>
+
+    </article>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title9" id="parquet_compression__parquet_none">
+
+      <h3 class="title topictitle3" id="ariaid-title9">Example of Uncompressed Parquet Table</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          If your data compresses very poorly, or you want to avoid the CPU overhead of compression and
+          decompression entirely, set the <code class="ph codeph">COMPRESSION_CODEC</code> query option to <code class="ph codeph">none</code>
+          before inserting the data:
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table parquet_none like raw_text_data;
+[localhost:21000] &gt; set COMPRESSION_CODEC=none;
+[localhost:21000] &gt; insert into parquet_none select * from raw_text_data;
+Inserted 1000000000 rows in 146.90s
+</code></pre>
+
+      </div>
+
+    </article>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title10" id="parquet_compression__parquet_compression_examples">
+
+      <h3 class="title topictitle3" id="ariaid-title10">Examples of Sizes and Speeds for Compressed Parquet Tables</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          Here are some examples showing differences in data sizes and query speeds for 1 billion rows of synthetic
+          data, compressed with each kind of codec. As always, run similar tests with realistic data sets of your
+          own. The actual compression ratios, and relative insert and query speeds, will vary depending on the
+          characteristics of the actual data.
+        </p>
+
+        <p class="p">
+          In this case, switching from Snappy to GZip compression shrinks the data by an additional 40% or so,
+          while switching from Snappy compression to no compression expands the data also by about 40%:
+        </p>
+
+<pre class="pre codeblock"><code>$ hdfs dfs -du -h /user/hive/warehouse/parquet_compression.db
+23.1 G  /user/hive/warehouse/parquet_compression.db/parquet_snappy
+13.5 G  /user/hive/warehouse/parquet_compression.db/parquet_gzip
+32.8 G  /user/hive/warehouse/parquet_compression.db/parquet_none
+</code></pre>
+
+        <p class="p">
+          Because Parquet data files are typically <span class="ph">large</span>, each directory will
+          have a different number of data files and the row groups will be arranged differently.
+        </p>
+
+        <p class="p">
+          At the same time, the less agressive the compression, the faster the data can be decompressed. In this
+          case using a table with a billion rows, a query that evaluates all the values for a particular column
+          runs faster with no compression than with Snappy compression, and faster with Snappy compression than
+          with Gzip compression. Query performance depends on several other factors, so as always, run your own
+          benchmarks with your own data to determine the ideal tradeoff between data size, CPU efficiency, and
+          speed of insert and query operations.
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; desc parquet_snappy;
+Query finished, fetching results ...
++-----------+---------+---------+
+| name      | type    | comment |
++-----------+---------+---------+
+| id        | int     |         |
+| val       | int     |         |
+| zfill     | string  |         |
+| name      | string  |         |
+| assertion | boolean |         |
++-----------+---------+---------+
+Returned 5 row(s) in 0.14s
+[localhost:21000] &gt; select avg(val) from parquet_snappy;
+Query finished, fetching results ...
++-----------------+
+| _c0             |
++-----------------+
+| 250000.93577915 |
++-----------------+
+Returned 1 row(s) in 4.29s
+[localhost:21000] &gt; select avg(val) from parquet_gzip;
+Query finished, fetching results ...
++-----------------+
+| _c0             |
++-----------------+
+| 250000.93577915 |
++-----------------+
+Returned 1 row(s) in 6.97s
+[localhost:21000] &gt; select avg(val) from parquet_none;
+Query finished, fetching results ...
++-----------------+
+| _c0             |
++-----------------+
+| 250000.93577915 |
++-----------------+
+Returned 1 row(s) in 3.67s
+</code></pre>
+
+      </div>
+
+    </article>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title11" id="parquet_compression__parquet_compression_multiple">
+
+      <h3 class="title topictitle3" id="ariaid-title11">Example of Copying Parquet Data Files</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          Here is a final example, to illustrate how the data files using the various compression codecs are all
+          compatible with each other for read operations. The metadata about the compression format is written into
+          each data file, and can be decoded during queries regardless of the <code class="ph codeph">COMPRESSION_CODEC</code>
+          setting in effect at the time. In this example, we copy data files from the
+          <code class="ph codeph">PARQUET_SNAPPY</code>, <code class="ph codeph">PARQUET_GZIP</code>, and <code class="ph codeph">PARQUET_NONE</code> tables
+          used in the previous examples, each containing 1 billion rows, all to the data directory of a new table
+          <code class="ph codeph">PARQUET_EVERYTHING</code>. A couple of sample queries demonstrate that the new table now
+          contains 3 billion rows featuring a variety of compression codecs for the data files.
+        </p>
+
+        <p class="p">
+          First, we create the table in Impala so that there is a destination directory in HDFS to put the data
+          files:
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table parquet_everything like parquet_snappy;
+Query: create table parquet_everything like parquet_snappy
+</code></pre>
+
+        <p class="p">
+          Then in the shell, we copy the relevant data files into the data directory for this new table. Rather
+          than using <code class="ph codeph">hdfs dfs -cp</code> as with typical files, we use <code class="ph codeph">hadoop distcp -pb</code>
+          to ensure that the special <span class="ph"> block size</span> of the Parquet data files is
+          preserved.
+        </p>
+
+<pre class="pre codeblock"><code>$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_snappy \
+  /user/hive/warehouse/parquet_compression.db/parquet_everything
+...<var class="keyword varname">MapReduce output</var>...
+$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_gzip  \
+  /user/hive/warehouse/parquet_compression.db/parquet_everything
+...<var class="keyword varname">MapReduce output</var>...
+$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_none  \
+  /user/hive/warehouse/parquet_compression.db/parquet_everything
+...<var class="keyword varname">MapReduce output</var>...
+</code></pre>
+
+        <p class="p">
+          Back in the <span class="keyword cmdname">impala-shell</span> interpreter, we use the <code class="ph codeph">REFRESH</code> statement to
+          alert the Impala server to the new data files for this table, then we can run queries demonstrating that
+          the data files represent 3 billion rows, and the values for one of the numeric columns match what was in
+          the original smaller tables:
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; refresh parquet_everything;
+Query finished, fetching results ...
+
+Returned 0 row(s) in 0.32s
+[localhost:21000] &gt; select count(*) from parquet_everything;
+Query finished, fetching results ...
++------------+
+| _c0        |
++------------+
+| 3000000000 |
++------------+
+Returned 1 row(s) in 8.18s
+[localhost:21000] &gt; select avg(val) from parquet_everything;
+Query finished, fetching results ...
++-----------------+
+| _c0             |
++-----------------+
+| 250000.93577915 |
++-----------------+
+Returned 1 row(s) in 13.35s
+</code></pre>
+
+      </div>
+
+    </article>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title12" id="parquet__parquet_complex_types">
+
+    <h2 class="title topictitle2" id="ariaid-title12">Parquet Tables for Impala Complex Types</h2>
+
+    <div class="body conbody">
+
+    <p class="p">
+      In <span class="keyword">Impala 2.3</span> and higher, Impala supports the complex types
+      <code class="ph codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, and <code class="ph codeph">MAP</code>
+      See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details.
+      Because these data types are currently supported only for the Parquet file format,
+      if you plan to use them, become familiar with the performance and storage aspects
+      of Parquet first.
+    </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title13" id="parquet__parquet_interop">
+
+    <h2 class="title topictitle2" id="ariaid-title13">Exchanging Parquet Data Files with Other Hadoop Components</h2>
+
+
+    <div class="body conbody">
+
+      <p class="p">
+        You can read and write Parquet data files from other <span class="keyword"></span> components.
+        See <span class="xref">the documentation for your Apache Hadoop distribution</span> for details.
+      </p>
+
+
+
+
+
+
+
+
+
+      <p class="p">
+        Previously, it was not possible to create Parquet data through Impala and reuse that table within Hive. Now
+        that Parquet support is available for Hive, reusing existing Impala Parquet data files in Hive
+        requires updating the table metadata. Use the following command if you are already running Impala 1.1.1 or
+        higher:
+      </p>
+
+<pre class="pre codeblock"><code>ALTER TABLE <var class="keyword varname">table_name</var> SET FILEFORMAT PARQUET;
+</code></pre>
+
+      <p class="p">
+        If you are running a level of Impala that is older than 1.1.1, do the metadata update through Hive:
+      </p>
+
+<pre class="pre codeblock"><code>ALTER TABLE <var class="keyword varname">table_name</var> SET SERDE 'parquet.hive.serde.ParquetHiveSerDe';
+ALTER TABLE <var class="keyword varname">table_name</var> SET FILEFORMAT
+  INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
+  OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat";
+</code></pre>
+
+      <p class="p">
+        Impala 1.1.1 and higher can reuse Parquet data files created by Hive, without any action required.
+      </p>
+
+
+
+      <p class="p">
+        Impala supports the scalar data types that you can encode in a Parquet data file, but not composite or
+        nested types such as maps or arrays. In <span class="keyword">Impala 2.2</span> and higher, Impala can query Parquet data
+        files that include composite or nested types, as long as the query only refers to columns with scalar
+        types.
+
+      </p>
+
+      <p class="p">
+        If you copy Parquet data files between nodes, or even between different directories on the same node, make
+        sure to preserve the block size by using the command <code class="ph codeph">hadoop distcp -pb</code>. To verify that the
+        block size was preserved, issue the command <code class="ph codeph">hdfs fsck -blocks
+        <var class="keyword varname">HDFS_path_of_impala_table_dir</var></code> and check that the average block size is at or
+        near <span class="ph">256 MB (or whatever other size is defined by the
+        <code class="ph codeph">PARQUET_FILE_SIZE</code> query option).</span>. (The <code class="ph codeph">hadoop distcp</code> operation
+        typically leaves some directories behind, with names matching <span class="ph filepath">_distcp_logs_*</span>, that you
+        can delete from the destination directory afterward.)
+
+
+
+        Issue the command <span class="keyword cmdname">hadoop distcp</span> for details about <span class="keyword cmdname">distcp</span> command
+        syntax.
+      </p>
+
+
+
+      <p class="p">
+        Impala can query Parquet files that use the <code class="ph codeph">PLAIN</code>, <code class="ph codeph">PLAIN_DICTIONARY</code>,
+        <code class="ph codeph">BIT_PACKED</code>, and <code class="ph codeph">RLE</code> encodings.
+        Currently, Impala does not support <code class="ph codeph">RLE_DICTIONARY</code> encoding.
+        When creating files outside of Impala for use by Impala, make sure to use one of the supported encodings.
+        In particular, for MapReduce jobs, <code class="ph codeph">parquet.writer.version</code> must not be defined
+        (especially as <code class="ph codeph">PARQUET_2_0</code>) for writing the configurations of Parquet MR jobs.
+        Use the default version (or format). The default format, 1.0, includes some enhancements that are compatible with older versions.
+        Data using the 2.0 format might not be consumable by Impala, due to use of the <code class="ph codeph">RLE_DICTIONARY</code> encoding.
+      </p>
+      <div class="p">
+        To examine the internal structure and data of Parquet files, you can use the
+        <span class="keyword cmdname">parquet-tools</span> command. Make sure this
+        command is in your <code class="ph codeph">$PATH</code>. (Typically, it is symlinked from
+        <span class="ph filepath">/usr/bin</span>; sometimes, depending on your installation setup, you
+        might need to locate it under an alternative  <code class="ph codeph">bin</code> directory.)
+        The arguments to this command let you perform operations such as:
+        <ul class="ul">
+          <li class="li">
+            <code class="ph codeph">cat</code>: Print a file's contents to standard out. In <span class="keyword">Impala 2.3</span> and higher, you can use
+            the <code class="ph codeph">-j</code> option to output JSON.
+          </li>
+          <li class="li">
+            <code class="ph codeph">head</code>: Print the first few records of a file to standard output.
+          </li>
+          <li class="li">
+            <code class="ph codeph">schema</code>: Print the Parquet schema for the file.
+          </li>
+          <li class="li">
+            <code class="ph codeph">meta</code>: Print the file footer metadata, including key-value properties (like Avro schema), compression ratios,
+            encodings, compression used, and row group information.
+          </li>
+          <li class="li">
+            <code class="ph codeph">dump</code>: Print all data and metadata.
+          </li>
+        </ul>
+        Use <code class="ph codeph">parquet-tools -h</code> to see usage information for all the arguments.
+        Here are some examples showing <span class="keyword cmdname">parquet-tools</span> usage:
+
+<pre class="pre codeblock"><code>
+$ # Be careful doing this for a big file! Use parquet-tools head to be safe.
+$ parquet-tools cat sample.parq
+year = 1992
+month = 1
+day = 2
+dayofweek = 4
+dep_time = 748
+crs_dep_time = 750
+arr_time = 851
+crs_arr_time = 846
+carrier = US
+flight_num = 53
+actual_elapsed_time = 63
+crs_elapsed_time = 56
+arrdelay = 5
+depdelay = -2
+origin = CMH
+dest = IND
+distance = 182
+cancelled = 0
+diverted = 0
+
+year = 1992
+month = 1
+day = 3
+...
+
+</code></pre>
+
+<pre class="pre codeblock"><code>
+$ parquet-tools head -n 2 sample.parq
+year = 1992
+month = 1
+day = 2
+dayofweek = 4
+dep_time = 748
+crs_dep_time = 750
+arr_time = 851
+crs_arr_time = 846
+carrier = US
+flight_num = 53
+actual_elapsed_time = 63
+crs_elapsed_time = 56
+arrdelay = 5
+depdelay = -2
+origin = CMH
+dest = IND
+distance = 182
+cancelled = 0
+diverted = 0
+
+year = 1992
+month = 1
+day = 3
+...
+
+</code></pre>
+
+<pre class="pre codeblock"><code>
+$ parquet-tools schema sample.parq
+message schema {
+  optional int32 year;
+  optional int32 month;
+  optional int32 day;
+  optional int32 dayofweek;
+  optional int32 dep_time;
+  optional int32 crs_dep_time;
+  optional int32 arr_time;
+  optional int32 crs_arr_time;
+  optional binary carrier;
+  optional int32 flight_num;
+...
+
+</code></pre>
+
+<pre class="pre codeblock"><code>
+$ parquet-tools meta sample.parq
+creator:             impala version 2.2.0-...
+
+file schema:         schema
+-------------------------------------------------------------------
+year:                OPTIONAL INT32 R:0 D:1
+month:               OPTIONAL INT32 R:0 D:1
+day:                 OPTIONAL INT32 R:0 D:1
+dayofweek:           OPTIONAL INT32 R:0 D:1
+dep_time:            OPTIONAL INT32 R:0 D:1
+crs_dep_time:        OPTIONAL INT32 R:0 D:1
+arr_time:            OPTIONAL INT32 R:0 D:1
+crs_arr_time:        OPTIONAL INT32 R:0 D:1
+carrier:             OPTIONAL BINARY R:0 D:1
+flight_num:          OPTIONAL INT32 R:0 D:1
+...
+
+row group 1:         RC:20636601 TS:265103674
+-------------------------------------------------------------------
+year:                 INT32 SNAPPY DO:4 FPO:35 SZ:10103/49723/4.92 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+month:                INT32 SNAPPY DO:10147 FPO:10210 SZ:11380/35732/3.14 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+day:                  INT32 SNAPPY DO:21572 FPO:21714 SZ:3071658/9868452/3.21 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+dayofweek:            INT32 SNAPPY DO:3093276 FPO:3093319 SZ:2274375/5941876/2.61 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+dep_time:             INT32 SNAPPY DO:5367705 FPO:5373967 SZ:28281281/28573175/1.01 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+crs_dep_time:         INT32 SNAPPY DO:33649039 FPO:33654262 SZ:10220839/11574964/1.13 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+arr_time:             INT32 SNAPPY DO:43869935 FPO:43876489 SZ:28562410/28797767/1.01 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+crs_arr_time:         INT32 SNAPPY DO:72432398 FPO:72438151 SZ:10908972/12164626/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+carrier:              BINARY SNAPPY DO:83341427 FPO:83341558 SZ:114916/128611/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+flight_num:           INT32 SNAPPY DO:83456393 FPO:83488603 SZ:10216514/11474301/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+...
+
+</code></pre>
+      </div>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title14" id="parquet__parquet_data_files">
+
+    <h2 class="title topictitle2" id="ariaid-title14">How Parquet Data Files Are Organized</h2>
+
+
+    <div class="body conbody">
+
+      <p class="p">
+        Although Parquet is a column-oriented file format, do not expect to find one data file for each column.
+        Parquet keeps all the data for a row within the same data file, to ensure that the columns for a row are
+        always available on the same node for processing. What Parquet does is to set a large HDFS block size and a
+        matching maximum data file size, to ensure that I/O and network transfer requests apply to large batches of
+        data.
+      </p>
+
+      <p class="p">
+        Within that data file, the data for a set of rows is rearranged so that all the values from the first
+        column are organized in one contiguous block, then all the values from the second column, and so on.
+        Putting the values from the same column next to each other lets Impala use effective compression techniques
+        on the values in that column.
+      </p>
+
+      <div class="note note note_note"><span class="note__title notetitle">Note:</span>
+        <p class="p">
+          Impala <code class="ph codeph">INSERT</code> statements write Parquet data files using an HDFS block size
+          <span class="ph">that matches the data file size</span>, to ensure that each data file is
+          represented by a single HDFS block, and the entire file can be processed on a single node without
+          requiring any remote reads.
+        </p>
+
+        <p class="p">
+          If you create Parquet data files outside of Impala, such as through a MapReduce or Pig job, ensure that
+          the HDFS block size is greater than or equal to the file size, so that the <span class="q">"one file per block"</span>
+          relationship is maintained. Set the <code class="ph codeph">dfs.block.size</code> or the <code class="ph codeph">dfs.blocksize</code>
+          property large enough that each file fits within a single HDFS block, even if that size is larger than
+          the normal HDFS block size.
+        </p>
+
+        <p class="p">
+          If the block size is reset to a lower value during a file copy, you will see lower performance for
+          queries involving those files, and the <code class="ph codeph">PROFILE</code> statement will reveal that some I/O is
+          being done suboptimally, through remote reads. See
+          <a class="xref" href="impala_parquet.html#parquet_compression_multiple">Example of Copying Parquet Data Files</a> for an example showing how to preserve the
+          block size when copying Parquet data files.
+        </p>
+      </div>
+
+      <p class="p">
+        When Impala retrieves or tests the data for a particular column, it opens all the data files, but only
+        reads the portion of each file containing the values for that column. The column values are stored
+        consecutively, minimizing the I/O required to process the values within a single column. If other columns
+        are named in the <code class="ph codeph">SELECT</code> list or <code class="ph codeph">WHERE</code> clauses, the data for all columns
+        in the same row is available within that same data file.
+      </p>
+
+      <p class="p">
+        If an <code class="ph codeph">INSERT</code> statement brings in less than <span class="ph">one Parquet
+        block's worth</span> of data, the resulting data file is smaller than ideal. Thus, if you do split up an ETL
+        job to use multiple <code class="ph codeph">INSERT</code> statements, try to keep the volume of data for each
+        <code class="ph codeph">INSERT</code> statement to approximately <span class="ph">256 MB, or a multiple of
+        256 MB</span>.
+      </p>
+
+    </div>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title15" id="parquet_data_files__parquet_encoding">
+
+      <h3 class="title topictitle3" id="ariaid-title15">RLE and Dictionary Encoding for Parquet Data Files</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          Parquet uses some automatic compression techniques, such as run-length encoding (RLE) and dictionary
+          encoding, based on analysis of the actual data values. Once the data values are encoded in a compact
+          form, the encoded data can optionally be further compressed using a compression algorithm. Parquet data
+          files created by Impala can use Snappy, GZip, or no compression; the Parquet spec also allows LZO
+          compression, but currently Impala does not support LZO-compressed Parquet files.
+        </p>
+
+        <p class="p">
+          RLE and dictionary encoding are compression techniques that Impala applies automatically to groups of
+          Parquet data values, in addition to any Snappy or GZip compression applied to the entire data files.
+          These automatic optimizations can save you time and planning that are normally needed for a traditional
+          data warehouse. For example, dictionary encoding reduces the need to create numeric IDs as abbreviations
+          for longer string values.
+        </p>
+
+        <p class="p">
+          Run-length encoding condenses sequences of repeated data values. For example, if many consecutive rows
+          all contain the same value for a country code, those repeating values can be represented by the value
+          followed by a count of how many times it appears consecutively.
+        </p>
+
+        <p class="p">
+          Dictionary encoding takes the different values present in a column, and represents each one in compact
+          2-byte form rather than the original value, which could be several bytes. (Additional compression is
+          applied to the compacted values, for extra space savings.) This type of encoding applies when the number
+          of different values for a column is less than 2**16 (16,384). It does not apply to columns of data type
+          <code class="ph codeph">BOOLEAN</code>, which are already very short. <code class="ph codeph">TIMESTAMP</code> columns sometimes have
+          a unique value for each row, in which case they can quickly exceed the 2**16 limit on distinct values.
+          The 2**16 limit on different values within a column is reset for each data file, so if several different
+          data files each contained 10,000 different city names, the city name column in each data file could still
+          be condensed using dictionary encoding.
+        </p>
+
+      </div>
+
+    </article>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title16" id="parquet__parquet_compacting">
+
+    <h2 class="title topictitle2" id="ariaid-title16">Compacting Data Files for Parquet Tables</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        If you reuse existing table structures or ETL processes for Parquet tables, you might encounter a <span class="q">"many
+        small files"</span> situation, which is suboptimal for query efficiency. For example, statements like these
+        might produce inefficiently organized data files:
+      </p>
+
+<pre class="pre codeblock"><code>-- In an N-node cluster, each node produces a data file
+-- for the INSERT operation. If you have less than
+-- N GB of data to copy, some files are likely to be
+-- much smaller than the <span class="ph">default Parquet</span> block size.
+insert into parquet_table select * from text_table;
+
+-- Even if this operation involves an overall large amount of data,
+-- when split up by year/month/day, each partition might only
+-- receive a small amount of data. Then the data files for
+-- the partition might be divided between the N nodes in the cluster.
+-- A multi-gigabyte copy operation might produce files of only
+-- a few MB each.
+insert into partitioned_parquet_table partition (year, month, day)
+  select year, month, day, url, referer, user_agent, http_code, response_time
+  from web_stats;
+</code></pre>
+
+      <p class="p">
+        Here are techniques to help you produce large data files in Parquet <code class="ph codeph">INSERT</code> operations, and
+        to compact existing too-small data files:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          <p class="p">
+            When inserting into a partitioned Parquet table, use statically partitioned <code class="ph codeph">INSERT</code>
+            statements where the partition key values are specified as constant values. Ideally, use a separate
+            <code class="ph codeph">INSERT</code> statement for each partition.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+        You might set the <code class="ph codeph">NUM_NODES</code> option to 1 briefly, during <code class="ph codeph">INSERT</code> or
+        <code class="ph codeph">CREATE TABLE AS SELECT</code> statements. Normally, those statements produce one or more data
+        files per data node. If the write operation involves small amounts of data, a Parquet table, and/or a
+        partitioned table, the default behavior could produce many small files when intuitively you might expect
+        only a single output file. <code class="ph codeph">SET NUM_NODES=1</code> turns off the <span class="q">"distributed"</span> aspect of the
+        write operation, making it more likely to produce only one or a few data files.
+      </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            Be prepared to reduce the number of partition key columns from what you are used to with traditional
+            analytic database systems.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            Do not expect Impala-written Parquet files to fill up the entire Parquet block size. Impala estimates
+            on the conservative side when figuring out how much data to write to each Parquet file. Typically, the
+            of uncompressed data in memory is substantially reduced on disk by the compression and encoding
+            techniques in the Parquet file format.
+
+            The final data file size varies depending on the compressibility of the data. Therefore, it is not an
+            indication of a problem if <span class="ph">256 MB</span> of text data is turned into 2
+            Parquet data files, each less than <span class="ph">256 MB</span>.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            If you accidentally end up with a table with many small data files, consider using one or more of the
+            preceding techniques and copying all the data into a new Parquet table, either through <code class="ph codeph">CREATE
+            TABLE AS SELECT</code> or <code class="ph codeph">INSERT ... SELECT</code> statements.
+          </p>
+
+          <p class="p">
+            To avoid rewriting queries to change table names, you can adopt a convention of always running
+            important queries against a view. Changing the view definition immediately switches any subsequent
+            queries to use the new underlying tables:
+          </p>
+<pre class="pre codeblock"><code>create view production_table as select * from table_with_many_small_files;
+-- CTAS or INSERT...SELECT all the data into a more efficient layout...
+alter view production_table as select * from table_with_few_big_files;
+select * from production_table where c1 = 100 and c2 &lt; 50 and ...;
+</code></pre>
+        </li>
+      </ul>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title17" id="parquet__parquet_schema_evolution">
+
+    <h2 class="title topictitle2" id="ariaid-title17">Schema Evolution for Parquet Tables</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Schema evolution refers to using the statement <code class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> to change
+        the names, data type, or number of columns in a table. You can perform schema evolution for Parquet tables
+        as follows:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          <p class="p">
+            The Impala <code class="ph codeph">ALTER TABLE</code> statement never changes any data files in the tables. From the
+            Impala side, schema evolution involves interpreting the same data files in terms of a new table
+            definition. Some types of schema changes make sense and are represented correctly. Other types of
+            changes cannot be represented in a sensible way, and produce special result values or conversion errors
+            during queries.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            The <code class="ph codeph">INSERT</code> statement always creates data using the latest table definition. You might
+            end up with data files with different numbers of columns or internal data representations if you do a
+            sequence of <code class="ph codeph">INSERT</code> and <code class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> statements.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            If you use <code class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> to define additional columns at the end,
+            when the original data files are used in a query, these final columns are considered to be all
+            <code class="ph codeph">NULL</code> values.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            If you use <code class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> to define fewer columns than before, when
+            the original data files are used in a query, the unused columns still present in the data file are
+            ignored.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            Parquet represents the <code class="ph codeph">TINYINT</code>, <code class="ph codeph">SMALLINT</code>, and <code class="ph codeph">INT</code>
+            types the same internally, all stored in 32-bit integers.
+          </p>
+          <ul class="ul">
+            <li class="li">
+              That means it is easy to promote a <code class="ph codeph">TINYINT</code> column to <code class="ph codeph">SMALLINT</code> or
+              <code class="ph codeph">INT</code>, or a <code class="ph codeph">SMALLINT</code> column to <code class="ph codeph">INT</code>. The numbers are
+              represented exactly the same in the data file, and the columns being promoted would not contain any
+              out-of-range values.
+            </li>
+
+            <li class="li">
+              <p class="p">
+                If you change any of these column types to a smaller type, any values that are out-of-range for the
+                new type are returned incorrectly, typically as negative numbers.
+              </p>
+            </li>
+
+            <li class="li">
+              <p class="p">
+                You cannot change a <code class="ph codeph">TINYINT</code>, <code class="ph codeph">SMALLINT</code>, or <code class="ph codeph">INT</code>
+                column to <code class="ph codeph">BIGINT</code>, or the other way around. Although the <code class="ph codeph">ALTER
+                TABLE</code> succeeds, any attempt to query those columns results in conversion errors.
+              </p>
+            </li>
+
+            <li class="li">
+              <p class="p">
+                Any other type conversion for columns produces a conversion error during queries. For example,
+                <code class="ph codeph">INT</code> to <code class="ph codeph">STRING</code>, <code class="ph codeph">FLOAT</code> to <code class="ph codeph">DOUBLE</code>,
+                <code class="ph codeph">TIMESTAMP</code> to <code class="ph codeph">STRING</code>, <code class="ph codeph">DECIMAL(9,0)</code> to
+                <code class="ph codeph">DECIMAL(5,2)</code>, and so on.
+              </p>
+            </li>
+          </ul>
+        </li>
+      </ul>
+
+      <div class="p">
+        You might find that you have Parquet files where the columns do not line up in the same
+        order as in your Impala table. For example, you might have a Parquet file that was part of
+        a table with columns <code class="ph codeph">C1,C2,C3,C4</code>, and now you want to reuse the same
+        Parquet file in a table with columns <code class="ph codeph">C4,C2</code>. By default, Impala expects the
+        columns in the data file to appear in the same order as the columns defined for the table,
+        making it impractical to do some kinds of file reuse or schema evolution. In <span class="keyword">Impala 2.6</span>
+        and higher, the query option <code class="ph codeph">PARQUET_FALLBACK_SCHEMA_RESOLUTION=name</code> lets Impala
+        resolve columns by name, and therefore handle out-of-order or extra columns in the data file.
+        For example:
+
+<pre class="pre codeblock"><code>
+create database schema_evolution;
+use schema_evolution;
+create table t1 (c1 int, c2 boolean, c3 string, c4 timestamp)
+  stored as parquet;
+insert into t1 values
+  (1, true, 'yes', now()),
+  (2, false, 'no', now() + interval 1 day);
+
+select * from t1;
++----+-------+-----+-------------------------------+
+| c1 | c2    | c3  | c4                            |
++----+-------+-----+-------------------------------+
+| 1  | true  | yes | 2016-06-28 14:53:26.554369000 |
+| 2  | false | no  | 2016-06-29 14:53:26.554369000 |
++----+-------+-----+-------------------------------+
+
+desc formatted t1;
+...
+| Location:   | /user/hive/warehouse/schema_evolution.db/t1 |
+...
+
+-- Make T2 have the same data file as in T1, including 2
+-- unused columns and column order different than T2 expects.
+load data inpath '/user/hive/warehouse/schema_evolution.db/t1'
+  into table t2;
++----------------------------------------------------------+
+| summary                                                  |
++----------------------------------------------------------+
+| Loaded 1 file(s). Total files in destination location: 1 |
++----------------------------------------------------------+
+
+-- 'position' is the default setting.
+-- Impala cannot read the Parquet file if the column order does not match.
+set PARQUET_FALLBACK_SCHEMA_RESOLUTION=position;
+PARQUET_FALLBACK_SCHEMA_RESOLUTION set to position
+
+select * from t2;
+WARNINGS:
+File 'schema_evolution.db/t2/45331705_data.0.parq'
+has an incompatible Parquet schema for column 'schema_evolution.t2.c4'.
+Column type: TIMESTAMP, Parquet schema: optional int32 c1 [i:0 d:1 r:0]
+
+File 'schema_evolution.db/t2/45331705_data.0.parq'
+has an incompatible Parquet schema for column 'schema_evolution.t2.c4'.
+Column type: TIMESTAMP, Parquet schema: optional int32 c1 [i:0 d:1 r:0]
+
+-- With the 'name' setting, Impala can read the Parquet data files
+-- despite mismatching column order.
+set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
+PARQUET_FALLBACK_SCHEMA_RESOLUTION set to name
+
+select * from t2;
++-------------------------------+-------+
+| c4                            | c2    |
++-------------------------------+-------+
+| 2016-06-28 14:53:26.554369000 | true  |
+| 2016-06-29 14:53:26.554369000 | false |
++-------------------------------+-------+
+
+</code></pre>
+
+        See <a class="xref" href="impala_parquet_fallback_schema_resolution.html#parquet_fallback_schema_resolution">PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (Impala 2.6 or higher only)</a>
+        for more details.
+      </div>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title18" id="parquet__parquet_data_types">
+
+    <h2 class="title topictitle2" id="ariaid-title18">Data Type Considerations for Parquet Tables</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The Parquet format defines a set of data types whose names differ from the names of the corresponding
+        Impala data types. If you are preparing Parquet files using other Hadoop components such as Pig or
+        MapReduce, you might need to work with the type names defined by Parquet. The following figure lists the
+        Parquet-defined types and the equivalent types in Impala.
+      </p>
+
+      <p class="p">
+        <strong class="ph b">Primitive types:</strong>
+      </p>
+
+<pre class="pre codeblock"><code>BINARY -&gt; STRING
+BOOLEAN -&gt; BOOLEAN
+DOUBLE -&gt; DOUBLE
+FLOAT -&gt; FLOAT
+INT32 -&gt; INT
+INT64 -&gt; BIGINT
+INT96 -&gt; TIMESTAMP
+</code></pre>
+
+      <p class="p">
+        <strong class="ph b">Logical types:</strong>
+      </p>
+
+<pre class="pre codeblock"><code>BINARY + OriginalType UTF8 -&gt; STRING
+BINARY + OriginalType ENUM -&gt; STRING
+BINARY + OriginalType DECIMAL -&gt; DECIMAL
+</code></pre>
+
+      <p class="p">
+        <strong class="ph b">Complex types:</strong>
+      </p>
+
+      <p class="p">
+        For the complex types (<code class="ph codeph">ARRAY</code>, <code class="ph codeph">MAP</code>, and <code class="ph codeph">STRUCT</code>)
+        available in <span class="keyword">Impala 2.3</span> and higher, Impala only supports queries
+        against those types in Parquet tables.
+      </p>
+
+    </div>
+
+  </article>
+
+</article></main></body></html>

http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_parquet_annotate_strings_utf8.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_parquet_annotate_strings_utf8.html b/docs/build3x/html/topics/impala_parquet_annotate_strings_utf8.html
new file mode 100644
index 0000000..f72b664
--- /dev/null
+++ b/docs/build3x/html/topics/impala_parquet_annotate_strings_utf8.html
@@ -0,0 +1,54 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="parquet_annotate_strings_utf8"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (Impala 2.6 or higher only)</title></head><body id="parquet_annotate_strings_utf8"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (<span class="keyword">Impala 2.6</span> or higher only)</h1>
+
+
+
+  <div class="body conbody">
+
+    <p class="p">
+
+      Causes Impala <code class="ph codeph">INSERT</code> and <code class="ph codeph">CREATE TABLE AS SELECT</code> statements
+      to write Parquet files that use the UTF-8 annotation for <code class="ph codeph">STRING</code> columns.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+    <p class="p">
+      By default, Impala represents a <code class="ph codeph">STRING</code> column in Parquet as an unannotated binary field.
+    </p>
+    <p class="p">
+      Impala always uses the UTF-8 annotation when writing <code class="ph codeph">CHAR</code> and <code class="ph codeph">VARCHAR</code>
+      columns to Parquet files. An alternative to using the query option is to cast <code class="ph codeph">STRING</code>
+      values to <code class="ph codeph">VARCHAR</code>.
+    </p>
+    <p class="p">
+      This option is to help make Impala-written data more interoperable with other data processing engines.
+      Impala itself currently does not support all operations on UTF-8 data.
+      Although data processed by Impala is typically represented in ASCII, it is valid to designate the
+      data as UTF-8 when storing on disk, because ASCII is a subset of UTF-8.
+    </p>
+    <p class="p">
+        <strong class="ph b">Type:</strong> Boolean; recognized values are 1 and 0, or <code class="ph codeph">true</code> and <code class="ph codeph">false</code>;
+        any other value interpreted as <code class="ph codeph">false</code>
+      </p>
+    <p class="p">
+        <strong class="ph b">Default:</strong> <code class="ph codeph">false</code> (shown as 0 in output of <code class="ph codeph">SET</code> statement)
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.6.0</span>
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Related information:</strong>
+      </p>
+    <p class="p">
+      <a class="xref" href="impala_parquet.html#parquet">Using the Parquet File Format with Impala Tables</a>
+    </p>
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html>

http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_parquet_array_resolution.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_parquet_array_resolution.html b/docs/build3x/html/topics/impala_parquet_array_resolution.html
new file mode 100644
index 0000000..831ac46
--- /dev/null
+++ b/docs/build3x/html/topics/impala_parquet_array_resolution.html
@@ -0,0 +1,180 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="parquet_array_resolution"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>PARQUET_ARRAY_RESOLUTION Query Option (Impala 2.9 or higher only)</title></head><body id="parquet_array_resolution"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">
+    PARQUET_ARRAY_RESOLUTION Query Option (<span class="keyword">Impala 2.9</span> or higher only)
+  </h1>
+
+
+
+
+
+  <div class="body conbody">
+
+    <p class="p">
+      The <code class="ph codeph">PARQUET_ARRAY_RESOLUTION</code> query option controls the
+      behavior of the indexed-based resolution for nested arrays in Parquet.
+    </p>
+
+    <p class="p">
+      In Parquet, you can represent an array using a 2-level or 3-level
+      representation. The modern, standard representation is 3-level. The legacy
+      2-level scheme is supported for compatibility with older Parquet files.
+      However, there is no reliable metadata within Parquet files to indicate
+      which encoding was used. It is even possible to have mixed encodings within
+      the same file if there are multiple arrays. The
+      <code class="ph codeph">PARQUET_ARRAY_RESOLTUTION</code> option controls the process of
+      resolution that is to match every column/field reference from a query to a
+      column in the Parquet file.</p>
+
+    <p class="p">
+      The supported values for the query option are:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        <code class="ph codeph">THREE_LEVEL</code>: Assumes arrays are encoded with the 3-level
+        representation, and does not attempt the 2-level resolution.
+      </li>
+
+      <li class="li">
+        <code class="ph codeph">TWO_LEVEL</code>: Assumes arrays are encoded with the 2-level
+        representation, and does not attempt the 3-level resolution.
+      </li>
+
+      <li class="li">
+        <code class="ph codeph">TWO_LEVEL_THEN_THREE_LEVEL</code>: First tries to resolve
+        assuming a 2-level representation, and if unsuccessful, tries a 3-level
+        representation.
+      </li>
+    </ul>
+
+    <p class="p">
+      All of the above options resolve arrays encoded with a single level.
+    </p>
+
+    <p class="p">
+      A failure to resolve a column/field reference in a query with a given array
+      resolution policy does not necessarily result in a warning or error returned
+      by the query. A mismatch might be treated like a missing column (returns
+      NULL values), and it is not possible to reliably distinguish the 'bad
+      resolution' and 'legitimately missing column' cases.
+    </p>
+
+    <p class="p">
+      The name-based policy generally does not have the problem of ambiguous
+      array representations. You specify to use the name-based policy by setting
+      the <code class="ph codeph">PARQUET_FALLBACK_SCHEMA_RESOLUTION</code> query option to
+      <code class="ph codeph">NAME</code>.
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Type:</strong> Enum of <code class="ph codeph">ONE_LEVEL</code>, <code class="ph codeph">TWO_LEVEL</code>,
+      <code class="ph codeph">THREE_LEVEL</code>
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Default:</strong> <code class="ph codeph">THREE_LEVEL</code>
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.9.0</span>
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Examples:</strong>
+      </p>
+
+    <p class="p">
+      EXAMPLE A: The following Parquet schema of a file can be interpreted as a
+      2-level or 3-level:
+    </p>
+
+<pre class="pre codeblock"><code>
+ParquetSchemaExampleA {
+  optional group single_element_groups (LIST) {
+    repeated group single_element_group {
+      required int64 count;
+    }
+  }
+}
+</code></pre>
+
+    <p class="p">
+      The following table schema corresponds to a 2-level interpretation:
+    </p>
+
+<pre class="pre codeblock"><code>
+CREATE TABLE t (col1 array&lt;struct&lt;f1: bigint&gt;&gt;) STORED AS PARQUET;
+</code></pre>
+
+    <p class="p">
+      Successful query with a 2-level interpretation:
+    </p>
+
+<pre class="pre codeblock"><code>
+SET PARQUET_ARRAY_RESOLUTION=TWO_LEVEL;
+SELECT ITEM.f1 FROM t.col1;
+</code></pre>
+
+    <p class="p">
+      The following table schema corresponds to a 3-level interpretation:
+    </p>
+
+<pre class="pre codeblock"><code>
+CREATE TABLE t (col1 array&lt;bigint&gt;) STORED AS PARQUET;
+</code></pre>
+
+    <p class="p">
+      Successful query with a 3-level interpretation:
+    </p>
+
+<pre class="pre codeblock"><code>
+SET PARQUET_ARRAY_RESOLUTION=THREE_LEVEL;
+SELECT ITEM FROM t.col1
+</code></pre>
+
+    <p class="p">
+      EXAMPLE B: The following Parquet schema of a file can be only be successfully
+      interpreted as a 2-level:
+    </p>
+
+<pre class="pre codeblock"><code>
+ParquetSchemaExampleB {
+  required group list_of_ints (LIST) {
+    repeated int32 list_of_ints_tuple;
+  }
+}
+</code></pre>
+
+    <p class="p">
+      The following table schema corresponds to a 2-level interpretation:
+    </p>
+
+<pre class="pre codeblock"><code>
+CREATE TABLE t (col1 array&lt;int&gt;) STORED AS PARQUET;
+</code></pre>
+
+    <p class="p">
+      Successful query with a 2-level interpretation:
+    </p>
+
+<pre class="pre codeblock"><code>
+SET PARQUET_ARRAY_RESOLUTION=TWO_LEVEL;
+SELECT ITEM FROM t.col1
+</code></pre>
+
+    <p class="p">
+      Unsuccessful query with a 3-level interpretation. The query returns
+      <code class="ph codeph">NULL</code>s as if the column was missing in the file:
+    </p>
+
+<pre class="pre codeblock"><code>
+SET PARQUET_ARRAY_RESOLUTION=THREE_LEVEL;
+SELECT ITEM FROM t.col1
+</code></pre>
+
+  </div>
+
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html>

http://git-wip-us.apache.org/repos/asf/impala/blob/fae51ec2/docs/build3x/html/topics/impala_parquet_compression_codec.html
----------------------------------------------------------------------
diff --git a/docs/build3x/html/topics/impala_parquet_compression_codec.html b/docs/build3x/html/topics/impala_parquet_compression_codec.html
new file mode 100644
index 0000000..ac5551a
--- /dev/null
+++ b/docs/build3x/html/topics/impala_parquet_compression_codec.html
@@ -0,0 +1,17 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2018"><meta name="DC.rights.owner" content="(C) Copyright 2018"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 3.0.x"><meta name="version" content="Impala 3.0.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="parquet_compression_codec"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>PARQUET_COMPRESSION_CODEC Query Option</title></head><body id="parquet_compression_codec"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">PARQUET_COMPRESSION_CODEC Query Option</h1>
+
+
+
+  <div class="body conbody">
+
+    <p class="p">
+
+      Deprecated. Use <code class="ph codeph">COMPRESSION_CODEC</code> in Impala 2.0 and later. See
+      <a class="xref" href="impala_compression_codec.html#compression_codec">COMPRESSION_CODEC Query Option (Impala 2.0 or higher only)</a> for details.
+    </p>
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html>