You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jr...@apache.org on 2016/10/31 19:27:12 UTC

incubator-impala git commit: Upgrade impala_file_formats.xml to latest version - it contains some new IDs used as conref targets.

Repository: incubator-impala
Updated Branches:
  refs/heads/doc_prototype 0124ae32f -> e9e4f18cd


Upgrade impala_file_formats.xml to latest version - it contains some new IDs used as conref targets.


Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/e9e4f18c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/e9e4f18c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/e9e4f18c

Branch: refs/heads/doc_prototype
Commit: e9e4f18cd4d16a53b7c034a6d003292c50ce1bf4
Parents: 0124ae3
Author: John Russell <jr...@cloudera.com>
Authored: Mon Oct 31 12:27:08 2016 -0700
Committer: John Russell <jr...@cloudera.com>
Committed: Mon Oct 31 12:27:08 2016 -0700

----------------------------------------------------------------------
 docs/topics/impala_file_formats.xml | 236 ++++++++++++++++++++++++++++++-
 1 file changed, 235 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e9e4f18c/docs/topics/impala_file_formats.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_file_formats.xml b/docs/topics/impala_file_formats.xml
index 64bf8a5..48b9e7c 100644
--- a/docs/topics/impala_file_formats.xml
+++ b/docs/topics/impala_file_formats.xml
@@ -29,8 +29,242 @@
       considerations for using each file format with Impala.
     </p>
 
-    
+    <p>
+      The file format used for an Impala table has significant performance consequences. Some file formats include
+      compression support that affects the size of data on the disk and, consequently, the amount of I/O and CPU
+      resources required to deserialize data. The amounts of I/O and CPU resources required can be a limiting
+      factor in query performance since querying often begins with moving and decompressing data. To reduce the
+      potential impact of this part of the process, data is often compressed. By compressing data, a smaller total
+      number of bytes are transferred from disk to memory. This reduces the amount of time taken to transfer the
+      data, but a tradeoff occurs when the CPU decompresses the content.
+    </p>
+
+    <p>
+      Impala can query files encoded with most of the popular file formats and compression codecs used in Hadoop.
+      Impala can create and insert data into tables that use some file formats but not others; for file formats
+      that Impala cannot write to, create the table in Hive, issue the <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph>
+      statement in <codeph>impala-shell</codeph>, and query the table through Impala. File formats can be
+      structured, in which case they may include metadata and built-in compression. Supported formats include:
+    </p>
+
+    <table>
+      <title>File Format Support in Impala</title>
+      <tgroup cols="5">
+        <colspec colname="1" colwidth="10*"/>
+        <colspec colname="2" colwidth="10*"/>
+        <colspec colname="3" colwidth="20*"/>
+        <colspec colname="4" colwidth="30*"/>
+        <colspec colname="5" colwidth="30*"/>
+        <thead>
+          <row>
+            <entry>
+              File Type
+            </entry>
+            <entry>
+              Format
+            </entry>
+            <entry>
+              Compression Codecs
+            </entry>
+            <entry>
+              Impala Can CREATE?
+            </entry>
+            <entry>
+              Impala Can INSERT?
+            </entry>
+          </row>
+        </thead>
+        <tbody>
+          <row id="parquet_support">
+            <entry>
+              <xref href="impala_parquet.xml#parquet">Parquet</xref>
+            </entry>
+            <entry>
+              Structured
+            </entry>
+            <entry>
+              Snappy, gzip; currently Snappy by default
+            </entry>
+            <entry>
+              Yes.
+            </entry>
+            <entry>
+              Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and query.
+            </entry>
+          </row>
+          <row id="txtfile_support">
+            <entry>
+              <xref href="impala_txtfile.xml#txtfile">Text</xref>
+            </entry>
+            <entry>
+              Unstructured
+            </entry>
+            <entry rev="2.0.0">
+              LZO, gzip, bzip2, Snappy
+            </entry>
+            <entry>
+              Yes. For <codeph>CREATE TABLE</codeph> with no <codeph>STORED AS</codeph> clause, the default file
+              format is uncompressed text, with values separated by ASCII <codeph>0x01</codeph> characters
+              (typically represented as Ctrl-A).
+            </entry>
+            <entry>
+              Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and query.
+              If LZO compression is used, you must create the table and load data in Hive. If other kinds of
+              compression are used, you must load data through <codeph>LOAD DATA</codeph>, Hive, or manually in
+              HDFS.
+
+<!--            <ph rev="2.0.0">Impala 2.0 and higher can write LZO-compressed text data; for earlier Impala releases,  you must create the table and load data in Hive.</ph> -->
+            </entry>
+          </row>
+          <row id="avro_support">
+            <entry>
+              <xref href="impala_avro.xml#avro">Avro</xref>
+            </entry>
+            <entry>
+              Structured
+            </entry>
+            <entry>
+              Snappy, gzip, deflate, bzip2
+            </entry>
+            <entry rev="1.4.0">
+              Yes, in Impala 1.4.0 and higher. Before that, create the table using Hive.
+            </entry>
+            <entry>
+              No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
+              <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
+            </entry>
+<!-- <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry> -->
+          </row>
+          <row id="rcfile_support">
+            <entry>
+              <xref href="impala_rcfile.xml#rcfile">RCFile</xref>
+            </entry>
+            <entry>
+              Structured
+            </entry>
+            <entry>
+              Snappy, gzip, deflate, bzip2
+            </entry>
+            <entry>
+              Yes.
+            </entry>
+            <entry>
+              No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
+              <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
+            </entry>
+<!--
+            <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry>
+            -->
+          </row>
+          <row id="sequencefile_support">
+            <entry>
+              <xref href="impala_seqfile.xml#seqfile">SequenceFile</xref>
+            </entry>
+            <entry>
+              Structured
+            </entry>
+            <entry>
+              Snappy, gzip, deflate, bzip2
+            </entry>
+            <entry>Yes.</entry>
+            <entry>
+              No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
+              <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
+            </entry>
+<!--
+            <entry rev="2.0.0">
+              Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD
+              DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.
+            </entry>
+-->
+          </row>
+        </tbody>
+      </tgroup>
+    </table>
+
+    <p rev="DOCS-1370">
+      Impala can only query the file formats listed in the preceding table.
+      In particular, Impala does not support the ORC file format.
+    </p>
+
+    <p>
+      Impala supports the following compression codecs:
+    </p>
+
+    <ul>
+      <li rev="2.0.0">
+        Snappy. Recommended for its effective balance between compression ratio and decompression speed. Snappy
+        compression is very fast, but gzip provides greater space savings. Supported for text files in Impala 2.0
+        and higher.
+<!-- Not supported for text files. -->
+      </li>
+
+      <li rev="2.0.0">
+        Gzip. Recommended when achieving the highest level of compression (and therefore greatest disk-space
+        savings) is desired. Supported for text files in Impala 2.0 and higher.
+      </li>
+
+      <li>
+        Deflate. Not supported for text files.
+      </li>
+
+      <li rev="2.0.0">
+        Bzip2. Supported for text files in Impala 2.0 and higher.
+<!-- Not supported for text files. -->
+      </li>
+
+      <li>
+        <p rev="2.0.0"> LZO, for text files only. Impala can query
+          LZO-compressed text tables, but currently cannot create them or insert
+          data into them; perform these operations in Hive. </p>
+      </li>
+    </ul>
   </conbody>
 
+  <concept id="file_format_choosing">
+
+    <title>Choosing the File Format for a Table</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Planning"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        Different file formats and compression codecs work better for different data sets. While Impala typically
+        provides performance gains regardless of file format, choosing the proper format for your data can yield
+        further performance improvements. Use the following considerations to decide which combination of file
+        format and compression to use for a particular table:
+      </p>
+
+      <ul>
+        <li>
+          If you are working with existing files that are already in a supported file format, use the same format
+          for the Impala table where practical. If the original format does not yield acceptable query performance
+          or resource usage, consider creating a new Impala table with different file format or compression
+          characteristics, and doing a one-time conversion by copying the data to the new table using the
+          <codeph>INSERT</codeph> statement. Depending on the file format, you might run the
+          <codeph>INSERT</codeph> statement in <codeph>impala-shell</codeph> or in Hive.
+        </li>
+
+        <li>
+          Text files are convenient to produce through many different tools, and are human-readable for ease of
+          verification and debugging. Those characteristics are why text is the default format for an Impala
+          <codeph>CREATE TABLE</codeph> statement. When performance and resource usage are the primary
+          considerations, use one of the other file formats and consider using compression. A typical workflow
+          might involve bringing data into an Impala table by copying CSV or TSV files into the appropriate data
+          directory, and then using the <codeph>INSERT ... SELECT</codeph> syntax to copy the data into a table
+          using a different, more compact file format.
+        </li>
 
+        <li>
+          If your architecture involves storing data to be queried in memory, do not compress the data. There is no
+          I/O savings since the data does not need to be moved from disk, but there is a CPU cost to decompress the
+          data.
+        </li>
+      </ul>
+    </conbody>
+  </concept>
 </concept>