You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jr...@apache.org on 2016/10/31 19:27:12 UTC
incubator-impala git commit: Upgrade impala_file_formats.xml to
latest version - it contains some new IDs used as conref targets.
Repository: incubator-impala
Updated Branches:
refs/heads/doc_prototype 0124ae32f -> e9e4f18cd
Upgrade impala_file_formats.xml to latest version - it contains some new IDs used as conref targets.
Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/e9e4f18c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/e9e4f18c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/e9e4f18c
Branch: refs/heads/doc_prototype
Commit: e9e4f18cd4d16a53b7c034a6d003292c50ce1bf4
Parents: 0124ae3
Author: John Russell <jr...@cloudera.com>
Authored: Mon Oct 31 12:27:08 2016 -0700
Committer: John Russell <jr...@cloudera.com>
Committed: Mon Oct 31 12:27:08 2016 -0700
----------------------------------------------------------------------
docs/topics/impala_file_formats.xml | 236 ++++++++++++++++++++++++++++++-
1 file changed, 235 insertions(+), 1 deletion(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e9e4f18c/docs/topics/impala_file_formats.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_file_formats.xml b/docs/topics/impala_file_formats.xml
index 64bf8a5..48b9e7c 100644
--- a/docs/topics/impala_file_formats.xml
+++ b/docs/topics/impala_file_formats.xml
@@ -29,8 +29,242 @@
considerations for using each file format with Impala.
</p>
-
+ <p>
+ The file format used for an Impala table has significant performance consequences. Some file formats include
+ compression support that affects the size of data on the disk and, consequently, the amount of I/O and CPU
+ resources required to deserialize data. The amounts of I/O and CPU resources required can be a limiting
+ factor in query performance since querying often begins with moving and decompressing data. To reduce the
+ potential impact of this part of the process, data is often compressed. By compressing data, a smaller total
+ number of bytes are transferred from disk to memory. This reduces the amount of time taken to transfer the
+ data, but a tradeoff occurs when the CPU decompresses the content.
+ </p>
+
+ <p>
+ Impala can query files encoded with most of the popular file formats and compression codecs used in Hadoop.
+ Impala can create and insert data into tables that use some file formats but not others; for file formats
+ that Impala cannot write to, create the table in Hive, issue the <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph>
+ statement in <codeph>impala-shell</codeph>, and query the table through Impala. File formats can be
+ structured, in which case they may include metadata and built-in compression. Supported formats include:
+ </p>
+
+ <table>
+ <title>File Format Support in Impala</title>
+ <tgroup cols="5">
+ <colspec colname="1" colwidth="10*"/>
+ <colspec colname="2" colwidth="10*"/>
+ <colspec colname="3" colwidth="20*"/>
+ <colspec colname="4" colwidth="30*"/>
+ <colspec colname="5" colwidth="30*"/>
+ <thead>
+ <row>
+ <entry>
+ File Type
+ </entry>
+ <entry>
+ Format
+ </entry>
+ <entry>
+ Compression Codecs
+ </entry>
+ <entry>
+ Impala Can CREATE?
+ </entry>
+ <entry>
+ Impala Can INSERT?
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row id="parquet_support">
+ <entry>
+ <xref href="impala_parquet.xml#parquet">Parquet</xref>
+ </entry>
+ <entry>
+ Structured
+ </entry>
+ <entry>
+ Snappy, gzip; currently Snappy by default
+ </entry>
+ <entry>
+ Yes.
+ </entry>
+ <entry>
+ Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and query.
+ </entry>
+ </row>
+ <row id="txtfile_support">
+ <entry>
+ <xref href="impala_txtfile.xml#txtfile">Text</xref>
+ </entry>
+ <entry>
+ Unstructured
+ </entry>
+ <entry rev="2.0.0">
+ LZO, gzip, bzip2, Snappy
+ </entry>
+ <entry>
+ Yes. For <codeph>CREATE TABLE</codeph> with no <codeph>STORED AS</codeph> clause, the default file
+ format is uncompressed text, with values separated by ASCII <codeph>0x01</codeph> characters
+ (typically represented as Ctrl-A).
+ </entry>
+ <entry>
+ Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and query.
+ If LZO compression is used, you must create the table and load data in Hive. If other kinds of
+ compression are used, you must load data through <codeph>LOAD DATA</codeph>, Hive, or manually in
+ HDFS.
+
+<!-- <ph rev="2.0.0">Impala 2.0 and higher can write LZO-compressed text data; for earlier Impala releases, you must create the table and load data in Hive.</ph> -->
+ </entry>
+ </row>
+ <row id="avro_support">
+ <entry>
+ <xref href="impala_avro.xml#avro">Avro</xref>
+ </entry>
+ <entry>
+ Structured
+ </entry>
+ <entry>
+ Snappy, gzip, deflate, bzip2
+ </entry>
+ <entry rev="1.4.0">
+ Yes, in Impala 1.4.0 and higher. Before that, create the table using Hive.
+ </entry>
+ <entry>
+ No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
+ <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
+ </entry>
+<!-- <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry> -->
+ </row>
+ <row id="rcfile_support">
+ <entry>
+ <xref href="impala_rcfile.xml#rcfile">RCFile</xref>
+ </entry>
+ <entry>
+ Structured
+ </entry>
+ <entry>
+ Snappy, gzip, deflate, bzip2
+ </entry>
+ <entry>
+ Yes.
+ </entry>
+ <entry>
+ No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
+ <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
+ </entry>
+<!--
+ <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry>
+ -->
+ </row>
+ <row id="sequencefile_support">
+ <entry>
+ <xref href="impala_seqfile.xml#seqfile">SequenceFile</xref>
+ </entry>
+ <entry>
+ Structured
+ </entry>
+ <entry>
+ Snappy, gzip, deflate, bzip2
+ </entry>
+ <entry>Yes.</entry>
+ <entry>
+ No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
+ <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
+ </entry>
+<!--
+ <entry rev="2.0.0">
+ Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD
+ DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.
+ </entry>
+-->
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <p rev="DOCS-1370">
+ Impala can only query the file formats listed in the preceding table.
+ In particular, Impala does not support the ORC file format.
+ </p>
+
+ <p>
+ Impala supports the following compression codecs:
+ </p>
+
+ <ul>
+ <li rev="2.0.0">
+ Snappy. Recommended for its effective balance between compression ratio and decompression speed. Snappy
+ compression is very fast, but gzip provides greater space savings. Supported for text files in Impala 2.0
+ and higher.
+<!-- Not supported for text files. -->
+ </li>
+
+ <li rev="2.0.0">
+ Gzip. Recommended when achieving the highest level of compression (and therefore greatest disk-space
+ savings) is desired. Supported for text files in Impala 2.0 and higher.
+ </li>
+
+ <li>
+ Deflate. Not supported for text files.
+ </li>
+
+ <li rev="2.0.0">
+ Bzip2. Supported for text files in Impala 2.0 and higher.
+<!-- Not supported for text files. -->
+ </li>
+
+ <li>
+ <p rev="2.0.0"> LZO, for text files only. Impala can query
+ LZO-compressed text tables, but currently cannot create them or insert
+ data into them; perform these operations in Hive. </p>
+ </li>
+ </ul>
</conbody>
+ <concept id="file_format_choosing">
+
+ <title>Choosing the File Format for a Table</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Planning"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ Different file formats and compression codecs work better for different data sets. While Impala typically
+ provides performance gains regardless of file format, choosing the proper format for your data can yield
+ further performance improvements. Use the following considerations to decide which combination of file
+ format and compression to use for a particular table:
+ </p>
+
+ <ul>
+ <li>
+ If you are working with existing files that are already in a supported file format, use the same format
+ for the Impala table where practical. If the original format does not yield acceptable query performance
+ or resource usage, consider creating a new Impala table with different file format or compression
+ characteristics, and doing a one-time conversion by copying the data to the new table using the
+ <codeph>INSERT</codeph> statement. Depending on the file format, you might run the
+ <codeph>INSERT</codeph> statement in <codeph>impala-shell</codeph> or in Hive.
+ </li>
+
+ <li>
+ Text files are convenient to produce through many different tools, and are human-readable for ease of
+ verification and debugging. Those characteristics are why text is the default format for an Impala
+ <codeph>CREATE TABLE</codeph> statement. When performance and resource usage are the primary
+ considerations, use one of the other file formats and consider using compression. A typical workflow
+ might involve bringing data into an Impala table by copying CSV or TSV files into the appropriate data
+ directory, and then using the <codeph>INSERT ... SELECT</codeph> syntax to copy the data into a table
+ using a different, more compact file format.
+ </li>
+ <li>
+ If your architecture involves storing data to be queried in memory, do not compress the data. There is no
+ I/O savings since the data does not need to be moved from disk, but there is a CPU cost to decompress the
+ data.
+ </li>
+ </ul>
+ </conbody>
+ </concept>
</concept>