You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jb...@apache.org on 2017/04/12 18:25:14 UTC
[10/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS]
Publish rendered Impala documentation to ASF site
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_s3.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_s3.html b/docs/build/html/topics/impala_s3.html
new file mode 100644
index 0000000..79a4a69
--- /dev/null
+++ b/docs/build/html/topics/impala_s3.html
@@ -0,0 +1,775 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="s3"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using Impala with the Amazon S3 Filesystem</title></head><body id="s3"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">Using Impala with the Amazon S3 Filesystem</h1>
+
+
+
+ <div class="body conbody">
+
+ <div class="note important note_important"><span class="note__title importanttitle">Important:</span>
+ <p class="p">
+ In <span class="keyword">Impala 2.6</span> and higher, Impala supports both queries (<code class="ph codeph">SELECT</code>)
+ and DML (<code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, <code class="ph codeph">CREATE TABLE AS SELECT</code>)
+ for data residing on Amazon S3. With the inclusion of write support,
+
+ the Impala support for S3 is now considered ready for production use.
+ </p>
+ </div>
+
+ <p class="p">
+
+
+
+ You can use Impala to query data residing on the Amazon S3 filesystem. This capability allows convenient
+ access to a storage system that is remotely managed, accessible from anywhere, and integrated with various
+ cloud-based services. Impala can query files in any supported file format from S3. The S3 storage location
+ can be for an entire table, or individual partitions in a partitioned table.
+ </p>
+
+ <p class="p">
+ The default Impala tables use data files stored on HDFS, which are ideal for bulk loads and queries using
+ full-table scans. In contrast, queries against S3 data are less performant, making S3 suitable for holding
+ <span class="q">"cold"</span> data that is only queried occasionally, while more frequently accessed <span class="q">"hot"</span> data resides in
+ HDFS. In a partitioned table, you can set the <code class="ph codeph">LOCATION</code> attribute for individual partitions
+ to put some partitions on HDFS and others on S3, typically depending on the age of the data.
+ </p>
+
+ <p class="p toc inpage"></p>
+
+ </div>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title2" id="s3__s3_sql">
+ <h2 class="title topictitle2" id="ariaid-title2">How Impala SQL Statements Work with S3</h2>
+ <div class="body conbody">
+ <p class="p">
+ Impala SQL statements work with data on S3 as follows:
+ </p>
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ The <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a>
+ or <a class="xref" href="impala_alter_table.html#alter_table">ALTER TABLE Statement</a> statements
+ can specify that a table resides on the S3 filesystem by
+ encoding an <code class="ph codeph">s3a://</code> prefix for the <code class="ph codeph">LOCATION</code>
+ property. <code class="ph codeph">ALTER TABLE</code> can also set the <code class="ph codeph">LOCATION</code>
+ property for an individual partition, so that some data in a table resides on
+ S3 and other data in the same table resides on HDFS.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ Once a table or partition is designated as residing on S3, the <a class="xref" href="impala_select.html#select">SELECT Statement</a>
+ statement transparently accesses the data files from the appropriate storage layer.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ If the S3 table is an internal table, the <a class="xref" href="impala_drop_table.html#drop_table">DROP TABLE Statement</a> statement
+ removes the corresponding data files from S3 when the table is dropped.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ The <a class="xref" href="impala_truncate_table.html#truncate_table">TRUNCATE TABLE Statement (Impala 2.3 or higher only)</a> statement always removes the corresponding
+ data files from S3 when the table is truncated.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ The <a class="xref" href="impala_load_data.html#load_data">LOAD DATA Statement</a> can move data files residing in HDFS into
+ an S3 table.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ The <a class="xref" href="impala_insert.html#insert">INSERT Statement</a> statement, or the <code class="ph codeph">CREATE TABLE AS SELECT</code>
+ form of the <code class="ph codeph">CREATE TABLE</code> statement, can copy data from an HDFS table or another S3
+ table into an S3 table. The <a class="xref" href="impala_s3_skip_insert_staging.html#s3_skip_insert_staging">S3_SKIP_INSERT_STAGING Query Option (Impala 2.6 or higher only)</a>
+ query option chooses whether or not to use a fast code path for these write operations to S3,
+ with the tradeoff of potential inconsistency in the case of a failure during the statement.
+ </p>
+ </li>
+ </ul>
+ <p class="p">
+ For usage information about Impala SQL statements with S3 tables, see <a class="xref" href="impala_s3.html#s3_ddl">Creating Impala Databases, Tables, and Partitions for Data Stored on S3</a>
+ and <a class="xref" href="impala_s3.html#s3_dml">Using Impala DML Statements for S3 Data</a>.
+ </p>
+ </div>
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="s3__s3_creds">
+
+ <h2 class="title topictitle2" id="ariaid-title3">Specifying Impala Credentials to Access Data in S3</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+
+
+
+
+ To allow Impala to access data in S3, specify values for the following configuration settings in your
+ <span class="ph filepath">core-site.xml</span> file:
+ </p>
+
+
+<pre class="pre codeblock"><code>
+<property>
+<name>fs.s3a.access.key</name>
+<value><var class="keyword varname">your_access_key</var></value>
+</property>
+<property>
+<name>fs.s3a.secret.key</name>
+<value><var class="keyword varname">your_secret_key</var></value>
+</property>
+</code></pre>
+
+ <p class="p">
+ After specifying the credentials, restart both the Impala and
+ Hive services. (Restarting Hive is required because Impala queries, CREATE TABLE statements, and so on go
+ through the Hive metastore.)
+ </p>
+
+ <div class="note important note_important"><span class="note__title importanttitle">Important:</span>
+
+ <p class="p">
+ Although you can specify the access key ID and secret key as part of the <code class="ph codeph">s3a://</code> URL in the
+ <code class="ph codeph">LOCATION</code> attribute, doing so makes this sensitive information visible in many places, such
+ as <code class="ph codeph">DESCRIBE FORMATTED</code> output and Impala log files. Therefore, specify this information
+ centrally in the <span class="ph filepath">core-site.xml</span> file, and restrict read access to that file to only
+ trusted users.
+ </p>
+
+
+
+ </div>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="s3__s3_etl">
+
+ <h2 class="title topictitle2" id="ariaid-title4">Loading Data into S3 for Impala Queries</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ If your ETL pipeline involves moving data into S3 and then querying through Impala,
+ you can either use Impala DML statements to create, move, or copy the data, or
+ use the same data loading techniques as you would for non-Impala data.
+ </p>
+
+ </div>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title5" id="s3_etl__s3_dml">
+ <h3 class="title topictitle3" id="ariaid-title5">Using Impala DML Statements for S3 Data</h3>
+ <div class="body conbody">
+ <p class="p">
+ In <span class="keyword">Impala 2.6</span> and higher, the Impala DML statements (<code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>,
+ and <code class="ph codeph">CREATE TABLE AS SELECT</code>) can write data into a table or partition that resides in the
+ Amazon Simple Storage Service (S3).
+ The syntax of the DML statements is the same as for any other tables, because the S3 location for tables and
+ partitions is specified by an <code class="ph codeph">s3a://</code> prefix in the
+ <code class="ph codeph">LOCATION</code> attribute of
+ <code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">ALTER TABLE</code> statements.
+ If you bring data into S3 using the normal S3 transfer mechanisms instead of Impala DML statements,
+ issue a <code class="ph codeph">REFRESH</code> statement for the table before using Impala to query the S3 data.
+ </p>
+ <p class="p">
+ Because of differences between S3 and traditional filesystems, DML operations
+ for S3 tables can take longer than for tables on HDFS. For example, both the
+ <code class="ph codeph">LOAD DATA</code> statement and the final stage of the <code class="ph codeph">INSERT</code>
+ and <code class="ph codeph">CREATE TABLE AS SELECT</code> statements involve moving files from one directory
+ to another. (In the case of <code class="ph codeph">INSERT</code> and <code class="ph codeph">CREATE TABLE AS SELECT</code>,
+ the files are moved from a temporary staging directory to the final destination directory.)
+ Because S3 does not support a <span class="q">"rename"</span> operation for existing objects, in these cases Impala
+ actually copies the data files from one location to another and then removes the original files.
+ In <span class="keyword">Impala 2.6</span>, the <code class="ph codeph">S3_SKIP_INSERT_STAGING</code> query option provides a way
+ to speed up <code class="ph codeph">INSERT</code> statements for S3 tables and partitions, with the tradeoff
+ that a problem during statement execution could leave data in an inconsistent state.
+ It does not apply to <code class="ph codeph">INSERT OVERWRITE</code> or <code class="ph codeph">LOAD DATA</code> statements.
+ See <a class="xref" href="../shared/../topics/impala_s3_skip_insert_staging.html#s3_skip_insert_staging">S3_SKIP_INSERT_STAGING Query Option (Impala 2.6 or higher only)</a> for details.
+ </p>
+ </div>
+ </article>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title6" id="s3_etl__s3_manual_etl">
+ <h3 class="title topictitle3" id="ariaid-title6">Manually Loading Data into Impala Tables on S3</h3>
+ <div class="body conbody">
+ <p class="p">
+ As an alternative, or on earlier Impala releases without DML support for S3,
+ you can use the Amazon-provided methods to bring data files into S3 for querying through Impala. See
+ <a class="xref" href="http://aws.amazon.com/s3/" target="_blank">the Amazon S3 web site</a> for
+ details.
+ </p>
+
+ <div class="note important note_important"><span class="note__title importanttitle">Important:</span>
+ <div class="p">
+ For best compatibility with the S3 write support in <span class="keyword">Impala 2.6</span>
+ and higher:
+ <ul class="ul">
+ <li class="li">Use native Hadoop techniques to create data files in S3 for querying through Impala.</li>
+ <li class="li">Use the <code class="ph codeph">PURGE</code> clause of <code class="ph codeph">DROP TABLE</code> when dropping internal (managed) tables.</li>
+ </ul>
+ By default, when you drop an internal (managed) table, the data files are
+ moved to the HDFS trashcan. This operation is expensive for tables that
+ reside on the Amazon S3 filesystem. Therefore, for S3 tables, prefer to use
+ <code class="ph codeph">DROP TABLE <var class="keyword varname">table_name</var> PURGE</code> rather than the default <code class="ph codeph">DROP TABLE</code> statement.
+ The <code class="ph codeph">PURGE</code> clause makes Impala delete the data files immediately,
+ skipping the HDFS trashcan.
+ For the <code class="ph codeph">PURGE</code> clause to work effectively, you must originally create the
+ data files on S3 using one of the tools from the Hadoop ecosystem, such as
+ <code class="ph codeph">hadoop fs -cp</code>, or <code class="ph codeph">INSERT</code> in Impala or Hive.
+ </div>
+ </div>
+
+ <p class="p">
+ Alternative file creation techniques (less compatible with the <code class="ph codeph">PURGE</code> clause) include:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ The <a class="xref" href="https://console.aws.amazon.com/s3/home" target="_blank">Amazon AWS / S3
+ web interface</a> to upload from a web browser.
+ </li>
+
+ <li class="li">
+ The <a class="xref" href="http://aws.amazon.com/cli/" target="_blank">Amazon AWS CLI</a> to
+ manipulate files from the command line.
+ </li>
+
+ <li class="li">
+ Other S3-enabled software, such as
+ <a class="xref" href="http://s3tools.org/s3cmd" target="_blank">the S3Tools client software</a>.
+ </li>
+ </ul>
+
+ <p class="p">
+ After you upload data files to a location already mapped to an Impala table or partition, or if you delete
+ files in S3 from such a location, issue the <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code>
+ statement to make Impala aware of the new set of data files.
+ </p>
+
+ </div>
+ </article>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title7" id="s3__s3_ddl">
+
+ <h2 class="title topictitle2" id="ariaid-title7">Creating Impala Databases, Tables, and Partitions for Data Stored on S3</h2>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ Impala reads data for a table or partition from S3 based on the <code class="ph codeph">LOCATION</code> attribute for the
+ table or partition. Specify the S3 details in the <code class="ph codeph">LOCATION</code> clause of a <code class="ph codeph">CREATE
+ TABLE</code> or <code class="ph codeph">ALTER TABLE</code> statement. The notation for the <code class="ph codeph">LOCATION</code>
+ clause is <code class="ph codeph">s3a://<var class="keyword varname">bucket_name</var>/<var class="keyword varname">path/to/file</var></code>. The
+ filesystem prefix is always <code class="ph codeph">s3a://</code> because Impala does not support the <code class="ph codeph">s3://</code> or
+ <code class="ph codeph">s3n://</code> prefixes.
+ </p>
+
+ <p class="p">
+ For a partitioned table, either specify a separate <code class="ph codeph">LOCATION</code> clause for each new partition,
+ or specify a base <code class="ph codeph">LOCATION</code> for the table and set up a directory structure in S3 to mirror
+ the way Impala partitioned tables are structured in HDFS. Although, strictly speaking, S3 filenames do not
+ have directory paths, Impala treats S3 filenames with <code class="ph codeph">/</code> characters the same as HDFS
+ pathnames that include directories.
+ </p>
+
+ <p class="p">
+ You point a nonpartitioned table or an individual partition at S3 by specifying a single directory
+ path in S3, which could be any arbitrary directory. To replicate the structure of an entire Impala
+ partitioned table or database in S3 requires more care, with directories and subdirectories nested and
+ named to match the equivalent directory tree in HDFS. Consider setting up an empty staging area if
+ necessary in HDFS, and recording the complete directory structure so that you can replicate it in S3.
+
+ </p>
+
+ <p class="p">
+ For convenience when working with multiple tables with data files stored in S3, you can create a database
+ with a <code class="ph codeph">LOCATION</code> attribute pointing to an S3 path.
+ Specify a URL of the form <code class="ph codeph">s3a://<var class="keyword varname">bucket</var>/<var class="keyword varname">root/path/for/database</var></code>
+ for the <code class="ph codeph">LOCATION</code> attribute of the database.
+ Any tables created inside that database
+ automatically create directories underneath the one specified by the database
+ <code class="ph codeph">LOCATION</code> attribute.
+ </p>
+
+ <p class="p">
+ For example, the following session creates a partitioned table where only a single partition resides on S3.
+ The partitions for years 2013 and 2014 are located on HDFS. The partition for year 2015 includes a
+ <code class="ph codeph">LOCATION</code> attribute with an <code class="ph codeph">s3a://</code> URL, and so refers to data residing on
+ S3, under a specific path underneath the bucket <code class="ph codeph">impala-demo</code>.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > create database db_on_hdfs;
+[localhost:21000] > use db_on_hdfs;
+[localhost:21000] > create table mostly_on_hdfs (x int) partitioned by (year int);
+[localhost:21000] > alter table mostly_on_hdfs add partition (year=2013);
+[localhost:21000] > alter table mostly_on_hdfs add partition (year=2014);
+[localhost:21000] > alter table mostly_on_hdfs add partition (year=2015)
+ > location 's3a://impala-demo/dir1/dir2/dir3/t1';
+</code></pre>
+
+ <p class="p">
+ The following session creates a database and two partitioned tables residing entirely on S3, one
+ partitioned by a single column and the other partitioned by multiple columns. Because a
+ <code class="ph codeph">LOCATION</code> attribute with an <code class="ph codeph">s3a://</code> URL is specified for the database, the
+ tables inside that database are automatically created on S3 underneath the database directory. To see the
+ names of the associated subdirectories, including the partition key values, we use an S3 client tool to
+ examine how the directory structure is organized on S3. For example, Impala partition directories such as
+ <code class="ph codeph">month=1</code> do not include leading zeroes, which sometimes appear in partition directories created
+ through Hive.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > create database db_on_s3 location 's3a://impala-demo/dir1/dir2/dir3';
+[localhost:21000] > use db_on_s3;
+
+[localhost:21000] > create table partitioned_on_s3 (x int) partitioned by (year int);
+[localhost:21000] > alter table partitioned_on_s3 add partition (year=2013);
+[localhost:21000] > alter table partitioned_on_s3 add partition (year=2014);
+[localhost:21000] > alter table partitioned_on_s3 add partition (year=2015);
+
+[localhost:21000] > !aws s3 ls s3://impala-demo/dir1/dir2/dir3 --recursive;
+2015-03-17 13:56:34 0 dir1/dir2/dir3/
+2015-03-17 16:43:28 0 dir1/dir2/dir3/partitioned_on_s3/
+2015-03-17 16:43:49 0 dir1/dir2/dir3/partitioned_on_s3/year=2013/
+2015-03-17 16:43:53 0 dir1/dir2/dir3/partitioned_on_s3/year=2014/
+2015-03-17 16:43:58 0 dir1/dir2/dir3/partitioned_on_s3/year=2015/
+
+[localhost:21000] > create table partitioned_multiple_keys (x int)
+ > partitioned by (year smallint, month tinyint, day tinyint);
+[localhost:21000] > alter table partitioned_multiple_keys
+ > add partition (year=2015,month=1,day=1);
+[localhost:21000] > alter table partitioned_multiple_keys
+ > add partition (year=2015,month=1,day=31);
+[localhost:21000] > alter table partitioned_multiple_keys
+ > add partition (year=2015,month=2,day=28);
+
+[localhost:21000] > !aws s3 ls s3://impala-demo/dir1/dir2/dir3 --recursive;
+2015-03-17 13:56:34 0 dir1/dir2/dir3/
+2015-03-17 16:47:13 0 dir1/dir2/dir3/partitioned_multiple_keys/
+2015-03-17 16:47:44 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=1/
+2015-03-17 16:47:50 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=31/
+2015-03-17 16:47:57 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=2/day=28/
+2015-03-17 16:43:28 0 dir1/dir2/dir3/partitioned_on_s3/
+2015-03-17 16:43:49 0 dir1/dir2/dir3/partitioned_on_s3/year=2013/
+2015-03-17 16:43:53 0 dir1/dir2/dir3/partitioned_on_s3/year=2014/
+2015-03-17 16:43:58 0 dir1/dir2/dir3/partitioned_on_s3/year=2015/
+</code></pre>
+
+ <p class="p">
+ The <code class="ph codeph">CREATE DATABASE</code> and <code class="ph codeph">CREATE TABLE</code> statements create the associated
+ directory paths if they do not already exist. You can specify multiple levels of directories, and the
+ <code class="ph codeph">CREATE</code> statement creates all appropriate levels, similar to using <code class="ph codeph">mkdir
+ -p</code>.
+ </p>
+
+ <p class="p">
+ Use the standard S3 file upload methods to actually put the data files into the right locations. You can
+ also put the directory paths and data files in place before creating the associated Impala databases or
+ tables, and Impala automatically uses the data from the appropriate location after the associated databases
+ and tables are created.
+ </p>
+
+ <p class="p">
+ You can switch whether an existing table or partition points to data in HDFS or S3. For example, if you
+ have an Impala table or partition pointing to data files in HDFS or S3, and you later transfer those data
+ files to the other filesystem, use an <code class="ph codeph">ALTER TABLE</code> statement to adjust the
+ <code class="ph codeph">LOCATION</code> attribute of the corresponding table or partition to reflect that change. Because
+ Impala does not have an <code class="ph codeph">ALTER DATABASE</code> statement, this location-switching technique is not
+ practical for entire databases that have a custom <code class="ph codeph">LOCATION</code> attribute.
+ </p>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title8" id="s3__s3_internal_external">
+
+ <h2 class="title topictitle2" id="ariaid-title8">Internal and External Tables Located on S3</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ Just as with tables located on HDFS storage, you can designate S3-based tables as either internal (managed
+ by Impala) or external, by using the syntax <code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">CREATE EXTERNAL
+ TABLE</code> respectively. When you drop an internal table, the files associated with the table are
+ removed, even if they are on S3 storage. When you drop an external table, the files associated with the
+ table are left alone, and are still available for access by other tools or components. See
+ <a class="xref" href="impala_tables.html#tables">Overview of Impala Tables</a> for details.
+ </p>
+
+ <p class="p">
+ If the data on S3 is intended to be long-lived and accessed by other tools in addition to Impala, create
+ any associated S3 tables with the <code class="ph codeph">CREATE EXTERNAL TABLE</code> syntax, so that the files are not
+ deleted from S3 when the table is dropped.
+ </p>
+
+ <p class="p">
+ If the data on S3 is only needed for querying by Impala and can be safely discarded once the Impala
+ workflow is complete, create the associated S3 tables using the <code class="ph codeph">CREATE TABLE</code> syntax, so
+ that dropping the table also deletes the corresponding data files on S3.
+ </p>
+
+ <p class="p">
+ For example, this session creates a table in S3 with the same column layout as a table in HDFS, then
+ examines the S3 table and queries some data from it. The table in S3 works the same as a table in HDFS as
+ far as the expected file format of the data, table and column statistics, and other table properties. The
+ only indication that it is not an HDFS table is the <code class="ph codeph">s3a://</code> URL in the
+ <code class="ph codeph">LOCATION</code> property. Many data files can reside in the S3 directory, and their combined
+ contents form the table data. Because the data in this example is uploaded after the table is created, a
+ <code class="ph codeph">REFRESH</code> statement prompts Impala to update its cached information about the data files.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > create table usa_cities_s3 like usa_cities location 's3a://impala-demo/usa_cities';
+[localhost:21000] > desc usa_cities_s3;
++-------+----------+---------+
+| name | type | comment |
++-------+----------+---------+
+| id | smallint | |
+| city | string | |
+| state | string | |
++-------+----------+---------+
+
+-- Now from a web browser, upload the same data file(s) to S3 as in the HDFS table,
+-- under the relevant bucket and path. If you already have the data in S3, you would
+-- point the table LOCATION at an existing path.
+
+[localhost:21000] > refresh usa_cities_s3;
+[localhost:21000] > select count(*) from usa_cities_s3;
++----------+
+| count(*) |
++----------+
+| 289 |
++----------+
+[localhost:21000] > select distinct state from sample_data_s3 limit 5;
++----------------------+
+| state |
++----------------------+
+| Louisiana |
+| Minnesota |
+| Georgia |
+| Alaska |
+| Ohio |
++----------------------+
+[localhost:21000] > desc formatted usa_cities_s3;
++------------------------------+------------------------------+---------+
+| name | type | comment |
++------------------------------+------------------------------+---------+
+| # col_name | data_type | comment |
+| | NULL | NULL |
+| id | smallint | NULL |
+| city | string | NULL |
+| state | string | NULL |
+| | NULL | NULL |
+| # Detailed Table Information | NULL | NULL |
+| Database: | s3_testing | NULL |
+| Owner: | jrussell | NULL |
+| CreateTime: | Mon Mar 16 11:36:25 PDT 2015 | NULL |
+| LastAccessTime: | UNKNOWN | NULL |
+| Protect Mode: | None | NULL |
+| Retention: | 0 | NULL |
+| Location: | s3a://impala-demo/usa_cities | NULL |
+| Table Type: | MANAGED_TABLE | NULL |
+...
++------------------------------+------------------------------+---------+
+</code></pre>
+
+
+
+ <p class="p">
+ In this case, we have already uploaded a Parquet file with a million rows of data to the
+ <code class="ph codeph">sample_data</code> directory underneath the <code class="ph codeph">impala-demo</code> bucket on S3. This
+ session creates a table with matching column settings pointing to the corresponding location in S3, then
+ queries the table. Because the data is already in place on S3 when the table is created, no
+ <code class="ph codeph">REFRESH</code> statement is required.
+ </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] > create table sample_data_s3
+ > (id int, id bigint, val int, zerofill string,
+ > name string, assertion boolean, city string, state string)
+ > stored as parquet location 's3a://impala-demo/sample_data';
+[localhost:21000] > select count(*) from sample_data_s3;;
++----------+
+| count(*) |
++----------+
+| 1000000 |
++----------+
+[localhost:21000] > select count(*) howmany, assertion from sample_data_s3 group by assertion;
++---------+-----------+
+| howmany | assertion |
++---------+-----------+
+| 667149 | true |
+| 332851 | false |
++---------+-----------+
+</code></pre>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title9" id="s3__s3_queries">
+
+ <h2 class="title topictitle2" id="ariaid-title9">Running and Tuning Impala Queries for Data Stored on S3</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ Once the appropriate <code class="ph codeph">LOCATION</code> attributes are set up at the table or partition level, you
+ query data stored in S3 exactly the same as data stored on HDFS or in HBase:
+ </p>
+
+ <ul class="ul">
+ <li class="li">
+ Queries against S3 data support all the same file formats as for HDFS data.
+ </li>
+
+ <li class="li">
+ Tables can be unpartitioned or partitioned. For partitioned tables, either manually construct paths in S3
+ corresponding to the HDFS directories representing partition key values, or use <code class="ph codeph">ALTER TABLE ...
+ ADD PARTITION</code> to set up the appropriate paths in S3.
+ </li>
+
+ <li class="li">
+ HDFS and HBase tables can be joined to S3 tables, or S3 tables can be joined with each other.
+ </li>
+
+ <li class="li">
+ Authorization using the Sentry framework to control access to databases, tables, or columns works the
+ same whether the data is in HDFS or in S3.
+ </li>
+
+ <li class="li">
+ The <span class="keyword cmdname">catalogd</span> daemon caches metadata for both HDFS and S3 tables. Use
+ <code class="ph codeph">REFRESH</code> and <code class="ph codeph">INVALIDATE METADATA</code> for S3 tables in the same situations
+ where you would issue those statements for HDFS tables.
+ </li>
+
+ <li class="li">
+ Queries against S3 tables are subject to the same kinds of admission control and resource management as
+ HDFS tables.
+ </li>
+
+ <li class="li">
+ Metadata about S3 tables is stored in the same metastore database as for HDFS tables.
+ </li>
+
+ <li class="li">
+ You can set up views referring to S3 tables, the same as for HDFS tables.
+ </li>
+
+ <li class="li">
+ The <code class="ph codeph">COMPUTE STATS</code>, <code class="ph codeph">SHOW TABLE STATS</code>, and <code class="ph codeph">SHOW COLUMN
+ STATS</code> statements work for S3 tables also.
+ </li>
+ </ul>
+
+ </div>
+
+ <article class="topic concept nested2" aria-labelledby="ariaid-title10" id="s3_queries__s3_performance">
+
+ <h3 class="title topictitle3" id="ariaid-title10">Understanding and Tuning Impala Query Performance for S3 Data</h3>
+
+
+ <div class="body conbody">
+
+ <p class="p">
+ Although Impala queries for data stored in S3 might be less performant than queries against the
+ equivalent data stored in HDFS, you can still do some tuning. Here are techniques you can use to
+ interpret explain plans and profiles for queries against S3 data, and tips to achieve the best
+ performance possible for such queries.
+ </p>
+
+ <p class="p">
+ All else being equal, performance is expected to be lower for queries running against data on S3 rather
+ than HDFS. The actual mechanics of the <code class="ph codeph">SELECT</code> statement are somewhat different when the
+ data is in S3. Although the work is still distributed across the datanodes of the cluster, Impala might
+ parallelize the work for a distributed query differently for data on HDFS and S3. S3 does not have the
+ same block notion as HDFS, so Impala uses heuristics to determine how to split up large S3 files for
+ processing in parallel. Because all hosts can access any S3 data file with equal efficiency, the
+ distribution of work might be different than for HDFS data, where the data blocks are physically read
+ using short-circuit local reads by hosts that contain the appropriate block replicas. Although the I/O to
+ read the S3 data might be spread evenly across the hosts of the cluster, the fact that all data is
+ initially retrieved across the network means that the overall query performance is likely to be lower for
+ S3 data than for HDFS data.
+ </p>
+
+ <p class="p">
+ In <span class="keyword">Impala 2.6</span> and higher, Impala queries are optimized for files stored in Amazon S3.
+ For Impala tables that use the file formats Parquet, RCFile, SequenceFile,
+ Avro, and uncompressed text, the setting <code class="ph codeph">fs.s3a.block.size</code>
+ in the <span class="ph filepath">core-site.xml</span> configuration file determines
+ how Impala divides the I/O work of reading the data files. This configuration
+ setting is specified in bytes. By default, this
+ value is 33554432 (32 MB), meaning that Impala parallelizes S3 read operations on the files
+ as if they were made up of 32 MB blocks. For example, if your S3 queries primarily access
+ Parquet files written by MapReduce or Hive, increase <code class="ph codeph">fs.s3a.block.size</code>
+ to 134217728 (128 MB) to match the row group size of those files. If most S3 queries involve
+ Parquet files written by Impala, increase <code class="ph codeph">fs.s3a.block.size</code>
+ to 268435456 (256 MB) to match the row group size produced by Impala.
+ </p>
+
+ <p class="p">
+ Because of differences between S3 and traditional filesystems, DML operations
+ for S3 tables can take longer than for tables on HDFS. For example, both the
+ <code class="ph codeph">LOAD DATA</code> statement and the final stage of the <code class="ph codeph">INSERT</code>
+ and <code class="ph codeph">CREATE TABLE AS SELECT</code> statements involve moving files from one directory
+ to another. (In the case of <code class="ph codeph">INSERT</code> and <code class="ph codeph">CREATE TABLE AS SELECT</code>,
+ the files are moved from a temporary staging directory to the final destination directory.)
+ Because S3 does not support a <span class="q">"rename"</span> operation for existing objects, in these cases Impala
+ actually copies the data files from one location to another and then removes the original files.
+ In <span class="keyword">Impala 2.6</span>, the <code class="ph codeph">S3_SKIP_INSERT_STAGING</code> query option provides a way
+ to speed up <code class="ph codeph">INSERT</code> statements for S3 tables and partitions, with the tradeoff
+ that a problem during statement execution could leave data in an inconsistent state.
+ It does not apply to <code class="ph codeph">INSERT OVERWRITE</code> or <code class="ph codeph">LOAD DATA</code> statements.
+ See <a class="xref" href="../shared/../topics/impala_s3_skip_insert_staging.html#s3_skip_insert_staging">S3_SKIP_INSERT_STAGING Query Option (Impala 2.6 or higher only)</a> for details.
+ </p>
+
+ <p class="p">
+ When optimizing aspects of for complex queries such as the join order, Impala treats tables on HDFS and
+ S3 the same way. Therefore, follow all the same tuning recommendations for S3 tables as for HDFS ones,
+ such as using the <code class="ph codeph">COMPUTE STATS</code> statement to help Impala construct accurate estimates of
+ row counts and cardinality. See <a class="xref" href="impala_performance.html#performance">Tuning Impala for Performance</a> for details.
+ </p>
+
+ <p class="p">
+ In query profile reports, the numbers for <code class="ph codeph">BytesReadLocal</code>,
+ <code class="ph codeph">BytesReadShortCircuit</code>, <code class="ph codeph">BytesReadDataNodeCached</code>, and
+ <code class="ph codeph">BytesReadRemoteUnexpected</code> are blank because those metrics come from HDFS.
+ If you do see any indications that a query against an S3 table performed <span class="q">"remote read"</span>
+ operations, do not be alarmed. That is expected because, by definition, all the I/O for S3 tables involves
+ remote reads.
+ </p>
+
+ </div>
+
+ </article>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title11" id="s3__s3_restrictions">
+
+ <h2 class="title topictitle2" id="ariaid-title11">Restrictions on Impala Support for S3</h2>
+
+ <div class="body conbody">
+
+ <p class="p">
+ Impala requires that the default filesystem for the cluster be HDFS. You cannot use S3 as the only
+ filesystem in the cluster.
+ </p>
+
+ <p class="p">
+ Prior to <span class="keyword">Impala 2.6</span> Impala could not perform DML operations (<code class="ph codeph">INSERT</code>,
+ <code class="ph codeph">LOAD DATA</code>, or <code class="ph codeph">CREATE TABLE AS SELECT</code>) where the destination is a table
+ or partition located on an S3 filesystem. This restriction is lifted in <span class="keyword">Impala 2.6</span> and higher.
+ </p>
+
+ <p class="p">
+ Impala does not support the old <code class="ph codeph">s3://</code> block-based and <code class="ph codeph">s3n://</code> filesystem
+ schemes, only <code class="ph codeph">s3a://</code>.
+ </p>
+
+ <p class="p">
+ Although S3 is often used to store JSON-formatted data, the current Impala support for S3 does not include
+ directly querying JSON data. For Impala queries, use data files in one of the file formats listed in
+ <a class="xref" href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File Formats</a>. If you have data in JSON format, you can prepare a
+ flattened version of that data for querying by Impala as part of your ETL cycle.
+ </p>
+
+ <p class="p">
+ You cannot use the <code class="ph codeph">ALTER TABLE ... SET CACHED</code> statement for tables or partitions that are
+ located in S3.
+ </p>
+
+ </div>
+
+ </article>
+
+ <article class="topic concept nested1" aria-labelledby="ariaid-title12" id="s3__s3_best_practices">
+ <h2 class="title topictitle2" id="ariaid-title12">Best Practices for Using Impala with S3</h2>
+
+ <div class="body conbody">
+ <p class="p">
+ The following guidelines represent best practices derived from testing and field experience with Impala on S3:
+ </p>
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ Any reference to an S3 location must be fully qualified. (This rule applies when
+ S3 is not designated as the default filesystem.)
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ Set the safety valve <code class="ph codeph">fs.s3a.connection.maximum</code> to 1500 for <span class="keyword cmdname">impalad</span>.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ Set safety valve <code class="ph codeph">fs.s3a.block.size</code> to 134217728
+ (128 MB in bytes) if most Parquet files queried by Impala were written by Hive
+ or ParquetMR jobs. Set the block size to 268435456 (256 MB in bytes) if most Parquet
+ files queried by Impala were written by Impala.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ <code class="ph codeph">DROP TABLE .. PURGE</code> is much faster than the default <code class="ph codeph">DROP TABLE</code>.
+ The same applies to <code class="ph codeph">ALTER TABLE ... DROP PARTITION PURGE</code>
+ versus the default <code class="ph codeph">DROP PARTITION</code> operation.
+ However, due to the eventually consistent nature of S3, the files for that
+ table or partition could remain for some unbounded time when using <code class="ph codeph">PURGE</code>.
+ The default <code class="ph codeph">DROP TABLE/PARTITION</code> is slow because Impala copies the files to the HDFS trash folder,
+ and Impala waits until all the data is moved. <code class="ph codeph">DROP TABLE/PARTITION .. PURGE</code> is a
+ fast delete operation, and the Impala statement finishes quickly even though the change might not
+ have propagated fully throughout S3.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ <code class="ph codeph">INSERT</code> statements are faster than <code class="ph codeph">INSERT OVERWRITE</code> for S3.
+ The query option <code class="ph codeph">S3_SKIP_INSERT_STAGING</code>, which is set to <code class="ph codeph">true</code> by default,
+ skips the staging step for regular <code class="ph codeph">INSERT</code> (but not <code class="ph codeph">INSERT OVERWRITE</code>).
+ This makes the operation much faster, but consistency is not guaranteed: if a node fails during execution, the
+ table could end up with inconsistent data. Set this option to <code class="ph codeph">false</code> if stronger
+ consistency is required, however this setting will make the <code class="ph codeph">INSERT</code> operations slower.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ Too many files in a table can make metadata loading and updating slow on S3.
+ If too many requests are made to S3, S3 has a back-off mechanism and
+ responds slower than usual. You might have many small files because of:
+ </p>
+ <ul class="ul">
+ <li class="li">
+ <p class="p">
+ Too many partitions due to over-granular partitioning. Prefer partitions with
+ many megabytes of data, so that even a query against a single partition can
+ be parallelized effectively.
+ </p>
+ </li>
+ <li class="li">
+ <p class="p">
+ Many small <code class="ph codeph">INSERT</code> queries. Prefer bulk
+ <code class="ph codeph">INSERT</code>s so that more data is written to fewer
+ files.
+ </p>
+ </li>
+ </ul>
+ </li>
+ </ul>
+
+ </div>
+ </article>
+
+
+</article></main></body></html>
\ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_s3_skip_insert_staging.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_s3_skip_insert_staging.html b/docs/build/html/topics/impala_s3_skip_insert_staging.html
new file mode 100644
index 0000000..53cf4e9
--- /dev/null
+++ b/docs/build/html/topics/impala_s3_skip_insert_staging.html
@@ -0,0 +1,78 @@
+<!DOCTYPE html
+ SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="s3_skip_insert_staging"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>S3_SKIP_INSERT_STAGING Query Option (Impala 2.6 or higher only)</title></head><body id="s3_skip_insert_staging"><main role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+ <h1 class="title topictitle1" id="ariaid-title1">S3_SKIP_INSERT_STAGING Query Option (<span class="keyword">Impala 2.6</span> or higher only)</h1>
+
+
+
+ <div class="body conbody">
+
+ <p class="p">
+
+ </p>
+
+ <p class="p">
+ Speeds up <code class="ph codeph">INSERT</code> operations on tables or partitions residing on the
+ Amazon S3 filesystem. The tradeoff is the possibility of inconsistent data left behind
+ if an error occurs partway through the operation.
+ </p>
+
+ <p class="p">
+ By default, Impala write operations to S3 tables and partitions involve a two-stage process.
+ Impala writes intermediate files to S3, then (because S3 does not provide a <span class="q">"rename"</span>
+ operation) those intermediate files are copied to their final location, making the process
+ more expensive as on a filesystem that supports renaming or moving files.
+ This query option makes Impala skip the intermediate files, and instead write the
+ new data directly to the final destination.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Usage notes:</strong>
+ </p>
+
+ <div class="note important note_important"><span class="note__title importanttitle">Important:</span>
+ <p class="p">
+ If a host that is participating in the <code class="ph codeph">INSERT</code> operation fails partway through
+ the query, you might be left with a table or partition that contains some but not all of the
+ expected data files. Therefore, this option is most appropriate for a development or test
+ environment where you have the ability to reconstruct the table if a problem during
+ <code class="ph codeph">INSERT</code> leaves the data in an inconsistent state.
+ </p>
+ </div>
+
+ <p class="p">
+ The timing of file deletion during an <code class="ph codeph">INSERT OVERWRITE</code> operation
+ makes it impractical to write new files to S3 and delete the old files in a single operation.
+ Therefore, this query option only affects regular <code class="ph codeph">INSERT</code> statements that add
+ to the existing data in a table, not <code class="ph codeph">INSERT OVERWRITE</code> statements.
+ Use <code class="ph codeph">TRUNCATE TABLE</code> if you need to remove all contents from an S3 table
+ before performing a fast <code class="ph codeph">INSERT</code> with this option enabled.
+ </p>
+
+ <p class="p">
+ Performance improvements with this option enabled can be substantial. The speed increase
+ might be more noticeable for non-partitioned tables than for partitioned tables.
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Type:</strong> Boolean; recognized values are 1 and 0, or <code class="ph codeph">true</code> and <code class="ph codeph">false</code>;
+ any other value interpreted as <code class="ph codeph">false</code>
+ </p>
+ <p class="p">
+ <strong class="ph b">Default:</strong> <code class="ph codeph">true</code> (shown as 1 in output of <code class="ph codeph">SET</code> statement)
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.6.0</span>
+ </p>
+
+ <p class="p">
+ <strong class="ph b">Related information:</strong>
+ </p>
+ <p class="p">
+ <a class="xref" href="impala_s3.html#s3">Using Impala with the Amazon S3 Filesystem</a>
+ </p>
+
+ </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file