You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by dh...@apache.org on 2018/05/18 05:11:24 UTC
[02/11] impala git commit: IMPALA-6827: [DOCS] Updated the download
link for the tutorial data
IMPALA-6827: [DOCS] Updated the download link for the tutorial data
Updated the link to download the Parquet airline files for tutorial.
Change-Id: I6823d1688169e0a6f09d5b552026bc18a3770828
Reviewed-on: http://gerrit.cloudera.org:8080/10393
Reviewed-by: Michael Brown <mi...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/04add98a
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/04add98a
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/04add98a
Branch: refs/heads/2.x
Commit: 04add98a341f3ae8e1e4e0613c82188eec5fc0d9
Parents: 99e379d
Author: Alex Rodoni <ar...@cloudera.com>
Authored: Mon May 14 13:30:42 2018 -0700
Committer: Impala Public Jenkins <im...@gerrit.cloudera.org>
Committed: Thu May 17 22:03:02 2018 +0000
----------------------------------------------------------------------
docs/topics/impala_tutorial.xml | 1110 ++++++++++++++--------------------
1 file changed, 466 insertions(+), 644 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/impala/blob/04add98a/docs/topics/impala_tutorial.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_tutorial.xml b/docs/topics/impala_tutorial.xml
index 01e53e2..72665ef 100644
--- a/docs/topics/impala_tutorial.xml
+++ b/docs/topics/impala_tutorial.xml
@@ -1484,81 +1484,6 @@ Returned 20 row(s) in 0.38s
Returned 10 row(s) in 0.39s</codeblock>
</conbody>
</concept>
-
- <concept audience="hidden" id="tut_connect">
-
- <title>Connecting to Impala through impala-shell</title>
-
- <conbody>
-
- <p>
- This tutorial provides some tips for connecting to Impala through the <cmdname>impala-shell</cmdname>
- interpreter, and gives usage tips for particular scenarios.
- </p>
-
- <ul>
- <li>
- <p>
- By default, <cmdname>impala-shell</cmdname> with no arguments connects to the current host, port
- 21000.
- </p>
- </li>
-
- <li>
- <p>
- <codeph>-i localhost</codeph> option gives you a shorter command prompt.
- </p>
- </li>
-
- <li>
- <p>
- The <codeph>--quiet</codeph> option suppresses some informational messages, such as the
- <codeph>Query:</codeph> line that echoes each command entered.
- </p>
- </li>
-
- <li>
- <codeph>CONNECT <varname>hostname</varname></codeph> lets you switch to another host (keeping the same
- default port 21000).
- </li>
-
- <li>
- <p>
- You could run <cmdname>impala-shell</cmdname> from a completely different host to keep login accounts
- off the actual Impala nodes.
- </p>
- </li>
-
- <li>
- <p>
- For load balancing, you would connect to different coordinator nodes for concurrent queries.
- </p>
- </li>
-
- <li>
- <p>
- For benchmarking, you would skip pretty printing with the <codeph>-B</codeph> option, or even skip
- all screen output with the <codeph>-o</codeph> option or shell redirection.
- </p>
- </li>
- </ul>
- </conbody>
- </concept>
- </concept>
-
- <concept audience="hidden" id="tut_mem_limit">
-
- <title>Effects of Memory Limits on Impala Queries</title>
- <prolog>
- <metadata>
- <data name="Category" value="Memory"/>
- </metadata>
- </prolog>
-
- <conbody>
-
- <p></p>
- </conbody>
</concept>
<concept id="tut_parquet_schemaless">
@@ -1591,319 +1516,311 @@ for purposes of this exercise, wait until after following the tutorial before ex
a real-life situation where you cannot rely on assumptions and assertions about the ranges and representations of
data values.
</p>
+ </conbody>
+ <concept id="download_hdfs">
+ <title>Download the Data Files into HDFS</title>
-<p>
-We will download Parquet files containing this data from the Ibis blog.
-First, we download and unpack the data files.
-There are 8 files totalling 1.4 GB.
-Each file is less than 256 MB.
-</p>
-
-<codeblock>$ wget -O airlines_parquet.tar.gz https://www.dropbox.com/s/ol9x51tqp6cv4yc/airlines_parquet.tar.gz?dl=0
-...
-Length: 1245204740 (1.2G) [application/octet-stream]
-Saving to: “airlines_parquet.tar.gz”
+<conbody>
+<p> First, we download and unpack the data files. There are 8 files totalling
+ 1.4 GB.</p>
-2015-08-12 17:14:24 (23.6 MB/s) - “airlines_parquet.tar.gz” saved [1245204740/1245204740]
+<codeblock>$ wget -O airlines_parquet.tar.gz https://home.apache.org/~arodoni/airlines_parquet.tar.gz
+$ wget https://home.apache.org/~arodoni/airlines_parquet.tar.gz.sha512
+$ shasum -a 512 -c airlines_parquet.tar.gz.sha512
+airlines_parquet.tar.gz: OK
$ tar xvzf airlines_parquet.tar.gz
-airlines_parquet/
-airlines_parquet/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
-airlines_parquet/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
-$ cd airlines_parquet/
-$ du -kch *.parq
-253M 93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
-65M 93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
-156M 93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
-240M 93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
-253M 93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
-16M 93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
-177M 93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
-213M 93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
-1.4G total
-</codeblock>
-<p>
-Next, we put the Parquet data files in HDFS, all together in a single directory,
-with permissions on the directory and the files so that the <codeph>impala</codeph>
-user will be able to read them.
-</p>
-
-<note>
-After unpacking, we saw the largest Parquet file was 253 MB.
-When copying Parquet files into HDFS for Impala to use,
-for maximum query performance, make sure that each file resides in a single HDFS data block.
-Therefore, we pick a size larger than any single file and specify that as the block size, using the argument
-<codeph>-Ddfs.block.size=256m</codeph> on the <codeph>hdfs dfs -put</codeph> command.
-</note>
+$ cd airlines_parquet/
-<codeblock>$ hdfs dfs -mkdir -p hdfs://demo_host.example.com:8020/user/impala/staging/airlines
-$ hdfs dfs -Ddfs.block.size=256m -put *.parq /user/impala/staging/airlines
-$ hdfs dfs -ls /user/impala/staging
+$ du -kch *.parq
+253M 4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq
+14M 4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq
+253M 4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq
+64M 4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq
+184M 4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq
+241M 4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq
+212M 4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq
+152M 4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq
+1.4G total</codeblock>
+
+<p> Next, we put the Parquet data files in HDFS, all together in a single
+ directory, with permissions on the directory and the files so that the
+ <codeph>impala</codeph> user will be able to read them.</p>
+ <p>After unpacking, we saw the largest Parquet file was 253 MB. When
+ copying Parquet files into HDFS for Impala to use, for maximum query
+ performance, make sure that each file resides in a single HDFS data
+ block. Therefore, we pick a size larger than any single file and
+ specify that as the block size, using the argument
+ <codeph>-Ddfs.block.size=253m</codeph> on the <codeph>hdfs dfs
+ -put</codeph> command. </p>
+
+<codeblock>$ sudo -u hdfs hdfs dfs -mkdir -p /user/impala/staging/airlines
+$ sudo -u hdfs hdfs dfs -Ddfs.block.size=253m -put *.parq /user/impala/staging/airlines
+$ sudo -u hdfs hdfs dfs -ls /user/impala/staging
Found 1 items
-drwxrwxrwx - hdfs supergroup 0 2015-08-12 13:52 /user/impala/staging/airlines
-$ hdfs dfs -ls hdfs://demo_host.example.com:8020/user/impala/staging/airlines
+
+$ sudo -u hdfs hdfs dfs -ls /user/impala/staging/airlines
Found 8 items
--rw-r--r-- 3 jrussell supergroup 265107489 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
--rw-r--r-- 3 jrussell supergroup 67544715 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
--rw-r--r-- 3 jrussell supergroup 162556490 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
--rw-r--r-- 3 jrussell supergroup 251603518 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
--rw-r--r-- 3 jrussell supergroup 265186603 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
--rw-r--r-- 3 jrussell supergroup 16663754 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
--rw-r--r-- 3 jrussell supergroup 185511677 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
--rw-r--r-- 3 jrussell supergroup 222794621 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
</codeblock>
+</conbody>
+ </concept>
-<p>
-With the files in an accessible location in HDFS, we create a database table that uses the data in those files.
-The <codeph>CREATE EXTERNAL</codeph> syntax and the <codeph>LOCATION</codeph> attribute point Impala at the appropriate HDFS directory.
-The <codeph>LIKE PARQUET '<varname>path_to_any_parquet_file</varname>'</codeph> clause means we skip the list of column names and types;
-Impala automatically gets the column names and data types straight from the data files.
-(Currently, this technique only works for Parquet files.)
-We ignore the warning about lack of <codeph>READ_WRITE</codeph> access to the files in HDFS;
-the <codeph>impala</codeph> user can read the files, which will be sufficient for us to experiment with
-queries and perform some copy and transform operations into other tables.
-</p>
+ <concept id="create_tables">
+ <title>Create Database and Tables</title>
+ <conbody>
-<codeblock>$ impala-shell -i localhost
-Starting Impala Shell without Kerberos authentication
+<p> With the files in an accessible location in HDFS, you create a database
+ table that uses the data in those files:<ul>
+ <li>The <codeph>CREATE EXTERNAL</codeph> syntax and the
+ <codeph>LOCATION</codeph> attribute point Impala at the
+ appropriate HDFS directory.</li>
+ <li>The <codeph>LIKE PARQUET
+ '<varname>path_to_any_parquet_file</varname>'</codeph> clause
+ means we skip the list of column names and types; Impala
+ automatically gets the column names and data types straight from
+ the data files. (Currently, this technique only works for Parquet
+ files.) </li>
+ <li>Ignore the warning about lack of <codeph>READ_WRITE</codeph>
+ access to the files in HDFS; the <codeph>impala</codeph> user can
+ read the files, which will be sufficient for us to experiment with
+ queries and perform some copy and transform operations into other
+ tables. </li>
+ </ul></p>
-Connected to localhost:21000
-<ph conref="../shared/ImpalaVariables.xml#impala_vars/ImpaladBanner"/>
-Welcome to the Impala shell. Press TAB twice to see a list of available commands.
-...
-<ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/>
-[localhost:21000] > create database airline_data;
-[localhost:21000] > use airline_data;
-[localhost:21000] > create external table airlines_external
- > like parquet 'hdfs://demo_host.example.com:8020/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq'
- > stored as parquet location 'hdfs://demo_host.example.com:8020/user/impala/staging/airlines';
-WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://demo_host.example.com:8020/user/impala/staging'
+<codeblock>$ impala-shell
+> CREATE DATABASE airlines_data;
+ USE airlines_data;
+ CREATE EXTERNAL TABLE airlines_external
+ LIKE PARQUET 'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
+ STORED AS PARQUET LOCATION 'hdfs:staging/airlines';
+WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://myhost.com:8020/user/impala/staging'
</codeblock>
+ </conbody>
+ </concept>
+ <concept id="examine_schema ">
+ <title>Examine Physical and Logical Schema</title>
+ <conbody>
-<p>
-With the table created, we examine its physical and logical characteristics to confirm that the data is really
-there and in a format and shape that we can work with.
-The <codeph>SHOW TABLE STATS</codeph> statement gives a very high-level summary of the table,
-showing how many files and how much total data it contains.
-Also, it confirms that the table is expecting all the associated data files to be in Parquet format.
-(The ability to work with all kinds of HDFS data files in different formats means that it is
-possible to have a mismatch between the format of the data files, and the format
-that the table expects the data files to be in.)
-The <codeph>SHOW FILES</codeph> statement confirms that the data in the table has the expected number,
-names, and sizes of the original Parquet files.
-The <codeph>DESCRIBE</codeph> statement (or its abbreviation <codeph>DESC</codeph>) confirms the names and types
-of the columns that Impala automatically created after reading that metadata from the Parquet file.
-The <codeph>DESCRIBE FORMATTED</codeph> statement prints out some extra detail along with the column definitions;
-the pieces we care about for this exercise are the containing database for the table,
-the location of the associated data files in HDFS, the fact that it's an external table so Impala will not
-delete the HDFS files when we finish the experiments and drop the table, and the fact that the
-table is set up to work exclusively with files in the Parquet format.
-</p>
-
-<codeblock>[localhost:21000] > show table stats airlines_external;
+<p> With the table created, we examine its physical and logical characteristics
+ to confirm that the data is really there and in a format and shape
+ that we can work with. <ul>
+ <li>The <codeph>SHOW TABLE STATS</codeph> statement gives a very
+ high-level summary of the table, showing how many files and how
+ much total data it contains. Also, it confirms that the table is
+ expecting all the associated data files to be in Parquet format.
+ (The ability to work with all kinds of HDFS data files in
+ different formats means that it is possible to have a mismatch
+ between the format of the data files, and the format that the
+ table expects the data files to be in.) </li>
+ <li>The <codeph>SHOW FILES</codeph> statement confirms that the data
+ in the table has the expected number, names, and sizes of the
+ original Parquet files.</li>
+ <li>The <codeph>DESCRIBE</codeph> statement (or its abbreviation
+ <codeph>DESC</codeph>) confirms the names and types of the
+ columns that Impala automatically created after reading that
+ metadata from the Parquet file. </li>
+ <li>The <codeph>DESCRIBE FORMATTED</codeph> statement prints out
+ some extra detail along with the column definitions. The pieces we
+ care about for this exercise are: <ul>
+ <li>The containing database for the table.</li>
+ <li>The location of the associated data files in HDFS.</li>
+ <li>The table is an external table so Impala will not delete the
+ HDFS files when we finish the experiments and drop the
+ table.</li>
+ <li>The table is set up to work exclusively with files in the
+ Parquet format.</li>
+ </ul></li>
+ </ul></p>
+
+<codeblock>> SHOW TABLE STATS airlines_external;
+-------+--------+--------+--------------+-------------------+---------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+--------+--------+--------------+-------------------+---------+-------------------+
| -1 | 8 | 1.34GB | NOT CACHED | NOT CACHED | PARQUET | false |
+-------+--------+--------+--------------+-------------------+---------+-------------------+
-[localhost:21000] > show files in airlines_external;
+
+> SHOW FILES IN airlines_external;
+----------------------------------------------------------------------------------------+----------+-----------+
| path | size | partition |
+----------------------------------------------------------------------------------------+----------+-----------+
-| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq | 252.83MB | |
-| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq | 64.42MB | |
-| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq | 155.03MB | |
-| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq | 239.95MB | |
-| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq | 252.90MB | |
-| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq | 15.89MB | |
-| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq | 176.92MB | |
-| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq | 212.47MB | |
+| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq | 252.99MB | |
+| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq | 13.43MB | |
+| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq | 252.84MB | |
+| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq | 63.92MB | |
+| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq | 183.64MB | |
+| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq | 240.04MB | |
+| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq | 211.35MB | |
+| /user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq | 151.46MB | |
+----------------------------------------------------------------------------------------+----------+-----------+
-[localhost:21000] > describe airlines_external;
+
+> DESCRIBE airlines_external;
+---------------------+--------+---------------------------------------------------+
| name | type | comment |
+---------------------+--------+---------------------------------------------------+
-| year | int | inferred from: optional int32 year |
-| month | int | inferred from: optional int32 month |
-| day | int | inferred from: optional int32 day |
-| dayofweek | int | inferred from: optional int32 dayofweek |
-| dep_time | int | inferred from: optional int32 dep_time |
-| crs_dep_time | int | inferred from: optional int32 crs_dep_time |
-| arr_time | int | inferred from: optional int32 arr_time |
-| crs_arr_time | int | inferred from: optional int32 crs_arr_time |
-| carrier | string | inferred from: optional binary carrier |
-| flight_num | int | inferred from: optional int32 flight_num |
-| tail_num | int | inferred from: optional int32 tail_num |
-| actual_elapsed_time | int | inferred from: optional int32 actual_elapsed_time |
-| crs_elapsed_time | int | inferred from: optional int32 crs_elapsed_time |
-| airtime | int | inferred from: optional int32 airtime |
-| arrdelay | int | inferred from: optional int32 arrdelay |
-| depdelay | int | inferred from: optional int32 depdelay |
-| origin | string | inferred from: optional binary origin |
-| dest | string | inferred from: optional binary dest |
-| distance | int | inferred from: optional int32 distance |
-| taxi_in | int | inferred from: optional int32 taxi_in |
-| taxi_out | int | inferred from: optional int32 taxi_out |
-| cancelled | int | inferred from: optional int32 cancelled |
-| cancellation_code | string | inferred from: optional binary cancellation_code |
-| diverted | int | inferred from: optional int32 diverted |
-| carrier_delay | int | inferred from: optional int32 carrier_delay |
-| weather_delay | int | inferred from: optional int32 weather_delay |
-| nas_delay | int | inferred from: optional int32 nas_delay |
-| security_delay | int | inferred from: optional int32 security_delay |
-| late_aircraft_delay | int | inferred from: optional int32 late_aircraft_delay |
+| year | int | Inferred from Parquet file. |
+| month | int | Inferred from Parquet file. |
+| day | int | Inferred from Parquet file. |
+| dayofweek | int | Inferred from Parquet file. |
+| dep_time | int | Inferred from Parquet file. |
+| crs_dep_time | int | Inferred from Parquet file. |
+| arr_time | int | Inferred from Parquet file. |
+| crs_arr_time | int | Inferred from Parquet file. |
+| carrier | string | Inferred from Parquet file. |
+| flight_num | int | Inferred from Parquet file. |
+| tail_num | int | Inferred from Parquet file. |
+| actual_elapsed_time | int | Inferred from Parquet file. |
+| crs_elapsed_time | int | Inferred from Parquet file. |
+| airtime | int | Inferred from Parquet file. |
+| arrdelay | int | Inferred from Parquet file. |
+| depdelay | int | Inferred from Parquet file. |
+| origin | string | Inferred from Parquet file. |
+| dest | string | Inferred from Parquet file. |
+| distance | int | Inferred from Parquet file. |
+| taxi_in | int | Inferred from Parquet file. |
+| taxi_out | int | Inferred from Parquet file. |
+| cancelled | int | Inferred from Parquet file. |
+| cancellation_code | string | Inferred from Parquet file. |
+| diverted | int | Inferred from Parquet file. |
+| carrier_delay | int | Inferred from Parquet file. |
+| weather_delay | int | Inferred from Parquet file. |
+| nas_delay | int | Inferred from Parquet file. |
+| security_delay | int | Inferred from Parquet file. |
+| late_aircraft_delay | int | Inferred from Parquet file. |
+---------------------+--------+---------------------------------------------------+
-[localhost:21000] > desc formatted airlines_external;
+
+> DESCRIBE FORMATTED airlines_external;
+------------------------------+-------------------------------
| name | type
+------------------------------+-------------------------------
...
| # Detailed Table Information | NULL
-| Database: | airline_data
-| Owner: | jrussell
+| Database: | airlines_data
+| Owner: | impala
...
| Location: | /user/impala/staging/airlines
| Table Type: | EXTERNAL_TABLE
...
| # Storage Information | NULL
-| SerDe Library: | parquet.hive.serde.ParquetHiveSerDe
-| InputFormat: | parquet.hive.DeprecatedParquetInputFormat
-| OutputFormat: | parquet.hive.DeprecatedParquetOutputFormat
+| SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
+| InputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputForma
+| OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
...
</codeblock>
+ </conbody></concept>
+ <concept id="examine_data">
+ <title>Analyze Data</title>
+ <conbody>
-<p>
-Now that we are confident that the connections are solid between the Impala table and the
-underlying Parquet files, we run some initial queries to understand the characteristics
-of the data: the overall number of rows, and the ranges and how many
-different values are in certain columns.
-For convenience in understanding the magnitude of the <codeph>COUNT(*)</codeph>
-result, we run another query dividing the number of rows by 1 million, demonstrating that there are 123 million rows in the table.
-</p>
-
-<!-- I think this is the very longest line that could possibly fit without wrapping in the PDF output; 87 characters wide.
-[localhost:21000] > select count(*) / 1e6 as 'millions of rows' from airlines_external;
--->
+<p> Now that we are confident that the connections are solid between the Impala
+ table and the underlying Parquet files, we run some initial queries to
+ understand the characteristics of the data: the overall number of
+ rows, and the ranges and how many different values are in certain
+ columns. </p>
-<codeblock>[localhost:21000] > select count(*) from airlines_external;
+<codeblock>> SELECT COUNT(*) FROM airlines_external;
+-----------+
| count(*) |
+-----------+
| 123534969 |
+-----------+
-Fetched 1 row(s) in 1.32s
-[localhost:21000] > select count(*) / 1e6 as 'millions of rows' from airlines_external;
-+------------------+
-| millions of rows |
-+------------------+
-| 123.534969 |
-+------------------+
-Fetched 1 row(s) in 1.24s
</codeblock>
-<p> The <codeph>NDV()</codeph> function stands for <q>number of distinct
- values</q>, which for performance reasons is an estimate when there
- are lots of different values in the column, but is precise when the
- cardinality is less than 16 K. Use <codeph>NDV()</codeph> calls for this
- kind of exploration rather than <codeph>COUNT(DISTINCT
- <varname>colname</varname>)</codeph>, because Impala can evaluate
- multiple <codeph>NDV()</codeph> functions in a single query, but only a
- single instance of <codeph>COUNT DISTINCT</codeph>. Here we see that
- there are modest numbers of different airlines, flight numbers, and
- origin and destination airports. Two things jump out from this query:
- the number of <codeph>tail_num</codeph> values is much smaller than we
- might have expected, and there are more destination airports than origin
- airports. Let's dig further. What we find is that most
- <codeph>tail_num</codeph> values are <codeph>NULL</codeph>. It looks
- like this was an experimental column that wasn't filled in accurately.
- We make a mental note that if we use this data as a starting point,
- we'll ignore this column. We also find that certain airports are
- represented in the <codeph>ORIGIN</codeph> column but not the
- <codeph>DEST</codeph> column; now we know that we cannot rely on the
- assumption that those sets of airport codes are identical. </p>
-
-<note>
-A slight digression for some performance tuning. Notice how the first
-<codeph>SELECT DISTINCT DEST</codeph> query takes almost 40 seconds.
-We expect all queries on such a small data set, less than 2 GB, to
-take a few seconds at most. The reason is because the expression
-<codeph>NOT IN (SELECT origin FROM airlines_external)</codeph>
-produces an intermediate result set of 123 million rows, then
-runs 123 million comparisons on each data node against the tiny set of destination airports.
-The way the <codeph>NOT IN</codeph> operator works internally means that
-this intermediate result set with 123 million rows might be transmitted
-across the network to each data node in the cluster.
-Applying another <codeph>DISTINCT</codeph> inside the <codeph>NOT IN</codeph>
-subquery means that the intermediate result set is only 340 items,
-resulting in much less network traffic and fewer comparison operations.
-The more efficient query with the added <codeph>DISTINCT</codeph> is approximately 7 times as fast.
-</note>
-
-<codeblock>[localhost:21000] > select ndv(carrier), ndv(flight_num), ndv(tail_num),
- > ndv(origin), ndv(dest) from airlines_external;
+<p> The <codeph>NDV()</codeph> function returns a number of distinct values,
+ which, for performance reasons, is an estimate when there are lots of
+ different values in the column, but is precise when the cardinality is
+ less than 16 K. Use <codeph>NDV()</codeph> function for this kind of
+ exploration rather than <codeph>COUNT(DISTINCT
+ <varname>colname</varname>)</codeph>, because Impala can evaluate
+ multiple <codeph>NDV()</codeph> functions in a single query, but only
+ a single instance of <codeph>COUNT DISTINCT</codeph>. </p>
+
+<codeblock>> SElECT NDV(carrier), NDV(flight_num), NDV(tail_num),
+ NDV(origin), NDV(dest) FROM airlines_external;
+--------------+-----------------+---------------+-------------+-----------+
| ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) |
+--------------+-----------------+---------------+-------------+-----------+
-| 29 | 9086 | 3 | 340 | 347 |
+| 29 | 8463 | 3 | 342 | 349 |
+--------------+-----------------+---------------+-------------+-----------+
-[localhost:21000] > select tail_num, count(*) as howmany from airlines_external
- > group by tail_num;
+
+> SELECT tail_num, COUNT(*) AS howmany FROM airlines_external
+ GROUP BY tail_num;
+----------+-----------+
| tail_num | howmany |
+----------+-----------+
+| NULL | 123122001 |
| 715 | 1 |
| 0 | 406405 |
| 112 | 6562 |
-| NULL | 123122001 |
+----------+-----------+
-Fetched 1 row(s) in 5.18s
-[localhost:21000] > select distinct dest from airlines_external
- > where dest not in (select origin from airlines_external);
+
+> SELECT DISTINCT dest FROM airlines_external
+ WHERE dest NOT IN (SELECT origin FROM airlines_external);
+------+
| dest |
+------+
-| LBF |
| CBM |
-| RCA |
| SKA |
| LAR |
+| RCA |
+| LBF |
+------+
-Fetched 5 row(s) in 39.64s
-[localhost:21000] > select distinct dest from airlines_external
- > where dest not in (select distinct origin from airlines_external);
+
+> SELECT DISTINCT dest FROM airlines_external
+ WHERE dest NOT IN (SELECT DISTINCT origin FROM airlines_external);
+------+
| dest |
+------+
-| LBF |
-| RCA |
| CBM |
| SKA |
| LAR |
+| RCA |
+| LBF |
+------+
-Fetched 5 row(s) in 5.59s
-[localhost:21000] > select distinct origin from airlines_external
- > where origin not in (select distinct dest from airlines_external);
-Fetched 0 row(s) in 5.37s
-</codeblock>
+
+> SELECT DISTINCT origin FROM airlines_external
+ WHERE origin NOT IN (SELECT DISTINCT dest FROM airlines_external);
+Fetched 0 row(s) in 2.63</codeblock>
+ <p>With the above queries, we see that there are modest numbers of
+ different airlines, flight numbers, and origin and destination
+ airports. Two things jump out from this query: the number of
+ <codeph>tail_num</codeph> values is much smaller than we might have
+ expected, and there are more destination airports than origin
+ airports. Let's dig further. What we find is that most
+ <codeph>tail_num</codeph> values are <codeph>NULL</codeph>. It looks
+ like this was an experimental column that wasn't filled in accurately.
+ We make a mental note that if we use this data as a starting point,
+ we'll ignore this column. We also find that certain airports are
+ represented in the <codeph>ORIGIN</codeph> column but not the
+ <codeph>DEST</codeph> column; now we know that we cannot rely on the
+ assumption that those sets of airport codes are identical. </p>
+ <note> The first <codeph>SELECT DISTINCT DEST</codeph> query takes
+ almost 40 seconds. We expect all queries on such a small data set,
+ less than 2 GB, to take a few seconds at most. The reason is because
+ the expression <codeph>NOT IN (SELECT origin FROM
+ airlines_external)</codeph> produces an intermediate result set of
+ 123 million rows, then runs 123 million comparisons on each data node
+ against the tiny set of destination airports. The way the <codeph>NOT
+ IN</codeph> operator works internally means that this intermediate
+ result set with 123 million rows might be transmitted across the
+ network to each data node in the cluster. Applying another
+ <codeph>DISTINCT</codeph> inside the <codeph>NOT IN</codeph>
+ subquery means that the intermediate result set is only 340 items,
+ resulting in much less network traffic and fewer comparison
+ operations. The more efficient query with the added
+ <codeph>DISTINCT</codeph> is approximately 7 times as fast. </note>
<p> Next, we try doing a simple calculation, with results broken down by year.
- This reveals that some years have no data in the
- <codeph>AIRTIME</codeph> column. That means we might be able to use
- that column in queries involving certain date ranges, but we cannot
- count on it to always be reliable. The question of whether a column
- contains any <codeph>NULL</codeph> values, and if so what is their
- number, proportion, and distribution, comes up again and again when
- doing initial exploration of a data set. </p>
-
-<codeblock>[localhost:21000] > select year, sum(airtime) from airlines_external
- > group by year order by year desc;
+ This reveals that some years have no data in the
+ <codeph>airtime</codeph> column. That means we might be able to use
+ that column in queries involving certain date ranges, but we cannot
+ count on it to always be reliable. The question of whether a column
+ contains any <codeph>NULL</codeph> values, and if so what is their
+ number, proportion, and distribution, comes up again and again when
+ doing initial exploration of a data set. </p>
+
+<codeblock>> SELECT year, SUM(airtime) FROM airlines_external
+ GROUP BY year ORDER BY year DESC;
+------+--------------+
| year | sum(airtime) |
+------+--------------+
@@ -1932,34 +1849,37 @@ Fetched 0 row(s) in 5.37s
+------+--------------+
</codeblock>
-<p>
-With the notion of <codeph>NULL</codeph> values in mind, let's come back to the <codeph>TAILNUM</codeph>
-column that we discovered had a lot of <codeph>NULL</codeph>s.
-Let's quantify the <codeph>NULL</codeph> and non-<codeph>NULL</codeph> values in that column for better understanding.
-First, we just count the overall number of rows versus the non-<codeph>NULL</codeph> values in that column.
-That initial result gives the appearance of relatively few non-<codeph>NULL</codeph> values, but we can break
-it down more clearly in a single query.
-Once we have the <codeph>COUNT(*)</codeph> and the <codeph>COUNT(<varname>colname</varname>)</codeph> numbers,
-we can encode that initial query in a <codeph>WITH</codeph> clause, then run a followon query that performs
-multiple arithmetic operations on those values.
-Seeing that only one-third of one percent of all rows have non-<codeph>NULL</codeph> values for the
-<codeph>TAILNUM</codeph> column clearly illustrates that that column is not of much use.
-</p>
-
-<codeblock>[localhost:21000] > select count(*) as 'rows', count(tail_num) as 'non-null tail numbers'
- > from airlines_external;
+<p> With the notion of <codeph>NULL</codeph> values in mind, let's come back to
+ the <codeph>tail_num</codeph> column that we discovered had a lot of
+ <codeph>NULL</codeph>s. Let's quantify the <codeph>NULL</codeph> and
+ non-<codeph>NULL</codeph> values in that column for better
+ understanding. First, we just count the overall number of rows versus
+ the non-<codeph>NULL</codeph> values in that column. That initial
+ result gives the appearance of relatively few
+ non-<codeph>NULL</codeph> values, but we can break it down more
+ clearly in a single query. Once we have the <codeph>COUNT(*)</codeph>
+ and the <codeph>COUNT(<varname>colname</varname>)</codeph> numbers, we
+ can encode that initial query in a <codeph>WITH</codeph> clause, then
+ run a follow-on query that performs multiple arithmetic operations on
+ those values. Seeing that only one-third of one percent of all rows
+ have non-<codeph>NULL</codeph> values for the
+ <codeph>tail_num</codeph> column clearly illustrates that column is
+ not of much use. </p>
+
+<codeblock>> SELECT COUNT(*) AS 'rows', COUNT(tail_num) AS 'non-null tail numbers'
+ FROM airlines_external;
+-----------+-----------------------+
| rows | non-null tail numbers |
+-----------+-----------------------+
| 123534969 | 412968 |
+-----------+-----------------------+
-Fetched 1 row(s) in 1.51s
-[localhost:21000] > with t1 as
- > (select count(*) as 'rows', count(tail_num) as 'nonnull'
- > from airlines_external)
- > select `rows`, `nonnull`, `rows` - `nonnull` as 'nulls',
- > (`nonnull` / `rows`) * 100 as 'percentage non-null'
- > from t1;
+
+> WITH t1 AS
+ (SELECT COUNT(*) AS 'rows', COUNT(tail_num) AS 'nonnull'
+ FROM airlines_external)
+SELECT `rows`, `nonnull`, `rows` - `nonnull` AS 'nulls',
+ (`nonnull` / `rows`) * 100 AS 'percentage non-null'
+FROM t1;
+-----------+---------+-----------+---------------------+
| rows | nonnull | nulls | percentage non-null |
+-----------+---------+-----------+---------------------+
@@ -1967,26 +1887,28 @@ Fetched 1 row(s) in 1.51s
+-----------+---------+-----------+---------------------+
</codeblock>
-<p>
-By examining other columns using these techniques, we can form a mental picture of the way data is distributed
-throughout the table, and which columns are most significant for query purposes. For this tutorial, we focus mostly on
-the fields likely to hold discrete values, rather than columns such as <codeph>ACTUAL_ELAPSED_TIME</codeph>
-whose names suggest they hold measurements. We would dig deeper into those columns once we had a clear picture
-of which questions were worthwhile to ask, and what kinds of trends we might look for.
-For the final piece of initial exploration, let's look at the <codeph>YEAR</codeph> column.
-A simple <codeph>GROUP BY</codeph> query shows that it has a well-defined range, a manageable number of
-distinct values, and relatively even distribution of rows across the different years.
-</p>
-
-<codeblock>[localhost:21000] > select min(year), max(year), ndv(year) from airlines_external;
+<p> By examining other columns using these techniques, we can form a mental
+ picture of the way data is distributed throughout the table, and which
+ columns are most significant for query purposes. For this tutorial, we
+ focus mostly on the fields likely to hold discrete values, rather than
+ columns such as <codeph>actual_elapsed_time</codeph> whose names
+ suggest they hold measurements. We would dig deeper into those columns
+ once we had a clear picture of which questions were worthwhile to ask,
+ and what kinds of trends we might look for. For the final piece of
+ initial exploration, let's look at the <codeph>year</codeph> column. A
+ simple <codeph>GROUP BY</codeph> query shows that it has a
+ well-defined range, a manageable number of distinct values, and
+ relatively even distribution of rows across the different years. </p>
+
+<codeblock>> SELECT MIN(year), MAX(year), NDV(year) FROM airlines_external;
+-----------+-----------+-----------+
| min(year) | max(year) | ndv(year) |
+-----------+-----------+-----------+
| 1987 | 2008 | 22 |
+-----------+-----------+-----------+
-Fetched 1 row(s) in 2.03s
-[localhost:21000] > select year, count(*) howmany from airlines_external
- > group by year order by year desc;
+
+> SELECT year, COUNT(*) howmany FROM airlines_external
+ GROUP BY year ORDER BY year DESC;
+------+---------+
| year | howmany |
+------+---------+
@@ -2013,77 +1935,35 @@ Fetched 1 row(s) in 2.03s
| 1988 | 5202096 |
| 1987 | 1311826 |
+------+---------+
-Fetched 22 row(s) in 2.13s
</codeblock>
-<p>
-We could go quite far with the data in this initial raw format, just as we downloaded it from the web.
-If the data set proved to be useful and worth persisting in Impala for extensive queries,
-we might want to copy it to an internal table, letting Impala manage the data files and perhaps
-reorganizing a little for higher efficiency.
-In this next stage of the tutorial, we copy the original data into a partitioned table, still in Parquet format.
-Partitioning based on the <codeph>YEAR</codeph> column lets us run queries with clauses such as <codeph>WHERE year = 2001</codeph>
-or <codeph>WHERE year BETWEEN 1989 AND 1999</codeph>, which can dramatically cut down on I/O by
-ignoring all the data from years outside the desired range.
-Rather than reading all the data and then deciding which rows are in the matching years, Impala can
-zero in on only the data files from specific <codeph>YEAR</codeph> partitions.
-To do this, Impala physically reorganizes the data files, putting the rows from each year into
-data files in a separate HDFS directory for each <codeph>YEAR</codeph> value.
-Along the way, we'll also get rid of the <codeph>TAIL_NUM</codeph> column that proved to be almost entirely <codeph>NULL</codeph>.
-</p>
-
-<p>
-The first step is to create a new table with a layout very similar to the original <codeph>AIRLINES_EXTERNAL</codeph> table.
-We'll do that by reverse-engineering a <codeph>CREATE TABLE</codeph> statement for the first table,
-then tweaking it slightly to include a <codeph>PARTITION BY</codeph> clause for <codeph>YEAR</codeph>,
-and excluding the <codeph>TAIL_NUM</codeph> column.
-The <codeph>SHOW CREATE TABLE</codeph> statement gives us the starting point.
-</p>
-
-<codeblock>[localhost:21000] > show create table airlines_external;
-+-------------------------------------------------------------------------------------
-| result
-+-------------------------------------------------------------------------------------
-| CREATE EXTERNAL TABLE airline_data.airlines_external (
-| year INT COMMENT 'inferred from: optional int32 year',
-| month INT COMMENT 'inferred from: optional int32 month',
-| day INT COMMENT 'inferred from: optional int32 day',
-| dayofweek INT COMMENT 'inferred from: optional int32 dayofweek',
-| dep_time INT COMMENT 'inferred from: optional int32 dep_time',
-| crs_dep_time INT COMMENT 'inferred from: optional int32 crs_dep_time',
-| arr_time INT COMMENT 'inferred from: optional int32 arr_time',
-| crs_arr_time INT COMMENT 'inferred from: optional int32 crs_arr_time',
-| carrier STRING COMMENT 'inferred from: optional binary carrier',
-| flight_num INT COMMENT 'inferred from: optional int32 flight_num',
-| tail_num INT COMMENT 'inferred from: optional int32 tail_num',
-| actual_elapsed_time INT COMMENT 'inferred from: optional int32 actual_elapsed_time',
-| crs_elapsed_time INT COMMENT 'inferred from: optional int32 crs_elapsed_time',
-| airtime INT COMMENT 'inferred from: optional int32 airtime',
-| arrdelay INT COMMENT 'inferred from: optional int32 arrdelay',
-| depdelay INT COMMENT 'inferred from: optional int32 depdelay',
-| origin STRING COMMENT 'inferred from: optional binary origin',
-| dest STRING COMMENT 'inferred from: optional binary dest',
-| distance INT COMMENT 'inferred from: optional int32 distance',
-| taxi_in INT COMMENT 'inferred from: optional int32 taxi_in',
-| taxi_out INT COMMENT 'inferred from: optional int32 taxi_out',
-| cancelled INT COMMENT 'inferred from: optional int32 cancelled',
-| cancellation_code STRING COMMENT 'inferred from: optional binary cancellation_code',
-| diverted INT COMMENT 'inferred from: optional int32 diverted',
-| carrier_delay INT COMMENT 'inferred from: optional int32 carrier_delay',
-| weather_delay INT COMMENT 'inferred from: optional int32 weather_delay',
-| nas_delay INT COMMENT 'inferred from: optional int32 nas_delay',
-| security_delay INT COMMENT 'inferred from: optional int32 security_delay',
-| late_aircraft_delay INT COMMENT 'inferred from: optional int32 late_aircraft_delay'
-| )
-| STORED AS PARQUET
-| LOCATION 'hdfs://a1730.example.com:8020/user/impala/staging/airlines'
-| TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false',
-| 'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0',
-| 'rawDataSize'='-1')
-+-------------------------------------------------------------------------------------
-Fetched 1 row(s) in 0.03s
-[localhost:21000] > quit;
-</codeblock>
+<p> We could go quite far with the data in this initial raw format, just as we
+ downloaded it from the web. If the data set proved to be useful and
+ worth persisting in Impala for extensive queries, we might want to
+ copy it to an internal table, letting Impala manage the data files and
+ perhaps reorganizing a little for higher efficiency. In this next
+ stage of the tutorial, we copy the original data into a partitioned
+ table, still in Parquet format. Partitioning based on the
+ <codeph>year</codeph> column lets us run queries with clauses such
+ as <codeph>WHERE year = 2001</codeph> or <codeph>WHERE year BETWEEN
+ 1989 AND 1999</codeph>, which can dramatically cut down on I/O by
+ ignoring all the data from years outside the desired range. Rather
+ than reading all the data and then deciding which rows are in the
+ matching years, Impala can zero in on only the data files from
+ specific <codeph>year</codeph> partitions. To do this, Impala
+ physically reorganizes the data files, putting the rows from each year
+ into data files in a separate HDFS directory for each
+ <codeph>year</codeph> value. Along the way, we'll also get rid of
+ the <codeph>tail_num</codeph> column that proved to be almost entirely
+ <codeph>NULL</codeph>. </p>
+
+<p> The first step is to create a new table with a layout very similar to the
+ original <codeph>airlines_external</codeph> table. We'll do that by
+ reverse-engineering a <codeph>CREATE TABLE</codeph> statement for the
+ first table, then tweaking it slightly to include a <codeph>PARTITION
+ BY</codeph> clause for <codeph>year</codeph>, and excluding the
+ <codeph>tail_num</codeph> column. The <codeph>SHOW CREATE
+ TABLE</codeph> statement gives us the starting point. </p>
<p>
Although we could edit that output into a new SQL statement, all the ASCII box characters
@@ -2092,17 +1972,10 @@ to start with, we restart the <cmdname>impala-shell</cmdname> command with the
<codeph>-B</codeph> option, which turns off the box-drawing behavior.
</p>
-<codeblock>[localhost:21000] > quit;
-Goodbye jrussell
-$ impala-shell -i localhost -B -d airline_data;
-Starting Impala Shell without Kerberos authentication
-Connected to localhost:21000
-<ph conref="../shared/ImpalaVariables.xml#impala_vars/ImpaladBanner"/>
-Welcome to the Impala shell. Press TAB twice to see a list of available commands.
-...
-<ph conref="../shared/ImpalaVariables.xml#impala_vars/ShellBanner"/>
-[localhost:21000] > show create table airlines_external;
-"CREATE EXTERNAL TABLE airline_data.airlines_external (
+<codeblock>$ impala-shell -i localhost -B -d airlines_data;
+
+> SHOW CREATE TABLE airlines_external;
+"CREATE EXTERNAL TABLE airlines_data.airlines_external (
year INT COMMENT 'inferred from: optional int32 year',
month INT COMMENT 'inferred from: optional int32 month',
day INT COMMENT 'inferred from: optional int32 day',
@@ -2138,7 +2011,6 @@ LOCATION 'hdfs://a1730.example.com:8020/user/impala/staging/airlines'
TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false',
'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0',
'rawDataSize'='-1')"
-Fetched 1 row(s) in 0.01s
</codeblock>
<p>
@@ -2146,102 +2018,97 @@ After copying and pasting the <codeph>CREATE TABLE</codeph> statement into a tex
without the <codeph>-B</codeph> option, to switch back to regular
output.
</p>
-<p>
-Next we run the <codeph>CREATE TABLE</codeph> statement that we adapted from the <codeph>SHOW CREATE TABLE</codeph> output.
-We kept the <codeph>STORED AS PARQUET</codeph> clause because
-we want to rearrange the data somewhat but still keep it in the high-performance
-Parquet format. The <codeph>LOCATION</codeph> and <codeph>TBLPROPERTIES</codeph>
-clauses are not relevant for this new table, so we edit those out.
-Because we are going to partition the new table based on the <codeph>YEAR</codeph>
-column, we move that column name (and its type) into a new <codeph>PARTITIONED BY</codeph> clause.
-</p>
-
-<codeblock>[localhost:21000] > CREATE TABLE airline_data.airlines
- > (
- > month INT,
- > day INT,
- > dayofweek INT,
- > dep_time INT,
- > crs_dep_time INT,
- > arr_time INT,
- > crs_arr_time INT,
- > carrier STRING,
- > flight_num INT,
- > actual_elapsed_time INT,
- > crs_elapsed_time INT,
- > airtime INT,
- > arrdelay INT,
- > depdelay INT,
- > origin STRING,
- > dest STRING,
- > distance INT,
- > taxi_in INT,
- > taxi_out INT,
- > cancelled INT,
- > cancellation_code STRING,
- > diverted INT,
- > carrier_delay INT,
- > weather_delay INT,
- > nas_delay INT,
- > security_delay INT,
- > late_aircraft_delay INT
- > )
- > STORED AS PARQUET
- > PARTITIONED BY (year INT);
-Fetched 0 row(s) in 0.10s
+<p> Next we run the <codeph>CREATE TABLE</codeph> statement that we adapted from
+ the <codeph>SHOW CREATE TABLE</codeph> output. We kept the
+ <codeph>STORED AS PARQUET</codeph> clause because we want to
+ rearrange the data somewhat but still keep it in the high-performance
+ Parquet format. The <codeph>LOCATION</codeph> and
+ <codeph>TBLPROPERTIES</codeph> clauses are not relevant for this new
+ table, so we edit those out. Because we are going to partition the new
+ table based on the <codeph>year</codeph> column, we move that column
+ name (and its type) into a new <codeph>PARTITIONED BY</codeph> clause. </p>
+
+<codeblock>> CREATE TABLE airlines_data.airlines
+ (month INT,
+ day INT,
+ dayofweek INT,
+ dep_time INT,
+ crs_dep_time INT,
+ arr_time INT,
+ crs_arr_time INT,
+ carrier STRING,
+ flight_num INT,
+ actual_elapsed_time INT,
+ crs_elapsed_time INT,
+ airtime INT,
+ arrdelay INT,
+ depdelay INT,
+ origin STRING,
+ dest STRING,
+ distance INT,
+ taxi_in INT,
+ taxi_out INT,
+ cancelled INT,
+ cancellation_code STRING,
+ diverted INT,
+ carrier_delay INT,
+ weather_delay INT,
+ nas_delay INT,
+ security_delay INT,
+ late_aircraft_delay INT)
+PARTITIONED BY (year INT)
+STORED AS PARQUET
+;
</codeblock>
-<p>
-Next, we copy all the rows from the original table into this new one with
-an <codeph>INSERT</codeph> statement. (We edited the <codeph>CREATE TABLE</codeph>
-statement to make an <codeph>INSERT</codeph> statement with the column names
-in the same order.) The only change is to add a <codeph>PARTITION(year)</codeph>
-clause, and move the <codeph>YEAR</codeph> column to the very end of the
-<codeph>SELECT</codeph> list of the <codeph>INSERT</codeph> statement.
-Specifying <codeph>PARTITION(year)</codeph>, rather than a fixed value
-such as <codeph>PARTITION(year=2000)</codeph>, means that Impala figures
-out the partition value for each row based on the value of the very last column
-in the <codeph>SELECT</codeph> list.
-This is the first SQL statement that legitimately takes any substantial time,
-because the rows from different years are shuffled around the cluster;
-the rows that go into each partition are collected on one node, before being
-written to one or more new data files.
-</p>
-
-<codeblock>[localhost:21000] > INSERT INTO airline_data.airlines
- > PARTITION (year)
- > SELECT
- > month,
- > day,
- > dayofweek,
- > dep_time,
- > crs_dep_time,
- > arr_time,
- > crs_arr_time,
- > carrier,
- > flight_num,
- > actual_elapsed_time,
- > crs_elapsed_time,
- > airtime,
- > arrdelay,
- > depdelay,
- > origin,
- > dest,
- > distance,
- > taxi_in,
- > taxi_out,
- > cancelled,
- > cancellation_code,
- > diverted,
- > carrier_delay,
- > weather_delay,
- > nas_delay,
- > security_delay,
- > late_aircraft_delay,
- > year
- > FROM airline_data.airlines_external;
-Inserted 123534969 row(s) in 202.70s
-</codeblock>
+<p> Next, we copy all the rows from the original table into this new one with an
+ <codeph>INSERT</codeph> statement. (We edited the <codeph>CREATE
+ TABLE</codeph> statement to make an <codeph>INSERT</codeph>
+ statement with the column names in the same order.) The only change is
+ to add a <codeph>PARTITION(year)</codeph> clause, and move the
+ <codeph>year</codeph> column to the very end of the
+ <codeph>SELECT</codeph> list of the <codeph>INSERT</codeph>
+ statement. Specifying <codeph>PARTITION(year)</codeph>, rather than a
+ fixed value such as <codeph>PARTITION(year=2000)</codeph>, means that
+ Impala figures out the partition value for each row based on the value
+ of the very last column in the <codeph>SELECT</codeph> list. This is
+ the first SQL statement that legitimately takes any substantial time,
+ because the rows from different years are shuffled around the cluster;
+ the rows that go into each partition are collected on one node, before
+ being written to one or more new data files. </p>
+
+<codeblock>> INSERT INTO airlines_data.airlines
+ PARTITION (year)
+ SELECT
+ month,
+ day,
+ dayofweek,
+ dep_time,
+ crs_dep_time,
+ arr_time,
+ crs_arr_time,
+ carrier,
+ flight_num,
+ actual_elapsed_time,
+ crs_elapsed_time,
+ airtime,
+ arrdelay,
+ depdelay,
+ origin,
+ dest,
+ distance,
+ taxi_in,
+ taxi_out,
+ cancelled,
+ cancellation_code,
+ diverted,
+ carrier_delay,
+ weather_delay,
+ nas_delay,
+ security_delay,
+ late_aircraft_delay,
+ year
+ FROM airlines_data.airlines_external;</codeblock>
<p>
Once partitioning or join queries come into play, it's important to have statistics
@@ -2253,54 +2120,57 @@ are in place for each partition, and also illustrates how many files and how muc
is in each partition.
</p>
-<codeblock>[localhost:21000] > compute incremental stats airlines;
+<codeblock>> COMPUTE INCREMENTAL STATS airlines;
+-------------------------------------------+
| summary |
+-------------------------------------------+
| Updated 22 partition(s) and 27 column(s). |
+-------------------------------------------+
-[localhost:21000] > show table stats airlines;
-+-------+-----------+--------+----------+--------------+------------+---------+-------------------+
-| year | #Rows | #Files | Size | Bytes Cached | Cache Repl | Format | Incremental stats |
-+-------+-----------+--------+----------+--------------+------------+---------+-----
-| 1987 | 1311826 | 1 | 9.32MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 1988 | 5202096 | 1 | 37.04MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 1989 | 5041200 | 1 | 36.25MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 1990 | 5270893 | 1 | 38.39MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 1991 | 5076925 | 1 | 37.23MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 1992 | 5092157 | 1 | 36.85MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 1993 | 5070501 | 1 | 37.16MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 1994 | 5180048 | 1 | 38.31MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 1995 | 5327435 | 1 | 53.14MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 1996 | 5351983 | 1 | 53.64MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 1997 | 5411843 | 1 | 54.41MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 1998 | 5384721 | 1 | 54.01MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 1999 | 5527884 | 1 | 56.32MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 2000 | 5683047 | 1 | 58.15MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 2001 | 5967780 | 1 | 60.65MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 2002 | 5271359 | 1 | 57.99MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 2003 | 6488540 | 1 | 81.33MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 2004 | 7129270 | 1 | 103.19MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 2005 | 7140596 | 1 | 102.61MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 2006 | 7141922 | 1 | 106.03MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 2007 | 7453215 | 1 | 112.15MB | NOT CACHED | NOT CACHED | PARQUET | true
-| 2008 | 7009728 | 1 | 105.76MB | NOT CACHED | NOT CACHED | PARQUET | true
-| Total | 123534969 | 22 | 1.30GB | 0B | | |
-+-------+-----------+--------+----------+--------------+------------+---------+-----
+
+> SHOW TABLE STATS airlines;
++-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+
+| year | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
++-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+
+| 1987 | 1311826 | 1 | 11.75MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1987 |
+| 1988 | 5202096 | 1 | 44.04MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1988 |
+| 1989 | 5041200 | 1 | 46.07MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1989 |
+| 1990 | 5270893 | 1 | 46.25MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1990 |
+| 1991 | 5076925 | 1 | 46.77MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1991 |
+| 1992 | 5092157 | 1 | 48.21MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1992 |
+| 1993 | 5070501 | 1 | 47.46MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1993 |
+| 1994 | 5180048 | 1 | 47.47MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1994 |
+| 1995 | 5327435 | 1 | 62.40MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1995 |
+| 1996 | 5351983 | 1 | 62.93MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1996 |
+| 1997 | 5411843 | 1 | 65.05MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1997 |
+| 1998 | 5384721 | 1 | 62.21MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1998 |
+| 1999 | 5527884 | 1 | 65.10MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=1999 |
+| 2000 | 5683047 | 1 | 67.68MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2000 |
+| 2001 | 5967780 | 1 | 74.03MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2001 |
+| 2002 | 5271359 | 1 | 74.00MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2002 |
+| 2003 | 6488540 | 1 | 99.35MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2003 |
+| 2004 | 7129270 | 1 | 123.29MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2004 |
+| 2005 | 7140596 | 1 | 120.72MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2005 |
+| 2006 | 7141922 | 1 | 121.88MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2006 |
+| 2007 | 7453215 | 1 | 130.87MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2007 |
+| 2008 | 7009728 | 1 | 123.14MB | NOT CACHED | NOT CACHED | PARQUET | true | hdfs://myhost.com:8020/user/hive/warehouse/airline_data.db/airlines/year=2008 |
+| Total | 123534969 | 22 | 1.55GB | 0B | | | | |
++-------+-----------+--------+----------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------------------------------------+
</codeblock>
-<p>
-At this point, we go through a quick thought process to sanity check the partitioning we did.
-All the partitions have exactly one file, which is on the low side. A query that includes a
-clause <codeph>WHERE year=2004</codeph> will only read a single data block; that data block
-will be read and processed by a single data node; therefore, for a query targeting a single year,
-all the other nodes in the cluster will sit idle while all the work happens on a single machine.
-It's even possible that by chance (depending on HDFS replication factor and the way data blocks are distributed
-across the cluster), that multiple year partitions selected by a filter such as
-<codeph>WHERE year BETWEEN 1999 AND 2001</codeph> could all be read and processed by the same data node.
-The more data files each partition has, the more parallelism you can get and the less probability of <q>hotspots</q> occurring
-on particular nodes, therefore a bigger performance boost by having a big cluster.
-</p>
+<p> At this point, we sanity check the partitioning we did. All the partitions
+ have exactly one file, which is on the low side. A query that includes
+ a clause <codeph>WHERE year=2004</codeph> will only read a single data
+ block; that data block will be read and processed by a single data
+ node; therefore, for a query targeting a single year, all the other
+ nodes in the cluster will sit idle while all the work happens on a
+ single machine. It's even possible that by chance (depending on HDFS
+ replication factor and the way data blocks are distributed across the
+ cluster), that multiple year partitions selected by a filter such as
+ <codeph>WHERE year BETWEEN 1999 AND 2001</codeph> could all be read
+ and processed by the same data node. The more data files each
+ partition has, the more parallelism you can get and the less
+ probability of <q>hotspots</q> occurring on particular nodes,
+ therefore a bigger performance boost by having a big cluster. </p>
<p>
However, the more data files, the less data goes in each one. The overhead of dividing the work in a
@@ -2312,15 +2182,19 @@ The way to see how well it works in practice is to run the same queries against
flat table and the new partitioned table, and compare times.
</p>
-<p>
-Spoiler: in this case, with my particular 4-node cluster with its specific distribution of data blocks and my
-particular exploratory queries, queries against the partitioned table do consistently run faster than the same queries
-against the unpartitioned table. But I could not be sure that would be the case without some real measurements.
-Here are some queries I ran to draw that conclusion, first against <codeph>AIRLINES_EXTERNAL</codeph> (no partitioning),
-then against <codeph>AIRLINES</codeph> (partitioned by year). The <codeph>AIRLINES</codeph> queries are consistently faster.
-Changing the volume of data, changing the size of the cluster, running queries that did or didn't refer to the partition key
-columns, or other factors could change the results to favor one table layout or the other.
-</p>
+<p> Spoiler: in this case, with my particular 4-node cluster with its specific
+ distribution of data blocks and my particular exploratory queries,
+ queries against the partitioned table do consistently run faster than
+ the same queries against the unpartitioned table. But I could not be
+ sure that would be the case without some real measurements. Here are
+ some queries I ran to draw that conclusion, first against
+ <codeph>airlines_external</codeph> (no partitioning), then against
+ <codeph>AIRLINES</codeph> (partitioned by year). The
+ <codeph>AIRLINES</codeph> queries are consistently faster. Changing
+ the volume of data, changing the size of the cluster, running queries
+ that did or didn't refer to the partition key columns, or other
+ factors could change the results to favor one table layout or the
+ other. </p>
<note>
If you find the volume of each partition is only in the low tens of megabytes, consider lowering the granularity
@@ -2330,64 +2204,36 @@ even hundreds of megabytes per Parquet file, and the number of Parquet files in
higher than the number of data nodes.
</note>
-<codeblock>[localhost:21000] > select sum(airtime) from airlines_external;
-+--------------+
-| sum(airtime) |
+<codeblock>> SELECT SUM(airtime) FROM airlines_external;
+--------------+
| 8662859484 |
+--------------+
-Fetched 1 row(s) in 2.02s
-[localhost:21000] > select sum(airtime) from airlines;
-+--------------+
-| sum(airtime) |
+
+> SELECT SUM(airtime) FROM airlines;
+--------------+
| 8662859484 |
+--------------+
-Fetched 1 row(s) in 1.21s
-[localhost:21000] > select sum(airtime) from airlines_external where year = 2005;
-+--------------+
-| sum(airtime) |
-+--------------+
-| 708204026 |
-+--------------+
-Fetched 1 row(s) in 2.61s
-[localhost:21000] > select sum(airtime) from airlines where year = 2005;
-+--------------+
-| sum(airtime) |
+> SELECT SUM(airtime) FROM airlines_external WHERE year = 2005;
+--------------+
| 708204026 |
+--------------+
-Fetched 1 row(s) in 1.19s
-[localhost:21000] > select sum(airtime) from airlines_external where dayofweek = 1;
+> SELECT SUM(airtime) FROM airlines WHERE year = 2005;
+--------------+
-| sum(airtime) |
-+--------------+
-| 1264945051 |
-+--------------+
-Fetched 1 row(s) in 2.82s
-[localhost:21000] > select sum(airtime) from airlines where dayofweek = 1;
-+--------------+
-| sum(airtime) |
-+--------------+
-| 1264945051 |
+| 708204026 |
+--------------+
-Fetched 1 row(s) in 1.61s
</codeblock>
-<p>
-Now we can finally do some serious analysis with this data set that, remember,
-a few minutes ago all we had were some raw data files and we didn't even know
-what columns they contained.
-Let's see whether the <q>air time</q> of a flight tends to be different depending on the
-day of the week. We can see that the average is a little higher on day number 6;
-perhaps Saturday is a busy flying day and planes have to circle for longer at the
-destination airport before landing.
-</p>
+<p> Now we can finally analyze this data set that from the raw data files and we
+ didn't know what columns they contained. Let's see whether the
+ <codeph>airtime</codeph> of a flight tends to be different depending
+ on the day of the week. We can see that the average is a little higher
+ on day number 6; perhaps Saturday is a busy flying day and planes have
+ to circle for longer at the destination airport before landing. </p>
-<codeblock>[localhost:21000] > select dayofweek, avg(airtime) from airlines
- > group by dayofweek order by dayofweek;
+<codeblock>> SELECT dayofweek, AVG(airtime) FROM airlines
+ GROUP BY dayofweek ORDER BY dayofweek;
+-----------+-------------------+
| dayofweek | avg(airtime) |
+-----------+-------------------+
@@ -2399,7 +2245,6 @@ destination airport before landing.
| 6 | 105.3627448363705 |
| 7 | 103.4144351202054 |
+-----------+-------------------+
-Fetched 7 row(s) in 2.25s
</codeblock>
<p>
@@ -2410,8 +2255,8 @@ of <codeph>NULL</codeph> for this column in years 1987 to 1994 shows that querie
need to be restricted to a date range of 1995 and higher.
</p>
-<codeblock>[localhost:21000] > select year, dayofweek, avg(airtime) from airlines
- > group by year, dayofweek order by year desc, dayofweek;
+<codeblock>> SELECT year, dayofweek, AVG(airtime) FROM airlines
+ GROUP BY year, dayofweek ORDER BY year DESC, dayofweek;
+------+-----------+-------------------+
| year | dayofweek | avg(airtime) |
+------+-----------+-------------------+
@@ -2456,28 +2301,5 @@ need to be restricted to a date range of 1995 and higher.
</conbody>
</concept>
-
- <concept audience="hidden" id="tut_synthesizing">
-
- <title>Synthesizing Data</title>
-
- <conbody>
-
-<!-- Scenarios:
- INSERT VALUES 1 row or a few rows, then duplicate the values with slight alterations.
- - Use row_number() and rank() to assign IDs to values from narrow tables.
- - Use sum() in analytic context, with sliding window, to derive other arbitrary numbers.
- Grab database, table, column, etc. names from metastore and play with those.
- Maybe take results of a 'find' and play with those.
- - Invent fields like 'username' based on /home or whatever subdirectories.
- - Perhaps also capture permissions, timestamps, etc. via 'ls -l'.
- - Illustrate narrow table with 'extension' field, wide table with 'jpg', 'Office', 'picture', 'XML', etc. fields
- based on case() of extension.
- Decompose some string values using regexp_extract.
- Set up views to get the logical layout right before making new physical copies of data.
--->
-
- <p></p>
- </conbody>
</concept>
</concept>