You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hawq.apache.org by yo...@apache.org on 2017/02/22 00:39:26 UTC
incubator-hawq-docs git commit: HAWQ-1289 - add hive profile example
with partitioning and mixed data types (closes #90)
Repository: incubator-hawq-docs
Updated Branches:
refs/heads/develop 7a7a87a65 -> d0745d7e7
HAWQ-1289 - add hive profile example with partitioning and mixed data types (closes #90)
Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/d0745d7e
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/d0745d7e
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/d0745d7e
Branch: refs/heads/develop
Commit: d0745d7e7924287a7ef57a30b83adcc60f1a6655
Parents: 7a7a87a
Author: Lisa Owen <lo...@pivotal.io>
Authored: Tue Feb 21 16:39:19 2017 -0800
Committer: David Yozie <yo...@apache.org>
Committed: Tue Feb 21 16:39:19 2017 -0800
----------------------------------------------------------------------
markdown/pxf/HivePXF.html.md.erb | 278 ++++++++++++++++++++++++++--------
1 file changed, 216 insertions(+), 62 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/d0745d7e/markdown/pxf/HivePXF.html.md.erb
----------------------------------------------------------------------
diff --git a/markdown/pxf/HivePXF.html.md.erb b/markdown/pxf/HivePXF.html.md.erb
index 51b317c..6101016 100644
--- a/markdown/pxf/HivePXF.html.md.erb
+++ b/markdown/pxf/HivePXF.html.md.erb
@@ -155,6 +155,8 @@ Create a Hive table to expose our sample data set.
hive> LOAD DATA LOCAL INPATH '/tmp/pxf_hive_datafile.txt'
INTO TABLE sales_info;
```
+
+ In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
3. Perform a query on `sales_info` to verify that the data was loaded successfully:
@@ -162,7 +164,21 @@ Create a Hive table to expose our sample data set.
hive> SELECT * FROM sales_info;
```
-In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
+### <a id="get_hdfs_file_location"></a>Determine the HDFS location of a Hive Table
+
+Should you need to identify the HDFS file location of a Hive managed table, reference it using its HDFS file path. You can determine a Hive table's location in HDFS using the `DESCRIBE` command, for example:
+
+``` sql
+hive> DESCRIBE EXTENDED sales_info;
+Detailed Table Information
+...
+location:hdfs://<namenode>:<port>/apps/hive/warehouse/sales_info
+...
+```
+
+The `location` value identifies the HDFS file path of the table.
+
+
## <a id="hcatalog"></a>Using PXF and HCatalog to Query Hive
@@ -240,7 +256,7 @@ To obtain a description of a Hive table with HCatalog integration, you can use t
```
``` sql
- postgres=# \d+ hcatalog.default.sales_info_rcfile;
+ postgres=# \d+ hcatalog.default.sales_info;
```
``` shell
@@ -327,7 +343,7 @@ Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](..
## <a id="profile_hive"></a>Hive Profile
-Use the `Hive` profile with any Hive file storage format. With the `Hive` profile, you can also access heterogenous format data in a single table where each partition may be stored in a different file format. In both cases, the `Hive` profile will use the optimal `Hive*` profile for the underlying file format type.
+Use the `Hive` profile with any Hive file storage format. With the `Hive` profile, you can also access heterogenous format data in a single table where each partition may be stored in a different file format. In both cases, the `Hive` profile will use the optimal `Hive*` profile for the underlying file storage type. Refer to the [Partition Filtering](#partitionfiltering) discussion later in this topic for additional information on partitioning and the `Hive` profile.
### <a id="profile_hive_using"></a>Example: Using the Hive Profile
@@ -361,7 +377,7 @@ Use the `Hive` profile to create a queryable HAWQ external table from the Hive `
Use the `HiveText` profile to query text format files.
-**Note**: When using the `HiveText` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.
+**Note**: When using the `HiveText` profile, you **must** specify a delimiter option in both the `LOCATION` and `FORMAT` clauses.
### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile
@@ -399,7 +415,7 @@ Use the PXF `HiveText` profile to create a queryable HAWQ external table from th
The RCFile Hive format is used for row columnar formatted data. The `HiveRC` profile provides access to RCFile data.
-**Note**: When using the `HiveRC` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.
+**Note**: When using the `HiveRC` profile, you **must** specify a delimiter option in both the `LOCATION` and `FORMAT` clauses.
### <a id="profile_hiverc_rcfiletbl_using"></a>Example: Using the HiveRC Profile
@@ -432,7 +448,7 @@ Use the `HiveRC` profile to query RCFile-formatted data in Hive tables.
hive> SELECT * FROM sales_info_rcfile;
```
-4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. When using the `HiveRC` profile, you **must** specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.:
+4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. You *must* specify a delimiter option in both the `LOCATION` and `FORMAT` clauses.:
``` sql
postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8)
@@ -611,76 +627,214 @@ postgres=# SHOW pxf_enable_filter_pushdown;
postgres=# SET pxf_enable_filter_pushdown=off;
```
-### <a id="example2"></a>Create Partitioned Hive Table
+### <a id="example_hive_part"></a>Example: Using the Hive Profile to Access Partitioned Homogenous Data
-Create a�Hive table `sales_part`�with two partition columns, `delivery_state` and `delivery_city:`
+In this example, you will use the `Hive` profile to query a Hive table named `sales_part` you partition on `delivery_state` and `delivery_city` fields. You will then create a HAWQ external table to query `sales_part`, including specific examples illustrating filter pushdown.
-``` sql
-hive> CREATE TABLE sales_part (name string, type string, supplier_key int, price double)
- PARTITIONED BY (delivery_state string, delivery_city string)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
-```
+1. Create a�Hive table named `sales_part`�with two partition columns, `delivery_state` and `delivery_city:`
-Load data into this Hive table and�add some partitions:
+ ``` sql
+ hive> CREATE TABLE sales_part (name string, type string, supplier_key int, price double)
+ PARTITIONED BY (delivery_state string, delivery_city string)
+ ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
+ ```
-``` sql
-hive> INSERT INTO TABLE sales_part
- PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Fresno')
- VALUES ('block', 'widget', 33, 15.17);
-hive> INSERT INTO TABLE sales_part
- PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento')
- VALUES ('cube', 'widget', 11, 1.17);
-hive> INSERT INTO TABLE sales_part
- PARTITION(delivery_state = 'NEVADA', delivery_city = 'Reno')
- VALUES ('dowel', 'widget', 51, 31.82);
-hive> INSERT INTO TABLE sales_part
- PARTITION(delivery_state = 'NEVADA', delivery_city = 'Las Vegas')
- VALUES ('px49', 'pipe', 52, 99.82);
-```
+2. Load data into this Hive table and�add some partitions:
-The Hive storage directory structure for the `sales_part` table appears as follows:
+ ``` sql
+ hive> INSERT INTO TABLE sales_part
+ PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Fresno')
+ VALUES ('block', 'widget', 33, 15.17);
+ hive> INSERT INTO TABLE sales_part
+ PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento')
+ VALUES ('cube', 'widget', 11, 1.17);
+ hive> INSERT INTO TABLE sales_part
+ PARTITION(delivery_state = 'NEVADA', delivery_city = 'Reno')
+ VALUES ('dowel', 'widget', 51, 31.82);
+ hive> INSERT INTO TABLE sales_part
+ PARTITION(delivery_state = 'NEVADA', delivery_city = 'Las Vegas')
+ VALUES ('px49', 'pipe', 52, 99.82);
+ ```
+
+3. Query the `sales_part` table:
-``` pre
-$ sudo -u hdfs hdfs dfs -ls -R /apps/hive/warehouse/sales_part
-/apps/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=\u2019Fresno\u2019/
-/apps/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=Sacramento/
-/apps/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=Reno/
-/apps/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=\u2019Las Vegas\u2019/
-```
+ ``` sql
+ hive> SELECT * FROM sales_part;
+ ```
+
+ A�`SELECT *`�statement on a Hive partitioned table shows the partition fields at the end of the record.
-To define a HAWQ PXF table that will read this Hive table�and�take advantage of partition filter push-down, define the fields corresponding to the Hive partition fields at the end of the `CREATE EXTERNAL TABLE` attribute list.�In HiveQL,�a�`SELECT *`�statement on a partitioned table shows the partition fields at the end of the record.
+3. Examine the Hive/HDFS directory structure for the `sales_part` table:
-``` sql
-postgres=# CREATE EXTERNAL TABLE pxf_sales_part(
- item_name TEXT,
- item_type TEXT,
- supplier_key INTEGER,
- item_price DOUBLE PRECISION,
- delivery_state TEXT,
- delivery_city TEXT
-)
-LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive')
-FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
-
-postgres=# SELECT * FROM pxf_sales_part;
-```
+ ``` shell
+ $ sudo -u hdfs hdfs dfs -ls -R /apps/hive/warehouse/sales_part
+ /apps/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=Fresno/
+ /apps/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=Sacramento/
+ /apps/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=Reno/
+ /apps/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=Las Vegas/
+ ```
-### <a id="example3"></a>Query Without Pushdown
+4. Create a PXF external table to read the partitioned `sales_part` Hive table. To take advantage of partition filter push-down, define fields corresponding to the Hive partition fields at the end of the `CREATE EXTERNAL TABLE` attribute list.
-In the following example, the HAWQ query filters the `delivery_city` partition `Sacramento`. The filter on� `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data on `Sacramento` is transferred for processing.
+ ``` shell
+ $ psql -d postgres
+ ```
-``` sql
-postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'cube';
-```
+ ``` sql
+ postgres=# CREATE EXTERNAL TABLE pxf_sales_part(
+ item_name TEXT, item_type TEXT,
+ supplier_key INTEGER, item_price DOUBLE PRECISION,
+ delivery_state TEXT, delivery_city TEXT)
+ LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive')
+ FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+ ```
+
+5. Query the table:
+
+ ``` sql
+ postgres=# SELECT * FROM pxf_sales_part;
+ ```
-### <a id="example4"></a>Query With Pushdown
+6. Perform another query (no pushdown) on `pxf_sales_part` to return records where the `delivery_city` is `Sacramento` and� `item_name` is `cube`
-The following HAWQ query reads all the data under�`delivery_state` partition `CALIFORNIA`, regardless of the city.
+ ``` sql
+ postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'cube';
+ ```
-``` sql
-postgres=# SET pxf_enable_filter_pushdown=on;
-postgres=# SELECT * FROM pxf_sales_part WHERE delivery_state = 'CALIFORNIA';
-```
+ The query filters the `delivery_city` partition `Sacramento`. The filter on� `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data in the `Sacramento` partition is transferred for processing.
+
+7. Query (with pushdown) for all records where�`delivery_state` is `CALIFORNIA`:
+
+ ``` sql
+ postgres=# SET pxf_enable_filter_pushdown=on;
+ postgres=# SELECT * FROM pxf_sales_part WHERE delivery_state = 'CALIFORNIA';
+ ```
+
+ This query reads all of the data in the�`CALIFORNIA` `delivery_state` partition, regardless of the city.
+
+### <a id="example_hive_part_multi"></a>Example: Using the Hive Profile to Access Partitioned Heterogenous Data
+
+The `Hive` profile supports multiple data format types. This support enables you to query a partitioned Hive table that may be composed of data of different formats.
+
+In this example, you will use the `Hive` profile both directly and indirectly via PXF HCatalog integration to query a partitioned Hive external table. The table is composed of the HDFS data files associated with the `sales_info` (text format) and `sales_info_rcfile` (RC format) Hive tables you created in previous exercises. You will partition the data by year, assigning the data from `sales_info` to the year 2013, and the data from `sales_info_rcfile` to the year 2016. (Ignore at the moment the fact that the tables contain the same data.)
+
+1. Create a Hive external table named `hive_multiformpart` that is partitioned by a string field named `year`:
+
+ ``` shell
+ $ HADOOP_USER_NAME=hdfs hive
+ ```
+
+ ``` sql
+ hive> CREATE EXTERNAL TABLE hive_multiformpart( location string, month string, number_of_orders int, total_sales double)
+ PARTITIONED BY( year string )
+ ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
+ ```
+
+2. Describe the `sales_info` and `sales_info_rcfile` tables, making note of the HDFS file `location`s:
+
+ ``` sql
+ hive> DESCRIBE EXTENDED sales_info;
+ hive> DESCRIBE EXTENDED sales_info_rcfile;
+ ```
+
+3. Create partitions in the `hive_multiformpart` table for the HDFS locations associated with each of the `sales_info` and `sales_info_rcfile` tables:
+
+ ``` sql
+ hive> ALTER TABLE hive_multiformpart ADD PARTITION (year = '2013') LOCATION 'hdfs://namenode:8020/apps/hive/warehouse/sales_info';
+ hive> ALTER TABLE hive_multiformpart ADD PARTITION (year = '2016') LOCATION 'hdfs://namenode:8020/apps/hive/warehouse/sales_info_rcfile';
+ ```
+
+4. Explicitly identify the file format of the partition associated with the `sales_info_rcfile` table:
+
+ ``` sql
+ hive> ALTER TABLE hive_multiformpart PARTITION (year='2016') SET FILEFORMAT RCFILE;
+ ```
+
+ You need not specify the file format of the partition associated with the `sales_info` table, as `TEXTFILE` format is the default.
+
+5. Query the `hive_multiformpart` table:
+
+ ``` sql
+ hive> SELECT * from hive_multiformpart;
+ ...
+ Bangalore Jul 271 8320.55 2016
+ Beijing Dec 100 4248.41 2016
+ Prague Jan 101 4875.33 2013
+ Rome Mar 87 1557.39 2013
+ ...
+ hive> SELECT * from hive_multiformpart WHERE year='2013';
+ hive> SELECT * from hive_multiformpart WHERE year='2016';
+ ```
+
+6. Show the partitions defined for the `hive_multiformpart` table and exit `hive`:
+
+ ``` sql
+ hive> SHOW PARTITIONS hive_multiformpart;
+ year=2013
+ year=2016
+ hive> quit;
+ ```
+
+7. Start the `psql` subsystem:
+
+ ``` shell
+ $ psql -d postgres
+ ```
+
+7. Use PXF HCatalog integration to query the Hive `hive_multiformpart` external table you created in the previous steps:
+
+ ``` sql
+ postgres=# SELECT * FROM hcatalog.default.hive_multiformpart;
+ ```
+
+ ``` shell
+ location | month | number_of_orders | total_sales | year
+ ---------------+-------+------------------+-------------+------
+ ...
+ Prague | Dec | 333 | 9894.77 | 2013
+ Bangalore | Jul | 271 | 8320.55 | 2013
+ Beijing | Dec | 100 | 4248.41 | 2013
+ Prague | Jan | 101 | 4875.33 | 2016
+ Rome | Mar | 87 | 1557.39 | 2016
+ Bangalore | May | 317 | 8936.99 | 2016
+ ...
+ ```
+
+7. Use the PXF `Hive` profile to create a readable HAWQ external table derived from the Hive `hive_multiformpart` external table you created in the previous steps:
+
+ ``` sql
+ postgres=# CREATE EXTERNAL TABLE pxf_multiformpart(location text, month text, num_orders int, total_sales float8, year text)
+ LOCATION ('pxf://namenode:51200/default.hive_multiformpart?PROFILE=Hive')
+ FORMAT 'CUSTOM' (formatter='pxfwritable_import');
+ ```
+
+5. Query the PXF external table:
+
+ ``` sql
+ postgres=# SELECT * FROM pxf_multiformpart;
+ ```
+
+ ``` shell
+ location | month | num_orders | total_sales | year
+ ---------------+-------+------------+-------------+--------
+ ....
+ Prague | Dec | 333 | 9894.77 | 2013
+ Bangalore | Jul | 271 | 8320.55 | 2013
+ Beijing | Dec | 100 | 4248.41 | 2013
+ Prague | Jan | 101 | 4875.33 | 2016
+ Rome | Mar | 87 | 1557.39 | 2016
+ Bangalore | May | 317 | 8936.99 | 2016
+ ....
+ ```
+
+6. Perform a second query to calculate the total number of orders for the year 2013:
+
+ ``` sql
+ postgres=# SELECT sum(num_orders) FROM pxf_multiformpart WHERE month='Dec' AND year='2013';
+ sum
+ -----
+ 433
+ ```
## <a id="topic_fdm_zrh_1s"></a>Using PXF with Hive Default Partitions