You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hawq.apache.org by yo...@apache.org on 2017/03/20 21:19:49 UTC
incubator-hawq-docs git commit: HAWQ-1394 document HiveORC profile (closes #104)

Repository: incubator-hawq-docs
Updated Branches:
  refs/heads/develop ce67adabd -> 8c37d043a


HAWQ-1394 document HiveORC profile (closes #104)


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/8c37d043
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/8c37d043
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/8c37d043

Branch: refs/heads/develop
Commit: 8c37d043ab8b6ef59315a7bd92acb2c9b45dcedb
Parents: ce67ada
Author: Lisa Owen <lo...@pivotal.io>
Authored: Mon Mar 20 14:19:45 2017 -0700
Committer: David Yozie <yo...@apache.org>
Committed: Mon Mar 20 14:19:45 2017 -0700

----------------------------------------------------------------------
 markdown/pxf/HivePXF.html.md.erb | 183 ++++++++++++++++++++++++++++++++--
 1 file changed, 175 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/8c37d043/markdown/pxf/HivePXF.html.md.erb
----------------------------------------------------------------------
diff --git a/markdown/pxf/HivePXF.html.md.erb b/markdown/pxf/HivePXF.html.md.erb
index bc4e9f6..345bbca 100644
--- a/markdown/pxf/HivePXF.html.md.erb
+++ b/markdown/pxf/HivePXF.html.md.erb
@@ -50,7 +50,7 @@ The PXF Hive plug-in supports several file formats and profiles for accessing th
 | TextFile | Flat file with data in comma-, tab-, or space-separated value format or JSON notation. | Hive, HiveText |
 | SequenceFile | Flat file consisting of binary key/value pairs. | Hive |
 | RCFile | Record columnar data consisting of binary key/value pairs; high row compression rate. | Hive, HiveRC |
-| ORCFile | Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. | Hive |
+| ORCFile | Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. | Hive, HiveORC |
 | Parquet | Compressed columnar data representation. | Hive |
 | Avro | JSON-defined, schema-based data serialization format. | Hive |
 
@@ -82,7 +82,7 @@ The following table summarizes external mapping rules for Hive primitive types.
 
 Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
 
-An example using complex data types with the `Hive` profile is provided later in this topic.
+Examples using complex data types with the `Hive` and `HiveORC` profiles are provided later in this topic.
 
 
 ## <a id="hive_sampledataset"></a>Sample Data Set
@@ -316,7 +316,7 @@ HCatalog integration has the following limitations:
 
 In the previous section, you used HCatalog integration to query a Hive table. You can also create a PXF/HAWQ external table to access Hive table data. This Hive table access mechanism requires that you identify an appropriate Hive profile.
 
-The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`. The `HiveText` and `HiveRC` profiles are specifically optimized for text and RC file formats, respectively. The `Hive` profile is optimized for all file storage types; use the `Hive` profile when the underlying Hive table is composed of multiple partitions with differing file formats.
+The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`, and `HiveORC`. The `HiveText` and `HiveRC` profiles are specifically optimized for text and RC file formats, respectively. The `HiveORC` profile is optimized for ORC file formats. The `Hive` profile is optimized for all file storage types; use the `Hive` profile when the underlying Hive table is composed of multiple partitions with differing file formats.
 
 Use the following syntax to create a HAWQ external table representing Hive data:
 
@@ -324,7 +324,7 @@ Use the following syntax to create a HAWQ external table representing Hive data:
 CREATE EXTERNAL TABLE <table_name>
     ( <column_name> <data_type> [, ...] | LIKE <other_table> )
 LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
-    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
+    ?PROFILE=Hive|HiveText|HiveRC|HiveORC[&DELIMITER=<delim>'])
 FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
 ```
 
@@ -336,9 +336,9 @@ Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](..
 | \<port\>    | The PXF port. If \<port\> is omitted, PXF assumes \<host\> identifies a High Availability HDFS Nameservice and connects to the port number designated by the `pxf_service_port` server configuration parameter value. Default is 51200. |
 | \<hive-db-name\>    | The name of the Hive database. If omitted, defaults to the Hive database named `default`. |
 | \<hive-table-name\>    | The name of the Hive table. |
-| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
+| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, `HiveRC`, or `HiveORC`. |
 | DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must be a single ascii character or specified in hexadecimal representation. |
-| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
+| FORMAT (`Hive` and `HiveORC` profiles)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
 | FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |
 
 
@@ -401,7 +401,7 @@ Use the PXF `HiveText` profile to create a queryable HAWQ external table from th
 2. Query the external table:
 
     ``` sql
-    postgres=# SELECT * FROM salesinfo_hivetextprofile WHERE location="Beijing";
+    postgres=# SELECT * FROM salesinfo_hivetextprofile WHERE location='Beijing';
     ```
 
     ``` shell
@@ -475,6 +475,96 @@ Use the `HiveRC` profile to query RCFile-formatted data in Hive tables.
      ...
     ```
 
+## <a id="hiveorc-intro"></a>HiveORC Profile
+
+The Optimized Row Columnar (ORC) file format is a columnar file format that provides a highly efficient way to both store and access HDFS data. ORC format offers improvements over text and RCFile formats in terms of both compression and performance. HAWQ/PXF supports ORC version 1.2.1.
+
+ORC is type-aware and specifically designed for Hadoop workloads. ORC files store both the type of and encoding information for the data in the file. All columns within a single group of row data (also known as stripe) are stored together on disk in ORC format files. The columnar nature of the ORC format type enables read projection, helping avoid accessing unecessary columns during a query.
+
+ORC also supports predicate pushdown with built-in indexes at the file, stripe, and row levels, moving the filter operation to the data loading phase.
+
+Refer to the [Apache orc](https://orc.apache.org/docs/) and the Apache Hive [LanguageManual ORC](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) websites for detailed information about the ORC file format.
+
+Use the `HiveORC` profile to access ORC format data. The `HiveORC` profile provides:
+
+- Enhanced query performance - Column projection information is leveraged to enhance query performance by reducing disk I/O and data payload.
+
+- Optimized predicate pushdown - Predicate pushdown is optimized for:
+    - `int2`, `int4`, `int8`, `float8`, `text`, `bpchar`, and `boolean` data type and `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operator combinations
+    - `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operators and comparisons between the integer types
+    - `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operators and comparisons between the `float8` and `float4` types
+    - `IN` operator on arrays of `int2`, `int4`, `int8`, `boolean`, and `text`
+
+- Complex type support - You can access Hive tables composed of array, map, struct, and union data types. PXF serializes each of these complex types to `text`.
+
+**Note**: The `HiveORC` profile currently supports access to data stored in ORC format only through a Hive mapped table.
+
+### <a id="using-hiveorc-profile"></a>Example: Using the HiveORC Profile
+
+In the following example, you will create a Hive table stored in ORC format and use the `HiveORC` profile to query this Hive table.
+
+1. Create a Hive table with ORC file format:
+
+    ``` shell
+    $ HADOOP_USER_NAME=hdfs hive
+    ```
+
+    ``` sql
+    hive> CREATE TABLE sales_info_ORC (location string, month string,
+            number_of_orders int, total_sales double)
+          STORED AS ORC;
+    ```
+
+2. Insert the data from the `sales_info` table into `sales_info_ORC`:
+
+    ``` sql
+    hive> INSERT INTO TABLE sales_info_ORC SELECT * FROM sales_info;
+    ```
+
+    A copy of the sample data set is now stored in ORC format in `sales_info_ORC`.
+
+3. Perform a Hive query on `sales_info_ORC` to verify that the data was loaded successfully:
+
+    ``` sql
+    hive> SELECT * FROM sales_info_ORC;
+    ```
+
+4. Start the `psql` subsystem and turn on timing:
+
+    ``` shell
+    $ psql -d postgres
+    ```
+
+    ``` sql
+    postgres=> \timing
+    Timing is on.
+    ```
+
+4. Use the PXF `HiveORC` profile to create a queryable HAWQ external table from the Hive table named `sales_info_ORC` you created in Step 1. The `FORMAT` clause must specify `'CUSTOM'`. The `HiveORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'` `formatter`.
+
+    ``` sql
+    postgres=> CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, month text, num_orders int, total_sales float8)
+                 LOCATION ('pxf://namenode:51200/default.sales_info_ORC?PROFILE=HiveORC')
+                 FORMAT 'CUSTOM' (formatter='pxfwritable_import');
+    ```
+
+5. Query the external table:
+
+    ``` sql
+    postgres=> SELECT * FROM salesinfo_hiveORCprofile;
+    ```
+
+    ``` pre
+       location    | month | number_of_orders | total_sales 
+    ---------------+-------+------------------+-------------
+     Prague        | Jan   |              101 |     4875.33
+     Rome          | Mar   |               87 |     1557.39
+     Bangalore     | May   |              317 |     8936.99
+     ...
+
+    Time: 425.416 ms
+    ```
+
 ## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
 
 The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using:
@@ -498,8 +588,9 @@ And query the HAWQ external table using:
 postgres=# SELECT fname,lname FROM pxf_parquet_table;
 ```
 
+## <a id="complex_dt_example"></a> Complex Data Types
 
-## <a id="complex_dt_example"></a>Hive Profile Complex Data Type Example
+### <a id="complex_dt_example"></a>Example: Using the Hive Profile with Complex Data Types
 
 This example employs the `Hive` profile and the array and map complex types, specifically an array of integers and a string key/value pair map.
 
@@ -607,6 +698,82 @@ When specifying an array field in a Hive table, you must identify the terminator
 
     `intarray` and `propmap` are each serialized as text strings.
 
+### <a id="using-hiveorc-profile-complex"></a>Example: Using the HiveORC Profile with Complex Data Types
+
+In the following example, you will create a Hive table stored in ORC format. You will use the `HiveORC` profile to query the complex types in the `table_complextypes` Hive table you created in the previous exercise.
+
+1. Create a Hive table with ORC file format:
+
+    ``` shell
+    $ HADOOP_USER_NAME=hdfs hive
+    ```
+
+    ``` sql
+    hive> CREATE TABLE table_complextypes_ORC( index int, name string, intarray ARRAY<int>, propmap MAP<string, string>)
+            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
+            COLLECTION ITEMS TERMINATED BY '%'
+            MAP KEYS TERMINATED BY ':'
+          STORED AS ORC;
+    ```
+
+2. Insert the data from the `table_complextypes` table into `table_complextypes_ORC`:
+
+    ``` sql
+    hive> INSERT INTO TABLE table_complextypes_ORC SELECT * FROM table_complextypes;
+    ```
+
+    A copy of the sample data set is now stored in ORC format in `table_complextypes_ORC`.
+
+3. Perform a Hive query on `table_complextypes_ORC` to verify that the data was loaded successfully:
+
+    ``` sql
+    hive> SELECT * FROM table_complextypes_ORC;
+    ```
+
+    ``` pre
+    OK
+    3       Prague       [1,2,3]    {"zone":"euro","status":"up"}
+    89      Rome         [4,5,6]    {"zone":"euro"}
+    400     Bangalore    [7,8,9]    {"zone":"apac","status":"pending"}
+    ...
+    ```
+
+4. Start the `psql` subsystem:
+
+    ``` shell
+    $ psql -d postgres
+    ```
+
+4. Use the PXF `HiveORC` profile to create a queryable HAWQ external table from the Hive table named `table_complextypes_ORC` you created in Step 1. The `FORMAT` clause must specify `'CUSTOM'`. The `HiveORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'` `formatter`.
+
+    ``` sql
+    postgres=> CREATE EXTERNAL TABLE complextypes_hiveorc(index int, name text, intarray text, propmap text)
+               LOCATION ('pxf://namenode:51200/default.table_complextypes_ORC?PROFILE=HiveORC')
+                 FORMAT 'CUSTOM' (formatter='pxfwritable_import');
+    ```
+
+    Notice that the integer array and map complex types are mapped to type text.
+
+5. Query the external table:
+
+    ``` sql
+    postgres=> SELECT * FROM complextypes_hiveorc;
+    ```
+
+    ``` pre
+     index |    name    | intarray |              propmap               
+    -------+------------+----------+------------------------------------
+         3 | Prague     | [1,2,3]  | {"zone":"euro","status":"up"}
+        89 | Rome       | [4,5,6]  | {"zone":"euro"}
+       400 | Bangalore  | [7,8,9]  | {"zone":"apac","status":"pending"}
+     ...
+
+    ```
+
+    `intarray` and `propmap` are each serialized as text strings.
+
+
+
 ## <a id="partitionfiltering"></a>Partition Filtering
 
 The PXF Hive plug-in supports the Hive partitioning feature and directory structure. This enables partition exclusion on selected HDFS files comprising the Hive table.�To use�the partition filtering�feature to reduce network traffic and I/O, run a PXF query using a `WHERE` clause�that refers to a specific partition in the partitioned Hive table.