You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hawq.apache.org by yo...@apache.org on 2016/10/27 23:33:05 UTC

[11/14] incubator-hawq-docs git commit: address david's comments

address david's comments


Project: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/commit/8ee05a3d
Tree: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/tree/8ee05a3d
Diff: http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/diff/8ee05a3d

Branch: refs/heads/develop
Commit: 8ee05a3d29d2b74c107cf200352ebd5b16963604
Parents: 2a38a03
Author: Lisa Owen <lo...@pivotal.io>
Authored: Thu Oct 27 12:08:05 2016 -0700
Committer: Lisa Owen <lo...@pivotal.io>
Committed: Thu Oct 27 12:08:05 2016 -0700

----------------------------------------------------------------------
 pxf/HivePXF.html.md.erb | 84 ++++++++++++++++++++++++--------------------
 1 file changed, 46 insertions(+), 38 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-hawq-docs/blob/8ee05a3d/pxf/HivePXF.html.md.erb
----------------------------------------------------------------------
diff --git a/pxf/HivePXF.html.md.erb b/pxf/HivePXF.html.md.erb
index da851db..7d77bc5 100644
--- a/pxf/HivePXF.html.md.erb
+++ b/pxf/HivePXF.html.md.erb
@@ -22,23 +22,19 @@ Before accessing Hive data with HAWQ and PXF, ensure that:
 
 ## <a id="topic_p2s_lvl_25"></a>Hive File Formats
 
-Hive supports several file formats:
+The PXF Hive plug-in supports several file formats and profiles for accessing these formats:
 
--   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
--   SequenceFile - flat file consisting of binary key/value pairs
--   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
--   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
--   Parquet - compressed columnar data representation
--   Avro - JSON-defined, schema-based data serialization format
+| File Format  | Description | Profile |
+|-------|---------------------------|-------|
+| TextFile | Flat file with data in comma-, tab-, or space-separated value format or JSON notation. | Hive, HiveText |
+| SequenceFile | Flat file consisting of binary key/value pairs. | Hive |
+| RCFile | Record columnar data consisting of binary key/value pairs; high row compression rate. | Hive, HiveRC |
+| ORCFile | Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. | Hive |
+| Parquet | Compressed columnar data representation. | Hive |
+| Avro | JSON-defined, schema-based data serialization format. | Hive |
 
 Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
 
-The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
-
-- `Hive`
-- `HiveText`
-- `HiveRC`
-
 ## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
 
 ### <a id="hive_primdatatypes"></a>Primitive Data Types
@@ -76,10 +72,12 @@ An example using complex data types is provided later in this topic.
 
 Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
 
-- location - text
-- month - text
-- number\_of\_orders - integer
-- total\_sales - double
+| Field Name  | Data Type |
+|-------|---------------------------|
+| location | text |
+| month | text |
+| number\_of\_orders | integer |
+| total\_sales | double |
 
 Prepare the sample data set for use:
 
@@ -131,7 +129,6 @@ Create a Hive table to expose our sample data set.
     ```
 
     Notice that:
-
     - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
     - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
 
@@ -142,7 +139,7 @@ Create a Hive table to expose our sample data set.
             INTO TABLE sales_info;
     ```
 
-3. Perform a query on `sales_info` to verify the data was loaded successfully:
+3. Perform a query on `sales_info` to verify that the data was loaded successfully:
 
     ``` sql
     hive> SELECT * FROM sales_info;
@@ -169,8 +166,8 @@ Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](..
 | Keyword  | Value |
 |-------|-------------------------------------|
 | \<host\>[:<port\>]    | The HDFS NameNode and port. |
-| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to the Hive database named `default`. |
-| \<hive-table-name\>    | Name of the Hive table. |
+| \<hive-db-name\>    | The name of the Hive database. If omitted, defaults to the Hive database named `default`. |
+| \<hive-table-name\>    | The name of the Hive table. |
 | PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
 | DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must be a single ascii character or specified in hexadecimal representation. |
 | FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
@@ -181,6 +178,9 @@ Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](..
 
 The `Hive` profile works with any Hive file format.
 
+While you can use the `Hive` profile to access any file format, the more specific profiles perform better for those specific file types.
+
+
 ### <a id="profile_hive_using"></a>Example: Using the Hive Profile
 
 Use the `Hive` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
@@ -230,9 +230,8 @@ Use the PXF `HiveText` profile to create a queryable HAWQ external table from th
     (You can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
 
     Notice that:
-
     - The `LOCATION` subclause `DELIMITER` value is specified in hexadecimal format. `\x` is a prefix that instructs PXF to interpret the following characters as hexadecimal. `2c` is the hex value for the comma character.
-    - The `FORMAT` subclause `delimiter` value is specified as the single ascii comma character ','. `E` escapes the character.
+    - The `FORMAT` subclause `delimiter` value is specified as the single ascii comma character `','`. `E` escapes the character.
 
 2. Query the external table:
 
@@ -277,7 +276,7 @@ Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. The `Hiv
 
     A copy of the sample data set is now stored in RCFile format in `sales_info_rcfile`. 
     
-3. Perform a Hive query on `sales_info_rcfile` to verify the data was loaded successfully:
+3. Perform a Hive query on `sales_info_rcfile` to verify that the data was loaded successfully:
 
     ``` sql
     hive> SELECT * FROM sales_info_rcfile;
@@ -330,18 +329,19 @@ postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (fname text, lname text, cust
 
 The `HiveRC` and `HiveText` profiles are faster than the generic `Hive` profile.
 
-?? MORE HERE. ??
 
 ## <a id="complex_dt_example"></a>Complex Data Type Example
 
 This example will employ the array and map complex types, specifically an array of integers and a string key/value pair map.
 
-The example data set includes fields with the following names and data types:
+The data schema for this example includes fields with the following names and data types:
 
-- index - int
-- name - string
-- intarray - array of integers
-- propmap - map of string key and value pairs
+| Field Name  | Data Type |
+|-------|---------------------------|
+| index | int |
+| name | string
+| intarray | array of integers |
+| propmap | map of string key and value pairs |
 
 When specifying an array field in a Hive table, you must identify the terminator for each item in the collection. Similarly, the map key termination character must also be specified.
 
@@ -381,7 +381,6 @@ When specifying an array field in a Hive table, you must identify the terminator
     ```
 
     Notice that:
-
     - `FIELDS TERMINATED BY` identifies a comma as the field terminator.
     - The `COLLECTION ITEMS TERMINATED BY` subclause specifies the percent sign as the collection items (array item, map key/value pair) terminator.
     - `MAP KEYS TERMINATED BY` identifies a colon as the terminator for map keys.
@@ -392,7 +391,7 @@ When specifying an array field in a Hive table, you must identify the terminator
     hive> LOAD DATA local INPATH '/tmp/pxf_hive_complex.txt' INTO TABLE table_complextypes;
     ```
 
-5. Perform a query on Hive table `table_complextypes` to verify the data was loaded successfully:
+5. Perform a query on Hive table `table_complextypes` to verify that the data was loaded successfully:
 
     ``` sql
     hive> SELECT * FROM table_complextypes;
@@ -460,7 +459,7 @@ The following diagram depicts how HAWQ integrates with HCatalog to query Hive ta
 3.  PXF queries Hive using table metadata that is stored in the HAWQ in-memory catalog tables. Table metadata is dropped at the end of the transaction.
 
 
-### <a id="topic_j1l_y55_c5"></a>Usage
+### <a id="topic_j1l_enabling"></a>Enabling HCatalog Integration
 
 To enable HCatalog query integration in HAWQ, perform the following steps:
 
@@ -476,7 +475,8 @@ To enable HCatalog query integration in HAWQ, perform the following steps:
     ``` sql
     postgres=# GRANT ALL ON PROTOCOL pxf TO "role";
     ``` 
-    
+
+### <a id="topic_j1l_y55_c5"></a>Usage    
 
 To query a Hive table with HCatalog integration, query HCatalog directly from HAWQ. The query syntax is:
 
@@ -574,10 +574,18 @@ hive> CREATE TABLE sales_part (name string, type string, supplier_key int, price
 Load data into this Hive table and�add some partitions:
 
 ``` sql
-hive> INSERT INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Fresno') VALUES ('block', 'widget', 33, 15.17);
-hive> INSERT INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento') VALUES ('cube', 'widget', 11, 1.17);
-hive> INSERT INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA', delivery_city = 'Reno') VALUES ('dowel', 'widget', 51, 31.82);
-hive> INSERT INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA', delivery_city = 'Las Vegas') VALUES ('px49', 'pipe', 52, 99.82);
+hive> INSERT INTO TABLE sales_part 
+        PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Fresno') 
+        VALUES ('block', 'widget', 33, 15.17);
+hive> INSERT INTO TABLE sales_part 
+        PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento') 
+        VALUES ('cube', 'widget', 11, 1.17);
+hive> INSERT INTO TABLE sales_part 
+        PARTITION(delivery_state = 'NEVADA', delivery_city = 'Reno') 
+        VALUES ('dowel', 'widget', 51, 31.82);
+hive> INSERT INTO TABLE sales_part 
+        PARTITION(delivery_state = 'NEVADA', delivery_city = 'Las Vegas') 
+        VALUES ('px49', 'pipe', 52, 99.82);
 ```
 
 The Hive storage directory structure for the `sales_part` table appears as follows: