You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hawq.apache.org by lisakowen <gi...@git.apache.org> on 2016/10/26 22:33:32 UTC

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

GitHub user lisakowen opened a pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39

    HAWQ-1071 - add examples for HiveText and HiveRC plugins

    added examples, restructured content, added hive command line section.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lisakowen/incubator-hawq-docs feature/pxfhive-enhance

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hawq-docs/pull/39.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #39
    
----
commit 0398a62fefd3627273927f938b4d082a25bf3003
Author: Lisa Owen <lo...@pivotal.io>
Date:   2016-09-26T21:37:04Z

    restructure PXF Hive pulug-in page; add more relevant examples

commit 457d703a3f5c057e241acf985fbc35da34f6a075
Author: Lisa Owen <lo...@pivotal.io>
Date:   2016-09-26T22:40:10Z

    PXF Hive plug-in mods

commit 822d7545e746490e55507866c62dca5ea2d5349a
Author: Lisa Owen <lo...@pivotal.io>
Date:   2016-10-03T22:19:03Z

    clean up some extra whitespace

commit 8c986b60b8db3edd77c10f23704cc9174c52a803
Author: Lisa Owen <lo...@pivotal.io>
Date:   2016-10-11T18:37:34Z

    include list of hive profile names in file format section

commit 150fa67857871d58ea05eb14c023215c932ab7b1
Author: Lisa Owen <lo...@pivotal.io>
Date:   2016-10-11T19:03:39Z

    link to CREATE EXTERNAL TABLE ref page

commit 5cdd8f8c35a51360fe3bfdedeff796bf1e0f31f3
Author: Lisa Owen <lo...@pivotal.io>
Date:   2016-10-11T20:27:17Z

    sql commands all caps

commit 67e8b9699c9eec64d04ce9e6048ffb385f7f3573
Author: Lisa Owen <lo...@pivotal.io>
Date:   2016-10-11T20:33:35Z

    use <> for optional args

commit 54b2c01a80d477cc093d7eb1ed2aa8c0bf762d36
Author: Lisa Owen <lo...@pivotal.io>
Date:   2016-10-22T00:16:24Z

    fix some duplicate ids

commit 284c3ec2db38e8d9020826e3bf292efad76c1819
Author: Lisa Owen <lo...@pivotal.io>
Date:   2016-10-26T15:38:37Z

    restructure to use numbered steps

commit 2a38a0322abda804cfd4fc8aa39f142f0d83ea11
Author: Lisa Owen <lo...@pivotal.io>
Date:   2016-10-26T17:20:28Z

    note/notice

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by dyozie <gi...@git.apache.org>.

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85366470
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    --- End diff --
    
    There's a markdown formatting issue with this bullet list on the review site - either needs more indenting or correct spacing between the list and the previous para.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by dyozie <gi...@git.apache.org>.

Github user dyozie commented on a diff in the pull request:

https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85365540

--- Diff: pxf/HivePXF.html.md.erb ---
@@ -2,121 +2,450 @@
title: Accessing Hive Data
---

-This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
+Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
+
+This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
+
+- Creating an external table in PXF and querying that table
+- Querying Hive tables via PXF's integration with HCatalog

## <a id="installingthepxfhiveplugin"></a>Prerequisites

-Check the following before using PXF to access Hive:
+Before accessing Hive data with HAWQ and PXF, ensure that:

-- The PXF HDFS plug-in is installed on all cluster nodes.
+- The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
- The PXF Hive plug-in is installed on all cluster nodes.
- The Hive JAR files and conf directory�are installed on all cluster nodes.
-- Test PXF on HDFS before connecting to Hive or HBase.
+- You have tested PXF on HDFS.
- You are running the Hive Metastore service on a machine in your cluster.�
- You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.

+## <a id="topic_p2s_lvl_25"></a>Hive File Formats
+
+Hive supports several file formats:
+
+- TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
+- SequenceFile - flat file consisting of binary key/value pairs
+- RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
+- ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
+- Parquet - compressed columnar data representation
+- Avro - JSON-defined, schema-based data serialization format
--- End diff --

Just a suggestion, but I think this would read better as a 2-column term/definition table. You could even make it a 3-column table to describe which PXF plug-ins are used with each format.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by dyozie <gi...@git.apache.org>.

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85367789
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info;
    +    ```
     
    -The following�PXF table definition is valid for any Hive file storage type.
    +In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
    +
    +## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data
    +
    +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`.
    +
    +Use the following syntax to create a HAWQ external table representing Hive data:
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://namenode[:port]/hive-db-name.hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
    +CREATE EXTERNAL TABLE <table_name>
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    +    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
    +FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
     ```
     
    -where `<pxf parameters>` is:
    +Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described below.
     
    -``` pre
    -   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:<port\>]    | The HDFS NameNode and port. |
    +| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to the Hive database named `default`. |
    +| \<hive-table-name\>    | Name of the Hive table. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
    +| DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must be a single ascii character or specified in hexadecimal representation. |
    +| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
    +| FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |
     
     
    -If `hive-db-name` is omitted, pxf will default to the Hive `default` database.
    +## <a id="profile_hive"></a>Hive Profile
     
    -**Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting the pxf\_service\_port configuration parameter.
    +The `Hive` profile works with any Hive file format.
     
    -PXF has three built-in profiles for Hive tables:
    +### <a id="profile_hive_using"></a>Example: Using the Hive Profile
     
    --   Hive
    --   HiveRC
    --   HiveText
    +Use the `Hive` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
     
    -The Hive profile works with any Hive storage type. 
    -The following example creates a readable HAWQ external table representing a Hive table named `accessories` in the `inventory` Hive database using the PXF Hive profile:
    +1. Create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier:
     
    -``` shell
    -$ psql -d postgres
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8)
    +                LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=Hive')
    +              FORMAT 'custom' (formatter='pxfwritable_import');
    +    ```
    +
    +2. Query the table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hiveprofile;
    +    ```
    +
    +    ``` shell
    +       location    | month | num_orders | total_sales
    +    ---------------+-------+------------+-------------
    +     Prague        | Jan   |        101 |     4875.33
    +     Rome          | Mar   |         87 |     1557.39
    +     Bangalore     | May   |        317 |     8936.99
    +     ...
    +
    +    ```
    +
    +## <a id="profile_hivetext"></a>HiveText Profile
    +
    +Use the `HiveText` profile to query text formats. The `HiveText` profile is more performant than the `Hive` profile.
    --- End diff --
    
    It seems like maybe we just need a general statement saying that the Hive profile works with all file formats, but that the more specific profiles perform better for the specific file types?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by lisakowen <gi...@git.apache.org>.

Github user lisakowen commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85424776
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -339,21 +601,21 @@ postgres=# CREATE EXTERNAL TABLE pxf_sales_part(
       delivery_state TEXT, 
       delivery_city TEXT
     )
    -LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive')
    +LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive')
     FORMAT 'custom' (FORMATTER='pxfwritable_import');
     
     postgres=# SELECT * FROM pxf_sales_part;
     ```
     
    -### <a id="example3"></a>Example
    +### <a id="example3"></a>Query Without Pushdown
     
     In the following example, the HAWQ query filters the `delivery_city` partition `Sacramento`. The filter on� `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data on `Sacramento` is transferred for processing.
     
     ``` sql
    -postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'shirt';
    +postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'cube';
     ```
     
    -### <a id="example4"></a>Example
    +### <a id="example4"></a>Query With Pushdown
    --- End diff --
    
    will also need to add this GUC to the documentation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by lisakowen <gi...@git.apache.org>.

Github user lisakowen commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85407700
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info;
    +    ```
     
    -The following�PXF table definition is valid for any Hive file storage type.
    +In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
    +
    +## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data
    +
    +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`.
    +
    +Use the following syntax to create a HAWQ external table representing Hive data:
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://namenode[:port]/hive-db-name.hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
    +CREATE EXTERNAL TABLE <table_name>
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    +    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
    +FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
     ```
     
    -where `<pxf parameters>` is:
    +Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described below.
     
    -``` pre
    -   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:<port\>]    | The HDFS NameNode and port. |
    +| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to the Hive database named `default`. |
    +| \<hive-table-name\>    | Name of the Hive table. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
    +| DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must be a single ascii character or specified in hexadecimal representation. |
    +| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
    +| FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |
     
     
    -If `hive-db-name` is omitted, pxf will default to the Hive `default` database.
    +## <a id="profile_hive"></a>Hive Profile
     
    -**Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting the pxf\_service\_port configuration parameter.
    +The `Hive` profile works with any Hive file format.
     
    -PXF has three built-in profiles for Hive tables:
    +### <a id="profile_hive_using"></a>Example: Using the Hive Profile
     
    --   Hive
    --   HiveRC
    --   HiveText
    +Use the `Hive` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
     
    -The Hive profile works with any Hive storage type. 
    -The following example creates a readable HAWQ external table representing a Hive table named `accessories` in the `inventory` Hive database using the PXF Hive profile:
    +1. Create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier:
     
    -``` shell
    -$ psql -d postgres
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8)
    +                LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=Hive')
    +              FORMAT 'custom' (formatter='pxfwritable_import');
    +    ```
    +
    +2. Query the table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hiveprofile;
    +    ```
    +
    +    ``` shell
    +       location    | month | num_orders | total_sales
    +    ---------------+-------+------------+-------------
    +     Prague        | Jan   |        101 |     4875.33
    +     Rome          | Mar   |         87 |     1557.39
    +     Bangalore     | May   |        317 |     8936.99
    +     ...
    +
    +    ```
    +
    +## <a id="profile_hivetext"></a>HiveText Profile
    +
    +Use the `HiveText` profile to query text formats. The `HiveText` profile is more performant than the `Hive` profile.
    +
    +**Note**: When using the `HiveText` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.
    +
    +### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile
    +
    +Use the PXF `HiveText` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
    +
    +1. Create the external table:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=HiveText&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (You can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +    Notice that:
    +
    +    - The `LOCATION` subclause `DELIMITER` value is specified in hexadecimal format. `\x` is a prefix that instructs PXF to interpret the following characters as hexadecimal. `2c` is the hex value for the comma character.
    +    - The `FORMAT` subclause `delimiter` value is specified as the single ascii comma character ','. `E` escapes the character.
    +
    +2. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hivetextprofile where location="Beijing";
    +    ```
    +
    +    ``` shell
    +     location | month | num_orders | total_sales
    +    ----------+-------+------------+-------------
    +     Beijing  | Jul   |        411 |    11600.67
    +     Beijing  | Dec   |        100 |     4248.41
    +    (2 rows)
    +    ```
    +
    +## <a id="profile_hiverc"></a>HiveRC Profile
    +
    +The RCFile Hive format is used for row columnar formatted data. The `HiveRC` profile provides access to RCFile data.
    +
    +### <a id="profile_hiverc_rcfiletbl_using"></a>Example: Using the HiveRC Profile
    +
    +Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. The `HiveRC` profile is more performant than the `Hive` profile for this file format type.
    +
    +1. Create a Hive table with RCFile format:
    +
    +    ``` shell
    +    $ HADOOP_USER_NAME=hdfs hive
    +    ```
    +
    +    ``` sql
    +    hive> CREATE TABLE sales_info_rcfile (location string, month string,
    +            number_of_orders int, total_sales double)
    +          ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +          STORED AS rcfile;
    +    ```
    +
    +2. Insert the data from the `sales_info` table into `sales_info_rcfile`:
    +
    +    ``` sql
    +    hive> INSERT INTO TABLE sales_info_rcfile SELECT * FROM sales_info;
    +    ```
    +
    +    A copy of the sample data set is now stored in RCFile format in `sales_info_rcfile`. 
    +    
    +3. Perform a Hive query on `sales_info_rcfile` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info_rcfile;
    +    ```
    +
    +4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. When using the `HiveRC` profile, you **must** specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info_rcfile?PROFILE=HiveRC&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (Again, you can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +5. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile;
    +    ```
    +
    +    ``` shell
    +       location    | total_sales
    +    ---------------+-------------
    +     Prague        |     4875.33
    +     Rome          |     1557.39
    +     Bangalore     |     8936.99
    +     Beijing       |    11600.67
    +     ...
    +    ```
    +
    +## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
    +
    +The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using:
    +
    +``` sql
    +hive> CREATE TABLE hive_parquet_table (fname string, lname string, custid int, acctbalance double)
    +        STORED AS parquet;
     ```
     
    +Define the HAWQ external table using:
    +
     ``` sql
    -postgres=# CREATE EXTERNAL TABLE hivetest(id int, newid int)
    -LOCATION ('pxf://namenode:51200/inventory.accessories?PROFILE=Hive')
    -FORMAT 'custom' (formatter='pxfwritable_import');
    +postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (fname text, lname text, custid int, acctbalance double precision)
    +    LOCATION ('pxf://namenode:51200/hive-db-name.hive_parquet_table?profile=Hive')
    +    FORMAT 'CUSTOM' (formatter='pxfwritable_import');
     ```
     
    +## <a id="profileperf"></a>Profile Performance Considerations
     
    -Use HiveRC and HiveText to query RC and Text formats respectively. The HiveRC and HiveText profiles are faster than the generic Hive profile. When using the HiveRC and HiveText profiles, you must specify a DELIMITER option in the LOCATION clause. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information on profiles.
    +The `HiveRC` and `HiveText` profiles are faster than the generic `Hive` profile.
     
    +?? MORE HERE. ??
    --- End diff --
    
    thanks, this is just the sort of information i was looking for.  it wasn't clear to me why one would ever use the Hive profile if the others were more performant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by lisakowen <gi...@git.apache.org>.

Github user lisakowen commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85407620
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -339,21 +601,21 @@ postgres=# CREATE EXTERNAL TABLE pxf_sales_part(
       delivery_state TEXT, 
       delivery_city TEXT
     )
    -LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive')
    +LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive')
     FORMAT 'custom' (FORMATTER='pxfwritable_import');
     
     postgres=# SELECT * FROM pxf_sales_part;
     ```
     
    -### <a id="example3"></a>Example
    +### <a id="example3"></a>Query Without Pushdown
     
     In the following example, the HAWQ query filters the `delivery_city` partition `Sacramento`. The filter on� `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data on `Sacramento` is transferred for processing.
     
     ``` sql
    -postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'shirt';
    +postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'cube';
     ```
     
    -### <a id="example4"></a>Example
    +### <a id="example4"></a>Query With Pushdown
    --- End diff --
    
    yes, this is good info to share with the user!  i checked out the code, and it looks like this GUC is on by default.  i will add some text to that effect in the appropriate section.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by dyozie <gi...@git.apache.org>.

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85368842
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info;
    +    ```
     
    -The following�PXF table definition is valid for any Hive file storage type.
    +In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
    +
    +## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data
    +
    +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`.
    +
    +Use the following syntax to create a HAWQ external table representing Hive data:
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://namenode[:port]/hive-db-name.hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
    +CREATE EXTERNAL TABLE <table_name>
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    +    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
    +FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
     ```
     
    -where `<pxf parameters>` is:
    +Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described below.
     
    -``` pre
    -   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:<port\>]    | The HDFS NameNode and port. |
    +| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to the Hive database named `default`. |
    +| \<hive-table-name\>    | Name of the Hive table. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
    +| DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must be a single ascii character or specified in hexadecimal representation. |
    +| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
    +| FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |
     
     
    -If `hive-db-name` is omitted, pxf will default to the Hive `default` database.
    +## <a id="profile_hive"></a>Hive Profile
     
    -**Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting the pxf\_service\_port configuration parameter.
    +The `Hive` profile works with any Hive file format.
     
    -PXF has three built-in profiles for Hive tables:
    +### <a id="profile_hive_using"></a>Example: Using the Hive Profile
     
    --   Hive
    --   HiveRC
    --   HiveText
    +Use the `Hive` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
     
    -The Hive profile works with any Hive storage type. 
    -The following example creates a readable HAWQ external table representing a Hive table named `accessories` in the `inventory` Hive database using the PXF Hive profile:
    +1. Create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier:
     
    -``` shell
    -$ psql -d postgres
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8)
    +                LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=Hive')
    +              FORMAT 'custom' (formatter='pxfwritable_import');
    +    ```
    +
    +2. Query the table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hiveprofile;
    +    ```
    +
    +    ``` shell
    +       location    | month | num_orders | total_sales
    +    ---------------+-------+------------+-------------
    +     Prague        | Jan   |        101 |     4875.33
    +     Rome          | Mar   |         87 |     1557.39
    +     Bangalore     | May   |        317 |     8936.99
    +     ...
    +
    +    ```
    +
    +## <a id="profile_hivetext"></a>HiveText Profile
    +
    +Use the `HiveText` profile to query text formats. The `HiveText` profile is more performant than the `Hive` profile.
    +
    +**Note**: When using the `HiveText` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.
    +
    +### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile
    +
    +Use the PXF `HiveText` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
    +
    +1. Create the external table:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=HiveText&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (You can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +    Notice that:
    +
    +    - The `LOCATION` subclause `DELIMITER` value is specified in hexadecimal format. `\x` is a prefix that instructs PXF to interpret the following characters as hexadecimal. `2c` is the hex value for the comma character.
    +    - The `FORMAT` subclause `delimiter` value is specified as the single ascii comma character ','. `E` escapes the character.
    +
    +2. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hivetextprofile where location="Beijing";
    +    ```
    +
    +    ``` shell
    +     location | month | num_orders | total_sales
    +    ----------+-------+------------+-------------
    +     Beijing  | Jul   |        411 |    11600.67
    +     Beijing  | Dec   |        100 |     4248.41
    +    (2 rows)
    +    ```
    +
    +## <a id="profile_hiverc"></a>HiveRC Profile
    +
    +The RCFile Hive format is used for row columnar formatted data. The `HiveRC` profile provides access to RCFile data.
    +
    +### <a id="profile_hiverc_rcfiletbl_using"></a>Example: Using the HiveRC Profile
    +
    +Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. The `HiveRC` profile is more performant than the `Hive` profile for this file format type.
    +
    +1. Create a Hive table with RCFile format:
    +
    +    ``` shell
    +    $ HADOOP_USER_NAME=hdfs hive
    +    ```
    +
    +    ``` sql
    +    hive> CREATE TABLE sales_info_rcfile (location string, month string,
    +            number_of_orders int, total_sales double)
    +          ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +          STORED AS rcfile;
    +    ```
    +
    +2. Insert the data from the `sales_info` table into `sales_info_rcfile`:
    +
    +    ``` sql
    +    hive> INSERT INTO TABLE sales_info_rcfile SELECT * FROM sales_info;
    +    ```
    +
    +    A copy of the sample data set is now stored in RCFile format in `sales_info_rcfile`. 
    +    
    +3. Perform a Hive query on `sales_info_rcfile` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info_rcfile;
    +    ```
    +
    +4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. When using the `HiveRC` profile, you **must** specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info_rcfile?PROFILE=HiveRC&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (Again, you can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +5. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile;
    +    ```
    +
    +    ``` shell
    +       location    | total_sales
    +    ---------------+-------------
    +     Prague        |     4875.33
    +     Rome          |     1557.39
    +     Bangalore     |     8936.99
    +     Beijing       |    11600.67
    +     ...
    +    ```
    +
    +## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
    +
    +The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using:
    +
    +``` sql
    +hive> CREATE TABLE hive_parquet_table (fname string, lname string, custid int, acctbalance double)
    +        STORED AS parquet;
     ```
     
    +Define the HAWQ external table using:
    +
     ``` sql
    -postgres=# CREATE EXTERNAL TABLE hivetest(id int, newid int)
    -LOCATION ('pxf://namenode:51200/inventory.accessories?PROFILE=Hive')
    -FORMAT 'custom' (formatter='pxfwritable_import');
    +postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (fname text, lname text, custid int, acctbalance double precision)
    +    LOCATION ('pxf://namenode:51200/hive-db-name.hive_parquet_table?profile=Hive')
    +    FORMAT 'CUSTOM' (formatter='pxfwritable_import');
     ```
     
    +## <a id="profileperf"></a>Profile Performance Considerations
     
    -Use HiveRC and HiveText to query RC and Text formats respectively. The HiveRC and HiveText profiles are faster than the generic Hive profile. When using the HiveRC and HiveText profiles, you must specify a DELIMITER option in the LOCATION clause. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information on profiles.
    +The `HiveRC` and `HiveText` profiles are faster than the generic `Hive` profile.
     
    +?? MORE HERE. ??
     
    -### <a id="topic_b4v_g3n_25"></a>Hive Complex Types
    +## <a id="complex_dt_example"></a>Complex Data Type Example
     
    -PXF tables support Hive data types that are not primitive types. The supported Hive complex data types are array, struct, map, and union. This Hive `CREATE TABLE` statement, for example, creates a table with each of these complex data types:
    +This example will employ the array and map complex types, specifically an array of integers and a string key/value pair map.
     
    -``` sql
    -hive> CREATE TABLE sales_collections (
    -  item STRING,
    -  price FLOAT,
    -  properties ARRAY<STRING>,
    -  hash MAP<STRING,FLOAT>,
    -  delivery_address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>,
    -  others UNIONTYPE<FLOAT, BOOLEAN, STRING>
    -)  
    -ROW FORMAT DELIMITED FIELDS
    -TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
    -hive> LOAD DATA LOCAL INPATH '/local/path/<some data file>' INTO TABLE sales_collection;
    -```
    +The example data set includes fields with the following names and data types:
    +
    +- index - int
    +- name - string
    +- intarray - array of integers
    +- propmap - map of string key and value pairs
    +
    +When specifying an array field in a Hive table, you must identify the terminator for each item in the collection. Similarly, the map key termination character must also be specified.
    +
    +1. Create a text file from which you will load the data set:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_complex.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_complex.txt`.  The data uses a comma `,` to separate field values, the percent symbol `%` to separate collection items, and a `:` to terminate map key values:
    +
    +    ```
    +    3,Prague,1%2%3,zone:euro%status:up
    +    89,Rome,4%5%6,zone:euro
    +    400,Bangalore,7%8%9,zone:apac%status:pending
    +    183,Beijing,0%1%2,zone:apac
    +    94,Sacramento,3%4%5,zone:noam%status:down
    +    101,Paris,6%7%8,zone:euro%status:up
    +    56,Frankfurt,9%0%1,zone:euro
    +    202,Jakarta,2%3%4,zone:apac%status:up
    +    313,Sydney,5%6%7,zone:apac%status:pending
    +    76,Atlanta,8%9%0,zone:noam%status:down
    +    ```
    +
    +3. Create a Hive table to represent this data:
    +
    +    ``` shell
    +    $ HADOOP_USER_NAME=hdfs hive
    +    ```
    +
    +    ``` sql
    +    hive> CREATE TABLE table_complextypes( index int, name string, intarray ARRAY<int>, propmap MAP<string, string>)
    +             ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +             COLLECTION ITEMS TERMINATED BY '%'
    +             MAP KEYS TERMINATED BY ':'
    +             STORED AS TEXTFILE;
    +    ```
    +
    +    Notice that:
    +
    --- End diff --
    
    Same MD formatting error.  This one also causes step 4 below to reset to step1.  Deleting the blank line should fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by kavinderd <gi...@git.apache.org>.

Github user kavinderd commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85376289
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info;
    +    ```
     
    -The following�PXF table definition is valid for any Hive file storage type.
    +In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
    +
    +## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data
    +
    +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`.
    +
    +Use the following syntax to create a HAWQ external table representing Hive data:
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://namenode[:port]/hive-db-name.hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
    +CREATE EXTERNAL TABLE <table_name>
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    +    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
    +FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
     ```
     
    -where `<pxf parameters>` is:
    +Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described below.
     
    -``` pre
    -   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:<port\>]    | The HDFS NameNode and port. |
    +| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to the Hive database named `default`. |
    +| \<hive-table-name\>    | Name of the Hive table. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
    +| DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must be a single ascii character or specified in hexadecimal representation. |
    +| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
    +| FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |
     
     
    -If `hive-db-name` is omitted, pxf will default to the Hive `default` database.
    +## <a id="profile_hive"></a>Hive Profile
     
    -**Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting the pxf\_service\_port configuration parameter.
    +The `Hive` profile works with any Hive file format.
     
    -PXF has three built-in profiles for Hive tables:
    +### <a id="profile_hive_using"></a>Example: Using the Hive Profile
     
    --   Hive
    --   HiveRC
    --   HiveText
    +Use the `Hive` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
     
    -The Hive profile works with any Hive storage type. 
    -The following example creates a readable HAWQ external table representing a Hive table named `accessories` in the `inventory` Hive database using the PXF Hive profile:
    +1. Create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier:
     
    -``` shell
    -$ psql -d postgres
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8)
    +                LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=Hive')
    +              FORMAT 'custom' (formatter='pxfwritable_import');
    +    ```
    +
    +2. Query the table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hiveprofile;
    +    ```
    +
    +    ``` shell
    +       location    | month | num_orders | total_sales
    +    ---------------+-------+------------+-------------
    +     Prague        | Jan   |        101 |     4875.33
    +     Rome          | Mar   |         87 |     1557.39
    +     Bangalore     | May   |        317 |     8936.99
    +     ...
    +
    +    ```
    +
    +## <a id="profile_hivetext"></a>HiveText Profile
    +
    +Use the `HiveText` profile to query text formats. The `HiveText` profile is more performant than the `Hive` profile.
    +
    +**Note**: When using the `HiveText` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.
    +
    +### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile
    +
    +Use the PXF `HiveText` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
    +
    +1. Create the external table:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=HiveText&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (You can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +    Notice that:
    +
    +    - The `LOCATION` subclause `DELIMITER` value is specified in hexadecimal format. `\x` is a prefix that instructs PXF to interpret the following characters as hexadecimal. `2c` is the hex value for the comma character.
    +    - The `FORMAT` subclause `delimiter` value is specified as the single ascii comma character ','. `E` escapes the character.
    +
    +2. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hivetextprofile where location="Beijing";
    +    ```
    +
    +    ``` shell
    +     location | month | num_orders | total_sales
    +    ----------+-------+------------+-------------
    +     Beijing  | Jul   |        411 |    11600.67
    +     Beijing  | Dec   |        100 |     4248.41
    +    (2 rows)
    +    ```
    +
    +## <a id="profile_hiverc"></a>HiveRC Profile
    +
    +The RCFile Hive format is used for row columnar formatted data. The `HiveRC` profile provides access to RCFile data.
    +
    +### <a id="profile_hiverc_rcfiletbl_using"></a>Example: Using the HiveRC Profile
    +
    +Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. The `HiveRC` profile is more performant than the `Hive` profile for this file format type.
    +
    +1. Create a Hive table with RCFile format:
    +
    +    ``` shell
    +    $ HADOOP_USER_NAME=hdfs hive
    +    ```
    +
    +    ``` sql
    +    hive> CREATE TABLE sales_info_rcfile (location string, month string,
    +            number_of_orders int, total_sales double)
    +          ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +          STORED AS rcfile;
    +    ```
    +
    +2. Insert the data from the `sales_info` table into `sales_info_rcfile`:
    +
    +    ``` sql
    +    hive> INSERT INTO TABLE sales_info_rcfile SELECT * FROM sales_info;
    +    ```
    +
    +    A copy of the sample data set is now stored in RCFile format in `sales_info_rcfile`. 
    +    
    +3. Perform a Hive query on `sales_info_rcfile` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info_rcfile;
    +    ```
    +
    +4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. When using the `HiveRC` profile, you **must** specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info_rcfile?PROFILE=HiveRC&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (Again, you can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +5. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile;
    +    ```
    +
    +    ``` shell
    +       location    | total_sales
    +    ---------------+-------------
    +     Prague        |     4875.33
    +     Rome          |     1557.39
    +     Bangalore     |     8936.99
    +     Beijing       |    11600.67
    +     ...
    +    ```
    +
    +## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
    +
    +The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using:
    +
    +``` sql
    +hive> CREATE TABLE hive_parquet_table (fname string, lname string, custid int, acctbalance double)
    +        STORED AS parquet;
     ```
     
    +Define the HAWQ external table using:
    +
     ``` sql
    -postgres=# CREATE EXTERNAL TABLE hivetest(id int, newid int)
    -LOCATION ('pxf://namenode:51200/inventory.accessories?PROFILE=Hive')
    -FORMAT 'custom' (formatter='pxfwritable_import');
    +postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (fname text, lname text, custid int, acctbalance double precision)
    +    LOCATION ('pxf://namenode:51200/hive-db-name.hive_parquet_table?profile=Hive')
    +    FORMAT 'CUSTOM' (formatter='pxfwritable_import');
     ```
     
    +## <a id="profileperf"></a>Profile Performance Considerations
     
    -Use HiveRC and HiveText to query RC and Text formats respectively. The HiveRC and HiveText profiles are faster than the generic Hive profile. When using the HiveRC and HiveText profiles, you must specify a DELIMITER option in the LOCATION clause. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information on profiles.
    +The `HiveRC` and `HiveText` profiles are faster than the generic `Hive` profile.
     
    +?? MORE HERE. ??
    --- End diff --
    
    Maybe mention the tradeoffs between HiveRC/HiveText and the Hive plugin. The Hive plugin can access heterogenous file formats for a single table where each partition may be stored as a different file format. Whereas the HiveRC and HiveText strictly work on a single file format


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-hawq-docs/pull/39


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by dyozie <gi...@git.apache.org>.

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85371576
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -151,184 +477,120 @@ To enable HCatalog query integration in HAWQ, perform the following steps:
         postgres=# GRANT ALL ON PROTOCOL pxf TO "role";
         ``` 
         
    -3.  To query a Hive table with HCatalog integration, simply query HCatalog directly from HAWQ. The query syntax is:
     
    -    ``` sql
    -    postgres=# SELECT * FROM hcatalog.hive-db-name.hive-table-name;
    -    ```
    +To query a Hive table with HCatalog integration, query HCatalog directly from HAWQ. The query syntax is:
    +
    +``` sql
    +postgres=# SELECT * FROM hcatalog.hive-db-name.hive-table-name;
    +```
     
    -    For example:
    +For example:
     
    -    ``` sql
    -    postgres=# SELECT * FROM hcatalog.default.sales;
    -    ```
    -
    -4.  To obtain a description of a Hive table with HCatalog integration, you can use the `psql` client interface.
    -    -   Within HAWQ, use either the `\d                                         hcatalog.hive-db-name.hive-table-name` or `\d+                                         hcatalog.hive-db-name.hive-table-name` commands to describe a single table. For example, from the `psql` client interface:
    -
    -        ``` shell
    -        $ psql -d postgres
    -        postgres=# \d hcatalog.default.test
    -
    -        PXF Hive Table "default.test"
    -            Column    |  Type  
    -        --------------+--------
    -         name         | text
    -         type         | text
    -         supplier_key | int4
    -         full_price   | float8 
    -        ```
    -    -   Use `\d hcatalog.hive-db-name.*` to describe the whole database schema. For example:
    -
    -        ``` shell
    -        postgres=# \d hcatalog.default.*
    -
    -        PXF Hive Table "default.test"
    -            Column    |  Type  
    -        --------------+--------
    -         type         | text
    -         name         | text
    -         supplier_key | int4
    -         full_price   | float8
    -
    -        PXF Hive Table "default.testabc"
    -         Column | Type 
    -        --------+------
    -         type   | text
    -         name   | text
    -        ```
    -    -   Use `\d hcatalog.*.*` to describe the whole schema:
    -
    -        ``` shell
    -        postgres=# \d hcatalog.*.*
    -
    -        PXF Hive Table "default.test"
    -            Column    |  Type  
    -        --------------+--------
    -         type         | text
    -         name         | text
    -         supplier_key | int4
    -         full_price   | float8
    -
    -        PXF Hive Table "default.testabc"
    -         Column | Type 
    -        --------+------
    -         type   | text
    -         name   | text
    -
    -        PXF Hive Table "userdb.test"
    -          Column  | Type 
    -        ----------+------
    -         address  | text
    -         username | text
    -         
    -        ```
    -
    -    **Note:** When using `\d` or `\d+` commands in the `psql` HAWQ client, `hcatalog` will not be listed as a database. If you use other `psql` compatible clients, `hcatalog` will be listed as a database with a size value of `-1` since `hcatalog` is not a real database in HAWQ.
    -
    -5.  Alternatively, you can use the **pxf\_get\_item\_fields** user-defined function (UDF) to obtain Hive table descriptions from other client interfaces or third-party applications. The UDF takes a PXF profile and a table pattern string as its input parameters.
    -
    -    **Note:** Currently the only supported input profile is `'Hive'`.
    -
    -    For example, the following statement returns a description of a specific table. The description includes path, itemname (table), fieldname, and fieldtype.
    +``` sql
    +postgres=# SELECT * FROM hcatalog.default.sales_info;
    +```
    +
    +To obtain a description of a Hive table with HCatalog integration, you can use the `psql` client interface.
    +
    +-   Within HAWQ, use either the `\d                                         hcatalog.hive-db-name.hive-table-name` or `\d+                                         hcatalog.hive-db-name.hive-table-name` commands to describe a single table. For example, from the `psql` client interface:
    +
    +    ``` shell
    +    $ psql -d postgres
    +    ```
     
         ``` sql
    -    postgres=# select * from pxf_get_item_fields('Hive','default.test');
    +    postgres=# \d hcatalog.default.sales_info_rcfile;
         ```
    -    
    -    ``` pre
    -      path   | itemname |  fieldname   | fieldtype 
    -    ---------+----------+--------------+-----------
    -     default | test     | name         | text
    -     default | test     | type         | text
    -     default | test     | supplier_key | int4
    -     default | test     | full_price   | float8
    -    (4 rows)
    +
    +    ``` shell
    +    PXF Hive Table "default.sales_info_rcfile"
    +          Column      |  Type
    +    ------------------+--------
    +     location         | text
    +     month            | text
    +     number_of_orders | int4
    +     total_sales      | float8
         ```
    +-   Use `\d hcatalog.hive-db-name.*` to describe the whole database schema, i.e. all tables in `hive-db-name`.
    +-   Use `\d hcatalog.*.*` to describe the whole schema, i.e. all databases and tables.
     
    -    The following statement returns table descriptions from the default database.
    +When using `\d` or `\d+` commands in the `psql` HAWQ client, `hcatalog` will not be listed as a database. If you use other `psql` compatible clients, `hcatalog` will be listed as a database with a size value of `-1` since `hcatalog` is not a real database in HAWQ.
    +
    +Alternatively, you can use the `pxf_get_item_fields` user-defined function (UDF) to obtain Hive table descriptions from other client interfaces or third-party applications. The UDF takes a PXF profile and a table pattern string as its input parameters.  **Note:** The only supported input profile at this time is `'Hive'`.
    +
    +- The following statement returns a description of a specific table. The description includes path, itemname (table), fieldname, and fieldtype.
     
         ``` sql
    -    postgres=# select * from pxf_get_item_fields('Hive','default.*');
    +    postgres=# SELECT * FROM pxf_get_item_fields('Hive','default.sales_info_rcfile');
         ```
    -    
    +
         ``` pre
    -      path   | itemname |  fieldname   | fieldtype 
    -    ---------+----------+--------------+-----------
    -     default | test     | name         | text
    -     default | test     | type         | text
    -     default | test     | supplier_key | int4
    -     default | test     | full_price   | float8
    -     default | testabc  | name         | text
    -     default | testabc  | type         | text
    -    (6 rows)
    +      path   |     itemname      |    fieldname     | fieldtype
    +    ---------+-------------------+------------------+-----------
    +     default | sales_info_rcfile | location         | text
    +     default | sales_info_rcfile | month            | text
    +     default | sales_info_rcfile | number_of_orders | int4
    +     default | sales_info_rcfile | total_sales      | float8
         ```
     
    -    The following statement returns a description of the entire schema.
    +- The following statement returns table descriptions from the default database.
     
         ``` sql
    -    postgres=# select * from pxf_get_item_fields('Hive', '*.*');
    +    postgres=# SELECT * FROM pxf_get_item_fields('Hive','default.*');
         ```
    -    
    -    ``` pre
    -      path   | itemname |  fieldname   | fieldtype 
    -    ---------+----------+--------------+-----------
    -     default | test     | name         | text
    -     default | test     | type         | text
    -     default | test     | supplier_key | int4
    -     default | test     | full_price   | float8
    -     default | testabc  | name         | text
    -     default | testabc  | type         | text
    -     userdb  | test     | username     | text
    -     userdb  | test     | address      | text
    -    (8 rows)
    +
    +- The following statement returns a description of the entire schema.
    +
    +    ``` sql
    +    postgres=# SELECT * FROM pxf_get_item_fields('Hive', '*.*');
         ```
     
     ### <a id="topic_r5k_pst_25"></a>Limitations
     
     HCatalog integration has the following limitations:
     
    --   HCatalog integration queries and describe commands do not support complex types; only primitive types are supported. Use PXF external tables to query complex types in Hive instead. (See [Hive Complex Types](#topic_b4v_g3n_25) for example.)
    --   Even for primitive types, HCatalog metadata descriptions produced by `\d` and` \d+` are converted to HAWQ types. For example, the Hive type `tinyint` is converted to HAWQ type `int2`. (See [Data Types](../reference/HAWQDataTypes.html) for a list of data types in HAWQ.)
    --   HAWQ reserves the database name "hcatalog" for system use. You cannot connect to or alter the system "hcatalog" database.
    +-   HCatalog integration queries and describe commands do not support complex types; only primitive types are supported. Use PXF external tables to query complex types in Hive. (See [Complex Types Example](#complex_dt_example).)
    +-   Even for primitive types, HCatalog metadata descriptions produced by `\d` and` \d+` are converted to HAWQ types. For example, the Hive type `tinyint` is converted to HAWQ type `int2`. (See [Data Type Mapping](#hive_primdatatypes).)
    +-   HAWQ reserves the database name `hcatalog` for system use. You cannot connect to or alter the system `hcatalog` database.
     
     ## <a id="partitionfiltering"></a>Partition Filtering
     
    -The PXF Hive plug-in uses the Hive partitioning feature and directory structure. This enables partition exclusion on HDFS files that�contain the Hive table.�To use�the partition filtering�feature to reduce network traffic and I/O, run a PXF query using a WHERE clause�that refers to a specific partition in the partitioned Hive table.
    +The PXF Hive plug-in supports the Hive partitioning feature and directory structure. This enables partition exclusion on selected HDFS files comprising the Hive table.�To use�the partition filtering�feature to reduce network traffic and I/O, run a PXF query using a `WHERE` clause�that refers to a specific partition in the partitioned Hive table.
     
    -To take advantage of PXF partition filtering push-down, name the partition fields in the external table. These names must be the same as the names stored in the Hive table. Otherwise, PXF ignores Partition filtering and the filtering is performed on the HAWQ side, impacting�performance.
    +To take advantage of PXF partition filtering push-down, the Hive and PXF partition field names should be the same. Otherwise, PXF ignores partition filtering and the filtering is performed on the HAWQ side, impacting�performance.
     
    -**Note:** The Hive plug-in only filters on partition columns, not on other table attributes.
    +**Note:** The Hive plug-in filters only on partition columns, not on other table attributes.
     
    -### <a id="example2"></a>Example
    +### <a id="example2"></a>Create Partitioned Hive Table
     
     Create a�Hive table `sales_part`�with two partition columns, `delivery_state` and `delivery_city:`
     
     ``` sql
    -hive> CREATE TABLE sales_part (name STRING, type STRING, supplier_key INT, price DOUBLE)
    -PARTITIONED BY (delivery_state STRING, delivery_city STRING)
    -ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    +hive> CREATE TABLE sales_part (name string, type string, supplier_key int, price double)
    +        PARTITIONED BY (delivery_state string, delivery_city string)
    +        ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
     ```
     
     Load data into this Hive table and�add some partitions:
     
     ``` sql
    -hive> LOAD DATA LOCAL INPATH '/local/path/data1.txt' INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'San Francisco');
    -hive> LOAD DATA LOCAL INPATH '/local/path/data2.txt' INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento');
    -hive> LOAD DATA LOCAL INPATH '/local/path/data3.txt' INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA'    , delivery_city = 'Reno');
    -hive> LOAD DATA LOCAL INPATH '/local/path/data4.txt' INTO TABLE sales_part PARTITION(delivery_state = 'NEVADA'    , delivery_city = 'Las Vegas');
    +hive> INSERT INTO TABLE sales_part PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Fresno') VALUES ('block', 'widget', 33, 15.17);
    --- End diff --
    
    Might be nice to add line breaks and spacing as with the previous code block, so readers don't have to scroll horizontally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by lisakowen <gi...@git.apache.org>.

Github user lisakowen commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85408122
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info;
    +    ```
     
    -The following�PXF table definition is valid for any Hive file storage type.
    +In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
    +
    +## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data
    +
    +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`.
    +
    +Use the following syntax to create a HAWQ external table representing Hive data:
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://namenode[:port]/hive-db-name.hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
    +CREATE EXTERNAL TABLE <table_name>
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    +    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
    +FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
     ```
     
    -where `<pxf parameters>` is:
    +Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described below.
     
    -``` pre
    -   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:<port\>]    | The HDFS NameNode and port. |
    +| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to the Hive database named `default`. |
    +| \<hive-table-name\>    | Name of the Hive table. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
    +| DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must be a single ascii character or specified in hexadecimal representation. |
    +| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
    +| FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |
     
     
    -If `hive-db-name` is omitted, pxf will default to the Hive `default` database.
    +## <a id="profile_hive"></a>Hive Profile
     
    -**Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting the pxf\_service\_port configuration parameter.
    +The `Hive` profile works with any Hive file format.
     
    -PXF has three built-in profiles for Hive tables:
    +### <a id="profile_hive_using"></a>Example: Using the Hive Profile
     
    --   Hive
    --   HiveRC
    --   HiveText
    +Use the `Hive` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
     
    -The Hive profile works with any Hive storage type. 
    -The following example creates a readable HAWQ external table representing a Hive table named `accessories` in the `inventory` Hive database using the PXF Hive profile:
    +1. Create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier:
     
    -``` shell
    -$ psql -d postgres
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8)
    +                LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=Hive')
    +              FORMAT 'custom' (formatter='pxfwritable_import');
    +    ```
    +
    +2. Query the table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hiveprofile;
    +    ```
    +
    +    ``` shell
    +       location    | month | num_orders | total_sales
    +    ---------------+-------+------------+-------------
    +     Prague        | Jan   |        101 |     4875.33
    +     Rome          | Mar   |         87 |     1557.39
    +     Bangalore     | May   |        317 |     8936.99
    +     ...
    +
    +    ```
    +
    +## <a id="profile_hivetext"></a>HiveText Profile
    +
    +Use the `HiveText` profile to query text formats. The `HiveText` profile is more performant than the `Hive` profile.
    +
    +**Note**: When using the `HiveText` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.
    +
    +### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile
    +
    +Use the PXF `HiveText` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
    +
    +1. Create the external table:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=HiveText&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (You can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +    Notice that:
    +
    +    - The `LOCATION` subclause `DELIMITER` value is specified in hexadecimal format. `\x` is a prefix that instructs PXF to interpret the following characters as hexadecimal. `2c` is the hex value for the comma character.
    +    - The `FORMAT` subclause `delimiter` value is specified as the single ascii comma character ','. `E` escapes the character.
    +
    +2. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hivetextprofile where location="Beijing";
    +    ```
    +
    +    ``` shell
    +     location | month | num_orders | total_sales
    +    ----------+-------+------------+-------------
    +     Beijing  | Jul   |        411 |    11600.67
    +     Beijing  | Dec   |        100 |     4248.41
    +    (2 rows)
    +    ```
    +
    +## <a id="profile_hiverc"></a>HiveRC Profile
    +
    +The RCFile Hive format is used for row columnar formatted data. The `HiveRC` profile provides access to RCFile data.
    +
    +### <a id="profile_hiverc_rcfiletbl_using"></a>Example: Using the HiveRC Profile
    +
    +Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. The `HiveRC` profile is more performant than the `Hive` profile for this file format type.
    +
    +1. Create a Hive table with RCFile format:
    +
    +    ``` shell
    +    $ HADOOP_USER_NAME=hdfs hive
    +    ```
    +
    +    ``` sql
    +    hive> CREATE TABLE sales_info_rcfile (location string, month string,
    +            number_of_orders int, total_sales double)
    +          ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +          STORED AS rcfile;
    +    ```
    +
    +2. Insert the data from the `sales_info` table into `sales_info_rcfile`:
    +
    +    ``` sql
    +    hive> INSERT INTO TABLE sales_info_rcfile SELECT * FROM sales_info;
    +    ```
    +
    +    A copy of the sample data set is now stored in RCFile format in `sales_info_rcfile`. 
    +    
    +3. Perform a Hive query on `sales_info_rcfile` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info_rcfile;
    +    ```
    +
    +4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. When using the `HiveRC` profile, you **must** specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info_rcfile?PROFILE=HiveRC&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (Again, you can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +5. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile;
    +    ```
    +
    +    ``` shell
    +       location    | total_sales
    +    ---------------+-------------
    +     Prague        |     4875.33
    +     Rome          |     1557.39
    +     Bangalore     |     8936.99
    +     Beijing       |    11600.67
    +     ...
    +    ```
    +
    +## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
    +
    +The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using:
    +
    +``` sql
    +hive> CREATE TABLE hive_parquet_table (fname string, lname string, custid int, acctbalance double)
    +        STORED AS parquet;
     ```
     
    +Define the HAWQ external table using:
    +
     ``` sql
    -postgres=# CREATE EXTERNAL TABLE hivetest(id int, newid int)
    -LOCATION ('pxf://namenode:51200/inventory.accessories?PROFILE=Hive')
    -FORMAT 'custom' (formatter='pxfwritable_import');
    +postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (fname text, lname text, custid int, acctbalance double precision)
    +    LOCATION ('pxf://namenode:51200/hive-db-name.hive_parquet_table?profile=Hive')
    +    FORMAT 'CUSTOM' (formatter='pxfwritable_import');
    --- End diff --
    
    yes, this section is more of a theoretical example as the "flow" is similar to the other file format types.  the select statement would make it more consistent with the flow other sections.  i will add it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by lisakowen <gi...@git.apache.org>.

Github user lisakowen commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85403929
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info;
    +    ```
     
    -The following�PXF table definition is valid for any Hive file storage type.
    +In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
    +
    +## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data
    +
    +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`.
    +
    +Use the following syntax to create a HAWQ external table representing Hive data:
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://namenode[:port]/hive-db-name.hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
    +CREATE EXTERNAL TABLE <table_name>
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    +    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
    +FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
     ```
     
    -where `<pxf parameters>` is:
    +Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described below.
     
    -``` pre
    -   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:<port\>]    | The HDFS NameNode and port. |
    +| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to the Hive database named `default`. |
    +| \<hive-table-name\>    | Name of the Hive table. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
    +| DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must be a single ascii character or specified in hexadecimal representation. |
    +| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
    +| FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |
     
     
    -If `hive-db-name` is omitted, pxf will default to the Hive `default` database.
    +## <a id="profile_hive"></a>Hive Profile
     
    -**Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting the pxf\_service\_port configuration parameter.
    +The `Hive` profile works with any Hive file format.
     
    -PXF has three built-in profiles for Hive tables:
    +### <a id="profile_hive_using"></a>Example: Using the Hive Profile
     
    --   Hive
    --   HiveRC
    --   HiveText
    +Use the `Hive` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
     
    -The Hive profile works with any Hive storage type. 
    -The following example creates a readable HAWQ external table representing a Hive table named `accessories` in the `inventory` Hive database using the PXF Hive profile:
    +1. Create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier:
     
    -``` shell
    -$ psql -d postgres
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8)
    +                LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=Hive')
    +              FORMAT 'custom' (formatter='pxfwritable_import');
    +    ```
    +
    +2. Query the table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hiveprofile;
    +    ```
    +
    +    ``` shell
    +       location    | month | num_orders | total_sales
    +    ---------------+-------+------------+-------------
    +     Prague        | Jan   |        101 |     4875.33
    +     Rome          | Mar   |         87 |     1557.39
    +     Bangalore     | May   |        317 |     8936.99
    +     ...
    +
    +    ```
    +
    +## <a id="profile_hivetext"></a>HiveText Profile
    +
    +Use the `HiveText` profile to query text formats. The `HiveText` profile is more performant than the `Hive` profile.
    +
    +**Note**: When using the `HiveText` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.
    +
    +### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile
    +
    +Use the PXF `HiveText` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
    +
    +1. Create the external table:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=HiveText&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (You can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +    Notice that:
    +
    +    - The `LOCATION` subclause `DELIMITER` value is specified in hexadecimal format. `\x` is a prefix that instructs PXF to interpret the following characters as hexadecimal. `2c` is the hex value for the comma character.
    +    - The `FORMAT` subclause `delimiter` value is specified as the single ascii comma character ','. `E` escapes the character.
    +
    +2. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hivetextprofile where location="Beijing";
    +    ```
    +
    +    ``` shell
    +     location | month | num_orders | total_sales
    +    ----------+-------+------------+-------------
    +     Beijing  | Jul   |        411 |    11600.67
    +     Beijing  | Dec   |        100 |     4248.41
    +    (2 rows)
    +    ```
    +
    +## <a id="profile_hiverc"></a>HiveRC Profile
    +
    +The RCFile Hive format is used for row columnar formatted data. The `HiveRC` profile provides access to RCFile data.
    +
    +### <a id="profile_hiverc_rcfiletbl_using"></a>Example: Using the HiveRC Profile
    +
    +Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. The `HiveRC` profile is more performant than the `Hive` profile for this file format type.
    +
    +1. Create a Hive table with RCFile format:
    +
    +    ``` shell
    +    $ HADOOP_USER_NAME=hdfs hive
    +    ```
    +
    +    ``` sql
    +    hive> CREATE TABLE sales_info_rcfile (location string, month string,
    +            number_of_orders int, total_sales double)
    +          ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +          STORED AS rcfile;
    +    ```
    +
    +2. Insert the data from the `sales_info` table into `sales_info_rcfile`:
    +
    +    ``` sql
    +    hive> INSERT INTO TABLE sales_info_rcfile SELECT * FROM sales_info;
    +    ```
    +
    +    A copy of the sample data set is now stored in RCFile format in `sales_info_rcfile`. 
    +    
    +3. Perform a Hive query on `sales_info_rcfile` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info_rcfile;
    +    ```
    +
    +4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. When using the `HiveRC` profile, you **must** specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info_rcfile?PROFILE=HiveRC&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (Again, you can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +5. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile;
    +    ```
    +
    +    ``` shell
    +       location    | total_sales
    +    ---------------+-------------
    +     Prague        |     4875.33
    +     Rome          |     1557.39
    +     Bangalore     |     8936.99
    +     Beijing       |    11600.67
    +     ...
    +    ```
    +
    +## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
    +
    +The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using:
    +
    +``` sql
    +hive> CREATE TABLE hive_parquet_table (fname string, lname string, custid int, acctbalance double)
    +        STORED AS parquet;
     ```
     
    +Define the HAWQ external table using:
    +
     ``` sql
    -postgres=# CREATE EXTERNAL TABLE hivetest(id int, newid int)
    -LOCATION ('pxf://namenode:51200/inventory.accessories?PROFILE=Hive')
    -FORMAT 'custom' (formatter='pxfwritable_import');
    +postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (fname text, lname text, custid int, acctbalance double precision)
    +    LOCATION ('pxf://namenode:51200/hive-db-name.hive_parquet_table?profile=Hive')
    +    FORMAT 'CUSTOM' (formatter='pxfwritable_import');
     ```
     
    +## <a id="profileperf"></a>Profile Performance Considerations
     
    -Use HiveRC and HiveText to query RC and Text formats respectively. The HiveRC and HiveText profiles are faster than the generic Hive profile. When using the HiveRC and HiveText profiles, you must specify a DELIMITER option in the LOCATION clause. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information on profiles.
    +The `HiveRC` and `HiveText` profiles are faster than the generic `Hive` profile.
     
    +?? MORE HERE. ??
     
    -### <a id="topic_b4v_g3n_25"></a>Hive Complex Types
    +## <a id="complex_dt_example"></a>Complex Data Type Example
     
    -PXF tables support Hive data types that are not primitive types. The supported Hive complex data types are array, struct, map, and union. This Hive `CREATE TABLE` statement, for example, creates a table with each of these complex data types:
    +This example will employ the array and map complex types, specifically an array of integers and a string key/value pair map.
     
    -``` sql
    -hive> CREATE TABLE sales_collections (
    -  item STRING,
    -  price FLOAT,
    -  properties ARRAY<STRING>,
    -  hash MAP<STRING,FLOAT>,
    -  delivery_address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>,
    -  others UNIONTYPE<FLOAT, BOOLEAN, STRING>
    -)  
    -ROW FORMAT DELIMITED FIELDS
    -TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
    -hive> LOAD DATA LOCAL INPATH '/local/path/<some data file>' INTO TABLE sales_collection;
    -```
    +The example data set includes fields with the following names and data types:
    +
    +- index - int
    +- name - string
    +- intarray - array of integers
    +- propmap - map of string key and value pairs
    +
    +When specifying an array field in a Hive table, you must identify the terminator for each item in the collection. Similarly, the map key termination character must also be specified.
    +
    +1. Create a text file from which you will load the data set:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_complex.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_complex.txt`.  The data uses a comma `,` to separate field values, the percent symbol `%` to separate collection items, and a `:` to terminate map key values:
    +
    +    ```
    +    3,Prague,1%2%3,zone:euro%status:up
    +    89,Rome,4%5%6,zone:euro
    +    400,Bangalore,7%8%9,zone:apac%status:pending
    +    183,Beijing,0%1%2,zone:apac
    +    94,Sacramento,3%4%5,zone:noam%status:down
    +    101,Paris,6%7%8,zone:euro%status:up
    +    56,Frankfurt,9%0%1,zone:euro
    +    202,Jakarta,2%3%4,zone:apac%status:up
    +    313,Sydney,5%6%7,zone:apac%status:pending
    +    76,Atlanta,8%9%0,zone:noam%status:down
    +    ```
    +
    +3. Create a Hive table to represent this data:
    +
    +    ``` shell
    +    $ HADOOP_USER_NAME=hdfs hive
    +    ```
    +
    +    ``` sql
    +    hive> CREATE TABLE table_complextypes( index int, name string, intarray ARRAY<int>, propmap MAP<string, string>)
    +             ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +             COLLECTION ITEMS TERMINATED BY '%'
    +             MAP KEYS TERMINATED BY ':'
    +             STORED AS TEXTFILE;
    +    ```
    +
    +    Notice that:
    +
    +    - `FIELDS TERMINATED BY` identifies a comma as the field terminator.
    +    - The `COLLECTION ITEMS TERMINATED BY` subclause specifies the percent sign as the collection items (array item, map key/value pair) terminator.
    +    - `MAP KEYS TERMINATED BY` identifies a colon as the terminator for map keys.
    +
    +4. Load the `pxf_hive_complex.txt` sample data file into the `table_complextypes` table you just created:
    +
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_complex.txt' INTO TABLE table_complextypes;
    +    ```
    +
    +5. Perform a query on Hive table `table_complextypes` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM table_complextypes;
    +    ```
    +
    +    ``` shell
    +    3	Prague	[1,2,3]	{"zone":"euro","status":"up"}
    +    89	Rome	[4,5,6]	{"zone":"euro"}
    +    400	Bangalore	[7,8,9]	{"zone":"apac","status":"pending"}
    +    ...
    +    ```
    +
    +6. Use the PXF `Hive` profile to create a queryable HAWQ external table representing the Hive `table_complextypes`:
     
    -You can use HAWQ functions or application code to extract the components of the complex data columns as needed.
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE complextypes_hiveprofile(index int, name text, intarray text, propmap text)
    +                 LOCATION ('pxf://namenode:51200/table_complextypes?PROFILE=Hive')
    +               FORMAT 'CUSTOM' (formatter='pxfwritable_import');
    +    ```
    +
    +    Notice that the integer array and map complex types are mapped to type text.
    +
    +7. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM complextypes_hiveprofile;
    +    ```
    +
    +    ``` shell     
    +     index |    name    | intarray |              propmap
    +    -------+------------+----------+------------------------------------
    +         3 | Prague     | [1,2,3]  | {"zone":"euro","status":"up"}
    +        89 | Rome       | [4,5,6]  | {"zone":"euro"}
    +       400 | Bangalore  | [7,8,9]  | {"zone":"apac","status":"pending"}
    +       183 | Beijing    | [0,1,2]  | {"zone":"apac"}
    +        94 | Sacramento | [3,4,5]  | {"zone":"noam","status":"down"}
    +       101 | Paris      | [6,7,8]  | {"zone":"euro","status":"up"}
    +        56 | Frankfurt  | [9,0,1]  | {"zone":"euro"}
    +       202 | Jakarta    | [2,3,4]  | {"zone":"apac","status":"up"}
    +       313 | Sydney     | [5,6,7]  | {"zone":"apac","status":"pending"}
    +        76 | Atlanta    | [8,9,0]  | {"zone":"noam","status":"down"}
    +    (10 rows)
    +    ```
    +
    +    `intarray` and `propmap` are each text strings.
     
     ## <a id="hcatalog"></a>Using PXF and HCatalog to Query Hive
    --- End diff --
    
    true, but it is less PXF-related.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by dyozie <gi...@git.apache.org>.

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85370681
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -151,184 +477,120 @@ To enable HCatalog query integration in HAWQ, perform the following steps:
         postgres=# GRANT ALL ON PROTOCOL pxf TO "role";
         ``` 
         
    -3.  To query a Hive table with HCatalog integration, simply query HCatalog directly from HAWQ. The query syntax is:
     
    -    ``` sql
    -    postgres=# SELECT * FROM hcatalog.hive-db-name.hive-table-name;
    -    ```
    +To query a Hive table with HCatalog integration, query HCatalog directly from HAWQ. The query syntax is:
    --- End diff --
    
    It's a bit awkward to drop out of the procedure and into free-form discussion of the various operations.  I think it might be better to put the previous 3-step procedure into a new subsection like "Enabling HCatalog Integration" and then putting the remaining non-procedural content into "Usage" ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by dyozie <gi...@git.apache.org>.

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85372086
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    --- End diff --
    
    Global edit to change "verify the data" -> "verify that the data"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by dyozie <gi...@git.apache.org>.

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85368752
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info;
    +    ```
     
    -The following�PXF table definition is valid for any Hive file storage type.
    +In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
    +
    +## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data
    +
    +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`.
    +
    +Use the following syntax to create a HAWQ external table representing Hive data:
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://namenode[:port]/hive-db-name.hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
    +CREATE EXTERNAL TABLE <table_name>
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    +    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
    +FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
     ```
     
    -where `<pxf parameters>` is:
    +Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described below.
     
    -``` pre
    -   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:<port\>]    | The HDFS NameNode and port. |
    +| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to the Hive database named `default`. |
    +| \<hive-table-name\>    | Name of the Hive table. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
    +| DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must be a single ascii character or specified in hexadecimal representation. |
    +| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
    +| FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |
     
     
    -If `hive-db-name` is omitted, pxf will default to the Hive `default` database.
    +## <a id="profile_hive"></a>Hive Profile
     
    -**Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting the pxf\_service\_port configuration parameter.
    +The `Hive` profile works with any Hive file format.
     
    -PXF has three built-in profiles for Hive tables:
    +### <a id="profile_hive_using"></a>Example: Using the Hive Profile
     
    --   Hive
    --   HiveRC
    --   HiveText
    +Use the `Hive` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
     
    -The Hive profile works with any Hive storage type. 
    -The following example creates a readable HAWQ external table representing a Hive table named `accessories` in the `inventory` Hive database using the PXF Hive profile:
    +1. Create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier:
     
    -``` shell
    -$ psql -d postgres
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8)
    +                LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=Hive')
    +              FORMAT 'custom' (formatter='pxfwritable_import');
    +    ```
    +
    +2. Query the table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hiveprofile;
    +    ```
    +
    +    ``` shell
    +       location    | month | num_orders | total_sales
    +    ---------------+-------+------------+-------------
    +     Prague        | Jan   |        101 |     4875.33
    +     Rome          | Mar   |         87 |     1557.39
    +     Bangalore     | May   |        317 |     8936.99
    +     ...
    +
    +    ```
    +
    +## <a id="profile_hivetext"></a>HiveText Profile
    +
    +Use the `HiveText` profile to query text formats. The `HiveText` profile is more performant than the `Hive` profile.
    +
    +**Note**: When using the `HiveText` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.
    +
    +### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile
    +
    +Use the PXF `HiveText` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
    +
    +1. Create the external table:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=HiveText&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (You can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +    Notice that:
    +
    +    - The `LOCATION` subclause `DELIMITER` value is specified in hexadecimal format. `\x` is a prefix that instructs PXF to interpret the following characters as hexadecimal. `2c` is the hex value for the comma character.
    +    - The `FORMAT` subclause `delimiter` value is specified as the single ascii comma character ','. `E` escapes the character.
    +
    +2. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hivetextprofile where location="Beijing";
    +    ```
    +
    +    ``` shell
    +     location | month | num_orders | total_sales
    +    ----------+-------+------------+-------------
    +     Beijing  | Jul   |        411 |    11600.67
    +     Beijing  | Dec   |        100 |     4248.41
    +    (2 rows)
    +    ```
    +
    +## <a id="profile_hiverc"></a>HiveRC Profile
    +
    +The RCFile Hive format is used for row columnar formatted data. The `HiveRC` profile provides access to RCFile data.
    +
    +### <a id="profile_hiverc_rcfiletbl_using"></a>Example: Using the HiveRC Profile
    +
    +Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. The `HiveRC` profile is more performant than the `Hive` profile for this file format type.
    +
    +1. Create a Hive table with RCFile format:
    +
    +    ``` shell
    +    $ HADOOP_USER_NAME=hdfs hive
    +    ```
    +
    +    ``` sql
    +    hive> CREATE TABLE sales_info_rcfile (location string, month string,
    +            number_of_orders int, total_sales double)
    +          ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +          STORED AS rcfile;
    +    ```
    +
    +2. Insert the data from the `sales_info` table into `sales_info_rcfile`:
    +
    +    ``` sql
    +    hive> INSERT INTO TABLE sales_info_rcfile SELECT * FROM sales_info;
    +    ```
    +
    +    A copy of the sample data set is now stored in RCFile format in `sales_info_rcfile`. 
    +    
    +3. Perform a Hive query on `sales_info_rcfile` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info_rcfile;
    +    ```
    +
    +4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. When using the `HiveRC` profile, you **must** specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info_rcfile?PROFILE=HiveRC&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (Again, you can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +5. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile;
    +    ```
    +
    +    ``` shell
    +       location    | total_sales
    +    ---------------+-------------
    +     Prague        |     4875.33
    +     Rome          |     1557.39
    +     Bangalore     |     8936.99
    +     Beijing       |    11600.67
    +     ...
    +    ```
    +
    +## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
    +
    +The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using:
    +
    +``` sql
    +hive> CREATE TABLE hive_parquet_table (fname string, lname string, custid int, acctbalance double)
    +        STORED AS parquet;
     ```
     
    +Define the HAWQ external table using:
    +
     ``` sql
    -postgres=# CREATE EXTERNAL TABLE hivetest(id int, newid int)
    -LOCATION ('pxf://namenode:51200/inventory.accessories?PROFILE=Hive')
    -FORMAT 'custom' (formatter='pxfwritable_import');
    +postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (fname text, lname text, custid int, acctbalance double precision)
    +    LOCATION ('pxf://namenode:51200/hive-db-name.hive_parquet_table?profile=Hive')
    +    FORMAT 'CUSTOM' (formatter='pxfwritable_import');
     ```
     
    +## <a id="profileperf"></a>Profile Performance Considerations
     
    -Use HiveRC and HiveText to query RC and Text formats respectively. The HiveRC and HiveText profiles are faster than the generic Hive profile. When using the HiveRC and HiveText profiles, you must specify a DELIMITER option in the LOCATION clause. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information on profiles.
    +The `HiveRC` and `HiveText` profiles are faster than the generic `Hive` profile.
     
    +?? MORE HERE. ??
    --- End diff --
    
    Need to remove this comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by dyozie <gi...@git.apache.org>.

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85365959
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    --- End diff --
    
    Also consider term/definition table here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by dyozie <gi...@git.apache.org>.

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85367290
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info;
    +    ```
     
    -The following�PXF table definition is valid for any Hive file storage type.
    +In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
    +
    +## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data
    +
    +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`.
    +
    +Use the following syntax to create a HAWQ external table representing Hive data:
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://namenode[:port]/hive-db-name.hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
    +CREATE EXTERNAL TABLE <table_name>
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    +    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
    +FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
     ```
     
    -where `<pxf parameters>` is:
    +Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described below.
     
    -``` pre
    -   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:<port\>]    | The HDFS NameNode and port. |
    +| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to the Hive database named `default`. |
    --- End diff --
    
    Change this to "The name" in this row and the next, to keep the sentences parallel.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by dyozie <gi...@git.apache.org>.

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85367943
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info;
    +    ```
     
    -The following�PXF table definition is valid for any Hive file storage type.
    +In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
    +
    +## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data
    +
    +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`.
    +
    +Use the following syntax to create a HAWQ external table representing Hive data:
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://namenode[:port]/hive-db-name.hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
    +CREATE EXTERNAL TABLE <table_name>
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    +    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
    +FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
     ```
     
    -where `<pxf parameters>` is:
    +Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described below.
     
    -``` pre
    -   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:<port\>]    | The HDFS NameNode and port. |
    +| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to the Hive database named `default`. |
    +| \<hive-table-name\>    | Name of the Hive table. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
    +| DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must be a single ascii character or specified in hexadecimal representation. |
    +| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
    +| FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |
     
     
    -If `hive-db-name` is omitted, pxf will default to the Hive `default` database.
    +## <a id="profile_hive"></a>Hive Profile
     
    -**Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting the pxf\_service\_port configuration parameter.
    +The `Hive` profile works with any Hive file format.
     
    -PXF has three built-in profiles for Hive tables:
    +### <a id="profile_hive_using"></a>Example: Using the Hive Profile
     
    --   Hive
    --   HiveRC
    --   HiveText
    +Use the `Hive` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
     
    -The Hive profile works with any Hive storage type. 
    -The following example creates a readable HAWQ external table representing a Hive table named `accessories` in the `inventory` Hive database using the PXF Hive profile:
    +1. Create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier:
     
    -``` shell
    -$ psql -d postgres
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8)
    +                LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=Hive')
    +              FORMAT 'custom' (formatter='pxfwritable_import');
    +    ```
    +
    +2. Query the table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hiveprofile;
    +    ```
    +
    +    ``` shell
    +       location    | month | num_orders | total_sales
    +    ---------------+-------+------------+-------------
    +     Prague        | Jan   |        101 |     4875.33
    +     Rome          | Mar   |         87 |     1557.39
    +     Bangalore     | May   |        317 |     8936.99
    +     ...
    +
    +    ```
    +
    +## <a id="profile_hivetext"></a>HiveText Profile
    +
    +Use the `HiveText` profile to query text formats. The `HiveText` profile is more performant than the `Hive` profile.
    +
    +**Note**: When using the `HiveText` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.
    +
    +### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile
    +
    +Use the PXF `HiveText` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
    +
    +1. Create the external table:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=HiveText&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (You can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +    Notice that:
    +
    --- End diff --
    
    Again, a MarkDown formatting error with the bullet list.  Should be fixed by removing this blank line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by kavinderd <gi...@git.apache.org>.

Github user kavinderd commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85376355
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info;
    +    ```
     
    -The following�PXF table definition is valid for any Hive file storage type.
    +In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
    +
    +## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data
    +
    +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`.
    +
    +Use the following syntax to create a HAWQ external table representing Hive data:
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://namenode[:port]/hive-db-name.hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
    +CREATE EXTERNAL TABLE <table_name>
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    +    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
    +FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
     ```
     
    -where `<pxf parameters>` is:
    +Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described below.
     
    -``` pre
    -   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:<port\>]    | The HDFS NameNode and port. |
    +| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to the Hive database named `default`. |
    +| \<hive-table-name\>    | Name of the Hive table. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
    +| DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must be a single ascii character or specified in hexadecimal representation. |
    +| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
    +| FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |
     
     
    -If `hive-db-name` is omitted, pxf will default to the Hive `default` database.
    +## <a id="profile_hive"></a>Hive Profile
     
    -**Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting the pxf\_service\_port configuration parameter.
    +The `Hive` profile works with any Hive file format.
     
    -PXF has three built-in profiles for Hive tables:
    +### <a id="profile_hive_using"></a>Example: Using the Hive Profile
     
    --   Hive
    --   HiveRC
    --   HiveText
    +Use the `Hive` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
     
    -The Hive profile works with any Hive storage type. 
    -The following example creates a readable HAWQ external table representing a Hive table named `accessories` in the `inventory` Hive database using the PXF Hive profile:
    +1. Create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier:
     
    -``` shell
    -$ psql -d postgres
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8)
    +                LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=Hive')
    +              FORMAT 'custom' (formatter='pxfwritable_import');
    +    ```
    +
    +2. Query the table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hiveprofile;
    +    ```
    +
    +    ``` shell
    +       location    | month | num_orders | total_sales
    +    ---------------+-------+------------+-------------
    +     Prague        | Jan   |        101 |     4875.33
    +     Rome          | Mar   |         87 |     1557.39
    +     Bangalore     | May   |        317 |     8936.99
    +     ...
    +
    +    ```
    +
    +## <a id="profile_hivetext"></a>HiveText Profile
    +
    +Use the `HiveText` profile to query text formats. The `HiveText` profile is more performant than the `Hive` profile.
    +
    +**Note**: When using the `HiveText` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.
    +
    +### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile
    +
    +Use the PXF `HiveText` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
    +
    +1. Create the external table:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=HiveText&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (You can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +    Notice that:
    +
    +    - The `LOCATION` subclause `DELIMITER` value is specified in hexadecimal format. `\x` is a prefix that instructs PXF to interpret the following characters as hexadecimal. `2c` is the hex value for the comma character.
    +    - The `FORMAT` subclause `delimiter` value is specified as the single ascii comma character ','. `E` escapes the character.
    +
    +2. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hivetextprofile where location="Beijing";
    +    ```
    +
    +    ``` shell
    +     location | month | num_orders | total_sales
    +    ----------+-------+------------+-------------
    +     Beijing  | Jul   |        411 |    11600.67
    +     Beijing  | Dec   |        100 |     4248.41
    +    (2 rows)
    +    ```
    +
    +## <a id="profile_hiverc"></a>HiveRC Profile
    +
    +The RCFile Hive format is used for row columnar formatted data. The `HiveRC` profile provides access to RCFile data.
    +
    +### <a id="profile_hiverc_rcfiletbl_using"></a>Example: Using the HiveRC Profile
    +
    +Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. The `HiveRC` profile is more performant than the `Hive` profile for this file format type.
    +
    +1. Create a Hive table with RCFile format:
    +
    +    ``` shell
    +    $ HADOOP_USER_NAME=hdfs hive
    +    ```
    +
    +    ``` sql
    +    hive> CREATE TABLE sales_info_rcfile (location string, month string,
    +            number_of_orders int, total_sales double)
    +          ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +          STORED AS rcfile;
    +    ```
    +
    +2. Insert the data from the `sales_info` table into `sales_info_rcfile`:
    +
    +    ``` sql
    +    hive> INSERT INTO TABLE sales_info_rcfile SELECT * FROM sales_info;
    +    ```
    +
    +    A copy of the sample data set is now stored in RCFile format in `sales_info_rcfile`. 
    +    
    +3. Perform a Hive query on `sales_info_rcfile` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info_rcfile;
    +    ```
    +
    +4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. When using the `HiveRC` profile, you **must** specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info_rcfile?PROFILE=HiveRC&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (Again, you can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +5. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile;
    +    ```
    +
    +    ``` shell
    +       location    | total_sales
    +    ---------------+-------------
    +     Prague        |     4875.33
    +     Rome          |     1557.39
    +     Bangalore     |     8936.99
    +     Beijing       |    11600.67
    +     ...
    +    ```
    +
    +## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
    +
    +The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using:
    +
    +``` sql
    +hive> CREATE TABLE hive_parquet_table (fname string, lname string, custid int, acctbalance double)
    +        STORED AS parquet;
     ```
     
    +Define the HAWQ external table using:
    +
     ``` sql
    -postgres=# CREATE EXTERNAL TABLE hivetest(id int, newid int)
    -LOCATION ('pxf://namenode:51200/inventory.accessories?PROFILE=Hive')
    -FORMAT 'custom' (formatter='pxfwritable_import');
    +postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (fname text, lname text, custid int, acctbalance double precision)
    +    LOCATION ('pxf://namenode:51200/hive-db-name.hive_parquet_table?profile=Hive')
    +    FORMAT 'CUSTOM' (formatter='pxfwritable_import');
    --- End diff --
    
    There's no `select` statement after the table creation. Is that intentional?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by dyozie <gi...@git.apache.org>.

Github user dyozie commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85369947
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -2,121 +2,450 @@
     title: Accessing Hive Data
     ---
     
    -This topic describes how to access Hive data using PXF. You have several options for querying data stored in Hive. You can create external tables in PXF and then query those tables, or you can easily query Hive tables by using HAWQ and PXF's integration with HCatalog. HAWQ accesses Hive table metadata stored in HCatalog.
    +Apache Hive is a distributed data warehousing infrastructure.  Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv), RC, ORC, and parquet. The PXF Hive plug-in reads data stored in Hive, as well as HDFS or HBase.
    +
    +This section describes how to use PXF to access Hive data. Options for querying data stored in Hive include:
    +
    +-  Creating an external table in PXF and querying that table
    +-  Querying Hive tables via PXF's integration with HCatalog
     
     ## <a id="installingthepxfhiveplugin"></a>Prerequisites
     
    -Check the following before using PXF to access Hive:
    +Before accessing Hive data with HAWQ and PXF, ensure that:
     
    --   The PXF HDFS plug-in is installed on all cluster nodes.
    +-   The PXF HDFS plug-in is installed on all cluster nodes. See [Installing PXF Plug-ins](InstallPXFPlugins.html) for PXF plug-in installation information.
     -   The PXF Hive plug-in is installed on all cluster nodes.
     -   The Hive JAR files and conf directory�are installed on all cluster nodes.
    --   Test PXF on HDFS before connecting to Hive or HBase.
    +-   You have tested PXF on HDFS.
     -   You are running the Hive Metastore service on a machine in your cluster.�
     -   You have set the `hive.metastore.uris`�property in the�`hive-site.xml` on the NameNode.
     
    +## <a id="topic_p2s_lvl_25"></a>Hive File Formats
    +
    +Hive supports several file formats:
    +
    +-   TextFile - flat file with data in comma-, tab-, or space-separated value format or JSON notation
    +-   SequenceFile - flat file consisting of binary key/value pairs
    +-   RCFile - record columnar data consisting of binary key/value pairs; high row compression rate
    +-   ORCFile - optimized row columnar data with stripe, footer, and postscript sections; reduces data size
    +-   Parquet - compressed columnar data representation
    +-   Avro - JSON-defined, schema-based data serialization format
    +
    +Refer to [File Formats](https://cwiki.apache.org/confluence/display/Hive/FileFormats) for detailed information about the file formats supported by Hive.
    +
    +The PXF Hive plug-in supports the following profiles for accessing the Hive file formats listed above. These include:
    +
    +- `Hive`
    +- `HiveText`
    +- `HiveRC`
    +
    +## <a id="topic_p2s_lvl_29"></a>Data Type Mapping
    +
    +### <a id="hive_primdatatypes"></a>Primitive Data Types
    +
    +To represent Hive data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type.
    +
    +The following table summarizes external mapping rules for Hive primitive types.
    +
    +| Hive Data Type  | Hawq Data Type |
    +|-------|---------------------------|
    +| boolean    | bool |
    +| int   | int4 |
    +| smallint   | int2 |
    +| tinyint   | int2 |
    +| bigint   | int8 |
    +| decimal  |  numeric  |
    +| float   | float4 |
    +| double   | float8 |
    +| string   | text |
    +| binary   | bytea |
    +| char   | bpchar |
    +| varchar   | varchar |
    +| timestamp   | timestamp |
    +| date   | date |
    +
    +
    +### <a id="topic_b4v_g3n_25"></a>Complex Data Types
    +
    +Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`.  While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.
    +
    +An example using complex data types is provided later in this topic.
    +
    +
    +## <a id="hive_sampledataset"></a>Sample Data Set
    +
    +Examples used in this topic will operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
    +
    +- location - text
    +- month - text
    +- number\_of\_orders - integer
    +- total\_sales - double
    +
    +Prepare the sample data set for use:
    +
    +1. First, create a text file:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_datafile.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_datafile.txt`; notice the use of the comma `,` to separate the four field values:
    +
    +    ```
    +    Prague,Jan,101,4875.33
    +    Rome,Mar,87,1557.39
    +    Bangalore,May,317,8936.99
    +    Beijing,Jul,411,11600.67
    +    San Francisco,Sept,156,6846.34
    +    Paris,Nov,159,7134.56
    +    San Francisco,Jan,113,5397.89
    +    Prague,Dec,333,9894.77
    +    Bangalore,Jul,271,8320.55
    +    Beijing,Dec,100,4248.41
    +    ```
    +
    +Make note of the path to `pxf_hive_datafile.txt`; you will use it in later exercises.
    +
    +
     ## <a id="hivecommandline"></a>Hive Command Line
     
    -To start the Hive command line and work directly on a Hive table:
    +The Hive command line is a subsystem similar to that of `psql`. To start the Hive command line:
     
     ``` shell
    -$ hive
    +$ HADOOP_USER_NAME=hdfs hive
     ```
     
    -Here is an example�of how to create and load data into�a sample Hive table from an existing file.
    +The default Hive database is named `default`. 
     
    -``` sql
    -hive> CREATE TABLE test (name string, type string, supplier_key int, full_price double) row format delimited fields terminated by ',';
    -hive> LOAD DATA local inpath '/local/path/data.txt' into table test;
    -```
    +### <a id="hivecommandline_createdb"></a>Example: Create a Hive Database
     
    -## <a id="topic_p2s_lvl_25"></a>Using PXF Tables to Query Hive
    +Create a Hive table to expose our sample data set.
     
    -Hive tables are defined in a specific way in PXF, regardless of the underlying file storage format. The PXF Hive plug-ins automatically detect source tables in the following formats:
    +1. Create a Hive table named `sales_info` in the `default` database:
     
    --   Text based
    --   SequenceFile
    --   RCFile
    --   ORCFile
    --   Parquet
    --   Avro
    +    ``` sql
    +    hive> CREATE TABLE sales_info (location string, month string,
    +            number_of_orders int, total_sales double)
    +            ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +            STORED AS textfile;
    +    ```
     
    -The source table can also be a combination of these types. The PXF Hive plug-in�uses this information to query the data in runtime.
    +    Notice that:
     
    --   **[Syntax Example](../pxf/HivePXF.html#syntax2)**
    +    - The `STORED AS textfile` subclause instructs Hive to create the table in Textfile (the default) format.  Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation.
    +    - The `DELIMITED FIELDS TERMINATED BY` subclause identifies the field delimiter within a data record (line). The `sales_info` table field delimiter is a comma (`,`).
     
    --   **[Hive Complex Types](../pxf/HivePXF.html#topic_b4v_g3n_25)**
    +2. Load the `pxf_hive_datafile.txt` sample data file into the `sales_info` table you just created:
     
    -### <a id="syntax2"></a>Syntax Example
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_datafile.txt'
    +            INTO TABLE sales_info;
    +    ```
    +
    +3. Perform a query on `sales_info` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info;
    +    ```
     
    -The following�PXF table definition is valid for any Hive file storage type.
    +In examples later in this section, you will access the `sales_info` Hive table directly via PXF. You will also insert `sales_info` data into tables of other Hive file format types, and use PXF to access those directly as well.
    +
    +## <a id="topic_p2s_lvl_28"></a>Querying External Hive Data
    +
    +The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`.
    +
    +Use the following syntax to create a HAWQ external table representing Hive data:
     
     ``` sql
    -CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name 
    -    ( column_name data_type [, ...] | LIKE other_table )
    -LOCATION ('pxf://namenode[:port]/hive-db-name.hive-table-name?<pxf parameters>[&custom-option=value...]')FORMAT 'CUSTOM' (formatter='pxfwritable_import')
    +CREATE EXTERNAL TABLE <table_name>
    +    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    +LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    +    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
    +FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
     ```
     
    -where `<pxf parameters>` is:
    +Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](../reference/sql/CREATE-EXTERNAL-TABLE.html) call are described below.
     
    -``` pre
    -   FRAGMENTER=fragmenter_class&ACCESSOR=accessor_class&RESOLVER=resolver_class]
    - | PROFILE=profile-name
    -```
    +| Keyword  | Value |
    +|-------|-------------------------------------|
    +| \<host\>[:<port\>]    | The HDFS NameNode and port. |
    +| \<hive-db-name\>    | Name of the Hive database. If omitted, defaults to the Hive database named `default`. |
    +| \<hive-table-name\>    | Name of the Hive table. |
    +| PROFILE    | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
    +| DELIMITER    | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set.  \<delim\> must be a single ascii character or specified in hexadecimal representation. |
    +| FORMAT (`Hive` profile)   | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`.   |
    +| FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |
     
     
    -If `hive-db-name` is omitted, pxf will default to the Hive `default` database.
    +## <a id="profile_hive"></a>Hive Profile
     
    -**Note:** The port is the connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200 by default. The HA name service port can be changed by setting the pxf\_service\_port configuration parameter.
    +The `Hive` profile works with any Hive file format.
     
    -PXF has three built-in profiles for Hive tables:
    +### <a id="profile_hive_using"></a>Example: Using the Hive Profile
     
    --   Hive
    --   HiveRC
    --   HiveText
    +Use the `Hive` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
     
    -The Hive profile works with any Hive storage type. 
    -The following example creates a readable HAWQ external table representing a Hive table named `accessories` in the `inventory` Hive database using the PXF Hive profile:
    +1. Create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier:
     
    -``` shell
    -$ psql -d postgres
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8)
    +                LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=Hive')
    +              FORMAT 'custom' (formatter='pxfwritable_import');
    +    ```
    +
    +2. Query the table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hiveprofile;
    +    ```
    +
    +    ``` shell
    +       location    | month | num_orders | total_sales
    +    ---------------+-------+------------+-------------
    +     Prague        | Jan   |        101 |     4875.33
    +     Rome          | Mar   |         87 |     1557.39
    +     Bangalore     | May   |        317 |     8936.99
    +     ...
    +
    +    ```
    +
    +## <a id="profile_hivetext"></a>HiveText Profile
    +
    +Use the `HiveText` profile to query text formats. The `HiveText` profile is more performant than the `Hive` profile.
    +
    +**Note**: When using the `HiveText` profile, you *must* specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.
    +
    +### <a id="profile_hivetext_using"></a>Example: Using the HiveText Profile
    +
    +Use the PXF `HiveText` profile to create a queryable HAWQ external table from the Hive `sales_info` textfile format table created earlier.
    +
    +1. Create the external table:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info?PROFILE=HiveText&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (You can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +    Notice that:
    +
    +    - The `LOCATION` subclause `DELIMITER` value is specified in hexadecimal format. `\x` is a prefix that instructs PXF to interpret the following characters as hexadecimal. `2c` is the hex value for the comma character.
    +    - The `FORMAT` subclause `delimiter` value is specified as the single ascii comma character ','. `E` escapes the character.
    +
    +2. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM salesinfo_hivetextprofile where location="Beijing";
    +    ```
    +
    +    ``` shell
    +     location | month | num_orders | total_sales
    +    ----------+-------+------------+-------------
    +     Beijing  | Jul   |        411 |    11600.67
    +     Beijing  | Dec   |        100 |     4248.41
    +    (2 rows)
    +    ```
    +
    +## <a id="profile_hiverc"></a>HiveRC Profile
    +
    +The RCFile Hive format is used for row columnar formatted data. The `HiveRC` profile provides access to RCFile data.
    +
    +### <a id="profile_hiverc_rcfiletbl_using"></a>Example: Using the HiveRC Profile
    +
    +Use the `HiveRC` profile to query RCFile-formatted data in Hive tables. The `HiveRC` profile is more performant than the `Hive` profile for this file format type.
    +
    +1. Create a Hive table with RCFile format:
    +
    +    ``` shell
    +    $ HADOOP_USER_NAME=hdfs hive
    +    ```
    +
    +    ``` sql
    +    hive> CREATE TABLE sales_info_rcfile (location string, month string,
    +            number_of_orders int, total_sales double)
    +          ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +          STORED AS rcfile;
    +    ```
    +
    +2. Insert the data from the `sales_info` table into `sales_info_rcfile`:
    +
    +    ``` sql
    +    hive> INSERT INTO TABLE sales_info_rcfile SELECT * FROM sales_info;
    +    ```
    +
    +    A copy of the sample data set is now stored in RCFile format in `sales_info_rcfile`. 
    +    
    +3. Perform a Hive query on `sales_info_rcfile` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM sales_info_rcfile;
    +    ```
    +
    +4. Use the PXF `HiveRC` profile to create a queryable HAWQ external table from the Hive `sales_info_rcfile` table created in the previous step. When using the `HiveRC` profile, you **must** specify a delimiter option in *both* the `LOCATION` and `FORMAT` clauses.:
    +
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8)
    +                 LOCATION ('pxf://namenode:51200/default.sales_info_rcfile?PROFILE=HiveRC&DELIMITER=\x2c')
    +               FORMAT 'TEXT' (delimiter=E',');
    +    ```
    +
    +    (Again, you can safely ignore the "nonstandard use of escape in a string literal" warning and related messages.)
    +
    +5. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT location, total_sales FROM salesinfo_hivercprofile;
    +    ```
    +
    +    ``` shell
    +       location    | total_sales
    +    ---------------+-------------
    +     Prague        |     4875.33
    +     Rome          |     1557.39
    +     Bangalore     |     8936.99
    +     Beijing       |    11600.67
    +     ...
    +    ```
    +
    +## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables
    +
    +The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using:
    +
    +``` sql
    +hive> CREATE TABLE hive_parquet_table (fname string, lname string, custid int, acctbalance double)
    +        STORED AS parquet;
     ```
     
    +Define the HAWQ external table using:
    +
     ``` sql
    -postgres=# CREATE EXTERNAL TABLE hivetest(id int, newid int)
    -LOCATION ('pxf://namenode:51200/inventory.accessories?PROFILE=Hive')
    -FORMAT 'custom' (formatter='pxfwritable_import');
    +postgres=# CREATE EXTERNAL TABLE pxf_parquet_table (fname text, lname text, custid int, acctbalance double precision)
    +    LOCATION ('pxf://namenode:51200/hive-db-name.hive_parquet_table?profile=Hive')
    +    FORMAT 'CUSTOM' (formatter='pxfwritable_import');
     ```
     
    +## <a id="profileperf"></a>Profile Performance Considerations
     
    -Use HiveRC and HiveText to query RC and Text formats respectively. The HiveRC and HiveText profiles are faster than the generic Hive profile. When using the HiveRC and HiveText profiles, you must specify a DELIMITER option in the LOCATION clause. See [Using Profiles to Read and Write Data](ReadWritePXF.html#readingandwritingdatawithpxf) for more information on profiles.
    +The `HiveRC` and `HiveText` profiles are faster than the generic `Hive` profile.
     
    +?? MORE HERE. ??
     
    -### <a id="topic_b4v_g3n_25"></a>Hive Complex Types
    +## <a id="complex_dt_example"></a>Complex Data Type Example
     
    -PXF tables support Hive data types that are not primitive types. The supported Hive complex data types are array, struct, map, and union. This Hive `CREATE TABLE` statement, for example, creates a table with each of these complex data types:
    +This example will employ the array and map complex types, specifically an array of integers and a string key/value pair map.
     
    -``` sql
    -hive> CREATE TABLE sales_collections (
    -  item STRING,
    -  price FLOAT,
    -  properties ARRAY<STRING>,
    -  hash MAP<STRING,FLOAT>,
    -  delivery_address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>,
    -  others UNIONTYPE<FLOAT, BOOLEAN, STRING>
    -)  
    -ROW FORMAT DELIMITED FIELDS
    -TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
    -hive> LOAD DATA LOCAL INPATH '/local/path/<some data file>' INTO TABLE sales_collection;
    -```
    +The example data set includes fields with the following names and data types:
    +
    +- index - int
    +- name - string
    +- intarray - array of integers
    +- propmap - map of string key and value pairs
    +
    +When specifying an array field in a Hive table, you must identify the terminator for each item in the collection. Similarly, the map key termination character must also be specified.
    +
    +1. Create a text file from which you will load the data set:
    +
    +    ```
    +    $ vi /tmp/pxf_hive_complex.txt
    +    ```
    +
    +2. Add the following data to `pxf_hive_complex.txt`.  The data uses a comma `,` to separate field values, the percent symbol `%` to separate collection items, and a `:` to terminate map key values:
    +
    +    ```
    +    3,Prague,1%2%3,zone:euro%status:up
    +    89,Rome,4%5%6,zone:euro
    +    400,Bangalore,7%8%9,zone:apac%status:pending
    +    183,Beijing,0%1%2,zone:apac
    +    94,Sacramento,3%4%5,zone:noam%status:down
    +    101,Paris,6%7%8,zone:euro%status:up
    +    56,Frankfurt,9%0%1,zone:euro
    +    202,Jakarta,2%3%4,zone:apac%status:up
    +    313,Sydney,5%6%7,zone:apac%status:pending
    +    76,Atlanta,8%9%0,zone:noam%status:down
    +    ```
    +
    +3. Create a Hive table to represent this data:
    +
    +    ``` shell
    +    $ HADOOP_USER_NAME=hdfs hive
    +    ```
    +
    +    ``` sql
    +    hive> CREATE TABLE table_complextypes( index int, name string, intarray ARRAY<int>, propmap MAP<string, string>)
    +             ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    +             COLLECTION ITEMS TERMINATED BY '%'
    +             MAP KEYS TERMINATED BY ':'
    +             STORED AS TEXTFILE;
    +    ```
    +
    +    Notice that:
    +
    +    - `FIELDS TERMINATED BY` identifies a comma as the field terminator.
    +    - The `COLLECTION ITEMS TERMINATED BY` subclause specifies the percent sign as the collection items (array item, map key/value pair) terminator.
    +    - `MAP KEYS TERMINATED BY` identifies a colon as the terminator for map keys.
    +
    +4. Load the `pxf_hive_complex.txt` sample data file into the `table_complextypes` table you just created:
    +
    +    ``` sql
    +    hive> LOAD DATA local INPATH '/tmp/pxf_hive_complex.txt' INTO TABLE table_complextypes;
    +    ```
    +
    +5. Perform a query on Hive table `table_complextypes` to verify the data was loaded successfully:
    +
    +    ``` sql
    +    hive> SELECT * FROM table_complextypes;
    +    ```
    +
    +    ``` shell
    +    3	Prague	[1,2,3]	{"zone":"euro","status":"up"}
    +    89	Rome	[4,5,6]	{"zone":"euro"}
    +    400	Bangalore	[7,8,9]	{"zone":"apac","status":"pending"}
    +    ...
    +    ```
    +
    +6. Use the PXF `Hive` profile to create a queryable HAWQ external table representing the Hive `table_complextypes`:
     
    -You can use HAWQ functions or application code to extract the components of the complex data columns as needed.
    +    ``` sql
    +    postgres=# CREATE EXTERNAL TABLE complextypes_hiveprofile(index int, name text, intarray text, propmap text)
    +                 LOCATION ('pxf://namenode:51200/table_complextypes?PROFILE=Hive')
    +               FORMAT 'CUSTOM' (formatter='pxfwritable_import');
    +    ```
    +
    +    Notice that the integer array and map complex types are mapped to type text.
    +
    +7. Query the external table:
    +
    +    ``` sql
    +    postgres=# SELECT * FROM complextypes_hiveprofile;
    +    ```
    +
    +    ``` shell     
    +     index |    name    | intarray |              propmap
    +    -------+------------+----------+------------------------------------
    +         3 | Prague     | [1,2,3]  | {"zone":"euro","status":"up"}
    +        89 | Rome       | [4,5,6]  | {"zone":"euro"}
    +       400 | Bangalore  | [7,8,9]  | {"zone":"apac","status":"pending"}
    +       183 | Beijing    | [0,1,2]  | {"zone":"apac"}
    +        94 | Sacramento | [3,4,5]  | {"zone":"noam","status":"down"}
    +       101 | Paris      | [6,7,8]  | {"zone":"euro","status":"up"}
    +        56 | Frankfurt  | [9,0,1]  | {"zone":"euro"}
    +       202 | Jakarta    | [2,3,4]  | {"zone":"apac","status":"up"}
    +       313 | Sydney     | [5,6,7]  | {"zone":"apac","status":"pending"}
    +        76 | Atlanta    | [8,9,0]  | {"zone":"noam","status":"down"}
    +    (10 rows)
    +    ```
    +
    +    `intarray` and `propmap` are each text strings.
     
     ## <a id="hcatalog"></a>Using PXF and HCatalog to Query Hive
    --- End diff --
    
    I'm wondering if we should put this method first, since it seems so much more convenient than the others?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-hawq-docs pull request #39: HAWQ-1071 - add examples for HiveText ...

Posted by kavinderd <gi...@git.apache.org>.

Github user kavinderd commented on a diff in the pull request:

    https://github.com/apache/incubator-hawq-docs/pull/39#discussion_r85378363
  
    --- Diff: pxf/HivePXF.html.md.erb ---
    @@ -339,21 +601,21 @@ postgres=# CREATE EXTERNAL TABLE pxf_sales_part(
       delivery_state TEXT, 
       delivery_city TEXT
     )
    -LOCATION ('pxf://namenode_host:51200/sales_part?Profile=Hive')
    +LOCATION ('pxf://namenode:51200/sales_part?Profile=Hive')
     FORMAT 'custom' (FORMATTER='pxfwritable_import');
     
     postgres=# SELECT * FROM pxf_sales_part;
     ```
     
    -### <a id="example3"></a>Example
    +### <a id="example3"></a>Query Without Pushdown
     
     In the following example, the HAWQ query filters the `delivery_city` partition `Sacramento`. The filter on� `item_name` is not pushed down, since it is not a partition column. It is performed on the HAWQ side after all the data on `Sacramento` is transferred for processing.
     
     ``` sql
    -postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'shirt';
    +postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'cube';
     ```
     
    -### <a id="example4"></a>Example
    +### <a id="example4"></a>Query With Pushdown
    --- End diff --
    
    Somewhere it should be stated that the HAWQ GUC `pxf_enable_filter_pushdown` needs to be turned on. If this is off no filter pushdown will occur regardless of the nature of the query.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---