You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Kyle R Dunn (JIRA)" <ji...@apache.org> on 2017/02/14 17:55:41 UTC

[jira] [Comment Edited] (HAWQ-1234) Document HAWQ to PXF APIs

    [ https://issues.apache.org/jira/browse/HAWQ-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866238#comment-15866238 ] 

Kyle R Dunn edited comment on HAWQ-1234 at 2/14/17 5:55 PM:
------------------------------------------------------------

I did some initial exploration of the HAWQ -> PXF communication chain, for a different purpose. I'm going to paste in what I've learned so far. Also, PXF itself does not store metadata, either HAWQ provides this directly or HCatalog can be queried for it; I'm showing the latter. PXF expects the metadata about the data, as well as some other pieces, to be provided as HTTP headers, which it appears to convert to a hashmap on the server side, as shown [here|https://github.com/apache/incubator-hawq/blob/master/pxf/pxf-service/src/main/java/org/apache/hawq/pxf/service/rest/RestResource.java#L52]. 

Get the PXF server version
{code}
$ curl 'http://localhost:51200/pxf/ProtocolVersion'
{ "version": "v14"} 
{code}

Get metadata from HCatalog for a Hive table called "kdtest" in the "default" database
{code}
$ curl -i -H "X-GP-SEGMENT-ID: -100005432" -H "X-GP-SEGMENT-COUNT: 0" -H "X-GP-XID: 2724107" -H "X-GP-ALIGNMENT: 8" -H "X-GP-URL-HOST: localhost" -H "X-GP-URL-PORT: 51200" -H "X-GP-URI: localhost:51200/" -H "X-GP-HAS-FILTER: 0" 'localhost:51200/pxf/v14/Metadata/getMetadata?profile=Hive&pattern=default.kdtest'
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json
Content-Length: 132
Date: Tue, 14 Feb 2017 05:06:11 GMT

{"PXFMetadata":[{"item":{"path":"default","name":"kdtest"},"fields":[{"name":"key","type":"text"},{"name":"value","type":"text"}]}]}
{code}

Get the actual data (in {{TEXT}} format, {{GPDBWritable}} is also valid) for the above table's PXF "Fragments"
{code}
$ curl -i -H "X-GP-SEGMENT-ID: -100005432" -H "X-GP-SEGMENT-COUNT: 0" -H "X-GP-XID: 2724107" -H "X-GP-ALIGNMENT: 8" -H "X-GP-URL-HOST: localhost" -H "X-GP-URL-PORT: 51200" -H "X-GP-URI: pxf://localhost:51200/default.kdtest?Profile=Hive" -H "X-GP-HAS-FILTER: 0" -H "X-GP-FORMAT: TEXT" -H "X-GP-ATTRS: 2" -H "X-GP-ATTR-NAME0: key" -H "X-GP-ATTR-TYPECODE0: 25" -H "X-GP-ATTR-TYPENAME0: text" -H "X-GP-ATTR-NAME1:  value" -H "X-GP-ATTR-TYPECODE1: 25" -H "X-GP-ATTR-TYPENAME1: text" -H "X-GP-Profile: Hive" -H "X-GP-DATA-DIR: default.kdtest" 'http://localhost:51200/pxf/v14/Fragmenter/getFragments?path=/apps/hive/warehouse/kdtest'

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json
Content-Length: 1305
Date: Tue, 14 Feb 2017 05:30:05 GMT

{"PXFFragments":[{"sourceName":"/apps/hive/warehouse/kdtest/hive-test-data.txt","index":0,"replicas":["10.215.181.12","10.215.181.11"],"metadata":"rO0ABXcQAAAAAAAAAAAAAAAAAAAAN3VyABNbTGphdmEubGFuZy5TdHJpbmc7rdJW5+kde0cCAAB4cAAAAAJ0AB1jbHBxbjFwZGhkYmRuMDIuaW5mb3NvbGNvLm5ldHQAHWNscHFuMXBkaGRiZG4wMS5pbmZvc29sY28ubmV0","userData":"b3JnLmFwYWNoZS5oYWRvb3AubWFwcmVkLlRleHRJbnB1dEZvcm1hdCFIVUREIW9yZy5hcGFjaGUuaGFkb29wLmhpdmUuc2VyZGUyLmxhenkuTGF6eVNpbXBsZVNlckRlIUhVREQhIwojTW9uIEZlYiAxMyAyMToyOTozNSBQU1QgMjAxNwpuYW1lPWRlZmF1bHQua2R0ZXN0Cm51bUZpbGVzPTEKZmllbGQuZGVsaW09LApjb2x1bW5zLnR5cGVzPXN0cmluZ1w6c3RyaW5nCnNlcmlhbGl6YXRpb24uZGRsPXN0cnVjdCBrZHRlc3QgeyBzdHJpbmcga2V5LCBzdHJpbmcgdmFsdWV9CmNvbHVtbnM9a2V5LHZhbHVlCnNlcmlhbGl6YXRpb24uZm9ybWF0PSwKY29sdW1ucy5jb21tZW50cz1cdTAwMDAKYnVja2V0X2NvdW50PS0xCnNlcmlhbGl6YXRpb24ubGliPW9yZy5hcGFjaGUuaGFkb29wLmhpdmUuc2VyZGUyLmxhenkuTGF6eVNpbXBsZVNlckRlCkNPTFVNTl9TVEFUU19BQ0NVUkFURT10cnVlCmZpbGUuaW5wdXRmb3JtYXQ9b3JnLmFwYWNoZS5oYWRvb3AubWFwcmVkLlRleHRJbnB1dEZvcm1hdAp0b3RhbFNpemU9NTUKZmlsZS5vdXRwdXRmb3JtYXQ9b3JnLmFwYWNoZS5oYWRvb3AuaGl2ZS5xbC5pby5IaXZlSWdub3JlS2V5VGV4dE91dHB1dEZvcm1hdApsb2NhdGlvbj1oZGZzXDovL2NscHFuMXBkaGRibW4wMS5pbmZvc29sY28ubmV0XDo4MDIwL2FwcHMvaGl2ZS93YXJlaG91c2Uva2R0ZXN0CnRyYW5zaWVudF9sYXN0RGRsVGltZT0xNDg3MDA2NDg4CiFIVUREISFITlBUISFIVUREIWZhbHNl"}]}
{code}

The Hive table looks like this:
{code}
hive> describe formatted kdtest;
OK
# col_name              data_type               comment

key                     string
value                   string

# Detailed Table Information
Database:               default
Owner:                  kdunn
CreateTime:             Mon Feb 13 09:20:40 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://nowhere.com:8020/apps/hive/warehouse/kdtest
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   true
        numFiles                1
        totalSize               55
        transient_lastDdlTime   1487006488

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        field.delim             ,
        serialization.format    ,
Time taken: 0.373 seconds, Fetched: 31 row(s)
{code}

The data in it is this:
{code}
hive> select * from kdtest;
OK
somekey somevalue
1234    56789
hello   world
aloha   mondays
Time taken: 0.043 seconds, Fetched: 4 row(s)
{code}

The raw data was this:
{code}
$ cat /tmp/hive-test-data.txt
somekey,somevalue
1234,56789
hello,world
aloha,mondays
{code}

Hive DDL and DML:
{code}
hive> CREATE TABLE kdtest (key string, value string) row format delimited fields terminated by ',';
hive> LOAD DATA local inpath '/tmp/hive-test-data.txt' into table test;
{code}


was (Author: kdunn926):
I did some initial exploration of the HAWQ -> PXF communication chain, for a different purpose. I'm going to paste in what I've learned so far. Also, PXF itself does not store metadata, either HAWQ provides this directly or HCatalog can be queried for it; I'm showing the latter. PXF expects the metadata about the data, as well as some other pieces, to be provided as HTTP headers, which it appears to convert to a hashmap on the server side, as shown [here|https://github.com/apache/incubator-hawq/blob/master/pxf/pxf-service/src/main/java/org/apache/hawq/pxf/service/rest/RestResource.java#L52]. 

Get the PXF server version
{code}
$ curl 'http://localhost:51200/pxf/ProtocolVersion'
{ "version": "v14"} 
{code}

Get metadata from HCatalog for a Hive table called "kdtest" in the "default" database
{code}
$ curl -i -H "X-GP-SEGMENT-ID: -100005432" -H "X-GP-SEGMENT-COUNT: 0" -H "X-GP-XID: 2724107" -H "X-GP-ALIGNMENT: 8" -H "X-GP-URL-HOST: localhost" -H "X-GP-URL-PORT: 51200" -H "X-GP-URI: localhost:51200/" -H "X-GP-HAS-FILTER: 0" 'localhost:51200/pxf/v14/Metadata/getMetadata?profile=Hive&pattern=default.kdtest'
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json
Content-Length: 132
Date: Tue, 14 Feb 2017 05:06:11 GMT

{"PXFMetadata":[{"item":{"path":"default","name":"kdtest"},"fields":[{"name":"key","type":"text"},{"name":"value","type":"text"}]}]}
{code}

Get the actual data (in {{TEXT}} format, {{GPDBWritable}} is also valid) for the above table's PXF "Fragments"
{code}
$ curl -i -H "X-GP-SEGMENT-ID: -100005432" -H "X-GP-SEGMENT-COUNT: 0" -H "X-GP-XID: 2724107" -H "X-GP-ALIGNMENT: 8" -H "X-GP-URL-HOST: localhost" -H "X-GP-URL-PORT: 51200" -H "X-GP-URI: pxf://localhost:51200/default.kdtest?Profile=Hive" -H "X-GP-HAS-FILTER: 0" -H "X-GP-FORMAT: TEXT" -H "X-GP-ATTRS: 2" -H "X-GP-ATTR-NAME0: key" -H "X-GP-ATTR-TYPECODE0: 25" -H "X-GP-ATTR-TYPENAME0: text" -H "X-GP-ATTR-NAME1:  value" -H "X-GP-ATTR-TYPECODE1: 25" -H "X-GP-ATTR-TYPENAME1: text" -H "X-GP-Profile: Hive" -H "X-GP-DATA-DIR: default.kdtest" 'http://localhost:51200/pxf/v14/Fragmenter/getFragments?path=/apps/hive/warehouse/kdtest'

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json
Content-Length: 1305
Date: Tue, 14 Feb 2017 05:30:05 GMT

{"PXFFragments":[{"sourceName":"/apps/hive/warehouse/kdtest/hive-test-data.txt","index":0,"replicas":["10.215.181.12","10.215.181.11"],"metadata":"rO0ABXcQAAAAAAAAAAAAAAAAAAAAN3VyABNbTGphdmEubGFuZy5TdHJpbmc7rdJW5+kde0cCAAB4cAAAAAJ0AB1jbHBxbjFwZGhkYmRuMDIuaW5mb3NvbGNvLm5ldHQAHWNscHFuMXBkaGRiZG4wMS5pbmZvc29sY28ubmV0","userData":"b3JnLmFwYWNoZS5oYWRvb3AubWFwcmVkLlRleHRJbnB1dEZvcm1hdCFIVUREIW9yZy5hcGFjaGUuaGFkb29wLmhpdmUuc2VyZGUyLmxhenkuTGF6eVNpbXBsZVNlckRlIUhVREQhIwojTW9uIEZlYiAxMyAyMToyOTozNSBQU1QgMjAxNwpuYW1lPWRlZmF1bHQua2R0ZXN0Cm51bUZpbGVzPTEKZmllbGQuZGVsaW09LApjb2x1bW5zLnR5cGVzPXN0cmluZ1w6c3RyaW5nCnNlcmlhbGl6YXRpb24uZGRsPXN0cnVjdCBrZHRlc3QgeyBzdHJpbmcga2V5LCBzdHJpbmcgdmFsdWV9CmNvbHVtbnM9a2V5LHZhbHVlCnNlcmlhbGl6YXRpb24uZm9ybWF0PSwKY29sdW1ucy5jb21tZW50cz1cdTAwMDAKYnVja2V0X2NvdW50PS0xCnNlcmlhbGl6YXRpb24ubGliPW9yZy5hcGFjaGUuaGFkb29wLmhpdmUuc2VyZGUyLmxhenkuTGF6eVNpbXBsZVNlckRlCkNPTFVNTl9TVEFUU19BQ0NVUkFURT10cnVlCmZpbGUuaW5wdXRmb3JtYXQ9b3JnLmFwYWNoZS5oYWRvb3AubWFwcmVkLlRleHRJbnB1dEZvcm1hdAp0b3RhbFNpemU9NTUKZmlsZS5vdXRwdXRmb3JtYXQ9b3JnLmFwYWNoZS5oYWRvb3AuaGl2ZS5xbC5pby5IaXZlSWdub3JlS2V5VGV4dE91dHB1dEZvcm1hdApsb2NhdGlvbj1oZGZzXDovL2NscHFuMXBkaGRibW4wMS5pbmZvc29sY28ubmV0XDo4MDIwL2FwcHMvaGl2ZS93YXJlaG91c2Uva2R0ZXN0CnRyYW5zaWVudF9sYXN0RGRsVGltZT0xNDg3MDA2NDg4CiFIVUREISFITlBUISFIVUREIWZhbHNl"}]}
{code}

The Hive table looks like this:
{code}
hive> describe formatted kdtest;
OK
# col_name              data_type               comment

key                     string
value                   string

# Detailed Table Information
Database:               default
Owner:                  kdunn
CreateTime:             Mon Feb 13 09:20:40 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://clpqn1pdhdbmn01.infosolco.net:8020/apps/hive/warehouse/kdtest
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   true
        numFiles                1
        totalSize               55
        transient_lastDdlTime   1487006488

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        field.delim             ,
        serialization.format    ,
Time taken: 0.373 seconds, Fetched: 31 row(s)
{code}

The data in it is this:
{code}
hive> select * from kdtest;
OK
somekey somevalue
1234    56789
hello   world
aloha   mondays
Time taken: 0.043 seconds, Fetched: 4 row(s)
{code}

The raw data was this:
{code}
$ cat /tmp/hive-test-data.txt
somekey,somevalue
1234,56789
hello,world
aloha,mondays
{code}

Hive DDL and DML:
{code}
hive> CREATE TABLE kdtest (key string, value string) row format delimited fields terminated by ',';
hive> LOAD DATA local inpath '/tmp/hive-test-data.txt' into table test;
{code}

> Document HAWQ to PXF APIs
> -------------------------
>
>                 Key: HAWQ-1234
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1234
>             Project: Apache HAWQ
>          Issue Type: Sub-task
>          Components: PXF
>            Reporter: Roman Shaposhnik
>            Assignee: Roman Shaposhnik
>         Attachments: PXFAdvancedStatsplan.pdf
>
>
> It would be very useful to start documenting HAWQ to PXF APIs. The right places to start are:
>    * libcurl (a thin wrapper for making HAWQ C code be able to do REST calls):
> https://github.com/apache/incubator-hawq/blob/master/src/include/access/libchurl.h
> https://github.com/apache/incubator-hawq/blob/master/src/backend/access/external/libchurl.c
>    * pxfmasterapi (mostly metadata calls that master is doing):
> https://github.com/apache/incubator-hawq/blob/master/src/backend/access/external/pxfmasterapi.c
> Here you will find how HAWQ via PXF pulls using a REST API to get external metadata and some logic to parse the JSON response.
>    * gpbridgeapi (segment calls to PXF):
> https://github.com/apache/incubator-hawq/blob/master/src/bin/gpfusion/gpbridgeapi.c
> Here you will find other examples of (read and write calls) used to fetch external data.
> Design doc on PXF's support for analyze (pxf's analyzer) is attached



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)