You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hawq.apache.org by Goden Yao <go...@apache.org> on 2016/09/25 20:27:05 UTC

PXF question with HAWQInputFormat to migrate data 1.x -> 2.x

+ dev mailing list , modified the title.

Hi Kyle.

Based on your description, your scenario is (as I understand):
1. HAWQ 1.x cluster installed.
2. HAWQ 2.x cluster installed in the same nodes
3. Data migration (ETL) from HAWQ 1.x files to HAWQ 2.x using PXF (from 2.x
installation)

Is that correct?
So you want to develop a custom PXF plugin that can read HAWQ 1.x parquet
data as external tables on HDFS then Insert into new HAWQ 2.x native table?

According to 1.3 doc:
http://hdb.docs.pivotal.io/131/topics/HAWQInputFormatforMapReduce.html#hawqinputformatexample


1) To use *HAWQInputFormat, *it'll require you also run HAWQ 1.x (as it
requires database URL to access metadata), so this mean you need to run 1.x
and 2.x side by side. In theory , it should be doable, but configuration
wise, no one has tried this.

2) If you run hawq side by side, this means PXF will run side by side as
well - have to make sure there's no conflicts in ports or ambiguity of
which version PXF you are invoking.

That's all I can think of for now.
-Goden

On Fri, Sep 23, 2016 at 12:10 PM Kyle Dunn <kd...@pivotal.io> wrote:

> Glad to hear Resolver is the only other piece - should work out nicely.
>
> So I'm looking at bolting on HAWQInputFormat to PXF (which actually looks
> quite straightforward) and I just want to ensure as many column types are
> supported as possible. This is motivated by needing to be able to read
> orphaned HAWQ 1.x files with PXF in HDB/HAWQ 2.x. Will make "in-place"
> upgrades much simpler.
>
> Here is the list of datatypes HAWQInputFormat supports, and the potential
> mapping to PXF types:
>
> [image: pasted1]
>
>
>
> On Fri, Sep 23, 2016 at 12:51 PM Goden Yao <go...@apache.org> wrote:
>
>> Thanks for the wishes.
>> Are you talking about developing a new plugin (a new data source).
>> Mapping data type has 2 parts:
>> 1. what pxf recognized from HAWQ
>> this is
>> https://github.com/apache/incubator-hawq/blob/master/pxf/pxf-api/src/main/java/org/apache/hawq/pxf/api/io/DataType.java
>>
>> 2. what plugins recognize and want to convert to HAWQ type. (Resolver)
>> sample:
>> https://github.com/apache/incubator-hawq/blob/master/pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveResolver.java
>>
>>
>> basically, 1 provides a type list, and 2 select from that list to see
>> which data type should be converted to hawq recognized type.
>>
>> If you're developing a new plugin with a new type mapping in HAWQ, you
>> need to do both 1 and 2.
>>
>> Which specific primitive type you need which is not on the list?
>> BTW, you can also mail dev mailing list so answers will be archived in
>> public for everyone :)
>>
>> -Goden
>>
>>
>> On Fri, Sep 23, 2016 at 11:43 AM Kyle Dunn <kd...@pivotal.io> wrote:
>>
>>> Hey Goden -
>>>
>>> I'm looking at extending PXF for a new data source and noticed only a
>>> subset of the HAWQ-supported primitive datatypes are implemented in PXF. Is
>>> this as trivial as mapping a type to the corresponding OID in
>>> "api/io/DataType.java" or is there something more I'm missing?
>>>
>>> Hope the new adventure is starting well.
>>>
>>>
>>> -Kyle
>>> --
>>> *Kyle Dunn | Data Engineering | Pivotal*
>>> Direct: 303.905.3171 <3039053171> | Email: kdunn@pivotal.io
>>>
>> --
> *Kyle Dunn | Data Engineering | Pivotal*
> Direct: 303.905.3171 <3039053171> | Email: kdunn@pivotal.io
>

Re: PXF question with HAWQInputFormat to migrate data 1.x -> 2.x

Posted by Kyle Dunn <kd...@pivotal.io>.

Lei's approach seems like a clean, simple way to achieve this.

Just to re-iterate, based on my understanding, this would imply building a
PXF plugin in HAWQ 2.0 that references HAWQ 1.x HAWQInputFormat libraries.
Maybe the plugin could then use the file generated by "gpextract" (needs to
be done in advance on the 1.x instance) in the LOCATION section of the
table, which would be passed along toe HAWQInputFormat init() methods to
parse out the actual HDFS file locations. As Lei pointed out, this avoid
needed multiple parallel instances and simplifies the overall design.

Thoughts?

On Sun, Sep 25, 2016 at 6:05 PM Lei Chang <ch...@gmail.com> wrote:

>
> I think it might be possible to use HAWQ 1.x MR Inputformt to develop a
> 2.0 pxf plugin. Then we  do not need to run 2 versions together.
>
> Cheers
>
> Lei
>
>
>
> On Mon, Sep 26, 2016 at 4:27 AM +0800, "Goden Yao" <go...@apache.org>
> wrote:
>
> + dev mailing list , modified the title.
>>
>> Hi Kyle.
>>
>> Based on your description, your scenario is (as I understand):
>> 1. HAWQ 1.x cluster installed.
>> 2. HAWQ 2.x cluster installed in the same nodes
>> 3. Data migration (ETL) from HAWQ 1.x files to HAWQ 2.x using PXF (from
>> 2.x installation)
>>
>> Is that correct?
>> So you want to develop a custom PXF plugin that can read HAWQ 1.x parquet
>> data as external tables on HDFS then Insert into new HAWQ 2.x native table?
>>
>> According to 1.3 doc:
>>
>> http://hdb.docs.pivotal.io/131/topics/HAWQInputFormatforMapReduce.html#hawqinputformatexample
>>
>>
>> 1) To use *HAWQInputFormat, *it'll require you also run HAWQ 1.x (as it
>> requires database URL to access metadata), so this mean you need to run 1.x
>> and 2.x side by side. In theory , it should be doable, but configuration
>> wise, no one has tried this.
>>
>> 2) If you run hawq side by side, this means PXF will run side by side as
>> well - have to make sure there's no conflicts in ports or ambiguity of
>> which version PXF you are invoking.
>>
>> That's all I can think of for now.
>> -Goden
>>
>> On Fri, Sep 23, 2016 at 12:10 PM Kyle Dunn <kd...@pivotal.io> wrote:
>>
>>> Glad to hear Resolver is the only other piece - should work out nicely.
>>>
>>> So I'm looking at bolting on HAWQInputFormat to PXF (which actually
>>> looks quite straightforward) and I just want to ensure as many column types
>>> are supported as possible. This is motivated by needing to be able to read
>>> orphaned HAWQ 1.x files with PXF in HDB/HAWQ 2.x. Will make "in-place"
>>> upgrades much simpler.
>>>
>>> Here is the list of datatypes HAWQInputFormat supports, and the
>>> potential mapping to PXF types:
>>>
>>> [image: pasted1]
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 12:51 PM Goden Yao <go...@apache.org> wrote:
>>>
>>>> Thanks for the wishes.
>>>> Are you talking about developing a new plugin (a new data source).
>>>> Mapping data type has 2 parts:
>>>> 1. what pxf recognized from HAWQ
>>>> this is
>>>> https://github.com/apache/incubator-hawq/blob/master/pxf/pxf-api/src/main/java/org/apache/hawq/pxf/api/io/DataType.java
>>>>
>>>> 2. what plugins recognize and want to convert to HAWQ type. (Resolver)
>>>> sample:
>>>> https://github.com/apache/incubator-hawq/blob/master/pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveResolver.java
>>>>
>>>>
>>>> basically, 1 provides a type list, and 2 select from that list to see
>>>> which data type should be converted to hawq recognized type.
>>>>
>>>> If you're developing a new plugin with a new type mapping in HAWQ, you
>>>> need to do both 1 and 2.
>>>>
>>>> Which specific primitive type you need which is not on the list?
>>>> BTW, you can also mail dev mailing list so answers will be archived in
>>>> public for everyone :)
>>>>
>>>> -Goden
>>>>
>>>>
>>>> On Fri, Sep 23, 2016 at 11:43 AM Kyle Dunn <kd...@pivotal.io> wrote:
>>>>
>>>>> Hey Goden -
>>>>>
>>>>> I'm looking at extending PXF for a new data source and noticed only a
>>>>> subset of the HAWQ-supported primitive datatypes are implemented in PXF. Is
>>>>> this as trivial as mapping a type to the corresponding OID in
>>>>> "api/io/DataType.java" or is there something more I'm missing?
>>>>>
>>>>> Hope the new adventure is starting well.
>>>>>
>>>>>
>>>>> -Kyle
>>>>> --
>>>>> *Kyle Dunn | Data Engineering | Pivotal*
>>>>> Direct: 303.905.3171 <3039053171> | Email: kdunn@pivotal.io
>>>>>
>>>> --
>>> *Kyle Dunn | Data Engineering | Pivotal*
>>> Direct: 303.905.3171 <3039053171> | Email: kdunn@pivotal.io
>>>
>> --
*Kyle Dunn | Data Engineering | Pivotal*
Direct: 303.905.3171 <3039053171> | Email: kdunn@pivotal.io

Re: PXF question with HAWQInputFormat to migrate data 1.x -> 2.x

Posted by Lei Chang <ch...@gmail.com>.

I think it might be possible to use HAWQ 1.x MR Inputformt to develop a 2.0 pxf plugin. Then we do not need to run 2 versions together.

Cheers
Lei

On Mon, Sep 26, 2016 at 4:27 AM +0800, "Goden Yao" <go...@apache.org> wrote:

+ dev mailing list , modified the title.
Hi Kyle.
Based on your description, your scenario is (as I understand):1. HAWQ 1.x cluster installed.2. HAWQ 2.x cluster installed in the same nodes3. Data migration (ETL) from HAWQ 1.x files to HAWQ 2.x using PXF (from 2.x installation)
Is that correct?So you want to develop a custom PXF plugin that can read HAWQ 1.x parquet data as external tables on HDFS then Insert into new HAWQ 2.x native table?
According to 1.3 doc:http://hdb.docs.pivotal.io/131/topics/HAWQInputFormatforMapReduce.html#hawqinputformatexample

1) To use HAWQInputFormat, it'll require you also run HAWQ 1.x (as it requires database URL to access metadata), so this mean you need to run 1.x and 2.x side by side. In theory , it should be doable, but configuration wise, no one has tried this.
2) If you run hawq side by side, this means PXF will run side by side as well - have to make sure there's no conflicts in ports or ambiguity of which version PXF you are invoking.
That's all I can think of for now.-Goden
On Fri, Sep 23, 2016 at 12:10 PM Kyle Dunn <kd...@pivotal.io> wrote:
Glad to hear Resolver is the only other piece - should work out nicely.
So I'm looking at bolting on HAWQInputFormat to PXF (which actually looks quite straightforward) and I just want to ensure as many column types are supported as possible. This is motivated by needing to be able to read orphaned HAWQ 1.x files with PXF in HDB/HAWQ 2.x. Will make "in-place" upgrades much simpler.
Here is the list of datatypes HAWQInputFormat supports, and the potential mapping to PXF types:

On Fri, Sep 23, 2016 at 12:51 PM Goden Yao <go...@apache.org> wrote:
Thanks for the wishes.Are you talking about developing a new plugin (a new data source). Mapping data type has 2 parts:1. what pxf recognized from HAWQthis is https://github.com/apache/incubator-hawq/blob/master/pxf/pxf-api/src/main/java/org/apache/hawq/pxf/api/io/DataType.java 2. what plugins recognize and want to convert to HAWQ type. (Resolver)sample: https://github.com/apache/incubator-hawq/blob/master/pxf/pxf-hive/src/main/java/org/apache/hawq/pxf/plugins/hive/HiveResolver.java
basically, 1 provides a type list, and 2 select from that list to see which data type should be converted to hawq recognized type.
If you're developing a new plugin with a new type mapping in HAWQ, you need to do both 1 and 2.
Which specific primitive type you need which is not on the list?BTW, you can also mail dev mailing list so answers will be archived in public for everyone :)
-Goden

On Fri, Sep 23, 2016 at 11:43 AM Kyle Dunn <kd...@pivotal.io> wrote:
Hey Goden -
I'm looking at extending PXF for a new data source and noticed only a subset of the HAWQ-supported primitive datatypes are implemented in PXF. Is this as trivial as mapping a type to the corresponding OID in "api/io/DataType.java" or is there something more I'm missing?
Hope the new adventure is starting well.