You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Madhusudan Borkar <mb...@etouch.net> on 2017/05/10 22:05:59 UTC

[PROPOSAL] Apache Hive connector

Hi all,
Thank you for your response to the earlier proposal. Taking into account
all the suggestions, we are making a new proposal for Hive connector.
Please, let us know your feedback.

[1]
https://docs.google.com/document/d/1aeQRLXjVr38Z03_zWkHO9YQhtnj0jHoCfhsSNm-wxtA/edit?usp=sharing

[2] https://issues.apache.org/jira/browse/BEAM-1158
<https://issues.apache.org/jira/browse/BEAM-1158>

Madhu Borkar

RE: [PROPOSAL] Apache Hive connector

Posted by Seshadri Raghunathan <sr...@etouch.net>.

Many thanks for your input. It simply works by *configuring* HadoopInputFormatIO indeed ! 

Perhaps I will simply write an integration test case with this configuration which could serve as a reference for reading from Hive using HCatalog.  

I see existing integration tests for HIFIO here - https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/jdk1.8-tests/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/integration/tests/, I will go ahead and write one for HCatalog.

 

Please let me know if you have any comments.

 

Thanks,

Seshadri

 

From: Eugene Kirpichov <kirpichov@google.com.invalid <ma...@google.com.invalid> >
Date: Fri, May 12, 2017 at 2:43 PM
Subject: Re: [PROPOSAL] Apache Hive connector
To: dev@beam.apache.org <ma...@beam.apache.org> 


Hi!

Why do you need at all to override methods like computeSplitsIfNecessary -
is HCatalogIO substantially different from other HadoopInputFormat's that
it can not be handled by the generic code of HadoopInputFormatIO? I looked
at the implementation in your commit and it seems identical, except for one
line - "HCatInputFormat.setInput(conf.getHadoopConfiguration(), database,
table, filter)" - but this line seems like simply specifying the
Configuration for the HadoopInputFormatIO, which can be done by
HadoopInputFormatIO.withConfiguration().

I.e. so far it seems like HCatalogIO can be implemented by *configuring*
HadoopInputFormatIO, rather than extending it. Am I missing something?

On Fri, May 12, 2017 at 12:11 PM Seshadri Raghunathan <sesh.cr@gmail.com <ma...@gmail.com> >
wrote:

> Hi Eugene,
>
> In order to reuse HadoopInputFormatIO, this is what I am thinking -
>
> 1. Extend HadoopInputFormatBoundedSource to create - HCatalogBoundedSource
> 2. Override necessary methods in HCatalogBoundedSource to perform
> HCatalog-specific steps. ( overriding computeSplitsIfNecessary() method
> should be enough as I see it now )
> 3. Use HCatalogBoundedSource and HadoopInputFormatReader in HCatalog
> wrapper class to perform IO
>
> Initially I started this way but since it involves modifying
> HadoopInputFormatReader
> / HadoopInputFormatBoundedSource to make it public / extensible, I wasn't
> sure if this fits with Beam authoring guidelines and hence came up with the
> solution I shared in my earlier note.
>
> Please let me know your thoughts !
>
> *HadoopInputFormatIO *-

>
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L172
>
> HadoopInputFormatBoundedSource -
>
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L367
>
> HadoopInputFormatReader -
>
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L584
>
> On Thu, May 11, 2017 at 4:57 PM, Seshadri Raghunathan <sesh.cr@gmail.com <ma...@gmail.com> >
> wrote:
>
> > Thanks Eugene, that makes sense. This solution heavily borrows on
> HadoopInputFormatIO
> > with a tweak for HCatalog (and related parameters). I will try to
> re-use  HadoopInputFormatIO
> > rather than the current approach.
> >
> > On Thu, May 11, 2017 at 4:44 PM, Eugene Kirpichov <
> > kirpichov@google.com.invalid <ma...@google.com.invalid> > wrote:
> >
> >> Thanks Seshadri! This seems to have a great deal of copy-paste from
> >> HadoopInputFormatIO. Is it possible to instead implement this connector
> as
> >> a wrapper around it, rather than copy-paste?
> >>
> >> On Thu, May 11, 2017 at 4:41 PM Seshadri Raghunathan <sesh.cr@gmail.com <ma...@gmail.com> 
> >
> >> wrote:
> >>
> >> > Hi all,
> >> >
> >> > Here is a draft implementation of this proposal -
> >> >
> >> > https://github.com/seshadri-cr/beam/commit/78cdf8772f2cd5bb9
> >> cd018b1c99c3ad0854157c1
> >> >
> >> > Many thanks to Ismaël Mejía who helped in a high level review &
> >> follow-up
> >> > of this design / approach.
> >> >
> >> > Looking forward for further review/comments from wider community to
> move
> >> > forward on this proposal.
> >> >
> >> > Thanks,
> >> > Seshadri
> >> >
> >> >
> >> > On Wed, May 10, 2017 at 3:05 PM, Madhusudan Borkar <
> mborkar@etouch.net <ma...@etouch.net> >
> >> > wrote:
> >> >
> >> > > Hi all,
> >> > > Thank you for your response to the earlier proposal. Taking into
> >> account
> >> > > all the suggestions, we are making a new proposal for Hive
> connector.
> >> > > Please, let us know your feedback.
> >> > >
> >> > > [1]
> >> > > https://docs.google.com/document/d/1aeQRLXjVr38Z03_
> >> > > zWkHO9YQhtnj0jHoCfhsSNm-wxtA/edit?usp=sharing
> >> > >
> >> > > [2] https://issues.apache.org/jira/browse/BEAM-1158
> >> > > <https://issues.apache.org/jira/browse/BEAM-1158>
> >> > >
> >> > > Madhu Borkar
> >> > >
> >> >
> >>
> >
> >
>

Re: [PROPOSAL] Apache Hive connector

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Hi!

Why do you need at all to override methods like computeSplitsIfNecessary -
is HCatalogIO substantially different from other HadoopInputFormat's that
it can not be handled by the generic code of HadoopInputFormatIO? I looked
at the implementation in your commit and it seems identical, except for one
line - "HCatInputFormat.setInput(conf.getHadoopConfiguration(), database,
table, filter)" - but this line seems like simply specifying the
Configuration for the HadoopInputFormatIO, which can be done by
HadoopInputFormatIO.withConfiguration().

I.e. so far it seems like HCatalogIO can be implemented by *configuring*
HadoopInputFormatIO, rather than extending it. Am I missing something?

On Fri, May 12, 2017 at 12:11 PM Seshadri Raghunathan <se...@gmail.com>
wrote:

> Hi Eugene,
>
> In order to reuse HadoopInputFormatIO, this is what I am thinking -
>
> 1. Extend HadoopInputFormatBoundedSource to create - HCatalogBoundedSource
> 2. Override necessary methods in HCatalogBoundedSource to perform
> HCatalog-specific steps. ( overriding computeSplitsIfNecessary() method
> should be enough as I see it now )
> 3. Use HCatalogBoundedSource and HadoopInputFormatReader in HCatalog
> wrapper class to perform IO
>
> Initially I started this way but since it involves modifying
> HadoopInputFormatReader
> / HadoopInputFormatBoundedSource to make it public / extensible, I wasn't
> sure if this fits with Beam authoring guidelines and hence came up with the
> solution I shared in my earlier note.
>
> Please let me know your thoughts !
>
> *HadoopInputFormatIO *-
>
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L172
>
> HadoopInputFormatBoundedSource -
>
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L367
>
> HadoopInputFormatReader -
>
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L584
>
> On Thu, May 11, 2017 at 4:57 PM, Seshadri Raghunathan <se...@gmail.com>
> wrote:
>
> > Thanks Eugene, that makes sense. This solution heavily borrows on
> HadoopInputFormatIO
> > with a tweak for HCatalog (and related parameters). I will try to
> re-use  HadoopInputFormatIO
> > rather than the current approach.
> >
> > On Thu, May 11, 2017 at 4:44 PM, Eugene Kirpichov <
> > kirpichov@google.com.invalid> wrote:
> >
> >> Thanks Seshadri! This seems to have a great deal of copy-paste from
> >> HadoopInputFormatIO. Is it possible to instead implement this connector
> as
> >> a wrapper around it, rather than copy-paste?
> >>
> >> On Thu, May 11, 2017 at 4:41 PM Seshadri Raghunathan <sesh.cr@gmail.com
> >
> >> wrote:
> >>
> >> > Hi all,
> >> >
> >> > Here is a draft implementation of this proposal -
> >> >
> >> > https://github.com/seshadri-cr/beam/commit/78cdf8772f2cd5bb9
> >> cd018b1c99c3ad0854157c1
> >> >
> >> > Many thanks to Ismaël Mejía who helped in a high level review &
> >> follow-up
> >> > of this design / approach.
> >> >
> >> > Looking forward for further review/comments from wider community to
> move
> >> > forward on this proposal.
> >> >
> >> > Thanks,
> >> > Seshadri
> >> >
> >> >
> >> > On Wed, May 10, 2017 at 3:05 PM, Madhusudan Borkar <
> mborkar@etouch.net>
> >> > wrote:
> >> >
> >> > > Hi all,
> >> > > Thank you for your response to the earlier proposal. Taking into
> >> account
> >> > > all the suggestions, we are making a new proposal for Hive
> connector.
> >> > > Please, let us know your feedback.
> >> > >
> >> > > [1]
> >> > > https://docs.google.com/document/d/1aeQRLXjVr38Z03_
> >> > > zWkHO9YQhtnj0jHoCfhsSNm-wxtA/edit?usp=sharing
> >> > >
> >> > > [2] https://issues.apache.org/jira/browse/BEAM-1158
> >> > > <https://issues.apache.org/jira/browse/BEAM-1158>
> >> > >
> >> > > Madhu Borkar
> >> > >
> >> >
> >>
> >
> >
>

Re: [PROPOSAL] Apache Hive connector

Posted by Seshadri Raghunathan <se...@gmail.com>.

Hi Eugene,

In order to reuse HadoopInputFormatIO, this is what I am thinking -

1. Extend HadoopInputFormatBoundedSource to create - HCatalogBoundedSource
2. Override necessary methods in HCatalogBoundedSource to perform
HCatalog-specific steps. ( overriding computeSplitsIfNecessary() method
should be enough as I see it now )
3. Use HCatalogBoundedSource and HadoopInputFormatReader in HCatalog
wrapper class to perform IO

Initially I started this way but since it involves modifying
HadoopInputFormatReader
/ HadoopInputFormatBoundedSource to make it public / extensible, I wasn't
sure if this fits with Beam authoring guidelines and hence came up with the
solution I shared in my earlier note.

Please let me know your thoughts !

*HadoopInputFormatIO *-
https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L172

HadoopInputFormatBoundedSource -
https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L367

HadoopInputFormatReader -
https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L584

On Thu, May 11, 2017 at 4:57 PM, Seshadri Raghunathan <se...@gmail.com>
wrote:

> Thanks Eugene, that makes sense. This solution heavily borrows on HadoopInputFormatIO
> with a tweak for HCatalog (and related parameters). I will try to re-use  HadoopInputFormatIO
> rather than the current approach.
>
> On Thu, May 11, 2017 at 4:44 PM, Eugene Kirpichov <
> kirpichov@google.com.invalid> wrote:
>
>> Thanks Seshadri! This seems to have a great deal of copy-paste from
>> HadoopInputFormatIO. Is it possible to instead implement this connector as
>> a wrapper around it, rather than copy-paste?
>>
>> On Thu, May 11, 2017 at 4:41 PM Seshadri Raghunathan <se...@gmail.com>
>> wrote:
>>
>> > Hi all,
>> >
>> > Here is a draft implementation of this proposal -
>> >
>> > https://github.com/seshadri-cr/beam/commit/78cdf8772f2cd5bb9
>> cd018b1c99c3ad0854157c1
>> >
>> > Many thanks to Ismaël Mejía who helped in a high level review &
>> follow-up
>> > of this design / approach.
>> >
>> > Looking forward for further review/comments from wider community to move
>> > forward on this proposal.
>> >
>> > Thanks,
>> > Seshadri
>> >
>> >
>> > On Wed, May 10, 2017 at 3:05 PM, Madhusudan Borkar <mb...@etouch.net>
>> > wrote:
>> >
>> > > Hi all,
>> > > Thank you for your response to the earlier proposal. Taking into
>> account
>> > > all the suggestions, we are making a new proposal for Hive connector.
>> > > Please, let us know your feedback.
>> > >
>> > > [1]
>> > > https://docs.google.com/document/d/1aeQRLXjVr38Z03_
>> > > zWkHO9YQhtnj0jHoCfhsSNm-wxtA/edit?usp=sharing
>> > >
>> > > [2] https://issues.apache.org/jira/browse/BEAM-1158
>> > > <https://issues.apache.org/jira/browse/BEAM-1158>
>> > >
>> > > Madhu Borkar
>> > >
>> >
>>
>
>

Re: [PROPOSAL] Apache Hive connector

Posted by Seshadri Raghunathan <se...@gmail.com>.

Thanks Eugene, that makes sense. This solution heavily borrows on
HadoopInputFormatIO
with a tweak for HCatalog (and related parameters). I will try to
re-use  HadoopInputFormatIO
rather than the current approach.

On Thu, May 11, 2017 at 4:44 PM, Eugene Kirpichov <
kirpichov@google.com.invalid> wrote:

> Thanks Seshadri! This seems to have a great deal of copy-paste from
> HadoopInputFormatIO. Is it possible to instead implement this connector as
> a wrapper around it, rather than copy-paste?
>
> On Thu, May 11, 2017 at 4:41 PM Seshadri Raghunathan <se...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > Here is a draft implementation of this proposal -
> >
> > https://github.com/seshadri-cr/beam/commit/
> 78cdf8772f2cd5bb9cd018b1c99c3ad0854157c1
> >
> > Many thanks to Ismaël Mejía who helped in a high level review & follow-up
> > of this design / approach.
> >
> > Looking forward for further review/comments from wider community to move
> > forward on this proposal.
> >
> > Thanks,
> > Seshadri
> >
> >
> > On Wed, May 10, 2017 at 3:05 PM, Madhusudan Borkar <mb...@etouch.net>
> > wrote:
> >
> > > Hi all,
> > > Thank you for your response to the earlier proposal. Taking into
> account
> > > all the suggestions, we are making a new proposal for Hive connector.
> > > Please, let us know your feedback.
> > >
> > > [1]
> > > https://docs.google.com/document/d/1aeQRLXjVr38Z03_
> > > zWkHO9YQhtnj0jHoCfhsSNm-wxtA/edit?usp=sharing
> > >
> > > [2] https://issues.apache.org/jira/browse/BEAM-1158
> > > <https://issues.apache.org/jira/browse/BEAM-1158>
> > >
> > > Madhu Borkar
> > >
> >
>

Re: [PROPOSAL] Apache Hive connector

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Thanks Seshadri! This seems to have a great deal of copy-paste from
HadoopInputFormatIO. Is it possible to instead implement this connector as
a wrapper around it, rather than copy-paste?

On Thu, May 11, 2017 at 4:41 PM Seshadri Raghunathan <se...@gmail.com>
wrote:

> Hi all,
>
> Here is a draft implementation of this proposal -
>
> https://github.com/seshadri-cr/beam/commit/78cdf8772f2cd5bb9cd018b1c99c3ad0854157c1
>
> Many thanks to Ismaël Mejía who helped in a high level review & follow-up
> of this design / approach.
>
> Looking forward for further review/comments from wider community to move
> forward on this proposal.
>
> Thanks,
> Seshadri
>
>
> On Wed, May 10, 2017 at 3:05 PM, Madhusudan Borkar <mb...@etouch.net>
> wrote:
>
> > Hi all,
> > Thank you for your response to the earlier proposal. Taking into account
> > all the suggestions, we are making a new proposal for Hive connector.
> > Please, let us know your feedback.
> >
> > [1]
> > https://docs.google.com/document/d/1aeQRLXjVr38Z03_
> > zWkHO9YQhtnj0jHoCfhsSNm-wxtA/edit?usp=sharing
> >
> > [2] https://issues.apache.org/jira/browse/BEAM-1158
> > <https://issues.apache.org/jira/browse/BEAM-1158>
> >
> > Madhu Borkar
> >
>

Re: [PROPOSAL] Apache Hive connector

Posted by Seshadri Raghunathan <se...@gmail.com>.

Hi all,

Here is a draft implementation of this proposal -
https://github.com/seshadri-cr/beam/commit/78cdf8772f2cd5bb9cd018b1c99c3ad0854157c1

Many thanks to Ismaël Mejía who helped in a high level review & follow-up
of this design / approach.

Looking forward for further review/comments from wider community to move
forward on this proposal.

Thanks,
Seshadri

On Wed, May 10, 2017 at 3:05 PM, Madhusudan Borkar <mb...@etouch.net>
wrote:

> Hi all,
> Thank you for your response to the earlier proposal. Taking into account
> all the suggestions, we are making a new proposal for Hive connector.
> Please, let us know your feedback.
>
> [1]
> https://docs.google.com/document/d/1aeQRLXjVr38Z03_
> zWkHO9YQhtnj0jHoCfhsSNm-wxtA/edit?usp=sharing
>
> [2] https://issues.apache.org/jira/browse/BEAM-1158
> <https://issues.apache.org/jira/browse/BEAM-1158>
>
> Madhu Borkar
>