You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Madhusudan Borkar <mb...@etouch.net> on 2017/04/25 17:45:28 UTC

[PROPOSAL] HiveIO - updated link to document

Please, use following link for the HiveIO proposal.

https://docs.google.com/document/d/1JOzihFiXkQjtv6rur8-vCixSK-
nHhIoIij9MwJZ_Dp0/edit?usp=sharing

Madhu Borkar

Re: [PROPOSAL] HiveIO - updated link to document

Posted by Madhusudan Borkar <mb...@etouch.net>.

Thank you for your helpful comments. We will look into the HadoopIO code
that you have suggested.
We are following incremental development and we will consider more generic
approach to develop for other runners.  Going ahead we will ask questions
when we need help.

Madhu Borkar

On Tue, Apr 25, 2017 at 12:45 PM, Ismaël Mejía <ie...@gmail.com> wrote:

> Hello,
>
> I created the HiveIO JIRA and followed the initial discussions about
> the best approach for HiveIO so I want first to suggest you to read
> the previous thread(s) on the mailing list.
>
> https://www.mail-archive.com/dev@beam.incubator.apache.org/msg02313.html
>
> The main idea I concluded from that thread is that a really valuable
> part of accesing Hive for Beam is to access the records exposed via
> the catalog of the data using HCatalog. This approach is way more
> interesting because Beam can benefit of the multiple runner execution
> to process the data exposed by Hive in all the different runners. This
> is not the case if we invoke HiveQL (or SQL) queries over Hive via Map
> Reduce. Note also that you can do this today on Beam by using the
> JdbcIO + the specific Hive JDBC configuration.
>
> It is probably a good idea that you take a look at how the Flink
> connector does this, because it is essentially the same idea that we
> want.
> https://github.com/apache/flink/tree/master/flink-
> connectors/flink-hcatalog
> (Note try to not get confused by the name of the classes on Flink vs
> Hive because they are really similar).
>
> Also take a look at HadoopIO because you can make a simpler
> implementation by reusing the code that is there because
> HCatInputFormat is a Hadoop InputFormat class.
>
> So the idea at least for the read part would be to build a
> PCollection<HCatRecord> from a Hadoop Configuration + the database
> name + the table + eventually a filter and this Pcollection will be
> processed on the Beam Pipelines, the advantage of this approach is
> that once the Beam SQL DSL is ready it will integrate perfectly with
> this IO so we can have SQL reading from Hive/Hcatalog and processing
> on whatever runner the users want.
>
> Finally if you agree with this approach I think that probably it makes
> sense to rename the IO into HCatalogIO as Flink does,
>
> One extra thing I have still not looked at the write part but I
> suppose that it should be something similar.
>
> Regards,
> Ismael.
>

Re: [PROPOSAL] HiveIO - updated link to document

Posted by Ismaël Mejía <ie...@gmail.com>.

Hello,

I created the HiveIO JIRA and followed the initial discussions about
the best approach for HiveIO so I want first to suggest you to read
the previous thread(s) on the mailing list.

https://www.mail-archive.com/dev@beam.incubator.apache.org/msg02313.html

The main idea I concluded from that thread is that a really valuable
part of accesing Hive for Beam is to access the records exposed via
the catalog of the data using HCatalog. This approach is way more
interesting because Beam can benefit of the multiple runner execution
to process the data exposed by Hive in all the different runners. This
is not the case if we invoke HiveQL (or SQL) queries over Hive via Map
Reduce. Note also that you can do this today on Beam by using the
JdbcIO + the specific Hive JDBC configuration.

It is probably a good idea that you take a look at how the Flink
connector does this, because it is essentially the same idea that we
want.
https://github.com/apache/flink/tree/master/flink-connectors/flink-hcatalog
(Note try to not get confused by the name of the classes on Flink vs
Hive because they are really similar).

Also take a look at HadoopIO because you can make a simpler
implementation by reusing the code that is there because
HCatInputFormat is a Hadoop InputFormat class.

So the idea at least for the read part would be to build a
PCollection<HCatRecord> from a Hadoop Configuration + the database
name + the table + eventually a filter and this Pcollection will be
processed on the Beam Pipelines, the advantage of this approach is
that once the Beam SQL DSL is ready it will integrate perfectly with
this IO so we can have SQL reading from Hive/Hcatalog and processing
on whatever runner the users want.

Finally if you agree with this approach I think that probably it makes
sense to rename the IO into HCatalogIO as Flink does,

One extra thing I have still not looked at the write part but I
suppose that it should be something similar.

Regards,
Ismael.