You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Michael Luckey <ad...@gmail.com> on 2017/03/01 00:24:46 UTC

Recommendations on reading/writing avro to hdfs

Hi all,

we are currently using beam over spark, reading and writing avro files to
hdfs.

Until now we use HDFSFileSource for reading and HadoopIO for writing,
essentially reading and writing PCollection<AvroKey<GenericRecord>>

With the changes introduced by
https://issues.apache.org/jira/browse/BEAM-1497 this seems to be not
directly supported anymore by beam, as the required AvroWrapperCoder is
deleted.

So as we have to change our code anyway, we are wondering, what would be
the recommended approach to read/write avro files from/to hdfs with beam on
spark.

- use the new implementation of HDFSFileSource/HDFSFileSink
- use spark provided HadoopIO (and probably reimplement AvroWrapperCoder by
ourself?)

What ware the trade offs here, possibly also considering already planned
changes on IO? Do we have advantages using the spark HadoopIO as our
underlying engine is currently spark, or will this eventually be deprecated
and exists only for ‘historical’ reasons?

Any thoughts and advice here?

Regards,

michel

Re: Recommendations on reading/writing avro to hdfs

Posted by Michael Luckey <ad...@gmail.com>.
Hi Davor,

thanks for your reply. And do not worry, we did know in advance, that our
solution would be temporary only.

So for the time being, we go for the HDFSFileSource/Sink approach and look
forward for the final solution as soon as HDFSFileSystem will be ready.

Thanks again for your help!

On Tue, Mar 7, 2017 at 4:39 AM, Davor Bonaci <da...@apache.org> wrote:

> Hi Michael,
> Sorry about the inconvenience here; AvroWrapperCoder is indeed removed
> recently from Hadoop/HDFS IO.
>
> I think the best approach would be to use HDFSFileSource; this is the only
> approach I can recommend today.
>
> Going forward, we are working on being able to read Avro files via AvroIO,
> regardless which file system the files may be stored on. So, you'd do
> something like AvroIO.Read.from("hdfs://..."), just as you can today do
> AvroIO.Read.from("gs://...").
>
> Hope this helps!
>
> Davor
>
> On Tue, Feb 28, 2017 at 4:24 PM, Michael Luckey <ad...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> we are currently using beam over spark, reading and writing avro files to
>> hdfs.
>>
>> Until now we use HDFSFileSource for reading and HadoopIO for writing,
>> essentially reading and writing PCollection<AvroKey<GenericRecord>>
>>
>> With the changes introduced by https://issues.apache.org/jira
>> /browse/BEAM-1497 this seems to be not directly supported anymore by
>> beam, as the required AvroWrapperCoder is deleted.
>>
>> So as we have to change our code anyway, we are wondering, what would be
>> the recommended approach to read/write avro files from/to hdfs with beam on
>> spark.
>>
>> - use the new implementation of HDFSFileSource/HDFSFileSink
>> - use spark provided HadoopIO (and probably reimplement AvroWrapperCoder
>> by ourself?)
>>
>> What ware the trade offs here, possibly also considering already planned
>> changes on IO? Do we have advantages using the spark HadoopIO as our
>> underlying engine is currently spark, or will this eventually be deprecated
>> and exists only for ‘historical’ reasons?
>>
>> Any thoughts and advice here?
>>
>> Regards,
>>
>> michel
>>
>
>

RE: Recommendations on reading/writing avro to hdfs

Posted by "Newport, Billy" <Bi...@gs.com>.
Going forward, how do you plan on handling partitioning in HDFS both on reading and outputting?

I’m thinking about how to implement dynamic partitioning (I don’t know the partitions up front, I’d need to example the input data before I know which partitions will be impacted by the data).

Thanks


From: Davor Bonaci [mailto:davor@apache.org]
Sent: Monday, March 06, 2017 10:39 PM
To: user@beam.apache.org
Subject: Re: Recommendations on reading/writing avro to hdfs

Hi Michael,
Sorry about the inconvenience here; AvroWrapperCoder is indeed removed recently from Hadoop/HDFS IO.

I think the best approach would be to use HDFSFileSource; this is the only approach I can recommend today.

Going forward, we are working on being able to read Avro files via AvroIO, regardless which file system the files may be stored on. So, you'd do something like AvroIO.Read.from("hdfs://..."), just as you can today do AvroIO.Read.from("gs://...").

Hope this helps!

Davor

On Tue, Feb 28, 2017 at 4:24 PM, Michael Luckey <ad...@gmail.com>> wrote:
Hi all,

we are currently using beam over spark, reading and writing avro files to hdfs.

Until now we use HDFSFileSource for reading and HadoopIO for writing, essentially reading and writing PCollection<AvroKey<GenericRecord>>

With the changes introduced by https://issues.apache.org/jira/browse/BEAM-1497<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_BEAM-2D1497&d=DgMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=rlkM70D3djmDN7dGPzzbVKG26ShcTFDMKlX5AWucE5Q&m=3MhoBenXB9bCFojE9-pShCozbulWkKkVIcK44-1Zz9M&s=3FQTwzLGVM-kp64wQLR9ax8FWIMyH3UGm-DaDXf_Zr8&e=> this seems to be not directly supported anymore by beam, as the required AvroWrapperCoder is deleted.

So as we have to change our code anyway, we are wondering, what would be the recommended approach to read/write avro files from/to hdfs with beam on spark.

- use the new implementation of HDFSFileSource/HDFSFileSink
- use spark provided HadoopIO (and probably reimplement AvroWrapperCoder by ourself?)

What ware the trade offs here, possibly also considering already planned changes on IO? Do we have advantages using the spark HadoopIO as our underlying engine is currently spark, or will this eventually be deprecated and exists only for ‘historical’ reasons?

Any thoughts and advice here?

Regards,

michel


Re: Recommendations on reading/writing avro to hdfs

Posted by Davor Bonaci <da...@apache.org>.
Hi Michael,
Sorry about the inconvenience here; AvroWrapperCoder is indeed removed
recently from Hadoop/HDFS IO.

I think the best approach would be to use HDFSFileSource; this is the only
approach I can recommend today.

Going forward, we are working on being able to read Avro files via AvroIO,
regardless which file system the files may be stored on. So, you'd do
something like AvroIO.Read.from("hdfs://..."), just as you can today do
AvroIO.Read.from("gs://...").

Hope this helps!

Davor

On Tue, Feb 28, 2017 at 4:24 PM, Michael Luckey <ad...@gmail.com> wrote:

> Hi all,
>
> we are currently using beam over spark, reading and writing avro files to
> hdfs.
>
> Until now we use HDFSFileSource for reading and HadoopIO for writing,
> essentially reading and writing PCollection<AvroKey<GenericRecord>>
>
> With the changes introduced by https://issues.apache.org/
> jira/browse/BEAM-1497 this seems to be not directly supported anymore by
> beam, as the required AvroWrapperCoder is deleted.
>
> So as we have to change our code anyway, we are wondering, what would be
> the recommended approach to read/write avro files from/to hdfs with beam on
> spark.
>
> - use the new implementation of HDFSFileSource/HDFSFileSink
> - use spark provided HadoopIO (and probably reimplement AvroWrapperCoder
> by ourself?)
>
> What ware the trade offs here, possibly also considering already planned
> changes on IO? Do we have advantages using the spark HadoopIO as our
> underlying engine is currently spark, or will this eventually be deprecated
> and exists only for ‘historical’ reasons?
>
> Any thoughts and advice here?
>
> Regards,
>
> michel
>