You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Kevin Sookocheff <ke...@gmail.com> on 2016/02/11 23:14:47 UTC

Show: Amazon S3 Sink

Hi everyone,

I've been interested in the project and have been doing small contributions
mostly around IO connectors for Dataflow.

I recently wrote up an article on batch writing to S3 on my blog (pasted
below). Would it be best to wait until the refactor and seeding of the Beam
project to work this code through proper review?

Also, any feedback and/or corrections on the article are welcome.


Article Link: http://sookocheff.com/post/beam/writing-a-beam-sink/

Thanks,

Kevin

Re: Show: Amazon S3 Sink

Posted by Kevin Sookocheff <ke...@gmail.com>.
Thanks for the response Dan. While I've been writing this prototype almost
all of your questions also came to my mind.


>    In general, it's unfortunate that -- as the code is set up right now --
> it made sense for you to make an "s3 sink" vs a "s3 channel
> interface".
>

I would have liked to use the IOChannel and IOUtils as well but currently
they heavily favour GCS and there isn't a clear path to break that
dependency (at least for someone as unfamiliar with that code as me). It
actually had me question ... is Sink the right way to do this or not? Room
for improvement and clarification in the SDK.


> *) Another thing we'd like to do better in Beam is handling of credentials.
> Suppose I want to read two different files from S3, using two different
> sets of credentials. The "read credentials from classpath" approach here
> probably won't work. This is another ripe area for some design as we move
> forward with Beam.
>

Agreed. The example that tests this uses the DirectPipelineRunner as a
kludge for reading the credentials from your local machine. You can
potentially pass credentials in at run-time as PipelineOptions as well. A
lot of this depends on individual use cases so something more flexible
would be appreciated.



> *) I think you acknowledged this in the post, but: I'm a little concerned
> about data loss here -- it looks like if the S3 copy fails, we do not fail
> the bundle?
>

In this prototype implementation, data loss is a possibility.


> *) It looks like you write the entire contents of the file locally, then
> copy to S3. Is there a reason not to write directly to a channel that
> writes to an S3 file?
>

 The Sink API seems to assume a channel interface is available yet I
haven't been able to find a way to create such an interface using the AWS
Java SDK. I'm sure it is possible I just don't know how.

*) In general, we'd like a better failure handling story in Beam.
>

Agreed. This needs work and the current implementation is nowhere near
production ready.


> Selfishly, I'd love it if we could use you as an early review guinea
> pig for Beam contributions :).
>

I would definitely like to be part of that! We are going to be using
Dataflow/Beam for a few use cases and I would like to expand that within
the organization as the new SDK improves. I would also be interested in
helping to do some of the development for some of these ideas but would
need some guidance from people more familiar with the project.

Hopefully we can touch base again on these issues in the coming months,

Kevin

Re: Show: Amazon S3 Sink

Posted by Dan Halperin <dh...@google.com.INVALID>.
Hi Kevin,

Thanks -- this is a nice post! I'm glad you were able to produce a sink for
s3 that is working for you.

I've been thinking about I/O for Dataflow, and thinking ahead to Beam, for
a while now. So, here's some thoughts on your code and, mixed inline, a few
design/feature requests for the Beam as a whole that we'll populate in JIRA
soon.

*) It looks like the S3 sink is roughly semantically equivalent to
TextIO.Write -- it writes strings to a file. Ideally, we could decouple the
destination file system (local disk, s3://, gs://, write to socket) from
the format of the files being written. Suppose that we had an S3
IOChannelFactory
<https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/IOChannelFactory.java>,
and TextIO.Write could use it -- would that have been enough?
   In general, it's unfortunate that -- as the code is set up right now --
it made sense for you to make an "s3 sink" vs a "s3 channel
interface". Improving IOChannelFactory's abstractions and making it generic
rather than Google Cloud Storage specific
<https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/GcsIOChannelFactory.java>
is one of our early Beam tasks.

*) Another thing we'd like to do better in Beam is handling of credentials.
Suppose I want to read two different files from S3, using two different
sets of credentials. The "read credentials from classpath" approach here
probably won't work. This is another ripe area for some design as we move
forward with Beam. In Dataflow, we essentially punted on this issue by
saying "well, the entire job needs one set of credentials" -- but it's been
something we know we need to fix going forward with Beam.

*) I think you acknowledged this in the post, but: I'm a little concerned
about data loss here -- it looks like if the S3 copy fails, we do not fail
the bundle? We'd want to wrap the S3 Put in a few retries, but we also want
the bundle to fail and be retried if we don't eventually succeed writing to
S3. Avoiding data loss is a primary requirement for us.  [* Caveat: I'm not
an S3 expert ]

*) It looks like you write the entire contents of the file locally, then
copy to S3. Is there a reason not to write directly to a channel that
writes to an S3 file? Seems like this would make the sink more scalable (S3
can hold bigger files than a hard drive, probably) and potentially faster
(a buffered write channel vs a file write + a network copy).

*) In general, we'd like a better failure handling story in Beam. As an
example, suppose you changed the S3 sink as I suggested -- retry the S3
copy, cause bundle failure if it did not eventually succeed.
   The job will retry the bundle (In Cloud Dataflow, we retry up to 4
times), but it might still fail if the bundle fails persistently (say, the
temp file it creates is too large for the local drive). In this case, we
may leave some successfully-written files to S3 but have a job failure. In
this case, I think we'd like to be able to remove all the files we
successfully wrote to S3 -- otherwise users will get angry at us for
racking up storage fees.
   In Dataflow, for TextIO.Write, we can "cheat" by using the Cloud
Dataflow service to do that cleanup outside of Java-land. We clearly need
to add some way to do this cleanup to Beam.

As for code review -- if you want to submit this as a contrib module to
Dataflow Java SDK, we'd be happy to take a look. (I've given you some of
the initial feedback in this mail.)
    Alternately, the refactoring and seeding of the Beam SDK is coming
soon. Selfishly, I'd love it if we could use you as an early review guinea
pig for Beam contributions :).

Thanks!
Dan

On Thu, Feb 11, 2016 at 2:14 PM, Kevin Sookocheff <
kevin.sookocheff@gmail.com> wrote:

> Hi everyone,
>
> I've been interested in the project and have been doing small contributions
> mostly around IO connectors for Dataflow.
>
> I recently wrote up an article on batch writing to S3 on my blog (pasted
> below). Would it be best to wait until the refactor and seeding of the Beam
> project to work this code through proper review?
>
> Also, any feedback and/or corrections on the article are welcome.
>
>
> Article Link: http://sookocheff.com/post/beam/writing-a-beam-sink/
>
> Thanks,
>
> Kevin
>

Re: Show: Amazon S3 Sink

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Kevin,

good article and potential contribution.

I'm also working on new IO (jms, mqtt, etc), for now in dataflow (with 
Dan), but I will move directly in Beam when the code donation will be 
complete.

Regards
JB

On 02/11/2016 11:14 PM, Kevin Sookocheff wrote:
> Hi everyone,
>
> I've been interested in the project and have been doing small contributions
> mostly around IO connectors for Dataflow.
>
> I recently wrote up an article on batch writing to S3 on my blog (pasted
> below). Would it be best to wait until the refactor and seeding of the Beam
> project to work this code through proper review?
>
> Also, any feedback and/or corrections on the article are welcome.
>
>
> Article Link: http://sookocheff.com/post/beam/writing-a-beam-sink/
>
> Thanks,
>
> Kevin
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com