You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Eugene Kirpichov <ki...@google.com.INVALID> on 2017/09/07 01:44:02 UTC

[PROPOSAL] FileIO.write: a modular replacement for FileBasedSink/WriteFiles

Hi,

Please take a look at the following proposal.

I believe, together with the (already available) FileIO.match() and
FileIO.readMatches() this proposal will empower Beam users to address all
use cases of file-based IO I'm aware of - which makes me quite excited.

http://s.apache.org/fileio-write

*We propose a new API for writing files in Beam: FileIO.write(). It is more
modular and cleaner to code against than FileBasedSink, and aims to
completely replace it.*

*FileIO.write() lets an IO author implement only logic and configuration
specific to a particular file format (e.g. Avro) and automatically get all
format-agnostic features, such as sharding, cleanup, windowed writes,
DynamicDestinations, compression, returning the successfully written
filenames, etc.*

TL;DR:

FileIO.write(FileSink<DestT, InputT> { open(dest), write(input), close() })
      .to(input → dest)
      .withFilenamePolicy(dest → prefix, shard pattern)
      .withEverythingElse() // like in WriteFiles

Re: [PROPOSAL] FileIO.write: a modular replacement for FileBasedSink/WriteFiles

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
PR out for review https://github.com/apache/beam/pull/3817

Next steps are clean it up (in this PR) and implement sinks for Text, XML
and TFRecord (in subsequent PRs).

On Thu, Sep 7, 2017 at 9:57 AM Robert Bradshaw <ro...@google.com.invalid>
wrote:

> Huge +1.
>
> This brings things more in line with Python's FileBasedSink where one
> simply overrides write[_encoded]_record and, usually, open/close. We
> may want to consider aligning the APIs. (And, of course bringing
> things like DynamicDestinations to Python.)
>
> On Wed, Sep 6, 2017 at 9:24 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> > Fantastic.
> >
> > Big +1 for this.
> >
> > Regards
> > JB
> >
> >
> > On 09/07/2017 03:44 AM, Eugene Kirpichov wrote:
> >>
> >> Hi,
> >>
> >> Please take a look at the following proposal.
> >>
> >> I believe, together with the (already available) FileIO.match() and
> >> FileIO.readMatches() this proposal will empower Beam users to address
> all
> >> use cases of file-based IO I'm aware of - which makes me quite excited.
> >>
> >> http://s.apache.org/fileio-write
> >>
> >> *We propose a new API for writing files in Beam: FileIO.write(). It is
> >> more
> >> modular and cleaner to code against than FileBasedSink, and aims to
> >> completely replace it.*
> >>
> >> *FileIO.write() lets an IO author implement only logic and configuration
> >> specific to a particular file format (e.g. Avro) and automatically get
> all
> >> format-agnostic features, such as sharding, cleanup, windowed writes,
> >> DynamicDestinations, compression, returning the successfully written
> >> filenames, etc.*
> >>
> >> TL;DR:
> >>
> >> FileIO.write(FileSink<DestT, InputT> { open(dest), write(input), close()
> >> })
> >>        .to(input → dest)
> >>        .withFilenamePolicy(dest → prefix, shard pattern)
> >>        .withEverythingElse() // like in WriteFiles
> >>
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>

Re: [PROPOSAL] FileIO.write: a modular replacement for FileBasedSink/WriteFiles

Posted by Robert Bradshaw <ro...@google.com.INVALID>.
Huge +1.

This brings things more in line with Python's FileBasedSink where one
simply overrides write[_encoded]_record and, usually, open/close. We
may want to consider aligning the APIs. (And, of course bringing
things like DynamicDestinations to Python.)

On Wed, Sep 6, 2017 at 9:24 PM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> Fantastic.
>
> Big +1 for this.
>
> Regards
> JB
>
>
> On 09/07/2017 03:44 AM, Eugene Kirpichov wrote:
>>
>> Hi,
>>
>> Please take a look at the following proposal.
>>
>> I believe, together with the (already available) FileIO.match() and
>> FileIO.readMatches() this proposal will empower Beam users to address all
>> use cases of file-based IO I'm aware of - which makes me quite excited.
>>
>> http://s.apache.org/fileio-write
>>
>> *We propose a new API for writing files in Beam: FileIO.write(). It is
>> more
>> modular and cleaner to code against than FileBasedSink, and aims to
>> completely replace it.*
>>
>> *FileIO.write() lets an IO author implement only logic and configuration
>> specific to a particular file format (e.g. Avro) and automatically get all
>> format-agnostic features, such as sharding, cleanup, windowed writes,
>> DynamicDestinations, compression, returning the successfully written
>> filenames, etc.*
>>
>> TL;DR:
>>
>> FileIO.write(FileSink<DestT, InputT> { open(dest), write(input), close()
>> })
>>        .to(input → dest)
>>        .withFilenamePolicy(dest → prefix, shard pattern)
>>        .withEverythingElse() // like in WriteFiles
>>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: [PROPOSAL] FileIO.write: a modular replacement for FileBasedSink/WriteFiles

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Fantastic.

Big +1 for this.

Regards
JB

On 09/07/2017 03:44 AM, Eugene Kirpichov wrote:
> Hi,
> 
> Please take a look at the following proposal.
> 
> I believe, together with the (already available) FileIO.match() and
> FileIO.readMatches() this proposal will empower Beam users to address all
> use cases of file-based IO I'm aware of - which makes me quite excited.
> 
> http://s.apache.org/fileio-write
> 
> *We propose a new API for writing files in Beam: FileIO.write(). It is more
> modular and cleaner to code against than FileBasedSink, and aims to
> completely replace it.*
> 
> *FileIO.write() lets an IO author implement only logic and configuration
> specific to a particular file format (e.g. Avro) and automatically get all
> format-agnostic features, such as sharding, cleanup, windowed writes,
> DynamicDestinations, compression, returning the successfully written
> filenames, etc.*
> 
> TL;DR:
> 
> FileIO.write(FileSink<DestT, InputT> { open(dest), write(input), close() })
>        .to(input → dest)
>        .withFilenamePolicy(dest → prefix, shard pattern)
>        .withEverythingElse() // like in WriteFiles
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com