You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Eugene Kirpichov <ki...@google.com.INVALID> on 2017/09/07 01:44:02 UTC
[PROPOSAL] FileIO.write: a modular replacement for FileBasedSink/WriteFiles
Hi,
Please take a look at the following proposal.
I believe, together with the (already available) FileIO.match() and
FileIO.readMatches() this proposal will empower Beam users to address all
use cases of file-based IO I'm aware of - which makes me quite excited.
http://s.apache.org/fileio-write
*We propose a new API for writing files in Beam: FileIO.write(). It is more
modular and cleaner to code against than FileBasedSink, and aims to
completely replace it.*
*FileIO.write() lets an IO author implement only logic and configuration
specific to a particular file format (e.g. Avro) and automatically get all
format-agnostic features, such as sharding, cleanup, windowed writes,
DynamicDestinations, compression, returning the successfully written
filenames, etc.*
TL;DR:
FileIO.write(FileSink<DestT, InputT> { open(dest), write(input), close() })
.to(input → dest)
.withFilenamePolicy(dest → prefix, shard pattern)
.withEverythingElse() // like in WriteFiles
Re: [PROPOSAL] FileIO.write: a modular replacement for FileBasedSink/WriteFiles
Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
PR out for review https://github.com/apache/beam/pull/3817
Next steps are clean it up (in this PR) and implement sinks for Text, XML
and TFRecord (in subsequent PRs).
On Thu, Sep 7, 2017 at 9:57 AM Robert Bradshaw <ro...@google.com.invalid>
wrote:
> Huge +1.
>
> This brings things more in line with Python's FileBasedSink where one
> simply overrides write[_encoded]_record and, usually, open/close. We
> may want to consider aligning the APIs. (And, of course bringing
> things like DynamicDestinations to Python.)
>
> On Wed, Sep 6, 2017 at 9:24 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> > Fantastic.
> >
> > Big +1 for this.
> >
> > Regards
> > JB
> >
> >
> > On 09/07/2017 03:44 AM, Eugene Kirpichov wrote:
> >>
> >> Hi,
> >>
> >> Please take a look at the following proposal.
> >>
> >> I believe, together with the (already available) FileIO.match() and
> >> FileIO.readMatches() this proposal will empower Beam users to address
> all
> >> use cases of file-based IO I'm aware of - which makes me quite excited.
> >>
> >> http://s.apache.org/fileio-write
> >>
> >> *We propose a new API for writing files in Beam: FileIO.write(). It is
> >> more
> >> modular and cleaner to code against than FileBasedSink, and aims to
> >> completely replace it.*
> >>
> >> *FileIO.write() lets an IO author implement only logic and configuration
> >> specific to a particular file format (e.g. Avro) and automatically get
> all
> >> format-agnostic features, such as sharding, cleanup, windowed writes,
> >> DynamicDestinations, compression, returning the successfully written
> >> filenames, etc.*
> >>
> >> TL;DR:
> >>
> >> FileIO.write(FileSink<DestT, InputT> { open(dest), write(input), close()
> >> })
> >> .to(input → dest)
> >> .withFilenamePolicy(dest → prefix, shard pattern)
> >> .withEverythingElse() // like in WriteFiles
> >>
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>
Re: [PROPOSAL] FileIO.write: a modular replacement for FileBasedSink/WriteFiles
Posted by Robert Bradshaw <ro...@google.com.INVALID>.
Huge +1.
This brings things more in line with Python's FileBasedSink where one
simply overrides write[_encoded]_record and, usually, open/close. We
may want to consider aligning the APIs. (And, of course bringing
things like DynamicDestinations to Python.)
On Wed, Sep 6, 2017 at 9:24 PM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> Fantastic.
>
> Big +1 for this.
>
> Regards
> JB
>
>
> On 09/07/2017 03:44 AM, Eugene Kirpichov wrote:
>>
>> Hi,
>>
>> Please take a look at the following proposal.
>>
>> I believe, together with the (already available) FileIO.match() and
>> FileIO.readMatches() this proposal will empower Beam users to address all
>> use cases of file-based IO I'm aware of - which makes me quite excited.
>>
>> http://s.apache.org/fileio-write
>>
>> *We propose a new API for writing files in Beam: FileIO.write(). It is
>> more
>> modular and cleaner to code against than FileBasedSink, and aims to
>> completely replace it.*
>>
>> *FileIO.write() lets an IO author implement only logic and configuration
>> specific to a particular file format (e.g. Avro) and automatically get all
>> format-agnostic features, such as sharding, cleanup, windowed writes,
>> DynamicDestinations, compression, returning the successfully written
>> filenames, etc.*
>>
>> TL;DR:
>>
>> FileIO.write(FileSink<DestT, InputT> { open(dest), write(input), close()
>> })
>> .to(input → dest)
>> .withFilenamePolicy(dest → prefix, shard pattern)
>> .withEverythingElse() // like in WriteFiles
>>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
Re: [PROPOSAL] FileIO.write: a modular replacement for
FileBasedSink/WriteFiles
Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Fantastic.
Big +1 for this.
Regards
JB
On 09/07/2017 03:44 AM, Eugene Kirpichov wrote:
> Hi,
>
> Please take a look at the following proposal.
>
> I believe, together with the (already available) FileIO.match() and
> FileIO.readMatches() this proposal will empower Beam users to address all
> use cases of file-based IO I'm aware of - which makes me quite excited.
>
> http://s.apache.org/fileio-write
>
> *We propose a new API for writing files in Beam: FileIO.write(). It is more
> modular and cleaner to code against than FileBasedSink, and aims to
> completely replace it.*
>
> *FileIO.write() lets an IO author implement only logic and configuration
> specific to a particular file format (e.g. Avro) and automatically get all
> format-agnostic features, such as sharding, cleanup, windowed writes,
> DynamicDestinations, compression, returning the successfully written
> filenames, etc.*
>
> TL;DR:
>
> FileIO.write(FileSink<DestT, InputT> { open(dest), write(input), close() })
> .to(input → dest)
> .withFilenamePolicy(dest → prefix, shard pattern)
> .withEverythingElse() // like in WriteFiles
>
--
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com