You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Pei He <pe...@google.com.INVALID> on 2016/11/17 00:09:51 UTC

[PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Hi,

I am working on BEAM-59
<https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
redesign". The goals are:

1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.

2. Support configuring any user-defined file system.

And, I drafted the design proposal in two parts to address them in order:

Part 1: IOChannelFactory Redesign
<https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#>

Summary:

Old API: WritableByteChannel create(String spec, String mimeType);

New API: WritableByteChannel create(URI uri, CreateOptions options);

Noticeable proposed changes:


   1.

   Includes the options parameter in most methods to specify behaviors.
   2.

   Replace String with URI to include scheme for files/directories
   locations.
   3.

   Require file systems to provide a SeekableByteChannel for read.
   4.

   Additional methods, such as getMetadata(), rename() e.t.c


Part 2: Configurable BeamFileSystem
<https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>

Summary:

Old API: IOChannelUtils.getFactory(glob).match(glob);

New API: BeamFileSystems.getFileSystem(glob, config).match(glob);


Looking for comments and feedback.

Thanks

--

Pei

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Pei,

Reading the documents, for the part 1, I think that using Hadoop filesystem:

https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/fs/FileSystem.html

would make more sense than introducing the BeamFileSystem interface.

It would allow us to directly support HDFS, FTP, Azure, S3 out of the 
box (as Hadoop FileSystem provide sub-classes for those providers).

We could provide a GsFileSystem as sub-class of Hadoop Filesystem.

The part 2 is OK in term of configuration.

Let me know if I can work with you on this (in term of implementation).

Regards
JB

On 11/17/2016 01:09 AM, Pei He wrote:
> Hi,
>
> I am working on BEAM-59
> <https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
> redesign". The goals are:
>
> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.
>
> 2. Support configuring any user-defined file system.
>
> And, I drafted the design proposal in two parts to address them in order:
>
> Part 1: IOChannelFactory Redesign
> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#>
>
> Summary:
>
> Old API: WritableByteChannel create(String spec, String mimeType);
>
> New API: WritableByteChannel create(URI uri, CreateOptions options);
>
> Noticeable proposed changes:
>
>
>    1.
>
>    Includes the options parameter in most methods to specify behaviors.
>    2.
>
>    Replace String with URI to include scheme for files/directories
>    locations.
>    3.
>
>    Require file systems to provide a SeekableByteChannel for read.
>    4.
>
>    Additional methods, such as getMetadata(), rename() e.t.c
>
>
> Part 2: Configurable BeamFileSystem
> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
>
> Summary:
>
> Old API: IOChannelUtils.getFactory(glob).match(glob);
>
> New API: BeamFileSystems.getFileSystem(glob, config).match(glob);
>
>
> Looking for comments and feedback.
>
> Thanks
>
> --
>
> Pei
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Pei,

Thanks for sharing.

For the goals, I fully agree with you: as already discussed, the purpose 
is to have "pluggable" filesystems that will allow us to easily with 
local, gs, hdfs, s3 filesystems (and even more).

After a quick first glance, it looks good to me. I will try to evaluate 
the impact later today.

IMHO, once this change is done, the HdfsIO (in the sdk/java/io) should 
be flagged as deprecated.

Regards
JB

On 11/17/2016 01:09 AM, Pei He wrote:
> Hi,
>
> I am working on BEAM-59
> <https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
> redesign". The goals are:
>
> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.
>
> 2. Support configuring any user-defined file system.
>
> And, I drafted the design proposal in two parts to address them in order:
>
> Part 1: IOChannelFactory Redesign
> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#>
>
> Summary:
>
> Old API: WritableByteChannel create(String spec, String mimeType);
>
> New API: WritableByteChannel create(URI uri, CreateOptions options);
>
> Noticeable proposed changes:
>
>
>    1.
>
>    Includes the options parameter in most methods to specify behaviors.
>    2.
>
>    Replace String with URI to include scheme for files/directories
>    locations.
>    3.
>
>    Require file systems to provide a SeekableByteChannel for read.
>    4.
>
>    Additional methods, such as getMetadata(), rename() e.t.c
>
>
> Part 2: Configurable BeamFileSystem
> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
>
> Summary:
>
> Old API: IOChannelUtils.getFactory(glob).match(glob);
>
> New API: BeamFileSystems.getFileSystem(glob, config).match(glob);
>
>
> Looking for comments and feedback.
>
> Thanks
>
> --
>
> Pei
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Amit Sela <am...@gmail.com>.
FWIW I'm pretty sure this
<https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/util-hadoop>
is Google's gs hdfs connector, and I think this
<https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.6.0> should
work for s3, and Azure's is here
<https://hadoop.apache.org/docs/stable2/hadoop-azure/index.html>.
So going with Hadoop's FileSystem interface is already compatible with
hdfs, gs, s3, azure.

On Thu, Nov 17, 2016 at 9:19 PM Pei He <pe...@google.com.invalid> wrote:

> Hi JB,
> My proposals are based on the current IOChannelFactory, and how they are
> used in FileBasedSink.
>
> Let's me spend more time to investigate Hadoop FileSystem interface.
> --
> Pei
>
> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > By the way, Pei, for the record: why introducing BeamFileSystem and not
> > using the Hadoop FileSystem interface ?
> >
> > Thanks
> > Regards
> > JB
> >
> > On 11/17/2016 01:09 AM, Pei He wrote:
> >
> >> Hi,
> >>
> >> I am working on BEAM-59
> >> <https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
> >> redesign". The goals are:
> >>
> >> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file
> system.
> >>
> >> 2. Support configuring any user-defined file system.
> >>
> >> And, I drafted the design proposal in two parts to address them in
> order:
> >>
> >> Part 1: IOChannelFactory Redesign
> >> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
> >> sVG3qel2lhdKTknmZ_7M/edit#>
> >>
> >> Summary:
> >>
> >> Old API: WritableByteChannel create(String spec, String mimeType);
> >>
> >> New API: WritableByteChannel create(URI uri, CreateOptions options);
> >>
> >> Noticeable proposed changes:
> >>
> >>
> >>    1.
> >>
> >>    Includes the options parameter in most methods to specify behaviors.
> >>    2.
> >>
> >>    Replace String with URI to include scheme for files/directories
> >>    locations.
> >>    3.
> >>
> >>    Require file systems to provide a SeekableByteChannel for read.
> >>    4.
> >>
> >>    Additional methods, such as getMetadata(), rename() e.t.c
> >>
> >>
> >> Part 2: Configurable BeamFileSystem
> >> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
> >> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
> >>
> >> Summary:
> >>
> >> Old API: IOChannelUtils.getFactory(glob).match(glob);
> >>
> >> New API: BeamFileSystems.getFileSystem(glob, config).match(glob);
> >>
> >>
> >> Looking for comments and feedback.
> >>
> >> Thanks
> >>
> >> --
> >>
> >> Pei
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
Agreed with Kenn. I was the one who raised this point on the design doc,
and really I just want to make sure that pipeline authors have a way to let
their users use regular paths from command line and from String/FilePath -
it doesn't have to be an IOChannelFactory or FileSystems feature per se,
but the design needs to make sure there's some well-known way to do it, and
advertise it, including in the documentation of these classes.

Though I'm conflicted on whether it'd be ok to have, say,
TextIO.Read.from() only take an URI rather than String (though under the
hood it would of course pass a URI to FileSystems APIs).

On Tue, Dec 13, 2016 at 1:15 PM Kenneth Knowles <kl...@google.com.invalid>
wrote:

> I don't think there is any conflict here.
>
> On Tue, Dec 13, 2016 at 12:34 PM, Pei He <pe...@google.com.invalid> wrote:
>
> > One design decision made during previous design discussion [1] is
> > "Replacing
> > FilePath with URI for resolving files paths". This has been brought back
> to
> > dev@ mailing list in my previous email.
> >
>
> The direction of this argument, in my opinion, gets the burden of proof
> wrong.
>
> The original design document effectively proposed "instead of using URIs,
> let's make a Beam-specific abstraction" and [1] is just the natural comment
> "let's just use URI". This works for the internet, and gives interop with
> essentially all code, so you need a very special reason not to do it (and
> special cases generally manifest as custom URI schemes).
>
> Comment [2] asked me to clarify the impact on Windows OS users because
> > users have to specify the path in the URI format, such as:
> > "file:///C:/home/input-*"
> > "C:/home/"
> >
>
> It is not really true that users have to do this. For the command line, it
> is the responsibility of the code that parses "--filesToStage
> C:\my\windows\path". Users should absolutely be able to specify paths like
> this on Windows, and it is not difficult and nothing your proposal needs to
> solve.
>
> With programmatic creation in Java code, the same principle applies: the
> environment-specific String/File/Path should be converted to a URI at the
> membrane. Making an API take a URI makes it completely obvious to a Java
> programmer that if they have a String/File/Path they need to convert it
> appropriately.
>
> Kenn
>
>
> > Using URIs in the API is to ensure Beam code is file systems agnostic.
> >
> > Another alternative is Java Path/File. It is used in the current
> > IOChannelFactory API, and it works poorly. For example, Path throws when
> > there are file scheme or asterisk in the path:
> > new File("file:///C:/home/").toPath() throws in toPath().
> > Paths.get("C:/home/").resolve("output-*") throws in resolve().
> >
> > any thoughts and suggestions are welcome.
> >
> > Thanks
> > --
> > Pei
> >
> > ---
> > [1]:
> > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> > XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA30vtPU#heading=h.p3gc3colc2cs
> >
> > [2]:
> > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> > XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA02O1cY
> >
> > On Tue, Dec 6, 2016 at 1:25 PM, Kenneth Knowles <kl...@google.com.invalid>
> > wrote:
> >
> > > Thanks for the thorough answers. It all sounds good to me.
> > >
> > > On Tue, Dec 6, 2016 at 12:57 PM, Pei He <pe...@google.com.invalid>
> > wrote:
> > >
> > > > Thanks Kenn for the feedback and questions.
> > > >
> > > > I responded inline.
> > > >
> > > > On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles
> <klk@google.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > I really like this document. It is easy to read and informative.
> > Three
> > > > > things not addressed by the document:
> > > > >
> > > > > 1. Major Beam use cases. I'm sure we have a few in the SDK that
> could
> > > be
> > > > > outlined in terms of the new API with pseudocode.
> > > >
> > > >
> > > > (I am writing pseudocode directly with FileSystem interface to
> > > demonstrate.
> > > > However, clients will use the utility FileSystems. This is for us to
> > > have a
> > > > layer between the file systems providers' interface and the client
> > > > interface. We can add utility functions to FileSystems for common use
> > > > patterns as needed.)
> > > >
> > > > Major Beam use cases are the followings:
> > > > A. FileBasedSource:
> > > > // a. Get input URIs and file sizes from users provided specs.
> > > > // Note: I updated the match() to be a bulk operation after I sent my
> > > last
> > > > email.
> > > > List<MatchResult> results = match(specList);
> > > > List<Metadata> inputMetadataList = FluentIterable.from(results)
> > > >     .transformAndConcat(
> > > >         new Function<MatchResult, Metadata>() {
> > > >           @Override
> > > >           public Iterable<Metadata> apply(MatchResult result) {
> > > >             return Arrays.asList(result.metadata());
> > > >           });
> > > >
> > > > // b. Read from a start offset to support the source splitting.
> > > > SeekableByteChannel seekChannel = open(fileUri);
> > > > seekChannel.position(source.getStartOffset());
> > > > seekChannel.read(...);
> > > >
> > > > B. FileBasedSink:
> > > > // bulk rename temporary files to output files
> > > > rename(tempUris, outputUris);
> > > >
> > > > C. General file operations:
> > > > a. resolve paths
> > > > b. create file to write, open file to read (for example in tests).
> > > > c. bulk delete files/directories
> > > >
> > > >
> > > >
> > > > 2. Related work. How does this differ from other filesystem APIs and
> > why?
> > > >
> > > > We need three sets of functionalities:
> > > > 1. resolve paths.
> > > > 2. read and write channels.
> > > > 3. bulk files management operations(bulk delete/rename/match).
> > > >
> > > > And, they are available from Java nio, hadoop FileSystem APIs, and
> > other
> > > > standard library such as java.net.URI.
> > > >
> > > > Current IOChannelFactory interface uses Java nio for (1) and (2), and
> > > > define its own interface for (3).
> > > >
> > > > In my redesign, I made the following choices:
> > > > For (1), I replaced Java nio with URI, because it is standardized and
> > > > precise and doesn't require additional implementation of a Path
> > interface
> > > > from file system providers.
> > > >
> > > > For (2), I kept the uses of Java nio (Writable/SeekableByteChannel),
> > > since
> > > > I don't see any things that need to improve and I don't see any
> better
> > > > alternatives (hadoop's FSDataInput/OutputStream provide same
> > > > functionalities, but requires additional dependencies).
> > > >
> > > > For (3), reasons that I didn't choose Java nio or hadoop are:
> > > > 1. Beam needs bulk operations API for better performance, however
> Java
> > > nio
> > > > and hadoop FileSystems are single file based API.
> > > > 2. Have APIs that are File systems agnostic. For example, we can use
> > URI
> > > > instead of Path.
> > > > 3. Have APIs that are minimum, and easy to implement by file system
> > > > providers.
> > > > 4. Introducing less dependencies.
> > > > 5. It is easy to build an adaptor based on Java nio or hadoop
> > interfaces.
> > > >
> > > > 3. Discussion of non-Java languages. It would be good to know what
> > > classes
> > > > > in e.g. Python we might use in place of URI, SeekableByteChannel,
> > etc.
> > > >
> > > > I don't want to mislead people here without a thorough investigation.
> > You
> > > > can see from your second question, that would require iterations on
> > > design
> > > > and prototyping.
> > > >
> > > > I didn't introduce any Java specific requirements in the redesign.
> > > > Resolving paths, seeking with channels or streams, file management
> > > > operations are languages independent. And, I pretty sure there are
> > python
> > > > libraries for that.
> > > >
> > > > However, I am happy to hear thoughts and get help from people working
> > on
> > > > the python sdk.
> > > >
> > > >
> > > > > On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid>
> > > wrote:
> > > > >
> > > > > > I have received a lot of comments in "Part 1: IOChannelFactory
> > > > > > Redesign" [1]. And, I have updated the design based on the
> > feedback.
> > > > > >
> > > > > > Now, I feel it is close to be ready for implementation, and I
> would
> > > > like
> > > > > to
> > > > > > summarize the changes:
> > > > > > 1. Replaced FilePath with URI for resolving files paths.
> > > > > > 2. Required match(String spec) to handle ambiguities in users
> > > provided
> > > > > > strings (see the match() java doc in the design doc for details).
> > > > > > 3. Changed Metadata to use Future.get() paradigm, and removed
> > > > > exception().
> > > > > > 4. Changed methods on FileSystem interface to be protected
> (visible
> > > for
> > > > > > implementors), and created FileSystems utility (visible for
> > callers).
> > > > > > 5.  Simplified FileSystem interface by moving operation options,
> > such
> > > > as
> > > > > > DeleteOptions, MatchOptions, to the FileSystems utility.
> > > > > > 6. Simplified FileSystem interface by requiring certain
> behaviors,
> > > such
> > > > > as
> > > > > > creating recursively, throwing for missing files.
> > > > > >
> > > > > > Any thoughts / feedback?
> > > > > > --
> > > > > > Pei
> > > > > >
> > > > > > [1]
> > > > > > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> > > > > > XJsVG3qel2lhdKTknmZ_7M/edit#
> > > > > >
> > > > > > On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com>
> wrote:
> > > > > >
> > > > > > > Thanks JB for the feedback.
> > > > > > >
> > > > > > > Yes, we should provide a hadoop.fs.FileSystem adaptor. As you
> > said,
> > > > it
> > > > > > > will make a range of file system available in Beam.
> > > > > > >
> > > > > > > And, people can choose to implement BeamFileSystem directly to
> > get
> > > > the
> > > > > > > best performance (For example, providing bulk operations.)
> > > > > > >
> > > > > > > --
> > > > > > > Pei
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <
> > > > > jb@nanthrax.net>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Hi Pei,
> > > > > > >>
> > > > > > >> rethinking about that, I understand that the purpose of the
> Beam
> > > > > > >> filesystem is to avoid to bring a bunch of dependencies into
> the
> > > > core.
> > > > > > That
> > > > > > >> makes perfect sense.
> > > > > > >>
> > > > > > >> So, I agree that a Beam filesystem abstract is fine.
> > > > > > >>
> > > > > > >> My point is that we should provide a HadoopFilesystem
> > > > extension/plugin
> > > > > > >> for Beam filesystem asap: that would help us to support a good
> > > range
> > > > > of
> > > > > > >> filesystems quickly.
> > > > > > >>
> > > > > > >> Just my $0.01 ;)
> > > > > > >>
> > > > > > >> Regards
> > > > > > >> JB
> > > > > > >>
> > > > > > >>
> > > > > > >> On 11/17/2016 08:18 PM, Pei He wrote:
> > > > > > >>
> > > > > > >>> Hi JB,
> > > > > > >>> My proposals are based on the current IOChannelFactory, and
> how
> > > > they
> > > > > > are
> > > > > > >>> used in FileBasedSink.
> > > > > > >>>
> > > > > > >>> Let's me spend more time to investigate Hadoop FileSystem
> > > > interface.
> > > > > > >>> --
> > > > > > >>> Pei
> > > > > > >>>
> > > > > > >>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <
> > > > > jb@nanthrax.net
> > > > > > >
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>> By the way, Pei, for the record: why introducing
> BeamFileSystem
> > > and
> > > > > not
> > > > > > >>>> using the Hadoop FileSystem interface ?
> > > > > > >>>>
> > > > > > >>>> Thanks
> > > > > > >>>> Regards
> > > > > > >>>> JB
> > > > > > >>>>
> > > > > > >>>> On 11/17/2016 01:09 AM, Pei He wrote:
> > > > > > >>>>
> > > > > > >>>> Hi,
> > > > > > >>>>>
> > > > > > >>>>> I am working on BEAM-59
> > > > > > >>>>> <https://issues.apache.org/jira/browse/BEAM-59>
> > > > "IOChannelFactory
> > > > > > >>>>> redesign". The goals are:
> > > > > > >>>>>
> > > > > > >>>>> 1. Support file-based IOs (TextIO, AvorIO) with
> user-defined
> > > file
> > > > > > >>>>> system.
> > > > > > >>>>>
> > > > > > >>>>> 2. Support configuring any user-defined file system.
> > > > > > >>>>>
> > > > > > >>>>> And, I drafted the design proposal in two parts to address
> > them
> > > > in
> > > > > > >>>>> order:
> > > > > > >>>>>
> > > > > > >>>>> Part 1: IOChannelFactory Redesign
> > > > > > >>>>> <
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
> > > > > > >>>>> sVG3qel2lhdKTknmZ_7M/edit#>
> > > > > > >>>>>
> > > > > > >>>>> Summary:
> > > > > > >>>>>
> > > > > > >>>>> Old API: WritableByteChannel create(String spec, String
> > > > mimeType);
> > > > > > >>>>>
> > > > > > >>>>> New API: WritableByteChannel create(URI uri, CreateOptions
> > > > > options);
> > > > > > >>>>>
> > > > > > >>>>> Noticeable proposed changes:
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>    1.
> > > > > > >>>>>
> > > > > > >>>>>    Includes the options parameter in most methods to
> specify
> > > > > > behaviors.
> > > > > > >>>>>    2.
> > > > > > >>>>>
> > > > > > >>>>>    Replace String with URI to include scheme for
> > > > files/directories
> > > > > > >>>>>    locations.
> > > > > > >>>>>    3.
> > > > > > >>>>>
> > > > > > >>>>>    Require file systems to provide a SeekableByteChannel
> for
> > > > read.
> > > > > > >>>>>    4.
> > > > > > >>>>>
> > > > > > >>>>>    Additional methods, such as getMetadata(), rename()
> e.t.c
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> Part 2: Configurable BeamFileSystem
> > > > > > >>>>> <
> https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
> > > > > > >>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
> > > > > > >>>>>
> > > > > > >>>>> Summary:
> > > > > > >>>>>
> > > > > > >>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
> > > > > > >>>>>
> > > > > > >>>>> New API: BeamFileSystems.getFileSystem(glob,
> > > > config).match(glob);
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> Looking for comments and feedback.
> > > > > > >>>>>
> > > > > > >>>>> Thanks
> > > > > > >>>>>
> > > > > > >>>>> --
> > > > > > >>>>>
> > > > > > >>>>> Pei
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> --
> > > > > > >>>> Jean-Baptiste Onofré
> > > > > > >>>> jbonofre@apache.org
> > > > > > >>>> http://blog.nanthrax.net
> > > > > > >>>> Talend - http://www.talend.com
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >> --
> > > > > > >> Jean-Baptiste Onofré
> > > > > > >> jbonofre@apache.org
> > > > > > >> http://blog.nanthrax.net
> > > > > > >> Talend - http://www.talend.com
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
I don't think there is any conflict here.

On Tue, Dec 13, 2016 at 12:34 PM, Pei He <pe...@google.com.invalid> wrote:

> One design decision made during previous design discussion [1] is
> "Replacing
> FilePath with URI for resolving files paths". This has been brought back to
> dev@ mailing list in my previous email.
>

The direction of this argument, in my opinion, gets the burden of proof
wrong.

The original design document effectively proposed "instead of using URIs,
let's make a Beam-specific abstraction" and [1] is just the natural comment
"let's just use URI". This works for the internet, and gives interop with
essentially all code, so you need a very special reason not to do it (and
special cases generally manifest as custom URI schemes).

Comment [2] asked me to clarify the impact on Windows OS users because
> users have to specify the path in the URI format, such as:
> "file:///C:/home/input-*"
> "C:/home/"
>

It is not really true that users have to do this. For the command line, it
is the responsibility of the code that parses "--filesToStage
C:\my\windows\path". Users should absolutely be able to specify paths like
this on Windows, and it is not difficult and nothing your proposal needs to
solve.

With programmatic creation in Java code, the same principle applies: the
environment-specific String/File/Path should be converted to a URI at the
membrane. Making an API take a URI makes it completely obvious to a Java
programmer that if they have a String/File/Path they need to convert it
appropriately.

Kenn


> Using URIs in the API is to ensure Beam code is file systems agnostic.
>
> Another alternative is Java Path/File. It is used in the current
> IOChannelFactory API, and it works poorly. For example, Path throws when
> there are file scheme or asterisk in the path:
> new File("file:///C:/home/").toPath() throws in toPath().
> Paths.get("C:/home/").resolve("output-*") throws in resolve().
>
> any thoughts and suggestions are welcome.
>
> Thanks
> --
> Pei
>
> ---
> [1]:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA30vtPU#heading=h.p3gc3colc2cs
>
> [2]:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA02O1cY
>
> On Tue, Dec 6, 2016 at 1:25 PM, Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > Thanks for the thorough answers. It all sounds good to me.
> >
> > On Tue, Dec 6, 2016 at 12:57 PM, Pei He <pe...@google.com.invalid>
> wrote:
> >
> > > Thanks Kenn for the feedback and questions.
> > >
> > > I responded inline.
> > >
> > > On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <klk@google.com.invalid
> >
> > > wrote:
> > >
> > > > I really like this document. It is easy to read and informative.
> Three
> > > > things not addressed by the document:
> > > >
> > > > 1. Major Beam use cases. I'm sure we have a few in the SDK that could
> > be
> > > > outlined in terms of the new API with pseudocode.
> > >
> > >
> > > (I am writing pseudocode directly with FileSystem interface to
> > demonstrate.
> > > However, clients will use the utility FileSystems. This is for us to
> > have a
> > > layer between the file systems providers' interface and the client
> > > interface. We can add utility functions to FileSystems for common use
> > > patterns as needed.)
> > >
> > > Major Beam use cases are the followings:
> > > A. FileBasedSource:
> > > // a. Get input URIs and file sizes from users provided specs.
> > > // Note: I updated the match() to be a bulk operation after I sent my
> > last
> > > email.
> > > List<MatchResult> results = match(specList);
> > > List<Metadata> inputMetadataList = FluentIterable.from(results)
> > >     .transformAndConcat(
> > >         new Function<MatchResult, Metadata>() {
> > >           @Override
> > >           public Iterable<Metadata> apply(MatchResult result) {
> > >             return Arrays.asList(result.metadata());
> > >           });
> > >
> > > // b. Read from a start offset to support the source splitting.
> > > SeekableByteChannel seekChannel = open(fileUri);
> > > seekChannel.position(source.getStartOffset());
> > > seekChannel.read(...);
> > >
> > > B. FileBasedSink:
> > > // bulk rename temporary files to output files
> > > rename(tempUris, outputUris);
> > >
> > > C. General file operations:
> > > a. resolve paths
> > > b. create file to write, open file to read (for example in tests).
> > > c. bulk delete files/directories
> > >
> > >
> > >
> > > 2. Related work. How does this differ from other filesystem APIs and
> why?
> > >
> > > We need three sets of functionalities:
> > > 1. resolve paths.
> > > 2. read and write channels.
> > > 3. bulk files management operations(bulk delete/rename/match).
> > >
> > > And, they are available from Java nio, hadoop FileSystem APIs, and
> other
> > > standard library such as java.net.URI.
> > >
> > > Current IOChannelFactory interface uses Java nio for (1) and (2), and
> > > define its own interface for (3).
> > >
> > > In my redesign, I made the following choices:
> > > For (1), I replaced Java nio with URI, because it is standardized and
> > > precise and doesn't require additional implementation of a Path
> interface
> > > from file system providers.
> > >
> > > For (2), I kept the uses of Java nio (Writable/SeekableByteChannel),
> > since
> > > I don't see any things that need to improve and I don't see any better
> > > alternatives (hadoop's FSDataInput/OutputStream provide same
> > > functionalities, but requires additional dependencies).
> > >
> > > For (3), reasons that I didn't choose Java nio or hadoop are:
> > > 1. Beam needs bulk operations API for better performance, however Java
> > nio
> > > and hadoop FileSystems are single file based API.
> > > 2. Have APIs that are File systems agnostic. For example, we can use
> URI
> > > instead of Path.
> > > 3. Have APIs that are minimum, and easy to implement by file system
> > > providers.
> > > 4. Introducing less dependencies.
> > > 5. It is easy to build an adaptor based on Java nio or hadoop
> interfaces.
> > >
> > > 3. Discussion of non-Java languages. It would be good to know what
> > classes
> > > > in e.g. Python we might use in place of URI, SeekableByteChannel,
> etc.
> > >
> > > I don't want to mislead people here without a thorough investigation.
> You
> > > can see from your second question, that would require iterations on
> > design
> > > and prototyping.
> > >
> > > I didn't introduce any Java specific requirements in the redesign.
> > > Resolving paths, seeking with channels or streams, file management
> > > operations are languages independent. And, I pretty sure there are
> python
> > > libraries for that.
> > >
> > > However, I am happy to hear thoughts and get help from people working
> on
> > > the python sdk.
> > >
> > >
> > > > On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid>
> > wrote:
> > > >
> > > > > I have received a lot of comments in "Part 1: IOChannelFactory
> > > > > Redesign" [1]. And, I have updated the design based on the
> feedback.
> > > > >
> > > > > Now, I feel it is close to be ready for implementation, and I would
> > > like
> > > > to
> > > > > summarize the changes:
> > > > > 1. Replaced FilePath with URI for resolving files paths.
> > > > > 2. Required match(String spec) to handle ambiguities in users
> > provided
> > > > > strings (see the match() java doc in the design doc for details).
> > > > > 3. Changed Metadata to use Future.get() paradigm, and removed
> > > > exception().
> > > > > 4. Changed methods on FileSystem interface to be protected (visible
> > for
> > > > > implementors), and created FileSystems utility (visible for
> callers).
> > > > > 5.  Simplified FileSystem interface by moving operation options,
> such
> > > as
> > > > > DeleteOptions, MatchOptions, to the FileSystems utility.
> > > > > 6. Simplified FileSystem interface by requiring certain behaviors,
> > such
> > > > as
> > > > > creating recursively, throwing for missing files.
> > > > >
> > > > > Any thoughts / feedback?
> > > > > --
> > > > > Pei
> > > > >
> > > > > [1]
> > > > > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> > > > > XJsVG3qel2lhdKTknmZ_7M/edit#
> > > > >
> > > > > On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote:
> > > > >
> > > > > > Thanks JB for the feedback.
> > > > > >
> > > > > > Yes, we should provide a hadoop.fs.FileSystem adaptor. As you
> said,
> > > it
> > > > > > will make a range of file system available in Beam.
> > > > > >
> > > > > > And, people can choose to implement BeamFileSystem directly to
> get
> > > the
> > > > > > best performance (For example, providing bulk operations.)
> > > > > >
> > > > > > --
> > > > > > Pei
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <
> > > > jb@nanthrax.net>
> > > > > > wrote:
> > > > > >
> > > > > >> Hi Pei,
> > > > > >>
> > > > > >> rethinking about that, I understand that the purpose of the Beam
> > > > > >> filesystem is to avoid to bring a bunch of dependencies into the
> > > core.
> > > > > That
> > > > > >> makes perfect sense.
> > > > > >>
> > > > > >> So, I agree that a Beam filesystem abstract is fine.
> > > > > >>
> > > > > >> My point is that we should provide a HadoopFilesystem
> > > extension/plugin
> > > > > >> for Beam filesystem asap: that would help us to support a good
> > range
> > > > of
> > > > > >> filesystems quickly.
> > > > > >>
> > > > > >> Just my $0.01 ;)
> > > > > >>
> > > > > >> Regards
> > > > > >> JB
> > > > > >>
> > > > > >>
> > > > > >> On 11/17/2016 08:18 PM, Pei He wrote:
> > > > > >>
> > > > > >>> Hi JB,
> > > > > >>> My proposals are based on the current IOChannelFactory, and how
> > > they
> > > > > are
> > > > > >>> used in FileBasedSink.
> > > > > >>>
> > > > > >>> Let's me spend more time to investigate Hadoop FileSystem
> > > interface.
> > > > > >>> --
> > > > > >>> Pei
> > > > > >>>
> > > > > >>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <
> > > > jb@nanthrax.net
> > > > > >
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>> By the way, Pei, for the record: why introducing BeamFileSystem
> > and
> > > > not
> > > > > >>>> using the Hadoop FileSystem interface ?
> > > > > >>>>
> > > > > >>>> Thanks
> > > > > >>>> Regards
> > > > > >>>> JB
> > > > > >>>>
> > > > > >>>> On 11/17/2016 01:09 AM, Pei He wrote:
> > > > > >>>>
> > > > > >>>> Hi,
> > > > > >>>>>
> > > > > >>>>> I am working on BEAM-59
> > > > > >>>>> <https://issues.apache.org/jira/browse/BEAM-59>
> > > "IOChannelFactory
> > > > > >>>>> redesign". The goals are:
> > > > > >>>>>
> > > > > >>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined
> > file
> > > > > >>>>> system.
> > > > > >>>>>
> > > > > >>>>> 2. Support configuring any user-defined file system.
> > > > > >>>>>
> > > > > >>>>> And, I drafted the design proposal in two parts to address
> them
> > > in
> > > > > >>>>> order:
> > > > > >>>>>
> > > > > >>>>> Part 1: IOChannelFactory Redesign
> > > > > >>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
> > > > > >>>>> sVG3qel2lhdKTknmZ_7M/edit#>
> > > > > >>>>>
> > > > > >>>>> Summary:
> > > > > >>>>>
> > > > > >>>>> Old API: WritableByteChannel create(String spec, String
> > > mimeType);
> > > > > >>>>>
> > > > > >>>>> New API: WritableByteChannel create(URI uri, CreateOptions
> > > > options);
> > > > > >>>>>
> > > > > >>>>> Noticeable proposed changes:
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>    1.
> > > > > >>>>>
> > > > > >>>>>    Includes the options parameter in most methods to specify
> > > > > behaviors.
> > > > > >>>>>    2.
> > > > > >>>>>
> > > > > >>>>>    Replace String with URI to include scheme for
> > > files/directories
> > > > > >>>>>    locations.
> > > > > >>>>>    3.
> > > > > >>>>>
> > > > > >>>>>    Require file systems to provide a SeekableByteChannel for
> > > read.
> > > > > >>>>>    4.
> > > > > >>>>>
> > > > > >>>>>    Additional methods, such as getMetadata(), rename() e.t.c
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> Part 2: Configurable BeamFileSystem
> > > > > >>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
> > > > > >>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
> > > > > >>>>>
> > > > > >>>>> Summary:
> > > > > >>>>>
> > > > > >>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
> > > > > >>>>>
> > > > > >>>>> New API: BeamFileSystems.getFileSystem(glob,
> > > config).match(glob);
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> Looking for comments and feedback.
> > > > > >>>>>
> > > > > >>>>> Thanks
> > > > > >>>>>
> > > > > >>>>> --
> > > > > >>>>>
> > > > > >>>>> Pei
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> --
> > > > > >>>> Jean-Baptiste Onofré
> > > > > >>>> jbonofre@apache.org
> > > > > >>>> http://blog.nanthrax.net
> > > > > >>>> Talend - http://www.talend.com
> > > > > >>>>
> > > > > >>>>
> > > > > >>>
> > > > > >> --
> > > > > >> Jean-Baptiste Onofré
> > > > > >> jbonofre@apache.org
> > > > > >> http://blog.nanthrax.net
> > > > > >> Talend - http://www.talend.com
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Amir Bahmanyari <am...@yahoo.com.INVALID>.
How can I unsubscribe?
I will be away from this subject for sometime 
Will rejoin once I get back
Thanks colleagues
Happy holidays 

Sent from my iPhone

> On Dec 13, 2016, at 12:34 PM, Pei He <pe...@google.com.INVALID> wrote:
> 
> One design decision made during previous design discussion [1] is "Replacing
> FilePath with URI for resolving files paths". This has been brought back to
> dev@ mailing list in my previous email.
> 
> Comment [2] asked me to clarify the impact on Windows OS users because
> users have to specify the path in the URI format, such as:
> "file:///C:/home/input-*"
> "C:/home/"
> 
> Using URIs in the API is to ensure Beam code is file systems agnostic.
> 
> Another alternative is Java Path/File. It is used in the current
> IOChannelFactory API, and it works poorly. For example, Path throws when
> there are file scheme or asterisk in the path:
> new File("file:///C:/home/").toPath() throws in toPath().
> Paths.get("C:/home/").resolve("output-*") throws in resolve().
> 
> any thoughts and suggestions are welcome.
> 
> Thanks
> --
> Pei
> 
> ---
> [1]:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA30vtPU#heading=h.p3gc3colc2cs
> 
> [2]:
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA02O1cY
> 
> On Tue, Dec 6, 2016 at 1:25 PM, Kenneth Knowles <kl...@google.com.invalid>
> wrote:
> 
>> Thanks for the thorough answers. It all sounds good to me.
>> 
>>> On Tue, Dec 6, 2016 at 12:57 PM, Pei He <pe...@google.com.invalid> wrote:
>>> 
>>> Thanks Kenn for the feedback and questions.
>>> 
>>> I responded inline.
>>> 
>>> On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <kl...@google.com.invalid>
>>> wrote:
>>> 
>>>> I really like this document. It is easy to read and informative. Three
>>>> things not addressed by the document:
>>>> 
>>>> 1. Major Beam use cases. I'm sure we have a few in the SDK that could
>> be
>>>> outlined in terms of the new API with pseudocode.
>>> 
>>> 
>>> (I am writing pseudocode directly with FileSystem interface to
>> demonstrate.
>>> However, clients will use the utility FileSystems. This is for us to
>> have a
>>> layer between the file systems providers' interface and the client
>>> interface. We can add utility functions to FileSystems for common use
>>> patterns as needed.)
>>> 
>>> Major Beam use cases are the followings:
>>> A. FileBasedSource:
>>> // a. Get input URIs and file sizes from users provided specs.
>>> // Note: I updated the match() to be a bulk operation after I sent my
>> last
>>> email.
>>> List<MatchResult> results = match(specList);
>>> List<Metadata> inputMetadataList = FluentIterable.from(results)
>>>    .transformAndConcat(
>>>        new Function<MatchResult, Metadata>() {
>>>          @Override
>>>          public Iterable<Metadata> apply(MatchResult result) {
>>>            return Arrays.asList(result.metadata());
>>>          });
>>> 
>>> // b. Read from a start offset to support the source splitting.
>>> SeekableByteChannel seekChannel = open(fileUri);
>>> seekChannel.position(source.getStartOffset());
>>> seekChannel.read(...);
>>> 
>>> B. FileBasedSink:
>>> // bulk rename temporary files to output files
>>> rename(tempUris, outputUris);
>>> 
>>> C. General file operations:
>>> a. resolve paths
>>> b. create file to write, open file to read (for example in tests).
>>> c. bulk delete files/directories
>>> 
>>> 
>>> 
>>> 2. Related work. How does this differ from other filesystem APIs and why?
>>> 
>>> We need three sets of functionalities:
>>> 1. resolve paths.
>>> 2. read and write channels.
>>> 3. bulk files management operations(bulk delete/rename/match).
>>> 
>>> And, they are available from Java nio, hadoop FileSystem APIs, and other
>>> standard library such as java.net.URI.
>>> 
>>> Current IOChannelFactory interface uses Java nio for (1) and (2), and
>>> define its own interface for (3).
>>> 
>>> In my redesign, I made the following choices:
>>> For (1), I replaced Java nio with URI, because it is standardized and
>>> precise and doesn't require additional implementation of a Path interface
>>> from file system providers.
>>> 
>>> For (2), I kept the uses of Java nio (Writable/SeekableByteChannel),
>> since
>>> I don't see any things that need to improve and I don't see any better
>>> alternatives (hadoop's FSDataInput/OutputStream provide same
>>> functionalities, but requires additional dependencies).
>>> 
>>> For (3), reasons that I didn't choose Java nio or hadoop are:
>>> 1. Beam needs bulk operations API for better performance, however Java
>> nio
>>> and hadoop FileSystems are single file based API.
>>> 2. Have APIs that are File systems agnostic. For example, we can use URI
>>> instead of Path.
>>> 3. Have APIs that are minimum, and easy to implement by file system
>>> providers.
>>> 4. Introducing less dependencies.
>>> 5. It is easy to build an adaptor based on Java nio or hadoop interfaces.
>>> 
>>> 3. Discussion of non-Java languages. It would be good to know what
>> classes
>>>> in e.g. Python we might use in place of URI, SeekableByteChannel, etc.
>>> 
>>> I don't want to mislead people here without a thorough investigation. You
>>> can see from your second question, that would require iterations on
>> design
>>> and prototyping.
>>> 
>>> I didn't introduce any Java specific requirements in the redesign.
>>> Resolving paths, seeking with channels or streams, file management
>>> operations are languages independent. And, I pretty sure there are python
>>> libraries for that.
>>> 
>>> However, I am happy to hear thoughts and get help from people working on
>>> the python sdk.
>>> 
>>> 
>>>> On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid>
>> wrote:
>>>> 
>>>>> I have received a lot of comments in "Part 1: IOChannelFactory
>>>>> Redesign" [1]. And, I have updated the design based on the feedback.
>>>>> 
>>>>> Now, I feel it is close to be ready for implementation, and I would
>>> like
>>>> to
>>>>> summarize the changes:
>>>>> 1. Replaced FilePath with URI for resolving files paths.
>>>>> 2. Required match(String spec) to handle ambiguities in users
>> provided
>>>>> strings (see the match() java doc in the design doc for details).
>>>>> 3. Changed Metadata to use Future.get() paradigm, and removed
>>>> exception().
>>>>> 4. Changed methods on FileSystem interface to be protected (visible
>> for
>>>>> implementors), and created FileSystems utility (visible for callers).
>>>>> 5.  Simplified FileSystem interface by moving operation options, such
>>> as
>>>>> DeleteOptions, MatchOptions, to the FileSystems utility.
>>>>> 6. Simplified FileSystem interface by requiring certain behaviors,
>> such
>>>> as
>>>>> creating recursively, throwing for missing files.
>>>>> 
>>>>> Any thoughts / feedback?
>>>>> --
>>>>> Pei
>>>>> 
>>>>> [1]
>>>>> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
>>>>> XJsVG3qel2lhdKTknmZ_7M/edit#
>>>>> 
>>>>>> On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote:
>>>>>> 
>>>>>> Thanks JB for the feedback.
>>>>>> 
>>>>>> Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said,
>>> it
>>>>>> will make a range of file system available in Beam.
>>>>>> 
>>>>>> And, people can choose to implement BeamFileSystem directly to get
>>> the
>>>>>> best performance (For example, providing bulk operations.)
>>>>>> 
>>>>>> --
>>>>>> Pei
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <
>>>> jb@nanthrax.net>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Pei,
>>>>>>> 
>>>>>>> rethinking about that, I understand that the purpose of the Beam
>>>>>>> filesystem is to avoid to bring a bunch of dependencies into the
>>> core.
>>>>> That
>>>>>>> makes perfect sense.
>>>>>>> 
>>>>>>> So, I agree that a Beam filesystem abstract is fine.
>>>>>>> 
>>>>>>> My point is that we should provide a HadoopFilesystem
>>> extension/plugin
>>>>>>> for Beam filesystem asap: that would help us to support a good
>> range
>>>> of
>>>>>>> filesystems quickly.
>>>>>>> 
>>>>>>> Just my $0.01 ;)
>>>>>>> 
>>>>>>> Regards
>>>>>>> JB
>>>>>>> 
>>>>>>> 
>>>>>>>> On 11/17/2016 08:18 PM, Pei He wrote:
>>>>>>>> 
>>>>>>>> Hi JB,
>>>>>>>> My proposals are based on the current IOChannelFactory, and how
>>> they
>>>>> are
>>>>>>>> used in FileBasedSink.
>>>>>>>> 
>>>>>>>> Let's me spend more time to investigate Hadoop FileSystem
>>> interface.
>>>>>>>> --
>>>>>>>> Pei
>>>>>>>> 
>>>>>>>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <
>>>> jb@nanthrax.net
>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> By the way, Pei, for the record: why introducing BeamFileSystem
>> and
>>>> not
>>>>>>>>> using the Hadoop FileSystem interface ?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>> 
>>>>>>>>> On 11/17/2016 01:09 AM, Pei He wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I am working on BEAM-59
>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-59>
>>> "IOChannelFactory
>>>>>>>>>> redesign". The goals are:
>>>>>>>>>> 
>>>>>>>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined
>> file
>>>>>>>>>> system.
>>>>>>>>>> 
>>>>>>>>>> 2. Support configuring any user-defined file system.
>>>>>>>>>> 
>>>>>>>>>> And, I drafted the design proposal in two parts to address them
>>> in
>>>>>>>>>> order:
>>>>>>>>>> 
>>>>>>>>>> Part 1: IOChannelFactory Redesign
>>>>>>>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
>>>>>>>>>> sVG3qel2lhdKTknmZ_7M/edit#>
>>>>>>>>>> 
>>>>>>>>>> Summary:
>>>>>>>>>> 
>>>>>>>>>> Old API: WritableByteChannel create(String spec, String
>>> mimeType);
>>>>>>>>>> 
>>>>>>>>>> New API: WritableByteChannel create(URI uri, CreateOptions
>>>> options);
>>>>>>>>>> 
>>>>>>>>>> Noticeable proposed changes:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>   1.
>>>>>>>>>> 
>>>>>>>>>>   Includes the options parameter in most methods to specify
>>>>> behaviors.
>>>>>>>>>>   2.
>>>>>>>>>> 
>>>>>>>>>>   Replace String with URI to include scheme for
>>> files/directories
>>>>>>>>>>   locations.
>>>>>>>>>>   3.
>>>>>>>>>> 
>>>>>>>>>>   Require file systems to provide a SeekableByteChannel for
>>> read.
>>>>>>>>>>   4.
>>>>>>>>>> 
>>>>>>>>>>   Additional methods, such as getMetadata(), rename() e.t.c
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Part 2: Configurable BeamFileSystem
>>>>>>>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
>>>>>>>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
>>>>>>>>>> 
>>>>>>>>>> Summary:
>>>>>>>>>> 
>>>>>>>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
>>>>>>>>>> 
>>>>>>>>>> New API: BeamFileSystems.getFileSystem(glob,
>>> config).match(glob);
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Looking for comments and feedback.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> Pei
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>> Jean-Baptiste Onofré
>>>>>>>>> jbonofre@apache.org
>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> --
>>>>>>> Jean-Baptiste Onofré
>>>>>>> jbonofre@apache.org
>>>>>>> http://blog.nanthrax.net
>>>>>>> Talend - http://www.talend.com
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Pei He <pe...@google.com.INVALID>.
One design decision made during previous design discussion [1] is "Replacing
FilePath with URI for resolving files paths". This has been brought back to
dev@ mailing list in my previous email.

Comment [2] asked me to clarify the impact on Windows OS users because
users have to specify the path in the URI format, such as:
"file:///C:/home/input-*"
"C:/home/"

Using URIs in the API is to ensure Beam code is file systems agnostic.

Another alternative is Java Path/File. It is used in the current
IOChannelFactory API, and it works poorly. For example, Path throws when
there are file scheme or asterisk in the path:
new File("file:///C:/home/").toPath() throws in toPath().
Paths.get("C:/home/").resolve("output-*") throws in resolve().

any thoughts and suggestions are welcome.

Thanks
--
Pei

---
[1]:
https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA30vtPU#heading=h.p3gc3colc2cs

[2]:
https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit?disco=AAAAA02O1cY

On Tue, Dec 6, 2016 at 1:25 PM, Kenneth Knowles <kl...@google.com.invalid>
wrote:

> Thanks for the thorough answers. It all sounds good to me.
>
> On Tue, Dec 6, 2016 at 12:57 PM, Pei He <pe...@google.com.invalid> wrote:
>
> > Thanks Kenn for the feedback and questions.
> >
> > I responded inline.
> >
> > On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <kl...@google.com.invalid>
> > wrote:
> >
> > > I really like this document. It is easy to read and informative. Three
> > > things not addressed by the document:
> > >
> > > 1. Major Beam use cases. I'm sure we have a few in the SDK that could
> be
> > > outlined in terms of the new API with pseudocode.
> >
> >
> > (I am writing pseudocode directly with FileSystem interface to
> demonstrate.
> > However, clients will use the utility FileSystems. This is for us to
> have a
> > layer between the file systems providers' interface and the client
> > interface. We can add utility functions to FileSystems for common use
> > patterns as needed.)
> >
> > Major Beam use cases are the followings:
> > A. FileBasedSource:
> > // a. Get input URIs and file sizes from users provided specs.
> > // Note: I updated the match() to be a bulk operation after I sent my
> last
> > email.
> > List<MatchResult> results = match(specList);
> > List<Metadata> inputMetadataList = FluentIterable.from(results)
> >     .transformAndConcat(
> >         new Function<MatchResult, Metadata>() {
> >           @Override
> >           public Iterable<Metadata> apply(MatchResult result) {
> >             return Arrays.asList(result.metadata());
> >           });
> >
> > // b. Read from a start offset to support the source splitting.
> > SeekableByteChannel seekChannel = open(fileUri);
> > seekChannel.position(source.getStartOffset());
> > seekChannel.read(...);
> >
> > B. FileBasedSink:
> > // bulk rename temporary files to output files
> > rename(tempUris, outputUris);
> >
> > C. General file operations:
> > a. resolve paths
> > b. create file to write, open file to read (for example in tests).
> > c. bulk delete files/directories
> >
> >
> >
> > 2. Related work. How does this differ from other filesystem APIs and why?
> >
> > We need three sets of functionalities:
> > 1. resolve paths.
> > 2. read and write channels.
> > 3. bulk files management operations(bulk delete/rename/match).
> >
> > And, they are available from Java nio, hadoop FileSystem APIs, and other
> > standard library such as java.net.URI.
> >
> > Current IOChannelFactory interface uses Java nio for (1) and (2), and
> > define its own interface for (3).
> >
> > In my redesign, I made the following choices:
> > For (1), I replaced Java nio with URI, because it is standardized and
> > precise and doesn't require additional implementation of a Path interface
> > from file system providers.
> >
> > For (2), I kept the uses of Java nio (Writable/SeekableByteChannel),
> since
> > I don't see any things that need to improve and I don't see any better
> > alternatives (hadoop's FSDataInput/OutputStream provide same
> > functionalities, but requires additional dependencies).
> >
> > For (3), reasons that I didn't choose Java nio or hadoop are:
> > 1. Beam needs bulk operations API for better performance, however Java
> nio
> > and hadoop FileSystems are single file based API.
> > 2. Have APIs that are File systems agnostic. For example, we can use URI
> > instead of Path.
> > 3. Have APIs that are minimum, and easy to implement by file system
> > providers.
> > 4. Introducing less dependencies.
> > 5. It is easy to build an adaptor based on Java nio or hadoop interfaces.
> >
> > 3. Discussion of non-Java languages. It would be good to know what
> classes
> > > in e.g. Python we might use in place of URI, SeekableByteChannel, etc.
> >
> > I don't want to mislead people here without a thorough investigation. You
> > can see from your second question, that would require iterations on
> design
> > and prototyping.
> >
> > I didn't introduce any Java specific requirements in the redesign.
> > Resolving paths, seeking with channels or streams, file management
> > operations are languages independent. And, I pretty sure there are python
> > libraries for that.
> >
> > However, I am happy to hear thoughts and get help from people working on
> > the python sdk.
> >
> >
> > > On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid>
> wrote:
> > >
> > > > I have received a lot of comments in "Part 1: IOChannelFactory
> > > > Redesign" [1]. And, I have updated the design based on the feedback.
> > > >
> > > > Now, I feel it is close to be ready for implementation, and I would
> > like
> > > to
> > > > summarize the changes:
> > > > 1. Replaced FilePath with URI for resolving files paths.
> > > > 2. Required match(String spec) to handle ambiguities in users
> provided
> > > > strings (see the match() java doc in the design doc for details).
> > > > 3. Changed Metadata to use Future.get() paradigm, and removed
> > > exception().
> > > > 4. Changed methods on FileSystem interface to be protected (visible
> for
> > > > implementors), and created FileSystems utility (visible for callers).
> > > > 5.  Simplified FileSystem interface by moving operation options, such
> > as
> > > > DeleteOptions, MatchOptions, to the FileSystems utility.
> > > > 6. Simplified FileSystem interface by requiring certain behaviors,
> such
> > > as
> > > > creating recursively, throwing for missing files.
> > > >
> > > > Any thoughts / feedback?
> > > > --
> > > > Pei
> > > >
> > > > [1]
> > > > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> > > > XJsVG3qel2lhdKTknmZ_7M/edit#
> > > >
> > > > On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote:
> > > >
> > > > > Thanks JB for the feedback.
> > > > >
> > > > > Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said,
> > it
> > > > > will make a range of file system available in Beam.
> > > > >
> > > > > And, people can choose to implement BeamFileSystem directly to get
> > the
> > > > > best performance (For example, providing bulk operations.)
> > > > >
> > > > > --
> > > > > Pei
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <
> > > jb@nanthrax.net>
> > > > > wrote:
> > > > >
> > > > >> Hi Pei,
> > > > >>
> > > > >> rethinking about that, I understand that the purpose of the Beam
> > > > >> filesystem is to avoid to bring a bunch of dependencies into the
> > core.
> > > > That
> > > > >> makes perfect sense.
> > > > >>
> > > > >> So, I agree that a Beam filesystem abstract is fine.
> > > > >>
> > > > >> My point is that we should provide a HadoopFilesystem
> > extension/plugin
> > > > >> for Beam filesystem asap: that would help us to support a good
> range
> > > of
> > > > >> filesystems quickly.
> > > > >>
> > > > >> Just my $0.01 ;)
> > > > >>
> > > > >> Regards
> > > > >> JB
> > > > >>
> > > > >>
> > > > >> On 11/17/2016 08:18 PM, Pei He wrote:
> > > > >>
> > > > >>> Hi JB,
> > > > >>> My proposals are based on the current IOChannelFactory, and how
> > they
> > > > are
> > > > >>> used in FileBasedSink.
> > > > >>>
> > > > >>> Let's me spend more time to investigate Hadoop FileSystem
> > interface.
> > > > >>> --
> > > > >>> Pei
> > > > >>>
> > > > >>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <
> > > jb@nanthrax.net
> > > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>> By the way, Pei, for the record: why introducing BeamFileSystem
> and
> > > not
> > > > >>>> using the Hadoop FileSystem interface ?
> > > > >>>>
> > > > >>>> Thanks
> > > > >>>> Regards
> > > > >>>> JB
> > > > >>>>
> > > > >>>> On 11/17/2016 01:09 AM, Pei He wrote:
> > > > >>>>
> > > > >>>> Hi,
> > > > >>>>>
> > > > >>>>> I am working on BEAM-59
> > > > >>>>> <https://issues.apache.org/jira/browse/BEAM-59>
> > "IOChannelFactory
> > > > >>>>> redesign". The goals are:
> > > > >>>>>
> > > > >>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined
> file
> > > > >>>>> system.
> > > > >>>>>
> > > > >>>>> 2. Support configuring any user-defined file system.
> > > > >>>>>
> > > > >>>>> And, I drafted the design proposal in two parts to address them
> > in
> > > > >>>>> order:
> > > > >>>>>
> > > > >>>>> Part 1: IOChannelFactory Redesign
> > > > >>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
> > > > >>>>> sVG3qel2lhdKTknmZ_7M/edit#>
> > > > >>>>>
> > > > >>>>> Summary:
> > > > >>>>>
> > > > >>>>> Old API: WritableByteChannel create(String spec, String
> > mimeType);
> > > > >>>>>
> > > > >>>>> New API: WritableByteChannel create(URI uri, CreateOptions
> > > options);
> > > > >>>>>
> > > > >>>>> Noticeable proposed changes:
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>    1.
> > > > >>>>>
> > > > >>>>>    Includes the options parameter in most methods to specify
> > > > behaviors.
> > > > >>>>>    2.
> > > > >>>>>
> > > > >>>>>    Replace String with URI to include scheme for
> > files/directories
> > > > >>>>>    locations.
> > > > >>>>>    3.
> > > > >>>>>
> > > > >>>>>    Require file systems to provide a SeekableByteChannel for
> > read.
> > > > >>>>>    4.
> > > > >>>>>
> > > > >>>>>    Additional methods, such as getMetadata(), rename() e.t.c
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Part 2: Configurable BeamFileSystem
> > > > >>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
> > > > >>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
> > > > >>>>>
> > > > >>>>> Summary:
> > > > >>>>>
> > > > >>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
> > > > >>>>>
> > > > >>>>> New API: BeamFileSystems.getFileSystem(glob,
> > config).match(glob);
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Looking for comments and feedback.
> > > > >>>>>
> > > > >>>>> Thanks
> > > > >>>>>
> > > > >>>>> --
> > > > >>>>>
> > > > >>>>> Pei
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> --
> > > > >>>> Jean-Baptiste Onofré
> > > > >>>> jbonofre@apache.org
> > > > >>>> http://blog.nanthrax.net
> > > > >>>> Talend - http://www.talend.com
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >> --
> > > > >> Jean-Baptiste Onofré
> > > > >> jbonofre@apache.org
> > > > >> http://blog.nanthrax.net
> > > > >> Talend - http://www.talend.com
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
Thanks for the thorough answers. It all sounds good to me.

On Tue, Dec 6, 2016 at 12:57 PM, Pei He <pe...@google.com.invalid> wrote:

> Thanks Kenn for the feedback and questions.
>
> I responded inline.
>
> On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > I really like this document. It is easy to read and informative. Three
> > things not addressed by the document:
> >
> > 1. Major Beam use cases. I'm sure we have a few in the SDK that could be
> > outlined in terms of the new API with pseudocode.
>
>
> (I am writing pseudocode directly with FileSystem interface to demonstrate.
> However, clients will use the utility FileSystems. This is for us to have a
> layer between the file systems providers' interface and the client
> interface. We can add utility functions to FileSystems for common use
> patterns as needed.)
>
> Major Beam use cases are the followings:
> A. FileBasedSource:
> // a. Get input URIs and file sizes from users provided specs.
> // Note: I updated the match() to be a bulk operation after I sent my last
> email.
> List<MatchResult> results = match(specList);
> List<Metadata> inputMetadataList = FluentIterable.from(results)
>     .transformAndConcat(
>         new Function<MatchResult, Metadata>() {
>           @Override
>           public Iterable<Metadata> apply(MatchResult result) {
>             return Arrays.asList(result.metadata());
>           });
>
> // b. Read from a start offset to support the source splitting.
> SeekableByteChannel seekChannel = open(fileUri);
> seekChannel.position(source.getStartOffset());
> seekChannel.read(...);
>
> B. FileBasedSink:
> // bulk rename temporary files to output files
> rename(tempUris, outputUris);
>
> C. General file operations:
> a. resolve paths
> b. create file to write, open file to read (for example in tests).
> c. bulk delete files/directories
>
>
>
> 2. Related work. How does this differ from other filesystem APIs and why?
>
> We need three sets of functionalities:
> 1. resolve paths.
> 2. read and write channels.
> 3. bulk files management operations(bulk delete/rename/match).
>
> And, they are available from Java nio, hadoop FileSystem APIs, and other
> standard library such as java.net.URI.
>
> Current IOChannelFactory interface uses Java nio for (1) and (2), and
> define its own interface for (3).
>
> In my redesign, I made the following choices:
> For (1), I replaced Java nio with URI, because it is standardized and
> precise and doesn't require additional implementation of a Path interface
> from file system providers.
>
> For (2), I kept the uses of Java nio (Writable/SeekableByteChannel), since
> I don't see any things that need to improve and I don't see any better
> alternatives (hadoop's FSDataInput/OutputStream provide same
> functionalities, but requires additional dependencies).
>
> For (3), reasons that I didn't choose Java nio or hadoop are:
> 1. Beam needs bulk operations API for better performance, however Java nio
> and hadoop FileSystems are single file based API.
> 2. Have APIs that are File systems agnostic. For example, we can use URI
> instead of Path.
> 3. Have APIs that are minimum, and easy to implement by file system
> providers.
> 4. Introducing less dependencies.
> 5. It is easy to build an adaptor based on Java nio or hadoop interfaces.
>
> 3. Discussion of non-Java languages. It would be good to know what classes
> > in e.g. Python we might use in place of URI, SeekableByteChannel, etc.
>
> I don't want to mislead people here without a thorough investigation. You
> can see from your second question, that would require iterations on design
> and prototyping.
>
> I didn't introduce any Java specific requirements in the redesign.
> Resolving paths, seeking with channels or streams, file management
> operations are languages independent. And, I pretty sure there are python
> libraries for that.
>
> However, I am happy to hear thoughts and get help from people working on
> the python sdk.
>
>
> > On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid> wrote:
> >
> > > I have received a lot of comments in "Part 1: IOChannelFactory
> > > Redesign" [1]. And, I have updated the design based on the feedback.
> > >
> > > Now, I feel it is close to be ready for implementation, and I would
> like
> > to
> > > summarize the changes:
> > > 1. Replaced FilePath with URI for resolving files paths.
> > > 2. Required match(String spec) to handle ambiguities in users provided
> > > strings (see the match() java doc in the design doc for details).
> > > 3. Changed Metadata to use Future.get() paradigm, and removed
> > exception().
> > > 4. Changed methods on FileSystem interface to be protected (visible for
> > > implementors), and created FileSystems utility (visible for callers).
> > > 5.  Simplified FileSystem interface by moving operation options, such
> as
> > > DeleteOptions, MatchOptions, to the FileSystems utility.
> > > 6. Simplified FileSystem interface by requiring certain behaviors, such
> > as
> > > creating recursively, throwing for missing files.
> > >
> > > Any thoughts / feedback?
> > > --
> > > Pei
> > >
> > > [1]
> > > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> > > XJsVG3qel2lhdKTknmZ_7M/edit#
> > >
> > > On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote:
> > >
> > > > Thanks JB for the feedback.
> > > >
> > > > Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said,
> it
> > > > will make a range of file system available in Beam.
> > > >
> > > > And, people can choose to implement BeamFileSystem directly to get
> the
> > > > best performance (For example, providing bulk operations.)
> > > >
> > > > --
> > > > Pei
> > > >
> > > >
> > > >
> > > > On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <
> > jb@nanthrax.net>
> > > > wrote:
> > > >
> > > >> Hi Pei,
> > > >>
> > > >> rethinking about that, I understand that the purpose of the Beam
> > > >> filesystem is to avoid to bring a bunch of dependencies into the
> core.
> > > That
> > > >> makes perfect sense.
> > > >>
> > > >> So, I agree that a Beam filesystem abstract is fine.
> > > >>
> > > >> My point is that we should provide a HadoopFilesystem
> extension/plugin
> > > >> for Beam filesystem asap: that would help us to support a good range
> > of
> > > >> filesystems quickly.
> > > >>
> > > >> Just my $0.01 ;)
> > > >>
> > > >> Regards
> > > >> JB
> > > >>
> > > >>
> > > >> On 11/17/2016 08:18 PM, Pei He wrote:
> > > >>
> > > >>> Hi JB,
> > > >>> My proposals are based on the current IOChannelFactory, and how
> they
> > > are
> > > >>> used in FileBasedSink.
> > > >>>
> > > >>> Let's me spend more time to investigate Hadoop FileSystem
> interface.
> > > >>> --
> > > >>> Pei
> > > >>>
> > > >>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <
> > jb@nanthrax.net
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> By the way, Pei, for the record: why introducing BeamFileSystem and
> > not
> > > >>>> using the Hadoop FileSystem interface ?
> > > >>>>
> > > >>>> Thanks
> > > >>>> Regards
> > > >>>> JB
> > > >>>>
> > > >>>> On 11/17/2016 01:09 AM, Pei He wrote:
> > > >>>>
> > > >>>> Hi,
> > > >>>>>
> > > >>>>> I am working on BEAM-59
> > > >>>>> <https://issues.apache.org/jira/browse/BEAM-59>
> "IOChannelFactory
> > > >>>>> redesign". The goals are:
> > > >>>>>
> > > >>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file
> > > >>>>> system.
> > > >>>>>
> > > >>>>> 2. Support configuring any user-defined file system.
> > > >>>>>
> > > >>>>> And, I drafted the design proposal in two parts to address them
> in
> > > >>>>> order:
> > > >>>>>
> > > >>>>> Part 1: IOChannelFactory Redesign
> > > >>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
> > > >>>>> sVG3qel2lhdKTknmZ_7M/edit#>
> > > >>>>>
> > > >>>>> Summary:
> > > >>>>>
> > > >>>>> Old API: WritableByteChannel create(String spec, String
> mimeType);
> > > >>>>>
> > > >>>>> New API: WritableByteChannel create(URI uri, CreateOptions
> > options);
> > > >>>>>
> > > >>>>> Noticeable proposed changes:
> > > >>>>>
> > > >>>>>
> > > >>>>>    1.
> > > >>>>>
> > > >>>>>    Includes the options parameter in most methods to specify
> > > behaviors.
> > > >>>>>    2.
> > > >>>>>
> > > >>>>>    Replace String with URI to include scheme for
> files/directories
> > > >>>>>    locations.
> > > >>>>>    3.
> > > >>>>>
> > > >>>>>    Require file systems to provide a SeekableByteChannel for
> read.
> > > >>>>>    4.
> > > >>>>>
> > > >>>>>    Additional methods, such as getMetadata(), rename() e.t.c
> > > >>>>>
> > > >>>>>
> > > >>>>> Part 2: Configurable BeamFileSystem
> > > >>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
> > > >>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
> > > >>>>>
> > > >>>>> Summary:
> > > >>>>>
> > > >>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
> > > >>>>>
> > > >>>>> New API: BeamFileSystems.getFileSystem(glob,
> config).match(glob);
> > > >>>>>
> > > >>>>>
> > > >>>>> Looking for comments and feedback.
> > > >>>>>
> > > >>>>> Thanks
> > > >>>>>
> > > >>>>> --
> > > >>>>>
> > > >>>>> Pei
> > > >>>>>
> > > >>>>>
> > > >>>>> --
> > > >>>> Jean-Baptiste Onofré
> > > >>>> jbonofre@apache.org
> > > >>>> http://blog.nanthrax.net
> > > >>>> Talend - http://www.talend.com
> > > >>>>
> > > >>>>
> > > >>>
> > > >> --
> > > >> Jean-Baptiste Onofré
> > > >> jbonofre@apache.org
> > > >> http://blog.nanthrax.net
> > > >> Talend - http://www.talend.com
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Pei He <pe...@google.com.INVALID>.
Thanks Kenn for the feedback and questions.

I responded inline.

On Mon, Dec 5, 2016 at 7:49 PM, Kenneth Knowles <kl...@google.com.invalid>
wrote:

> I really like this document. It is easy to read and informative. Three
> things not addressed by the document:
>
> 1. Major Beam use cases. I'm sure we have a few in the SDK that could be
> outlined in terms of the new API with pseudocode.


(I am writing pseudocode directly with FileSystem interface to demonstrate.
However, clients will use the utility FileSystems. This is for us to have a
layer between the file systems providers' interface and the client
interface. We can add utility functions to FileSystems for common use
patterns as needed.)

Major Beam use cases are the followings:
A. FileBasedSource:
// a. Get input URIs and file sizes from users provided specs.
// Note: I updated the match() to be a bulk operation after I sent my last
email.
List<MatchResult> results = match(specList);
List<Metadata> inputMetadataList = FluentIterable.from(results)
    .transformAndConcat(
        new Function<MatchResult, Metadata>() {
          @Override
          public Iterable<Metadata> apply(MatchResult result) {
            return Arrays.asList(result.metadata());
          });

// b. Read from a start offset to support the source splitting.
SeekableByteChannel seekChannel = open(fileUri);
seekChannel.position(source.getStartOffset());
seekChannel.read(...);

B. FileBasedSink:
// bulk rename temporary files to output files
rename(tempUris, outputUris);

C. General file operations:
a. resolve paths
b. create file to write, open file to read (for example in tests).
c. bulk delete files/directories



2. Related work. How does this differ from other filesystem APIs and why?

We need three sets of functionalities:
1. resolve paths.
2. read and write channels.
3. bulk files management operations(bulk delete/rename/match).

And, they are available from Java nio, hadoop FileSystem APIs, and other
standard library such as java.net.URI.

Current IOChannelFactory interface uses Java nio for (1) and (2), and
define its own interface for (3).

In my redesign, I made the following choices:
For (1), I replaced Java nio with URI, because it is standardized and
precise and doesn't require additional implementation of a Path interface
from file system providers.

For (2), I kept the uses of Java nio (Writable/SeekableByteChannel), since
I don't see any things that need to improve and I don't see any better
alternatives (hadoop's FSDataInput/OutputStream provide same
functionalities, but requires additional dependencies).

For (3), reasons that I didn't choose Java nio or hadoop are:
1. Beam needs bulk operations API for better performance, however Java nio
and hadoop FileSystems are single file based API.
2. Have APIs that are File systems agnostic. For example, we can use URI
instead of Path.
3. Have APIs that are minimum, and easy to implement by file system
providers.
4. Introducing less dependencies.
5. It is easy to build an adaptor based on Java nio or hadoop interfaces.

3. Discussion of non-Java languages. It would be good to know what classes
> in e.g. Python we might use in place of URI, SeekableByteChannel, etc.

I don't want to mislead people here without a thorough investigation. You
can see from your second question, that would require iterations on design
and prototyping.

I didn't introduce any Java specific requirements in the redesign.
Resolving paths, seeking with channels or streams, file management
operations are languages independent. And, I pretty sure there are python
libraries for that.

However, I am happy to hear thoughts and get help from people working on
the python sdk.


> On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid> wrote:
>
> > I have received a lot of comments in "Part 1: IOChannelFactory
> > Redesign" [1]. And, I have updated the design based on the feedback.
> >
> > Now, I feel it is close to be ready for implementation, and I would like
> to
> > summarize the changes:
> > 1. Replaced FilePath with URI for resolving files paths.
> > 2. Required match(String spec) to handle ambiguities in users provided
> > strings (see the match() java doc in the design doc for details).
> > 3. Changed Metadata to use Future.get() paradigm, and removed
> exception().
> > 4. Changed methods on FileSystem interface to be protected (visible for
> > implementors), and created FileSystems utility (visible for callers).
> > 5.  Simplified FileSystem interface by moving operation options, such as
> > DeleteOptions, MatchOptions, to the FileSystems utility.
> > 6. Simplified FileSystem interface by requiring certain behaviors, such
> as
> > creating recursively, throwing for missing files.
> >
> > Any thoughts / feedback?
> > --
> > Pei
> >
> > [1]
> > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> > XJsVG3qel2lhdKTknmZ_7M/edit#
> >
> > On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote:
> >
> > > Thanks JB for the feedback.
> > >
> > > Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said, it
> > > will make a range of file system available in Beam.
> > >
> > > And, people can choose to implement BeamFileSystem directly to get the
> > > best performance (For example, providing bulk operations.)
> > >
> > > --
> > > Pei
> > >
> > >
> > >
> > > On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <
> jb@nanthrax.net>
> > > wrote:
> > >
> > >> Hi Pei,
> > >>
> > >> rethinking about that, I understand that the purpose of the Beam
> > >> filesystem is to avoid to bring a bunch of dependencies into the core.
> > That
> > >> makes perfect sense.
> > >>
> > >> So, I agree that a Beam filesystem abstract is fine.
> > >>
> > >> My point is that we should provide a HadoopFilesystem extension/plugin
> > >> for Beam filesystem asap: that would help us to support a good range
> of
> > >> filesystems quickly.
> > >>
> > >> Just my $0.01 ;)
> > >>
> > >> Regards
> > >> JB
> > >>
> > >>
> > >> On 11/17/2016 08:18 PM, Pei He wrote:
> > >>
> > >>> Hi JB,
> > >>> My proposals are based on the current IOChannelFactory, and how they
> > are
> > >>> used in FileBasedSink.
> > >>>
> > >>> Let's me spend more time to investigate Hadoop FileSystem interface.
> > >>> --
> > >>> Pei
> > >>>
> > >>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <
> jb@nanthrax.net
> > >
> > >>> wrote:
> > >>>
> > >>> By the way, Pei, for the record: why introducing BeamFileSystem and
> not
> > >>>> using the Hadoop FileSystem interface ?
> > >>>>
> > >>>> Thanks
> > >>>> Regards
> > >>>> JB
> > >>>>
> > >>>> On 11/17/2016 01:09 AM, Pei He wrote:
> > >>>>
> > >>>> Hi,
> > >>>>>
> > >>>>> I am working on BEAM-59
> > >>>>> <https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
> > >>>>> redesign". The goals are:
> > >>>>>
> > >>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file
> > >>>>> system.
> > >>>>>
> > >>>>> 2. Support configuring any user-defined file system.
> > >>>>>
> > >>>>> And, I drafted the design proposal in two parts to address them in
> > >>>>> order:
> > >>>>>
> > >>>>> Part 1: IOChannelFactory Redesign
> > >>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
> > >>>>> sVG3qel2lhdKTknmZ_7M/edit#>
> > >>>>>
> > >>>>> Summary:
> > >>>>>
> > >>>>> Old API: WritableByteChannel create(String spec, String mimeType);
> > >>>>>
> > >>>>> New API: WritableByteChannel create(URI uri, CreateOptions
> options);
> > >>>>>
> > >>>>> Noticeable proposed changes:
> > >>>>>
> > >>>>>
> > >>>>>    1.
> > >>>>>
> > >>>>>    Includes the options parameter in most methods to specify
> > behaviors.
> > >>>>>    2.
> > >>>>>
> > >>>>>    Replace String with URI to include scheme for files/directories
> > >>>>>    locations.
> > >>>>>    3.
> > >>>>>
> > >>>>>    Require file systems to provide a SeekableByteChannel for read.
> > >>>>>    4.
> > >>>>>
> > >>>>>    Additional methods, such as getMetadata(), rename() e.t.c
> > >>>>>
> > >>>>>
> > >>>>> Part 2: Configurable BeamFileSystem
> > >>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
> > >>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
> > >>>>>
> > >>>>> Summary:
> > >>>>>
> > >>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
> > >>>>>
> > >>>>> New API: BeamFileSystems.getFileSystem(glob, config).match(glob);
> > >>>>>
> > >>>>>
> > >>>>> Looking for comments and feedback.
> > >>>>>
> > >>>>> Thanks
> > >>>>>
> > >>>>> --
> > >>>>>
> > >>>>> Pei
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>> Jean-Baptiste Onofré
> > >>>> jbonofre@apache.org
> > >>>> http://blog.nanthrax.net
> > >>>> Talend - http://www.talend.com
> > >>>>
> > >>>>
> > >>>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbonofre@apache.org
> > >> http://blog.nanthrax.net
> > >> Talend - http://www.talend.com
> > >>
> > >
> > >
> >
>

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
I really like this document. It is easy to read and informative. Three
things not addressed by the document:

1. Major Beam use cases. I'm sure we have a few in the SDK that could be
outlined in terms of the new API with pseudocode.
2. Related work. How does this differ from other filesystem APIs and why?
3. Discussion of non-Java languages. It would be good to know what classes
in e.g. Python we might use in place of URI, SeekableByteChannel, etc.

On Mon, Dec 5, 2016 at 4:41 PM, Pei He <pe...@google.com.invalid> wrote:

> I have received a lot of comments in "Part 1: IOChannelFactory
> Redesign" [1]. And, I have updated the design based on the feedback.
>
> Now, I feel it is close to be ready for implementation, and I would like to
> summarize the changes:
> 1. Replaced FilePath with URI for resolving files paths.
> 2. Required match(String spec) to handle ambiguities in users provided
> strings (see the match() java doc in the design doc for details).
> 3. Changed Metadata to use Future.get() paradigm, and removed exception().
> 4. Changed methods on FileSystem interface to be protected (visible for
> implementors), and created FileSystems utility (visible for callers).
> 5.  Simplified FileSystem interface by moving operation options, such as
> DeleteOptions, MatchOptions, to the FileSystems utility.
> 6. Simplified FileSystem interface by requiring certain behaviors, such as
> creating recursively, throwing for missing files.
>
> Any thoughts / feedback?
> --
> Pei
>
> [1]
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-
> XJsVG3qel2lhdKTknmZ_7M/edit#
>
> On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote:
>
> > Thanks JB for the feedback.
> >
> > Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said, it
> > will make a range of file system available in Beam.
> >
> > And, people can choose to implement BeamFileSystem directly to get the
> > best performance (For example, providing bulk operations.)
> >
> > --
> > Pei
> >
> >
> >
> > On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> >
> >> Hi Pei,
> >>
> >> rethinking about that, I understand that the purpose of the Beam
> >> filesystem is to avoid to bring a bunch of dependencies into the core.
> That
> >> makes perfect sense.
> >>
> >> So, I agree that a Beam filesystem abstract is fine.
> >>
> >> My point is that we should provide a HadoopFilesystem extension/plugin
> >> for Beam filesystem asap: that would help us to support a good range of
> >> filesystems quickly.
> >>
> >> Just my $0.01 ;)
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 11/17/2016 08:18 PM, Pei He wrote:
> >>
> >>> Hi JB,
> >>> My proposals are based on the current IOChannelFactory, and how they
> are
> >>> used in FileBasedSink.
> >>>
> >>> Let's me spend more time to investigate Hadoop FileSystem interface.
> >>> --
> >>> Pei
> >>>
> >>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <jb@nanthrax.net
> >
> >>> wrote:
> >>>
> >>> By the way, Pei, for the record: why introducing BeamFileSystem and not
> >>>> using the Hadoop FileSystem interface ?
> >>>>
> >>>> Thanks
> >>>> Regards
> >>>> JB
> >>>>
> >>>> On 11/17/2016 01:09 AM, Pei He wrote:
> >>>>
> >>>> Hi,
> >>>>>
> >>>>> I am working on BEAM-59
> >>>>> <https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
> >>>>> redesign". The goals are:
> >>>>>
> >>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file
> >>>>> system.
> >>>>>
> >>>>> 2. Support configuring any user-defined file system.
> >>>>>
> >>>>> And, I drafted the design proposal in two parts to address them in
> >>>>> order:
> >>>>>
> >>>>> Part 1: IOChannelFactory Redesign
> >>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
> >>>>> sVG3qel2lhdKTknmZ_7M/edit#>
> >>>>>
> >>>>> Summary:
> >>>>>
> >>>>> Old API: WritableByteChannel create(String spec, String mimeType);
> >>>>>
> >>>>> New API: WritableByteChannel create(URI uri, CreateOptions options);
> >>>>>
> >>>>> Noticeable proposed changes:
> >>>>>
> >>>>>
> >>>>>    1.
> >>>>>
> >>>>>    Includes the options parameter in most methods to specify
> behaviors.
> >>>>>    2.
> >>>>>
> >>>>>    Replace String with URI to include scheme for files/directories
> >>>>>    locations.
> >>>>>    3.
> >>>>>
> >>>>>    Require file systems to provide a SeekableByteChannel for read.
> >>>>>    4.
> >>>>>
> >>>>>    Additional methods, such as getMetadata(), rename() e.t.c
> >>>>>
> >>>>>
> >>>>> Part 2: Configurable BeamFileSystem
> >>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
> >>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
> >>>>>
> >>>>> Summary:
> >>>>>
> >>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
> >>>>>
> >>>>> New API: BeamFileSystems.getFileSystem(glob, config).match(glob);
> >>>>>
> >>>>>
> >>>>> Looking for comments and feedback.
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Pei
> >>>>>
> >>>>>
> >>>>> --
> >>>> Jean-Baptiste Onofré
> >>>> jbonofre@apache.org
> >>>> http://blog.nanthrax.net
> >>>> Talend - http://www.talend.com
> >>>>
> >>>>
> >>>
> >> --
> >> Jean-Baptiste Onofré
> >> jbonofre@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
> >
> >
>

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Pei He <pe...@google.com.INVALID>.
I have received a lot of comments in "Part 1: IOChannelFactory
Redesign" [1]. And, I have updated the design based on the feedback.

Now, I feel it is close to be ready for implementation, and I would like to
summarize the changes:
1. Replaced FilePath with URI for resolving files paths.
2. Required match(String spec) to handle ambiguities in users provided
strings (see the match() java doc in the design doc for details).
3. Changed Metadata to use Future.get() paradigm, and removed exception().
4. Changed methods on FileSystem interface to be protected (visible for
implementors), and created FileSystems utility (visible for callers).
5.  Simplified FileSystem interface by moving operation options, such as
DeleteOptions, MatchOptions, to the FileSystems utility.
6. Simplified FileSystem interface by requiring certain behaviors, such as
creating recursively, throwing for missing files.

Any thoughts / feedback?
--
Pei

[1]
https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#

On Wed, Nov 30, 2016 at 1:32 PM, Pei He <pe...@google.com> wrote:

> Thanks JB for the feedback.
>
> Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said, it
> will make a range of file system available in Beam.
>
> And, people can choose to implement BeamFileSystem directly to get the
> best performance (For example, providing bulk operations.)
>
> --
> Pei
>
>
>
> On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> Hi Pei,
>>
>> rethinking about that, I understand that the purpose of the Beam
>> filesystem is to avoid to bring a bunch of dependencies into the core. That
>> makes perfect sense.
>>
>> So, I agree that a Beam filesystem abstract is fine.
>>
>> My point is that we should provide a HadoopFilesystem extension/plugin
>> for Beam filesystem asap: that would help us to support a good range of
>> filesystems quickly.
>>
>> Just my $0.01 ;)
>>
>> Regards
>> JB
>>
>>
>> On 11/17/2016 08:18 PM, Pei He wrote:
>>
>>> Hi JB,
>>> My proposals are based on the current IOChannelFactory, and how they are
>>> used in FileBasedSink.
>>>
>>> Let's me spend more time to investigate Hadoop FileSystem interface.
>>> --
>>> Pei
>>>
>>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>>> wrote:
>>>
>>> By the way, Pei, for the record: why introducing BeamFileSystem and not
>>>> using the Hadoop FileSystem interface ?
>>>>
>>>> Thanks
>>>> Regards
>>>> JB
>>>>
>>>> On 11/17/2016 01:09 AM, Pei He wrote:
>>>>
>>>> Hi,
>>>>>
>>>>> I am working on BEAM-59
>>>>> <https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
>>>>> redesign". The goals are:
>>>>>
>>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file
>>>>> system.
>>>>>
>>>>> 2. Support configuring any user-defined file system.
>>>>>
>>>>> And, I drafted the design proposal in two parts to address them in
>>>>> order:
>>>>>
>>>>> Part 1: IOChannelFactory Redesign
>>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
>>>>> sVG3qel2lhdKTknmZ_7M/edit#>
>>>>>
>>>>> Summary:
>>>>>
>>>>> Old API: WritableByteChannel create(String spec, String mimeType);
>>>>>
>>>>> New API: WritableByteChannel create(URI uri, CreateOptions options);
>>>>>
>>>>> Noticeable proposed changes:
>>>>>
>>>>>
>>>>>    1.
>>>>>
>>>>>    Includes the options parameter in most methods to specify behaviors.
>>>>>    2.
>>>>>
>>>>>    Replace String with URI to include scheme for files/directories
>>>>>    locations.
>>>>>    3.
>>>>>
>>>>>    Require file systems to provide a SeekableByteChannel for read.
>>>>>    4.
>>>>>
>>>>>    Additional methods, such as getMetadata(), rename() e.t.c
>>>>>
>>>>>
>>>>> Part 2: Configurable BeamFileSystem
>>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
>>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
>>>>>
>>>>> Summary:
>>>>>
>>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
>>>>>
>>>>> New API: BeamFileSystems.getFileSystem(glob, config).match(glob);
>>>>>
>>>>>
>>>>> Looking for comments and feedback.
>>>>>
>>>>> Thanks
>>>>>
>>>>> --
>>>>>
>>>>> Pei
>>>>>
>>>>>
>>>>> --
>>>> Jean-Baptiste Onofré
>>>> jbonofre@apache.org
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>
>>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Pei He <pe...@google.com.INVALID>.
Thanks JB for the feedback.

Yes, we should provide a hadoop.fs.FileSystem adaptor. As you said, it will
make a range of file system available in Beam.

And, people can choose to implement BeamFileSystem directly to get the best
performance (For example, providing bulk operations.)

--
Pei



On Tue, Nov 29, 2016 at 11:11 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Pei,
>
> rethinking about that, I understand that the purpose of the Beam
> filesystem is to avoid to bring a bunch of dependencies into the core. That
> makes perfect sense.
>
> So, I agree that a Beam filesystem abstract is fine.
>
> My point is that we should provide a HadoopFilesystem extension/plugin for
> Beam filesystem asap: that would help us to support a good range of
> filesystems quickly.
>
> Just my $0.01 ;)
>
> Regards
> JB
>
>
> On 11/17/2016 08:18 PM, Pei He wrote:
>
>> Hi JB,
>> My proposals are based on the current IOChannelFactory, and how they are
>> used in FileBasedSink.
>>
>> Let's me spend more time to investigate Hadoop FileSystem interface.
>> --
>> Pei
>>
>> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>> By the way, Pei, for the record: why introducing BeamFileSystem and not
>>> using the Hadoop FileSystem interface ?
>>>
>>> Thanks
>>> Regards
>>> JB
>>>
>>> On 11/17/2016 01:09 AM, Pei He wrote:
>>>
>>> Hi,
>>>>
>>>> I am working on BEAM-59
>>>> <https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
>>>> redesign". The goals are:
>>>>
>>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file
>>>> system.
>>>>
>>>> 2. Support configuring any user-defined file system.
>>>>
>>>> And, I drafted the design proposal in two parts to address them in
>>>> order:
>>>>
>>>> Part 1: IOChannelFactory Redesign
>>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
>>>> sVG3qel2lhdKTknmZ_7M/edit#>
>>>>
>>>> Summary:
>>>>
>>>> Old API: WritableByteChannel create(String spec, String mimeType);
>>>>
>>>> New API: WritableByteChannel create(URI uri, CreateOptions options);
>>>>
>>>> Noticeable proposed changes:
>>>>
>>>>
>>>>    1.
>>>>
>>>>    Includes the options parameter in most methods to specify behaviors.
>>>>    2.
>>>>
>>>>    Replace String with URI to include scheme for files/directories
>>>>    locations.
>>>>    3.
>>>>
>>>>    Require file systems to provide a SeekableByteChannel for read.
>>>>    4.
>>>>
>>>>    Additional methods, such as getMetadata(), rename() e.t.c
>>>>
>>>>
>>>> Part 2: Configurable BeamFileSystem
>>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
>>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
>>>>
>>>> Summary:
>>>>
>>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
>>>>
>>>> New API: BeamFileSystems.getFileSystem(glob, config).match(glob);
>>>>
>>>>
>>>> Looking for comments and feedback.
>>>>
>>>> Thanks
>>>>
>>>> --
>>>>
>>>> Pei
>>>>
>>>>
>>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Pei,

rethinking about that, I understand that the purpose of the Beam 
filesystem is to avoid to bring a bunch of dependencies into the core. 
That makes perfect sense.

So, I agree that a Beam filesystem abstract is fine.

My point is that we should provide a HadoopFilesystem extension/plugin 
for Beam filesystem asap: that would help us to support a good range of 
filesystems quickly.

Just my $0.01 ;)

Regards
JB

On 11/17/2016 08:18 PM, Pei He wrote:
> Hi JB,
> My proposals are based on the current IOChannelFactory, and how they are
> used in FileBasedSink.
>
> Let's me spend more time to investigate Hadoop FileSystem interface.
> --
> Pei
>
> On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
> wrote:
>
>> By the way, Pei, for the record: why introducing BeamFileSystem and not
>> using the Hadoop FileSystem interface ?
>>
>> Thanks
>> Regards
>> JB
>>
>> On 11/17/2016 01:09 AM, Pei He wrote:
>>
>>> Hi,
>>>
>>> I am working on BEAM-59
>>> <https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
>>> redesign". The goals are:
>>>
>>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.
>>>
>>> 2. Support configuring any user-defined file system.
>>>
>>> And, I drafted the design proposal in two parts to address them in order:
>>>
>>> Part 1: IOChannelFactory Redesign
>>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
>>> sVG3qel2lhdKTknmZ_7M/edit#>
>>>
>>> Summary:
>>>
>>> Old API: WritableByteChannel create(String spec, String mimeType);
>>>
>>> New API: WritableByteChannel create(URI uri, CreateOptions options);
>>>
>>> Noticeable proposed changes:
>>>
>>>
>>>    1.
>>>
>>>    Includes the options parameter in most methods to specify behaviors.
>>>    2.
>>>
>>>    Replace String with URI to include scheme for files/directories
>>>    locations.
>>>    3.
>>>
>>>    Require file systems to provide a SeekableByteChannel for read.
>>>    4.
>>>
>>>    Additional methods, such as getMetadata(), rename() e.t.c
>>>
>>>
>>> Part 2: Configurable BeamFileSystem
>>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
>>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
>>>
>>> Summary:
>>>
>>> Old API: IOChannelUtils.getFactory(glob).match(glob);
>>>
>>> New API: BeamFileSystems.getFileSystem(glob, config).match(glob);
>>>
>>>
>>> Looking for comments and feedback.
>>>
>>> Thanks
>>>
>>> --
>>>
>>> Pei
>>>
>>>
>> --
>> Jean-Baptiste Onofr�
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Pei He <pe...@google.com.INVALID>.
Hi JB,
My proposals are based on the current IOChannelFactory, and how they are
used in FileBasedSink.

Let's me spend more time to investigate Hadoop FileSystem interface.
--
Pei

On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> By the way, Pei, for the record: why introducing BeamFileSystem and not
> using the Hadoop FileSystem interface ?
>
> Thanks
> Regards
> JB
>
> On 11/17/2016 01:09 AM, Pei He wrote:
>
>> Hi,
>>
>> I am working on BEAM-59
>> <https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
>> redesign". The goals are:
>>
>> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.
>>
>> 2. Support configuring any user-defined file system.
>>
>> And, I drafted the design proposal in two parts to address them in order:
>>
>> Part 1: IOChannelFactory Redesign
>> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
>> sVG3qel2lhdKTknmZ_7M/edit#>
>>
>> Summary:
>>
>> Old API: WritableByteChannel create(String spec, String mimeType);
>>
>> New API: WritableByteChannel create(URI uri, CreateOptions options);
>>
>> Noticeable proposed changes:
>>
>>
>>    1.
>>
>>    Includes the options parameter in most methods to specify behaviors.
>>    2.
>>
>>    Replace String with URI to include scheme for files/directories
>>    locations.
>>    3.
>>
>>    Require file systems to provide a SeekableByteChannel for read.
>>    4.
>>
>>    Additional methods, such as getMetadata(), rename() e.t.c
>>
>>
>> Part 2: Configurable BeamFileSystem
>> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
>> q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
>>
>> Summary:
>>
>> Old API: IOChannelUtils.getFactory(glob).match(glob);
>>
>> New API: BeamFileSystems.getFileSystem(glob, config).match(glob);
>>
>>
>> Looking for comments and feedback.
>>
>> Thanks
>>
>> --
>>
>> Pei
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
By the way, Pei, for the record: why introducing BeamFileSystem and not 
using the Hadoop FileSystem interface ?

Thanks
Regards
JB

On 11/17/2016 01:09 AM, Pei He wrote:
> Hi,
>
> I am working on BEAM-59
> <https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
> redesign". The goals are:
>
> 1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.
>
> 2. Support configuring any user-defined file system.
>
> And, I drafted the design proposal in two parts to address them in order:
>
> Part 1: IOChannelFactory Redesign
> <https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#>
>
> Summary:
>
> Old API: WritableByteChannel create(String spec, String mimeType);
>
> New API: WritableByteChannel create(URI uri, CreateOptions options);
>
> Noticeable proposed changes:
>
>
>    1.
>
>    Includes the options parameter in most methods to specify behaviors.
>    2.
>
>    Replace String with URI to include scheme for files/directories
>    locations.
>    3.
>
>    Require file systems to provide a SeekableByteChannel for read.
>    4.
>
>    Additional methods, such as getMetadata(), rename() e.t.c
>
>
> Part 2: Configurable BeamFileSystem
> <https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>
>
> Summary:
>
> Old API: IOChannelUtils.getFactory(glob).match(glob);
>
> New API: BeamFileSystems.getFileSystem(glob, config).match(glob);
>
>
> Looking for comments and feedback.
>
> Thanks
>
> --
>
> Pei
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com