You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Aviem Zur <av...@gmail.com> on 2017/01/30 08:24:15 UTC

TextIO binary file

Hi,

While trying to use TextIO to write/read a binary file rather than String
lines from a textual file I ran into an issue - the delimiter TextIO uses
seems to be hardcoded '\n'.
See `findSeparatorBounds` -
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024

The use case is to have a file of objects, encoded into bytes using a
coder. However, '\n' is not a good delimiter here, as you can imagine.
A similar pattern is found in Spark's `saveAsObjectFile`
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
where
they use a more appropriate delimiter, to avoid such issues.

I did not find any unit tests which use TextIO to read anything other than
Strings.

Re: TextIO binary file

Posted by Dan Halperin <dh...@google.com.INVALID>.
Stas' comment is the right one. The "canonical" use of TextIO is using
something like a TextualIntegerCoder
<https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/TextualIntegerCoder.java>,
but that should almost certainly be replaced with TextIO.Read |
ParDo.of(Parse integer). The `withCoder` functions need to get removed or
replaced.

For "holding a file of arbitrary records" -- simply producing a
delimiter-separated TextIO is probably not a good choice. Specifically,
splitting is broken when the delimiter might appear in the output (e.g.,
when using almost any coder). A better option is to design a file format to
hold arbitrary records. E.g., an Avro file where each record is just a
byte[].

Dan

On Mon, Jan 30, 2017 at 2:52 AM, Aviem Zur <av...@gmail.com> wrote:

> The Javadoc of TextIO states:
>
> * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
> {@link String Strings},
>  * each corresponding to one line of an input UTF-8 text file. To convert
> directly from the raw
>  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
> object of type {@code T},
>  * supply a {@code Coder<T>} using {@link TextIO.Read#withCoder(Coder)}.
>
> However, as I stated, `withCoder` doesn't seem to have tests, and probably
> won't work given the hard-coded '\n' delimiter.
>
> On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Hi Aviem,
> >
> > TextIO is not designed to write/read binary file: it's pure Text, so
> > String.
> >
> > Regards
> > JB
> >
> > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > Hi,
> > >
> > > While trying to use TextIO to write/read a binary file rather than
> String
> > > lines from a textual file I ran into an issue - the delimiter TextIO
> uses
> > > seems to be hardcoded '\n'.
> > > See `findSeparatorBounds` -
> > >
> > https://github.com/apache/beam/blob/master/sdks/java/
> core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > >
> > > The use case is to have a file of objects, encoded into bytes using a
> > > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > > A similar pattern is found in Spark's `saveAsObjectFile`
> > >
> > https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > where
> > > they use a more appropriate delimiter, to avoid such issues.
> > >
> > > I did not find any unit tests which use TextIO to read anything other
> > than
> > > Strings.
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: TextIO binary file

Posted by Aviem Zur <av...@gmail.com>.
Looks like Eugene addressed this in the following ticket:
https://issues.apache.org/jira/browse/BEAM-1354
Just added a bullet regarding updating the javadoc.

On Tue, Jan 31, 2017 at 12:47 PM Aviem Zur <av...@gmail.com> wrote:

> So If I understand the general agreement is that TextIO should not support
> anything but lines from files as strings.
> I'll go ahead and file a ticket that says the Javadoc should be changed to
> reflect this and `withCoder` method should be removed.
>
> Is there merit for Beam to supply an IO which does allow writing objects
> to a file using Beam coders and Beam FS (To write these files to
> GS/Hadoop/Local)?
>
> On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
>
> P.S. Note that this point (about coders) is also mentioned in the
> now-being-reviewed PTransform Style Guide
> https://github.com/apache/beam-site/pull/134
> currently staged at
>
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
>
>
> On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <ch...@apache.org>
> wrote:
>
> > +1 to what Eugene said.
> >
> > I've seen a number of Python SDK users incorrectly assuming that
> > coder.decode() is needed when developing their own file-based sources
> > (since many users usually refer to text source first). Probably coder
> > parameter should not be configurable for text source/sink and they should
> > be updated to only read/write UTF-8 encoded strings.
> >
> > - Cham
> >
> > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > <ki...@google.com.invalid> wrote:
> >
> > > The use of Coder in TextIO is a long standing design issue because
> coders
> > > are not intended to be used for general purpose converting things from
> > and
> > > to bytes, their only proper use is letting the runner materialize and
> > > restore objects if the runner thinks it's necessary. IMO it should have
> > > been called LineIO, document that it reads lines of text as String, and
> > not
> > > have a withCoder parameter at all.
> > >
> > > The proper way to address your use case is to write a custom
> > > FileBasedSource.
> > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <av...@gmail.com> wrote:
> > >
> > > > The Javadoc of TextIO states:
> > > >
> > > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > > {@link String Strings},
> > > >  * each corresponding to one line of an input UTF-8 text file. To
> > convert
> > > > directly from the raw
> > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> > another
> > > > object of type {@code T},
> > > >  * supply a {@code Coder<T>} using {@link
> > TextIO.Read#withCoder(Coder)}.
> > > >
> > > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > > probably
> > > > won't work given the hard-coded '\n' delimiter.
> > > >
> > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> jb@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > Hi Aviem,
> > > > >
> > > > > TextIO is not designed to write/read binary file: it's pure Text,
> so
> > > > > String.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > > Hi,
> > > > > >
> > > > > > While trying to use TextIO to write/read a binary file rather
> than
> > > > String
> > > > > > lines from a textual file I ran into an issue - the delimiter
> > TextIO
> > > > uses
> > > > > > seems to be hardcoded '\n'.
> > > > > > See `findSeparatorBounds` -
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > >
> > > > > > The use case is to have a file of objects, encoded into bytes
> > using a
> > > > > > coder. However, '\n' is not a good delimiter here, as you can
> > > imagine.
> > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > > where
> > > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > > >
> > > > > > I did not find any unit tests which use TextIO to read anything
> > other
> > > > > than
> > > > > > Strings.
> > > > > >
> > > > >
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > jbonofre@apache.org
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > >
> >
>
>

Re: TextIO binary file

Posted by Robert Bradshaw <ro...@google.com.INVALID>.
I agree documentation should be better around this.

If we decided to go down the route of adding FileObjectIO, I would suggest
making it a PTransform that wraps an AvroIO rather than a new file format
(unless it can be demonstrated that there are significant performance
concerns). In this case on simply need s T <-> POJO, not T <-> bytes[],
though bytes[] is (essentially) POJO so that could be supported as well.

On Mon, Feb 6, 2017 at 5:37 AM, Aviem Zur <av...@gmail.com> wrote:

> I believe these are actually several different use cases with different
> paths (Some do not exist today):
>
> User wants to output results of a pipeline to be used in a different
> pipeline:
> (This is indeed a use case in my organization. Also, Spark sees this as a
> use case as well with its `saveAsObjectFile`)
>
>    1. User's objects are Avro-serializable.
>    1. [Today] User figures out they can use AvroIO - not very user
>       friendly, user has to be a little more savvy.
>          1. [Suggestion #1] Improve documentation for this use case.
>          2. [Suggestion #2] User uses a more user-friendly FileObjectIO.
>             1. FileObjectIO, seeing the objects are
>             Avro-serializable, delegates to AvroIO under the hood.
>          2. User's objects are not necessarily Avro-serializable.
>    1. [Today] User writes a custom IO, from scratch, including lower level
>       IO operations and tests.
>       2. [Suggestion #3] User uses a more user-friendly FileObjectIO and
>       provides the proper coder.
>          1. User uses an existing coder, compatible with their object, for
>          example - SerializableCoder for serializable objects.
>          2. No existing coder is compatible with their object, user writes
>          a custom coder.
>          3. * Regarding writing several objects to one stream - this issue
>          is already acknowledged and addressed in
> `IterableLikeCoder`s. Some coders
>          do not require delimiters (Like AvroCoder), others, like
> ProtoCoder already
>          handle writing with delimiters. So it seems that the implicit
> contract for
>          coders, already handles this. I did not find the appropriate
> javadoc that
>          declares this contract.
>
>
> On Mon, Feb 6, 2017 at 12:13 AM Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
>
> > Hmm, do you have a concrete use case in mind, where all these
> circumstances
> > come together?
> > - There is a need to write data from one Beam pipeline, and read it from
> > another Beam pipeline
> > - These pipelines have to be kept separate, rather than merged into one
> > pipeline
> > - It is okay that the data can not be parsed by anything except a Beam
> > pipeline using a compatible SDK version
> > - The objects to be serialized are not POJOs
> > - The user has chosen and was sufficiently savvy to develop a Coder for
> > these objects (as opposed to representing them as POJOs and using AvroIO,
> > which I think requires a rather less savvy user)
> >
> > It just seems like a really exotic set of circumstances to me. If the
> > problem is that people don't realize that the easiest way to serialize
> > their data is make it a POJO and use AvroIO, we can solve by improving
> > documentation about coders.
> >
> > On Sun, Feb 5, 2017 at 1:41 PM Aviem Zur <av...@gmail.com> wrote:
> >
> > > AvroIO would is great for POJOs. But for use cases with more complex,
> > > serializable objects, or objects which are compatible with some coder
> it
> > > falls short.
> > >
> > > Also, for less savvy users to know they need to use AvroIO might be a
> > > stretch.
> > > Some simpler API along the the lines of ObjectFile might be more user
> > > friendly (even if for optimization it uses avro under the hood for
> > POJOs).
> > >
> > > On Sun, Feb 5, 2017, 22:00 Eugene Kirpichov
> <kirpichov@google.com.invalid
> > >
> > > wrote:
> > >
> > > > OK, I see what you mean; however I still think this can be solved
> > without
> > > > introducing a new "Beam object file" (or whatever) file format, and
> > > without
> > > > thereby introducing additional use cases and compatibility
> constraints
> > on
> > > > coders.
> > > >
> > > > I asked before in the thread why not just use AvroIO (it can
> serialize
> > > > arbitrary POJOs using reflection); I skimmed the thread it doesn't
> seem
> > > > like that got answered properly. I also like Dan's suggestion to use
> > > AvroIO
> > > > to serialize byte[] arrays and you can do whatever you want with them
> > > (e.g.
> > > > use another serialization library, say, Kryo, or Java serialization,
> > > etc.)
> > > >
> > > > On Sun, Feb 5, 2017 at 11:37 AM Aviem Zur <av...@gmail.com>
> wrote:
> > > >
> > > > > I agree that these files will serve no use outside of Beam
> pipelines.
> > > > >
> > > > > The rationale was that you might want to have one pipeline write
> > output
> > > > to
> > > > > files and then have a different pipeline that uses those files as
> > > inputs.
> > > > >
> > > > > Say one team in your organization creates a pipeline and a
> different
> > > team
> > > > > utilizes those files as input for a different pipeline. The
> contract
> > > > > between them is the file, in a Beam-readable format.
> > > > > This is similar to Spark's `saveAsObjectFile`
> > > https://github.com/apache/
> > > > >
> > > >
> > >
> > spark/blob/master/core/src/main/scala/org/apache/spark/
> rdd/RDD.scala#L1512
> > > > > <
> > > >
> > >
> > https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > >
> > > > >
> > > > > The merit for something like this in my eyes is to not burden the
> > user
> > > > with
> > > > > writing a custom IO
> > > > >
> > > > > On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
> > > > > <ki...@google.com.invalid> wrote:
> > > > >
> > > > > +1 to Robert. Either this will be a Beam-specific file format (and
> > then
> > > > > nothing except Beam will be able to read it - which I doubt is what
> > you
> > > > > want), or it is an existing well-known file format and then we
> should
> > > > just
> > > > > develop an IO for it.
> > > > > Note that any file format that involves encoding elements with a
> > Coder
> > > is
> > > > > Beam-specific, because wire format of coders is Beam-specific.
> > > > >
> > > > > On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
> > > > > <ro...@google.com.invalid> wrote:
> > > > >
> > > > > > On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <av...@gmail.com>
> > > > wrote:
> > > > > > > +1 on what Stas said.
> > > > > > > I think there is value in not having the user write a custom IO
> > > for a
> > > > > > > protocol they use which is not covered by Beam IOs. Plus having
> > > them
> > > > > deal
> > > > > > > with not only the encoding but also the IO part is not ideal.
> > > > > > > I think having a basic FileIO that can write to the Filesystems
> > > > > supported
> > > > > > > by Beam (GS/HDFS/Local/...) which you can use any coder with,
> > > > including
> > > > > > > your own custom coder, can be beneficial.
> > > > > >
> > > > > > What would the format of the file be? Just the concatenation of
> the
> > > > > > elements encoded according to the coder? Or is there a delimiter
> > > > > > needed to separate records. In which case how does one ensure the
> > > > > > delimiter does not also appear in the middle of an encoded
> element?
> > > At
> > > > > > this point you're developing a file format, and might as well
> stick
> > > > > > with one of the standard ones. https://xkcd.com/927
> > > > > >
> > > > > > > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <
> staslevin@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > I believe the motivation is to have an abstraction that allows
> > one
> > > to
> > > > > > write
> > > > > > > stuff to a file in a way that is agnostic to the coder.
> > > > > > > If one needs to write a non-Avro protocol to a file, and this
> > > > > particular
> > > > > > > protocol does not meet the assumption made by TextIO, one might
> > > need
> > > > to
> > > > > > > duplicate the file IO related code from AvroIO.
> > > > > > >
> > > > > > > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > > > > > > <ki...@google.com.invalid> wrote:
> > > > > > >
> > > > > > >> Could you clarify why it would be useful to write objects to
> > files
> > > > > using
> > > > > > >> Beam coders, as opposed to just using e.g. AvroIO?
> > > > > > >>
> > > > > > >> Coders (should) make no promise as to what their wire format
> is,
> > > so
> > > > > such
> > > > > > >> files could be read back only by other Beam pipelines using
> the
> > > same
> > > > > IO.
> > > > > > >>
> > > > > > >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <aviemzur@gmail.com
> >
> > > > wrote:
> > > > > > >>
> > > > > > >> > So If I understand the general agreement is that TextIO
> should
> > > not
> > > > > > >> support
> > > > > > >> > anything but lines from files as strings.
> > > > > > >> > I'll go ahead and file a ticket that says the Javadoc should
> > be
> > > > > > changed
> > > > > > >> to
> > > > > > >> > reflect this and `withCoder` method should be removed.
> > > > > > >> >
> > > > > > >> > Is there merit for Beam to supply an IO which does allow
> > writing
> > > > > > objects
> > > > > > >> to
> > > > > > >> > a file using Beam coders and Beam FS (To write these files
> to
> > > > > > >> > GS/Hadoop/Local)?
> > > > > > >> >
> > > > > > >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> > > > > > >> > <ki...@google.com.invalid> wrote:
> > > > > > >> >
> > > > > > >> > P.S. Note that this point (about coders) is also mentioned
> in
> > > the
> > > > > > >> > now-being-reviewed PTransform Style Guide
> > > > > > >> > https://github.com/apache/beam-site/pull/134
> > > > > > >> > currently staged at
> > > > > > >> >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > http://apache-beam-website-pull-requests.storage.
> googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> > > > > > chamikara@apache.org
> > > > > > >> >
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > +1 to what Eugene said.
> > > > > > >> > >
> > > > > > >> > > I've seen a number of Python SDK users incorrectly
> assuming
> > > that
> > > > > > >> > > coder.decode() is needed when developing their own
> > file-based
> > > > > > sources
> > > > > > >> > > (since many users usually refer to text source first).
> > > Probably
> > > > > > coder
> > > > > > >> > > parameter should not be configurable for text source/sink
> > and
> > > > they
> > > > > > >> should
> > > > > > >> > > be updated to only read/write UTF-8 encoded strings.
> > > > > > >> > >
> > > > > > >> > > - Cham
> > > > > > >> > >
> > > > > > >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > > > > > >> > > <ki...@google.com.invalid> wrote:
> > > > > > >> > >
> > > > > > >> > > > The use of Coder in TextIO is a long standing design
> issue
> > > > > because
> > > > > > >> > coders
> > > > > > >> > > > are not intended to be used for general purpose
> converting
> > > > > things
> > > > > > >> from
> > > > > > >> > > and
> > > > > > >> > > > to bytes, their only proper use is letting the runner
> > > > > materialize
> > > > > > > and
> > > > > > >> > > > restore objects if the runner thinks it's necessary. IMO
> > it
> > > > > should
> > > > > > >> have
> > > > > > >> > > > been called LineIO, document that it reads lines of text
> > as
> > > > > > String,
> > > > > > >> and
> > > > > > >> > > not
> > > > > > >> > > > have a withCoder parameter at all.
> > > > > > >> > > >
> > > > > > >> > > > The proper way to address your use case is to write a
> > custom
> > > > > > >> > > > FileBasedSource.
> > > > > > >> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <
> > > aviemzur@gmail.com
> > > > >
> > > > > > >> wrote:
> > > > > > >> > > >
> > > > > > >> > > > > The Javadoc of TextIO states:
> > > > > > >> > > > >
> > > > > > >> > > > > * <p>By default, {@link TextIO.Read} returns a {@link
> > > > > > PCollection}
> > > > > > >> of
> > > > > > >> > > > > {@link String Strings},
> > > > > > >> > > > >  * each corresponding to one line of an input UTF-8
> text
> > > > file.
> > > > > > To
> > > > > > >> > > convert
> > > > > > >> > > > > directly from the raw
> > > > > > >> > > > >  * bytes (split into lines delimited by '\n', '\r', or
> > > > '\r\n')
> > > > > > to
> > > > > > >> > > another
> > > > > > >> > > > > object of type {@code T},
> > > > > > >> > > > >  * supply a {@code Coder<T>} using {@link
> > > > > > >> > > TextIO.Read#withCoder(Coder)}.
> > > > > > >> > > > >
> > > > > > >> > > > > However, as I stated, `withCoder` doesn't seem to have
> > > > tests,
> > > > > > and
> > > > > > >> > > > probably
> > > > > > >> > > > > won't work given the hard-coded '\n' delimiter.
> > > > > > >> > > > >
> > > > > > >> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré
> <
> > > > > > >> > jb@nanthrax.net
> > > > > > >> > > >
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hi Aviem,
> > > > > > >> > > > > >
> > > > > > >> > > > > > TextIO is not designed to write/read binary file:
> it's
> > > > pure
> > > > > > > Text,
> > > > > > >> > so
> > > > > > >> > > > > > String.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Regards
> > > > > > >> > > > > > JB
> > > > > > >> > > > > >
> > > > > > >> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > > >> > > > > > > Hi,
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > While trying to use TextIO to write/read a binary
> > file
> > > > > > rather
> > > > > > >> > than
> > > > > > >> > > > > String
> > > > > > >> > > > > > > lines from a textual file I ran into an issue -
> the
> > > > > > delimiter
> > > > > > >> > > TextIO
> > > > > > >> > > > > uses
> > > > > > >> > > > > > > seems to be hardcoded '\n'.
> > > > > > >> > > > > > > See `findSeparatorBounds` -
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/beam/blob/master/sdks/java/
> core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > The use case is to have a file of objects, encoded
> > > into
> > > > > > bytes
> > > > > > >> > > using a
> > > > > > >> > > > > > > coder. However, '\n' is not a good delimiter here,
> > as
> > > > you
> > > > > > can
> > > > > > >> > > > imagine.
> > > > > > >> > > > > > > A similar pattern is found in Spark's
> > > `saveAsObjectFile`
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/spark/blob/master/core/src/
> main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > > >> > > > > > > where
> > > > > > >> > > > > > > they use a more appropriate delimiter, to avoid
> such
> > > > > issues.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > I did not find any unit tests which use TextIO to
> > read
> > > > > > > anything
> > > > > > >> > > other
> > > > > > >> > > > > > than
> > > > > > >> > > > > > > Strings.
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > --
> > > > > > >> > > > > > Jean-Baptiste Onofré
> > > > > > >> > > > > > jbonofre@apache.org
> > > > > > >> > > > > > http://blog.nanthrax.net
> > > > > > >> > > > > > Talend - http://www.talend.com
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: TextIO binary file

Posted by Aviem Zur <av...@gmail.com>.
I believe these are actually several different use cases with different
paths (Some do not exist today):

User wants to output results of a pipeline to be used in a different
pipeline:
(This is indeed a use case in my organization. Also, Spark sees this as a
use case as well with its `saveAsObjectFile`)

   1. User's objects are Avro-serializable.
   1. [Today] User figures out they can use AvroIO - not very user
      friendly, user has to be a little more savvy.
         1. [Suggestion #1] Improve documentation for this use case.
         2. [Suggestion #2] User uses a more user-friendly FileObjectIO.
            1. FileObjectIO, seeing the objects are
            Avro-serializable, delegates to AvroIO under the hood.
         2. User's objects are not necessarily Avro-serializable.
   1. [Today] User writes a custom IO, from scratch, including lower level
      IO operations and tests.
      2. [Suggestion #3] User uses a more user-friendly FileObjectIO and
      provides the proper coder.
         1. User uses an existing coder, compatible with their object, for
         example - SerializableCoder for serializable objects.
         2. No existing coder is compatible with their object, user writes
         a custom coder.
         3. * Regarding writing several objects to one stream - this issue
         is already acknowledged and addressed in
`IterableLikeCoder`s. Some coders
         do not require delimiters (Like AvroCoder), others, like
ProtoCoder already
         handle writing with delimiters. So it seems that the implicit
contract for
         coders, already handles this. I did not find the appropriate
javadoc that
         declares this contract.


On Mon, Feb 6, 2017 at 12:13 AM Eugene Kirpichov
<ki...@google.com.invalid> wrote:

> Hmm, do you have a concrete use case in mind, where all these circumstances
> come together?
> - There is a need to write data from one Beam pipeline, and read it from
> another Beam pipeline
> - These pipelines have to be kept separate, rather than merged into one
> pipeline
> - It is okay that the data can not be parsed by anything except a Beam
> pipeline using a compatible SDK version
> - The objects to be serialized are not POJOs
> - The user has chosen and was sufficiently savvy to develop a Coder for
> these objects (as opposed to representing them as POJOs and using AvroIO,
> which I think requires a rather less savvy user)
>
> It just seems like a really exotic set of circumstances to me. If the
> problem is that people don't realize that the easiest way to serialize
> their data is make it a POJO and use AvroIO, we can solve by improving
> documentation about coders.
>
> On Sun, Feb 5, 2017 at 1:41 PM Aviem Zur <av...@gmail.com> wrote:
>
> > AvroIO would is great for POJOs. But for use cases with more complex,
> > serializable objects, or objects which are compatible with some coder it
> > falls short.
> >
> > Also, for less savvy users to know they need to use AvroIO might be a
> > stretch.
> > Some simpler API along the the lines of ObjectFile might be more user
> > friendly (even if for optimization it uses avro under the hood for
> POJOs).
> >
> > On Sun, Feb 5, 2017, 22:00 Eugene Kirpichov <kirpichov@google.com.invalid
> >
> > wrote:
> >
> > > OK, I see what you mean; however I still think this can be solved
> without
> > > introducing a new "Beam object file" (or whatever) file format, and
> > without
> > > thereby introducing additional use cases and compatibility constraints
> on
> > > coders.
> > >
> > > I asked before in the thread why not just use AvroIO (it can serialize
> > > arbitrary POJOs using reflection); I skimmed the thread it doesn't seem
> > > like that got answered properly. I also like Dan's suggestion to use
> > AvroIO
> > > to serialize byte[] arrays and you can do whatever you want with them
> > (e.g.
> > > use another serialization library, say, Kryo, or Java serialization,
> > etc.)
> > >
> > > On Sun, Feb 5, 2017 at 11:37 AM Aviem Zur <av...@gmail.com> wrote:
> > >
> > > > I agree that these files will serve no use outside of Beam pipelines.
> > > >
> > > > The rationale was that you might want to have one pipeline write
> output
> > > to
> > > > files and then have a different pipeline that uses those files as
> > inputs.
> > > >
> > > > Say one team in your organization creates a pipeline and a different
> > team
> > > > utilizes those files as input for a different pipeline. The contract
> > > > between them is the file, in a Beam-readable format.
> > > > This is similar to Spark's `saveAsObjectFile`
> > https://github.com/apache/
> > > >
> > >
> >
> spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > <
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > >
> > > >
> > > > The merit for something like this in my eyes is to not burden the
> user
> > > with
> > > > writing a custom IO
> > > >
> > > > On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
> > > > <ki...@google.com.invalid> wrote:
> > > >
> > > > +1 to Robert. Either this will be a Beam-specific file format (and
> then
> > > > nothing except Beam will be able to read it - which I doubt is what
> you
> > > > want), or it is an existing well-known file format and then we should
> > > just
> > > > develop an IO for it.
> > > > Note that any file format that involves encoding elements with a
> Coder
> > is
> > > > Beam-specific, because wire format of coders is Beam-specific.
> > > >
> > > > On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
> > > > <ro...@google.com.invalid> wrote:
> > > >
> > > > > On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <av...@gmail.com>
> > > wrote:
> > > > > > +1 on what Stas said.
> > > > > > I think there is value in not having the user write a custom IO
> > for a
> > > > > > protocol they use which is not covered by Beam IOs. Plus having
> > them
> > > > deal
> > > > > > with not only the encoding but also the IO part is not ideal.
> > > > > > I think having a basic FileIO that can write to the Filesystems
> > > > supported
> > > > > > by Beam (GS/HDFS/Local/...) which you can use any coder with,
> > > including
> > > > > > your own custom coder, can be beneficial.
> > > > >
> > > > > What would the format of the file be? Just the concatenation of the
> > > > > elements encoded according to the coder? Or is there a delimiter
> > > > > needed to separate records. In which case how does one ensure the
> > > > > delimiter does not also appear in the middle of an encoded element?
> > At
> > > > > this point you're developing a file format, and might as well stick
> > > > > with one of the standard ones. https://xkcd.com/927
> > > > >
> > > > > > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <st...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > I believe the motivation is to have an abstraction that allows
> one
> > to
> > > > > write
> > > > > > stuff to a file in a way that is agnostic to the coder.
> > > > > > If one needs to write a non-Avro protocol to a file, and this
> > > > particular
> > > > > > protocol does not meet the assumption made by TextIO, one might
> > need
> > > to
> > > > > > duplicate the file IO related code from AvroIO.
> > > > > >
> > > > > > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > > > > > <ki...@google.com.invalid> wrote:
> > > > > >
> > > > > >> Could you clarify why it would be useful to write objects to
> files
> > > > using
> > > > > >> Beam coders, as opposed to just using e.g. AvroIO?
> > > > > >>
> > > > > >> Coders (should) make no promise as to what their wire format is,
> > so
> > > > such
> > > > > >> files could be read back only by other Beam pipelines using the
> > same
> > > > IO.
> > > > > >>
> > > > > >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <av...@gmail.com>
> > > wrote:
> > > > > >>
> > > > > >> > So If I understand the general agreement is that TextIO should
> > not
> > > > > >> support
> > > > > >> > anything but lines from files as strings.
> > > > > >> > I'll go ahead and file a ticket that says the Javadoc should
> be
> > > > > changed
> > > > > >> to
> > > > > >> > reflect this and `withCoder` method should be removed.
> > > > > >> >
> > > > > >> > Is there merit for Beam to supply an IO which does allow
> writing
> > > > > objects
> > > > > >> to
> > > > > >> > a file using Beam coders and Beam FS (To write these files to
> > > > > >> > GS/Hadoop/Local)?
> > > > > >> >
> > > > > >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> > > > > >> > <ki...@google.com.invalid> wrote:
> > > > > >> >
> > > > > >> > P.S. Note that this point (about coders) is also mentioned in
> > the
> > > > > >> > now-being-reviewed PTransform Style Guide
> > > > > >> > https://github.com/apache/beam-site/pull/134
> > > > > >> > currently staged at
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> > > > > >> >
> > > > > >> >
> > > > > >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> > > > > chamikara@apache.org
> > > > > >> >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > +1 to what Eugene said.
> > > > > >> > >
> > > > > >> > > I've seen a number of Python SDK users incorrectly assuming
> > that
> > > > > >> > > coder.decode() is needed when developing their own
> file-based
> > > > > sources
> > > > > >> > > (since many users usually refer to text source first).
> > Probably
> > > > > coder
> > > > > >> > > parameter should not be configurable for text source/sink
> and
> > > they
> > > > > >> should
> > > > > >> > > be updated to only read/write UTF-8 encoded strings.
> > > > > >> > >
> > > > > >> > > - Cham
> > > > > >> > >
> > > > > >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > > > > >> > > <ki...@google.com.invalid> wrote:
> > > > > >> > >
> > > > > >> > > > The use of Coder in TextIO is a long standing design issue
> > > > because
> > > > > >> > coders
> > > > > >> > > > are not intended to be used for general purpose converting
> > > > things
> > > > > >> from
> > > > > >> > > and
> > > > > >> > > > to bytes, their only proper use is letting the runner
> > > > materialize
> > > > > > and
> > > > > >> > > > restore objects if the runner thinks it's necessary. IMO
> it
> > > > should
> > > > > >> have
> > > > > >> > > > been called LineIO, document that it reads lines of text
> as
> > > > > String,
> > > > > >> and
> > > > > >> > > not
> > > > > >> > > > have a withCoder parameter at all.
> > > > > >> > > >
> > > > > >> > > > The proper way to address your use case is to write a
> custom
> > > > > >> > > > FileBasedSource.
> > > > > >> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <
> > aviemzur@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> > > >
> > > > > >> > > > > The Javadoc of TextIO states:
> > > > > >> > > > >
> > > > > >> > > > > * <p>By default, {@link TextIO.Read} returns a {@link
> > > > > PCollection}
> > > > > >> of
> > > > > >> > > > > {@link String Strings},
> > > > > >> > > > >  * each corresponding to one line of an input UTF-8 text
> > > file.
> > > > > To
> > > > > >> > > convert
> > > > > >> > > > > directly from the raw
> > > > > >> > > > >  * bytes (split into lines delimited by '\n', '\r', or
> > > '\r\n')
> > > > > to
> > > > > >> > > another
> > > > > >> > > > > object of type {@code T},
> > > > > >> > > > >  * supply a {@code Coder<T>} using {@link
> > > > > >> > > TextIO.Read#withCoder(Coder)}.
> > > > > >> > > > >
> > > > > >> > > > > However, as I stated, `withCoder` doesn't seem to have
> > > tests,
> > > > > and
> > > > > >> > > > probably
> > > > > >> > > > > won't work given the hard-coded '\n' delimiter.
> > > > > >> > > > >
> > > > > >> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> > > > > >> > jb@nanthrax.net
> > > > > >> > > >
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hi Aviem,
> > > > > >> > > > > >
> > > > > >> > > > > > TextIO is not designed to write/read binary file: it's
> > > pure
> > > > > > Text,
> > > > > >> > so
> > > > > >> > > > > > String.
> > > > > >> > > > > >
> > > > > >> > > > > > Regards
> > > > > >> > > > > > JB
> > > > > >> > > > > >
> > > > > >> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > >> > > > > > > Hi,
> > > > > >> > > > > > >
> > > > > >> > > > > > > While trying to use TextIO to write/read a binary
> file
> > > > > rather
> > > > > >> > than
> > > > > >> > > > > String
> > > > > >> > > > > > > lines from a textual file I ran into an issue - the
> > > > > delimiter
> > > > > >> > > TextIO
> > > > > >> > > > > uses
> > > > > >> > > > > > > seems to be hardcoded '\n'.
> > > > > >> > > > > > > See `findSeparatorBounds` -
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > >> > > > > > >
> > > > > >> > > > > > > The use case is to have a file of objects, encoded
> > into
> > > > > bytes
> > > > > >> > > using a
> > > > > >> > > > > > > coder. However, '\n' is not a good delimiter here,
> as
> > > you
> > > > > can
> > > > > >> > > > imagine.
> > > > > >> > > > > > > A similar pattern is found in Spark's
> > `saveAsObjectFile`
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > >> > > > > > > where
> > > > > >> > > > > > > they use a more appropriate delimiter, to avoid such
> > > > issues.
> > > > > >> > > > > > >
> > > > > >> > > > > > > I did not find any unit tests which use TextIO to
> read
> > > > > > anything
> > > > > >> > > other
> > > > > >> > > > > > than
> > > > > >> > > > > > > Strings.
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > --
> > > > > >> > > > > > Jean-Baptiste Onofré
> > > > > >> > > > > > jbonofre@apache.org
> > > > > >> > > > > > http://blog.nanthrax.net
> > > > > >> > > > > > Talend - http://www.talend.com
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: TextIO binary file

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
Hmm, do you have a concrete use case in mind, where all these circumstances
come together?
- There is a need to write data from one Beam pipeline, and read it from
another Beam pipeline
- These pipelines have to be kept separate, rather than merged into one
pipeline
- It is okay that the data can not be parsed by anything except a Beam
pipeline using a compatible SDK version
- The objects to be serialized are not POJOs
- The user has chosen and was sufficiently savvy to develop a Coder for
these objects (as opposed to representing them as POJOs and using AvroIO,
which I think requires a rather less savvy user)

It just seems like a really exotic set of circumstances to me. If the
problem is that people don't realize that the easiest way to serialize
their data is make it a POJO and use AvroIO, we can solve by improving
documentation about coders.

On Sun, Feb 5, 2017 at 1:41 PM Aviem Zur <av...@gmail.com> wrote:

> AvroIO would is great for POJOs. But for use cases with more complex,
> serializable objects, or objects which are compatible with some coder it
> falls short.
>
> Also, for less savvy users to know they need to use AvroIO might be a
> stretch.
> Some simpler API along the the lines of ObjectFile might be more user
> friendly (even if for optimization it uses avro under the hood for POJOs).
>
> On Sun, Feb 5, 2017, 22:00 Eugene Kirpichov <ki...@google.com.invalid>
> wrote:
>
> > OK, I see what you mean; however I still think this can be solved without
> > introducing a new "Beam object file" (or whatever) file format, and
> without
> > thereby introducing additional use cases and compatibility constraints on
> > coders.
> >
> > I asked before in the thread why not just use AvroIO (it can serialize
> > arbitrary POJOs using reflection); I skimmed the thread it doesn't seem
> > like that got answered properly. I also like Dan's suggestion to use
> AvroIO
> > to serialize byte[] arrays and you can do whatever you want with them
> (e.g.
> > use another serialization library, say, Kryo, or Java serialization,
> etc.)
> >
> > On Sun, Feb 5, 2017 at 11:37 AM Aviem Zur <av...@gmail.com> wrote:
> >
> > > I agree that these files will serve no use outside of Beam pipelines.
> > >
> > > The rationale was that you might want to have one pipeline write output
> > to
> > > files and then have a different pipeline that uses those files as
> inputs.
> > >
> > > Say one team in your organization creates a pipeline and a different
> team
> > > utilizes those files as input for a different pipeline. The contract
> > > between them is the file, in a Beam-readable format.
> > > This is similar to Spark's `saveAsObjectFile`
> https://github.com/apache/
> > >
> >
> spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > <
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > >
> > >
> > > The merit for something like this in my eyes is to not burden the user
> > with
> > > writing a custom IO
> > >
> > > On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
> > > <ki...@google.com.invalid> wrote:
> > >
> > > +1 to Robert. Either this will be a Beam-specific file format (and then
> > > nothing except Beam will be able to read it - which I doubt is what you
> > > want), or it is an existing well-known file format and then we should
> > just
> > > develop an IO for it.
> > > Note that any file format that involves encoding elements with a Coder
> is
> > > Beam-specific, because wire format of coders is Beam-specific.
> > >
> > > On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
> > > <ro...@google.com.invalid> wrote:
> > >
> > > > On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <av...@gmail.com>
> > wrote:
> > > > > +1 on what Stas said.
> > > > > I think there is value in not having the user write a custom IO
> for a
> > > > > protocol they use which is not covered by Beam IOs. Plus having
> them
> > > deal
> > > > > with not only the encoding but also the IO part is not ideal.
> > > > > I think having a basic FileIO that can write to the Filesystems
> > > supported
> > > > > by Beam (GS/HDFS/Local/...) which you can use any coder with,
> > including
> > > > > your own custom coder, can be beneficial.
> > > >
> > > > What would the format of the file be? Just the concatenation of the
> > > > elements encoded according to the coder? Or is there a delimiter
> > > > needed to separate records. In which case how does one ensure the
> > > > delimiter does not also appear in the middle of an encoded element?
> At
> > > > this point you're developing a file format, and might as well stick
> > > > with one of the standard ones. https://xkcd.com/927
> > > >
> > > > > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <st...@gmail.com>
> > > wrote:
> > > > >
> > > > > I believe the motivation is to have an abstraction that allows one
> to
> > > > write
> > > > > stuff to a file in a way that is agnostic to the coder.
> > > > > If one needs to write a non-Avro protocol to a file, and this
> > > particular
> > > > > protocol does not meet the assumption made by TextIO, one might
> need
> > to
> > > > > duplicate the file IO related code from AvroIO.
> > > > >
> > > > > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > > > > <ki...@google.com.invalid> wrote:
> > > > >
> > > > >> Could you clarify why it would be useful to write objects to files
> > > using
> > > > >> Beam coders, as opposed to just using e.g. AvroIO?
> > > > >>
> > > > >> Coders (should) make no promise as to what their wire format is,
> so
> > > such
> > > > >> files could be read back only by other Beam pipelines using the
> same
> > > IO.
> > > > >>
> > > > >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <av...@gmail.com>
> > wrote:
> > > > >>
> > > > >> > So If I understand the general agreement is that TextIO should
> not
> > > > >> support
> > > > >> > anything but lines from files as strings.
> > > > >> > I'll go ahead and file a ticket that says the Javadoc should be
> > > > changed
> > > > >> to
> > > > >> > reflect this and `withCoder` method should be removed.
> > > > >> >
> > > > >> > Is there merit for Beam to supply an IO which does allow writing
> > > > objects
> > > > >> to
> > > > >> > a file using Beam coders and Beam FS (To write these files to
> > > > >> > GS/Hadoop/Local)?
> > > > >> >
> > > > >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> > > > >> > <ki...@google.com.invalid> wrote:
> > > > >> >
> > > > >> > P.S. Note that this point (about coders) is also mentioned in
> the
> > > > >> > now-being-reviewed PTransform Style Guide
> > > > >> > https://github.com/apache/beam-site/pull/134
> > > > >> > currently staged at
> > > > >> >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> >
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> > > > >> >
> > > > >> >
> > > > >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> > > > chamikara@apache.org
> > > > >> >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > +1 to what Eugene said.
> > > > >> > >
> > > > >> > > I've seen a number of Python SDK users incorrectly assuming
> that
> > > > >> > > coder.decode() is needed when developing their own file-based
> > > > sources
> > > > >> > > (since many users usually refer to text source first).
> Probably
> > > > coder
> > > > >> > > parameter should not be configurable for text source/sink and
> > they
> > > > >> should
> > > > >> > > be updated to only read/write UTF-8 encoded strings.
> > > > >> > >
> > > > >> > > - Cham
> > > > >> > >
> > > > >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > > > >> > > <ki...@google.com.invalid> wrote:
> > > > >> > >
> > > > >> > > > The use of Coder in TextIO is a long standing design issue
> > > because
> > > > >> > coders
> > > > >> > > > are not intended to be used for general purpose converting
> > > things
> > > > >> from
> > > > >> > > and
> > > > >> > > > to bytes, their only proper use is letting the runner
> > > materialize
> > > > > and
> > > > >> > > > restore objects if the runner thinks it's necessary. IMO it
> > > should
> > > > >> have
> > > > >> > > > been called LineIO, document that it reads lines of text as
> > > > String,
> > > > >> and
> > > > >> > > not
> > > > >> > > > have a withCoder parameter at all.
> > > > >> > > >
> > > > >> > > > The proper way to address your use case is to write a custom
> > > > >> > > > FileBasedSource.
> > > > >> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <
> aviemzur@gmail.com
> > >
> > > > >> wrote:
> > > > >> > > >
> > > > >> > > > > The Javadoc of TextIO states:
> > > > >> > > > >
> > > > >> > > > > * <p>By default, {@link TextIO.Read} returns a {@link
> > > > PCollection}
> > > > >> of
> > > > >> > > > > {@link String Strings},
> > > > >> > > > >  * each corresponding to one line of an input UTF-8 text
> > file.
> > > > To
> > > > >> > > convert
> > > > >> > > > > directly from the raw
> > > > >> > > > >  * bytes (split into lines delimited by '\n', '\r', or
> > '\r\n')
> > > > to
> > > > >> > > another
> > > > >> > > > > object of type {@code T},
> > > > >> > > > >  * supply a {@code Coder<T>} using {@link
> > > > >> > > TextIO.Read#withCoder(Coder)}.
> > > > >> > > > >
> > > > >> > > > > However, as I stated, `withCoder` doesn't seem to have
> > tests,
> > > > and
> > > > >> > > > probably
> > > > >> > > > > won't work given the hard-coded '\n' delimiter.
> > > > >> > > > >
> > > > >> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> > > > >> > jb@nanthrax.net
> > > > >> > > >
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi Aviem,
> > > > >> > > > > >
> > > > >> > > > > > TextIO is not designed to write/read binary file: it's
> > pure
> > > > > Text,
> > > > >> > so
> > > > >> > > > > > String.
> > > > >> > > > > >
> > > > >> > > > > > Regards
> > > > >> > > > > > JB
> > > > >> > > > > >
> > > > >> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > >> > > > > > > Hi,
> > > > >> > > > > > >
> > > > >> > > > > > > While trying to use TextIO to write/read a binary file
> > > > rather
> > > > >> > than
> > > > >> > > > > String
> > > > >> > > > > > > lines from a textual file I ran into an issue - the
> > > > delimiter
> > > > >> > > TextIO
> > > > >> > > > > uses
> > > > >> > > > > > > seems to be hardcoded '\n'.
> > > > >> > > > > > > See `findSeparatorBounds` -
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > >> > > > > > >
> > > > >> > > > > > > The use case is to have a file of objects, encoded
> into
> > > > bytes
> > > > >> > > using a
> > > > >> > > > > > > coder. However, '\n' is not a good delimiter here, as
> > you
> > > > can
> > > > >> > > > imagine.
> > > > >> > > > > > > A similar pattern is found in Spark's
> `saveAsObjectFile`
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > >> > > > > > > where
> > > > >> > > > > > > they use a more appropriate delimiter, to avoid such
> > > issues.
> > > > >> > > > > > >
> > > > >> > > > > > > I did not find any unit tests which use TextIO to read
> > > > > anything
> > > > >> > > other
> > > > >> > > > > > than
> > > > >> > > > > > > Strings.
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > > > --
> > > > >> > > > > > Jean-Baptiste Onofré
> > > > >> > > > > > jbonofre@apache.org
> > > > >> > > > > > http://blog.nanthrax.net
> > > > >> > > > > > Talend - http://www.talend.com
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Re: TextIO binary file

Posted by Aviem Zur <av...@gmail.com>.
AvroIO would is great for POJOs. But for use cases with more complex,
serializable objects, or objects which are compatible with some coder it
falls short.

Also, for less savvy users to know they need to use AvroIO might be a
stretch.
Some simpler API along the the lines of ObjectFile might be more user
friendly (even if for optimization it uses avro under the hood for POJOs).

On Sun, Feb 5, 2017, 22:00 Eugene Kirpichov <ki...@google.com.invalid>
wrote:

> OK, I see what you mean; however I still think this can be solved without
> introducing a new "Beam object file" (or whatever) file format, and without
> thereby introducing additional use cases and compatibility constraints on
> coders.
>
> I asked before in the thread why not just use AvroIO (it can serialize
> arbitrary POJOs using reflection); I skimmed the thread it doesn't seem
> like that got answered properly. I also like Dan's suggestion to use AvroIO
> to serialize byte[] arrays and you can do whatever you want with them (e.g.
> use another serialization library, say, Kryo, or Java serialization, etc.)
>
> On Sun, Feb 5, 2017 at 11:37 AM Aviem Zur <av...@gmail.com> wrote:
>
> > I agree that these files will serve no use outside of Beam pipelines.
> >
> > The rationale was that you might want to have one pipeline write output
> to
> > files and then have a different pipeline that uses those files as inputs.
> >
> > Say one team in your organization creates a pipeline and a different team
> > utilizes those files as input for a different pipeline. The contract
> > between them is the file, in a Beam-readable format.
> > This is similar to Spark's `saveAsObjectFile` https://github.com/apache/
> >
> spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > <
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> >
> >
> > The merit for something like this in my eyes is to not burden the user
> with
> > writing a custom IO
> >
> > On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
> > <ki...@google.com.invalid> wrote:
> >
> > +1 to Robert. Either this will be a Beam-specific file format (and then
> > nothing except Beam will be able to read it - which I doubt is what you
> > want), or it is an existing well-known file format and then we should
> just
> > develop an IO for it.
> > Note that any file format that involves encoding elements with a Coder is
> > Beam-specific, because wire format of coders is Beam-specific.
> >
> > On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
> > <ro...@google.com.invalid> wrote:
> >
> > > On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <av...@gmail.com>
> wrote:
> > > > +1 on what Stas said.
> > > > I think there is value in not having the user write a custom IO for a
> > > > protocol they use which is not covered by Beam IOs. Plus having them
> > deal
> > > > with not only the encoding but also the IO part is not ideal.
> > > > I think having a basic FileIO that can write to the Filesystems
> > supported
> > > > by Beam (GS/HDFS/Local/...) which you can use any coder with,
> including
> > > > your own custom coder, can be beneficial.
> > >
> > > What would the format of the file be? Just the concatenation of the
> > > elements encoded according to the coder? Or is there a delimiter
> > > needed to separate records. In which case how does one ensure the
> > > delimiter does not also appear in the middle of an encoded element? At
> > > this point you're developing a file format, and might as well stick
> > > with one of the standard ones. https://xkcd.com/927
> > >
> > > > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <st...@gmail.com>
> > wrote:
> > > >
> > > > I believe the motivation is to have an abstraction that allows one to
> > > write
> > > > stuff to a file in a way that is agnostic to the coder.
> > > > If one needs to write a non-Avro protocol to a file, and this
> > particular
> > > > protocol does not meet the assumption made by TextIO, one might need
> to
> > > > duplicate the file IO related code from AvroIO.
> > > >
> > > > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > > > <ki...@google.com.invalid> wrote:
> > > >
> > > >> Could you clarify why it would be useful to write objects to files
> > using
> > > >> Beam coders, as opposed to just using e.g. AvroIO?
> > > >>
> > > >> Coders (should) make no promise as to what their wire format is, so
> > such
> > > >> files could be read back only by other Beam pipelines using the same
> > IO.
> > > >>
> > > >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <av...@gmail.com>
> wrote:
> > > >>
> > > >> > So If I understand the general agreement is that TextIO should not
> > > >> support
> > > >> > anything but lines from files as strings.
> > > >> > I'll go ahead and file a ticket that says the Javadoc should be
> > > changed
> > > >> to
> > > >> > reflect this and `withCoder` method should be removed.
> > > >> >
> > > >> > Is there merit for Beam to supply an IO which does allow writing
> > > objects
> > > >> to
> > > >> > a file using Beam coders and Beam FS (To write these files to
> > > >> > GS/Hadoop/Local)?
> > > >> >
> > > >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> > > >> > <ki...@google.com.invalid> wrote:
> > > >> >
> > > >> > P.S. Note that this point (about coders) is also mentioned in the
> > > >> > now-being-reviewed PTransform Style Guide
> > > >> > https://github.com/apache/beam-site/pull/134
> > > >> > currently staged at
> > > >> >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> > > >> >
> > > >> >
> > > >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> > > chamikara@apache.org
> > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > +1 to what Eugene said.
> > > >> > >
> > > >> > > I've seen a number of Python SDK users incorrectly assuming that
> > > >> > > coder.decode() is needed when developing their own file-based
> > > sources
> > > >> > > (since many users usually refer to text source first). Probably
> > > coder
> > > >> > > parameter should not be configurable for text source/sink and
> they
> > > >> should
> > > >> > > be updated to only read/write UTF-8 encoded strings.
> > > >> > >
> > > >> > > - Cham
> > > >> > >
> > > >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > > >> > > <ki...@google.com.invalid> wrote:
> > > >> > >
> > > >> > > > The use of Coder in TextIO is a long standing design issue
> > because
> > > >> > coders
> > > >> > > > are not intended to be used for general purpose converting
> > things
> > > >> from
> > > >> > > and
> > > >> > > > to bytes, their only proper use is letting the runner
> > materialize
> > > > and
> > > >> > > > restore objects if the runner thinks it's necessary. IMO it
> > should
> > > >> have
> > > >> > > > been called LineIO, document that it reads lines of text as
> > > String,
> > > >> and
> > > >> > > not
> > > >> > > > have a withCoder parameter at all.
> > > >> > > >
> > > >> > > > The proper way to address your use case is to write a custom
> > > >> > > > FileBasedSource.
> > > >> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <aviemzur@gmail.com
> >
> > > >> wrote:
> > > >> > > >
> > > >> > > > > The Javadoc of TextIO states:
> > > >> > > > >
> > > >> > > > > * <p>By default, {@link TextIO.Read} returns a {@link
> > > PCollection}
> > > >> of
> > > >> > > > > {@link String Strings},
> > > >> > > > >  * each corresponding to one line of an input UTF-8 text
> file.
> > > To
> > > >> > > convert
> > > >> > > > > directly from the raw
> > > >> > > > >  * bytes (split into lines delimited by '\n', '\r', or
> '\r\n')
> > > to
> > > >> > > another
> > > >> > > > > object of type {@code T},
> > > >> > > > >  * supply a {@code Coder<T>} using {@link
> > > >> > > TextIO.Read#withCoder(Coder)}.
> > > >> > > > >
> > > >> > > > > However, as I stated, `withCoder` doesn't seem to have
> tests,
> > > and
> > > >> > > > probably
> > > >> > > > > won't work given the hard-coded '\n' delimiter.
> > > >> > > > >
> > > >> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> > > >> > jb@nanthrax.net
> > > >> > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi Aviem,
> > > >> > > > > >
> > > >> > > > > > TextIO is not designed to write/read binary file: it's
> pure
> > > > Text,
> > > >> > so
> > > >> > > > > > String.
> > > >> > > > > >
> > > >> > > > > > Regards
> > > >> > > > > > JB
> > > >> > > > > >
> > > >> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > >> > > > > > > Hi,
> > > >> > > > > > >
> > > >> > > > > > > While trying to use TextIO to write/read a binary file
> > > rather
> > > >> > than
> > > >> > > > > String
> > > >> > > > > > > lines from a textual file I ran into an issue - the
> > > delimiter
> > > >> > > TextIO
> > > >> > > > > uses
> > > >> > > > > > > seems to be hardcoded '\n'.
> > > >> > > > > > > See `findSeparatorBounds` -
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > >> > > > > > >
> > > >> > > > > > > The use case is to have a file of objects, encoded into
> > > bytes
> > > >> > > using a
> > > >> > > > > > > coder. However, '\n' is not a good delimiter here, as
> you
> > > can
> > > >> > > > imagine.
> > > >> > > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > >> > > > > > > where
> > > >> > > > > > > they use a more appropriate delimiter, to avoid such
> > issues.
> > > >> > > > > > >
> > > >> > > > > > > I did not find any unit tests which use TextIO to read
> > > > anything
> > > >> > > other
> > > >> > > > > > than
> > > >> > > > > > > Strings.
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Jean-Baptiste Onofré
> > > >> > > > > > jbonofre@apache.org
> > > >> > > > > > http://blog.nanthrax.net
> > > >> > > > > > Talend - http://www.talend.com
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>

Re: TextIO binary file

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
OK, I see what you mean; however I still think this can be solved without
introducing a new "Beam object file" (or whatever) file format, and without
thereby introducing additional use cases and compatibility constraints on
coders.

I asked before in the thread why not just use AvroIO (it can serialize
arbitrary POJOs using reflection); I skimmed the thread it doesn't seem
like that got answered properly. I also like Dan's suggestion to use AvroIO
to serialize byte[] arrays and you can do whatever you want with them (e.g.
use another serialization library, say, Kryo, or Java serialization, etc.)

On Sun, Feb 5, 2017 at 11:37 AM Aviem Zur <av...@gmail.com> wrote:

> I agree that these files will serve no use outside of Beam pipelines.
>
> The rationale was that you might want to have one pipeline write output to
> files and then have a different pipeline that uses those files as inputs.
>
> Say one team in your organization creates a pipeline and a different team
> utilizes those files as input for a different pipeline. The contract
> between them is the file, in a Beam-readable format.
> This is similar to Spark's `saveAsObjectFile` https://github.com/apache/
> spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512>
>
> The merit for something like this in my eyes is to not burden the user with
> writing a custom IO
>
> On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
>
> +1 to Robert. Either this will be a Beam-specific file format (and then
> nothing except Beam will be able to read it - which I doubt is what you
> want), or it is an existing well-known file format and then we should just
> develop an IO for it.
> Note that any file format that involves encoding elements with a Coder is
> Beam-specific, because wire format of coders is Beam-specific.
>
> On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
> <ro...@google.com.invalid> wrote:
>
> > On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <av...@gmail.com> wrote:
> > > +1 on what Stas said.
> > > I think there is value in not having the user write a custom IO for a
> > > protocol they use which is not covered by Beam IOs. Plus having them
> deal
> > > with not only the encoding but also the IO part is not ideal.
> > > I think having a basic FileIO that can write to the Filesystems
> supported
> > > by Beam (GS/HDFS/Local/...) which you can use any coder with, including
> > > your own custom coder, can be beneficial.
> >
> > What would the format of the file be? Just the concatenation of the
> > elements encoded according to the coder? Or is there a delimiter
> > needed to separate records. In which case how does one ensure the
> > delimiter does not also appear in the middle of an encoded element? At
> > this point you're developing a file format, and might as well stick
> > with one of the standard ones. https://xkcd.com/927
> >
> > > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <st...@gmail.com>
> wrote:
> > >
> > > I believe the motivation is to have an abstraction that allows one to
> > write
> > > stuff to a file in a way that is agnostic to the coder.
> > > If one needs to write a non-Avro protocol to a file, and this
> particular
> > > protocol does not meet the assumption made by TextIO, one might need to
> > > duplicate the file IO related code from AvroIO.
> > >
> > > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > > <ki...@google.com.invalid> wrote:
> > >
> > >> Could you clarify why it would be useful to write objects to files
> using
> > >> Beam coders, as opposed to just using e.g. AvroIO?
> > >>
> > >> Coders (should) make no promise as to what their wire format is, so
> such
> > >> files could be read back only by other Beam pipelines using the same
> IO.
> > >>
> > >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <av...@gmail.com> wrote:
> > >>
> > >> > So If I understand the general agreement is that TextIO should not
> > >> support
> > >> > anything but lines from files as strings.
> > >> > I'll go ahead and file a ticket that says the Javadoc should be
> > changed
> > >> to
> > >> > reflect this and `withCoder` method should be removed.
> > >> >
> > >> > Is there merit for Beam to supply an IO which does allow writing
> > objects
> > >> to
> > >> > a file using Beam coders and Beam FS (To write these files to
> > >> > GS/Hadoop/Local)?
> > >> >
> > >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> > >> > <ki...@google.com.invalid> wrote:
> > >> >
> > >> > P.S. Note that this point (about coders) is also mentioned in the
> > >> > now-being-reviewed PTransform Style Guide
> > >> > https://github.com/apache/beam-site/pull/134
> > >> > currently staged at
> > >> >
> > >> >
> > >>
> > >
> >
>
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> > >> >
> > >> >
> > >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> > chamikara@apache.org
> > >> >
> > >> > wrote:
> > >> >
> > >> > > +1 to what Eugene said.
> > >> > >
> > >> > > I've seen a number of Python SDK users incorrectly assuming that
> > >> > > coder.decode() is needed when developing their own file-based
> > sources
> > >> > > (since many users usually refer to text source first). Probably
> > coder
> > >> > > parameter should not be configurable for text source/sink and they
> > >> should
> > >> > > be updated to only read/write UTF-8 encoded strings.
> > >> > >
> > >> > > - Cham
> > >> > >
> > >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > >> > > <ki...@google.com.invalid> wrote:
> > >> > >
> > >> > > > The use of Coder in TextIO is a long standing design issue
> because
> > >> > coders
> > >> > > > are not intended to be used for general purpose converting
> things
> > >> from
> > >> > > and
> > >> > > > to bytes, their only proper use is letting the runner
> materialize
> > > and
> > >> > > > restore objects if the runner thinks it's necessary. IMO it
> should
> > >> have
> > >> > > > been called LineIO, document that it reads lines of text as
> > String,
> > >> and
> > >> > > not
> > >> > > > have a withCoder parameter at all.
> > >> > > >
> > >> > > > The proper way to address your use case is to write a custom
> > >> > > > FileBasedSource.
> > >> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <av...@gmail.com>
> > >> wrote:
> > >> > > >
> > >> > > > > The Javadoc of TextIO states:
> > >> > > > >
> > >> > > > > * <p>By default, {@link TextIO.Read} returns a {@link
> > PCollection}
> > >> of
> > >> > > > > {@link String Strings},
> > >> > > > >  * each corresponding to one line of an input UTF-8 text file.
> > To
> > >> > > convert
> > >> > > > > directly from the raw
> > >> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n')
> > to
> > >> > > another
> > >> > > > > object of type {@code T},
> > >> > > > >  * supply a {@code Coder<T>} using {@link
> > >> > > TextIO.Read#withCoder(Coder)}.
> > >> > > > >
> > >> > > > > However, as I stated, `withCoder` doesn't seem to have tests,
> > and
> > >> > > > probably
> > >> > > > > won't work given the hard-coded '\n' delimiter.
> > >> > > > >
> > >> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> > >> > jb@nanthrax.net
> > >> > > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Hi Aviem,
> > >> > > > > >
> > >> > > > > > TextIO is not designed to write/read binary file: it's pure
> > > Text,
> > >> > so
> > >> > > > > > String.
> > >> > > > > >
> > >> > > > > > Regards
> > >> > > > > > JB
> > >> > > > > >
> > >> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > >> > > > > > > Hi,
> > >> > > > > > >
> > >> > > > > > > While trying to use TextIO to write/read a binary file
> > rather
> > >> > than
> > >> > > > > String
> > >> > > > > > > lines from a textual file I ran into an issue - the
> > delimiter
> > >> > > TextIO
> > >> > > > > uses
> > >> > > > > > > seems to be hardcoded '\n'.
> > >> > > > > > > See `findSeparatorBounds` -
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >>
> > >
> >
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > >> > > > > > >
> > >> > > > > > > The use case is to have a file of objects, encoded into
> > bytes
> > >> > > using a
> > >> > > > > > > coder. However, '\n' is not a good delimiter here, as you
> > can
> > >> > > > imagine.
> > >> > > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >>
> > >
> >
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > >> > > > > > > where
> > >> > > > > > > they use a more appropriate delimiter, to avoid such
> issues.
> > >> > > > > > >
> > >> > > > > > > I did not find any unit tests which use TextIO to read
> > > anything
> > >> > > other
> > >> > > > > > than
> > >> > > > > > > Strings.
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Jean-Baptiste Onofré
> > >> > > > > > jbonofre@apache.org
> > >> > > > > > http://blog.nanthrax.net
> > >> > > > > > Talend - http://www.talend.com
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>

Re: TextIO binary file

Posted by Aviem Zur <av...@gmail.com>.
I agree that these files will serve no use outside of Beam pipelines.

The rationale was that you might want to have one pipeline write output to
files and then have a different pipeline that uses those files as inputs.

Say one team in your organization creates a pipeline and a different team
utilizes those files as input for a different pipeline. The contract
between them is the file, in a Beam-readable format.
This is similar to Spark's `saveAsObjectFile` https://github.com/apache/
spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512

The merit for something like this in my eyes is to not burden the user with
writing a custom IO

On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
<ki...@google.com.invalid> wrote:

+1 to Robert. Either this will be a Beam-specific file format (and then
nothing except Beam will be able to read it - which I doubt is what you
want), or it is an existing well-known file format and then we should just
develop an IO for it.
Note that any file format that involves encoding elements with a Coder is
Beam-specific, because wire format of coders is Beam-specific.

On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
<ro...@google.com.invalid> wrote:

> On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <av...@gmail.com> wrote:
> > +1 on what Stas said.
> > I think there is value in not having the user write a custom IO for a
> > protocol they use which is not covered by Beam IOs. Plus having them
deal
> > with not only the encoding but also the IO part is not ideal.
> > I think having a basic FileIO that can write to the Filesystems
supported
> > by Beam (GS/HDFS/Local/...) which you can use any coder with, including
> > your own custom coder, can be beneficial.
>
> What would the format of the file be? Just the concatenation of the
> elements encoded according to the coder? Or is there a delimiter
> needed to separate records. In which case how does one ensure the
> delimiter does not also appear in the middle of an encoded element? At
> this point you're developing a file format, and might as well stick
> with one of the standard ones. https://xkcd.com/927
>
> > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <st...@gmail.com> wrote:
> >
> > I believe the motivation is to have an abstraction that allows one to
> write
> > stuff to a file in a way that is agnostic to the coder.
> > If one needs to write a non-Avro protocol to a file, and this particular
> > protocol does not meet the assumption made by TextIO, one might need to
> > duplicate the file IO related code from AvroIO.
> >
> > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > <ki...@google.com.invalid> wrote:
> >
> >> Could you clarify why it would be useful to write objects to files
using
> >> Beam coders, as opposed to just using e.g. AvroIO?
> >>
> >> Coders (should) make no promise as to what their wire format is, so
such
> >> files could be read back only by other Beam pipelines using the same
IO.
> >>
> >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <av...@gmail.com> wrote:
> >>
> >> > So If I understand the general agreement is that TextIO should not
> >> support
> >> > anything but lines from files as strings.
> >> > I'll go ahead and file a ticket that says the Javadoc should be
> changed
> >> to
> >> > reflect this and `withCoder` method should be removed.
> >> >
> >> > Is there merit for Beam to supply an IO which does allow writing
> objects
> >> to
> >> > a file using Beam coders and Beam FS (To write these files to
> >> > GS/Hadoop/Local)?
> >> >
> >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> >> > <ki...@google.com.invalid> wrote:
> >> >
> >> > P.S. Note that this point (about coders) is also mentioned in the
> >> > now-being-reviewed PTransform Style Guide
> >> > https://github.com/apache/beam-site/pull/134
> >> > currently staged at
> >> >
> >> >
> >>
> >
>
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> >> >
> >> >
> >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> chamikara@apache.org
> >> >
> >> > wrote:
> >> >
> >> > > +1 to what Eugene said.
> >> > >
> >> > > I've seen a number of Python SDK users incorrectly assuming that
> >> > > coder.decode() is needed when developing their own file-based
> sources
> >> > > (since many users usually refer to text source first). Probably
> coder
> >> > > parameter should not be configurable for text source/sink and they
> >> should
> >> > > be updated to only read/write UTF-8 encoded strings.
> >> > >
> >> > > - Cham
> >> > >
> >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> >> > > <ki...@google.com.invalid> wrote:
> >> > >
> >> > > > The use of Coder in TextIO is a long standing design issue
because
> >> > coders
> >> > > > are not intended to be used for general purpose converting things
> >> from
> >> > > and
> >> > > > to bytes, their only proper use is letting the runner materialize
> > and
> >> > > > restore objects if the runner thinks it's necessary. IMO it
should
> >> have
> >> > > > been called LineIO, document that it reads lines of text as
> String,
> >> and
> >> > > not
> >> > > > have a withCoder parameter at all.
> >> > > >
> >> > > > The proper way to address your use case is to write a custom
> >> > > > FileBasedSource.
> >> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <av...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > The Javadoc of TextIO states:
> >> > > > >
> >> > > > > * <p>By default, {@link TextIO.Read} returns a {@link
> PCollection}
> >> of
> >> > > > > {@link String Strings},
> >> > > > >  * each corresponding to one line of an input UTF-8 text file.
> To
> >> > > convert
> >> > > > > directly from the raw
> >> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n')
> to
> >> > > another
> >> > > > > object of type {@code T},
> >> > > > >  * supply a {@code Coder<T>} using {@link
> >> > > TextIO.Read#withCoder(Coder)}.
> >> > > > >
> >> > > > > However, as I stated, `withCoder` doesn't seem to have tests,
> and
> >> > > > probably
> >> > > > > won't work given the hard-coded '\n' delimiter.
> >> > > > >
> >> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> >> > jb@nanthrax.net
> >> > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hi Aviem,
> >> > > > > >
> >> > > > > > TextIO is not designed to write/read binary file: it's pure
> > Text,
> >> > so
> >> > > > > > String.
> >> > > > > >
> >> > > > > > Regards
> >> > > > > > JB
> >> > > > > >
> >> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> >> > > > > > > Hi,
> >> > > > > > >
> >> > > > > > > While trying to use TextIO to write/read a binary file
> rather
> >> > than
> >> > > > > String
> >> > > > > > > lines from a textual file I ran into an issue - the
> delimiter
> >> > > TextIO
> >> > > > > uses
> >> > > > > > > seems to be hardcoded '\n'.
> >> > > > > > > See `findSeparatorBounds` -
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >>
> >
>
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> >> > > > > > >
> >> > > > > > > The use case is to have a file of objects, encoded into
> bytes
> >> > > using a
> >> > > > > > > coder. However, '\n' is not a good delimiter here, as you
> can
> >> > > > imagine.
> >> > > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >>
> >
>
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> >> > > > > > > where
> >> > > > > > > they use a more appropriate delimiter, to avoid such
issues.
> >> > > > > > >
> >> > > > > > > I did not find any unit tests which use TextIO to read
> > anything
> >> > > other
> >> > > > > > than
> >> > > > > > > Strings.
> >> > > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Jean-Baptiste Onofré
> >> > > > > > jbonofre@apache.org
> >> > > > > > http://blog.nanthrax.net
> >> > > > > > Talend - http://www.talend.com
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: TextIO binary file

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
+1 to Robert. Either this will be a Beam-specific file format (and then
nothing except Beam will be able to read it - which I doubt is what you
want), or it is an existing well-known file format and then we should just
develop an IO for it.
Note that any file format that involves encoding elements with a Coder is
Beam-specific, because wire format of coders is Beam-specific.

On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
<ro...@google.com.invalid> wrote:

> On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <av...@gmail.com> wrote:
> > +1 on what Stas said.
> > I think there is value in not having the user write a custom IO for a
> > protocol they use which is not covered by Beam IOs. Plus having them deal
> > with not only the encoding but also the IO part is not ideal.
> > I think having a basic FileIO that can write to the Filesystems supported
> > by Beam (GS/HDFS/Local/...) which you can use any coder with, including
> > your own custom coder, can be beneficial.
>
> What would the format of the file be? Just the concatenation of the
> elements encoded according to the coder? Or is there a delimiter
> needed to separate records. In which case how does one ensure the
> delimiter does not also appear in the middle of an encoded element? At
> this point you're developing a file format, and might as well stick
> with one of the standard ones. https://xkcd.com/927
>
> > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <st...@gmail.com> wrote:
> >
> > I believe the motivation is to have an abstraction that allows one to
> write
> > stuff to a file in a way that is agnostic to the coder.
> > If one needs to write a non-Avro protocol to a file, and this particular
> > protocol does not meet the assumption made by TextIO, one might need to
> > duplicate the file IO related code from AvroIO.
> >
> > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > <ki...@google.com.invalid> wrote:
> >
> >> Could you clarify why it would be useful to write objects to files using
> >> Beam coders, as opposed to just using e.g. AvroIO?
> >>
> >> Coders (should) make no promise as to what their wire format is, so such
> >> files could be read back only by other Beam pipelines using the same IO.
> >>
> >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <av...@gmail.com> wrote:
> >>
> >> > So If I understand the general agreement is that TextIO should not
> >> support
> >> > anything but lines from files as strings.
> >> > I'll go ahead and file a ticket that says the Javadoc should be
> changed
> >> to
> >> > reflect this and `withCoder` method should be removed.
> >> >
> >> > Is there merit for Beam to supply an IO which does allow writing
> objects
> >> to
> >> > a file using Beam coders and Beam FS (To write these files to
> >> > GS/Hadoop/Local)?
> >> >
> >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> >> > <ki...@google.com.invalid> wrote:
> >> >
> >> > P.S. Note that this point (about coders) is also mentioned in the
> >> > now-being-reviewed PTransform Style Guide
> >> > https://github.com/apache/beam-site/pull/134
> >> > currently staged at
> >> >
> >> >
> >>
> >
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> >> >
> >> >
> >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> chamikara@apache.org
> >> >
> >> > wrote:
> >> >
> >> > > +1 to what Eugene said.
> >> > >
> >> > > I've seen a number of Python SDK users incorrectly assuming that
> >> > > coder.decode() is needed when developing their own file-based
> sources
> >> > > (since many users usually refer to text source first). Probably
> coder
> >> > > parameter should not be configurable for text source/sink and they
> >> should
> >> > > be updated to only read/write UTF-8 encoded strings.
> >> > >
> >> > > - Cham
> >> > >
> >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> >> > > <ki...@google.com.invalid> wrote:
> >> > >
> >> > > > The use of Coder in TextIO is a long standing design issue because
> >> > coders
> >> > > > are not intended to be used for general purpose converting things
> >> from
> >> > > and
> >> > > > to bytes, their only proper use is letting the runner materialize
> > and
> >> > > > restore objects if the runner thinks it's necessary. IMO it should
> >> have
> >> > > > been called LineIO, document that it reads lines of text as
> String,
> >> and
> >> > > not
> >> > > > have a withCoder parameter at all.
> >> > > >
> >> > > > The proper way to address your use case is to write a custom
> >> > > > FileBasedSource.
> >> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <av...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > The Javadoc of TextIO states:
> >> > > > >
> >> > > > > * <p>By default, {@link TextIO.Read} returns a {@link
> PCollection}
> >> of
> >> > > > > {@link String Strings},
> >> > > > >  * each corresponding to one line of an input UTF-8 text file.
> To
> >> > > convert
> >> > > > > directly from the raw
> >> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n')
> to
> >> > > another
> >> > > > > object of type {@code T},
> >> > > > >  * supply a {@code Coder<T>} using {@link
> >> > > TextIO.Read#withCoder(Coder)}.
> >> > > > >
> >> > > > > However, as I stated, `withCoder` doesn't seem to have tests,
> and
> >> > > > probably
> >> > > > > won't work given the hard-coded '\n' delimiter.
> >> > > > >
> >> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> >> > jb@nanthrax.net
> >> > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hi Aviem,
> >> > > > > >
> >> > > > > > TextIO is not designed to write/read binary file: it's pure
> > Text,
> >> > so
> >> > > > > > String.
> >> > > > > >
> >> > > > > > Regards
> >> > > > > > JB
> >> > > > > >
> >> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> >> > > > > > > Hi,
> >> > > > > > >
> >> > > > > > > While trying to use TextIO to write/read a binary file
> rather
> >> > than
> >> > > > > String
> >> > > > > > > lines from a textual file I ran into an issue - the
> delimiter
> >> > > TextIO
> >> > > > > uses
> >> > > > > > > seems to be hardcoded '\n'.
> >> > > > > > > See `findSeparatorBounds` -
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >>
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> >> > > > > > >
> >> > > > > > > The use case is to have a file of objects, encoded into
> bytes
> >> > > using a
> >> > > > > > > coder. However, '\n' is not a good delimiter here, as you
> can
> >> > > > imagine.
> >> > > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >>
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> >> > > > > > > where
> >> > > > > > > they use a more appropriate delimiter, to avoid such issues.
> >> > > > > > >
> >> > > > > > > I did not find any unit tests which use TextIO to read
> > anything
> >> > > other
> >> > > > > > than
> >> > > > > > > Strings.
> >> > > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Jean-Baptiste Onofré
> >> > > > > > jbonofre@apache.org
> >> > > > > > http://blog.nanthrax.net
> >> > > > > > Talend - http://www.talend.com
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: TextIO binary file

Posted by Robert Bradshaw <ro...@google.com.INVALID>.
On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <av...@gmail.com> wrote:
> +1 on what Stas said.
> I think there is value in not having the user write a custom IO for a
> protocol they use which is not covered by Beam IOs. Plus having them deal
> with not only the encoding but also the IO part is not ideal.
> I think having a basic FileIO that can write to the Filesystems supported
> by Beam (GS/HDFS/Local/...) which you can use any coder with, including
> your own custom coder, can be beneficial.

What would the format of the file be? Just the concatenation of the
elements encoded according to the coder? Or is there a delimiter
needed to separate records. In which case how does one ensure the
delimiter does not also appear in the middle of an encoded element? At
this point you're developing a file format, and might as well stick
with one of the standard ones. https://xkcd.com/927

> On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <st...@gmail.com> wrote:
>
> I believe the motivation is to have an abstraction that allows one to write
> stuff to a file in a way that is agnostic to the coder.
> If one needs to write a non-Avro protocol to a file, and this particular
> protocol does not meet the assumption made by TextIO, one might need to
> duplicate the file IO related code from AvroIO.
>
> On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
>
>> Could you clarify why it would be useful to write objects to files using
>> Beam coders, as opposed to just using e.g. AvroIO?
>>
>> Coders (should) make no promise as to what their wire format is, so such
>> files could be read back only by other Beam pipelines using the same IO.
>>
>> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <av...@gmail.com> wrote:
>>
>> > So If I understand the general agreement is that TextIO should not
>> support
>> > anything but lines from files as strings.
>> > I'll go ahead and file a ticket that says the Javadoc should be changed
>> to
>> > reflect this and `withCoder` method should be removed.
>> >
>> > Is there merit for Beam to supply an IO which does allow writing objects
>> to
>> > a file using Beam coders and Beam FS (To write these files to
>> > GS/Hadoop/Local)?
>> >
>> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
>> > <ki...@google.com.invalid> wrote:
>> >
>> > P.S. Note that this point (about coders) is also mentioned in the
>> > now-being-reviewed PTransform Style Guide
>> > https://github.com/apache/beam-site/pull/134
>> > currently staged at
>> >
>> >
>>
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
>> >
>> >
>> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <chamikara@apache.org
>> >
>> > wrote:
>> >
>> > > +1 to what Eugene said.
>> > >
>> > > I've seen a number of Python SDK users incorrectly assuming that
>> > > coder.decode() is needed when developing their own file-based sources
>> > > (since many users usually refer to text source first). Probably coder
>> > > parameter should not be configurable for text source/sink and they
>> should
>> > > be updated to only read/write UTF-8 encoded strings.
>> > >
>> > > - Cham
>> > >
>> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
>> > > <ki...@google.com.invalid> wrote:
>> > >
>> > > > The use of Coder in TextIO is a long standing design issue because
>> > coders
>> > > > are not intended to be used for general purpose converting things
>> from
>> > > and
>> > > > to bytes, their only proper use is letting the runner materialize
> and
>> > > > restore objects if the runner thinks it's necessary. IMO it should
>> have
>> > > > been called LineIO, document that it reads lines of text as String,
>> and
>> > > not
>> > > > have a withCoder parameter at all.
>> > > >
>> > > > The proper way to address your use case is to write a custom
>> > > > FileBasedSource.
>> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <av...@gmail.com>
>> wrote:
>> > > >
>> > > > > The Javadoc of TextIO states:
>> > > > >
>> > > > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection}
>> of
>> > > > > {@link String Strings},
>> > > > >  * each corresponding to one line of an input UTF-8 text file. To
>> > > convert
>> > > > > directly from the raw
>> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
>> > > another
>> > > > > object of type {@code T},
>> > > > >  * supply a {@code Coder<T>} using {@link
>> > > TextIO.Read#withCoder(Coder)}.
>> > > > >
>> > > > > However, as I stated, `withCoder` doesn't seem to have tests, and
>> > > > probably
>> > > > > won't work given the hard-coded '\n' delimiter.
>> > > > >
>> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
>> > jb@nanthrax.net
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Aviem,
>> > > > > >
>> > > > > > TextIO is not designed to write/read binary file: it's pure
> Text,
>> > so
>> > > > > > String.
>> > > > > >
>> > > > > > Regards
>> > > > > > JB
>> > > > > >
>> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
>> > > > > > > Hi,
>> > > > > > >
>> > > > > > > While trying to use TextIO to write/read a binary file rather
>> > than
>> > > > > String
>> > > > > > > lines from a textual file I ran into an issue - the delimiter
>> > > TextIO
>> > > > > uses
>> > > > > > > seems to be hardcoded '\n'.
>> > > > > > > See `findSeparatorBounds` -
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
>> > > > > > >
>> > > > > > > The use case is to have a file of objects, encoded into bytes
>> > > using a
>> > > > > > > coder. However, '\n' is not a good delimiter here, as you can
>> > > > imagine.
>> > > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
>> > > > > > > where
>> > > > > > > they use a more appropriate delimiter, to avoid such issues.
>> > > > > > >
>> > > > > > > I did not find any unit tests which use TextIO to read
> anything
>> > > other
>> > > > > > than
>> > > > > > > Strings.
>> > > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Jean-Baptiste Onofré
>> > > > > > jbonofre@apache.org
>> > > > > > http://blog.nanthrax.net
>> > > > > > Talend - http://www.talend.com
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Re: TextIO binary file

Posted by Aviem Zur <av...@gmail.com>.
+1 on what Stas said.
I think there is value in not having the user write a custom IO for a
protocol they use which is not covered by Beam IOs. Plus having them deal
with not only the encoding but also the IO part is not ideal.
I think having a basic FileIO that can write to the Filesystems supported
by Beam (GS/HDFS/Local/...) which you can use any coder with, including
your own custom coder, can be beneficial.

On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <st...@gmail.com> wrote:

I believe the motivation is to have an abstraction that allows one to write
stuff to a file in a way that is agnostic to the coder.
If one needs to write a non-Avro protocol to a file, and this particular
protocol does not meet the assumption made by TextIO, one might need to
duplicate the file IO related code from AvroIO.

On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
<ki...@google.com.invalid> wrote:

> Could you clarify why it would be useful to write objects to files using
> Beam coders, as opposed to just using e.g. AvroIO?
>
> Coders (should) make no promise as to what their wire format is, so such
> files could be read back only by other Beam pipelines using the same IO.
>
> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <av...@gmail.com> wrote:
>
> > So If I understand the general agreement is that TextIO should not
> support
> > anything but lines from files as strings.
> > I'll go ahead and file a ticket that says the Javadoc should be changed
> to
> > reflect this and `withCoder` method should be removed.
> >
> > Is there merit for Beam to supply an IO which does allow writing objects
> to
> > a file using Beam coders and Beam FS (To write these files to
> > GS/Hadoop/Local)?
> >
> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> > <ki...@google.com.invalid> wrote:
> >
> > P.S. Note that this point (about coders) is also mentioned in the
> > now-being-reviewed PTransform Style Guide
> > https://github.com/apache/beam-site/pull/134
> > currently staged at
> >
> >
>
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> >
> >
> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <chamikara@apache.org
> >
> > wrote:
> >
> > > +1 to what Eugene said.
> > >
> > > I've seen a number of Python SDK users incorrectly assuming that
> > > coder.decode() is needed when developing their own file-based sources
> > > (since many users usually refer to text source first). Probably coder
> > > parameter should not be configurable for text source/sink and they
> should
> > > be updated to only read/write UTF-8 encoded strings.
> > >
> > > - Cham
> > >
> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > > <ki...@google.com.invalid> wrote:
> > >
> > > > The use of Coder in TextIO is a long standing design issue because
> > coders
> > > > are not intended to be used for general purpose converting things
> from
> > > and
> > > > to bytes, their only proper use is letting the runner materialize
and
> > > > restore objects if the runner thinks it's necessary. IMO it should
> have
> > > > been called LineIO, document that it reads lines of text as String,
> and
> > > not
> > > > have a withCoder parameter at all.
> > > >
> > > > The proper way to address your use case is to write a custom
> > > > FileBasedSource.
> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <av...@gmail.com>
> wrote:
> > > >
> > > > > The Javadoc of TextIO states:
> > > > >
> > > > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection}
> of
> > > > > {@link String Strings},
> > > > >  * each corresponding to one line of an input UTF-8 text file. To
> > > convert
> > > > > directly from the raw
> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> > > another
> > > > > object of type {@code T},
> > > > >  * supply a {@code Coder<T>} using {@link
> > > TextIO.Read#withCoder(Coder)}.
> > > > >
> > > > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > > > probably
> > > > > won't work given the hard-coded '\n' delimiter.
> > > > >
> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> > jb@nanthrax.net
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Aviem,
> > > > > >
> > > > > > TextIO is not designed to write/read binary file: it's pure
Text,
> > so
> > > > > > String.
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > While trying to use TextIO to write/read a binary file rather
> > than
> > > > > String
> > > > > > > lines from a textual file I ran into an issue - the delimiter
> > > TextIO
> > > > > uses
> > > > > > > seems to be hardcoded '\n'.
> > > > > > > See `findSeparatorBounds` -
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
>
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > > >
> > > > > > > The use case is to have a file of objects, encoded into bytes
> > > using a
> > > > > > > coder. However, '\n' is not a good delimiter here, as you can
> > > > imagine.
> > > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
>
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > > > where
> > > > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > > > >
> > > > > > > I did not find any unit tests which use TextIO to read
anything
> > > other
> > > > > > than
> > > > > > > Strings.
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Jean-Baptiste Onofré
> > > > > > jbonofre@apache.org
> > > > > > http://blog.nanthrax.net
> > > > > > Talend - http://www.talend.com
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: TextIO binary file

Posted by Stas Levin <st...@gmail.com>.
I believe the motivation is to have an abstraction that allows one to write
stuff to a file in a way that is agnostic to the coder.
If one needs to write a non-Avro protocol to a file, and this particular
protocol does not meet the assumption made by TextIO, one might need to
duplicate the file IO related code from AvroIO.

On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
<ki...@google.com.invalid> wrote:

> Could you clarify why it would be useful to write objects to files using
> Beam coders, as opposed to just using e.g. AvroIO?
>
> Coders (should) make no promise as to what their wire format is, so such
> files could be read back only by other Beam pipelines using the same IO.
>
> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <av...@gmail.com> wrote:
>
> > So If I understand the general agreement is that TextIO should not
> support
> > anything but lines from files as strings.
> > I'll go ahead and file a ticket that says the Javadoc should be changed
> to
> > reflect this and `withCoder` method should be removed.
> >
> > Is there merit for Beam to supply an IO which does allow writing objects
> to
> > a file using Beam coders and Beam FS (To write these files to
> > GS/Hadoop/Local)?
> >
> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> > <ki...@google.com.invalid> wrote:
> >
> > P.S. Note that this point (about coders) is also mentioned in the
> > now-being-reviewed PTransform Style Guide
> > https://github.com/apache/beam-site/pull/134
> > currently staged at
> >
> >
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> >
> >
> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <chamikara@apache.org
> >
> > wrote:
> >
> > > +1 to what Eugene said.
> > >
> > > I've seen a number of Python SDK users incorrectly assuming that
> > > coder.decode() is needed when developing their own file-based sources
> > > (since many users usually refer to text source first). Probably coder
> > > parameter should not be configurable for text source/sink and they
> should
> > > be updated to only read/write UTF-8 encoded strings.
> > >
> > > - Cham
> > >
> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > > <ki...@google.com.invalid> wrote:
> > >
> > > > The use of Coder in TextIO is a long standing design issue because
> > coders
> > > > are not intended to be used for general purpose converting things
> from
> > > and
> > > > to bytes, their only proper use is letting the runner materialize and
> > > > restore objects if the runner thinks it's necessary. IMO it should
> have
> > > > been called LineIO, document that it reads lines of text as String,
> and
> > > not
> > > > have a withCoder parameter at all.
> > > >
> > > > The proper way to address your use case is to write a custom
> > > > FileBasedSource.
> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <av...@gmail.com>
> wrote:
> > > >
> > > > > The Javadoc of TextIO states:
> > > > >
> > > > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection}
> of
> > > > > {@link String Strings},
> > > > >  * each corresponding to one line of an input UTF-8 text file. To
> > > convert
> > > > > directly from the raw
> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> > > another
> > > > > object of type {@code T},
> > > > >  * supply a {@code Coder<T>} using {@link
> > > TextIO.Read#withCoder(Coder)}.
> > > > >
> > > > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > > > probably
> > > > > won't work given the hard-coded '\n' delimiter.
> > > > >
> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> > jb@nanthrax.net
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Aviem,
> > > > > >
> > > > > > TextIO is not designed to write/read binary file: it's pure Text,
> > so
> > > > > > String.
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > While trying to use TextIO to write/read a binary file rather
> > than
> > > > > String
> > > > > > > lines from a textual file I ran into an issue - the delimiter
> > > TextIO
> > > > > uses
> > > > > > > seems to be hardcoded '\n'.
> > > > > > > See `findSeparatorBounds` -
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > > >
> > > > > > > The use case is to have a file of objects, encoded into bytes
> > > using a
> > > > > > > coder. However, '\n' is not a good delimiter here, as you can
> > > > imagine.
> > > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > > > where
> > > > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > > > >
> > > > > > > I did not find any unit tests which use TextIO to read anything
> > > other
> > > > > > than
> > > > > > > Strings.
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Jean-Baptiste Onofré
> > > > > > jbonofre@apache.org
> > > > > > http://blog.nanthrax.net
> > > > > > Talend - http://www.talend.com
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: TextIO binary file

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
Could you clarify why it would be useful to write objects to files using
Beam coders, as opposed to just using e.g. AvroIO?

Coders (should) make no promise as to what their wire format is, so such
files could be read back only by other Beam pipelines using the same IO.

On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <av...@gmail.com> wrote:

> So If I understand the general agreement is that TextIO should not support
> anything but lines from files as strings.
> I'll go ahead and file a ticket that says the Javadoc should be changed to
> reflect this and `withCoder` method should be removed.
>
> Is there merit for Beam to supply an IO which does allow writing objects to
> a file using Beam coders and Beam FS (To write these files to
> GS/Hadoop/Local)?
>
> On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
>
> P.S. Note that this point (about coders) is also mentioned in the
> now-being-reviewed PTransform Style Guide
> https://github.com/apache/beam-site/pull/134
> currently staged at
>
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
>
>
> On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <ch...@apache.org>
> wrote:
>
> > +1 to what Eugene said.
> >
> > I've seen a number of Python SDK users incorrectly assuming that
> > coder.decode() is needed when developing their own file-based sources
> > (since many users usually refer to text source first). Probably coder
> > parameter should not be configurable for text source/sink and they should
> > be updated to only read/write UTF-8 encoded strings.
> >
> > - Cham
> >
> > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > <ki...@google.com.invalid> wrote:
> >
> > > The use of Coder in TextIO is a long standing design issue because
> coders
> > > are not intended to be used for general purpose converting things from
> > and
> > > to bytes, their only proper use is letting the runner materialize and
> > > restore objects if the runner thinks it's necessary. IMO it should have
> > > been called LineIO, document that it reads lines of text as String, and
> > not
> > > have a withCoder parameter at all.
> > >
> > > The proper way to address your use case is to write a custom
> > > FileBasedSource.
> > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <av...@gmail.com> wrote:
> > >
> > > > The Javadoc of TextIO states:
> > > >
> > > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > > {@link String Strings},
> > > >  * each corresponding to one line of an input UTF-8 text file. To
> > convert
> > > > directly from the raw
> > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> > another
> > > > object of type {@code T},
> > > >  * supply a {@code Coder<T>} using {@link
> > TextIO.Read#withCoder(Coder)}.
> > > >
> > > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > > probably
> > > > won't work given the hard-coded '\n' delimiter.
> > > >
> > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> jb@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > Hi Aviem,
> > > > >
> > > > > TextIO is not designed to write/read binary file: it's pure Text,
> so
> > > > > String.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > > Hi,
> > > > > >
> > > > > > While trying to use TextIO to write/read a binary file rather
> than
> > > > String
> > > > > > lines from a textual file I ran into an issue - the delimiter
> > TextIO
> > > > uses
> > > > > > seems to be hardcoded '\n'.
> > > > > > See `findSeparatorBounds` -
> > > > > >
> > > > >
> > > >
> > >
> >
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > > >
> > > > > > The use case is to have a file of objects, encoded into bytes
> > using a
> > > > > > coder. However, '\n' is not a good delimiter here, as you can
> > > imagine.
> > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > > >
> > > > >
> > > >
> > >
> >
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > > where
> > > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > > >
> > > > > > I did not find any unit tests which use TextIO to read anything
> > other
> > > > > than
> > > > > > Strings.
> > > > > >
> > > > >
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > jbonofre@apache.org
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > >
> >
>

Re: TextIO binary file

Posted by Aviem Zur <av...@gmail.com>.
So If I understand the general agreement is that TextIO should not support
anything but lines from files as strings.
I'll go ahead and file a ticket that says the Javadoc should be changed to
reflect this and `withCoder` method should be removed.

Is there merit for Beam to supply an IO which does allow writing objects to
a file using Beam coders and Beam FS (To write these files to
GS/Hadoop/Local)?

On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
<ki...@google.com.invalid> wrote:

P.S. Note that this point (about coders) is also mentioned in the
now-being-reviewed PTransform Style Guide
https://github.com/apache/beam-site/pull/134
currently staged at
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders


On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <ch...@apache.org>
wrote:

> +1 to what Eugene said.
>
> I've seen a number of Python SDK users incorrectly assuming that
> coder.decode() is needed when developing their own file-based sources
> (since many users usually refer to text source first). Probably coder
> parameter should not be configurable for text source/sink and they should
> be updated to only read/write UTF-8 encoded strings.
>
> - Cham
>
> On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
>
> > The use of Coder in TextIO is a long standing design issue because
coders
> > are not intended to be used for general purpose converting things from
> and
> > to bytes, their only proper use is letting the runner materialize and
> > restore objects if the runner thinks it's necessary. IMO it should have
> > been called LineIO, document that it reads lines of text as String, and
> not
> > have a withCoder parameter at all.
> >
> > The proper way to address your use case is to write a custom
> > FileBasedSource.
> > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <av...@gmail.com> wrote:
> >
> > > The Javadoc of TextIO states:
> > >
> > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > {@link String Strings},
> > >  * each corresponding to one line of an input UTF-8 text file. To
> convert
> > > directly from the raw
> > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> another
> > > object of type {@code T},
> > >  * supply a {@code Coder<T>} using {@link
> TextIO.Read#withCoder(Coder)}.
> > >
> > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > probably
> > > won't work given the hard-coded '\n' delimiter.
> > >
> > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <jb@nanthrax.net
> >
> > > wrote:
> > >
> > > > Hi Aviem,
> > > >
> > > > TextIO is not designed to write/read binary file: it's pure Text, so
> > > > String.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > Hi,
> > > > >
> > > > > While trying to use TextIO to write/read a binary file rather than
> > > String
> > > > > lines from a textual file I ran into an issue - the delimiter
> TextIO
> > > uses
> > > > > seems to be hardcoded '\n'.
> > > > > See `findSeparatorBounds` -
> > > > >
> > > >
> > >
> >
>
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > >
> > > > > The use case is to have a file of objects, encoded into bytes
> using a
> > > > > coder. However, '\n' is not a good delimiter here, as you can
> > imagine.
> > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > >
> > > >
> > >
> >
>
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > where
> > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > >
> > > > > I did not find any unit tests which use TextIO to read anything
> other
> > > > than
> > > > > Strings.
> > > > >
> > > >
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbonofre@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>

Re: TextIO binary file

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
P.S. Note that this point (about coders) is also mentioned in the
now-being-reviewed PTransform Style Guide
https://github.com/apache/beam-site/pull/134
currently staged at
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders


On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <ch...@apache.org>
wrote:

> +1 to what Eugene said.
>
> I've seen a number of Python SDK users incorrectly assuming that
> coder.decode() is needed when developing their own file-based sources
> (since many users usually refer to text source first). Probably coder
> parameter should not be configurable for text source/sink and they should
> be updated to only read/write UTF-8 encoded strings.
>
> - Cham
>
> On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
>
> > The use of Coder in TextIO is a long standing design issue because coders
> > are not intended to be used for general purpose converting things from
> and
> > to bytes, their only proper use is letting the runner materialize and
> > restore objects if the runner thinks it's necessary. IMO it should have
> > been called LineIO, document that it reads lines of text as String, and
> not
> > have a withCoder parameter at all.
> >
> > The proper way to address your use case is to write a custom
> > FileBasedSource.
> > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <av...@gmail.com> wrote:
> >
> > > The Javadoc of TextIO states:
> > >
> > > * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > {@link String Strings},
> > >  * each corresponding to one line of an input UTF-8 text file. To
> convert
> > > directly from the raw
> > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> another
> > > object of type {@code T},
> > >  * supply a {@code Coder<T>} using {@link
> TextIO.Read#withCoder(Coder)}.
> > >
> > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > probably
> > > won't work given the hard-coded '\n' delimiter.
> > >
> > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <jb@nanthrax.net
> >
> > > wrote:
> > >
> > > > Hi Aviem,
> > > >
> > > > TextIO is not designed to write/read binary file: it's pure Text, so
> > > > String.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > Hi,
> > > > >
> > > > > While trying to use TextIO to write/read a binary file rather than
> > > String
> > > > > lines from a textual file I ran into an issue - the delimiter
> TextIO
> > > uses
> > > > > seems to be hardcoded '\n'.
> > > > > See `findSeparatorBounds` -
> > > > >
> > > >
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > >
> > > > > The use case is to have a file of objects, encoded into bytes
> using a
> > > > > coder. However, '\n' is not a good delimiter here, as you can
> > imagine.
> > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > >
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > where
> > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > >
> > > > > I did not find any unit tests which use TextIO to read anything
> other
> > > > than
> > > > > Strings.
> > > > >
> > > >
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbonofre@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>

Re: TextIO binary file

Posted by Chamikara Jayalath <ch...@apache.org>.
+1 to what Eugene said.

I've seen a number of Python SDK users incorrectly assuming that
coder.decode() is needed when developing their own file-based sources
(since many users usually refer to text source first). Probably coder
parameter should not be configurable for text source/sink and they should
be updated to only read/write UTF-8 encoded strings.

- Cham

On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
<ki...@google.com.invalid> wrote:

> The use of Coder in TextIO is a long standing design issue because coders
> are not intended to be used for general purpose converting things from and
> to bytes, their only proper use is letting the runner materialize and
> restore objects if the runner thinks it's necessary. IMO it should have
> been called LineIO, document that it reads lines of text as String, and not
> have a withCoder parameter at all.
>
> The proper way to address your use case is to write a custom
> FileBasedSource.
> On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <av...@gmail.com> wrote:
>
> > The Javadoc of TextIO states:
> >
> > * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
> > {@link String Strings},
> >  * each corresponding to one line of an input UTF-8 text file. To convert
> > directly from the raw
> >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
> > object of type {@code T},
> >  * supply a {@code Coder<T>} using {@link TextIO.Read#withCoder(Coder)}.
> >
> > However, as I stated, `withCoder` doesn't seem to have tests, and
> probably
> > won't work given the hard-coded '\n' delimiter.
> >
> > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> >
> > > Hi Aviem,
> > >
> > > TextIO is not designed to write/read binary file: it's pure Text, so
> > > String.
> > >
> > > Regards
> > > JB
> > >
> > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > Hi,
> > > >
> > > > While trying to use TextIO to write/read a binary file rather than
> > String
> > > > lines from a textual file I ran into an issue - the delimiter TextIO
> > uses
> > > > seems to be hardcoded '\n'.
> > > > See `findSeparatorBounds` -
> > > >
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > >
> > > > The use case is to have a file of objects, encoded into bytes using a
> > > > coder. However, '\n' is not a good delimiter here, as you can
> imagine.
> > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > >
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > where
> > > > they use a more appropriate delimiter, to avoid such issues.
> > > >
> > > > I did not find any unit tests which use TextIO to read anything other
> > > than
> > > > Strings.
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbonofre@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>

Re: TextIO binary file

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
The use of Coder in TextIO is a long standing design issue because coders
are not intended to be used for general purpose converting things from and
to bytes, their only proper use is letting the runner materialize and
restore objects if the runner thinks it's necessary. IMO it should have
been called LineIO, document that it reads lines of text as String, and not
have a withCoder parameter at all.

The proper way to address your use case is to write a custom
FileBasedSource.
On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <av...@gmail.com> wrote:

> The Javadoc of TextIO states:
>
> * <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
> {@link String Strings},
>  * each corresponding to one line of an input UTF-8 text file. To convert
> directly from the raw
>  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
> object of type {@code T},
>  * supply a {@code Coder<T>} using {@link TextIO.Read#withCoder(Coder)}.
>
> However, as I stated, `withCoder` doesn't seem to have tests, and probably
> won't work given the hard-coded '\n' delimiter.
>
> On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Hi Aviem,
> >
> > TextIO is not designed to write/read binary file: it's pure Text, so
> > String.
> >
> > Regards
> > JB
> >
> > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > Hi,
> > >
> > > While trying to use TextIO to write/read a binary file rather than
> String
> > > lines from a textual file I ran into an issue - the delimiter TextIO
> uses
> > > seems to be hardcoded '\n'.
> > > See `findSeparatorBounds` -
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > >
> > > The use case is to have a file of objects, encoded into bytes using a
> > > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > > A similar pattern is found in Spark's `saveAsObjectFile`
> > >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > where
> > > they use a more appropriate delimiter, to avoid such issues.
> > >
> > > I did not find any unit tests which use TextIO to read anything other
> > than
> > > Strings.
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: TextIO binary file

Posted by Aviem Zur <av...@gmail.com>.
The Javadoc of TextIO states:

* <p>By default, {@link TextIO.Read} returns a {@link PCollection} of
{@link String Strings},
 * each corresponding to one line of an input UTF-8 text file. To convert
directly from the raw
 * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
object of type {@code T},
 * supply a {@code Coder<T>} using {@link TextIO.Read#withCoder(Coder)}.

However, as I stated, `withCoder` doesn't seem to have tests, and probably
won't work given the hard-coded '\n' delimiter.

On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Aviem,
>
> TextIO is not designed to write/read binary file: it's pure Text, so
> String.
>
> Regards
> JB
>
> On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > Hi,
> >
> > While trying to use TextIO to write/read a binary file rather than String
> > lines from a textual file I ran into an issue - the delimiter TextIO uses
> > seems to be hardcoded '\n'.
> > See `findSeparatorBounds` -
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> >
> > The use case is to have a file of objects, encoded into bytes using a
> > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > A similar pattern is found in Spark's `saveAsObjectFile`
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > where
> > they use a more appropriate delimiter, to avoid such issues.
> >
> > I did not find any unit tests which use TextIO to read anything other
> than
> > Strings.
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: TextIO binary file

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Aviem,

TextIO is not designed to write/read binary file: it's pure Text, so String.

Regards
JB

On 01/30/2017 09:24 AM, Aviem Zur wrote:
> Hi,
>
> While trying to use TextIO to write/read a binary file rather than String
> lines from a textual file I ran into an issue - the delimiter TextIO uses
> seems to be hardcoded '\n'.
> See `findSeparatorBounds` -
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
>
> The use case is to have a file of objects, encoded into bytes using a
> coder. However, '\n' is not a good delimiter here, as you can imagine.
> A similar pattern is found in Spark's `saveAsObjectFile`
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> where
> they use a more appropriate delimiter, to avoid such issues.
>
> I did not find any unit tests which use TextIO to read anything other than
> Strings.
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: TextIO binary file

Posted by Stas Levin <st...@gmail.com>.
Dear Aviem,

That's a good point. TextIO seems to make a few "string oriented"
assumptions, see TextIO.Write's header and footer support
<https://github.com/apache/beam/pull/918>, and IO design pattern: Decouple
Parsers and Coders <https://issues.apache.org/jira/browse/BEAM-73>
(BEAM-73) that
came up during its development.

IMHO it would be nice to make TextIO purely about textual content, and
perhaps go as far as remove the ability to pass in a coder. To support
encoded files, one could consider something like FileIO which gets a coder
and writes/reads the encoded/decoded content to/from a file. For example,
AvroIO could be thought of as a FileIO with an AvroCoder.

IO authors, does this sit well with what you had in mind?

-Stas



On Mon, Jan 30, 2017 at 10:24 AM Aviem Zur <av...@gmail.com> wrote:

> Hi,
>
> While trying to use TextIO to write/read a binary file rather than String
> lines from a textual file I ran into an issue - the delimiter TextIO uses
> seems to be hardcoded '\n'.
> See `findSeparatorBounds` -
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
>
> The use case is to have a file of objects, encoded into bytes using a
> coder. However, '\n' is not a good delimiter here, as you can imagine.
> A similar pattern is found in Spark's `saveAsObjectFile`
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> where
> they use a more appropriate delimiter, to avoid such issues.
>
> I did not find any unit tests which use TextIO to read anything other than
> Strings.
>