You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Amit Sela <am...@gmail.com> on 2016/12/13 09:56:38 UTC

Re: PCollection to PCollection Conversion

It seems that there were a lot of good points raised here, and I tend to
agree that something as trivial and lean as "ToString" should be a part of
core.
I'm particularly fond of makeString(prefix, toString, suffix) in various
combinations (Scala-like).
For "fromString", I think JB has a good point leveraging JAXB and Jackson -
though I think this should be in extensions as it is not as lean as
toString.

Thanks,
Amit

On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Jesse,
>
> yes, I started something there (using JAXB and Jackson). Let me polish
> and push.
>
> Regards
> JB
>
> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > I went through the string conversions. Do you have an example of writing
> > out XML/JSON/etc too?
> >
> > On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> >
> >> Hi Jesse,
> >>
> >>
> >>
> https://github.com/jbonofre/incubator-beam/tree/DATAFORMAT/sdks/java/extensions/dataformat
> >>
> >> it's very simple and stupid and of course not complete at all (I have
> >> other commits but not merged as they need some polishing), but as I
> >> said, it's a base of discussion.
> >>
> >> Regards
> >> JB
> >>
> >> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> >>> @jb Sounds good. Just let us know once you've pushed.
> >>>
> >>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> >>> wrote:
> >>>
> >>>> Good point Eugene.
> >>>>
> >>>> Right now, it's a DoFn collection to experiment a bit (a pure
> >>>> extension). It's pretty stupid ;)
> >>>>
> >>>> But, you are right, depending the direction of such extension, it
> could
> >>>> cover more use cases (even if it's not my first intention ;)).
> >>>>
> >>>> Let me push the branch (pretty small) as an illustration, and in the
> >>>> mean time, I'm preparing a document (more focused on the use cases).
> >>>>
> >>>> WDYT ?
> >>>>
> >>>> Regards
> >>>> JB
> >>>>
> >>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> >>>>> Hi JB,
> >>>>> Depending on the scope of what you want to ultimately accomplish with
> >>>> this
> >>>>> extension, I think it may make sense to write a proposal document and
> >>>>> discuss it.
> >>>>> If it's just a collection of utility DoFn's for various well-defined
> >>>>> source/target format pairs, then that's probably not needed, but if
> >> it's
> >>>>> anything more, then I think it is.
> >>>>> That will help avoid a lot of churn if people propose reasonable
> >>>>> significant changes.
> >>>>>
> >>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> jb@nanthrax.net
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my github
> and I
> >>>>>> will post on the dev mailing list when done.
> >>>>>>
> >>>>>> Regards
> >>>>>> JB
> >>>>>>
> >>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> >>>>>>> I want to bring this thread back up since we've had time to think
> >> about
> >>>>>> it
> >>>>>>> more and make a plan.
> >>>>>>>
> >>>>>>> I think a format-specific converter will be more time consuming
> task
> >>>> than
> >>>>>>> we originally thought. It'd have to be a writer that takes another
> >>>> writer
> >>>>>>> as a parameter.
> >>>>>>>
> >>>>>>> I think a string converter can be done as a simple transform.
> >>>>>>>
> >>>>>>> I think we should start with a simple string converter and plan
> for a
> >>>>>>> format-specific writer.
> >>>>>>>
> >>>>>>> What are your thoughts?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Jesse
> >>>>>>>
> >>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> >> jesse@smokinghand.com
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> I was thinking about what the outputs would look like last night. I
> >>>>>>> realized that more complex formats like JSON and XML may or may not
> >>>>>> output
> >>>>>>> the data in a valid format.
> >>>>>>>
> >>>>>>> Doing a direct conversion on unbounded collections would work just
> >>>> fine.
> >>>>>>> They're self-contained. For writing out bounded collections, that's
> >>>> where
> >>>>>>> we'll hit the issues. This changes the uber conversion transform
> >> into a
> >>>>>>> transform that needs to be a writer.
> >>>>>>>
> >>>>>>> If a transform executes a JSON conversion on a per element basis,
> >> we'd
> >>>>>> get
> >>>>>>> this:
> >>>>>>> {
> >>>>>>> "key": "value"
> >>>>>>> }, {
> >>>>>>> "key": "value"
> >>>>>>> },
> >>>>>>>
> >>>>>>> That isn't valid JSON.
> >>>>>>>
> >>>>>>> The conversion transform would need to know do several things when
> >>>>>> writing
> >>>>>>> out a file. It would need to add brackets for an array. Now we
> have:
> >>>>>>> [
> >>>>>>> {
> >>>>>>> "key": "value"
> >>>>>>> }, {
> >>>>>>> "key": "value"
> >>>>>>> },
> >>>>>>> ]
> >>>>>>>
> >>>>>>> We still don't have valid JSON. We have to remove the last comma or
> >>>> have
> >>>>>>> the uber transform start putting in the commas, except for the last
> >>>>>> element.
> >>>>>>>
> >>>>>>> [
> >>>>>>> {
> >>>>>>> "key": "value"
> >>>>>>> }, {
> >>>>>>> "key": "value"
> >>>>>>> }
> >>>>>>> ]
> >>>>>>>
> >>>>>>> Only by doing this do we have valid JSON.
> >>>>>>>
> >>>>>>> I'd argue we'd have a similar issue with XML. Some parsers require
> a
> >>>> root
> >>>>>>> element for everything. The uber transform would have to put the
> root
> >>>>>>> element tags at the beginning and end of the file.
> >>>>>>>
> >>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> owenzhang1990@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> I would love to see a lean core and abundant Transforms at the same
> >>>> time.
> >>>>>>>
> >>>>>>> Maybe we can look at what Confluent <
> https://github.com/confluentinc
> >>>
> >>>>>> does
> >>>>>>> for kafka-connect. They have official extensions support for JDBC,
> >> HDFS
> >>>>>> and
> >>>>>>> ElasticSearch under https://github.com/confluentinc. They put them
> >>>> along
> >>>>>>> with other community extensions on
> >>>>>>> https://www.confluent.io/product/connectors/ for visibility.
> >>>>>>>
> >>>>>>> Although not a commercial company, can we have a GitHub user like
> >>>>>>> beam-community to host projects we build around beam but not
> suitable
> >>>> for
> >>>>>>> https://github.com/apache/incubator-beam. In the future, we may
> have
> >>>>>>> beam-algebra like http://github.com/twitter/algebird for algebra
> >>>>>> operations
> >>>>>>> and beam-ml / beam-dl for machine learning / deep learning. Also,
> >> there
> >>>>>>> will will be beam related projects elsewhere maintained by other
> >>>>>>> communities. We can put all of them on the beam-website or like
> spark
> >>>>>>> packages as mentioned by Amit.
> >>>>>>>
> >>>>>>> My $0.02
> >>>>>>> Manu
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> >> <klk@google.com.invalid
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> On this point from Amit and Ismaël, I agree: we could benefit
> from a
> >>>>>> place
> >>>>>>>> for miscellaneous non-core helper transformations.
> >>>>>>>>
> >>>>>>>> We have sdks/java/extensions but it is organized as separate
> >>>> artifacts.
> >>>>>> I
> >>>>>>>> think that is fine, considering the nature of Join and SortValues.
> >> But
> >>>>>> for
> >>>>>>>> simpler transforms, Importing one artifact per tiny transform is
> too
> >>>>>> much
> >>>>>>>> overhead. It also seems unlikely that we will have enough
> >> commonality
> >>>>>>> among
> >>>>>>>> the transforms to call the artifact anything other than [some
> >> synonym
> >>>>>> for]
> >>>>>>>> "miscellaneous".
> >>>>>>>>
> >>>>>>>> I wouldn't want to take this too far - even though the SDK many
> >>>>>>> transforms*
> >>>>>>>> that are not required for the model [1], I like that the SDK
> >> artifact
> >>>>>> has
> >>>>>>>> everything a user might need in their "getting started" phase of
> >> use.
> >>>>>> This
> >>>>>>>> user-friendliness (the user doesn't care that ParDo is core and
> Sum
> >> is
> >>>>>>> not)
> >>>>>>>> plus the difficulty of judging which transforms go where, are
> >> probably
> >>>>>> why
> >>>>>>>> we have them mostly all in one place.
> >>>>>>>>
> >>>>>>>> Models to look at, off the top of my head, include Pig's PiggyBank
> >> and
> >>>>>>>> Apex's Malhar. These have different levels of support implied.
> >> Others?
> >>>>>>>>
> >>>>>>>> Kenn
> >>>>>>>>
> >>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct,
> >> Filter,
> >>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min,
> Values,
> >>>>>>> KvSwap,
> >>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> WithTimestamps
> >>>>>>>>
> >>>>>>>> * at least they are separate classes and not methods on
> PCollection
> >>>> :-)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <ie...@gmail.com>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>>> Nice discussion, and thanks Jesse for bringing this subject
> back.
> >>>>>>>>>
> >>>>>>>>> I agree 100% with Amit and the idea of having a home for those
> >>>>>>> transforms
> >>>>>>>>> that are not core enough to be part of the sdk, but that we all
> end
> >>>> up
> >>>>>>>>> re-writing somehow.
> >>>>>>>>>
> >>>>>>>>> This is a needed improvement to be more developer friendly, but
> >> also
> >>>> as
> >>>>>>> a
> >>>>>>>>> reference of good practices of Beam development, and for this
> >> reason
> >>>> I
> >>>>>>>>> agree with JB that at this moment it would be better for these
> >>>>>>> transforms
> >>>>>>>>> to reside in the Beam repository at least for visibility reasons.
> >>>>>>>>>
> >>>>>>>>> One additional question is if these transforms represent a
> >> different
> >>>>>> DSL
> >>>>>>>> or
> >>>>>>>>> if those could be grouped with the current extensions (e.g. Join
> >> and
> >>>>>>>>> SortValues) into something more general that we as a community
> >> could
> >>>>>>>>> maintain, but well even if it is not the case, it would be really
> >>>> nice
> >>>>>>> to
> >>>>>>>>> start working on something like this.
> >>>>>>>>>
> >>>>>>>>> Ismaël Mejía
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <
> >>>> jb@nanthrax.net
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Related to spark-package, we also have Apache Bahir to host
> >>>>>>>>>> connectors/transforms for Spark and Flink.
> >>>>>>>>>>
> >>>>>>>>>> IMHO, right now, Beam should host this, not sure if it makes
> sense
> >>>>>>>>>> directly in the core.
> >>>>>>>>>>
> >>>>>>>>>> It reminds me the "Integration" DSL we discussed in the
> technical
> >>>>>>>> vision
> >>>>>>>>>> document.
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>> JB
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I think Jesse has a very good point on one hand, while Luke's
> and
> >>>>>>>>>>> Kenneth's
> >>>>>>>>>>> worries about committing users to specific implementations is
> in
> >>>>>>>> place.
> >>>>>>>>>>>
> >>>>>>>>>>> The Spark community has a 3rd party repository for useful
> >> libraries
> >>>>>>>> that
> >>>>>>>>>>> for various reasons are not a part of the Apache Spark project:
> >>>>>>>>>>> https://spark-packages.org/.
> >>>>>>>>>>>
> >>>>>>>>>>> Maybe a "common-transformations" package would serve both users
> >>>> quick
> >>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more "enabling" ?
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> >>>>>>> <klk@google.com.invalid
> >>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> It seems useful for small scale debugging / demoing to have
> >>>>>>>>>>>> Dump.toString(). I think it should be named to clearly
> indicate
> >>>> its
> >>>>>>>>>>>> limited
> >>>>>>>>>>>> scope. Maybe other stuff could go in the Dump namespace, but
> >>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it should be
> >>>> pretty
> >>>>>>>>>>>> printed, not treated as a machine-to-machine wire format.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The broader question of representing data in JSON or XML, etc,
> >> is
> >>>>>>>>> already
> >>>>>>>>>>>> the subject of many mature libraries which are already easy to
> >> use
> >>>>>>>> with
> >>>>>>>>>>>> Beam.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The more esoteric practice of implicit or semi-implicit
> >> coercions
> >>>>>>>> seems
> >>>>>>>>>>>> like it is also already addressed in many ways elsewhere.
> >>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as
> >>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with Beam.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In both of the last cases, there are many reasonable
> approaches,
> >>>> and
> >>>>>>>> we
> >>>>>>>>>>>> shouldn't commit our users to one of them.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> >>>>>>>> <lcwik@google.com.invalid
> >>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> The suggestions you give seem good except for the the XML
> cases.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Might want to have the XML be a document per line similar to
> >> the
> >>>>>>>> JSON
> >>>>>>>>>>>>> examples you have been giving.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> >>>>>>>>> jesse@smokinghand.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I was more
> >>>> think
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>
> >>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV. It should
> >>>>>>> handle
> >>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give someone
> >>>>>>>> something
> >>>>>>>>>>>>>> general purpose enough that you would just end up writing
> your
> >>>> own
> >>>>>>>>> code
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> handle it anyway.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Here are some ideas on what it could look like with a method
> >> and
> >>>>>>>> the
> >>>>>>>>>>>>>> resulting string output:
> >>>>>>>>>>>>>> *Stringify.toJSON()*
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With KV:
> >>>>>>>>>>>>>> {"key": "value"}
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With Iterables:
> >>>>>>>>>>>>>> ["one", "two", "three"]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With KV:
> >>>>>>>>>>>>>> <rootelement key=value />
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With Iterables:
> >>>>>>>>>>>>>> <rootelement>
> >>>>>>>>>>>>>>   <item>one</item>
> >>>>>>>>>>>>>>   <item>two</item>
> >>>>>>>>>>>>>>   <item>three</item>
> >>>>>>>>>>>>>> </rootelement>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With KV:
> >>>>>>>>>>>>>> key,value
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With Iterables:
> >>>>>>>>>>>>>> one,two,three
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Do you think that would strike a good balance between
> reusable
> >>>>>>> code
> >>>>>>>>> and
> >>>>>>>>>>>>>> writing your own for more difficult formatting?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jesse
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> >>>>>>>> <lcwik@google.com.invalid
> >>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jesse, I believe if one format gets special treatment in
> >> TextIO,
> >>>>>>>>> people
> >>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not supported.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Also, the example that you provide is using the fact that
> the
> >>>>>>> input
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> format
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question about using
> KV
> >>>>>>> with
> >>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed input
> >> format
> >>>>>>>> and
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> still
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> would require to write a type conversion function, this time
> >>>> from
> >>>>>>>> KV
> >>>>>>>>> to
> >>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> >>>>>>>>> jesse@smokinghand.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Lukasz,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> TextIO.Write.
> >>>> For
> >>>>>>>> CSV
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> call would look like:
> >>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix,
> delimiter,
> >>>>>>>>> suffix).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The code would be something like:
> >>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> for (Item item : list) {
> >>>>>>>>>>>>>>>   buffer.append(item.toString());
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>   if(notLast) {
> >>>>>>>>>>>>>>>     buffer.append(delimiter);
> >>>>>>>>>>>>>>>   }
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> buffer.append(suffix);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> c.output(buffer.toString());
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and other
> >>>> formats
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> without
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> complicated logic. The same sort of thing could be done for
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> TextIO.Write.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Jesse
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> >>>>>>>>> <lcwik@google.com.invalid
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The conversion from object to string will have uses outside
> >> of
> >>>>>>>> just
> >>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would want to
> have
> >> a
> >>>>>>>> ParDo
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> conversion.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if you
> >> consider
> >>>>>>>> the
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> subset
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> of CSV like formats where it could have fixed width fields,
> >> or
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> escaping
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> quoting around other fields, or headers that should be
> >> placed
> >>>> at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>>>> top.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Having all these format conversions within TextIO.Write
> >> seems
> >>>>>>>> like
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>
> >>>>>>>>>>>>> lot
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> logic to contain in that transform which should just focus
> >> on
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> writing
> >>>>>>>>>>>>
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> files.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> jesse@smokinghand.com>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This is a thread moved over from the user mailing list.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> >> PCollection<KV>
> >>>> to
> >>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually convert
> the
> >>>> KV
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> to a
> >>>>>>>>>>>>
> >>>>>>>>>>>>> String:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>         p
> >>>>>>>>>>>>>>>>>
> >>  .apply(TextIO.Read.from("playing_cards.tsv"))
> >>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> >>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> >>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<String, Long>
> >>>>>>> count)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> ->*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> *                            count.getKey() + ":" +
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> count.getValue()*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> *                        ).withOutputType(
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> >>>> ("output/stringcounts"));
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This code really should be something like:
> >>>>>>>>>>>>>>>>>         p
> >>>>>>>>>>>>>>>>>
> >>  .apply(TextIO.Read.from("playing_cards.tsv"))
> >>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> >>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> >>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
> >>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> >>>>>> ("output/stringcounts"));
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> To summarize the discussion:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to output
> any
> >> KV
> >>>>>>> or
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> list
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes an type and
> >> runs
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> toString()
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>    on it:
> >>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> SimpleFunction<InputT,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> String>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> {
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>        public static String apply(InputT input) {
> >>>>>>>>>>>>>>>>>            return input.toString();
> >>>>>>>>>>>>>>>>>        }
> >>>>>>>>>>>>>>>>>    }
> >>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter like in
> >> Apache
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Camel.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that would write
> >> out
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>    toString of any Object.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> My thoughts:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly needed when
> >>>>>>> you're
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> using
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform only work in
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>
> >>>>>>>>>>>>> cases
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> and you'll normally have to write custom code format the
> >>>> strings
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> you want them?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object support
> to
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> TextIO.Write
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an argument.
> >>>> Making
> >>>>>>> a
> >>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a delimiter (and
> >>>> perhaps
> >>>>>>> a
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> prefix
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and cases.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jesse
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Jean-Baptiste Onofré
> >>>>>>>>>> jbonofre@apache.org
> >>>>>>>>>> http://blog.nanthrax.net
> >>>>>>>>>> Talend - http://www.talend.com
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Jean-Baptiste Onofré
> >>>>>> jbonofre@apache.org
> >>>>>> http://blog.nanthrax.net
> >>>>>> Talend - http://www.talend.com
> >>>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Jean-Baptiste Onofré
> >>>> jbonofre@apache.org
> >>>> http://blog.nanthrax.net
> >>>> Talend - http://www.talend.com
> >>>>
> >>>
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbonofre@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: PCollection to PCollection Conversion

Posted by Ben Chambers <bc...@google.com.INVALID>.

+1 to Dan's point that MapElements.via is not that hard to use and going
too far down this path leads to significant complexity.

Pipelines should generally prefer to deal with structured data as much as
possible. As has been discussed on this thread, though, sometimes it is
necessary to convert things to strings (either for writing to a sink or
debugging). In these cases, it seems like having a simple "ToString"
transform is useful.

But if we go beyond that, we open up a can of worms. There are many ways to
convert something to a string, and I don't think we can reasonably identify
all of them. It also makes usability harder, since the user has to dig
through our documentation and find the built-in transform offering the
functionality they want.

If we instead provide the most common case (simple toString transform) and
after that suggest using MapElements.via (maybe even in the Javadoc),
user's will be guided to a useful tool (MapElements.via) and then be able
to reuse all the normal Java tools such as String.format that they are
already familiar with.

On Thu, Dec 29, 2016 at 1:46 PM Dan Halperin <dh...@google.com.invalid>
wrote:

> On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson <je...@smokinghand.com>
> wrote:
>
> > I prefer JB's take. I think there should be three overloaded methods on
> the
> > class. I like Vikas' name ToString. The methods for a simple conversion
> > should be:
> >
> > ToString.strings() - Outputs the .toString() of the objects in the
> > PCollection
> > ToString.strings(String delimiter) - Outputs the .toString() of KVs,
> Lists,
> > etc with the delimiter between every entry
> > ToString.formatted(String format) - Outputs the formatted
> > <https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html>
> > string
> > with the object passed in. For objects made up of different parts like
> KVs,
> > each one is passed in as separate toString() of a varargs.
> >
>
> Riffing a little, with some types:
>
> ToString.<T>of() -- PTransform<T, String> that is equivalent to a ParDo
> that takes in a T and outputs T.toString().
>
> ToString.<K,V>kv(String delimiter) -- PTransform<KV<K, V>, String> that is
> equivalent to a ParDo that takes in a KV<K,V> and outputs
> kv.getKey().toString() + delimiter + kv.getValue().toString()
>
> ToString.<T>iterable(String delimiter) -- PTransform<? extends Iterable<T>,
> String> that is equivalent to a ParDo that takes in an Iterable<T> and
> outputs the iterable[0] + delimiter + iterable[1] + delimiter + ... +
> delimiter + iterable[N-1]
>
> ToString.<T>custom(SerializableFunction<T, String> formatter) ?
>
> The last one is just MapElement.via, except you don't need to set the
> output type.
>
> I don't see a way to make the generic .formatted() that you propose that
> just works with anything "made of different parts".
>
> I think this adding too many overrides beyond "of" and "custom" is opening
> up a Pandora's Box. the KV one might want to have left and right
> delimiters, might want to take custom formatters for K and V, etc. etc. The
> iterable one might want to have a special configuration for an empty
> iterable. So I'm inclined towards simplicity with the awareness that
> MapElements.via is just not that hard to use.
>
> Dan
>
>
> >
> > I think doing these three methods would cover every simple and advanced
> > "simple conversions." As JB says, we'll need other specific converters
> for
> > other formats like XML.
> >
> > I'd really like to see this class in the next version of Beam. What does
> > everyone think of the class name, methods name, and method operations so
> we
> > can have Vikas finish up?
> >
> > Thanks,
> >
> > Jesse
> >
> > On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> >
> > > Hi Vikas,
> > >
> > > did you take a look on:
> > >
> > >
> > > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/
> > java/extensions/dataformat
> > >
> > > You can see KV2String and ToString could be part of this extension.
> > > I'm also using JAXB for XML and Jackson for JSON
> > > marshalling/unmarshalling. I'm planning to deal with Avro
> > (IndexedRecord).
> > >
> > > Regards
> > > JB
> > >
> > > On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> > > > Hi All,
> > > >
> > > >   Not being aware of the discussion here, I sent out a PR
> > > > <https://github.com/apache/beam/pull/1704> but JB and others
> directed
> > > me to
> > > > this thread. Having converted PCollection<T> to PCollection<String>
> > > several
> > > > times, I feel something like 'ToString' transform is common enough to
> > be
> > > > part of the core. What do you all think?
> > > >
> > > > Also, if someone else is already working on or interested in tackling
> > > this,
> > > > then I am happy to discard the PR.
> > > >
> > > > Regards,
> > > > Vikas
> > > >
> > > > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <am...@gmail.com>
> > wrote:
> > > >
> > > >> It seems that there were a lot of good points raised here, and I
> tend
> > to
> > > >> agree that something as trivial and lean as "ToString" should be a
> > part
> > > of
> > > >> core.ake
> > > >> I'm particularly fond of makeString(prefix, toString, suffix) in
> > various
> > > >> combinations (Scala-like).
> > > >> For "fromString", I think JB has a good point leveraging JAXB and
> > > Jackson -
> > > >> though I think this should be in extensions as it is not as lean as
> > > >> toString.
> > > >>
> > > >> Thanks,
> > > >> Amit
> > > >>
> > > >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <
> jb@nanthrax.net
> > >
> > > >> wrote:
> > > >>
> > > >>> Hi Jesse,
> > > >>>
> > > >>> yes, I started something there (using JAXB and Jackson). Let me
> > polish
> > > >>> and push.
> > > >>>
> > > >>> Regards
> > > >>> JB
> > > >>>
> > > >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > > >>>> I went through the string conversions. Do you have an example of
> > > >> writing
> > > >>>> out XML/JSON/etc too?
> > > >>>>
> > > >>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <
> > jb@nanthrax.net
> > > >
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hi Jesse,
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>> https://github.com/jbonofre/incubator-beam/tree/
> > DATAFORMAT/sdks/java/
> > > >> extensions/dataformat
> > > >>>>>
> > > >>>>> it's very simple and stupid and of course not complete at all (I
> > have
> > > >>>>> other commits but not merged as they need some polishing), but
> as I
> > > >>>>> said, it's a base of discussion.
> > > >>>>>
> > > >>>>> Regards
> > > >>>>> JB
> > > >>>>>
> > > >>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> > > >>>>>> @jb Sounds good. Just let us know once you've pushed.
> > > >>>>>>
> > > >>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> > > >> jb@nanthrax.net>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Good point Eugene.
> > > >>>>>>>
> > > >>>>>>> Right now, it's a DoFn collection to experiment a bit (a pure
> > > >>>>>>> extension). It's pretty stupid ;)
> > > >>>>>>>
> > > >>>>>>> But, you are right, depending the direction of such extension,
> it
> > > >>> could
> > > >>>>>>> cover more use cases (even if it's not my first intention ;)).
> > > >>>>>>>
> > > >>>>>>> Let me push the branch (pretty small) as an illustration, and
> in
> > > the
> > > >>>>>>> mean time, I'm preparing a document (more focused on the use
> > > cases).
> > > >>>>>>>
> > > >>>>>>> WDYT ?
> > > >>>>>>>
> > > >>>>>>> Regards
> > > >>>>>>> JB
> > > >>>>>>>
> > > >>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> > > >>>>>>>> Hi JB,
> > > >>>>>>>> Depending on the scope of what you want to ultimately
> accomplish
> > > >> with
> > > >>>>>>> this
> > > >>>>>>>> extension, I think it may make sense to write a proposal
> > document
> > > >> and
> > > >>>>>>>> discuss it.
> > > >>>>>>>> If it's just a collection of utility DoFn's for various
> > > >> well-defined
> > > >>>>>>>> source/target format pairs, then that's probably not needed,
> but
> > > if
> > > >>>>> it's
> > > >>>>>>>> anything more, then I think it is.
> > > >>>>>>>> That will help avoid a lot of churn if people propose
> reasonable
> > > >>>>>>>> significant changes.
> > > >>>>>>>>
> > > >>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> > > >>> jb@nanthrax.net
> > > >>>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my
> > github
> > > >>> and I
> > > >>>>>>>>> will post on the dev mailing list when done.
> > > >>>>>>>>>
> > > >>>>>>>>> Regards
> > > >>>>>>>>> JB
> > > >>>>>>>>>
> > > >>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > > >>>>>>>>>> I want to bring this thread back up since we've had time to
> > > think
> > > >>>>> about
> > > >>>>>>>>> it
> > > >>>>>>>>>> more and make a plan.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I think a format-specific converter will be more time
> > consuming
> > > >>> task
> > > >>>>>>> than
> > > >>>>>>>>>> we originally thought. It'd have to be a writer that takes
> > > >> another
> > > >>>>>>> writer
> > > >>>>>>>>>> as a parameter.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I think a string converter can be done as a simple
> transform.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I think we should start with a simple string converter and
> > plan
> > > >>> for a
> > > >>>>>>>>>> format-specific writer.
> > > >>>>>>>>>>
> > > >>>>>>>>>> What are your thoughts?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Jesse
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> > > >>>>> jesse@smokinghand.com
> > > >>>>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> I was thinking about what the outputs would look like last
> > > >> night. I
> > > >>>>>>>>>> realized that more complex formats like JSON and XML may or
> > may
> > > >> not
> > > >>>>>>>>> output
> > > >>>>>>>>>> the data in a valid format.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Doing a direct conversion on unbounded collections would
> work
> > > >> just
> > > >>>>>>> fine.
> > > >>>>>>>>>> They're self-contained. For writing out bounded collections,
> > > >> that's
> > > >>>>>>> where
> > > >>>>>>>>>> we'll hit the issues. This changes the uber conversion
> > transform
> > > >>>>> into a
> > > >>>>>>>>>> transform that needs to be a writer.
> > > >>>>>>>>>>
> > > >>>>>>>>>> If a transform executes a JSON conversion on a per element
> > > basis,
> > > >>>>> we'd
> > > >>>>>>>>> get
> > > >>>>>>>>>> this:
> > > >>>>>>>>>> {
> > > >>>>>>>>>> "key": "value"
> > > >>>>>>>>>> }, {
> > > >>>>>>>>>> "key": "value"
> > > >>>>>>>>>> },
> > > >>>>>>>>>>
> > > >>>>>>>>>> That isn't valid JSON.
> > > >>>>>>>>>>
> > > >>>>>>>>>> The conversion transform would need to know do several
> things
> > > >> when
> > > >>>>>>>>> writing
> > > >>>>>>>>>> out a file. It would need to add brackets for an array. Now
> we
> > > >>> have:
> > > >>>>>>>>>> [
> > > >>>>>>>>>> {
> > > >>>>>>>>>> "key": "value"
> > > >>>>>>>>>> }, {
> > > >>>>>>>>>> "key": "value"
> > > >>>>>>>>>> },
> > > >>>>>>>>>> ]
> > > >>>>>>>>>>
> > > >>>>>>>>>> We still don't have valid JSON. We have to remove the last
> > comma
> > > >> or
> > > >>>>>>> have
> > > >>>>>>>>>> the uber transform start putting in the commas, except for
> the
> > > >> last
> > > >>>>>>>>> element.
> > > >>>>>>>>>>
> > > >>>>>>>>>> [
> > > >>>>>>>>>> {
> > > >>>>>>>>>> "key": "value"
> > > >>>>>>>>>> }, {
> > > >>>>>>>>>> "key": "value"
> > > >>>>>>>>>> }
> > > >>>>>>>>>> ]
> > > >>>>>>>>>>
> > > >>>>>>>>>> Only by doing this do we have valid JSON.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I'd argue we'd have a similar issue with XML. Some parsers
> > > >> require
> > > >>> a
> > > >>>>>>> root
> > > >>>>>>>>>> element for everything. The uber transform would have to put
> > the
> > > >>> root
> > > >>>>>>>>>> element tags at the beginning and end of the file.
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> > > >>> owenzhang1990@gmail.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> I would love to see a lean core and abundant Transforms at
> the
> > > >> same
> > > >>>>>>> time.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Maybe we can look at what Confluent <
> > > >>> https://github.com/confluentinc
> > > >>>>>>
> > > >>>>>>>>> does
> > > >>>>>>>>>> for kafka-connect. They have official extensions support for
> > > >> JDBC,
> > > >>>>> HDFS
> > > >>>>>>>>> and
> > > >>>>>>>>>> ElasticSearch under https://github.com/confluentinc. They
> put
> > > >> them
> > > >>>>>>> along
> > > >>>>>>>>>> with other community extensions on
> > > >>>>>>>>>> https://www.confluent.io/product/connectors/ for
> visibility.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Although not a commercial company, can we have a GitHub user
> > > like
> > > >>>>>>>>>> beam-community to host projects we build around beam but not
> > > >>> suitable
> > > >>>>>>> for
> > > >>>>>>>>>> https://github.com/apache/incubator-beam. In the future, we
> > may
> > > >>> have
> > > >>>>>>>>>> beam-algebra like http://github.com/twitter/algebird for
> > > algebra
> > > >>>>>>>>> operations
> > > >>>>>>>>>> and beam-ml / beam-dl for machine learning / deep learning.
> > > Also,
> > > >>>>> there
> > > >>>>>>>>>> will will be beam related projects elsewhere maintained by
> > other
> > > >>>>>>>>>> communities. We can put all of them on the beam-website or
> > like
> > > >>> spark
> > > >>>>>>>>>> packages as mentioned by Amit.
> > > >>>>>>>>>>
> > > >>>>>>>>>> My $0.02
> > > >>>>>>>>>> Manu
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> > > >>>>> <klk@google.com.invalid
> > > >>>>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> On this point from Amit and Ismaël, I agree: we could
> benefit
> > > >>> from a
> > > >>>>>>>>> place
> > > >>>>>>>>>>> for miscellaneous non-core helper transformations.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> We have sdks/java/extensions but it is organized as
> separate
> > > >>>>>>> artifacts.
> > > >>>>>>>>> I
> > > >>>>>>>>>>> think that is fine, considering the nature of Join and
> > > >> SortValues.
> > > >>>>> But
> > > >>>>>>>>> for
> > > >>>>>>>>>>> simpler transforms, Importing one artifact per tiny
> transform
> > > is
> > > >>> too
> > > >>>>>>>>> much
> > > >>>>>>>>>>> overhead. It also seems unlikely that we will have enough
> > > >>>>> commonality
> > > >>>>>>>>>> among
> > > >>>>>>>>>>> the transforms to call the artifact anything other than
> [some
> > > >>>>> synonym
> > > >>>>>>>>> for]
> > > >>>>>>>>>>> "miscellaneous".
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I wouldn't want to take this too far - even though the SDK
> > many
> > > >>>>>>>>>> transforms*
> > > >>>>>>>>>>> that are not required for the model [1], I like that the
> SDK
> > > >>>>> artifact
> > > >>>>>>>>> has
> > > >>>>>>>>>>> everything a user might need in their "getting started"
> phase
> > > of
> > > >>>>> use.
> > > >>>>>>>>> This
> > > >>>>>>>>>>> user-friendliness (the user doesn't care that ParDo is core
> > and
> > > >>> Sum
> > > >>>>> is
> > > >>>>>>>>>> not)
> > > >>>>>>>>>>> plus the difficulty of judging which transforms go where,
> are
> > > >>>>> probably
> > > >>>>>>>>> why
> > > >>>>>>>>>>> we have them mostly all in one place.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Models to look at, off the top of my head, include Pig's
> > > >> PiggyBank
> > > >>>>> and
> > > >>>>>>>>>>> Apex's Malhar. These have different levels of support
> > implied.
> > > >>>>> Others?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Kenn
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count,
> Distinct,
> > > >>>>> Filter,
> > > >>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min,
> > > >>> Values,
> > > >>>>>>>>>> KvSwap,
> > > >>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> > > >>> WithTimestamps
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> * at least they are separate classes and not methods on
> > > >>> PCollection
> > > >>>>>>> :-)
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <
> > > iemejia@gmail.com
> > > >>>
> > > >>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Nice discussion, and thanks Jesse for bringing this
> subject
> > > >>> back.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I agree 100% with Amit and the idea of having a home for
> > those
> > > >>>>>>>>>> transforms
> > > >>>>>>>>>>>> that are not core enough to be part of the sdk, but that
> we
> > > all
> > > >>> end
> > > >>>>>>> up
> > > >>>>>>>>>>>> re-writing somehow.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> This is a needed improvement to be more developer
> friendly,
> > > but
> > > >>>>> also
> > > >>>>>>> as
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>> reference of good practices of Beam development, and for
> > this
> > > >>>>> reason
> > > >>>>>>> I
> > > >>>>>>>>>>>> agree with JB that at this moment it would be better for
> > these
> > > >>>>>>>>>> transforms
> > > >>>>>>>>>>>> to reside in the Beam repository at least for visibility
> > > >> reasons.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> One additional question is if these transforms represent a
> > > >>>>> different
> > > >>>>>>>>> DSL
> > > >>>>>>>>>>> or
> > > >>>>>>>>>>>> if those could be grouped with the current extensions
> (e.g.
> > > >> Join
> > > >>>>> and
> > > >>>>>>>>>>>> SortValues) into something more general that we as a
> > community
> > > >>>>> could
> > > >>>>>>>>>>>> maintain, but well even if it is not the case, it would be
> > > >> really
> > > >>>>>>> nice
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>> start working on something like this.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Ismaël Mejía
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <
> > > >>>>>>> jb@nanthrax.net
> > > >>>>>>>>>>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Related to spark-package, we also have Apache Bahir to
> host
> > > >>>>>>>>>>>>> connectors/transforms for Spark and Flink.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> IMHO, right now, Beam should host this, not sure if it
> > makes
> > > >>> sense
> > > >>>>>>>>>>>>> directly in the core.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> It reminds me the "Integration" DSL we discussed in the
> > > >>> technical
> > > >>>>>>>>>>> vision
> > > >>>>>>>>>>>>> document.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Regards
> > > >>>>>>>>>>>>> JB
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I think Jesse has a very good point on one hand, while
> > > Luke's
> > > >>> and
> > > >>>>>>>>>>>>>> Kenneth's
> > > >>>>>>>>>>>>>> worries about committing users to specific
> implementations
> > > is
> > > >>> in
> > > >>>>>>>>>>> place.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> The Spark community has a 3rd party repository for
> useful
> > > >>>>> libraries
> > > >>>>>>>>>>> that
> > > >>>>>>>>>>>>>> for various reasons are not a part of the Apache Spark
> > > >> project:
> > > >>>>>>>>>>>>>> https://spark-packages.org/.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Maybe a "common-transformations" package would serve
> both
> > > >> users
> > > >>>>>>> quick
> > > >>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more
> > "enabling" ?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > > >>>>>>>>>> <klk@google.com.invalid
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> It seems useful for small scale debugging / demoing to
> > have
> > > >>>>>>>>>>>>>>> Dump.toString(). I think it should be named to clearly
> > > >>> indicate
> > > >>>>>>> its
> > > >>>>>>>>>>>>>>> limited
> > > >>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump
> namespace,
> > > but
> > > >>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it
> > should
> > > >> be
> > > >>>>>>> pretty
> > > >>>>>>>>>>>>>>> printed, not treated as a machine-to-machine wire
> format.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> The broader question of representing data in JSON or
> XML,
> > > >> etc,
> > > >>>>> is
> > > >>>>>>>>>>>> already
> > > >>>>>>>>>>>>>>> the subject of many mature libraries which are already
> > easy
> > > >> to
> > > >>>>> use
> > > >>>>>>>>>>> with
> > > >>>>>>>>>>>>>>> Beam.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> The more esoteric practice of implicit or semi-implicit
> > > >>>>> coercions
> > > >>>>>>>>>>> seems
> > > >>>>>>>>>>>>>>> like it is also already addressed in many ways
> elsewhere.
> > > >>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as
> > > >>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with
> Beam.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> In both of the last cases, there are many reasonable
> > > >>> approaches,
> > > >>>>>>> and
> > > >>>>>>>>>>> we
> > > >>>>>>>>>>>>>>> shouldn't commit our users to one of them.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> The suggestions you give seem good except for the the
> XML
> > > >>> cases.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Might want to have the XML be a document per line
> > similar
> > > >> to
> > > >>>>> the
> > > >>>>>>>>>>> JSON
> > > >>>>>>>>>>>>>>>> examples you have been giving.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > >>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I
> was
> > > >> more
> > > >>>>>>> think
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV. It
> > > >> should
> > > >>>>>>>>>> handle
> > > >>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give
> > > someone
> > > >>>>>>>>>>> something
> > > >>>>>>>>>>>>>>>>> general purpose enough that you would just end up
> > writing
> > > >>> your
> > > >>>>>>> own
> > > >>>>>>>>>>>> code
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> handle it anyway.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Here are some ideas on what it could look like with a
> > > >> method
> > > >>>>> and
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> resulting string output:
> > > >>>>>>>>>>>>>>>>> *Stringify.toJSON()*
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> With KV:
> > > >>>>>>>>>>>>>>>>> {"key": "value"}
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> With Iterables:
> > > >>>>>>>>>>>>>>>>> ["one", "two", "three"]
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> With KV:
> > > >>>>>>>>>>>>>>>>> <rootelement key=value />
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> With Iterables:
> > > >>>>>>>>>>>>>>>>> <rootelement>
> > > >>>>>>>>>>>>>>>>>   <item>one</item>
> > > >>>>>>>>>>>>>>>>>   <item>two</item>
> > > >>>>>>>>>>>>>>>>>   <item>three</item>
> > > >>>>>>>>>>>>>>>>> </rootelement>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> With KV:
> > > >>>>>>>>>>>>>>>>> key,value
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> With Iterables:
> > > >>>>>>>>>>>>>>>>> one,two,three
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Do you think that would strike a good balance between
> > > >>> reusable
> > > >>>>>>>>>> code
> > > >>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>> writing your own for more difficult formatting?
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Jesse
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special treatment
> > in
> > > >>>>> TextIO,
> > > >>>>>>>>>>>> people
> > > >>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not
> > > >> supported.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Also, the example that you provide is using the fact
> > that
> > > >>> the
> > > >>>>>>>>>> input
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> format
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question about
> > > >> using
> > > >>> KV
> > > >>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed
> > input
> > > >>>>> format
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> still
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> would require to write a type conversion function,
> this
> > > >> time
> > > >>>>>>> from
> > > >>>>>>>>>>> KV
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > >>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Lukasz,
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> > > >>> TextIO.Write.
> > > >>>>>>> For
> > > >>>>>>>>>>> CSV
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> call would look like:
> > > >>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix,
> > > >>> delimiter,
> > > >>>>>>>>>>>> suffix).
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> The code would be something like:
> > > >>>>>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> for (Item item : list) {
> > > >>>>>>>>>>>>>>>>>>   buffer.append(item.toString());
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>   if(notLast) {
> > > >>>>>>>>>>>>>>>>>>     buffer.append(delimiter);
> > > >>>>>>>>>>>>>>>>>>   }
> > > >>>>>>>>>>>>>>>>>> }
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> buffer.append(suffix);
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> c.output(buffer.toString());
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and
> > other
> > > >>>>>>> formats
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> without
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> complicated logic. The same sort of thing could be
> done
> > > >> for
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> TextIO.Write.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Jesse
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> > > >>>>>>>>>>>> <lcwik@google.com.invalid
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> The conversion from object to string will have uses
> > > >> outside
> > > >>>>> of
> > > >>>>>>>>>>> just
> > > >>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would want
> > to
> > > >>> have
> > > >>>>> a
> > > >>>>>>>>>>> ParDo
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> do
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> conversion.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if
> you
> > > >>>>> consider
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> subset
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> of CSV like formats where it could have fixed width
> > > >> fields,
> > > >>>>> or
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> escaping
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> quoting around other fields, or headers that should
> > be
> > > >>>>> placed
> > > >>>>>>> at
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> top.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Having all these format conversions within
> > TextIO.Write
> > > >>>>> seems
> > > >>>>>>>>>>> like
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> lot
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> logic to contain in that transform which should
> just
> > > >> focus
> > > >>>>> on
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> writing
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> files.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> jesse@smokinghand.com>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> This is a thread moved over from the user mailing
> > list.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> > > >>>>> PCollection<KV>
> > > >>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually
> > > convert
> > > >>> the
> > > >>>>>>> KV
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> to a
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> String:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>         p
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > > >>>>>>>>>>>>>>>>>>>> *
> .apply(MapElements.via((KV<String,
> > > >> Long>
> > > >>>>>>>>>> count)
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> ->*
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> *                            count.getKey() + ":" +
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> count.getValue()*
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> *                        ).withOutputType(
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > >>>>>>> ("output/stringcounts"));
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> This code really should be something like:
> > > >>>>>>>>>>>>>>>>>>>>         p
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > > >>>>>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
> > > >>>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > >>>>>>>>> ("output/stringcounts"));
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> To summarize the discussion:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to
> > output
> > > >>> any
> > > >>>>> KV
> > > >>>>>>>>>> or
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> list
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes an
> type
> > > >> and
> > > >>>>> runs
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> toString()
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>    on it:
> > > >>>>>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> > > >>> SimpleFunction<InputT,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> String>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> {
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>        public static String apply(InputT input) {
> > > >>>>>>>>>>>>>>>>>>>>            return input.toString();
> > > >>>>>>>>>>>>>>>>>>>>        }
> > > >>>>>>>>>>>>>>>>>>>>    }
> > > >>>>>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter like
> > in
> > > >>>>> Apache
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Camel.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that would
> > > >> write
> > > >>>>> out
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>    toString of any Object.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> My thoughts:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly
> needed
> > > >> when
> > > >>>>>>>>>> you're
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> using
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform only
> > work
> > > >> in
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> certain
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> cases
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> and you'll normally have to write custom code
> format
> > > the
> > > >>>>>>> strings
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> way
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> you want them?
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object
> > > >> support
> > > >>> to
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> TextIO.Write
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> or
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an
> > > argument.
> > > >>>>>>> Making
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a delimiter
> > (and
> > > >>>>>>> perhaps
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> prefix
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and
> > > cases.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Jesse
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> --
> > > >>>>>>>>>>>>> Jean-Baptiste Onofré
> > > >>>>>>>>>>>>> jbonofre@apache.org
> > > >>>>>>>>>>>>> http://blog.nanthrax.net
> > > >>>>>>>>>>>>> Talend - http://www.talend.com
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> --
> > > >>>>>>>>> Jean-Baptiste Onofré
> > > >>>>>>>>> jbonofre@apache.org
> > > >>>>>>>>> http://blog.nanthrax.net
> > > >>>>>>>>> Talend - http://www.talend.com
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>> --
> > > >>>>>>> Jean-Baptiste Onofré
> > > >>>>>>> jbonofre@apache.org
> > > >>>>>>> http://blog.nanthrax.net
> > > >>>>>>> Talend - http://www.talend.com
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>> --
> > > >>>>> Jean-Baptiste Onofré
> > > >>>>> jbonofre@apache.org
> > > >>>>> http://blog.nanthrax.net
> > > >>>>> Talend - http://www.talend.com
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>> --
> > > >>> Jean-Baptiste Onofré
> > > >>> jbonofre@apache.org
> > > >>> http://blog.nanthrax.net
> > > >>> Talend - http://www.talend.com
> > > >>>
> > > >>
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbonofre@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>

Re: PCollection to PCollection Conversion

Posted by Vikas Kedigehalli <vi...@gmail.com>.

Thanks all for the inputs. Nice to see such active discussion on this. I
agree with the proposal, will update the PR.

-Vikas

On Thu, Dec 29, 2016 at 4:03 PM, Jesse Anderson <je...@smokinghand.com>
wrote:

> Sounds good to me too. @vikas can you start modifying the PR's code:
> Clean up the PR to be more future-proof for now? Aka make `ToString` itself
> not a PTransform,  but instead ToString.create() returns ToString.Default
> which is a private class implementing what ToString is now (PTransform<T,
> String>, wrapping MapElements).
>
> On Thu, Dec 29, 2016 at 4:00 PM Ben Chambers <bchambers@google.com.invalid
> >
> wrote:
>
> > Dan's proposal to move forward with a simple (future-proofed) version of
> > the ToString transform and Javadoc, and add specific features via
> follow-up
> > PRs.
> >
> > On Thu, Dec 29, 2016 at 3:53 PM Jesse Anderson <je...@smokinghand.com>
> > wrote:
> >
> > > @Ben which idea do you like?
> > >
> > > On Thu, Dec 29, 2016 at 3:20 PM Ben Chambers
> > <bchambers@google.com.invalid
> > > >
> > > wrote:
> > >
> > > > I like that idea, with the caveat that we should probably come up
> with
> > a
> > > > better name. Perhaps "ToString.elements()" and ToString.Elements or
> > > > something? Calling one the "default" and using "create" for it seems
> > > > moderately non-future proof.
> > > >
> > > > On Thu, Dec 29, 2016 at 3:17 PM Dan Halperin
> > <dhalperi@google.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > On Thu, Dec 29, 2016 at 2:10 PM, Jesse Anderson <
> > jesse@smokinghand.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > I agree MapElements isn't hard to use. I think there is a demand
> > for
> > > > this
> > > > > > built-in conversion.
> > > > > >
> > > > > > My thought on the formatter is that, worst case, we could do
> > runtime
> > > > type
> > > > > > checking. It would be ugly and not as performant, but it should
> > work.
> > > > As
> > > > > > we've said, we'd point them to MapElements for better code. We'd
> > > write
> > > > > the
> > > > > > JavaDoc accordingly.
> > > > > >
> > > > >
> > > > > I think it will be good to see these proposals in PR form. I would
> > stay
> > > > far
> > > > > away from reflection and varargs if possible, but properly-typed
> bits
> > > of
> > > > > code (possibly exposed as SerializableFunctions in ToString?) would
> > > > > probably make sense.
> > > > >
> > > > > In the short-term, I can't find anyone arguing against a
> > > > ToString.create()
> > > > > that simply does input.toString().
> > > > >
> > > > > To get started, how about we ask Vikas to clean up the PR to be
> more
> > > > > future-proof for now? Aka make `ToString` itself not a PTransform,
> > but
> > > > > instead ToString.create() returns ToString.Default which is a
> private
> > > > class
> > > > > implementing what ToString is now (PTransform<T, String>, wrapping
> > > > > MapElements).
> > > > >
> > > > > Then we can send PRs adding new features to that.
> > > > >
> > > > > IME and to Ben's point, these will mostly be used in development.
> > Some
> > > of
> > > > > > our assumptions will break down when programmers aren't the ones
> > > using
> > > > > > Beam. I can see from the user traffic already that not everyone
> > using
> > > > > Beam
> > > > > > is a programmer and they'll need classes like this to be
> > productive.
> > > > >
> > > > >
> > > > > > On Thu, Dec 29, 2016 at 1:46 PM Dan Halperin
> > > > <dhalperi@google.com.invalid
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson <
> > > jesse@smokinghand.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I prefer JB's take. I think there should be three overloaded
> > > methods
> > > > on
> > > > > > the
> > > > > > > class. I like Vikas' name ToString. The methods for a simple
> > > > conversion
> > > > > > > should be:
> > > > > > >
> > > > > > > ToString.strings() - Outputs the .toString() of the objects in
> > the
> > > > > > > PCollection
> > > > > > > ToString.strings(String delimiter) - Outputs the .toString() of
> > > KVs,
> > > > > > Lists,
> > > > > > > etc with the delimiter between every entry
> > > > > > > ToString.formatted(String format) - Outputs the formatted
> > > > > > > <
> > > https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html>
> > > > > > > string
> > > > > > > with the object passed in. For objects made up of different
> parts
> > > > like
> > > > > > KVs,
> > > > > > > each one is passed in as separate toString() of a varargs.
> > > > > > >
> > > > > >
> > > > > > Riffing a little, with some types:
> > > > > >
> > > > > > ToString.<T>of() -- PTransform<T, String> that is equivalent to a
> > > ParDo
> > > > > > that takes in a T and outputs T.toString().
> > > > > >
> > > > > > ToString.<K,V>kv(String delimiter) -- PTransform<KV<K, V>,
> String>
> > > that
> > > > > is
> > > > > > equivalent to a ParDo that takes in a KV<K,V> and outputs
> > > > > > kv.getKey().toString() + delimiter + kv.getValue().toString()
> > > > > >
> > > > > > ToString.<T>iterable(String delimiter) -- PTransform<? extends
> > > > > Iterable<T>,
> > > > > > String> that is equivalent to a ParDo that takes in an
> Iterable<T>
> > > and
> > > > > > outputs the iterable[0] + delimiter + iterable[1] + delimiter +
> > ... +
> > > > > > delimiter + iterable[N-1]
> > > > > >
> > > > > > ToString.<T>custom(SerializableFunction<T, String> formatter) ?
> > > > > >
> > > > > > The last one is just MapElement.via, except you don't need to set
> > the
> > > > > > output type.
> > > > > >
> > > > > > I don't see a way to make the generic .formatted() that you
> propose
> > > > that
> > > > > > just works with anything "made of different parts".
> > > > > >
> > > > > > I think this adding too many overrides beyond "of" and "custom"
> is
> > > > > opening
> > > > > > up a Pandora's Box. the KV one might want to have left and right
> > > > > > delimiters, might want to take custom formatters for K and V,
> etc.
> > > etc.
> > > > > The
> > > > > > iterable one might want to have a special configuration for an
> > empty
> > > > > > iterable. So I'm inclined towards simplicity with the awareness
> > that
> > > > > > MapElements.via is just not that hard to use.
> > > > > >
> > > > > > Dan
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > I think doing these three methods would cover every simple and
> > > > advanced
> > > > > > > "simple conversions." As JB says, we'll need other specific
> > > > converters
> > > > > > for
> > > > > > > other formats like XML.
> > > > > > >
> > > > > > > I'd really like to see this class in the next version of Beam.
> > What
> > > > > does
> > > > > > > everyone think of the class name, methods name, and method
> > > operations
> > > > > so
> > > > > > we
> > > > > > > can have Vikas finish up?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jesse
> > > > > > >
> > > > > > > On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <
> > > > jb@nanthrax.net
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Vikas,
> > > > > > > >
> > > > > > > > did you take a look on:
> > > > > > > >
> > > > > > > >
> > > > > > > > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/
> > > > > > > java/extensions/dataformat
> > > > > > > >
> > > > > > > > You can see KV2String and ToString could be part of this
> > > extension.
> > > > > > > > I'm also using JAXB for XML and Jackson for JSON
> > > > > > > > marshalling/unmarshalling. I'm planning to deal with Avro
> > > > > > > (IndexedRecord).
> > > > > > > >
> > > > > > > > Regards
> > > > > > > > JB
> > > > > > > >
> > > > > > > > On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > > >   Not being aware of the discussion here, I sent out a PR
> > > > > > > > > <https://github.com/apache/beam/pull/1704> but JB and
> others
> > > > > > directed
> > > > > > > > me to
> > > > > > > > > this thread. Having converted PCollection<T> to
> > > > PCollection<String>
> > > > > > > > several
> > > > > > > > > times, I feel something like 'ToString' transform is common
> > > > enough
> > > > > to
> > > > > > > be
> > > > > > > > > part of the core. What do you all think?
> > > > > > > > >
> > > > > > > > > Also, if someone else is already working on or interested
> in
> > > > > tackling
> > > > > > > > this,
> > > > > > > > > then I am happy to discard the PR.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Vikas
> > > > > > > > >
> > > > > > > > > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <
> > > amitsela33@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> It seems that there were a lot of good points raised here,
> > > and I
> > > > > > tend
> > > > > > > to
> > > > > > > > >> agree that something as trivial and lean as "ToString"
> > should
> > > > be a
> > > > > > > part
> > > > > > > > of
> > > > > > > > >> core.ake
> > > > > > > > >> I'm particularly fond of makeString(prefix, toString,
> > suffix)
> > > in
> > > > > > > various
> > > > > > > > >> combinations (Scala-like).
> > > > > > > > >> For "fromString", I think JB has a good point leveraging
> > JAXB
> > > > and
> > > > > > > > Jackson -
> > > > > > > > >> though I think this should be in extensions as it is not
> as
> > > lean
> > > > > as
> > > > > > > > >> toString.
> > > > > > > > >>
> > > > > > > > >> Thanks,
> > > > > > > > >> Amit
> > > > > > > > >>
> > > > > > > > >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <
> > > > > > jb@nanthrax.net
> > > > > > > >
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >>> Hi Jesse,
> > > > > > > > >>>
> > > > > > > > >>> yes, I started something there (using JAXB and Jackson).
> > Let
> > > me
> > > > > > > polish
> > > > > > > > >>> and push.
> > > > > > > > >>>
> > > > > > > > >>> Regards
> > > > > > > > >>> JB
> > > > > > > > >>>
> > > > > > > > >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > > > > > > > >>>> I went through the string conversions. Do you have an
> > > example
> > > > of
> > > > > > > > >> writing
> > > > > > > > >>>> out XML/JSON/etc too?
> > > > > > > > >>>>
> > > > > > > > >>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <
> > > > > > > jb@nanthrax.net
> > > > > > > > >
> > > > > > > > >>>> wrote:
> > > > > > > > >>>>
> > > > > > > > >>>>> Hi Jesse,
> > > > > > > > >>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>>
> > > > > > > > >>> https://github.com/jbonofre/incubator-beam/tree/
> > > > > > > DATAFORMAT/sdks/java/
> > > > > > > > >> extensions/dataformat
> > > > > > > > >>>>>
> > > > > > > > >>>>> it's very simple and stupid and of course not complete
> at
> > > all
> > > > > (I
> > > > > > > have
> > > > > > > > >>>>> other commits but not merged as they need some
> > polishing),
> > > > but
> > > > > as
> > > > > > I
> > > > > > > > >>>>> said, it's a base of discussion.
> > > > > > > > >>>>>
> > > > > > > > >>>>> Regards
> > > > > > > > >>>>> JB
> > > > > > > > >>>>>
> > > > > > > > >>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> > > > > > > > >>>>>> @jb Sounds good. Just let us know once you've pushed.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> > > > > > > > >> jb@nanthrax.net>
> > > > > > > > >>>>>> wrote:
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>> Good point Eugene.
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Right now, it's a DoFn collection to experiment a bit
> > (a
> > > > pure
> > > > > > > > >>>>>>> extension). It's pretty stupid ;)
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> But, you are right, depending the direction of such
> > > > > extension,
> > > > > > it
> > > > > > > > >>> could
> > > > > > > > >>>>>>> cover more use cases (even if it's not my first
> > intention
> > > > > ;)).
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Let me push the branch (pretty small) as an
> > illustration,
> > > > and
> > > > > > in
> > > > > > > > the
> > > > > > > > >>>>>>> mean time, I'm preparing a document (more focused on
> > the
> > > > use
> > > > > > > > cases).
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> WDYT ?
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Regards
> > > > > > > > >>>>>>> JB
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> > > > > > > > >>>>>>>> Hi JB,
> > > > > > > > >>>>>>>> Depending on the scope of what you want to
> ultimately
> > > > > > accomplish
> > > > > > > > >> with
> > > > > > > > >>>>>>> this
> > > > > > > > >>>>>>>> extension, I think it may make sense to write a
> > proposal
> > > > > > > document
> > > > > > > > >> and
> > > > > > > > >>>>>>>> discuss it.
> > > > > > > > >>>>>>>> If it's just a collection of utility DoFn's for
> > various
> > > > > > > > >> well-defined
> > > > > > > > >>>>>>>> source/target format pairs, then that's probably not
> > > > needed,
> > > > > > but
> > > > > > > > if
> > > > > > > > >>>>> it's
> > > > > > > > >>>>>>>> anything more, then I think it is.
> > > > > > > > >>>>>>>> That will help avoid a lot of churn if people
> propose
> > > > > > reasonable
> > > > > > > > >>>>>>>> significant changes.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste
> Onofré
> > <
> > > > > > > > >>> jb@nanthrax.net
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>>> wrote:
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch
> > on
> > > my
> > > > > > > github
> > > > > > > > >>> and I
> > > > > > > > >>>>>>>>> will post on the dev mailing list when done.
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>> Regards
> > > > > > > > >>>>>>>>> JB
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > > > > > > > >>>>>>>>>> I want to bring this thread back up since we've
> had
> > > time
> > > > > to
> > > > > > > > think
> > > > > > > > >>>>> about
> > > > > > > > >>>>>>>>> it
> > > > > > > > >>>>>>>>>> more and make a plan.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> I think a format-specific converter will be more
> > time
> > > > > > > consuming
> > > > > > > > >>> task
> > > > > > > > >>>>>>> than
> > > > > > > > >>>>>>>>>> we originally thought. It'd have to be a writer
> that
> > > > takes
> > > > > > > > >> another
> > > > > > > > >>>>>>> writer
> > > > > > > > >>>>>>>>>> as a parameter.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> I think a string converter can be done as a simple
> > > > > > transform.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> I think we should start with a simple string
> > converter
> > > > and
> > > > > > > plan
> > > > > > > > >>> for a
> > > > > > > > >>>>>>>>>> format-specific writer.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> What are your thoughts?
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Thanks,
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Jesse
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> > > > > > > > >>>>> jesse@smokinghand.com
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> I was thinking about what the outputs would look
> > like
> > > > last
> > > > > > > > >> night. I
> > > > > > > > >>>>>>>>>> realized that more complex formats like JSON and
> XML
> > > may
> > > > > or
> > > > > > > may
> > > > > > > > >> not
> > > > > > > > >>>>>>>>> output
> > > > > > > > >>>>>>>>>> the data in a valid format.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Doing a direct conversion on unbounded collections
> > > would
> > > > > > work
> > > > > > > > >> just
> > > > > > > > >>>>>>> fine.
> > > > > > > > >>>>>>>>>> They're self-contained. For writing out bounded
> > > > > collections,
> > > > > > > > >> that's
> > > > > > > > >>>>>>> where
> > > > > > > > >>>>>>>>>> we'll hit the issues. This changes the uber
> > conversion
> > > > > > > transform
> > > > > > > > >>>>> into a
> > > > > > > > >>>>>>>>>> transform that needs to be a writer.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> If a transform executes a JSON conversion on a per
> > > > element
> > > > > > > > basis,
> > > > > > > > >>>>> we'd
> > > > > > > > >>>>>>>>> get
> > > > > > > > >>>>>>>>>> this:
> > > > > > > > >>>>>>>>>> {
> > > > > > > > >>>>>>>>>> "key": "value"
> > > > > > > > >>>>>>>>>> }, {
> > > > > > > > >>>>>>>>>> "key": "value"
> > > > > > > > >>>>>>>>>> },
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> That isn't valid JSON.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> The conversion transform would need to know do
> > several
> > > > > > things
> > > > > > > > >> when
> > > > > > > > >>>>>>>>> writing
> > > > > > > > >>>>>>>>>> out a file. It would need to add brackets for an
> > > array.
> > > > > Now
> > > > > > we
> > > > > > > > >>> have:
> > > > > > > > >>>>>>>>>> [
> > > > > > > > >>>>>>>>>> {
> > > > > > > > >>>>>>>>>> "key": "value"
> > > > > > > > >>>>>>>>>> }, {
> > > > > > > > >>>>>>>>>> "key": "value"
> > > > > > > > >>>>>>>>>> },
> > > > > > > > >>>>>>>>>> ]
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> We still don't have valid JSON. We have to remove
> > the
> > > > last
> > > > > > > comma
> > > > > > > > >> or
> > > > > > > > >>>>>>> have
> > > > > > > > >>>>>>>>>> the uber transform start putting in the commas,
> > except
> > > > for
> > > > > > the
> > > > > > > > >> last
> > > > > > > > >>>>>>>>> element.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> [
> > > > > > > > >>>>>>>>>> {
> > > > > > > > >>>>>>>>>> "key": "value"
> > > > > > > > >>>>>>>>>> }, {
> > > > > > > > >>>>>>>>>> "key": "value"
> > > > > > > > >>>>>>>>>> }
> > > > > > > > >>>>>>>>>> ]
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Only by doing this do we have valid JSON.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> I'd argue we'd have a similar issue with XML. Some
> > > > parsers
> > > > > > > > >> require
> > > > > > > > >>> a
> > > > > > > > >>>>>>> root
> > > > > > > > >>>>>>>>>> element for everything. The uber transform would
> > have
> > > to
> > > > > put
> > > > > > > the
> > > > > > > > >>> root
> > > > > > > > >>>>>>>>>> element tags at the beginning and end of the file.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> > > > > > > > >>> owenzhang1990@gmail.com>
> > > > > > > > >>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> I would love to see a lean core and abundant
> > > Transforms
> > > > at
> > > > > > the
> > > > > > > > >> same
> > > > > > > > >>>>>>> time.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Maybe we can look at what Confluent <
> > > > > > > > >>> https://github.com/confluentinc
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>>>> does
> > > > > > > > >>>>>>>>>> for kafka-connect. They have official extensions
> > > support
> > > > > for
> > > > > > > > >> JDBC,
> > > > > > > > >>>>> HDFS
> > > > > > > > >>>>>>>>> and
> > > > > > > > >>>>>>>>>> ElasticSearch under https://github.com/
> confluentinc
> > .
> > > > They
> > > > > > put
> > > > > > > > >> them
> > > > > > > > >>>>>>> along
> > > > > > > > >>>>>>>>>> with other community extensions on
> > > > > > > > >>>>>>>>>> https://www.confluent.io/product/connectors/ for
> > > > > > visibility.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Although not a commercial company, can we have a
> > > GitHub
> > > > > user
> > > > > > > > like
> > > > > > > > >>>>>>>>>> beam-community to host projects we build around
> beam
> > > but
> > > > > not
> > > > > > > > >>> suitable
> > > > > > > > >>>>>>> for
> > > > > > > > >>>>>>>>>> https://github.com/apache/incubator-beam. In the
> > > > future,
> > > > > we
> > > > > > > may
> > > > > > > > >>> have
> > > > > > > > >>>>>>>>>> beam-algebra like
> > http://github.com/twitter/algebird
> > > > for
> > > > > > > > algebra
> > > > > > > > >>>>>>>>> operations
> > > > > > > > >>>>>>>>>> and beam-ml / beam-dl for machine learning / deep
> > > > > learning.
> > > > > > > > Also,
> > > > > > > > >>>>> there
> > > > > > > > >>>>>>>>>> will will be beam related projects elsewhere
> > > maintained
> > > > by
> > > > > > > other
> > > > > > > > >>>>>>>>>> communities. We can put all of them on the
> > > beam-website
> > > > or
> > > > > > > like
> > > > > > > > >>> spark
> > > > > > > > >>>>>>>>>> packages as mentioned by Amit.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> My $0.02
> > > > > > > > >>>>>>>>>> Manu
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> > > > > > > > >>>>> <klk@google.com.invalid
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>> On this point from Amit and Ismaël, I agree: we
> > could
> > > > > > benefit
> > > > > > > > >>> from a
> > > > > > > > >>>>>>>>> place
> > > > > > > > >>>>>>>>>>> for miscellaneous non-core helper
> transformations.
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> We have sdks/java/extensions but it is organized
> as
> > > > > > separate
> > > > > > > > >>>>>>> artifacts.
> > > > > > > > >>>>>>>>> I
> > > > > > > > >>>>>>>>>>> think that is fine, considering the nature of
> Join
> > > and
> > > > > > > > >> SortValues.
> > > > > > > > >>>>> But
> > > > > > > > >>>>>>>>> for
> > > > > > > > >>>>>>>>>>> simpler transforms, Importing one artifact per
> tiny
> > > > > > transform
> > > > > > > > is
> > > > > > > > >>> too
> > > > > > > > >>>>>>>>> much
> > > > > > > > >>>>>>>>>>> overhead. It also seems unlikely that we will
> have
> > > > enough
> > > > > > > > >>>>> commonality
> > > > > > > > >>>>>>>>>> among
> > > > > > > > >>>>>>>>>>> the transforms to call the artifact anything
> other
> > > than
> > > > > > [some
> > > > > > > > >>>>> synonym
> > > > > > > > >>>>>>>>> for]
> > > > > > > > >>>>>>>>>>> "miscellaneous".
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> I wouldn't want to take this too far - even
> though
> > > the
> > > > > SDK
> > > > > > > many
> > > > > > > > >>>>>>>>>> transforms*
> > > > > > > > >>>>>>>>>>> that are not required for the model [1], I like
> > that
> > > > the
> > > > > > SDK
> > > > > > > > >>>>> artifact
> > > > > > > > >>>>>>>>> has
> > > > > > > > >>>>>>>>>>> everything a user might need in their "getting
> > > started"
> > > > > > phase
> > > > > > > > of
> > > > > > > > >>>>> use.
> > > > > > > > >>>>>>>>> This
> > > > > > > > >>>>>>>>>>> user-friendliness (the user doesn't care that
> ParDo
> > > is
> > > > > core
> > > > > > > and
> > > > > > > > >>> Sum
> > > > > > > > >>>>> is
> > > > > > > > >>>>>>>>>> not)
> > > > > > > > >>>>>>>>>>> plus the difficulty of judging which transforms
> go
> > > > where,
> > > > > > are
> > > > > > > > >>>>> probably
> > > > > > > > >>>>>>>>> why
> > > > > > > > >>>>>>>>>>> we have them mostly all in one place.
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> Models to look at, off the top of my head,
> include
> > > > Pig's
> > > > > > > > >> PiggyBank
> > > > > > > > >>>>> and
> > > > > > > > >>>>>>>>>>> Apex's Malhar. These have different levels of
> > support
> > > > > > > implied.
> > > > > > > > >>>>> Others?
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> Kenn
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique,
> Count,
> > > > > > Distinct,
> > > > > > > > >>>>> Filter,
> > > > > > > > >>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max,
> > > Mean,
> > > > > Min,
> > > > > > > > >>> Values,
> > > > > > > > >>>>>>>>>> KvSwap,
> > > > > > > > >>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values,
> > WithKeys,
> > > > > > > > >>> WithTimestamps
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> * at least they are separate classes and not
> > methods
> > > on
> > > > > > > > >>> PCollection
> > > > > > > > >>>>>>> :-)
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <
> > > > > > > > iemejia@gmail.com
> > > > > > > > >>>
> > > > > > > > >>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> Nice discussion, and thanks Jesse for bringing
> > this
> > > > > > subject
> > > > > > > > >>> back.
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> I agree 100% with Amit and the idea of having a
> > home
> > > > for
> > > > > > > those
> > > > > > > > >>>>>>>>>> transforms
> > > > > > > > >>>>>>>>>>>> that are not core enough to be part of the sdk,
> > but
> > > > that
> > > > > > we
> > > > > > > > all
> > > > > > > > >>> end
> > > > > > > > >>>>>>> up
> > > > > > > > >>>>>>>>>>>> re-writing somehow.
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> This is a needed improvement to be more
> developer
> > > > > > friendly,
> > > > > > > > but
> > > > > > > > >>>>> also
> > > > > > > > >>>>>>> as
> > > > > > > > >>>>>>>>>> a
> > > > > > > > >>>>>>>>>>>> reference of good practices of Beam development,
> > and
> > > > for
> > > > > > > this
> > > > > > > > >>>>> reason
> > > > > > > > >>>>>>> I
> > > > > > > > >>>>>>>>>>>> agree with JB that at this moment it would be
> > better
> > > > for
> > > > > > > these
> > > > > > > > >>>>>>>>>> transforms
> > > > > > > > >>>>>>>>>>>> to reside in the Beam repository at least for
> > > > visibility
> > > > > > > > >> reasons.
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> One additional question is if these transforms
> > > > > represent a
> > > > > > > > >>>>> different
> > > > > > > > >>>>>>>>> DSL
> > > > > > > > >>>>>>>>>>> or
> > > > > > > > >>>>>>>>>>>> if those could be grouped with the current
> > > extensions
> > > > > > (e.g.
> > > > > > > > >> Join
> > > > > > > > >>>>> and
> > > > > > > > >>>>>>>>>>>> SortValues) into something more general that we
> > as a
> > > > > > > community
> > > > > > > > >>>>> could
> > > > > > > > >>>>>>>>>>>> maintain, but well even if it is not the case,
> it
> > > > would
> > > > > be
> > > > > > > > >> really
> > > > > > > > >>>>>>> nice
> > > > > > > > >>>>>>>>>> to
> > > > > > > > >>>>>>>>>>>> start working on something like this.
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> Ismaël Mejía
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste
> > > Onofré
> > > > <
> > > > > > > > >>>>>>> jb@nanthrax.net
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> Related to spark-package, we also have Apache
> > Bahir
> > > > to
> > > > > > host
> > > > > > > > >>>>>>>>>>>>> connectors/transforms for Spark and Flink.
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> IMHO, right now, Beam should host this, not
> sure
> > if
> > > > it
> > > > > > > makes
> > > > > > > > >>> sense
> > > > > > > > >>>>>>>>>>>>> directly in the core.
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> It reminds me the "Integration" DSL we
> discussed
> > in
> > > > the
> > > > > > > > >>> technical
> > > > > > > > >>>>>>>>>>> vision
> > > > > > > > >>>>>>>>>>>>> document.
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> Regards
> > > > > > > > >>>>>>>>>>>>> JB
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> I think Jesse has a very good point on one
> hand,
> > > > while
> > > > > > > > Luke's
> > > > > > > > >>> and
> > > > > > > > >>>>>>>>>>>>>> Kenneth's
> > > > > > > > >>>>>>>>>>>>>> worries about committing users to specific
> > > > > > implementations
> > > > > > > > is
> > > > > > > > >>> in
> > > > > > > > >>>>>>>>>>> place.
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> The Spark community has a 3rd party repository
> > for
> > > > > > useful
> > > > > > > > >>>>> libraries
> > > > > > > > >>>>>>>>>>> that
> > > > > > > > >>>>>>>>>>>>>> for various reasons are not a part of the
> Apache
> > > > Spark
> > > > > > > > >> project:
> > > > > > > > >>>>>>>>>>>>>> https://spark-packages.org/.
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> Maybe a "common-transformations" package would
> > > serve
> > > > > > both
> > > > > > > > >> users
> > > > > > > > >>>>>>> quick
> > > > > > > > >>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam
> more
> > > > > > > "enabling" ?
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > > > > > > > >>>>>>>>>> <klk@google.com.invalid
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> It seems useful for small scale debugging /
> > > demoing
> > > > to
> > > > > > > have
> > > > > > > > >>>>>>>>>>>>>>> Dump.toString(). I think it should be named
> to
> > > > > clearly
> > > > > > > > >>> indicate
> > > > > > > > >>>>>>> its
> > > > > > > > >>>>>>>>>>>>>>> limited
> > > > > > > > >>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump
> > > > > > namespace,
> > > > > > > > but
> > > > > > > > >>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read -
> > so
> > > it
> > > > > > > should
> > > > > > > > >> be
> > > > > > > > >>>>>>> pretty
> > > > > > > > >>>>>>>>>>>>>>> printed, not treated as a machine-to-machine
> > wire
> > > > > > format.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> The broader question of representing data in
> > JSON
> > > > or
> > > > > > XML,
> > > > > > > > >> etc,
> > > > > > > > >>>>> is
> > > > > > > > >>>>>>>>>>>> already
> > > > > > > > >>>>>>>>>>>>>>> the subject of many mature libraries which
> are
> > > > > already
> > > > > > > easy
> > > > > > > > >> to
> > > > > > > > >>>>> use
> > > > > > > > >>>>>>>>>>> with
> > > > > > > > >>>>>>>>>>>>>>> Beam.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> The more esoteric practice of implicit or
> > > > > semi-implicit
> > > > > > > > >>>>> coercions
> > > > > > > > >>>>>>>>>>> seems
> > > > > > > > >>>>>>>>>>>>>>> like it is also already addressed in many
> ways
> > > > > > elsewhere.
> > > > > > > > >>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the
> > > same
> > > > as
> > > > > > > > >>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to
> use
> > > with
> > > > > > Beam.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> In both of the last cases, there are many
> > > > reasonable
> > > > > > > > >>> approaches,
> > > > > > > > >>>>>>> and
> > > > > > > > >>>>>>>>>>> we
> > > > > > > > >>>>>>>>>>>>>>> shouldn't commit our users to one of them.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > > > > > > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> The suggestions you give seem good except for
> > the
> > > > the
> > > > > > XML
> > > > > > > > >>> cases.
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Might want to have the XML be a document per
> > > line
> > > > > > > similar
> > > > > > > > >> to
> > > > > > > > >>>>> the
> > > > > > > > >>>>>>>>>>> JSON
> > > > > > > > >>>>>>>>>>>>>>>> examples you have been giving.
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse
> > Anderson
> > > <
> > > > > > > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > > > > > > >>>>>>>>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV
> > > > handling. I
> > > > > > was
> > > > > > > > >> more
> > > > > > > > >>>>>>> think
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> that
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> whatever the addition, it shouldn't just
> > handle
> > > > KV.
> > > > > It
> > > > > > > > >> should
> > > > > > > > >>>>>>>>>> handle
> > > > > > > > >>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able
> to
> > > > give
> > > > > > > > someone
> > > > > > > > >>>>>>>>>>> something
> > > > > > > > >>>>>>>>>>>>>>>>> general purpose enough that you would just
> > end
> > > up
> > > > > > > writing
> > > > > > > > >>> your
> > > > > > > > >>>>>>> own
> > > > > > > > >>>>>>>>>>>> code
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> to
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> handle it anyway.
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> Here are some ideas on what it could look
> > like
> > > > > with a
> > > > > > > > >> method
> > > > > > > > >>>>> and
> > > > > > > > >>>>>>>>>>> the
> > > > > > > > >>>>>>>>>>>>>>>>> resulting string output:
> > > > > > > > >>>>>>>>>>>>>>>>> *Stringify.toJSON()*
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> With KV:
> > > > > > > > >>>>>>>>>>>>>>>>> {"key": "value"}
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > > > > > >>>>>>>>>>>>>>>>> ["one", "two", "three"]
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> With KV:
> > > > > > > > >>>>>>>>>>>>>>>>> <rootelement key=value />
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > > > > > >>>>>>>>>>>>>>>>> <rootelement>
> > > > > > > > >>>>>>>>>>>>>>>>>   <item>one</item>
> > > > > > > > >>>>>>>>>>>>>>>>>   <item>two</item>
> > > > > > > > >>>>>>>>>>>>>>>>>   <item>three</item>
> > > > > > > > >>>>>>>>>>>>>>>>> </rootelement>
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> With KV:
> > > > > > > > >>>>>>>>>>>>>>>>> key,value
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > > > > > >>>>>>>>>>>>>>>>> one,two,three
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> Do you think that would strike a good
> balance
> > > > > between
> > > > > > > > >>> reusable
> > > > > > > > >>>>>>>>>> code
> > > > > > > > >>>>>>>>>>>> and
> > > > > > > > >>>>>>>>>>>>>>>>> writing your own for more difficult
> > formatting?
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> Jesse
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > > > > > > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special
> > > > > treatment
> > > > > > > in
> > > > > > > > >>>>> TextIO,
> > > > > > > > >>>>>>>>>>>> people
> > > > > > > > >>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ...
> also
> > > not
> > > > > > > > >> supported.
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> Also, the example that you provide is using
> > the
> > > > > fact
> > > > > > > that
> > > > > > > > >>> the
> > > > > > > > >>>>>>>>>> input
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> format
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a
> > question
> > > > > about
> > > > > > > > >> using
> > > > > > > > >>> KV
> > > > > > > > >>>>>>>>>> with
> > > > > > > > >>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the
> > > > proposed
> > > > > > > input
> > > > > > > > >>>>> format
> > > > > > > > >>>>>>>>>>> and
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> still
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> would require to write a type conversion
> > > > function,
> > > > > > this
> > > > > > > > >> time
> > > > > > > > >>>>>>> from
> > > > > > > > >>>>>>>>>>> KV
> > > > > > > > >>>>>>>>>>>> to
> > > > > > > > >>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse
> > Anderson
> > > <
> > > > > > > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > > > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> Lukasz,
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic
> > for
> > > > > > > > >>> TextIO.Write.
> > > > > > > > >>>>>>> For
> > > > > > > > >>>>>>>>>>> CSV
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> the
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> call would look like:
> > > > > > > > >>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> Where the arguments would be
> > > > Stringify.to(prefix,
> > > > > > > > >>> delimiter,
> > > > > > > > >>>>>>>>>>>> suffix).
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> The code would be something like:
> > > > > > > > >>>>>>>>>>>>>>>>>> StringBuffer buffer = new
> > > StringBuffer(prefix);
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> for (Item item : list) {
> > > > > > > > >>>>>>>>>>>>>>>>>>   buffer.append(item.toString());
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>   if(notLast) {
> > > > > > > > >>>>>>>>>>>>>>>>>>     buffer.append(delimiter);
> > > > > > > > >>>>>>>>>>>>>>>>>>   }
> > > > > > > > >>>>>>>>>>>>>>>>>> }
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> buffer.append(suffix);
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> c.output(buffer.toString());
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV,
> > TSV,
> > > > and
> > > > > > > other
> > > > > > > > >>>>>>> formats
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> without
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> complicated logic. The same sort of thing
> > could
> > > > be
> > > > > > done
> > > > > > > > >> for
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> TextIO.Write.
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> Thanks,
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> Jesse
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz
> Cwik
> > > > > > > > >>>>>>>>>>>> <lcwik@google.com.invalid
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> The conversion from object to string will
> > have
> > > > > uses
> > > > > > > > >> outside
> > > > > > > > >>>>> of
> > > > > > > > >>>>>>>>>>> just
> > > > > > > > >>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we
> > > would
> > > > > want
> > > > > > > to
> > > > > > > > >>> have
> > > > > > > > >>>>> a
> > > > > > > > >>>>>>>>>>> ParDo
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> do
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> the
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> conversion.
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance,
> > > even
> > > > if
> > > > > > you
> > > > > > > > >>>>> consider
> > > > > > > > >>>>>>>>>>> the
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> subset
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> of CSV like formats where it could have
> > fixed
> > > > > width
> > > > > > > > >> fields,
> > > > > > > > >>>>> or
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> escaping
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> and
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> quoting around other fields, or headers
> > that
> > > > > should
> > > > > > > be
> > > > > > > > >>>>> placed
> > > > > > > > >>>>>>> at
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> the
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> top.
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> Having all these format conversions
> within
> > > > > > > TextIO.Write
> > > > > > > > >>>>> seems
> > > > > > > > >>>>>>>>>>> like
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> a
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> lot
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> of
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> logic to contain in that transform which
> > > should
> > > > > > just
> > > > > > > > >> focus
> > > > > > > > >>>>> on
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> writing
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> to
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> files.
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse
> > > Anderson
> > > > <
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> jesse@smokinghand.com>
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> This is a thread moved over from the user
> > > > mailing
> > > > > > > list.
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>> I think there needs to be a way to
> > convert a
> > > > > > > > >>>>> PCollection<KV>
> > > > > > > > >>>>>>> to
> > > > > > > > >>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to
> > > > manually
> > > > > > > > convert
> > > > > > > > >>> the
> > > > > > > > >>>>>>> KV
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> to a
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> String:
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>>         p
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> >  .apply(Regex.split("\\W+"))
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
>  .apply(Count.perElement())
> > > > > > > > >>>>>>>>>>>>>>>>>>>> *
> > .apply(MapElements.via((KV<
> > > > > > String,
> > > > > > > > >> Long>
> > > > > > > > >>>>>>>>>> count)
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> ->*
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> *
> count.getKey() +
> > > > ":" +
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> count.getValue()*
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> *                        ).withOutputType(
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > > > > > > >>>>>>> ("output/stringcounts"));
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>> This code really should be something
> like:
> > > > > > > > >>>>>>>>>>>>>>>>>>>>         p
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> >  .apply(Regex.split("\\W+"))
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
>  .apply(Count.perElement())
> > > > > > > > >>>>>>>>>>>>>>>>>>>> *
> > > .apply(ToString.stringify())*
> > > > > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > > > > > > >>>>>>>>> ("output/stringcounts"));
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>> To summarize the discussion:
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>>    - JA: Add a method to
> > StringDelegateCoder
> > > > to
> > > > > > > output
> > > > > > > > >>> any
> > > > > > > > >>>>> KV
> > > > > > > > >>>>>>>>>> or
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> list
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that
> > takes
> > > > an
> > > > > > type
> > > > > > > > >> and
> > > > > > > > >>>>> runs
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> toString()
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>>    on it:
> > > > > > > > >>>>>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> > > > > > > > >>> SimpleFunction<InputT,
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> String>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> {
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>        public static String apply(InputT
> > > input) {
> > > > > > > > >>>>>>>>>>>>>>>>>>>>            return input.toString();
> > > > > > > > >>>>>>>>>>>>>>>>>>>>        }
> > > > > > > > >>>>>>>>>>>>>>>>>>>>    }
> > > > > > > > >>>>>>>>>>>>>>>>>>>>    - JB: Add a general purpose type
> > > converter
> > > > > like
> > > > > > > in
> > > > > > > > >>>>> Apache
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> Camel.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write
> > that
> > > > > would
> > > > > > > > >> write
> > > > > > > > >>>>> out
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> the
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>    toString of any Object.
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>> My thoughts:
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String>
> > > mostly
> > > > > > needed
> > > > > > > > >> when
> > > > > > > > >>>>>>>>>> you're
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> using
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose
> > transform
> > > > > only
> > > > > > > work
> > > > > > > > >> in
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> certain
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> cases
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> and you'll normally have to write custom
> > code
> > > > > > format
> > > > > > > > the
> > > > > > > > >>>>>>> strings
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> the
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>> way
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> you want them?
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to
> add
> > > > Object
> > > > > > > > >> support
> > > > > > > > >>> to
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> TextIO.Write
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> or
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter
> as
> > > an
> > > > > > > > argument.
> > > > > > > > >>>>>>> Making
> > > > > > > > >>>>>>>>>> a
> > > > > > > > >>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a
> > > > > delimiter
> > > > > > > (and
> > > > > > > > >>>>>>> perhaps
> > > > > > > > >>>>>>>>>> a
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>> prefix
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>> and
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of
> > formats
> > > > and
> > > > > > > > cases.
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>> Jesse
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> --
> > > > > > > > >>>>>>>>>>>>> Jean-Baptiste Onofré
> > > > > > > > >>>>>>>>>>>>> jbonofre@apache.org
> > > > > > > > >>>>>>>>>>>>> http://blog.nanthrax.net
> > > > > > > > >>>>>>>>>>>>> Talend - http://www.talend.com
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>
>

Re: PCollection to PCollection Conversion

Posted by Jesse Anderson <je...@smokinghand.com>.

Sounds good to me too. @vikas can you start modifying the PR's code:
Clean up the PR to be more future-proof for now? Aka make `ToString` itself
not a PTransform,  but instead ToString.create() returns ToString.Default
which is a private class implementing what ToString is now (PTransform<T,
String>, wrapping MapElements).

On Thu, Dec 29, 2016 at 4:00 PM Ben Chambers <bc...@google.com.invalid>
wrote:

> Dan's proposal to move forward with a simple (future-proofed) version of
> the ToString transform and Javadoc, and add specific features via follow-up
> PRs.
>
> On Thu, Dec 29, 2016 at 3:53 PM Jesse Anderson <je...@smokinghand.com>
> wrote:
>
> > @Ben which idea do you like?
> >
> > On Thu, Dec 29, 2016 at 3:20 PM Ben Chambers
> <bchambers@google.com.invalid
> > >
> > wrote:
> >
> > > I like that idea, with the caveat that we should probably come up with
> a
> > > better name. Perhaps "ToString.elements()" and ToString.Elements or
> > > something? Calling one the "default" and using "create" for it seems
> > > moderately non-future proof.
> > >
> > > On Thu, Dec 29, 2016 at 3:17 PM Dan Halperin
> <dhalperi@google.com.invalid
> > >
> > > wrote:
> > >
> > > > On Thu, Dec 29, 2016 at 2:10 PM, Jesse Anderson <
> jesse@smokinghand.com
> > >
> > > > wrote:
> > > >
> > > > > I agree MapElements isn't hard to use. I think there is a demand
> for
> > > this
> > > > > built-in conversion.
> > > > >
> > > > > My thought on the formatter is that, worst case, we could do
> runtime
> > > type
> > > > > checking. It would be ugly and not as performant, but it should
> work.
> > > As
> > > > > we've said, we'd point them to MapElements for better code. We'd
> > write
> > > > the
> > > > > JavaDoc accordingly.
> > > > >
> > > >
> > > > I think it will be good to see these proposals in PR form. I would
> stay
> > > far
> > > > away from reflection and varargs if possible, but properly-typed bits
> > of
> > > > code (possibly exposed as SerializableFunctions in ToString?) would
> > > > probably make sense.
> > > >
> > > > In the short-term, I can't find anyone arguing against a
> > > ToString.create()
> > > > that simply does input.toString().
> > > >
> > > > To get started, how about we ask Vikas to clean up the PR to be more
> > > > future-proof for now? Aka make `ToString` itself not a PTransform,
> but
> > > > instead ToString.create() returns ToString.Default which is a private
> > > class
> > > > implementing what ToString is now (PTransform<T, String>, wrapping
> > > > MapElements).
> > > >
> > > > Then we can send PRs adding new features to that.
> > > >
> > > > IME and to Ben's point, these will mostly be used in development.
> Some
> > of
> > > > > our assumptions will break down when programmers aren't the ones
> > using
> > > > > Beam. I can see from the user traffic already that not everyone
> using
> > > > Beam
> > > > > is a programmer and they'll need classes like this to be
> productive.
> > > >
> > > >
> > > > > On Thu, Dec 29, 2016 at 1:46 PM Dan Halperin
> > > <dhalperi@google.com.invalid
> > > > >
> > > > > wrote:
> > > > >
> > > > > On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson <
> > jesse@smokinghand.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > I prefer JB's take. I think there should be three overloaded
> > methods
> > > on
> > > > > the
> > > > > > class. I like Vikas' name ToString. The methods for a simple
> > > conversion
> > > > > > should be:
> > > > > >
> > > > > > ToString.strings() - Outputs the .toString() of the objects in
> the
> > > > > > PCollection
> > > > > > ToString.strings(String delimiter) - Outputs the .toString() of
> > KVs,
> > > > > Lists,
> > > > > > etc with the delimiter between every entry
> > > > > > ToString.formatted(String format) - Outputs the formatted
> > > > > > <
> > https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html>
> > > > > > string
> > > > > > with the object passed in. For objects made up of different parts
> > > like
> > > > > KVs,
> > > > > > each one is passed in as separate toString() of a varargs.
> > > > > >
> > > > >
> > > > > Riffing a little, with some types:
> > > > >
> > > > > ToString.<T>of() -- PTransform<T, String> that is equivalent to a
> > ParDo
> > > > > that takes in a T and outputs T.toString().
> > > > >
> > > > > ToString.<K,V>kv(String delimiter) -- PTransform<KV<K, V>, String>
> > that
> > > > is
> > > > > equivalent to a ParDo that takes in a KV<K,V> and outputs
> > > > > kv.getKey().toString() + delimiter + kv.getValue().toString()
> > > > >
> > > > > ToString.<T>iterable(String delimiter) -- PTransform<? extends
> > > > Iterable<T>,
> > > > > String> that is equivalent to a ParDo that takes in an Iterable<T>
> > and
> > > > > outputs the iterable[0] + delimiter + iterable[1] + delimiter +
> ... +
> > > > > delimiter + iterable[N-1]
> > > > >
> > > > > ToString.<T>custom(SerializableFunction<T, String> formatter) ?
> > > > >
> > > > > The last one is just MapElement.via, except you don't need to set
> the
> > > > > output type.
> > > > >
> > > > > I don't see a way to make the generic .formatted() that you propose
> > > that
> > > > > just works with anything "made of different parts".
> > > > >
> > > > > I think this adding too many overrides beyond "of" and "custom" is
> > > > opening
> > > > > up a Pandora's Box. the KV one might want to have left and right
> > > > > delimiters, might want to take custom formatters for K and V, etc.
> > etc.
> > > > The
> > > > > iterable one might want to have a special configuration for an
> empty
> > > > > iterable. So I'm inclined towards simplicity with the awareness
> that
> > > > > MapElements.via is just not that hard to use.
> > > > >
> > > > > Dan
> > > > >
> > > > >
> > > > > >
> > > > > > I think doing these three methods would cover every simple and
> > > advanced
> > > > > > "simple conversions." As JB says, we'll need other specific
> > > converters
> > > > > for
> > > > > > other formats like XML.
> > > > > >
> > > > > > I'd really like to see this class in the next version of Beam.
> What
> > > > does
> > > > > > everyone think of the class name, methods name, and method
> > operations
> > > > so
> > > > > we
> > > > > > can have Vikas finish up?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jesse
> > > > > >
> > > > > > On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <
> > > jb@nanthrax.net
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Vikas,
> > > > > > >
> > > > > > > did you take a look on:
> > > > > > >
> > > > > > >
> > > > > > > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/
> > > > > > java/extensions/dataformat
> > > > > > >
> > > > > > > You can see KV2String and ToString could be part of this
> > extension.
> > > > > > > I'm also using JAXB for XML and Jackson for JSON
> > > > > > > marshalling/unmarshalling. I'm planning to deal with Avro
> > > > > > (IndexedRecord).
> > > > > > >
> > > > > > > Regards
> > > > > > > JB
> > > > > > >
> > > > > > > On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > >   Not being aware of the discussion here, I sent out a PR
> > > > > > > > <https://github.com/apache/beam/pull/1704> but JB and others
> > > > > directed
> > > > > > > me to
> > > > > > > > this thread. Having converted PCollection<T> to
> > > PCollection<String>
> > > > > > > several
> > > > > > > > times, I feel something like 'ToString' transform is common
> > > enough
> > > > to
> > > > > > be
> > > > > > > > part of the core. What do you all think?
> > > > > > > >
> > > > > > > > Also, if someone else is already working on or interested in
> > > > tackling
> > > > > > > this,
> > > > > > > > then I am happy to discard the PR.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Vikas
> > > > > > > >
> > > > > > > > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <
> > amitsela33@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> It seems that there were a lot of good points raised here,
> > and I
> > > > > tend
> > > > > > to
> > > > > > > >> agree that something as trivial and lean as "ToString"
> should
> > > be a
> > > > > > part
> > > > > > > of
> > > > > > > >> core.ake
> > > > > > > >> I'm particularly fond of makeString(prefix, toString,
> suffix)
> > in
> > > > > > various
> > > > > > > >> combinations (Scala-like).
> > > > > > > >> For "fromString", I think JB has a good point leveraging
> JAXB
> > > and
> > > > > > > Jackson -
> > > > > > > >> though I think this should be in extensions as it is not as
> > lean
> > > > as
> > > > > > > >> toString.
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >> Amit
> > > > > > > >>
> > > > > > > >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <
> > > > > jb@nanthrax.net
> > > > > > >
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >>> Hi Jesse,
> > > > > > > >>>
> > > > > > > >>> yes, I started something there (using JAXB and Jackson).
> Let
> > me
> > > > > > polish
> > > > > > > >>> and push.
> > > > > > > >>>
> > > > > > > >>> Regards
> > > > > > > >>> JB
> > > > > > > >>>
> > > > > > > >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > > > > > > >>>> I went through the string conversions. Do you have an
> > example
> > > of
> > > > > > > >> writing
> > > > > > > >>>> out XML/JSON/etc too?
> > > > > > > >>>>
> > > > > > > >>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <
> > > > > > jb@nanthrax.net
> > > > > > > >
> > > > > > > >>>> wrote:
> > > > > > > >>>>
> > > > > > > >>>>> Hi Jesse,
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>>>>
> > > > > > > >>> https://github.com/jbonofre/incubator-beam/tree/
> > > > > > DATAFORMAT/sdks/java/
> > > > > > > >> extensions/dataformat
> > > > > > > >>>>>
> > > > > > > >>>>> it's very simple and stupid and of course not complete at
> > all
> > > > (I
> > > > > > have
> > > > > > > >>>>> other commits but not merged as they need some
> polishing),
> > > but
> > > > as
> > > > > I
> > > > > > > >>>>> said, it's a base of discussion.
> > > > > > > >>>>>
> > > > > > > >>>>> Regards
> > > > > > > >>>>> JB
> > > > > > > >>>>>
> > > > > > > >>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> > > > > > > >>>>>> @jb Sounds good. Just let us know once you've pushed.
> > > > > > > >>>>>>
> > > > > > > >>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> > > > > > > >> jb@nanthrax.net>
> > > > > > > >>>>>> wrote:
> > > > > > > >>>>>>
> > > > > > > >>>>>>> Good point Eugene.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Right now, it's a DoFn collection to experiment a bit
> (a
> > > pure
> > > > > > > >>>>>>> extension). It's pretty stupid ;)
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> But, you are right, depending the direction of such
> > > > extension,
> > > > > it
> > > > > > > >>> could
> > > > > > > >>>>>>> cover more use cases (even if it's not my first
> intention
> > > > ;)).
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Let me push the branch (pretty small) as an
> illustration,
> > > and
> > > > > in
> > > > > > > the
> > > > > > > >>>>>>> mean time, I'm preparing a document (more focused on
> the
> > > use
> > > > > > > cases).
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> WDYT ?
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Regards
> > > > > > > >>>>>>> JB
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> > > > > > > >>>>>>>> Hi JB,
> > > > > > > >>>>>>>> Depending on the scope of what you want to ultimately
> > > > > accomplish
> > > > > > > >> with
> > > > > > > >>>>>>> this
> > > > > > > >>>>>>>> extension, I think it may make sense to write a
> proposal
> > > > > > document
> > > > > > > >> and
> > > > > > > >>>>>>>> discuss it.
> > > > > > > >>>>>>>> If it's just a collection of utility DoFn's for
> various
> > > > > > > >> well-defined
> > > > > > > >>>>>>>> source/target format pairs, then that's probably not
> > > needed,
> > > > > but
> > > > > > > if
> > > > > > > >>>>> it's
> > > > > > > >>>>>>>> anything more, then I think it is.
> > > > > > > >>>>>>>> That will help avoid a lot of churn if people propose
> > > > > reasonable
> > > > > > > >>>>>>>> significant changes.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré
> <
> > > > > > > >>> jb@nanthrax.net
> > > > > > > >>>>>>
> > > > > > > >>>>>>>> wrote:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch
> on
> > my
> > > > > > github
> > > > > > > >>> and I
> > > > > > > >>>>>>>>> will post on the dev mailing list when done.
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> Regards
> > > > > > > >>>>>>>>> JB
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > > > > > > >>>>>>>>>> I want to bring this thread back up since we've had
> > time
> > > > to
> > > > > > > think
> > > > > > > >>>>> about
> > > > > > > >>>>>>>>> it
> > > > > > > >>>>>>>>>> more and make a plan.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> I think a format-specific converter will be more
> time
> > > > > > consuming
> > > > > > > >>> task
> > > > > > > >>>>>>> than
> > > > > > > >>>>>>>>>> we originally thought. It'd have to be a writer that
> > > takes
> > > > > > > >> another
> > > > > > > >>>>>>> writer
> > > > > > > >>>>>>>>>> as a parameter.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> I think a string converter can be done as a simple
> > > > > transform.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> I think we should start with a simple string
> converter
> > > and
> > > > > > plan
> > > > > > > >>> for a
> > > > > > > >>>>>>>>>> format-specific writer.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> What are your thoughts?
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Jesse
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> > > > > > > >>>>> jesse@smokinghand.com
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> I was thinking about what the outputs would look
> like
> > > last
> > > > > > > >> night. I
> > > > > > > >>>>>>>>>> realized that more complex formats like JSON and XML
> > may
> > > > or
> > > > > > may
> > > > > > > >> not
> > > > > > > >>>>>>>>> output
> > > > > > > >>>>>>>>>> the data in a valid format.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Doing a direct conversion on unbounded collections
> > would
> > > > > work
> > > > > > > >> just
> > > > > > > >>>>>>> fine.
> > > > > > > >>>>>>>>>> They're self-contained. For writing out bounded
> > > > collections,
> > > > > > > >> that's
> > > > > > > >>>>>>> where
> > > > > > > >>>>>>>>>> we'll hit the issues. This changes the uber
> conversion
> > > > > > transform
> > > > > > > >>>>> into a
> > > > > > > >>>>>>>>>> transform that needs to be a writer.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> If a transform executes a JSON conversion on a per
> > > element
> > > > > > > basis,
> > > > > > > >>>>> we'd
> > > > > > > >>>>>>>>> get
> > > > > > > >>>>>>>>>> this:
> > > > > > > >>>>>>>>>> {
> > > > > > > >>>>>>>>>> "key": "value"
> > > > > > > >>>>>>>>>> }, {
> > > > > > > >>>>>>>>>> "key": "value"
> > > > > > > >>>>>>>>>> },
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> That isn't valid JSON.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> The conversion transform would need to know do
> several
> > > > > things
> > > > > > > >> when
> > > > > > > >>>>>>>>> writing
> > > > > > > >>>>>>>>>> out a file. It would need to add brackets for an
> > array.
> > > > Now
> > > > > we
> > > > > > > >>> have:
> > > > > > > >>>>>>>>>> [
> > > > > > > >>>>>>>>>> {
> > > > > > > >>>>>>>>>> "key": "value"
> > > > > > > >>>>>>>>>> }, {
> > > > > > > >>>>>>>>>> "key": "value"
> > > > > > > >>>>>>>>>> },
> > > > > > > >>>>>>>>>> ]
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> We still don't have valid JSON. We have to remove
> the
> > > last
> > > > > > comma
> > > > > > > >> or
> > > > > > > >>>>>>> have
> > > > > > > >>>>>>>>>> the uber transform start putting in the commas,
> except
> > > for
> > > > > the
> > > > > > > >> last
> > > > > > > >>>>>>>>> element.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> [
> > > > > > > >>>>>>>>>> {
> > > > > > > >>>>>>>>>> "key": "value"
> > > > > > > >>>>>>>>>> }, {
> > > > > > > >>>>>>>>>> "key": "value"
> > > > > > > >>>>>>>>>> }
> > > > > > > >>>>>>>>>> ]
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Only by doing this do we have valid JSON.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> I'd argue we'd have a similar issue with XML. Some
> > > parsers
> > > > > > > >> require
> > > > > > > >>> a
> > > > > > > >>>>>>> root
> > > > > > > >>>>>>>>>> element for everything. The uber transform would
> have
> > to
> > > > put
> > > > > > the
> > > > > > > >>> root
> > > > > > > >>>>>>>>>> element tags at the beginning and end of the file.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> > > > > > > >>> owenzhang1990@gmail.com>
> > > > > > > >>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> I would love to see a lean core and abundant
> > Transforms
> > > at
> > > > > the
> > > > > > > >> same
> > > > > > > >>>>>>> time.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Maybe we can look at what Confluent <
> > > > > > > >>> https://github.com/confluentinc
> > > > > > > >>>>>>
> > > > > > > >>>>>>>>> does
> > > > > > > >>>>>>>>>> for kafka-connect. They have official extensions
> > support
> > > > for
> > > > > > > >> JDBC,
> > > > > > > >>>>> HDFS
> > > > > > > >>>>>>>>> and
> > > > > > > >>>>>>>>>> ElasticSearch under https://github.com/confluentinc
> .
> > > They
> > > > > put
> > > > > > > >> them
> > > > > > > >>>>>>> along
> > > > > > > >>>>>>>>>> with other community extensions on
> > > > > > > >>>>>>>>>> https://www.confluent.io/product/connectors/ for
> > > > > visibility.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Although not a commercial company, can we have a
> > GitHub
> > > > user
> > > > > > > like
> > > > > > > >>>>>>>>>> beam-community to host projects we build around beam
> > but
> > > > not
> > > > > > > >>> suitable
> > > > > > > >>>>>>> for
> > > > > > > >>>>>>>>>> https://github.com/apache/incubator-beam. In the
> > > future,
> > > > we
> > > > > > may
> > > > > > > >>> have
> > > > > > > >>>>>>>>>> beam-algebra like
> http://github.com/twitter/algebird
> > > for
> > > > > > > algebra
> > > > > > > >>>>>>>>> operations
> > > > > > > >>>>>>>>>> and beam-ml / beam-dl for machine learning / deep
> > > > learning.
> > > > > > > Also,
> > > > > > > >>>>> there
> > > > > > > >>>>>>>>>> will will be beam related projects elsewhere
> > maintained
> > > by
> > > > > > other
> > > > > > > >>>>>>>>>> communities. We can put all of them on the
> > beam-website
> > > or
> > > > > > like
> > > > > > > >>> spark
> > > > > > > >>>>>>>>>> packages as mentioned by Amit.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> My $0.02
> > > > > > > >>>>>>>>>> Manu
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> > > > > > > >>>>> <klk@google.com.invalid
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>>> On this point from Amit and Ismaël, I agree: we
> could
> > > > > benefit
> > > > > > > >>> from a
> > > > > > > >>>>>>>>> place
> > > > > > > >>>>>>>>>>> for miscellaneous non-core helper transformations.
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> We have sdks/java/extensions but it is organized as
> > > > > separate
> > > > > > > >>>>>>> artifacts.
> > > > > > > >>>>>>>>> I
> > > > > > > >>>>>>>>>>> think that is fine, considering the nature of Join
> > and
> > > > > > > >> SortValues.
> > > > > > > >>>>> But
> > > > > > > >>>>>>>>> for
> > > > > > > >>>>>>>>>>> simpler transforms, Importing one artifact per tiny
> > > > > transform
> > > > > > > is
> > > > > > > >>> too
> > > > > > > >>>>>>>>> much
> > > > > > > >>>>>>>>>>> overhead. It also seems unlikely that we will have
> > > enough
> > > > > > > >>>>> commonality
> > > > > > > >>>>>>>>>> among
> > > > > > > >>>>>>>>>>> the transforms to call the artifact anything other
> > than
> > > > > [some
> > > > > > > >>>>> synonym
> > > > > > > >>>>>>>>> for]
> > > > > > > >>>>>>>>>>> "miscellaneous".
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> I wouldn't want to take this too far - even though
> > the
> > > > SDK
> > > > > > many
> > > > > > > >>>>>>>>>> transforms*
> > > > > > > >>>>>>>>>>> that are not required for the model [1], I like
> that
> > > the
> > > > > SDK
> > > > > > > >>>>> artifact
> > > > > > > >>>>>>>>> has
> > > > > > > >>>>>>>>>>> everything a user might need in their "getting
> > started"
> > > > > phase
> > > > > > > of
> > > > > > > >>>>> use.
> > > > > > > >>>>>>>>> This
> > > > > > > >>>>>>>>>>> user-friendliness (the user doesn't care that ParDo
> > is
> > > > core
> > > > > > and
> > > > > > > >>> Sum
> > > > > > > >>>>> is
> > > > > > > >>>>>>>>>> not)
> > > > > > > >>>>>>>>>>> plus the difficulty of judging which transforms go
> > > where,
> > > > > are
> > > > > > > >>>>> probably
> > > > > > > >>>>>>>>> why
> > > > > > > >>>>>>>>>>> we have them mostly all in one place.
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> Models to look at, off the top of my head, include
> > > Pig's
> > > > > > > >> PiggyBank
> > > > > > > >>>>> and
> > > > > > > >>>>>>>>>>> Apex's Malhar. These have different levels of
> support
> > > > > > implied.
> > > > > > > >>>>> Others?
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> Kenn
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count,
> > > > > Distinct,
> > > > > > > >>>>> Filter,
> > > > > > > >>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max,
> > Mean,
> > > > Min,
> > > > > > > >>> Values,
> > > > > > > >>>>>>>>>> KvSwap,
> > > > > > > >>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values,
> WithKeys,
> > > > > > > >>> WithTimestamps
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> * at least they are separate classes and not
> methods
> > on
> > > > > > > >>> PCollection
> > > > > > > >>>>>>> :-)
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <
> > > > > > > iemejia@gmail.com
> > > > > > > >>>
> > > > > > > >>>>>>> wrote:
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Nice discussion, and thanks Jesse for bringing
> this
> > > > > subject
> > > > > > > >>> back.
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> I agree 100% with Amit and the idea of having a
> home
> > > for
> > > > > > those
> > > > > > > >>>>>>>>>> transforms
> > > > > > > >>>>>>>>>>>> that are not core enough to be part of the sdk,
> but
> > > that
> > > > > we
> > > > > > > all
> > > > > > > >>> end
> > > > > > > >>>>>>> up
> > > > > > > >>>>>>>>>>>> re-writing somehow.
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> This is a needed improvement to be more developer
> > > > > friendly,
> > > > > > > but
> > > > > > > >>>>> also
> > > > > > > >>>>>>> as
> > > > > > > >>>>>>>>>> a
> > > > > > > >>>>>>>>>>>> reference of good practices of Beam development,
> and
> > > for
> > > > > > this
> > > > > > > >>>>> reason
> > > > > > > >>>>>>> I
> > > > > > > >>>>>>>>>>>> agree with JB that at this moment it would be
> better
> > > for
> > > > > > these
> > > > > > > >>>>>>>>>> transforms
> > > > > > > >>>>>>>>>>>> to reside in the Beam repository at least for
> > > visibility
> > > > > > > >> reasons.
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> One additional question is if these transforms
> > > > represent a
> > > > > > > >>>>> different
> > > > > > > >>>>>>>>> DSL
> > > > > > > >>>>>>>>>>> or
> > > > > > > >>>>>>>>>>>> if those could be grouped with the current
> > extensions
> > > > > (e.g.
> > > > > > > >> Join
> > > > > > > >>>>> and
> > > > > > > >>>>>>>>>>>> SortValues) into something more general that we
> as a
> > > > > > community
> > > > > > > >>>>> could
> > > > > > > >>>>>>>>>>>> maintain, but well even if it is not the case, it
> > > would
> > > > be
> > > > > > > >> really
> > > > > > > >>>>>>> nice
> > > > > > > >>>>>>>>>> to
> > > > > > > >>>>>>>>>>>> start working on something like this.
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Ismaël Mejía
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste
> > Onofré
> > > <
> > > > > > > >>>>>>> jb@nanthrax.net
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Related to spark-package, we also have Apache
> Bahir
> > > to
> > > > > host
> > > > > > > >>>>>>>>>>>>> connectors/transforms for Spark and Flink.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> IMHO, right now, Beam should host this, not sure
> if
> > > it
> > > > > > makes
> > > > > > > >>> sense
> > > > > > > >>>>>>>>>>>>> directly in the core.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> It reminds me the "Integration" DSL we discussed
> in
> > > the
> > > > > > > >>> technical
> > > > > > > >>>>>>>>>>> vision
> > > > > > > >>>>>>>>>>>>> document.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Regards
> > > > > > > >>>>>>>>>>>>> JB
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> I think Jesse has a very good point on one hand,
> > > while
> > > > > > > Luke's
> > > > > > > >>> and
> > > > > > > >>>>>>>>>>>>>> Kenneth's
> > > > > > > >>>>>>>>>>>>>> worries about committing users to specific
> > > > > implementations
> > > > > > > is
> > > > > > > >>> in
> > > > > > > >>>>>>>>>>> place.
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> The Spark community has a 3rd party repository
> for
> > > > > useful
> > > > > > > >>>>> libraries
> > > > > > > >>>>>>>>>>> that
> > > > > > > >>>>>>>>>>>>>> for various reasons are not a part of the Apache
> > > Spark
> > > > > > > >> project:
> > > > > > > >>>>>>>>>>>>>> https://spark-packages.org/.
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> Maybe a "common-transformations" package would
> > serve
> > > > > both
> > > > > > > >> users
> > > > > > > >>>>>>> quick
> > > > > > > >>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more
> > > > > > "enabling" ?
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > > > > > > >>>>>>>>>> <klk@google.com.invalid
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> It seems useful for small scale debugging /
> > demoing
> > > to
> > > > > > have
> > > > > > > >>>>>>>>>>>>>>> Dump.toString(). I think it should be named to
> > > > clearly
> > > > > > > >>> indicate
> > > > > > > >>>>>>> its
> > > > > > > >>>>>>>>>>>>>>> limited
> > > > > > > >>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump
> > > > > namespace,
> > > > > > > but
> > > > > > > >>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read -
> so
> > it
> > > > > > should
> > > > > > > >> be
> > > > > > > >>>>>>> pretty
> > > > > > > >>>>>>>>>>>>>>> printed, not treated as a machine-to-machine
> wire
> > > > > format.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> The broader question of representing data in
> JSON
> > > or
> > > > > XML,
> > > > > > > >> etc,
> > > > > > > >>>>> is
> > > > > > > >>>>>>>>>>>> already
> > > > > > > >>>>>>>>>>>>>>> the subject of many mature libraries which are
> > > > already
> > > > > > easy
> > > > > > > >> to
> > > > > > > >>>>> use
> > > > > > > >>>>>>>>>>> with
> > > > > > > >>>>>>>>>>>>>>> Beam.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> The more esoteric practice of implicit or
> > > > semi-implicit
> > > > > > > >>>>> coercions
> > > > > > > >>>>>>>>>>> seems
> > > > > > > >>>>>>>>>>>>>>> like it is also already addressed in many ways
> > > > > elsewhere.
> > > > > > > >>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the
> > same
> > > as
> > > > > > > >>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use
> > with
> > > > > Beam.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> In both of the last cases, there are many
> > > reasonable
> > > > > > > >>> approaches,
> > > > > > > >>>>>>> and
> > > > > > > >>>>>>>>>>> we
> > > > > > > >>>>>>>>>>>>>>> shouldn't commit our users to one of them.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > > > > > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> The suggestions you give seem good except for
> the
> > > the
> > > > > XML
> > > > > > > >>> cases.
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Might want to have the XML be a document per
> > line
> > > > > > similar
> > > > > > > >> to
> > > > > > > >>>>> the
> > > > > > > >>>>>>>>>>> JSON
> > > > > > > >>>>>>>>>>>>>>>> examples you have been giving.
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse
> Anderson
> > <
> > > > > > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > > > > > >>>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV
> > > handling. I
> > > > > was
> > > > > > > >> more
> > > > > > > >>>>>>> think
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> that
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> whatever the addition, it shouldn't just
> handle
> > > KV.
> > > > It
> > > > > > > >> should
> > > > > > > >>>>>>>>>> handle
> > > > > > > >>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to
> > > give
> > > > > > > someone
> > > > > > > >>>>>>>>>>> something
> > > > > > > >>>>>>>>>>>>>>>>> general purpose enough that you would just
> end
> > up
> > > > > > writing
> > > > > > > >>> your
> > > > > > > >>>>>>> own
> > > > > > > >>>>>>>>>>>> code
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> to
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> handle it anyway.
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Here are some ideas on what it could look
> like
> > > > with a
> > > > > > > >> method
> > > > > > > >>>>> and
> > > > > > > >>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>> resulting string output:
> > > > > > > >>>>>>>>>>>>>>>>> *Stringify.toJSON()*
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> With KV:
> > > > > > > >>>>>>>>>>>>>>>>> {"key": "value"}
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > > > > >>>>>>>>>>>>>>>>> ["one", "two", "three"]
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> With KV:
> > > > > > > >>>>>>>>>>>>>>>>> <rootelement key=value />
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > > > > >>>>>>>>>>>>>>>>> <rootelement>
> > > > > > > >>>>>>>>>>>>>>>>>   <item>one</item>
> > > > > > > >>>>>>>>>>>>>>>>>   <item>two</item>
> > > > > > > >>>>>>>>>>>>>>>>>   <item>three</item>
> > > > > > > >>>>>>>>>>>>>>>>> </rootelement>
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> With KV:
> > > > > > > >>>>>>>>>>>>>>>>> key,value
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > > > > >>>>>>>>>>>>>>>>> one,two,three
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Do you think that would strike a good balance
> > > > between
> > > > > > > >>> reusable
> > > > > > > >>>>>>>>>> code
> > > > > > > >>>>>>>>>>>> and
> > > > > > > >>>>>>>>>>>>>>>>> writing your own for more difficult
> formatting?
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Jesse
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > > > > > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special
> > > > treatment
> > > > > > in
> > > > > > > >>>>> TextIO,
> > > > > > > >>>>>>>>>>>> people
> > > > > > > >>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also
> > not
> > > > > > > >> supported.
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Also, the example that you provide is using
> the
> > > > fact
> > > > > > that
> > > > > > > >>> the
> > > > > > > >>>>>>>>>> input
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> format
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a
> question
> > > > about
> > > > > > > >> using
> > > > > > > >>> KV
> > > > > > > >>>>>>>>>> with
> > > > > > > >>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the
> > > proposed
> > > > > > input
> > > > > > > >>>>> format
> > > > > > > >>>>>>>>>>> and
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> still
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> would require to write a type conversion
> > > function,
> > > > > this
> > > > > > > >> time
> > > > > > > >>>>>>> from
> > > > > > > >>>>>>>>>>> KV
> > > > > > > >>>>>>>>>>>> to
> > > > > > > >>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse
> Anderson
> > <
> > > > > > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Lukasz,
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic
> for
> > > > > > > >>> TextIO.Write.
> > > > > > > >>>>>>> For
> > > > > > > >>>>>>>>>>> CSV
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> call would look like:
> > > > > > > >>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> Where the arguments would be
> > > Stringify.to(prefix,
> > > > > > > >>> delimiter,
> > > > > > > >>>>>>>>>>>> suffix).
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> The code would be something like:
> > > > > > > >>>>>>>>>>>>>>>>>> StringBuffer buffer = new
> > StringBuffer(prefix);
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> for (Item item : list) {
> > > > > > > >>>>>>>>>>>>>>>>>>   buffer.append(item.toString());
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>   if(notLast) {
> > > > > > > >>>>>>>>>>>>>>>>>>     buffer.append(delimiter);
> > > > > > > >>>>>>>>>>>>>>>>>>   }
> > > > > > > >>>>>>>>>>>>>>>>>> }
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> buffer.append(suffix);
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> c.output(buffer.toString());
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV,
> TSV,
> > > and
> > > > > > other
> > > > > > > >>>>>>> formats
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> without
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> complicated logic. The same sort of thing
> could
> > > be
> > > > > done
> > > > > > > >> for
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> TextIO.Write.
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> Jesse
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> > > > > > > >>>>>>>>>>>> <lcwik@google.com.invalid
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> The conversion from object to string will
> have
> > > > uses
> > > > > > > >> outside
> > > > > > > >>>>> of
> > > > > > > >>>>>>>>>>> just
> > > > > > > >>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we
> > would
> > > > want
> > > > > > to
> > > > > > > >>> have
> > > > > > > >>>>> a
> > > > > > > >>>>>>>>>>> ParDo
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> do
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> conversion.
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance,
> > even
> > > if
> > > > > you
> > > > > > > >>>>> consider
> > > > > > > >>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> subset
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> of CSV like formats where it could have
> fixed
> > > > width
> > > > > > > >> fields,
> > > > > > > >>>>> or
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> escaping
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> and
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> quoting around other fields, or headers
> that
> > > > should
> > > > > > be
> > > > > > > >>>>> placed
> > > > > > > >>>>>>> at
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> top.
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> Having all these format conversions within
> > > > > > TextIO.Write
> > > > > > > >>>>> seems
> > > > > > > >>>>>>>>>>> like
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> a
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> lot
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> of
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> logic to contain in that transform which
> > should
> > > > > just
> > > > > > > >> focus
> > > > > > > >>>>> on
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> writing
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> to
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> files.
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse
> > Anderson
> > > <
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> jesse@smokinghand.com>
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> This is a thread moved over from the user
> > > mailing
> > > > > > list.
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> I think there needs to be a way to
> convert a
> > > > > > > >>>>> PCollection<KV>
> > > > > > > >>>>>>> to
> > > > > > > >>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to
> > > manually
> > > > > > > convert
> > > > > > > >>> the
> > > > > > > >>>>>>> KV
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> to a
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> String:
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>         p
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > > > > > >>>>>>>>>>>>>>>>>>>>
>  .apply(Regex.split("\\W+"))
> > > > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > > > > > > >>>>>>>>>>>>>>>>>>>> *
> .apply(MapElements.via((KV<
> > > > > String,
> > > > > > > >> Long>
> > > > > > > >>>>>>>>>> count)
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> ->*
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> *                            count.getKey() +
> > > ":" +
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> count.getValue()*
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> *                        ).withOutputType(
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > > > > > >>>>>>> ("output/stringcounts"));
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> This code really should be something like:
> > > > > > > >>>>>>>>>>>>>>>>>>>>         p
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > > > > > >>>>>>>>>>>>>>>>>>>>
>  .apply(Regex.split("\\W+"))
> > > > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > > > > > > >>>>>>>>>>>>>>>>>>>> *
> > .apply(ToString.stringify())*
> > > > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > > > > > >>>>>>>>> ("output/stringcounts"));
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> To summarize the discussion:
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>    - JA: Add a method to
> StringDelegateCoder
> > > to
> > > > > > output
> > > > > > > >>> any
> > > > > > > >>>>> KV
> > > > > > > >>>>>>>>>> or
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> list
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that
> takes
> > > an
> > > > > type
> > > > > > > >> and
> > > > > > > >>>>> runs
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> toString()
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>    on it:
> > > > > > > >>>>>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> > > > > > > >>> SimpleFunction<InputT,
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> String>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> {
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>        public static String apply(InputT
> > input) {
> > > > > > > >>>>>>>>>>>>>>>>>>>>            return input.toString();
> > > > > > > >>>>>>>>>>>>>>>>>>>>        }
> > > > > > > >>>>>>>>>>>>>>>>>>>>    }
> > > > > > > >>>>>>>>>>>>>>>>>>>>    - JB: Add a general purpose type
> > converter
> > > > like
> > > > > > in
> > > > > > > >>>>> Apache
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> Camel.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write
> that
> > > > would
> > > > > > > >> write
> > > > > > > >>>>> out
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>    toString of any Object.
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> My thoughts:
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String>
> > mostly
> > > > > needed
> > > > > > > >> when
> > > > > > > >>>>>>>>>> you're
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> using
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose
> transform
> > > > only
> > > > > > work
> > > > > > > >> in
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> certain
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> cases
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> and you'll normally have to write custom
> code
> > > > > format
> > > > > > > the
> > > > > > > >>>>>>> strings
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> way
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> you want them?
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add
> > > Object
> > > > > > > >> support
> > > > > > > >>> to
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> TextIO.Write
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> or
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as
> > an
> > > > > > > argument.
> > > > > > > >>>>>>> Making
> > > > > > > >>>>>>>>>> a
> > > > > > > >>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a
> > > > delimiter
> > > > > > (and
> > > > > > > >>>>>>> perhaps
> > > > > > > >>>>>>>>>> a
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>> prefix
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>> and
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of
> formats
> > > and
> > > > > > > cases.
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>> Jesse
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> --
> > > > > > > >>>>>>>>>>>>> Jean-Baptiste Onofré
> > > > > > > >>>>>>>>>>>>> jbonofre@apache.org
> > > > > > > >>>>>>>>>>>>> http://blog.nanthrax.net
> > > > > > > >>>>>>>>>>>>> Talend - http://www.talend.com
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>

Re: PCollection to PCollection Conversion

Posted by Ben Chambers <bc...@google.com.INVALID>.

Dan's proposal to move forward with a simple (future-proofed) version of
the ToString transform and Javadoc, and add specific features via follow-up
PRs.

On Thu, Dec 29, 2016 at 3:53 PM Jesse Anderson <je...@smokinghand.com>
wrote:

> @Ben which idea do you like?
>
> On Thu, Dec 29, 2016 at 3:20 PM Ben Chambers <bchambers@google.com.invalid
> >
> wrote:
>
> > I like that idea, with the caveat that we should probably come up with a
> > better name. Perhaps "ToString.elements()" and ToString.Elements or
> > something? Calling one the "default" and using "create" for it seems
> > moderately non-future proof.
> >
> > On Thu, Dec 29, 2016 at 3:17 PM Dan Halperin <dhalperi@google.com.invalid
> >
> > wrote:
> >
> > > On Thu, Dec 29, 2016 at 2:10 PM, Jesse Anderson <jesse@smokinghand.com
> >
> > > wrote:
> > >
> > > > I agree MapElements isn't hard to use. I think there is a demand for
> > this
> > > > built-in conversion.
> > > >
> > > > My thought on the formatter is that, worst case, we could do runtime
> > type
> > > > checking. It would be ugly and not as performant, but it should work.
> > As
> > > > we've said, we'd point them to MapElements for better code. We'd
> write
> > > the
> > > > JavaDoc accordingly.
> > > >
> > >
> > > I think it will be good to see these proposals in PR form. I would stay
> > far
> > > away from reflection and varargs if possible, but properly-typed bits
> of
> > > code (possibly exposed as SerializableFunctions in ToString?) would
> > > probably make sense.
> > >
> > > In the short-term, I can't find anyone arguing against a
> > ToString.create()
> > > that simply does input.toString().
> > >
> > > To get started, how about we ask Vikas to clean up the PR to be more
> > > future-proof for now? Aka make `ToString` itself not a PTransform,  but
> > > instead ToString.create() returns ToString.Default which is a private
> > class
> > > implementing what ToString is now (PTransform<T, String>, wrapping
> > > MapElements).
> > >
> > > Then we can send PRs adding new features to that.
> > >
> > > IME and to Ben's point, these will mostly be used in development. Some
> of
> > > > our assumptions will break down when programmers aren't the ones
> using
> > > > Beam. I can see from the user traffic already that not everyone using
> > > Beam
> > > > is a programmer and they'll need classes like this to be productive.
> > >
> > >
> > > > On Thu, Dec 29, 2016 at 1:46 PM Dan Halperin
> > <dhalperi@google.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson <
> jesse@smokinghand.com
> > >
> > > > wrote:
> > > >
> > > > > I prefer JB's take. I think there should be three overloaded
> methods
> > on
> > > > the
> > > > > class. I like Vikas' name ToString. The methods for a simple
> > conversion
> > > > > should be:
> > > > >
> > > > > ToString.strings() - Outputs the .toString() of the objects in the
> > > > > PCollection
> > > > > ToString.strings(String delimiter) - Outputs the .toString() of
> KVs,
> > > > Lists,
> > > > > etc with the delimiter between every entry
> > > > > ToString.formatted(String format) - Outputs the formatted
> > > > > <
> https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html>
> > > > > string
> > > > > with the object passed in. For objects made up of different parts
> > like
> > > > KVs,
> > > > > each one is passed in as separate toString() of a varargs.
> > > > >
> > > >
> > > > Riffing a little, with some types:
> > > >
> > > > ToString.<T>of() -- PTransform<T, String> that is equivalent to a
> ParDo
> > > > that takes in a T and outputs T.toString().
> > > >
> > > > ToString.<K,V>kv(String delimiter) -- PTransform<KV<K, V>, String>
> that
> > > is
> > > > equivalent to a ParDo that takes in a KV<K,V> and outputs
> > > > kv.getKey().toString() + delimiter + kv.getValue().toString()
> > > >
> > > > ToString.<T>iterable(String delimiter) -- PTransform<? extends
> > > Iterable<T>,
> > > > String> that is equivalent to a ParDo that takes in an Iterable<T>
> and
> > > > outputs the iterable[0] + delimiter + iterable[1] + delimiter + ... +
> > > > delimiter + iterable[N-1]
> > > >
> > > > ToString.<T>custom(SerializableFunction<T, String> formatter) ?
> > > >
> > > > The last one is just MapElement.via, except you don't need to set the
> > > > output type.
> > > >
> > > > I don't see a way to make the generic .formatted() that you propose
> > that
> > > > just works with anything "made of different parts".
> > > >
> > > > I think this adding too many overrides beyond "of" and "custom" is
> > > opening
> > > > up a Pandora's Box. the KV one might want to have left and right
> > > > delimiters, might want to take custom formatters for K and V, etc.
> etc.
> > > The
> > > > iterable one might want to have a special configuration for an empty
> > > > iterable. So I'm inclined towards simplicity with the awareness that
> > > > MapElements.via is just not that hard to use.
> > > >
> > > > Dan
> > > >
> > > >
> > > > >
> > > > > I think doing these three methods would cover every simple and
> > advanced
> > > > > "simple conversions." As JB says, we'll need other specific
> > converters
> > > > for
> > > > > other formats like XML.
> > > > >
> > > > > I'd really like to see this class in the next version of Beam. What
> > > does
> > > > > everyone think of the class name, methods name, and method
> operations
> > > so
> > > > we
> > > > > can have Vikas finish up?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jesse
> > > > >
> > > > > On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <
> > jb@nanthrax.net
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Vikas,
> > > > > >
> > > > > > did you take a look on:
> > > > > >
> > > > > >
> > > > > > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/
> > > > > java/extensions/dataformat
> > > > > >
> > > > > > You can see KV2String and ToString could be part of this
> extension.
> > > > > > I'm also using JAXB for XML and Jackson for JSON
> > > > > > marshalling/unmarshalling. I'm planning to deal with Avro
> > > > > (IndexedRecord).
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > > > On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> > > > > > > Hi All,
> > > > > > >
> > > > > > >   Not being aware of the discussion here, I sent out a PR
> > > > > > > <https://github.com/apache/beam/pull/1704> but JB and others
> > > > directed
> > > > > > me to
> > > > > > > this thread. Having converted PCollection<T> to
> > PCollection<String>
> > > > > > several
> > > > > > > times, I feel something like 'ToString' transform is common
> > enough
> > > to
> > > > > be
> > > > > > > part of the core. What do you all think?
> > > > > > >
> > > > > > > Also, if someone else is already working on or interested in
> > > tackling
> > > > > > this,
> > > > > > > then I am happy to discard the PR.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Vikas
> > > > > > >
> > > > > > > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <
> amitsela33@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> It seems that there were a lot of good points raised here,
> and I
> > > > tend
> > > > > to
> > > > > > >> agree that something as trivial and lean as "ToString" should
> > be a
> > > > > part
> > > > > > of
> > > > > > >> core.ake
> > > > > > >> I'm particularly fond of makeString(prefix, toString, suffix)
> in
> > > > > various
> > > > > > >> combinations (Scala-like).
> > > > > > >> For "fromString", I think JB has a good point leveraging JAXB
> > and
> > > > > > Jackson -
> > > > > > >> though I think this should be in extensions as it is not as
> lean
> > > as
> > > > > > >> toString.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Amit
> > > > > > >>
> > > > > > >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <
> > > > jb@nanthrax.net
> > > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Hi Jesse,
> > > > > > >>>
> > > > > > >>> yes, I started something there (using JAXB and Jackson). Let
> me
> > > > > polish
> > > > > > >>> and push.
> > > > > > >>>
> > > > > > >>> Regards
> > > > > > >>> JB
> > > > > > >>>
> > > > > > >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > > > > > >>>> I went through the string conversions. Do you have an
> example
> > of
> > > > > > >> writing
> > > > > > >>>> out XML/JSON/etc too?
> > > > > > >>>>
> > > > > > >>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <
> > > > > jb@nanthrax.net
> > > > > > >
> > > > > > >>>> wrote:
> > > > > > >>>>
> > > > > > >>>>> Hi Jesse,
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>> https://github.com/jbonofre/incubator-beam/tree/
> > > > > DATAFORMAT/sdks/java/
> > > > > > >> extensions/dataformat
> > > > > > >>>>>
> > > > > > >>>>> it's very simple and stupid and of course not complete at
> all
> > > (I
> > > > > have
> > > > > > >>>>> other commits but not merged as they need some polishing),
> > but
> > > as
> > > > I
> > > > > > >>>>> said, it's a base of discussion.
> > > > > > >>>>>
> > > > > > >>>>> Regards
> > > > > > >>>>> JB
> > > > > > >>>>>
> > > > > > >>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> > > > > > >>>>>> @jb Sounds good. Just let us know once you've pushed.
> > > > > > >>>>>>
> > > > > > >>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> > > > > > >> jb@nanthrax.net>
> > > > > > >>>>>> wrote:
> > > > > > >>>>>>
> > > > > > >>>>>>> Good point Eugene.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Right now, it's a DoFn collection to experiment a bit (a
> > pure
> > > > > > >>>>>>> extension). It's pretty stupid ;)
> > > > > > >>>>>>>
> > > > > > >>>>>>> But, you are right, depending the direction of such
> > > extension,
> > > > it
> > > > > > >>> could
> > > > > > >>>>>>> cover more use cases (even if it's not my first intention
> > > ;)).
> > > > > > >>>>>>>
> > > > > > >>>>>>> Let me push the branch (pretty small) as an illustration,
> > and
> > > > in
> > > > > > the
> > > > > > >>>>>>> mean time, I'm preparing a document (more focused on the
> > use
> > > > > > cases).
> > > > > > >>>>>>>
> > > > > > >>>>>>> WDYT ?
> > > > > > >>>>>>>
> > > > > > >>>>>>> Regards
> > > > > > >>>>>>> JB
> > > > > > >>>>>>>
> > > > > > >>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> > > > > > >>>>>>>> Hi JB,
> > > > > > >>>>>>>> Depending on the scope of what you want to ultimately
> > > > accomplish
> > > > > > >> with
> > > > > > >>>>>>> this
> > > > > > >>>>>>>> extension, I think it may make sense to write a proposal
> > > > > document
> > > > > > >> and
> > > > > > >>>>>>>> discuss it.
> > > > > > >>>>>>>> If it's just a collection of utility DoFn's for various
> > > > > > >> well-defined
> > > > > > >>>>>>>> source/target format pairs, then that's probably not
> > needed,
> > > > but
> > > > > > if
> > > > > > >>>>> it's
> > > > > > >>>>>>>> anything more, then I think it is.
> > > > > > >>>>>>>> That will help avoid a lot of churn if people propose
> > > > reasonable
> > > > > > >>>>>>>> significant changes.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> > > > > > >>> jb@nanthrax.net
> > > > > > >>>>>>
> > > > > > >>>>>>>> wrote:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on
> my
> > > > > github
> > > > > > >>> and I
> > > > > > >>>>>>>>> will post on the dev mailing list when done.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Regards
> > > > > > >>>>>>>>> JB
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > > > > > >>>>>>>>>> I want to bring this thread back up since we've had
> time
> > > to
> > > > > > think
> > > > > > >>>>> about
> > > > > > >>>>>>>>> it
> > > > > > >>>>>>>>>> more and make a plan.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> I think a format-specific converter will be more time
> > > > > consuming
> > > > > > >>> task
> > > > > > >>>>>>> than
> > > > > > >>>>>>>>>> we originally thought. It'd have to be a writer that
> > takes
> > > > > > >> another
> > > > > > >>>>>>> writer
> > > > > > >>>>>>>>>> as a parameter.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> I think a string converter can be done as a simple
> > > > transform.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> I think we should start with a simple string converter
> > and
> > > > > plan
> > > > > > >>> for a
> > > > > > >>>>>>>>>> format-specific writer.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> What are your thoughts?
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Jesse
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> > > > > > >>>>> jesse@smokinghand.com
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> I was thinking about what the outputs would look like
> > last
> > > > > > >> night. I
> > > > > > >>>>>>>>>> realized that more complex formats like JSON and XML
> may
> > > or
> > > > > may
> > > > > > >> not
> > > > > > >>>>>>>>> output
> > > > > > >>>>>>>>>> the data in a valid format.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Doing a direct conversion on unbounded collections
> would
> > > > work
> > > > > > >> just
> > > > > > >>>>>>> fine.
> > > > > > >>>>>>>>>> They're self-contained. For writing out bounded
> > > collections,
> > > > > > >> that's
> > > > > > >>>>>>> where
> > > > > > >>>>>>>>>> we'll hit the issues. This changes the uber conversion
> > > > > transform
> > > > > > >>>>> into a
> > > > > > >>>>>>>>>> transform that needs to be a writer.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> If a transform executes a JSON conversion on a per
> > element
> > > > > > basis,
> > > > > > >>>>> we'd
> > > > > > >>>>>>>>> get
> > > > > > >>>>>>>>>> this:
> > > > > > >>>>>>>>>> {
> > > > > > >>>>>>>>>> "key": "value"
> > > > > > >>>>>>>>>> }, {
> > > > > > >>>>>>>>>> "key": "value"
> > > > > > >>>>>>>>>> },
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> That isn't valid JSON.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> The conversion transform would need to know do several
> > > > things
> > > > > > >> when
> > > > > > >>>>>>>>> writing
> > > > > > >>>>>>>>>> out a file. It would need to add brackets for an
> array.
> > > Now
> > > > we
> > > > > > >>> have:
> > > > > > >>>>>>>>>> [
> > > > > > >>>>>>>>>> {
> > > > > > >>>>>>>>>> "key": "value"
> > > > > > >>>>>>>>>> }, {
> > > > > > >>>>>>>>>> "key": "value"
> > > > > > >>>>>>>>>> },
> > > > > > >>>>>>>>>> ]
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> We still don't have valid JSON. We have to remove the
> > last
> > > > > comma
> > > > > > >> or
> > > > > > >>>>>>> have
> > > > > > >>>>>>>>>> the uber transform start putting in the commas, except
> > for
> > > > the
> > > > > > >> last
> > > > > > >>>>>>>>> element.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> [
> > > > > > >>>>>>>>>> {
> > > > > > >>>>>>>>>> "key": "value"
> > > > > > >>>>>>>>>> }, {
> > > > > > >>>>>>>>>> "key": "value"
> > > > > > >>>>>>>>>> }
> > > > > > >>>>>>>>>> ]
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Only by doing this do we have valid JSON.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> I'd argue we'd have a similar issue with XML. Some
> > parsers
> > > > > > >> require
> > > > > > >>> a
> > > > > > >>>>>>> root
> > > > > > >>>>>>>>>> element for everything. The uber transform would have
> to
> > > put
> > > > > the
> > > > > > >>> root
> > > > > > >>>>>>>>>> element tags at the beginning and end of the file.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> > > > > > >>> owenzhang1990@gmail.com>
> > > > > > >>>>>>>>> wrote:
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> I would love to see a lean core and abundant
> Transforms
> > at
> > > > the
> > > > > > >> same
> > > > > > >>>>>>> time.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Maybe we can look at what Confluent <
> > > > > > >>> https://github.com/confluentinc
> > > > > > >>>>>>
> > > > > > >>>>>>>>> does
> > > > > > >>>>>>>>>> for kafka-connect. They have official extensions
> support
> > > for
> > > > > > >> JDBC,
> > > > > > >>>>> HDFS
> > > > > > >>>>>>>>> and
> > > > > > >>>>>>>>>> ElasticSearch under https://github.com/confluentinc.
> > They
> > > > put
> > > > > > >> them
> > > > > > >>>>>>> along
> > > > > > >>>>>>>>>> with other community extensions on
> > > > > > >>>>>>>>>> https://www.confluent.io/product/connectors/ for
> > > > visibility.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Although not a commercial company, can we have a
> GitHub
> > > user
> > > > > > like
> > > > > > >>>>>>>>>> beam-community to host projects we build around beam
> but
> > > not
> > > > > > >>> suitable
> > > > > > >>>>>>> for
> > > > > > >>>>>>>>>> https://github.com/apache/incubator-beam. In the
> > future,
> > > we
> > > > > may
> > > > > > >>> have
> > > > > > >>>>>>>>>> beam-algebra like http://github.com/twitter/algebird
> > for
> > > > > > algebra
> > > > > > >>>>>>>>> operations
> > > > > > >>>>>>>>>> and beam-ml / beam-dl for machine learning / deep
> > > learning.
> > > > > > Also,
> > > > > > >>>>> there
> > > > > > >>>>>>>>>> will will be beam related projects elsewhere
> maintained
> > by
> > > > > other
> > > > > > >>>>>>>>>> communities. We can put all of them on the
> beam-website
> > or
> > > > > like
> > > > > > >>> spark
> > > > > > >>>>>>>>>> packages as mentioned by Amit.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> My $0.02
> > > > > > >>>>>>>>>> Manu
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> > > > > > >>>>> <klk@google.com.invalid
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>> On this point from Amit and Ismaël, I agree: we could
> > > > benefit
> > > > > > >>> from a
> > > > > > >>>>>>>>> place
> > > > > > >>>>>>>>>>> for miscellaneous non-core helper transformations.
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> We have sdks/java/extensions but it is organized as
> > > > separate
> > > > > > >>>>>>> artifacts.
> > > > > > >>>>>>>>> I
> > > > > > >>>>>>>>>>> think that is fine, considering the nature of Join
> and
> > > > > > >> SortValues.
> > > > > > >>>>> But
> > > > > > >>>>>>>>> for
> > > > > > >>>>>>>>>>> simpler transforms, Importing one artifact per tiny
> > > > transform
> > > > > > is
> > > > > > >>> too
> > > > > > >>>>>>>>> much
> > > > > > >>>>>>>>>>> overhead. It also seems unlikely that we will have
> > enough
> > > > > > >>>>> commonality
> > > > > > >>>>>>>>>> among
> > > > > > >>>>>>>>>>> the transforms to call the artifact anything other
> than
> > > > [some
> > > > > > >>>>> synonym
> > > > > > >>>>>>>>> for]
> > > > > > >>>>>>>>>>> "miscellaneous".
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> I wouldn't want to take this too far - even though
> the
> > > SDK
> > > > > many
> > > > > > >>>>>>>>>> transforms*
> > > > > > >>>>>>>>>>> that are not required for the model [1], I like that
> > the
> > > > SDK
> > > > > > >>>>> artifact
> > > > > > >>>>>>>>> has
> > > > > > >>>>>>>>>>> everything a user might need in their "getting
> started"
> > > > phase
> > > > > > of
> > > > > > >>>>> use.
> > > > > > >>>>>>>>> This
> > > > > > >>>>>>>>>>> user-friendliness (the user doesn't care that ParDo
> is
> > > core
> > > > > and
> > > > > > >>> Sum
> > > > > > >>>>> is
> > > > > > >>>>>>>>>> not)
> > > > > > >>>>>>>>>>> plus the difficulty of judging which transforms go
> > where,
> > > > are
> > > > > > >>>>> probably
> > > > > > >>>>>>>>> why
> > > > > > >>>>>>>>>>> we have them mostly all in one place.
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> Models to look at, off the top of my head, include
> > Pig's
> > > > > > >> PiggyBank
> > > > > > >>>>> and
> > > > > > >>>>>>>>>>> Apex's Malhar. These have different levels of support
> > > > > implied.
> > > > > > >>>>> Others?
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> Kenn
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count,
> > > > Distinct,
> > > > > > >>>>> Filter,
> > > > > > >>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max,
> Mean,
> > > Min,
> > > > > > >>> Values,
> > > > > > >>>>>>>>>> KvSwap,
> > > > > > >>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> > > > > > >>> WithTimestamps
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> * at least they are separate classes and not methods
> on
> > > > > > >>> PCollection
> > > > > > >>>>>>> :-)
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <
> > > > > > iemejia@gmail.com
> > > > > > >>>
> > > > > > >>>>>>> wrote:
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>>> Nice discussion, and thanks Jesse for bringing this
> > > > subject
> > > > > > >>> back.
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> I agree 100% with Amit and the idea of having a home
> > for
> > > > > those
> > > > > > >>>>>>>>>> transforms
> > > > > > >>>>>>>>>>>> that are not core enough to be part of the sdk, but
> > that
> > > > we
> > > > > > all
> > > > > > >>> end
> > > > > > >>>>>>> up
> > > > > > >>>>>>>>>>>> re-writing somehow.
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> This is a needed improvement to be more developer
> > > > friendly,
> > > > > > but
> > > > > > >>>>> also
> > > > > > >>>>>>> as
> > > > > > >>>>>>>>>> a
> > > > > > >>>>>>>>>>>> reference of good practices of Beam development, and
> > for
> > > > > this
> > > > > > >>>>> reason
> > > > > > >>>>>>> I
> > > > > > >>>>>>>>>>>> agree with JB that at this moment it would be better
> > for
> > > > > these
> > > > > > >>>>>>>>>> transforms
> > > > > > >>>>>>>>>>>> to reside in the Beam repository at least for
> > visibility
> > > > > > >> reasons.
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> One additional question is if these transforms
> > > represent a
> > > > > > >>>>> different
> > > > > > >>>>>>>>> DSL
> > > > > > >>>>>>>>>>> or
> > > > > > >>>>>>>>>>>> if those could be grouped with the current
> extensions
> > > > (e.g.
> > > > > > >> Join
> > > > > > >>>>> and
> > > > > > >>>>>>>>>>>> SortValues) into something more general that we as a
> > > > > community
> > > > > > >>>>> could
> > > > > > >>>>>>>>>>>> maintain, but well even if it is not the case, it
> > would
> > > be
> > > > > > >> really
> > > > > > >>>>>>> nice
> > > > > > >>>>>>>>>> to
> > > > > > >>>>>>>>>>>> start working on something like this.
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> Ismaël Mejía
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste
> Onofré
> > <
> > > > > > >>>>>>> jb@nanthrax.net
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Related to spark-package, we also have Apache Bahir
> > to
> > > > host
> > > > > > >>>>>>>>>>>>> connectors/transforms for Spark and Flink.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> IMHO, right now, Beam should host this, not sure if
> > it
> > > > > makes
> > > > > > >>> sense
> > > > > > >>>>>>>>>>>>> directly in the core.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> It reminds me the "Integration" DSL we discussed in
> > the
> > > > > > >>> technical
> > > > > > >>>>>>>>>>> vision
> > > > > > >>>>>>>>>>>>> document.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Regards
> > > > > > >>>>>>>>>>>>> JB
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> I think Jesse has a very good point on one hand,
> > while
> > > > > > Luke's
> > > > > > >>> and
> > > > > > >>>>>>>>>>>>>> Kenneth's
> > > > > > >>>>>>>>>>>>>> worries about committing users to specific
> > > > implementations
> > > > > > is
> > > > > > >>> in
> > > > > > >>>>>>>>>>> place.
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> The Spark community has a 3rd party repository for
> > > > useful
> > > > > > >>>>> libraries
> > > > > > >>>>>>>>>>> that
> > > > > > >>>>>>>>>>>>>> for various reasons are not a part of the Apache
> > Spark
> > > > > > >> project:
> > > > > > >>>>>>>>>>>>>> https://spark-packages.org/.
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> Maybe a "common-transformations" package would
> serve
> > > > both
> > > > > > >> users
> > > > > > >>>>>>> quick
> > > > > > >>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more
> > > > > "enabling" ?
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > > > > > >>>>>>>>>> <klk@google.com.invalid
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> It seems useful for small scale debugging /
> demoing
> > to
> > > > > have
> > > > > > >>>>>>>>>>>>>>> Dump.toString(). I think it should be named to
> > > clearly
> > > > > > >>> indicate
> > > > > > >>>>>>> its
> > > > > > >>>>>>>>>>>>>>> limited
> > > > > > >>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump
> > > > namespace,
> > > > > > but
> > > > > > >>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so
> it
> > > > > should
> > > > > > >> be
> > > > > > >>>>>>> pretty
> > > > > > >>>>>>>>>>>>>>> printed, not treated as a machine-to-machine wire
> > > > format.
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>> The broader question of representing data in JSON
> > or
> > > > XML,
> > > > > > >> etc,
> > > > > > >>>>> is
> > > > > > >>>>>>>>>>>> already
> > > > > > >>>>>>>>>>>>>>> the subject of many mature libraries which are
> > > already
> > > > > easy
> > > > > > >> to
> > > > > > >>>>> use
> > > > > > >>>>>>>>>>> with
> > > > > > >>>>>>>>>>>>>>> Beam.
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>> The more esoteric practice of implicit or
> > > semi-implicit
> > > > > > >>>>> coercions
> > > > > > >>>>>>>>>>> seems
> > > > > > >>>>>>>>>>>>>>> like it is also already addressed in many ways
> > > > elsewhere.
> > > > > > >>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the
> same
> > as
> > > > > > >>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use
> with
> > > > Beam.
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>> In both of the last cases, there are many
> > reasonable
> > > > > > >>> approaches,
> > > > > > >>>>>>> and
> > > > > > >>>>>>>>>>> we
> > > > > > >>>>>>>>>>>>>>> shouldn't commit our users to one of them.
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > > > > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>> The suggestions you give seem good except for the
> > the
> > > > XML
> > > > > > >>> cases.
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> Might want to have the XML be a document per
> line
> > > > > similar
> > > > > > >> to
> > > > > > >>>>> the
> > > > > > >>>>>>>>>>> JSON
> > > > > > >>>>>>>>>>>>>>>> examples you have been giving.
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson
> <
> > > > > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > > > > >>>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV
> > handling. I
> > > > was
> > > > > > >> more
> > > > > > >>>>>>> think
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> that
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> whatever the addition, it shouldn't just handle
> > KV.
> > > It
> > > > > > >> should
> > > > > > >>>>>>>>>> handle
> > > > > > >>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to
> > give
> > > > > > someone
> > > > > > >>>>>>>>>>> something
> > > > > > >>>>>>>>>>>>>>>>> general purpose enough that you would just end
> up
> > > > > writing
> > > > > > >>> your
> > > > > > >>>>>>> own
> > > > > > >>>>>>>>>>>> code
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> to
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> handle it anyway.
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> Here are some ideas on what it could look like
> > > with a
> > > > > > >> method
> > > > > > >>>>> and
> > > > > > >>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>> resulting string output:
> > > > > > >>>>>>>>>>>>>>>>> *Stringify.toJSON()*
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> With KV:
> > > > > > >>>>>>>>>>>>>>>>> {"key": "value"}
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > > > >>>>>>>>>>>>>>>>> ["one", "two", "three"]
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> With KV:
> > > > > > >>>>>>>>>>>>>>>>> <rootelement key=value />
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > > > >>>>>>>>>>>>>>>>> <rootelement>
> > > > > > >>>>>>>>>>>>>>>>>   <item>one</item>
> > > > > > >>>>>>>>>>>>>>>>>   <item>two</item>
> > > > > > >>>>>>>>>>>>>>>>>   <item>three</item>
> > > > > > >>>>>>>>>>>>>>>>> </rootelement>
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> With KV:
> > > > > > >>>>>>>>>>>>>>>>> key,value
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > > > >>>>>>>>>>>>>>>>> one,two,three
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> Do you think that would strike a good balance
> > > between
> > > > > > >>> reusable
> > > > > > >>>>>>>>>> code
> > > > > > >>>>>>>>>>>> and
> > > > > > >>>>>>>>>>>>>>>>> writing your own for more difficult formatting?
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> Jesse
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > > > > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special
> > > treatment
> > > > > in
> > > > > > >>>>> TextIO,
> > > > > > >>>>>>>>>>>> people
> > > > > > >>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also
> not
> > > > > > >> supported.
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> Also, the example that you provide is using the
> > > fact
> > > > > that
> > > > > > >>> the
> > > > > > >>>>>>>>>> input
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> format
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question
> > > about
> > > > > > >> using
> > > > > > >>> KV
> > > > > > >>>>>>>>>> with
> > > > > > >>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the
> > proposed
> > > > > input
> > > > > > >>>>> format
> > > > > > >>>>>>>>>>> and
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> still
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> would require to write a type conversion
> > function,
> > > > this
> > > > > > >> time
> > > > > > >>>>>>> from
> > > > > > >>>>>>>>>>> KV
> > > > > > >>>>>>>>>>>> to
> > > > > > >>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson
> <
> > > > > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> Lukasz,
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> > > > > > >>> TextIO.Write.
> > > > > > >>>>>>> For
> > > > > > >>>>>>>>>>> CSV
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> call would look like:
> > > > > > >>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> Where the arguments would be
> > Stringify.to(prefix,
> > > > > > >>> delimiter,
> > > > > > >>>>>>>>>>>> suffix).
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> The code would be something like:
> > > > > > >>>>>>>>>>>>>>>>>> StringBuffer buffer = new
> StringBuffer(prefix);
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> for (Item item : list) {
> > > > > > >>>>>>>>>>>>>>>>>>   buffer.append(item.toString());
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>   if(notLast) {
> > > > > > >>>>>>>>>>>>>>>>>>     buffer.append(delimiter);
> > > > > > >>>>>>>>>>>>>>>>>>   }
> > > > > > >>>>>>>>>>>>>>>>>> }
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> buffer.append(suffix);
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> c.output(buffer.toString());
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV,
> > and
> > > > > other
> > > > > > >>>>>>> formats
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> without
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> complicated logic. The same sort of thing could
> > be
> > > > done
> > > > > > >> for
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> TextIO.Write.
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> Jesse
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> > > > > > >>>>>>>>>>>> <lcwik@google.com.invalid
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> The conversion from object to string will have
> > > uses
> > > > > > >> outside
> > > > > > >>>>> of
> > > > > > >>>>>>>>>>> just
> > > > > > >>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we
> would
> > > want
> > > > > to
> > > > > > >>> have
> > > > > > >>>>> a
> > > > > > >>>>>>>>>>> ParDo
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> do
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> conversion.
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance,
> even
> > if
> > > > you
> > > > > > >>>>> consider
> > > > > > >>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> subset
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> of CSV like formats where it could have fixed
> > > width
> > > > > > >> fields,
> > > > > > >>>>> or
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> escaping
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> and
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> quoting around other fields, or headers that
> > > should
> > > > > be
> > > > > > >>>>> placed
> > > > > > >>>>>>> at
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> top.
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> Having all these format conversions within
> > > > > TextIO.Write
> > > > > > >>>>> seems
> > > > > > >>>>>>>>>>> like
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> a
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> lot
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> of
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> logic to contain in that transform which
> should
> > > > just
> > > > > > >> focus
> > > > > > >>>>> on
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> writing
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> to
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> files.
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse
> Anderson
> > <
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> jesse@smokinghand.com>
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> This is a thread moved over from the user
> > mailing
> > > > > list.
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> > > > > > >>>>> PCollection<KV>
> > > > > > >>>>>>> to
> > > > > > >>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to
> > manually
> > > > > > convert
> > > > > > >>> the
> > > > > > >>>>>>> KV
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> to a
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> String:
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>         p
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > > > > > >>>>>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<
> > > > String,
> > > > > > >> Long>
> > > > > > >>>>>>>>>> count)
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> ->*
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> *                            count.getKey() +
> > ":" +
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> count.getValue()*
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> *                        ).withOutputType(
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > > > > >>>>>>> ("output/stringcounts"));
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> This code really should be something like:
> > > > > > >>>>>>>>>>>>>>>>>>>>         p
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > > > > > >>>>>>>>>>>>>>>>>>>> *
> .apply(ToString.stringify())*
> > > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > > > > >>>>>>>>> ("output/stringcounts"));
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> To summarize the discussion:
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder
> > to
> > > > > output
> > > > > > >>> any
> > > > > > >>>>> KV
> > > > > > >>>>>>>>>> or
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> list
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes
> > an
> > > > type
> > > > > > >> and
> > > > > > >>>>> runs
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> toString()
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>    on it:
> > > > > > >>>>>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> > > > > > >>> SimpleFunction<InputT,
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> String>
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> {
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>        public static String apply(InputT
> input) {
> > > > > > >>>>>>>>>>>>>>>>>>>>            return input.toString();
> > > > > > >>>>>>>>>>>>>>>>>>>>        }
> > > > > > >>>>>>>>>>>>>>>>>>>>    }
> > > > > > >>>>>>>>>>>>>>>>>>>>    - JB: Add a general purpose type
> converter
> > > like
> > > > > in
> > > > > > >>>>> Apache
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> Camel.
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that
> > > would
> > > > > > >> write
> > > > > > >>>>> out
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>    toString of any Object.
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> My thoughts:
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String>
> mostly
> > > > needed
> > > > > > >> when
> > > > > > >>>>>>>>>> you're
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> using
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform
> > > only
> > > > > work
> > > > > > >> in
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> certain
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> cases
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> and you'll normally have to write custom code
> > > > format
> > > > > > the
> > > > > > >>>>>>> strings
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>> way
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> you want them?
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add
> > Object
> > > > > > >> support
> > > > > > >>> to
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> TextIO.Write
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> or
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as
> an
> > > > > > argument.
> > > > > > >>>>>>> Making
> > > > > > >>>>>>>>>> a
> > > > > > >>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a
> > > delimiter
> > > > > (and
> > > > > > >>>>>>> perhaps
> > > > > > >>>>>>>>>> a
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>> prefix
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>> and
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of formats
> > and
> > > > > > cases.
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>> Jesse
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> --
> > > > > > >>>>>>>>>>>>> Jean-Baptiste Onofré
> > > > > > >>>>>>>>>>>>> jbonofre@apache.org
> > > > > > >>>>>>>>>>>>> http://blog.nanthrax.net
> > > > > > >>>>>>>>>>>>> Talend - http://www.talend.com
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> --
> > > > > > >>>>>>>>> Jean-Baptiste Onofré
> > > > > > >>>>>>>>> jbonofre@apache.org
> > > > > > >>>>>>>>> http://blog.nanthrax.net
> > > > > > >>>>>>>>> Talend - http://www.talend.com
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>> --
> > > > > > >>>>>>> Jean-Baptiste Onofré
> > > > > > >>>>>>> jbonofre@apache.org
> > > > > > >>>>>>> http://blog.nanthrax.net
> > > > > > >>>>>>> Talend - http://www.talend.com
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>> --
> > > > > > >>>>> Jean-Baptiste Onofré
> > > > > > >>>>> jbonofre@apache.org
> > > > > > >>>>> http://blog.nanthrax.net
> > > > > > >>>>> Talend - http://www.talend.com
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Jean-Baptiste Onofré
> > > > > > >>> jbonofre@apache.org
> > > > > > >>> http://blog.nanthrax.net
> > > > > > >>> Talend - http://www.talend.com
> > > > > > >>>
> > > > > > >>
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Jean-Baptiste Onofré
> > > > > > jbonofre@apache.org
> > > > > > http://blog.nanthrax.net
> > > > > > Talend - http://www.talend.com
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: PCollection to PCollection Conversion

Posted by Jesse Anderson <je...@smokinghand.com>.

@Ben which idea do you like?

On Thu, Dec 29, 2016 at 3:20 PM Ben Chambers <bc...@google.com.invalid>
wrote:

> I like that idea, with the caveat that we should probably come up with a
> better name. Perhaps "ToString.elements()" and ToString.Elements or
> something? Calling one the "default" and using "create" for it seems
> moderately non-future proof.
>
> On Thu, Dec 29, 2016 at 3:17 PM Dan Halperin <dh...@google.com.invalid>
> wrote:
>
> > On Thu, Dec 29, 2016 at 2:10 PM, Jesse Anderson <je...@smokinghand.com>
> > wrote:
> >
> > > I agree MapElements isn't hard to use. I think there is a demand for
> this
> > > built-in conversion.
> > >
> > > My thought on the formatter is that, worst case, we could do runtime
> type
> > > checking. It would be ugly and not as performant, but it should work.
> As
> > > we've said, we'd point them to MapElements for better code. We'd write
> > the
> > > JavaDoc accordingly.
> > >
> >
> > I think it will be good to see these proposals in PR form. I would stay
> far
> > away from reflection and varargs if possible, but properly-typed bits of
> > code (possibly exposed as SerializableFunctions in ToString?) would
> > probably make sense.
> >
> > In the short-term, I can't find anyone arguing against a
> ToString.create()
> > that simply does input.toString().
> >
> > To get started, how about we ask Vikas to clean up the PR to be more
> > future-proof for now? Aka make `ToString` itself not a PTransform,  but
> > instead ToString.create() returns ToString.Default which is a private
> class
> > implementing what ToString is now (PTransform<T, String>, wrapping
> > MapElements).
> >
> > Then we can send PRs adding new features to that.
> >
> > IME and to Ben's point, these will mostly be used in development. Some of
> > > our assumptions will break down when programmers aren't the ones using
> > > Beam. I can see from the user traffic already that not everyone using
> > Beam
> > > is a programmer and they'll need classes like this to be productive.
> >
> >
> > > On Thu, Dec 29, 2016 at 1:46 PM Dan Halperin
> <dhalperi@google.com.invalid
> > >
> > > wrote:
> > >
> > > On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson <jesse@smokinghand.com
> >
> > > wrote:
> > >
> > > > I prefer JB's take. I think there should be three overloaded methods
> on
> > > the
> > > > class. I like Vikas' name ToString. The methods for a simple
> conversion
> > > > should be:
> > > >
> > > > ToString.strings() - Outputs the .toString() of the objects in the
> > > > PCollection
> > > > ToString.strings(String delimiter) - Outputs the .toString() of KVs,
> > > Lists,
> > > > etc with the delimiter between every entry
> > > > ToString.formatted(String format) - Outputs the formatted
> > > > <https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html>
> > > > string
> > > > with the object passed in. For objects made up of different parts
> like
> > > KVs,
> > > > each one is passed in as separate toString() of a varargs.
> > > >
> > >
> > > Riffing a little, with some types:
> > >
> > > ToString.<T>of() -- PTransform<T, String> that is equivalent to a ParDo
> > > that takes in a T and outputs T.toString().
> > >
> > > ToString.<K,V>kv(String delimiter) -- PTransform<KV<K, V>, String> that
> > is
> > > equivalent to a ParDo that takes in a KV<K,V> and outputs
> > > kv.getKey().toString() + delimiter + kv.getValue().toString()
> > >
> > > ToString.<T>iterable(String delimiter) -- PTransform<? extends
> > Iterable<T>,
> > > String> that is equivalent to a ParDo that takes in an Iterable<T> and
> > > outputs the iterable[0] + delimiter + iterable[1] + delimiter + ... +
> > > delimiter + iterable[N-1]
> > >
> > > ToString.<T>custom(SerializableFunction<T, String> formatter) ?
> > >
> > > The last one is just MapElement.via, except you don't need to set the
> > > output type.
> > >
> > > I don't see a way to make the generic .formatted() that you propose
> that
> > > just works with anything "made of different parts".
> > >
> > > I think this adding too many overrides beyond "of" and "custom" is
> > opening
> > > up a Pandora's Box. the KV one might want to have left and right
> > > delimiters, might want to take custom formatters for K and V, etc. etc.
> > The
> > > iterable one might want to have a special configuration for an empty
> > > iterable. So I'm inclined towards simplicity with the awareness that
> > > MapElements.via is just not that hard to use.
> > >
> > > Dan
> > >
> > >
> > > >
> > > > I think doing these three methods would cover every simple and
> advanced
> > > > "simple conversions." As JB says, we'll need other specific
> converters
> > > for
> > > > other formats like XML.
> > > >
> > > > I'd really like to see this class in the next version of Beam. What
> > does
> > > > everyone think of the class name, methods name, and method operations
> > so
> > > we
> > > > can have Vikas finish up?
> > > >
> > > > Thanks,
> > > >
> > > > Jesse
> > > >
> > > > On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <
> jb@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > Hi Vikas,
> > > > >
> > > > > did you take a look on:
> > > > >
> > > > >
> > > > > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/
> > > > java/extensions/dataformat
> > > > >
> > > > > You can see KV2String and ToString could be part of this extension.
> > > > > I'm also using JAXB for XML and Jackson for JSON
> > > > > marshalling/unmarshalling. I'm planning to deal with Avro
> > > > (IndexedRecord).
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> > > > > > Hi All,
> > > > > >
> > > > > >   Not being aware of the discussion here, I sent out a PR
> > > > > > <https://github.com/apache/beam/pull/1704> but JB and others
> > > directed
> > > > > me to
> > > > > > this thread. Having converted PCollection<T> to
> PCollection<String>
> > > > > several
> > > > > > times, I feel something like 'ToString' transform is common
> enough
> > to
> > > > be
> > > > > > part of the core. What do you all think?
> > > > > >
> > > > > > Also, if someone else is already working on or interested in
> > tackling
> > > > > this,
> > > > > > then I am happy to discard the PR.
> > > > > >
> > > > > > Regards,
> > > > > > Vikas
> > > > > >
> > > > > > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <amitsela33@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > >> It seems that there were a lot of good points raised here, and I
> > > tend
> > > > to
> > > > > >> agree that something as trivial and lean as "ToString" should
> be a
> > > > part
> > > > > of
> > > > > >> core.ake
> > > > > >> I'm particularly fond of makeString(prefix, toString, suffix) in
> > > > various
> > > > > >> combinations (Scala-like).
> > > > > >> For "fromString", I think JB has a good point leveraging JAXB
> and
> > > > > Jackson -
> > > > > >> though I think this should be in extensions as it is not as lean
> > as
> > > > > >> toString.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Amit
> > > > > >>
> > > > > >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <
> > > jb@nanthrax.net
> > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi Jesse,
> > > > > >>>
> > > > > >>> yes, I started something there (using JAXB and Jackson). Let me
> > > > polish
> > > > > >>> and push.
> > > > > >>>
> > > > > >>> Regards
> > > > > >>> JB
> > > > > >>>
> > > > > >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > > > > >>>> I went through the string conversions. Do you have an example
> of
> > > > > >> writing
> > > > > >>>> out XML/JSON/etc too?
> > > > > >>>>
> > > > > >>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <
> > > > jb@nanthrax.net
> > > > > >
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> Hi Jesse,
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>> https://github.com/jbonofre/incubator-beam/tree/
> > > > DATAFORMAT/sdks/java/
> > > > > >> extensions/dataformat
> > > > > >>>>>
> > > > > >>>>> it's very simple and stupid and of course not complete at all
> > (I
> > > > have
> > > > > >>>>> other commits but not merged as they need some polishing),
> but
> > as
> > > I
> > > > > >>>>> said, it's a base of discussion.
> > > > > >>>>>
> > > > > >>>>> Regards
> > > > > >>>>> JB
> > > > > >>>>>
> > > > > >>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> > > > > >>>>>> @jb Sounds good. Just let us know once you've pushed.
> > > > > >>>>>>
> > > > > >>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> > > > > >> jb@nanthrax.net>
> > > > > >>>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Good point Eugene.
> > > > > >>>>>>>
> > > > > >>>>>>> Right now, it's a DoFn collection to experiment a bit (a
> pure
> > > > > >>>>>>> extension). It's pretty stupid ;)
> > > > > >>>>>>>
> > > > > >>>>>>> But, you are right, depending the direction of such
> > extension,
> > > it
> > > > > >>> could
> > > > > >>>>>>> cover more use cases (even if it's not my first intention
> > ;)).
> > > > > >>>>>>>
> > > > > >>>>>>> Let me push the branch (pretty small) as an illustration,
> and
> > > in
> > > > > the
> > > > > >>>>>>> mean time, I'm preparing a document (more focused on the
> use
> > > > > cases).
> > > > > >>>>>>>
> > > > > >>>>>>> WDYT ?
> > > > > >>>>>>>
> > > > > >>>>>>> Regards
> > > > > >>>>>>> JB
> > > > > >>>>>>>
> > > > > >>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> > > > > >>>>>>>> Hi JB,
> > > > > >>>>>>>> Depending on the scope of what you want to ultimately
> > > accomplish
> > > > > >> with
> > > > > >>>>>>> this
> > > > > >>>>>>>> extension, I think it may make sense to write a proposal
> > > > document
> > > > > >> and
> > > > > >>>>>>>> discuss it.
> > > > > >>>>>>>> If it's just a collection of utility DoFn's for various
> > > > > >> well-defined
> > > > > >>>>>>>> source/target format pairs, then that's probably not
> needed,
> > > but
> > > > > if
> > > > > >>>>> it's
> > > > > >>>>>>>> anything more, then I think it is.
> > > > > >>>>>>>> That will help avoid a lot of churn if people propose
> > > reasonable
> > > > > >>>>>>>> significant changes.
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> > > > > >>> jb@nanthrax.net
> > > > > >>>>>>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my
> > > > github
> > > > > >>> and I
> > > > > >>>>>>>>> will post on the dev mailing list when done.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Regards
> > > > > >>>>>>>>> JB
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > > > > >>>>>>>>>> I want to bring this thread back up since we've had time
> > to
> > > > > think
> > > > > >>>>> about
> > > > > >>>>>>>>> it
> > > > > >>>>>>>>>> more and make a plan.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I think a format-specific converter will be more time
> > > > consuming
> > > > > >>> task
> > > > > >>>>>>> than
> > > > > >>>>>>>>>> we originally thought. It'd have to be a writer that
> takes
> > > > > >> another
> > > > > >>>>>>> writer
> > > > > >>>>>>>>>> as a parameter.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I think a string converter can be done as a simple
> > > transform.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I think we should start with a simple string converter
> and
> > > > plan
> > > > > >>> for a
> > > > > >>>>>>>>>> format-specific writer.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> What are your thoughts?
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Jesse
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> > > > > >>>>> jesse@smokinghand.com
> > > > > >>>>>>>>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I was thinking about what the outputs would look like
> last
> > > > > >> night. I
> > > > > >>>>>>>>>> realized that more complex formats like JSON and XML may
> > or
> > > > may
> > > > > >> not
> > > > > >>>>>>>>> output
> > > > > >>>>>>>>>> the data in a valid format.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Doing a direct conversion on unbounded collections would
> > > work
> > > > > >> just
> > > > > >>>>>>> fine.
> > > > > >>>>>>>>>> They're self-contained. For writing out bounded
> > collections,
> > > > > >> that's
> > > > > >>>>>>> where
> > > > > >>>>>>>>>> we'll hit the issues. This changes the uber conversion
> > > > transform
> > > > > >>>>> into a
> > > > > >>>>>>>>>> transform that needs to be a writer.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> If a transform executes a JSON conversion on a per
> element
> > > > > basis,
> > > > > >>>>> we'd
> > > > > >>>>>>>>> get
> > > > > >>>>>>>>>> this:
> > > > > >>>>>>>>>> {
> > > > > >>>>>>>>>> "key": "value"
> > > > > >>>>>>>>>> }, {
> > > > > >>>>>>>>>> "key": "value"
> > > > > >>>>>>>>>> },
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> That isn't valid JSON.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> The conversion transform would need to know do several
> > > things
> > > > > >> when
> > > > > >>>>>>>>> writing
> > > > > >>>>>>>>>> out a file. It would need to add brackets for an array.
> > Now
> > > we
> > > > > >>> have:
> > > > > >>>>>>>>>> [
> > > > > >>>>>>>>>> {
> > > > > >>>>>>>>>> "key": "value"
> > > > > >>>>>>>>>> }, {
> > > > > >>>>>>>>>> "key": "value"
> > > > > >>>>>>>>>> },
> > > > > >>>>>>>>>> ]
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> We still don't have valid JSON. We have to remove the
> last
> > > > comma
> > > > > >> or
> > > > > >>>>>>> have
> > > > > >>>>>>>>>> the uber transform start putting in the commas, except
> for
> > > the
> > > > > >> last
> > > > > >>>>>>>>> element.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> [
> > > > > >>>>>>>>>> {
> > > > > >>>>>>>>>> "key": "value"
> > > > > >>>>>>>>>> }, {
> > > > > >>>>>>>>>> "key": "value"
> > > > > >>>>>>>>>> }
> > > > > >>>>>>>>>> ]
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Only by doing this do we have valid JSON.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I'd argue we'd have a similar issue with XML. Some
> parsers
> > > > > >> require
> > > > > >>> a
> > > > > >>>>>>> root
> > > > > >>>>>>>>>> element for everything. The uber transform would have to
> > put
> > > > the
> > > > > >>> root
> > > > > >>>>>>>>>> element tags at the beginning and end of the file.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> > > > > >>> owenzhang1990@gmail.com>
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I would love to see a lean core and abundant Transforms
> at
> > > the
> > > > > >> same
> > > > > >>>>>>> time.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Maybe we can look at what Confluent <
> > > > > >>> https://github.com/confluentinc
> > > > > >>>>>>
> > > > > >>>>>>>>> does
> > > > > >>>>>>>>>> for kafka-connect. They have official extensions support
> > for
> > > > > >> JDBC,
> > > > > >>>>> HDFS
> > > > > >>>>>>>>> and
> > > > > >>>>>>>>>> ElasticSearch under https://github.com/confluentinc.
> They
> > > put
> > > > > >> them
> > > > > >>>>>>> along
> > > > > >>>>>>>>>> with other community extensions on
> > > > > >>>>>>>>>> https://www.confluent.io/product/connectors/ for
> > > visibility.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Although not a commercial company, can we have a GitHub
> > user
> > > > > like
> > > > > >>>>>>>>>> beam-community to host projects we build around beam but
> > not
> > > > > >>> suitable
> > > > > >>>>>>> for
> > > > > >>>>>>>>>> https://github.com/apache/incubator-beam. In the
> future,
> > we
> > > > may
> > > > > >>> have
> > > > > >>>>>>>>>> beam-algebra like http://github.com/twitter/algebird
> for
> > > > > algebra
> > > > > >>>>>>>>> operations
> > > > > >>>>>>>>>> and beam-ml / beam-dl for machine learning / deep
> > learning.
> > > > > Also,
> > > > > >>>>> there
> > > > > >>>>>>>>>> will will be beam related projects elsewhere maintained
> by
> > > > other
> > > > > >>>>>>>>>> communities. We can put all of them on the beam-website
> or
> > > > like
> > > > > >>> spark
> > > > > >>>>>>>>>> packages as mentioned by Amit.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> My $0.02
> > > > > >>>>>>>>>> Manu
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> > > > > >>>>> <klk@google.com.invalid
> > > > > >>>>>>>>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> On this point from Amit and Ismaël, I agree: we could
> > > benefit
> > > > > >>> from a
> > > > > >>>>>>>>> place
> > > > > >>>>>>>>>>> for miscellaneous non-core helper transformations.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> We have sdks/java/extensions but it is organized as
> > > separate
> > > > > >>>>>>> artifacts.
> > > > > >>>>>>>>> I
> > > > > >>>>>>>>>>> think that is fine, considering the nature of Join and
> > > > > >> SortValues.
> > > > > >>>>> But
> > > > > >>>>>>>>> for
> > > > > >>>>>>>>>>> simpler transforms, Importing one artifact per tiny
> > > transform
> > > > > is
> > > > > >>> too
> > > > > >>>>>>>>> much
> > > > > >>>>>>>>>>> overhead. It also seems unlikely that we will have
> enough
> > > > > >>>>> commonality
> > > > > >>>>>>>>>> among
> > > > > >>>>>>>>>>> the transforms to call the artifact anything other than
> > > [some
> > > > > >>>>> synonym
> > > > > >>>>>>>>> for]
> > > > > >>>>>>>>>>> "miscellaneous".
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> I wouldn't want to take this too far - even though the
> > SDK
> > > > many
> > > > > >>>>>>>>>> transforms*
> > > > > >>>>>>>>>>> that are not required for the model [1], I like that
> the
> > > SDK
> > > > > >>>>> artifact
> > > > > >>>>>>>>> has
> > > > > >>>>>>>>>>> everything a user might need in their "getting started"
> > > phase
> > > > > of
> > > > > >>>>> use.
> > > > > >>>>>>>>> This
> > > > > >>>>>>>>>>> user-friendliness (the user doesn't care that ParDo is
> > core
> > > > and
> > > > > >>> Sum
> > > > > >>>>> is
> > > > > >>>>>>>>>> not)
> > > > > >>>>>>>>>>> plus the difficulty of judging which transforms go
> where,
> > > are
> > > > > >>>>> probably
> > > > > >>>>>>>>> why
> > > > > >>>>>>>>>>> we have them mostly all in one place.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Models to look at, off the top of my head, include
> Pig's
> > > > > >> PiggyBank
> > > > > >>>>> and
> > > > > >>>>>>>>>>> Apex's Malhar. These have different levels of support
> > > > implied.
> > > > > >>>>> Others?
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Kenn
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count,
> > > Distinct,
> > > > > >>>>> Filter,
> > > > > >>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean,
> > Min,
> > > > > >>> Values,
> > > > > >>>>>>>>>> KvSwap,
> > > > > >>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> > > > > >>> WithTimestamps
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> * at least they are separate classes and not methods on
> > > > > >>> PCollection
> > > > > >>>>>>> :-)
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <
> > > > > iemejia@gmail.com
> > > > > >>>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> Nice discussion, and thanks Jesse for bringing this
> > > subject
> > > > > >>> back.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> I agree 100% with Amit and the idea of having a home
> for
> > > > those
> > > > > >>>>>>>>>> transforms
> > > > > >>>>>>>>>>>> that are not core enough to be part of the sdk, but
> that
> > > we
> > > > > all
> > > > > >>> end
> > > > > >>>>>>> up
> > > > > >>>>>>>>>>>> re-writing somehow.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> This is a needed improvement to be more developer
> > > friendly,
> > > > > but
> > > > > >>>>> also
> > > > > >>>>>>> as
> > > > > >>>>>>>>>> a
> > > > > >>>>>>>>>>>> reference of good practices of Beam development, and
> for
> > > > this
> > > > > >>>>> reason
> > > > > >>>>>>> I
> > > > > >>>>>>>>>>>> agree with JB that at this moment it would be better
> for
> > > > these
> > > > > >>>>>>>>>> transforms
> > > > > >>>>>>>>>>>> to reside in the Beam repository at least for
> visibility
> > > > > >> reasons.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> One additional question is if these transforms
> > represent a
> > > > > >>>>> different
> > > > > >>>>>>>>> DSL
> > > > > >>>>>>>>>>> or
> > > > > >>>>>>>>>>>> if those could be grouped with the current extensions
> > > (e.g.
> > > > > >> Join
> > > > > >>>>> and
> > > > > >>>>>>>>>>>> SortValues) into something more general that we as a
> > > > community
> > > > > >>>>> could
> > > > > >>>>>>>>>>>> maintain, but well even if it is not the case, it
> would
> > be
> > > > > >> really
> > > > > >>>>>>> nice
> > > > > >>>>>>>>>> to
> > > > > >>>>>>>>>>>> start working on something like this.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Ismaël Mejía
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré
> <
> > > > > >>>>>>> jb@nanthrax.net
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Related to spark-package, we also have Apache Bahir
> to
> > > host
> > > > > >>>>>>>>>>>>> connectors/transforms for Spark and Flink.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> IMHO, right now, Beam should host this, not sure if
> it
> > > > makes
> > > > > >>> sense
> > > > > >>>>>>>>>>>>> directly in the core.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> It reminds me the "Integration" DSL we discussed in
> the
> > > > > >>> technical
> > > > > >>>>>>>>>>> vision
> > > > > >>>>>>>>>>>>> document.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Regards
> > > > > >>>>>>>>>>>>> JB
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> I think Jesse has a very good point on one hand,
> while
> > > > > Luke's
> > > > > >>> and
> > > > > >>>>>>>>>>>>>> Kenneth's
> > > > > >>>>>>>>>>>>>> worries about committing users to specific
> > > implementations
> > > > > is
> > > > > >>> in
> > > > > >>>>>>>>>>> place.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> The Spark community has a 3rd party repository for
> > > useful
> > > > > >>>>> libraries
> > > > > >>>>>>>>>>> that
> > > > > >>>>>>>>>>>>>> for various reasons are not a part of the Apache
> Spark
> > > > > >> project:
> > > > > >>>>>>>>>>>>>> https://spark-packages.org/.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Maybe a "common-transformations" package would serve
> > > both
> > > > > >> users
> > > > > >>>>>>> quick
> > > > > >>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more
> > > > "enabling" ?
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > > > > >>>>>>>>>> <klk@google.com.invalid
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> It seems useful for small scale debugging / demoing
> to
> > > > have
> > > > > >>>>>>>>>>>>>>> Dump.toString(). I think it should be named to
> > clearly
> > > > > >>> indicate
> > > > > >>>>>>> its
> > > > > >>>>>>>>>>>>>>> limited
> > > > > >>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump
> > > namespace,
> > > > > but
> > > > > >>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it
> > > > should
> > > > > >> be
> > > > > >>>>>>> pretty
> > > > > >>>>>>>>>>>>>>> printed, not treated as a machine-to-machine wire
> > > format.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> The broader question of representing data in JSON
> or
> > > XML,
> > > > > >> etc,
> > > > > >>>>> is
> > > > > >>>>>>>>>>>> already
> > > > > >>>>>>>>>>>>>>> the subject of many mature libraries which are
> > already
> > > > easy
> > > > > >> to
> > > > > >>>>> use
> > > > > >>>>>>>>>>> with
> > > > > >>>>>>>>>>>>>>> Beam.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> The more esoteric practice of implicit or
> > semi-implicit
> > > > > >>>>> coercions
> > > > > >>>>>>>>>>> seems
> > > > > >>>>>>>>>>>>>>> like it is also already addressed in many ways
> > > elsewhere.
> > > > > >>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the same
> as
> > > > > >>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with
> > > Beam.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> In both of the last cases, there are many
> reasonable
> > > > > >>> approaches,
> > > > > >>>>>>> and
> > > > > >>>>>>>>>>> we
> > > > > >>>>>>>>>>>>>>> shouldn't commit our users to one of them.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > > > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> The suggestions you give seem good except for the
> the
> > > XML
> > > > > >>> cases.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Might want to have the XML be a document per line
> > > > similar
> > > > > >> to
> > > > > >>>>> the
> > > > > >>>>>>>>>>> JSON
> > > > > >>>>>>>>>>>>>>>> examples you have been giving.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> > > > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > > > >>>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV
> handling. I
> > > was
> > > > > >> more
> > > > > >>>>>>> think
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> that
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> whatever the addition, it shouldn't just handle
> KV.
> > It
> > > > > >> should
> > > > > >>>>>>>>>> handle
> > > > > >>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to
> give
> > > > > someone
> > > > > >>>>>>>>>>> something
> > > > > >>>>>>>>>>>>>>>>> general purpose enough that you would just end up
> > > > writing
> > > > > >>> your
> > > > > >>>>>>> own
> > > > > >>>>>>>>>>>> code
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> handle it anyway.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Here are some ideas on what it could look like
> > with a
> > > > > >> method
> > > > > >>>>> and
> > > > > >>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>> resulting string output:
> > > > > >>>>>>>>>>>>>>>>> *Stringify.toJSON()*
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> With KV:
> > > > > >>>>>>>>>>>>>>>>> {"key": "value"}
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > > >>>>>>>>>>>>>>>>> ["one", "two", "three"]
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> With KV:
> > > > > >>>>>>>>>>>>>>>>> <rootelement key=value />
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > > >>>>>>>>>>>>>>>>> <rootelement>
> > > > > >>>>>>>>>>>>>>>>>   <item>one</item>
> > > > > >>>>>>>>>>>>>>>>>   <item>two</item>
> > > > > >>>>>>>>>>>>>>>>>   <item>three</item>
> > > > > >>>>>>>>>>>>>>>>> </rootelement>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> With KV:
> > > > > >>>>>>>>>>>>>>>>> key,value
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > > >>>>>>>>>>>>>>>>> one,two,three
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Do you think that would strike a good balance
> > between
> > > > > >>> reusable
> > > > > >>>>>>>>>> code
> > > > > >>>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>>>> writing your own for more difficult formatting?
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Jesse
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > > > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special
> > treatment
> > > > in
> > > > > >>>>> TextIO,
> > > > > >>>>>>>>>>>> people
> > > > > >>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not
> > > > > >> supported.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Also, the example that you provide is using the
> > fact
> > > > that
> > > > > >>> the
> > > > > >>>>>>>>>> input
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> format
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question
> > about
> > > > > >> using
> > > > > >>> KV
> > > > > >>>>>>>>>> with
> > > > > >>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the
> proposed
> > > > input
> > > > > >>>>> format
> > > > > >>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> still
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> would require to write a type conversion
> function,
> > > this
> > > > > >> time
> > > > > >>>>>>> from
> > > > > >>>>>>>>>>> KV
> > > > > >>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> > > > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Lukasz,
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> > > > > >>> TextIO.Write.
> > > > > >>>>>>> For
> > > > > >>>>>>>>>>> CSV
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> call would look like:
> > > > > >>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Where the arguments would be
> Stringify.to(prefix,
> > > > > >>> delimiter,
> > > > > >>>>>>>>>>>> suffix).
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> The code would be something like:
> > > > > >>>>>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> for (Item item : list) {
> > > > > >>>>>>>>>>>>>>>>>>   buffer.append(item.toString());
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>   if(notLast) {
> > > > > >>>>>>>>>>>>>>>>>>     buffer.append(delimiter);
> > > > > >>>>>>>>>>>>>>>>>>   }
> > > > > >>>>>>>>>>>>>>>>>> }
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> buffer.append(suffix);
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> c.output(buffer.toString());
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV,
> and
> > > > other
> > > > > >>>>>>> formats
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> without
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> complicated logic. The same sort of thing could
> be
> > > done
> > > > > >> for
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> TextIO.Write.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Jesse
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> > > > > >>>>>>>>>>>> <lcwik@google.com.invalid
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> The conversion from object to string will have
> > uses
> > > > > >> outside
> > > > > >>>>> of
> > > > > >>>>>>>>>>> just
> > > > > >>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would
> > want
> > > > to
> > > > > >>> have
> > > > > >>>>> a
> > > > > >>>>>>>>>>> ParDo
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> do
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> conversion.
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance, even
> if
> > > you
> > > > > >>>>> consider
> > > > > >>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> subset
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> of CSV like formats where it could have fixed
> > width
> > > > > >> fields,
> > > > > >>>>> or
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> escaping
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> quoting around other fields, or headers that
> > should
> > > > be
> > > > > >>>>> placed
> > > > > >>>>>>> at
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> top.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> Having all these format conversions within
> > > > TextIO.Write
> > > > > >>>>> seems
> > > > > >>>>>>>>>>> like
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> a
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> lot
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> of
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> logic to contain in that transform which should
> > > just
> > > > > >> focus
> > > > > >>>>> on
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> writing
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> files.
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson
> <
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> jesse@smokinghand.com>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> This is a thread moved over from the user
> mailing
> > > > list.
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> > > > > >>>>> PCollection<KV>
> > > > > >>>>>>> to
> > > > > >>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to
> manually
> > > > > convert
> > > > > >>> the
> > > > > >>>>>>> KV
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> to a
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> String:
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>         p
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > > > > >>>>>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<
> > > String,
> > > > > >> Long>
> > > > > >>>>>>>>>> count)
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> ->*
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> *                            count.getKey() +
> ":" +
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> count.getValue()*
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> *                        ).withOutputType(
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > > > >>>>>>> ("output/stringcounts"));
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> This code really should be something like:
> > > > > >>>>>>>>>>>>>>>>>>>>         p
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > > > > >>>>>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
> > > > > >>>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > > > >>>>>>>>> ("output/stringcounts"));
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> To summarize the discussion:
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder
> to
> > > > output
> > > > > >>> any
> > > > > >>>>> KV
> > > > > >>>>>>>>>> or
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> list
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes
> an
> > > type
> > > > > >> and
> > > > > >>>>> runs
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> toString()
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>    on it:
> > > > > >>>>>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> > > > > >>> SimpleFunction<InputT,
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> String>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> {
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>        public static String apply(InputT input) {
> > > > > >>>>>>>>>>>>>>>>>>>>            return input.toString();
> > > > > >>>>>>>>>>>>>>>>>>>>        }
> > > > > >>>>>>>>>>>>>>>>>>>>    }
> > > > > >>>>>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter
> > like
> > > > in
> > > > > >>>>> Apache
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> Camel.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that
> > would
> > > > > >> write
> > > > > >>>>> out
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>    toString of any Object.
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> My thoughts:
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly
> > > needed
> > > > > >> when
> > > > > >>>>>>>>>> you're
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> using
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform
> > only
> > > > work
> > > > > >> in
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> certain
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> cases
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> and you'll normally have to write custom code
> > > format
> > > > > the
> > > > > >>>>>>> strings
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> way
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> you want them?
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add
> Object
> > > > > >> support
> > > > > >>> to
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> TextIO.Write
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> or
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an
> > > > > argument.
> > > > > >>>>>>> Making
> > > > > >>>>>>>>>> a
> > > > > >>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a
> > delimiter
> > > > (and
> > > > > >>>>>>> perhaps
> > > > > >>>>>>>>>> a
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> prefix
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of formats
> and
> > > > > cases.
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> Jesse
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> --
> > > > > >>>>>>>>>>>>> Jean-Baptiste Onofré
> > > > > >>>>>>>>>>>>> jbonofre@apache.org
> > > > > >>>>>>>>>>>>> http://blog.nanthrax.net
> > > > > >>>>>>>>>>>>> Talend - http://www.talend.com
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> --
> > > > > >>>>>>>>> Jean-Baptiste Onofré
> > > > > >>>>>>>>> jbonofre@apache.org
> > > > > >>>>>>>>> http://blog.nanthrax.net
> > > > > >>>>>>>>> Talend - http://www.talend.com
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> --
> > > > > >>>>>>> Jean-Baptiste Onofré
> > > > > >>>>>>> jbonofre@apache.org
> > > > > >>>>>>> http://blog.nanthrax.net
> > > > > >>>>>>> Talend - http://www.talend.com
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>> --
> > > > > >>>>> Jean-Baptiste Onofré
> > > > > >>>>> jbonofre@apache.org
> > > > > >>>>> http://blog.nanthrax.net
> > > > > >>>>> Talend - http://www.talend.com
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Jean-Baptiste Onofré
> > > > > >>> jbonofre@apache.org
> > > > > >>> http://blog.nanthrax.net
> > > > > >>> Talend - http://www.talend.com
> > > > > >>>
> > > > > >>
> > > > > >
> > > > >
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > jbonofre@apache.org
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > >
> >
>

Re: PCollection to PCollection Conversion

Posted by Ben Chambers <bc...@google.com.INVALID>.

I like that idea, with the caveat that we should probably come up with a
better name. Perhaps "ToString.elements()" and ToString.Elements or
something? Calling one the "default" and using "create" for it seems
moderately non-future proof.

On Thu, Dec 29, 2016 at 3:17 PM Dan Halperin <dh...@google.com.invalid>
wrote:

> On Thu, Dec 29, 2016 at 2:10 PM, Jesse Anderson <je...@smokinghand.com>
> wrote:
>
> > I agree MapElements isn't hard to use. I think there is a demand for this
> > built-in conversion.
> >
> > My thought on the formatter is that, worst case, we could do runtime type
> > checking. It would be ugly and not as performant, but it should work. As
> > we've said, we'd point them to MapElements for better code. We'd write
> the
> > JavaDoc accordingly.
> >
>
> I think it will be good to see these proposals in PR form. I would stay far
> away from reflection and varargs if possible, but properly-typed bits of
> code (possibly exposed as SerializableFunctions in ToString?) would
> probably make sense.
>
> In the short-term, I can't find anyone arguing against a ToString.create()
> that simply does input.toString().
>
> To get started, how about we ask Vikas to clean up the PR to be more
> future-proof for now? Aka make `ToString` itself not a PTransform,  but
> instead ToString.create() returns ToString.Default which is a private class
> implementing what ToString is now (PTransform<T, String>, wrapping
> MapElements).
>
> Then we can send PRs adding new features to that.
>
> IME and to Ben's point, these will mostly be used in development. Some of
> > our assumptions will break down when programmers aren't the ones using
> > Beam. I can see from the user traffic already that not everyone using
> Beam
> > is a programmer and they'll need classes like this to be productive.
>
>
> > On Thu, Dec 29, 2016 at 1:46 PM Dan Halperin <dhalperi@google.com.invalid
> >
> > wrote:
> >
> > On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson <je...@smokinghand.com>
> > wrote:
> >
> > > I prefer JB's take. I think there should be three overloaded methods on
> > the
> > > class. I like Vikas' name ToString. The methods for a simple conversion
> > > should be:
> > >
> > > ToString.strings() - Outputs the .toString() of the objects in the
> > > PCollection
> > > ToString.strings(String delimiter) - Outputs the .toString() of KVs,
> > Lists,
> > > etc with the delimiter between every entry
> > > ToString.formatted(String format) - Outputs the formatted
> > > <https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html>
> > > string
> > > with the object passed in. For objects made up of different parts like
> > KVs,
> > > each one is passed in as separate toString() of a varargs.
> > >
> >
> > Riffing a little, with some types:
> >
> > ToString.<T>of() -- PTransform<T, String> that is equivalent to a ParDo
> > that takes in a T and outputs T.toString().
> >
> > ToString.<K,V>kv(String delimiter) -- PTransform<KV<K, V>, String> that
> is
> > equivalent to a ParDo that takes in a KV<K,V> and outputs
> > kv.getKey().toString() + delimiter + kv.getValue().toString()
> >
> > ToString.<T>iterable(String delimiter) -- PTransform<? extends
> Iterable<T>,
> > String> that is equivalent to a ParDo that takes in an Iterable<T> and
> > outputs the iterable[0] + delimiter + iterable[1] + delimiter + ... +
> > delimiter + iterable[N-1]
> >
> > ToString.<T>custom(SerializableFunction<T, String> formatter) ?
> >
> > The last one is just MapElement.via, except you don't need to set the
> > output type.
> >
> > I don't see a way to make the generic .formatted() that you propose that
> > just works with anything "made of different parts".
> >
> > I think this adding too many overrides beyond "of" and "custom" is
> opening
> > up a Pandora's Box. the KV one might want to have left and right
> > delimiters, might want to take custom formatters for K and V, etc. etc.
> The
> > iterable one might want to have a special configuration for an empty
> > iterable. So I'm inclined towards simplicity with the awareness that
> > MapElements.via is just not that hard to use.
> >
> > Dan
> >
> >
> > >
> > > I think doing these three methods would cover every simple and advanced
> > > "simple conversions." As JB says, we'll need other specific converters
> > for
> > > other formats like XML.
> > >
> > > I'd really like to see this class in the next version of Beam. What
> does
> > > everyone think of the class name, methods name, and method operations
> so
> > we
> > > can have Vikas finish up?
> > >
> > > Thanks,
> > >
> > > Jesse
> > >
> > > On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <jb@nanthrax.net
> >
> > > wrote:
> > >
> > > > Hi Vikas,
> > > >
> > > > did you take a look on:
> > > >
> > > >
> > > > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/
> > > java/extensions/dataformat
> > > >
> > > > You can see KV2String and ToString could be part of this extension.
> > > > I'm also using JAXB for XML and Jackson for JSON
> > > > marshalling/unmarshalling. I'm planning to deal with Avro
> > > (IndexedRecord).
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> > > > > Hi All,
> > > > >
> > > > >   Not being aware of the discussion here, I sent out a PR
> > > > > <https://github.com/apache/beam/pull/1704> but JB and others
> > directed
> > > > me to
> > > > > this thread. Having converted PCollection<T> to PCollection<String>
> > > > several
> > > > > times, I feel something like 'ToString' transform is common enough
> to
> > > be
> > > > > part of the core. What do you all think?
> > > > >
> > > > > Also, if someone else is already working on or interested in
> tackling
> > > > this,
> > > > > then I am happy to discard the PR.
> > > > >
> > > > > Regards,
> > > > > Vikas
> > > > >
> > > > > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <am...@gmail.com>
> > > wrote:
> > > > >
> > > > >> It seems that there were a lot of good points raised here, and I
> > tend
> > > to
> > > > >> agree that something as trivial and lean as "ToString" should be a
> > > part
> > > > of
> > > > >> core.ake
> > > > >> I'm particularly fond of makeString(prefix, toString, suffix) in
> > > various
> > > > >> combinations (Scala-like).
> > > > >> For "fromString", I think JB has a good point leveraging JAXB and
> > > > Jackson -
> > > > >> though I think this should be in extensions as it is not as lean
> as
> > > > >> toString.
> > > > >>
> > > > >> Thanks,
> > > > >> Amit
> > > > >>
> > > > >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <
> > jb@nanthrax.net
> > > >
> > > > >> wrote:
> > > > >>
> > > > >>> Hi Jesse,
> > > > >>>
> > > > >>> yes, I started something there (using JAXB and Jackson). Let me
> > > polish
> > > > >>> and push.
> > > > >>>
> > > > >>> Regards
> > > > >>> JB
> > > > >>>
> > > > >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > > > >>>> I went through the string conversions. Do you have an example of
> > > > >> writing
> > > > >>>> out XML/JSON/etc too?
> > > > >>>>
> > > > >>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <
> > > jb@nanthrax.net
> > > > >
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Hi Jesse,
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>> https://github.com/jbonofre/incubator-beam/tree/
> > > DATAFORMAT/sdks/java/
> > > > >> extensions/dataformat
> > > > >>>>>
> > > > >>>>> it's very simple and stupid and of course not complete at all
> (I
> > > have
> > > > >>>>> other commits but not merged as they need some polishing), but
> as
> > I
> > > > >>>>> said, it's a base of discussion.
> > > > >>>>>
> > > > >>>>> Regards
> > > > >>>>> JB
> > > > >>>>>
> > > > >>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> > > > >>>>>> @jb Sounds good. Just let us know once you've pushed.
> > > > >>>>>>
> > > > >>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> > > > >> jb@nanthrax.net>
> > > > >>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Good point Eugene.
> > > > >>>>>>>
> > > > >>>>>>> Right now, it's a DoFn collection to experiment a bit (a pure
> > > > >>>>>>> extension). It's pretty stupid ;)
> > > > >>>>>>>
> > > > >>>>>>> But, you are right, depending the direction of such
> extension,
> > it
> > > > >>> could
> > > > >>>>>>> cover more use cases (even if it's not my first intention
> ;)).
> > > > >>>>>>>
> > > > >>>>>>> Let me push the branch (pretty small) as an illustration, and
> > in
> > > > the
> > > > >>>>>>> mean time, I'm preparing a document (more focused on the use
> > > > cases).
> > > > >>>>>>>
> > > > >>>>>>> WDYT ?
> > > > >>>>>>>
> > > > >>>>>>> Regards
> > > > >>>>>>> JB
> > > > >>>>>>>
> > > > >>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> > > > >>>>>>>> Hi JB,
> > > > >>>>>>>> Depending on the scope of what you want to ultimately
> > accomplish
> > > > >> with
> > > > >>>>>>> this
> > > > >>>>>>>> extension, I think it may make sense to write a proposal
> > > document
> > > > >> and
> > > > >>>>>>>> discuss it.
> > > > >>>>>>>> If it's just a collection of utility DoFn's for various
> > > > >> well-defined
> > > > >>>>>>>> source/target format pairs, then that's probably not needed,
> > but
> > > > if
> > > > >>>>> it's
> > > > >>>>>>>> anything more, then I think it is.
> > > > >>>>>>>> That will help avoid a lot of churn if people propose
> > reasonable
> > > > >>>>>>>> significant changes.
> > > > >>>>>>>>
> > > > >>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> > > > >>> jb@nanthrax.net
> > > > >>>>>>
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my
> > > github
> > > > >>> and I
> > > > >>>>>>>>> will post on the dev mailing list when done.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Regards
> > > > >>>>>>>>> JB
> > > > >>>>>>>>>
> > > > >>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > > > >>>>>>>>>> I want to bring this thread back up since we've had time
> to
> > > > think
> > > > >>>>> about
> > > > >>>>>>>>> it
> > > > >>>>>>>>>> more and make a plan.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I think a format-specific converter will be more time
> > > consuming
> > > > >>> task
> > > > >>>>>>> than
> > > > >>>>>>>>>> we originally thought. It'd have to be a writer that takes
> > > > >> another
> > > > >>>>>>> writer
> > > > >>>>>>>>>> as a parameter.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I think a string converter can be done as a simple
> > transform.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I think we should start with a simple string converter and
> > > plan
> > > > >>> for a
> > > > >>>>>>>>>> format-specific writer.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> What are your thoughts?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Thanks,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Jesse
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> > > > >>>>> jesse@smokinghand.com
> > > > >>>>>>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I was thinking about what the outputs would look like last
> > > > >> night. I
> > > > >>>>>>>>>> realized that more complex formats like JSON and XML may
> or
> > > may
> > > > >> not
> > > > >>>>>>>>> output
> > > > >>>>>>>>>> the data in a valid format.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Doing a direct conversion on unbounded collections would
> > work
> > > > >> just
> > > > >>>>>>> fine.
> > > > >>>>>>>>>> They're self-contained. For writing out bounded
> collections,
> > > > >> that's
> > > > >>>>>>> where
> > > > >>>>>>>>>> we'll hit the issues. This changes the uber conversion
> > > transform
> > > > >>>>> into a
> > > > >>>>>>>>>> transform that needs to be a writer.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> If a transform executes a JSON conversion on a per element
> > > > basis,
> > > > >>>>> we'd
> > > > >>>>>>>>> get
> > > > >>>>>>>>>> this:
> > > > >>>>>>>>>> {
> > > > >>>>>>>>>> "key": "value"
> > > > >>>>>>>>>> }, {
> > > > >>>>>>>>>> "key": "value"
> > > > >>>>>>>>>> },
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> That isn't valid JSON.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> The conversion transform would need to know do several
> > things
> > > > >> when
> > > > >>>>>>>>> writing
> > > > >>>>>>>>>> out a file. It would need to add brackets for an array.
> Now
> > we
> > > > >>> have:
> > > > >>>>>>>>>> [
> > > > >>>>>>>>>> {
> > > > >>>>>>>>>> "key": "value"
> > > > >>>>>>>>>> }, {
> > > > >>>>>>>>>> "key": "value"
> > > > >>>>>>>>>> },
> > > > >>>>>>>>>> ]
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> We still don't have valid JSON. We have to remove the last
> > > comma
> > > > >> or
> > > > >>>>>>> have
> > > > >>>>>>>>>> the uber transform start putting in the commas, except for
> > the
> > > > >> last
> > > > >>>>>>>>> element.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> [
> > > > >>>>>>>>>> {
> > > > >>>>>>>>>> "key": "value"
> > > > >>>>>>>>>> }, {
> > > > >>>>>>>>>> "key": "value"
> > > > >>>>>>>>>> }
> > > > >>>>>>>>>> ]
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Only by doing this do we have valid JSON.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I'd argue we'd have a similar issue with XML. Some parsers
> > > > >> require
> > > > >>> a
> > > > >>>>>>> root
> > > > >>>>>>>>>> element for everything. The uber transform would have to
> put
> > > the
> > > > >>> root
> > > > >>>>>>>>>> element tags at the beginning and end of the file.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> > > > >>> owenzhang1990@gmail.com>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I would love to see a lean core and abundant Transforms at
> > the
> > > > >> same
> > > > >>>>>>> time.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Maybe we can look at what Confluent <
> > > > >>> https://github.com/confluentinc
> > > > >>>>>>
> > > > >>>>>>>>> does
> > > > >>>>>>>>>> for kafka-connect. They have official extensions support
> for
> > > > >> JDBC,
> > > > >>>>> HDFS
> > > > >>>>>>>>> and
> > > > >>>>>>>>>> ElasticSearch under https://github.com/confluentinc. They
> > put
> > > > >> them
> > > > >>>>>>> along
> > > > >>>>>>>>>> with other community extensions on
> > > > >>>>>>>>>> https://www.confluent.io/product/connectors/ for
> > visibility.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Although not a commercial company, can we have a GitHub
> user
> > > > like
> > > > >>>>>>>>>> beam-community to host projects we build around beam but
> not
> > > > >>> suitable
> > > > >>>>>>> for
> > > > >>>>>>>>>> https://github.com/apache/incubator-beam. In the future,
> we
> > > may
> > > > >>> have
> > > > >>>>>>>>>> beam-algebra like http://github.com/twitter/algebird for
> > > > algebra
> > > > >>>>>>>>> operations
> > > > >>>>>>>>>> and beam-ml / beam-dl for machine learning / deep
> learning.
> > > > Also,
> > > > >>>>> there
> > > > >>>>>>>>>> will will be beam related projects elsewhere maintained by
> > > other
> > > > >>>>>>>>>> communities. We can put all of them on the beam-website or
> > > like
> > > > >>> spark
> > > > >>>>>>>>>> packages as mentioned by Amit.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> My $0.02
> > > > >>>>>>>>>> Manu
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> > > > >>>>> <klk@google.com.invalid
> > > > >>>>>>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> On this point from Amit and Ismaël, I agree: we could
> > benefit
> > > > >>> from a
> > > > >>>>>>>>> place
> > > > >>>>>>>>>>> for miscellaneous non-core helper transformations.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> We have sdks/java/extensions but it is organized as
> > separate
> > > > >>>>>>> artifacts.
> > > > >>>>>>>>> I
> > > > >>>>>>>>>>> think that is fine, considering the nature of Join and
> > > > >> SortValues.
> > > > >>>>> But
> > > > >>>>>>>>> for
> > > > >>>>>>>>>>> simpler transforms, Importing one artifact per tiny
> > transform
> > > > is
> > > > >>> too
> > > > >>>>>>>>> much
> > > > >>>>>>>>>>> overhead. It also seems unlikely that we will have enough
> > > > >>>>> commonality
> > > > >>>>>>>>>> among
> > > > >>>>>>>>>>> the transforms to call the artifact anything other than
> > [some
> > > > >>>>> synonym
> > > > >>>>>>>>> for]
> > > > >>>>>>>>>>> "miscellaneous".
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> I wouldn't want to take this too far - even though the
> SDK
> > > many
> > > > >>>>>>>>>> transforms*
> > > > >>>>>>>>>>> that are not required for the model [1], I like that the
> > SDK
> > > > >>>>> artifact
> > > > >>>>>>>>> has
> > > > >>>>>>>>>>> everything a user might need in their "getting started"
> > phase
> > > > of
> > > > >>>>> use.
> > > > >>>>>>>>> This
> > > > >>>>>>>>>>> user-friendliness (the user doesn't care that ParDo is
> core
> > > and
> > > > >>> Sum
> > > > >>>>> is
> > > > >>>>>>>>>> not)
> > > > >>>>>>>>>>> plus the difficulty of judging which transforms go where,
> > are
> > > > >>>>> probably
> > > > >>>>>>>>> why
> > > > >>>>>>>>>>> we have them mostly all in one place.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Models to look at, off the top of my head, include Pig's
> > > > >> PiggyBank
> > > > >>>>> and
> > > > >>>>>>>>>>> Apex's Malhar. These have different levels of support
> > > implied.
> > > > >>>>> Others?
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Kenn
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count,
> > Distinct,
> > > > >>>>> Filter,
> > > > >>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean,
> Min,
> > > > >>> Values,
> > > > >>>>>>>>>> KvSwap,
> > > > >>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> > > > >>> WithTimestamps
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> * at least they are separate classes and not methods on
> > > > >>> PCollection
> > > > >>>>>>> :-)
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <
> > > > iemejia@gmail.com
> > > > >>>
> > > > >>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> Nice discussion, and thanks Jesse for bringing this
> > subject
> > > > >>> back.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> I agree 100% with Amit and the idea of having a home for
> > > those
> > > > >>>>>>>>>> transforms
> > > > >>>>>>>>>>>> that are not core enough to be part of the sdk, but that
> > we
> > > > all
> > > > >>> end
> > > > >>>>>>> up
> > > > >>>>>>>>>>>> re-writing somehow.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> This is a needed improvement to be more developer
> > friendly,
> > > > but
> > > > >>>>> also
> > > > >>>>>>> as
> > > > >>>>>>>>>> a
> > > > >>>>>>>>>>>> reference of good practices of Beam development, and for
> > > this
> > > > >>>>> reason
> > > > >>>>>>> I
> > > > >>>>>>>>>>>> agree with JB that at this moment it would be better for
> > > these
> > > > >>>>>>>>>> transforms
> > > > >>>>>>>>>>>> to reside in the Beam repository at least for visibility
> > > > >> reasons.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> One additional question is if these transforms
> represent a
> > > > >>>>> different
> > > > >>>>>>>>> DSL
> > > > >>>>>>>>>>> or
> > > > >>>>>>>>>>>> if those could be grouped with the current extensions
> > (e.g.
> > > > >> Join
> > > > >>>>> and
> > > > >>>>>>>>>>>> SortValues) into something more general that we as a
> > > community
> > > > >>>>> could
> > > > >>>>>>>>>>>> maintain, but well even if it is not the case, it would
> be
> > > > >> really
> > > > >>>>>>> nice
> > > > >>>>>>>>>> to
> > > > >>>>>>>>>>>> start working on something like this.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Ismaël Mejía
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <
> > > > >>>>>>> jb@nanthrax.net
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Related to spark-package, we also have Apache Bahir to
> > host
> > > > >>>>>>>>>>>>> connectors/transforms for Spark and Flink.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> IMHO, right now, Beam should host this, not sure if it
> > > makes
> > > > >>> sense
> > > > >>>>>>>>>>>>> directly in the core.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> It reminds me the "Integration" DSL we discussed in the
> > > > >>> technical
> > > > >>>>>>>>>>> vision
> > > > >>>>>>>>>>>>> document.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Regards
> > > > >>>>>>>>>>>>> JB
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> I think Jesse has a very good point on one hand, while
> > > > Luke's
> > > > >>> and
> > > > >>>>>>>>>>>>>> Kenneth's
> > > > >>>>>>>>>>>>>> worries about committing users to specific
> > implementations
> > > > is
> > > > >>> in
> > > > >>>>>>>>>>> place.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> The Spark community has a 3rd party repository for
> > useful
> > > > >>>>> libraries
> > > > >>>>>>>>>>> that
> > > > >>>>>>>>>>>>>> for various reasons are not a part of the Apache Spark
> > > > >> project:
> > > > >>>>>>>>>>>>>> https://spark-packages.org/.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Maybe a "common-transformations" package would serve
> > both
> > > > >> users
> > > > >>>>>>> quick
> > > > >>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more
> > > "enabling" ?
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > > > >>>>>>>>>> <klk@google.com.invalid
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> It seems useful for small scale debugging / demoing to
> > > have
> > > > >>>>>>>>>>>>>>> Dump.toString(). I think it should be named to
> clearly
> > > > >>> indicate
> > > > >>>>>>> its
> > > > >>>>>>>>>>>>>>> limited
> > > > >>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump
> > namespace,
> > > > but
> > > > >>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it
> > > should
> > > > >> be
> > > > >>>>>>> pretty
> > > > >>>>>>>>>>>>>>> printed, not treated as a machine-to-machine wire
> > format.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> The broader question of representing data in JSON or
> > XML,
> > > > >> etc,
> > > > >>>>> is
> > > > >>>>>>>>>>>> already
> > > > >>>>>>>>>>>>>>> the subject of many mature libraries which are
> already
> > > easy
> > > > >> to
> > > > >>>>> use
> > > > >>>>>>>>>>> with
> > > > >>>>>>>>>>>>>>> Beam.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> The more esoteric practice of implicit or
> semi-implicit
> > > > >>>>> coercions
> > > > >>>>>>>>>>> seems
> > > > >>>>>>>>>>>>>>> like it is also already addressed in many ways
> > elsewhere.
> > > > >>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as
> > > > >>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with
> > Beam.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> In both of the last cases, there are many reasonable
> > > > >>> approaches,
> > > > >>>>>>> and
> > > > >>>>>>>>>>> we
> > > > >>>>>>>>>>>>>>> shouldn't commit our users to one of them.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> The suggestions you give seem good except for the the
> > XML
> > > > >>> cases.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Might want to have the XML be a document per line
> > > similar
> > > > >> to
> > > > >>>>> the
> > > > >>>>>>>>>>> JSON
> > > > >>>>>>>>>>>>>>>> examples you have been giving.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> > > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > > >>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I
> > was
> > > > >> more
> > > > >>>>>>> think
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> that
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV.
> It
> > > > >> should
> > > > >>>>>>>>>> handle
> > > > >>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give
> > > > someone
> > > > >>>>>>>>>>> something
> > > > >>>>>>>>>>>>>>>>> general purpose enough that you would just end up
> > > writing
> > > > >>> your
> > > > >>>>>>> own
> > > > >>>>>>>>>>>> code
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> handle it anyway.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Here are some ideas on what it could look like
> with a
> > > > >> method
> > > > >>>>> and
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>> resulting string output:
> > > > >>>>>>>>>>>>>>>>> *Stringify.toJSON()*
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> With KV:
> > > > >>>>>>>>>>>>>>>>> {"key": "value"}
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > >>>>>>>>>>>>>>>>> ["one", "two", "three"]
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> With KV:
> > > > >>>>>>>>>>>>>>>>> <rootelement key=value />
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > >>>>>>>>>>>>>>>>> <rootelement>
> > > > >>>>>>>>>>>>>>>>>   <item>one</item>
> > > > >>>>>>>>>>>>>>>>>   <item>two</item>
> > > > >>>>>>>>>>>>>>>>>   <item>three</item>
> > > > >>>>>>>>>>>>>>>>> </rootelement>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> With KV:
> > > > >>>>>>>>>>>>>>>>> key,value
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> With Iterables:
> > > > >>>>>>>>>>>>>>>>> one,two,three
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Do you think that would strike a good balance
> between
> > > > >>> reusable
> > > > >>>>>>>>>> code
> > > > >>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>>> writing your own for more difficult formatting?
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Jesse
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special
> treatment
> > > in
> > > > >>>>> TextIO,
> > > > >>>>>>>>>>>> people
> > > > >>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not
> > > > >> supported.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Also, the example that you provide is using the
> fact
> > > that
> > > > >>> the
> > > > >>>>>>>>>> input
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> format
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question
> about
> > > > >> using
> > > > >>> KV
> > > > >>>>>>>>>> with
> > > > >>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed
> > > input
> > > > >>>>> format
> > > > >>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> still
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> would require to write a type conversion function,
> > this
> > > > >> time
> > > > >>>>>>> from
> > > > >>>>>>>>>>> KV
> > > > >>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> > > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > > >>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Lukasz,
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> > > > >>> TextIO.Write.
> > > > >>>>>>> For
> > > > >>>>>>>>>>> CSV
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> call would look like:
> > > > >>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix,
> > > > >>> delimiter,
> > > > >>>>>>>>>>>> suffix).
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> The code would be something like:
> > > > >>>>>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> for (Item item : list) {
> > > > >>>>>>>>>>>>>>>>>>   buffer.append(item.toString());
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>   if(notLast) {
> > > > >>>>>>>>>>>>>>>>>>     buffer.append(delimiter);
> > > > >>>>>>>>>>>>>>>>>>   }
> > > > >>>>>>>>>>>>>>>>>> }
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> buffer.append(suffix);
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> c.output(buffer.toString());
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and
> > > other
> > > > >>>>>>> formats
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> without
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> complicated logic. The same sort of thing could be
> > done
> > > > >> for
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> TextIO.Write.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> Jesse
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> > > > >>>>>>>>>>>> <lcwik@google.com.invalid
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> The conversion from object to string will have
> uses
> > > > >> outside
> > > > >>>>> of
> > > > >>>>>>>>>>> just
> > > > >>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would
> want
> > > to
> > > > >>> have
> > > > >>>>> a
> > > > >>>>>>>>>>> ParDo
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> do
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> conversion.
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if
> > you
> > > > >>>>> consider
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> subset
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> of CSV like formats where it could have fixed
> width
> > > > >> fields,
> > > > >>>>> or
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> escaping
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> quoting around other fields, or headers that
> should
> > > be
> > > > >>>>> placed
> > > > >>>>>>> at
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> top.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> Having all these format conversions within
> > > TextIO.Write
> > > > >>>>> seems
> > > > >>>>>>>>>>> like
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> a
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> lot
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> of
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> logic to contain in that transform which should
> > just
> > > > >> focus
> > > > >>>>> on
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> writing
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> files.
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> jesse@smokinghand.com>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> This is a thread moved over from the user mailing
> > > list.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> > > > >>>>> PCollection<KV>
> > > > >>>>>>> to
> > > > >>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually
> > > > convert
> > > > >>> the
> > > > >>>>>>> KV
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> to a
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> String:
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>         p
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > > > >>>>>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<
> > String,
> > > > >> Long>
> > > > >>>>>>>>>> count)
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> ->*
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> *                            count.getKey() + ":" +
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> count.getValue()*
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> *                        ).withOutputType(
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > > >>>>>>> ("output/stringcounts"));
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> This code really should be something like:
> > > > >>>>>>>>>>>>>>>>>>>>         p
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > > > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > > > >>>>>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
> > > > >>>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > > >>>>>>>>> ("output/stringcounts"));
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> To summarize the discussion:
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to
> > > output
> > > > >>> any
> > > > >>>>> KV
> > > > >>>>>>>>>> or
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> list
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes an
> > type
> > > > >> and
> > > > >>>>> runs
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> toString()
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>    on it:
> > > > >>>>>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> > > > >>> SimpleFunction<InputT,
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> String>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> {
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>        public static String apply(InputT input) {
> > > > >>>>>>>>>>>>>>>>>>>>            return input.toString();
> > > > >>>>>>>>>>>>>>>>>>>>        }
> > > > >>>>>>>>>>>>>>>>>>>>    }
> > > > >>>>>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter
> like
> > > in
> > > > >>>>> Apache
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> Camel.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that
> would
> > > > >> write
> > > > >>>>> out
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>    toString of any Object.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> My thoughts:
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly
> > needed
> > > > >> when
> > > > >>>>>>>>>> you're
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> using
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform
> only
> > > work
> > > > >> in
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> certain
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> cases
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> and you'll normally have to write custom code
> > format
> > > > the
> > > > >>>>>>> strings
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> way
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> you want them?
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object
> > > > >> support
> > > > >>> to
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> TextIO.Write
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> or
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an
> > > > argument.
> > > > >>>>>>> Making
> > > > >>>>>>>>>> a
> > > > >>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a
> delimiter
> > > (and
> > > > >>>>>>> perhaps
> > > > >>>>>>>>>> a
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> prefix
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and
> > > > cases.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Jesse
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> --
> > > > >>>>>>>>>>>>> Jean-Baptiste Onofré
> > > > >>>>>>>>>>>>> jbonofre@apache.org
> > > > >>>>>>>>>>>>> http://blog.nanthrax.net
> > > > >>>>>>>>>>>>> Talend - http://www.talend.com
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> --
> > > > >>>>>>>>> Jean-Baptiste Onofré
> > > > >>>>>>>>> jbonofre@apache.org
> > > > >>>>>>>>> http://blog.nanthrax.net
> > > > >>>>>>>>> Talend - http://www.talend.com
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> --
> > > > >>>>>>> Jean-Baptiste Onofré
> > > > >>>>>>> jbonofre@apache.org
> > > > >>>>>>> http://blog.nanthrax.net
> > > > >>>>>>> Talend - http://www.talend.com
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> --
> > > > >>>>> Jean-Baptiste Onofré
> > > > >>>>> jbonofre@apache.org
> > > > >>>>> http://blog.nanthrax.net
> > > > >>>>> Talend - http://www.talend.com
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>> --
> > > > >>> Jean-Baptiste Onofré
> > > > >>> jbonofre@apache.org
> > > > >>> http://blog.nanthrax.net
> > > > >>> Talend - http://www.talend.com
> > > > >>>
> > > > >>
> > > > >
> > > >
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbonofre@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>

Re: PCollection to PCollection Conversion

Posted by Dan Halperin <dh...@google.com.INVALID>.

On Thu, Dec 29, 2016 at 2:10 PM, Jesse Anderson <je...@smokinghand.com>
wrote:

> I agree MapElements isn't hard to use. I think there is a demand for this
> built-in conversion.
>
> My thought on the formatter is that, worst case, we could do runtime type
> checking. It would be ugly and not as performant, but it should work. As
> we've said, we'd point them to MapElements for better code. We'd write the
> JavaDoc accordingly.
>

I think it will be good to see these proposals in PR form. I would stay far
away from reflection and varargs if possible, but properly-typed bits of
code (possibly exposed as SerializableFunctions in ToString?) would
probably make sense.

In the short-term, I can't find anyone arguing against a ToString.create()
that simply does input.toString().

To get started, how about we ask Vikas to clean up the PR to be more
future-proof for now? Aka make `ToString` itself not a PTransform,  but
instead ToString.create() returns ToString.Default which is a private class
implementing what ToString is now (PTransform<T, String>, wrapping
MapElements).

Then we can send PRs adding new features to that.

IME and to Ben's point, these will mostly be used in development. Some of
> our assumptions will break down when programmers aren't the ones using
> Beam. I can see from the user traffic already that not everyone using Beam
> is a programmer and they'll need classes like this to be productive.


> On Thu, Dec 29, 2016 at 1:46 PM Dan Halperin <dh...@google.com.invalid>
> wrote:
>
> On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson <je...@smokinghand.com>
> wrote:
>
> > I prefer JB's take. I think there should be three overloaded methods on
> the
> > class. I like Vikas' name ToString. The methods for a simple conversion
> > should be:
> >
> > ToString.strings() - Outputs the .toString() of the objects in the
> > PCollection
> > ToString.strings(String delimiter) - Outputs the .toString() of KVs,
> Lists,
> > etc with the delimiter between every entry
> > ToString.formatted(String format) - Outputs the formatted
> > <https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html>
> > string
> > with the object passed in. For objects made up of different parts like
> KVs,
> > each one is passed in as separate toString() of a varargs.
> >
>
> Riffing a little, with some types:
>
> ToString.<T>of() -- PTransform<T, String> that is equivalent to a ParDo
> that takes in a T and outputs T.toString().
>
> ToString.<K,V>kv(String delimiter) -- PTransform<KV<K, V>, String> that is
> equivalent to a ParDo that takes in a KV<K,V> and outputs
> kv.getKey().toString() + delimiter + kv.getValue().toString()
>
> ToString.<T>iterable(String delimiter) -- PTransform<? extends Iterable<T>,
> String> that is equivalent to a ParDo that takes in an Iterable<T> and
> outputs the iterable[0] + delimiter + iterable[1] + delimiter + ... +
> delimiter + iterable[N-1]
>
> ToString.<T>custom(SerializableFunction<T, String> formatter) ?
>
> The last one is just MapElement.via, except you don't need to set the
> output type.
>
> I don't see a way to make the generic .formatted() that you propose that
> just works with anything "made of different parts".
>
> I think this adding too many overrides beyond "of" and "custom" is opening
> up a Pandora's Box. the KV one might want to have left and right
> delimiters, might want to take custom formatters for K and V, etc. etc. The
> iterable one might want to have a special configuration for an empty
> iterable. So I'm inclined towards simplicity with the awareness that
> MapElements.via is just not that hard to use.
>
> Dan
>
>
> >
> > I think doing these three methods would cover every simple and advanced
> > "simple conversions." As JB says, we'll need other specific converters
> for
> > other formats like XML.
> >
> > I'd really like to see this class in the next version of Beam. What does
> > everyone think of the class name, methods name, and method operations so
> we
> > can have Vikas finish up?
> >
> > Thanks,
> >
> > Jesse
> >
> > On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> >
> > > Hi Vikas,
> > >
> > > did you take a look on:
> > >
> > >
> > > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/
> > java/extensions/dataformat
> > >
> > > You can see KV2String and ToString could be part of this extension.
> > > I'm also using JAXB for XML and Jackson for JSON
> > > marshalling/unmarshalling. I'm planning to deal with Avro
> > (IndexedRecord).
> > >
> > > Regards
> > > JB
> > >
> > > On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> > > > Hi All,
> > > >
> > > >   Not being aware of the discussion here, I sent out a PR
> > > > <https://github.com/apache/beam/pull/1704> but JB and others
> directed
> > > me to
> > > > this thread. Having converted PCollection<T> to PCollection<String>
> > > several
> > > > times, I feel something like 'ToString' transform is common enough to
> > be
> > > > part of the core. What do you all think?
> > > >
> > > > Also, if someone else is already working on or interested in tackling
> > > this,
> > > > then I am happy to discard the PR.
> > > >
> > > > Regards,
> > > > Vikas
> > > >
> > > > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <am...@gmail.com>
> > wrote:
> > > >
> > > >> It seems that there were a lot of good points raised here, and I
> tend
> > to
> > > >> agree that something as trivial and lean as "ToString" should be a
> > part
> > > of
> > > >> core.ake
> > > >> I'm particularly fond of makeString(prefix, toString, suffix) in
> > various
> > > >> combinations (Scala-like).
> > > >> For "fromString", I think JB has a good point leveraging JAXB and
> > > Jackson -
> > > >> though I think this should be in extensions as it is not as lean as
> > > >> toString.
> > > >>
> > > >> Thanks,
> > > >> Amit
> > > >>
> > > >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <
> jb@nanthrax.net
> > >
> > > >> wrote:
> > > >>
> > > >>> Hi Jesse,
> > > >>>
> > > >>> yes, I started something there (using JAXB and Jackson). Let me
> > polish
> > > >>> and push.
> > > >>>
> > > >>> Regards
> > > >>> JB
> > > >>>
> > > >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > > >>>> I went through the string conversions. Do you have an example of
> > > >> writing
> > > >>>> out XML/JSON/etc too?
> > > >>>>
> > > >>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <
> > jb@nanthrax.net
> > > >
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hi Jesse,
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>> https://github.com/jbonofre/incubator-beam/tree/
> > DATAFORMAT/sdks/java/
> > > >> extensions/dataformat
> > > >>>>>
> > > >>>>> it's very simple and stupid and of course not complete at all (I
> > have
> > > >>>>> other commits but not merged as they need some polishing), but as
> I
> > > >>>>> said, it's a base of discussion.
> > > >>>>>
> > > >>>>> Regards
> > > >>>>> JB
> > > >>>>>
> > > >>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> > > >>>>>> @jb Sounds good. Just let us know once you've pushed.
> > > >>>>>>
> > > >>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> > > >> jb@nanthrax.net>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Good point Eugene.
> > > >>>>>>>
> > > >>>>>>> Right now, it's a DoFn collection to experiment a bit (a pure
> > > >>>>>>> extension). It's pretty stupid ;)
> > > >>>>>>>
> > > >>>>>>> But, you are right, depending the direction of such extension,
> it
> > > >>> could
> > > >>>>>>> cover more use cases (even if it's not my first intention ;)).
> > > >>>>>>>
> > > >>>>>>> Let me push the branch (pretty small) as an illustration, and
> in
> > > the
> > > >>>>>>> mean time, I'm preparing a document (more focused on the use
> > > cases).
> > > >>>>>>>
> > > >>>>>>> WDYT ?
> > > >>>>>>>
> > > >>>>>>> Regards
> > > >>>>>>> JB
> > > >>>>>>>
> > > >>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> > > >>>>>>>> Hi JB,
> > > >>>>>>>> Depending on the scope of what you want to ultimately
> accomplish
> > > >> with
> > > >>>>>>> this
> > > >>>>>>>> extension, I think it may make sense to write a proposal
> > document
> > > >> and
> > > >>>>>>>> discuss it.
> > > >>>>>>>> If it's just a collection of utility DoFn's for various
> > > >> well-defined
> > > >>>>>>>> source/target format pairs, then that's probably not needed,
> but
> > > if
> > > >>>>> it's
> > > >>>>>>>> anything more, then I think it is.
> > > >>>>>>>> That will help avoid a lot of churn if people propose
> reasonable
> > > >>>>>>>> significant changes.
> > > >>>>>>>>
> > > >>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> > > >>> jb@nanthrax.net
> > > >>>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my
> > github
> > > >>> and I
> > > >>>>>>>>> will post on the dev mailing list when done.
> > > >>>>>>>>>
> > > >>>>>>>>> Regards
> > > >>>>>>>>> JB
> > > >>>>>>>>>
> > > >>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > > >>>>>>>>>> I want to bring this thread back up since we've had time to
> > > think
> > > >>>>> about
> > > >>>>>>>>> it
> > > >>>>>>>>>> more and make a plan.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I think a format-specific converter will be more time
> > consuming
> > > >>> task
> > > >>>>>>> than
> > > >>>>>>>>>> we originally thought. It'd have to be a writer that takes
> > > >> another
> > > >>>>>>> writer
> > > >>>>>>>>>> as a parameter.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I think a string converter can be done as a simple
> transform.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I think we should start with a simple string converter and
> > plan
> > > >>> for a
> > > >>>>>>>>>> format-specific writer.
> > > >>>>>>>>>>
> > > >>>>>>>>>> What are your thoughts?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Jesse
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> > > >>>>> jesse@smokinghand.com
> > > >>>>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> I was thinking about what the outputs would look like last
> > > >> night. I
> > > >>>>>>>>>> realized that more complex formats like JSON and XML may or
> > may
> > > >> not
> > > >>>>>>>>> output
> > > >>>>>>>>>> the data in a valid format.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Doing a direct conversion on unbounded collections would
> work
> > > >> just
> > > >>>>>>> fine.
> > > >>>>>>>>>> They're self-contained. For writing out bounded collections,
> > > >> that's
> > > >>>>>>> where
> > > >>>>>>>>>> we'll hit the issues. This changes the uber conversion
> > transform
> > > >>>>> into a
> > > >>>>>>>>>> transform that needs to be a writer.
> > > >>>>>>>>>>
> > > >>>>>>>>>> If a transform executes a JSON conversion on a per element
> > > basis,
> > > >>>>> we'd
> > > >>>>>>>>> get
> > > >>>>>>>>>> this:
> > > >>>>>>>>>> {
> > > >>>>>>>>>> "key": "value"
> > > >>>>>>>>>> }, {
> > > >>>>>>>>>> "key": "value"
> > > >>>>>>>>>> },
> > > >>>>>>>>>>
> > > >>>>>>>>>> That isn't valid JSON.
> > > >>>>>>>>>>
> > > >>>>>>>>>> The conversion transform would need to know do several
> things
> > > >> when
> > > >>>>>>>>> writing
> > > >>>>>>>>>> out a file. It would need to add brackets for an array. Now
> we
> > > >>> have:
> > > >>>>>>>>>> [
> > > >>>>>>>>>> {
> > > >>>>>>>>>> "key": "value"
> > > >>>>>>>>>> }, {
> > > >>>>>>>>>> "key": "value"
> > > >>>>>>>>>> },
> > > >>>>>>>>>> ]
> > > >>>>>>>>>>
> > > >>>>>>>>>> We still don't have valid JSON. We have to remove the last
> > comma
> > > >> or
> > > >>>>>>> have
> > > >>>>>>>>>> the uber transform start putting in the commas, except for
> the
> > > >> last
> > > >>>>>>>>> element.
> > > >>>>>>>>>>
> > > >>>>>>>>>> [
> > > >>>>>>>>>> {
> > > >>>>>>>>>> "key": "value"
> > > >>>>>>>>>> }, {
> > > >>>>>>>>>> "key": "value"
> > > >>>>>>>>>> }
> > > >>>>>>>>>> ]
> > > >>>>>>>>>>
> > > >>>>>>>>>> Only by doing this do we have valid JSON.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I'd argue we'd have a similar issue with XML. Some parsers
> > > >> require
> > > >>> a
> > > >>>>>>> root
> > > >>>>>>>>>> element for everything. The uber transform would have to put
> > the
> > > >>> root
> > > >>>>>>>>>> element tags at the beginning and end of the file.
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> > > >>> owenzhang1990@gmail.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> I would love to see a lean core and abundant Transforms at
> the
> > > >> same
> > > >>>>>>> time.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Maybe we can look at what Confluent <
> > > >>> https://github.com/confluentinc
> > > >>>>>>
> > > >>>>>>>>> does
> > > >>>>>>>>>> for kafka-connect. They have official extensions support for
> > > >> JDBC,
> > > >>>>> HDFS
> > > >>>>>>>>> and
> > > >>>>>>>>>> ElasticSearch under https://github.com/confluentinc. They
> put
> > > >> them
> > > >>>>>>> along
> > > >>>>>>>>>> with other community extensions on
> > > >>>>>>>>>> https://www.confluent.io/product/connectors/ for
> visibility.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Although not a commercial company, can we have a GitHub user
> > > like
> > > >>>>>>>>>> beam-community to host projects we build around beam but not
> > > >>> suitable
> > > >>>>>>> for
> > > >>>>>>>>>> https://github.com/apache/incubator-beam. In the future, we
> > may
> > > >>> have
> > > >>>>>>>>>> beam-algebra like http://github.com/twitter/algebird for
> > > algebra
> > > >>>>>>>>> operations
> > > >>>>>>>>>> and beam-ml / beam-dl for machine learning / deep learning.
> > > Also,
> > > >>>>> there
> > > >>>>>>>>>> will will be beam related projects elsewhere maintained by
> > other
> > > >>>>>>>>>> communities. We can put all of them on the beam-website or
> > like
> > > >>> spark
> > > >>>>>>>>>> packages as mentioned by Amit.
> > > >>>>>>>>>>
> > > >>>>>>>>>> My $0.02
> > > >>>>>>>>>> Manu
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> > > >>>>> <klk@google.com.invalid
> > > >>>>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> On this point from Amit and Ismaël, I agree: we could
> benefit
> > > >>> from a
> > > >>>>>>>>> place
> > > >>>>>>>>>>> for miscellaneous non-core helper transformations.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> We have sdks/java/extensions but it is organized as
> separate
> > > >>>>>>> artifacts.
> > > >>>>>>>>> I
> > > >>>>>>>>>>> think that is fine, considering the nature of Join and
> > > >> SortValues.
> > > >>>>> But
> > > >>>>>>>>> for
> > > >>>>>>>>>>> simpler transforms, Importing one artifact per tiny
> transform
> > > is
> > > >>> too
> > > >>>>>>>>> much
> > > >>>>>>>>>>> overhead. It also seems unlikely that we will have enough
> > > >>>>> commonality
> > > >>>>>>>>>> among
> > > >>>>>>>>>>> the transforms to call the artifact anything other than
> [some
> > > >>>>> synonym
> > > >>>>>>>>> for]
> > > >>>>>>>>>>> "miscellaneous".
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I wouldn't want to take this too far - even though the SDK
> > many
> > > >>>>>>>>>> transforms*
> > > >>>>>>>>>>> that are not required for the model [1], I like that the
> SDK
> > > >>>>> artifact
> > > >>>>>>>>> has
> > > >>>>>>>>>>> everything a user might need in their "getting started"
> phase
> > > of
> > > >>>>> use.
> > > >>>>>>>>> This
> > > >>>>>>>>>>> user-friendliness (the user doesn't care that ParDo is core
> > and
> > > >>> Sum
> > > >>>>> is
> > > >>>>>>>>>> not)
> > > >>>>>>>>>>> plus the difficulty of judging which transforms go where,
> are
> > > >>>>> probably
> > > >>>>>>>>> why
> > > >>>>>>>>>>> we have them mostly all in one place.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Models to look at, off the top of my head, include Pig's
> > > >> PiggyBank
> > > >>>>> and
> > > >>>>>>>>>>> Apex's Malhar. These have different levels of support
> > implied.
> > > >>>>> Others?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Kenn
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count,
> Distinct,
> > > >>>>> Filter,
> > > >>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min,
> > > >>> Values,
> > > >>>>>>>>>> KvSwap,
> > > >>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> > > >>> WithTimestamps
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> * at least they are separate classes and not methods on
> > > >>> PCollection
> > > >>>>>>> :-)
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <
> > > iemejia@gmail.com
> > > >>>
> > > >>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Nice discussion, and thanks Jesse for bringing this
> subject
> > > >>> back.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I agree 100% with Amit and the idea of having a home for
> > those
> > > >>>>>>>>>> transforms
> > > >>>>>>>>>>>> that are not core enough to be part of the sdk, but that
> we
> > > all
> > > >>> end
> > > >>>>>>> up
> > > >>>>>>>>>>>> re-writing somehow.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> This is a needed improvement to be more developer
> friendly,
> > > but
> > > >>>>> also
> > > >>>>>>> as
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>> reference of good practices of Beam development, and for
> > this
> > > >>>>> reason
> > > >>>>>>> I
> > > >>>>>>>>>>>> agree with JB that at this moment it would be better for
> > these
> > > >>>>>>>>>> transforms
> > > >>>>>>>>>>>> to reside in the Beam repository at least for visibility
> > > >> reasons.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> One additional question is if these transforms represent a
> > > >>>>> different
> > > >>>>>>>>> DSL
> > > >>>>>>>>>>> or
> > > >>>>>>>>>>>> if those could be grouped with the current extensions
> (e.g.
> > > >> Join
> > > >>>>> and
> > > >>>>>>>>>>>> SortValues) into something more general that we as a
> > community
> > > >>>>> could
> > > >>>>>>>>>>>> maintain, but well even if it is not the case, it would be
> > > >> really
> > > >>>>>>> nice
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>> start working on something like this.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Ismaël Mejía
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <
> > > >>>>>>> jb@nanthrax.net
> > > >>>>>>>>>>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Related to spark-package, we also have Apache Bahir to
> host
> > > >>>>>>>>>>>>> connectors/transforms for Spark and Flink.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> IMHO, right now, Beam should host this, not sure if it
> > makes
> > > >>> sense
> > > >>>>>>>>>>>>> directly in the core.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> It reminds me the "Integration" DSL we discussed in the
> > > >>> technical
> > > >>>>>>>>>>> vision
> > > >>>>>>>>>>>>> document.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Regards
> > > >>>>>>>>>>>>> JB
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I think Jesse has a very good point on one hand, while
> > > Luke's
> > > >>> and
> > > >>>>>>>>>>>>>> Kenneth's
> > > >>>>>>>>>>>>>> worries about committing users to specific
> implementations
> > > is
> > > >>> in
> > > >>>>>>>>>>> place.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> The Spark community has a 3rd party repository for
> useful
> > > >>>>> libraries
> > > >>>>>>>>>>> that
> > > >>>>>>>>>>>>>> for various reasons are not a part of the Apache Spark
> > > >> project:
> > > >>>>>>>>>>>>>> https://spark-packages.org/.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Maybe a "common-transformations" package would serve
> both
> > > >> users
> > > >>>>>>> quick
> > > >>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more
> > "enabling" ?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > > >>>>>>>>>> <klk@google.com.invalid
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> It seems useful for small scale debugging / demoing to
> > have
> > > >>>>>>>>>>>>>>> Dump.toString(). I think it should be named to clearly
> > > >>> indicate
> > > >>>>>>> its
> > > >>>>>>>>>>>>>>> limited
> > > >>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump
> namespace,
> > > but
> > > >>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it
> > should
> > > >> be
> > > >>>>>>> pretty
> > > >>>>>>>>>>>>>>> printed, not treated as a machine-to-machine wire
> format.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> The broader question of representing data in JSON or
> XML,
> > > >> etc,
> > > >>>>> is
> > > >>>>>>>>>>>> already
> > > >>>>>>>>>>>>>>> the subject of many mature libraries which are already
> > easy
> > > >> to
> > > >>>>> use
> > > >>>>>>>>>>> with
> > > >>>>>>>>>>>>>>> Beam.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> The more esoteric practice of implicit or semi-implicit
> > > >>>>> coercions
> > > >>>>>>>>>>> seems
> > > >>>>>>>>>>>>>>> like it is also already addressed in many ways
> elsewhere.
> > > >>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as
> > > >>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with
> Beam.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> In both of the last cases, there are many reasonable
> > > >>> approaches,
> > > >>>>>>> and
> > > >>>>>>>>>>> we
> > > >>>>>>>>>>>>>>> shouldn't commit our users to one of them.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> The suggestions you give seem good except for the the
> XML
> > > >>> cases.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Might want to have the XML be a document per line
> > similar
> > > >> to
> > > >>>>> the
> > > >>>>>>>>>>> JSON
> > > >>>>>>>>>>>>>>>> examples you have been giving.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > >>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I
> was
> > > >> more
> > > >>>>>>> think
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV. It
> > > >> should
> > > >>>>>>>>>> handle
> > > >>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give
> > > someone
> > > >>>>>>>>>>> something
> > > >>>>>>>>>>>>>>>>> general purpose enough that you would just end up
> > writing
> > > >>> your
> > > >>>>>>> own
> > > >>>>>>>>>>>> code
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> handle it anyway.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Here are some ideas on what it could look like with a
> > > >> method
> > > >>>>> and
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> resulting string output:
> > > >>>>>>>>>>>>>>>>> *Stringify.toJSON()*
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> With KV:
> > > >>>>>>>>>>>>>>>>> {"key": "value"}
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> With Iterables:
> > > >>>>>>>>>>>>>>>>> ["one", "two", "three"]
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> With KV:
> > > >>>>>>>>>>>>>>>>> <rootelement key=value />
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> With Iterables:
> > > >>>>>>>>>>>>>>>>> <rootelement>
> > > >>>>>>>>>>>>>>>>>   <item>one</item>
> > > >>>>>>>>>>>>>>>>>   <item>two</item>
> > > >>>>>>>>>>>>>>>>>   <item>three</item>
> > > >>>>>>>>>>>>>>>>> </rootelement>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> With KV:
> > > >>>>>>>>>>>>>>>>> key,value
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> With Iterables:
> > > >>>>>>>>>>>>>>>>> one,two,three
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Do you think that would strike a good balance between
> > > >>> reusable
> > > >>>>>>>>>> code
> > > >>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>> writing your own for more difficult formatting?
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Jesse
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > > >>>>>>>>>>> <lcwik@google.com.invalid
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special treatment
> > in
> > > >>>>> TextIO,
> > > >>>>>>>>>>>> people
> > > >>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not
> > > >> supported.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Also, the example that you provide is using the fact
> > that
> > > >>> the
> > > >>>>>>>>>> input
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> format
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question about
> > > >> using
> > > >>> KV
> > > >>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed
> > input
> > > >>>>> format
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> still
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> would require to write a type conversion function,
> this
> > > >> time
> > > >>>>>>> from
> > > >>>>>>>>>>> KV
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> > > >>>>>>>>>>>> jesse@smokinghand.com>
> > > >>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Lukasz,
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> > > >>> TextIO.Write.
> > > >>>>>>> For
> > > >>>>>>>>>>> CSV
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> call would look like:
> > > >>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix,
> > > >>> delimiter,
> > > >>>>>>>>>>>> suffix).
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> The code would be something like:
> > > >>>>>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> for (Item item : list) {
> > > >>>>>>>>>>>>>>>>>>   buffer.append(item.toString());
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>   if(notLast) {
> > > >>>>>>>>>>>>>>>>>>     buffer.append(delimiter);
> > > >>>>>>>>>>>>>>>>>>   }
> > > >>>>>>>>>>>>>>>>>> }
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> buffer.append(suffix);
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> c.output(buffer.toString());
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and
> > other
> > > >>>>>>> formats
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> without
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> complicated logic. The same sort of thing could be
> done
> > > >> for
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> TextIO.Write.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Jesse
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> > > >>>>>>>>>>>> <lcwik@google.com.invalid
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> The conversion from object to string will have uses
> > > >> outside
> > > >>>>> of
> > > >>>>>>>>>>> just
> > > >>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would want
> > to
> > > >>> have
> > > >>>>> a
> > > >>>>>>>>>>> ParDo
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> do
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> conversion.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if
> you
> > > >>>>> consider
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> subset
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> of CSV like formats where it could have fixed width
> > > >> fields,
> > > >>>>> or
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> escaping
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> quoting around other fields, or headers that should
> > be
> > > >>>>> placed
> > > >>>>>>> at
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> top.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Having all these format conversions within
> > TextIO.Write
> > > >>>>> seems
> > > >>>>>>>>>>> like
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> lot
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> logic to contain in that transform which should
> just
> > > >> focus
> > > >>>>> on
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> writing
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> files.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> jesse@smokinghand.com>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> This is a thread moved over from the user mailing
> > list.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> > > >>>>> PCollection<KV>
> > > >>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually
> > > convert
> > > >>> the
> > > >>>>>>> KV
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> to a
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> String:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>         p
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > > >>>>>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<
> String,
> > > >> Long>
> > > >>>>>>>>>> count)
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> ->*
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> *                            count.getKey() + ":" +
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> count.getValue()*
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> *                        ).withOutputType(
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > >>>>>>> ("output/stringcounts"));
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> This code really should be something like:
> > > >>>>>>>>>>>>>>>>>>>>         p
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > > >>>>>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
> > > >>>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > > >>>>>>>>> ("output/stringcounts"));
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> To summarize the discussion:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to
> > output
> > > >>> any
> > > >>>>> KV
> > > >>>>>>>>>> or
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> list
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes an
> type
> > > >> and
> > > >>>>> runs
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> toString()
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>    on it:
> > > >>>>>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> > > >>> SimpleFunction<InputT,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> String>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> {
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>        public static String apply(InputT input) {
> > > >>>>>>>>>>>>>>>>>>>>            return input.toString();
> > > >>>>>>>>>>>>>>>>>>>>        }
> > > >>>>>>>>>>>>>>>>>>>>    }
> > > >>>>>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter like
> > in
> > > >>>>> Apache
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Camel.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that would
> > > >> write
> > > >>>>> out
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>    toString of any Object.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> My thoughts:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly
> needed
> > > >> when
> > > >>>>>>>>>> you're
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> using
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform only
> > work
> > > >> in
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> certain
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> cases
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> and you'll normally have to write custom code
> format
> > > the
> > > >>>>>>> strings
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> way
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> you want them?
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object
> > > >> support
> > > >>> to
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> TextIO.Write
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> or
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an
> > > argument.
> > > >>>>>>> Making
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a delimiter
> > (and
> > > >>>>>>> perhaps
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> prefix
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and
> > > cases.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Jesse
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> --
> > > >>>>>>>>>>>>> Jean-Baptiste Onofré
> > > >>>>>>>>>>>>> jbonofre@apache.org
> > > >>>>>>>>>>>>> http://blog.nanthrax.net
> > > >>>>>>>>>>>>> Talend - http://www.talend.com
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> --
> > > >>>>>>>>> Jean-Baptiste Onofré
> > > >>>>>>>>> jbonofre@apache.org
> > > >>>>>>>>> http://blog.nanthrax.net
> > > >>>>>>>>> Talend - http://www.talend.com
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>> --
> > > >>>>>>> Jean-Baptiste Onofré
> > > >>>>>>> jbonofre@apache.org
> > > >>>>>>> http://blog.nanthrax.net
> > > >>>>>>> Talend - http://www.talend.com
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>> --
> > > >>>>> Jean-Baptiste Onofré
> > > >>>>> jbonofre@apache.org
> > > >>>>> http://blog.nanthrax.net
> > > >>>>> Talend - http://www.talend.com
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>> --
> > > >>> Jean-Baptiste Onofré
> > > >>> jbonofre@apache.org
> > > >>> http://blog.nanthrax.net
> > > >>> Talend - http://www.talend.com
> > > >>>
> > > >>
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbonofre@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>

Re: PCollection to PCollection Conversion

Posted by Jesse Anderson <je...@smokinghand.com>.

I agree MapElements isn't hard to use. I think there is a demand for this
built-in conversion.

My thought on the formatter is that, worst case, we could do runtime type
checking. It would be ugly and not as performant, but it should work. As
we've said, we'd point them to MapElements for better code. We'd write the
JavaDoc accordingly.

IME and to Ben's point, these will mostly be used in development. Some of
our assumptions will break down when programmers aren't the ones using
Beam. I can see from the user traffic already that not everyone using Beam
is a programmer and they'll need classes like this to be productive.

On Thu, Dec 29, 2016 at 1:46 PM Dan Halperin <dh...@google.com.invalid>
wrote:

On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson <je...@smokinghand.com>
wrote:

> I prefer JB's take. I think there should be three overloaded methods on
the
> class. I like Vikas' name ToString. The methods for a simple conversion
> should be:
>
> ToString.strings() - Outputs the .toString() of the objects in the
> PCollection
> ToString.strings(String delimiter) - Outputs the .toString() of KVs,
Lists,
> etc with the delimiter between every entry
> ToString.formatted(String format) - Outputs the formatted
> <https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html>
> string
> with the object passed in. For objects made up of different parts like
KVs,
> each one is passed in as separate toString() of a varargs.
>

Riffing a little, with some types:

ToString.<T>of() -- PTransform<T, String> that is equivalent to a ParDo
that takes in a T and outputs T.toString().

ToString.<K,V>kv(String delimiter) -- PTransform<KV<K, V>, String> that is
equivalent to a ParDo that takes in a KV<K,V> and outputs
kv.getKey().toString() + delimiter + kv.getValue().toString()

ToString.<T>iterable(String delimiter) -- PTransform<? extends Iterable<T>,
String> that is equivalent to a ParDo that takes in an Iterable<T> and
outputs the iterable[0] + delimiter + iterable[1] + delimiter + ... +
delimiter + iterable[N-1]

ToString.<T>custom(SerializableFunction<T, String> formatter) ?

The last one is just MapElement.via, except you don't need to set the
output type.

I don't see a way to make the generic .formatted() that you propose that
just works with anything "made of different parts".

I think this adding too many overrides beyond "of" and "custom" is opening
up a Pandora's Box. the KV one might want to have left and right
delimiters, might want to take custom formatters for K and V, etc. etc. The
iterable one might want to have a special configuration for an empty
iterable. So I'm inclined towards simplicity with the awareness that
MapElements.via is just not that hard to use.

Dan


>
> I think doing these three methods would cover every simple and advanced
> "simple conversions." As JB says, we'll need other specific converters for
> other formats like XML.
>
> I'd really like to see this class in the next version of Beam. What does
> everyone think of the class name, methods name, and method operations so
we
> can have Vikas finish up?
>
> Thanks,
>
> Jesse
>
> On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Hi Vikas,
> >
> > did you take a look on:
> >
> >
> > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/
> java/extensions/dataformat
> >
> > You can see KV2String and ToString could be part of this extension.
> > I'm also using JAXB for XML and Jackson for JSON
> > marshalling/unmarshalling. I'm planning to deal with Avro
> (IndexedRecord).
> >
> > Regards
> > JB
> >
> > On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> > > Hi All,
> > >
> > >   Not being aware of the discussion here, I sent out a PR
> > > <https://github.com/apache/beam/pull/1704> but JB and others directed
> > me to
> > > this thread. Having converted PCollection<T> to PCollection<String>
> > several
> > > times, I feel something like 'ToString' transform is common enough to
> be
> > > part of the core. What do you all think?
> > >
> > > Also, if someone else is already working on or interested in tackling
> > this,
> > > then I am happy to discard the PR.
> > >
> > > Regards,
> > > Vikas
> > >
> > > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <am...@gmail.com>
> wrote:
> > >
> > >> It seems that there were a lot of good points raised here, and I tend
> to
> > >> agree that something as trivial and lean as "ToString" should be a
> part
> > of
> > >> core.ake
> > >> I'm particularly fond of makeString(prefix, toString, suffix) in
> various
> > >> combinations (Scala-like).
> > >> For "fromString", I think JB has a good point leveraging JAXB and
> > Jackson -
> > >> though I think this should be in extensions as it is not as lean as
> > >> toString.
> > >>
> > >> Thanks,
> > >> Amit
> > >>
> > >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <jb@nanthrax.net
> >
> > >> wrote:
> > >>
> > >>> Hi Jesse,
> > >>>
> > >>> yes, I started something there (using JAXB and Jackson). Let me
> polish
> > >>> and push.
> > >>>
> > >>> Regards
> > >>> JB
> > >>>
> > >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > >>>> I went through the string conversions. Do you have an example of
> > >> writing
> > >>>> out XML/JSON/etc too?
> > >>>>
> > >>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <
> jb@nanthrax.net
> > >
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Jesse,
> > >>>>>
> > >>>>>
> > >>>>>
> > >>> https://github.com/jbonofre/incubator-beam/tree/
> DATAFORMAT/sdks/java/
> > >> extensions/dataformat
> > >>>>>
> > >>>>> it's very simple and stupid and of course not complete at all (I
> have
> > >>>>> other commits but not merged as they need some polishing), but as
I
> > >>>>> said, it's a base of discussion.
> > >>>>>
> > >>>>> Regards
> > >>>>> JB
> > >>>>>
> > >>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> > >>>>>> @jb Sounds good. Just let us know once you've pushed.
> > >>>>>>
> > >>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> > >> jb@nanthrax.net>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Good point Eugene.
> > >>>>>>>
> > >>>>>>> Right now, it's a DoFn collection to experiment a bit (a pure
> > >>>>>>> extension). It's pretty stupid ;)
> > >>>>>>>
> > >>>>>>> But, you are right, depending the direction of such extension,
it
> > >>> could
> > >>>>>>> cover more use cases (even if it's not my first intention ;)).
> > >>>>>>>
> > >>>>>>> Let me push the branch (pretty small) as an illustration, and in
> > the
> > >>>>>>> mean time, I'm preparing a document (more focused on the use
> > cases).
> > >>>>>>>
> > >>>>>>> WDYT ?
> > >>>>>>>
> > >>>>>>> Regards
> > >>>>>>> JB
> > >>>>>>>
> > >>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> > >>>>>>>> Hi JB,
> > >>>>>>>> Depending on the scope of what you want to ultimately
accomplish
> > >> with
> > >>>>>>> this
> > >>>>>>>> extension, I think it may make sense to write a proposal
> document
> > >> and
> > >>>>>>>> discuss it.
> > >>>>>>>> If it's just a collection of utility DoFn's for various
> > >> well-defined
> > >>>>>>>> source/target format pairs, then that's probably not needed,
but
> > if
> > >>>>> it's
> > >>>>>>>> anything more, then I think it is.
> > >>>>>>>> That will help avoid a lot of churn if people propose
reasonable
> > >>>>>>>> significant changes.
> > >>>>>>>>
> > >>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> > >>> jb@nanthrax.net
> > >>>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my
> github
> > >>> and I
> > >>>>>>>>> will post on the dev mailing list when done.
> > >>>>>>>>>
> > >>>>>>>>> Regards
> > >>>>>>>>> JB
> > >>>>>>>>>
> > >>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > >>>>>>>>>> I want to bring this thread back up since we've had time to
> > think
> > >>>>> about
> > >>>>>>>>> it
> > >>>>>>>>>> more and make a plan.
> > >>>>>>>>>>
> > >>>>>>>>>> I think a format-specific converter will be more time
> consuming
> > >>> task
> > >>>>>>> than
> > >>>>>>>>>> we originally thought. It'd have to be a writer that takes
> > >> another
> > >>>>>>> writer
> > >>>>>>>>>> as a parameter.
> > >>>>>>>>>>
> > >>>>>>>>>> I think a string converter can be done as a simple transform.
> > >>>>>>>>>>
> > >>>>>>>>>> I think we should start with a simple string converter and
> plan
> > >>> for a
> > >>>>>>>>>> format-specific writer.
> > >>>>>>>>>>
> > >>>>>>>>>> What are your thoughts?
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>>
> > >>>>>>>>>> Jesse
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> > >>>>> jesse@smokinghand.com
> > >>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> I was thinking about what the outputs would look like last
> > >> night. I
> > >>>>>>>>>> realized that more complex formats like JSON and XML may or
> may
> > >> not
> > >>>>>>>>> output
> > >>>>>>>>>> the data in a valid format.
> > >>>>>>>>>>
> > >>>>>>>>>> Doing a direct conversion on unbounded collections would work
> > >> just
> > >>>>>>> fine.
> > >>>>>>>>>> They're self-contained. For writing out bounded collections,
> > >> that's
> > >>>>>>> where
> > >>>>>>>>>> we'll hit the issues. This changes the uber conversion
> transform
> > >>>>> into a
> > >>>>>>>>>> transform that needs to be a writer.
> > >>>>>>>>>>
> > >>>>>>>>>> If a transform executes a JSON conversion on a per element
> > basis,
> > >>>>> we'd
> > >>>>>>>>> get
> > >>>>>>>>>> this:
> > >>>>>>>>>> {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> }, {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> },
> > >>>>>>>>>>
> > >>>>>>>>>> That isn't valid JSON.
> > >>>>>>>>>>
> > >>>>>>>>>> The conversion transform would need to know do several things
> > >> when
> > >>>>>>>>> writing
> > >>>>>>>>>> out a file. It would need to add brackets for an array. Now
we
> > >>> have:
> > >>>>>>>>>> [
> > >>>>>>>>>> {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> }, {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> },
> > >>>>>>>>>> ]
> > >>>>>>>>>>
> > >>>>>>>>>> We still don't have valid JSON. We have to remove the last
> comma
> > >> or
> > >>>>>>> have
> > >>>>>>>>>> the uber transform start putting in the commas, except for
the
> > >> last
> > >>>>>>>>> element.
> > >>>>>>>>>>
> > >>>>>>>>>> [
> > >>>>>>>>>> {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> }, {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> }
> > >>>>>>>>>> ]
> > >>>>>>>>>>
> > >>>>>>>>>> Only by doing this do we have valid JSON.
> > >>>>>>>>>>
> > >>>>>>>>>> I'd argue we'd have a similar issue with XML. Some parsers
> > >> require
> > >>> a
> > >>>>>>> root
> > >>>>>>>>>> element for everything. The uber transform would have to put
> the
> > >>> root
> > >>>>>>>>>> element tags at the beginning and end of the file.
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> > >>> owenzhang1990@gmail.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> I would love to see a lean core and abundant Transforms at
the
> > >> same
> > >>>>>>> time.
> > >>>>>>>>>>
> > >>>>>>>>>> Maybe we can look at what Confluent <
> > >>> https://github.com/confluentinc
> > >>>>>>
> > >>>>>>>>> does
> > >>>>>>>>>> for kafka-connect. They have official extensions support for
> > >> JDBC,
> > >>>>> HDFS
> > >>>>>>>>> and
> > >>>>>>>>>> ElasticSearch under https://github.com/confluentinc. They put
> > >> them
> > >>>>>>> along
> > >>>>>>>>>> with other community extensions on
> > >>>>>>>>>> https://www.confluent.io/product/connectors/ for visibility.
> > >>>>>>>>>>
> > >>>>>>>>>> Although not a commercial company, can we have a GitHub user
> > like
> > >>>>>>>>>> beam-community to host projects we build around beam but not
> > >>> suitable
> > >>>>>>> for
> > >>>>>>>>>> https://github.com/apache/incubator-beam. In the future, we
> may
> > >>> have
> > >>>>>>>>>> beam-algebra like http://github.com/twitter/algebird for
> > algebra
> > >>>>>>>>> operations
> > >>>>>>>>>> and beam-ml / beam-dl for machine learning / deep learning.
> > Also,
> > >>>>> there
> > >>>>>>>>>> will will be beam related projects elsewhere maintained by
> other
> > >>>>>>>>>> communities. We can put all of them on the beam-website or
> like
> > >>> spark
> > >>>>>>>>>> packages as mentioned by Amit.
> > >>>>>>>>>>
> > >>>>>>>>>> My $0.02
> > >>>>>>>>>> Manu
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> > >>>>> <klk@google.com.invalid
> > >>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> On this point from Amit and Ismaël, I agree: we could
benefit
> > >>> from a
> > >>>>>>>>> place
> > >>>>>>>>>>> for miscellaneous non-core helper transformations.
> > >>>>>>>>>>>
> > >>>>>>>>>>> We have sdks/java/extensions but it is organized as separate
> > >>>>>>> artifacts.
> > >>>>>>>>> I
> > >>>>>>>>>>> think that is fine, considering the nature of Join and
> > >> SortValues.
> > >>>>> But
> > >>>>>>>>> for
> > >>>>>>>>>>> simpler transforms, Importing one artifact per tiny
transform
> > is
> > >>> too
> > >>>>>>>>> much
> > >>>>>>>>>>> overhead. It also seems unlikely that we will have enough
> > >>>>> commonality
> > >>>>>>>>>> among
> > >>>>>>>>>>> the transforms to call the artifact anything other than
[some
> > >>>>> synonym
> > >>>>>>>>> for]
> > >>>>>>>>>>> "miscellaneous".
> > >>>>>>>>>>>
> > >>>>>>>>>>> I wouldn't want to take this too far - even though the SDK
> many
> > >>>>>>>>>> transforms*
> > >>>>>>>>>>> that are not required for the model [1], I like that the SDK
> > >>>>> artifact
> > >>>>>>>>> has
> > >>>>>>>>>>> everything a user might need in their "getting started"
phase
> > of
> > >>>>> use.
> > >>>>>>>>> This
> > >>>>>>>>>>> user-friendliness (the user doesn't care that ParDo is core
> and
> > >>> Sum
> > >>>>> is
> > >>>>>>>>>> not)
> > >>>>>>>>>>> plus the difficulty of judging which transforms go where,
are
> > >>>>> probably
> > >>>>>>>>> why
> > >>>>>>>>>>> we have them mostly all in one place.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Models to look at, off the top of my head, include Pig's
> > >> PiggyBank
> > >>>>> and
> > >>>>>>>>>>> Apex's Malhar. These have different levels of support
> implied.
> > >>>>> Others?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Kenn
> > >>>>>>>>>>>
> > >>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count,
Distinct,
> > >>>>> Filter,
> > >>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min,
> > >>> Values,
> > >>>>>>>>>> KvSwap,
> > >>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> > >>> WithTimestamps
> > >>>>>>>>>>>
> > >>>>>>>>>>> * at least they are separate classes and not methods on
> > >>> PCollection
> > >>>>>>> :-)
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <
> > iemejia@gmail.com
> > >>>
> > >>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Nice discussion, and thanks Jesse for bringing this
subject
> > >>> back.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I agree 100% with Amit and the idea of having a home for
> those
> > >>>>>>>>>> transforms
> > >>>>>>>>>>>> that are not core enough to be part of the sdk, but that we
> > all
> > >>> end
> > >>>>>>> up
> > >>>>>>>>>>>> re-writing somehow.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This is a needed improvement to be more developer friendly,
> > but
> > >>>>> also
> > >>>>>>> as
> > >>>>>>>>>> a
> > >>>>>>>>>>>> reference of good practices of Beam development, and for
> this
> > >>>>> reason
> > >>>>>>> I
> > >>>>>>>>>>>> agree with JB that at this moment it would be better for
> these
> > >>>>>>>>>> transforms
> > >>>>>>>>>>>> to reside in the Beam repository at least for visibility
> > >> reasons.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> One additional question is if these transforms represent a
> > >>>>> different
> > >>>>>>>>> DSL
> > >>>>>>>>>>> or
> > >>>>>>>>>>>> if those could be grouped with the current extensions (e.g.
> > >> Join
> > >>>>> and
> > >>>>>>>>>>>> SortValues) into something more general that we as a
> community
> > >>>>> could
> > >>>>>>>>>>>> maintain, but well even if it is not the case, it would be
> > >> really
> > >>>>>>> nice
> > >>>>>>>>>> to
> > >>>>>>>>>>>> start working on something like this.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Ismaël Mejía
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <
> > >>>>>>> jb@nanthrax.net
> > >>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Related to spark-package, we also have Apache Bahir to
host
> > >>>>>>>>>>>>> connectors/transforms for Spark and Flink.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> IMHO, right now, Beam should host this, not sure if it
> makes
> > >>> sense
> > >>>>>>>>>>>>> directly in the core.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> It reminds me the "Integration" DSL we discussed in the
> > >>> technical
> > >>>>>>>>>>> vision
> > >>>>>>>>>>>>> document.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Regards
> > >>>>>>>>>>>>> JB
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I think Jesse has a very good point on one hand, while
> > Luke's
> > >>> and
> > >>>>>>>>>>>>>> Kenneth's
> > >>>>>>>>>>>>>> worries about committing users to specific
implementations
> > is
> > >>> in
> > >>>>>>>>>>> place.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The Spark community has a 3rd party repository for useful
> > >>>>> libraries
> > >>>>>>>>>>> that
> > >>>>>>>>>>>>>> for various reasons are not a part of the Apache Spark
> > >> project:
> > >>>>>>>>>>>>>> https://spark-packages.org/.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Maybe a "common-transformations" package would serve both
> > >> users
> > >>>>>>> quick
> > >>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more
> "enabling" ?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > >>>>>>>>>> <klk@google.com.invalid
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> It seems useful for small scale debugging / demoing to
> have
> > >>>>>>>>>>>>>>> Dump.toString(). I think it should be named to clearly
> > >>> indicate
> > >>>>>>> its
> > >>>>>>>>>>>>>>> limited
> > >>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump namespace,
> > but
> > >>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it
> should
> > >> be
> > >>>>>>> pretty
> > >>>>>>>>>>>>>>> printed, not treated as a machine-to-machine wire
format.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The broader question of representing data in JSON or
XML,
> > >> etc,
> > >>>>> is
> > >>>>>>>>>>>> already
> > >>>>>>>>>>>>>>> the subject of many mature libraries which are already
> easy
> > >> to
> > >>>>> use
> > >>>>>>>>>>> with
> > >>>>>>>>>>>>>>> Beam.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The more esoteric practice of implicit or semi-implicit
> > >>>>> coercions
> > >>>>>>>>>>> seems
> > >>>>>>>>>>>>>>> like it is also already addressed in many ways
elsewhere.
> > >>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as
> > >>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with
Beam.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> In both of the last cases, there are many reasonable
> > >>> approaches,
> > >>>>>>> and
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>>>> shouldn't commit our users to one of them.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > >>>>>>>>>>> <lcwik@google.com.invalid
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The suggestions you give seem good except for the the
XML
> > >>> cases.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Might want to have the XML be a document per line
> similar
> > >> to
> > >>>>> the
> > >>>>>>>>>>> JSON
> > >>>>>>>>>>>>>>>> examples you have been giving.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> > >>>>>>>>>>>> jesse@smokinghand.com>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I
was
> > >> more
> > >>>>>>> think
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV. It
> > >> should
> > >>>>>>>>>> handle
> > >>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give
> > someone
> > >>>>>>>>>>> something
> > >>>>>>>>>>>>>>>>> general purpose enough that you would just end up
> writing
> > >>> your
> > >>>>>>> own
> > >>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> handle it anyway.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Here are some ideas on what it could look like with a
> > >> method
> > >>>>> and
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> resulting string output:
> > >>>>>>>>>>>>>>>>> *Stringify.toJSON()*
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>>>>> {"key": "value"}
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>>>>> ["one", "two", "three"]
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>>>>> <rootelement key=value />
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>>>>> <rootelement>
> > >>>>>>>>>>>>>>>>>   <item>one</item>
> > >>>>>>>>>>>>>>>>>   <item>two</item>
> > >>>>>>>>>>>>>>>>>   <item>three</item>
> > >>>>>>>>>>>>>>>>> </rootelement>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>>>>> key,value
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>>>>> one,two,three
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Do you think that would strike a good balance between
> > >>> reusable
> > >>>>>>>>>> code
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>> writing your own for more difficult formatting?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > >>>>>>>>>>> <lcwik@google.com.invalid
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special treatment
> in
> > >>>>> TextIO,
> > >>>>>>>>>>>> people
> > >>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not
> > >> supported.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Also, the example that you provide is using the fact
> that
> > >>> the
> > >>>>>>>>>> input
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> format
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question about
> > >> using
> > >>> KV
> > >>>>>>>>>> with
> > >>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed
> input
> > >>>>> format
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> still
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> would require to write a type conversion function,
this
> > >> time
> > >>>>>>> from
> > >>>>>>>>>>> KV
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> > >>>>>>>>>>>> jesse@smokinghand.com>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Lukasz,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> > >>> TextIO.Write.
> > >>>>>>> For
> > >>>>>>>>>>> CSV
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> call would look like:
> > >>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix,
> > >>> delimiter,
> > >>>>>>>>>>>> suffix).
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The code would be something like:
> > >>>>>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> for (Item item : list) {
> > >>>>>>>>>>>>>>>>>>   buffer.append(item.toString());
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>   if(notLast) {
> > >>>>>>>>>>>>>>>>>>     buffer.append(delimiter);
> > >>>>>>>>>>>>>>>>>>   }
> > >>>>>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> buffer.append(suffix);
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> c.output(buffer.toString());
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and
> other
> > >>>>>>> formats
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> without
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> complicated logic. The same sort of thing could be
done
> > >> for
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> TextIO.Write.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> > >>>>>>>>>>>> <lcwik@google.com.invalid
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The conversion from object to string will have uses
> > >> outside
> > >>>>> of
> > >>>>>>>>>>> just
> > >>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would want
> to
> > >>> have
> > >>>>> a
> > >>>>>>>>>>> ParDo
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> do
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> conversion.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if
you
> > >>>>> consider
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> subset
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> of CSV like formats where it could have fixed width
> > >> fields,
> > >>>>> or
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> escaping
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> quoting around other fields, or headers that should
> be
> > >>>>> placed
> > >>>>>>> at
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> top.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Having all these format conversions within
> TextIO.Write
> > >>>>> seems
> > >>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> lot
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> logic to contain in that transform which should just
> > >> focus
> > >>>>> on
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> writing
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> files.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> jesse@smokinghand.com>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> This is a thread moved over from the user mailing
> list.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> > >>>>> PCollection<KV>
> > >>>>>>> to
> > >>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually
> > convert
> > >>> the
> > >>>>>>> KV
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> to a
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> String:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>         p
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > >>>>>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<String,
> > >> Long>
> > >>>>>>>>>> count)
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> ->*
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> *                            count.getKey() + ":" +
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> count.getValue()*
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> *                        ).withOutputType(
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > >>>>>>> ("output/stringcounts"));
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> This code really should be something like:
> > >>>>>>>>>>>>>>>>>>>>         p
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > >>>>>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
> > >>>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > >>>>>>>>> ("output/stringcounts"));
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> To summarize the discussion:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to
> output
> > >>> any
> > >>>>> KV
> > >>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> list
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes an
type
> > >> and
> > >>>>> runs
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> toString()
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>    on it:
> > >>>>>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> > >>> SimpleFunction<InputT,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> String>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> {
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>        public static String apply(InputT input) {
> > >>>>>>>>>>>>>>>>>>>>            return input.toString();
> > >>>>>>>>>>>>>>>>>>>>        }
> > >>>>>>>>>>>>>>>>>>>>    }
> > >>>>>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter like
> in
> > >>>>> Apache
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Camel.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that would
> > >> write
> > >>>>> out
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>    toString of any Object.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> My thoughts:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly
needed
> > >> when
> > >>>>>>>>>> you're
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> using
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform only
> work
> > >> in
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> certain
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> cases
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> and you'll normally have to write custom code format
> > the
> > >>>>>>> strings
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> way
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> you want them?
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object
> > >> support
> > >>> to
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> TextIO.Write
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an
> > argument.
> > >>>>>>> Making
> > >>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a delimiter
> (and
> > >>>>>>> perhaps
> > >>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> prefix
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and
> > cases.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>>> Jean-Baptiste Onofré
> > >>>>>>>>>>>>> jbonofre@apache.org
> > >>>>>>>>>>>>> http://blog.nanthrax.net
> > >>>>>>>>>>>>> Talend - http://www.talend.com
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Jean-Baptiste Onofré
> > >>>>>>>>> jbonofre@apache.org
> > >>>>>>>>> http://blog.nanthrax.net
> > >>>>>>>>> Talend - http://www.talend.com
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Jean-Baptiste Onofré
> > >>>>>>> jbonofre@apache.org
> > >>>>>>> http://blog.nanthrax.net
> > >>>>>>> Talend - http://www.talend.com
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Jean-Baptiste Onofré
> > >>>>> jbonofre@apache.org
> > >>>>> http://blog.nanthrax.net
> > >>>>> Talend - http://www.talend.com
> > >>>>>
> > >>>>
> > >>>
> > >>> --
> > >>> Jean-Baptiste Onofré
> > >>> jbonofre@apache.org
> > >>> http://blog.nanthrax.net
> > >>> Talend - http://www.talend.com
> > >>>
> > >>
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: PCollection to PCollection Conversion

Posted by Dan Halperin <dh...@google.com.INVALID>.

On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson <je...@smokinghand.com>
wrote:

> I prefer JB's take. I think there should be three overloaded methods on the
> class. I like Vikas' name ToString. The methods for a simple conversion
> should be:
>
> ToString.strings() - Outputs the .toString() of the objects in the
> PCollection
> ToString.strings(String delimiter) - Outputs the .toString() of KVs, Lists,
> etc with the delimiter between every entry
> ToString.formatted(String format) - Outputs the formatted
> <https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html>
> string
> with the object passed in. For objects made up of different parts like KVs,
> each one is passed in as separate toString() of a varargs.
>

Riffing a little, with some types:

ToString.<T>of() -- PTransform<T, String> that is equivalent to a ParDo
that takes in a T and outputs T.toString().

ToString.<K,V>kv(String delimiter) -- PTransform<KV<K, V>, String> that is
equivalent to a ParDo that takes in a KV<K,V> and outputs
kv.getKey().toString() + delimiter + kv.getValue().toString()

ToString.<T>iterable(String delimiter) -- PTransform<? extends Iterable<T>,
String> that is equivalent to a ParDo that takes in an Iterable<T> and
outputs the iterable[0] + delimiter + iterable[1] + delimiter + ... +
delimiter + iterable[N-1]

ToString.<T>custom(SerializableFunction<T, String> formatter) ?

The last one is just MapElement.via, except you don't need to set the
output type.

I don't see a way to make the generic .formatted() that you propose that
just works with anything "made of different parts".

I think this adding too many overrides beyond "of" and "custom" is opening
up a Pandora's Box. the KV one might want to have left and right
delimiters, might want to take custom formatters for K and V, etc. etc. The
iterable one might want to have a special configuration for an empty
iterable. So I'm inclined towards simplicity with the awareness that
MapElements.via is just not that hard to use.

Dan


>
> I think doing these three methods would cover every simple and advanced
> "simple conversions." As JB says, we'll need other specific converters for
> other formats like XML.
>
> I'd really like to see this class in the next version of Beam. What does
> everyone think of the class name, methods name, and method operations so we
> can have Vikas finish up?
>
> Thanks,
>
> Jesse
>
> On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Hi Vikas,
> >
> > did you take a look on:
> >
> >
> > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/
> java/extensions/dataformat
> >
> > You can see KV2String and ToString could be part of this extension.
> > I'm also using JAXB for XML and Jackson for JSON
> > marshalling/unmarshalling. I'm planning to deal with Avro
> (IndexedRecord).
> >
> > Regards
> > JB
> >
> > On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> > > Hi All,
> > >
> > >   Not being aware of the discussion here, I sent out a PR
> > > <https://github.com/apache/beam/pull/1704> but JB and others directed
> > me to
> > > this thread. Having converted PCollection<T> to PCollection<String>
> > several
> > > times, I feel something like 'ToString' transform is common enough to
> be
> > > part of the core. What do you all think?
> > >
> > > Also, if someone else is already working on or interested in tackling
> > this,
> > > then I am happy to discard the PR.
> > >
> > > Regards,
> > > Vikas
> > >
> > > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <am...@gmail.com>
> wrote:
> > >
> > >> It seems that there were a lot of good points raised here, and I tend
> to
> > >> agree that something as trivial and lean as "ToString" should be a
> part
> > of
> > >> core.ake
> > >> I'm particularly fond of makeString(prefix, toString, suffix) in
> various
> > >> combinations (Scala-like).
> > >> For "fromString", I think JB has a good point leveraging JAXB and
> > Jackson -
> > >> though I think this should be in extensions as it is not as lean as
> > >> toString.
> > >>
> > >> Thanks,
> > >> Amit
> > >>
> > >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <jb@nanthrax.net
> >
> > >> wrote:
> > >>
> > >>> Hi Jesse,
> > >>>
> > >>> yes, I started something there (using JAXB and Jackson). Let me
> polish
> > >>> and push.
> > >>>
> > >>> Regards
> > >>> JB
> > >>>
> > >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > >>>> I went through the string conversions. Do you have an example of
> > >> writing
> > >>>> out XML/JSON/etc too?
> > >>>>
> > >>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <
> jb@nanthrax.net
> > >
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Jesse,
> > >>>>>
> > >>>>>
> > >>>>>
> > >>> https://github.com/jbonofre/incubator-beam/tree/
> DATAFORMAT/sdks/java/
> > >> extensions/dataformat
> > >>>>>
> > >>>>> it's very simple and stupid and of course not complete at all (I
> have
> > >>>>> other commits but not merged as they need some polishing), but as I
> > >>>>> said, it's a base of discussion.
> > >>>>>
> > >>>>> Regards
> > >>>>> JB
> > >>>>>
> > >>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> > >>>>>> @jb Sounds good. Just let us know once you've pushed.
> > >>>>>>
> > >>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> > >> jb@nanthrax.net>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Good point Eugene.
> > >>>>>>>
> > >>>>>>> Right now, it's a DoFn collection to experiment a bit (a pure
> > >>>>>>> extension). It's pretty stupid ;)
> > >>>>>>>
> > >>>>>>> But, you are right, depending the direction of such extension, it
> > >>> could
> > >>>>>>> cover more use cases (even if it's not my first intention ;)).
> > >>>>>>>
> > >>>>>>> Let me push the branch (pretty small) as an illustration, and in
> > the
> > >>>>>>> mean time, I'm preparing a document (more focused on the use
> > cases).
> > >>>>>>>
> > >>>>>>> WDYT ?
> > >>>>>>>
> > >>>>>>> Regards
> > >>>>>>> JB
> > >>>>>>>
> > >>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> > >>>>>>>> Hi JB,
> > >>>>>>>> Depending on the scope of what you want to ultimately accomplish
> > >> with
> > >>>>>>> this
> > >>>>>>>> extension, I think it may make sense to write a proposal
> document
> > >> and
> > >>>>>>>> discuss it.
> > >>>>>>>> If it's just a collection of utility DoFn's for various
> > >> well-defined
> > >>>>>>>> source/target format pairs, then that's probably not needed, but
> > if
> > >>>>> it's
> > >>>>>>>> anything more, then I think it is.
> > >>>>>>>> That will help avoid a lot of churn if people propose reasonable
> > >>>>>>>> significant changes.
> > >>>>>>>>
> > >>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> > >>> jb@nanthrax.net
> > >>>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my
> github
> > >>> and I
> > >>>>>>>>> will post on the dev mailing list when done.
> > >>>>>>>>>
> > >>>>>>>>> Regards
> > >>>>>>>>> JB
> > >>>>>>>>>
> > >>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > >>>>>>>>>> I want to bring this thread back up since we've had time to
> > think
> > >>>>> about
> > >>>>>>>>> it
> > >>>>>>>>>> more and make a plan.
> > >>>>>>>>>>
> > >>>>>>>>>> I think a format-specific converter will be more time
> consuming
> > >>> task
> > >>>>>>> than
> > >>>>>>>>>> we originally thought. It'd have to be a writer that takes
> > >> another
> > >>>>>>> writer
> > >>>>>>>>>> as a parameter.
> > >>>>>>>>>>
> > >>>>>>>>>> I think a string converter can be done as a simple transform.
> > >>>>>>>>>>
> > >>>>>>>>>> I think we should start with a simple string converter and
> plan
> > >>> for a
> > >>>>>>>>>> format-specific writer.
> > >>>>>>>>>>
> > >>>>>>>>>> What are your thoughts?
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>>
> > >>>>>>>>>> Jesse
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> > >>>>> jesse@smokinghand.com
> > >>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> I was thinking about what the outputs would look like last
> > >> night. I
> > >>>>>>>>>> realized that more complex formats like JSON and XML may or
> may
> > >> not
> > >>>>>>>>> output
> > >>>>>>>>>> the data in a valid format.
> > >>>>>>>>>>
> > >>>>>>>>>> Doing a direct conversion on unbounded collections would work
> > >> just
> > >>>>>>> fine.
> > >>>>>>>>>> They're self-contained. For writing out bounded collections,
> > >> that's
> > >>>>>>> where
> > >>>>>>>>>> we'll hit the issues. This changes the uber conversion
> transform
> > >>>>> into a
> > >>>>>>>>>> transform that needs to be a writer.
> > >>>>>>>>>>
> > >>>>>>>>>> If a transform executes a JSON conversion on a per element
> > basis,
> > >>>>> we'd
> > >>>>>>>>> get
> > >>>>>>>>>> this:
> > >>>>>>>>>> {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> }, {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> },
> > >>>>>>>>>>
> > >>>>>>>>>> That isn't valid JSON.
> > >>>>>>>>>>
> > >>>>>>>>>> The conversion transform would need to know do several things
> > >> when
> > >>>>>>>>> writing
> > >>>>>>>>>> out a file. It would need to add brackets for an array. Now we
> > >>> have:
> > >>>>>>>>>> [
> > >>>>>>>>>> {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> }, {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> },
> > >>>>>>>>>> ]
> > >>>>>>>>>>
> > >>>>>>>>>> We still don't have valid JSON. We have to remove the last
> comma
> > >> or
> > >>>>>>> have
> > >>>>>>>>>> the uber transform start putting in the commas, except for the
> > >> last
> > >>>>>>>>> element.
> > >>>>>>>>>>
> > >>>>>>>>>> [
> > >>>>>>>>>> {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> }, {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> }
> > >>>>>>>>>> ]
> > >>>>>>>>>>
> > >>>>>>>>>> Only by doing this do we have valid JSON.
> > >>>>>>>>>>
> > >>>>>>>>>> I'd argue we'd have a similar issue with XML. Some parsers
> > >> require
> > >>> a
> > >>>>>>> root
> > >>>>>>>>>> element for everything. The uber transform would have to put
> the
> > >>> root
> > >>>>>>>>>> element tags at the beginning and end of the file.
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> > >>> owenzhang1990@gmail.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> I would love to see a lean core and abundant Transforms at the
> > >> same
> > >>>>>>> time.
> > >>>>>>>>>>
> > >>>>>>>>>> Maybe we can look at what Confluent <
> > >>> https://github.com/confluentinc
> > >>>>>>
> > >>>>>>>>> does
> > >>>>>>>>>> for kafka-connect. They have official extensions support for
> > >> JDBC,
> > >>>>> HDFS
> > >>>>>>>>> and
> > >>>>>>>>>> ElasticSearch under https://github.com/confluentinc. They put
> > >> them
> > >>>>>>> along
> > >>>>>>>>>> with other community extensions on
> > >>>>>>>>>> https://www.confluent.io/product/connectors/ for visibility.
> > >>>>>>>>>>
> > >>>>>>>>>> Although not a commercial company, can we have a GitHub user
> > like
> > >>>>>>>>>> beam-community to host projects we build around beam but not
> > >>> suitable
> > >>>>>>> for
> > >>>>>>>>>> https://github.com/apache/incubator-beam. In the future, we
> may
> > >>> have
> > >>>>>>>>>> beam-algebra like http://github.com/twitter/algebird for
> > algebra
> > >>>>>>>>> operations
> > >>>>>>>>>> and beam-ml / beam-dl for machine learning / deep learning.
> > Also,
> > >>>>> there
> > >>>>>>>>>> will will be beam related projects elsewhere maintained by
> other
> > >>>>>>>>>> communities. We can put all of them on the beam-website or
> like
> > >>> spark
> > >>>>>>>>>> packages as mentioned by Amit.
> > >>>>>>>>>>
> > >>>>>>>>>> My $0.02
> > >>>>>>>>>> Manu
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> > >>>>> <klk@google.com.invalid
> > >>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> On this point from Amit and Ismaël, I agree: we could benefit
> > >>> from a
> > >>>>>>>>> place
> > >>>>>>>>>>> for miscellaneous non-core helper transformations.
> > >>>>>>>>>>>
> > >>>>>>>>>>> We have sdks/java/extensions but it is organized as separate
> > >>>>>>> artifacts.
> > >>>>>>>>> I
> > >>>>>>>>>>> think that is fine, considering the nature of Join and
> > >> SortValues.
> > >>>>> But
> > >>>>>>>>> for
> > >>>>>>>>>>> simpler transforms, Importing one artifact per tiny transform
> > is
> > >>> too
> > >>>>>>>>> much
> > >>>>>>>>>>> overhead. It also seems unlikely that we will have enough
> > >>>>> commonality
> > >>>>>>>>>> among
> > >>>>>>>>>>> the transforms to call the artifact anything other than [some
> > >>>>> synonym
> > >>>>>>>>> for]
> > >>>>>>>>>>> "miscellaneous".
> > >>>>>>>>>>>
> > >>>>>>>>>>> I wouldn't want to take this too far - even though the SDK
> many
> > >>>>>>>>>> transforms*
> > >>>>>>>>>>> that are not required for the model [1], I like that the SDK
> > >>>>> artifact
> > >>>>>>>>> has
> > >>>>>>>>>>> everything a user might need in their "getting started" phase
> > of
> > >>>>> use.
> > >>>>>>>>> This
> > >>>>>>>>>>> user-friendliness (the user doesn't care that ParDo is core
> and
> > >>> Sum
> > >>>>> is
> > >>>>>>>>>> not)
> > >>>>>>>>>>> plus the difficulty of judging which transforms go where, are
> > >>>>> probably
> > >>>>>>>>> why
> > >>>>>>>>>>> we have them mostly all in one place.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Models to look at, off the top of my head, include Pig's
> > >> PiggyBank
> > >>>>> and
> > >>>>>>>>>>> Apex's Malhar. These have different levels of support
> implied.
> > >>>>> Others?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Kenn
> > >>>>>>>>>>>
> > >>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct,
> > >>>>> Filter,
> > >>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min,
> > >>> Values,
> > >>>>>>>>>> KvSwap,
> > >>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> > >>> WithTimestamps
> > >>>>>>>>>>>
> > >>>>>>>>>>> * at least they are separate classes and not methods on
> > >>> PCollection
> > >>>>>>> :-)
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <
> > iemejia@gmail.com
> > >>>
> > >>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Nice discussion, and thanks Jesse for bringing this subject
> > >>> back.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I agree 100% with Amit and the idea of having a home for
> those
> > >>>>>>>>>> transforms
> > >>>>>>>>>>>> that are not core enough to be part of the sdk, but that we
> > all
> > >>> end
> > >>>>>>> up
> > >>>>>>>>>>>> re-writing somehow.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This is a needed improvement to be more developer friendly,
> > but
> > >>>>> also
> > >>>>>>> as
> > >>>>>>>>>> a
> > >>>>>>>>>>>> reference of good practices of Beam development, and for
> this
> > >>>>> reason
> > >>>>>>> I
> > >>>>>>>>>>>> agree with JB that at this moment it would be better for
> these
> > >>>>>>>>>> transforms
> > >>>>>>>>>>>> to reside in the Beam repository at least for visibility
> > >> reasons.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> One additional question is if these transforms represent a
> > >>>>> different
> > >>>>>>>>> DSL
> > >>>>>>>>>>> or
> > >>>>>>>>>>>> if those could be grouped with the current extensions (e.g.
> > >> Join
> > >>>>> and
> > >>>>>>>>>>>> SortValues) into something more general that we as a
> community
> > >>>>> could
> > >>>>>>>>>>>> maintain, but well even if it is not the case, it would be
> > >> really
> > >>>>>>> nice
> > >>>>>>>>>> to
> > >>>>>>>>>>>> start working on something like this.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Ismaël Mejía
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <
> > >>>>>>> jb@nanthrax.net
> > >>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Related to spark-package, we also have Apache Bahir to host
> > >>>>>>>>>>>>> connectors/transforms for Spark and Flink.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> IMHO, right now, Beam should host this, not sure if it
> makes
> > >>> sense
> > >>>>>>>>>>>>> directly in the core.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> It reminds me the "Integration" DSL we discussed in the
> > >>> technical
> > >>>>>>>>>>> vision
> > >>>>>>>>>>>>> document.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Regards
> > >>>>>>>>>>>>> JB
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I think Jesse has a very good point on one hand, while
> > Luke's
> > >>> and
> > >>>>>>>>>>>>>> Kenneth's
> > >>>>>>>>>>>>>> worries about committing users to specific implementations
> > is
> > >>> in
> > >>>>>>>>>>> place.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The Spark community has a 3rd party repository for useful
> > >>>>> libraries
> > >>>>>>>>>>> that
> > >>>>>>>>>>>>>> for various reasons are not a part of the Apache Spark
> > >> project:
> > >>>>>>>>>>>>>> https://spark-packages.org/.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Maybe a "common-transformations" package would serve both
> > >> users
> > >>>>>>> quick
> > >>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more
> "enabling" ?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > >>>>>>>>>> <klk@google.com.invalid
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> It seems useful for small scale debugging / demoing to
> have
> > >>>>>>>>>>>>>>> Dump.toString(). I think it should be named to clearly
> > >>> indicate
> > >>>>>>> its
> > >>>>>>>>>>>>>>> limited
> > >>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump namespace,
> > but
> > >>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it
> should
> > >> be
> > >>>>>>> pretty
> > >>>>>>>>>>>>>>> printed, not treated as a machine-to-machine wire format.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The broader question of representing data in JSON or XML,
> > >> etc,
> > >>>>> is
> > >>>>>>>>>>>> already
> > >>>>>>>>>>>>>>> the subject of many mature libraries which are already
> easy
> > >> to
> > >>>>> use
> > >>>>>>>>>>> with
> > >>>>>>>>>>>>>>> Beam.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The more esoteric practice of implicit or semi-implicit
> > >>>>> coercions
> > >>>>>>>>>>> seems
> > >>>>>>>>>>>>>>> like it is also already addressed in many ways elsewhere.
> > >>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as
> > >>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with Beam.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> In both of the last cases, there are many reasonable
> > >>> approaches,
> > >>>>>>> and
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>>>> shouldn't commit our users to one of them.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > >>>>>>>>>>> <lcwik@google.com.invalid
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The suggestions you give seem good except for the the XML
> > >>> cases.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Might want to have the XML be a document per line
> similar
> > >> to
> > >>>>> the
> > >>>>>>>>>>> JSON
> > >>>>>>>>>>>>>>>> examples you have been giving.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> > >>>>>>>>>>>> jesse@smokinghand.com>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I was
> > >> more
> > >>>>>>> think
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV. It
> > >> should
> > >>>>>>>>>> handle
> > >>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give
> > someone
> > >>>>>>>>>>> something
> > >>>>>>>>>>>>>>>>> general purpose enough that you would just end up
> writing
> > >>> your
> > >>>>>>> own
> > >>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> handle it anyway.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Here are some ideas on what it could look like with a
> > >> method
> > >>>>> and
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> resulting string output:
> > >>>>>>>>>>>>>>>>> *Stringify.toJSON()*
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>>>>> {"key": "value"}
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>>>>> ["one", "two", "three"]
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>>>>> <rootelement key=value />
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>>>>> <rootelement>
> > >>>>>>>>>>>>>>>>>   <item>one</item>
> > >>>>>>>>>>>>>>>>>   <item>two</item>
> > >>>>>>>>>>>>>>>>>   <item>three</item>
> > >>>>>>>>>>>>>>>>> </rootelement>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>>>>> key,value
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>>>>> one,two,three
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Do you think that would strike a good balance between
> > >>> reusable
> > >>>>>>>>>> code
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>> writing your own for more difficult formatting?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > >>>>>>>>>>> <lcwik@google.com.invalid
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special treatment
> in
> > >>>>> TextIO,
> > >>>>>>>>>>>> people
> > >>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not
> > >> supported.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Also, the example that you provide is using the fact
> that
> > >>> the
> > >>>>>>>>>> input
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> format
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question about
> > >> using
> > >>> KV
> > >>>>>>>>>> with
> > >>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed
> input
> > >>>>> format
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> still
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> would require to write a type conversion function, this
> > >> time
> > >>>>>>> from
> > >>>>>>>>>>> KV
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> > >>>>>>>>>>>> jesse@smokinghand.com>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Lukasz,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> > >>> TextIO.Write.
> > >>>>>>> For
> > >>>>>>>>>>> CSV
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> call would look like:
> > >>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix,
> > >>> delimiter,
> > >>>>>>>>>>>> suffix).
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The code would be something like:
> > >>>>>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> for (Item item : list) {
> > >>>>>>>>>>>>>>>>>>   buffer.append(item.toString());
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>   if(notLast) {
> > >>>>>>>>>>>>>>>>>>     buffer.append(delimiter);
> > >>>>>>>>>>>>>>>>>>   }
> > >>>>>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> buffer.append(suffix);
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> c.output(buffer.toString());
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and
> other
> > >>>>>>> formats
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> without
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> complicated logic. The same sort of thing could be done
> > >> for
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> TextIO.Write.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> > >>>>>>>>>>>> <lcwik@google.com.invalid
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The conversion from object to string will have uses
> > >> outside
> > >>>>> of
> > >>>>>>>>>>> just
> > >>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would want
> to
> > >>> have
> > >>>>> a
> > >>>>>>>>>>> ParDo
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> do
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> conversion.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if you
> > >>>>> consider
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> subset
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> of CSV like formats where it could have fixed width
> > >> fields,
> > >>>>> or
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> escaping
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> quoting around other fields, or headers that should
> be
> > >>>>> placed
> > >>>>>>> at
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> top.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Having all these format conversions within
> TextIO.Write
> > >>>>> seems
> > >>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> lot
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> logic to contain in that transform which should just
> > >> focus
> > >>>>> on
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> writing
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> files.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> jesse@smokinghand.com>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> This is a thread moved over from the user mailing
> list.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> > >>>>> PCollection<KV>
> > >>>>>>> to
> > >>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually
> > convert
> > >>> the
> > >>>>>>> KV
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> to a
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> String:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>         p
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > >>>>>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<String,
> > >> Long>
> > >>>>>>>>>> count)
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> ->*
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> *                            count.getKey() + ":" +
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> count.getValue()*
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> *                        ).withOutputType(
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > >>>>>>> ("output/stringcounts"));
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> This code really should be something like:
> > >>>>>>>>>>>>>>>>>>>>         p
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > >>>>>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
> > >>>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > >>>>>>>>> ("output/stringcounts"));
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> To summarize the discussion:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to
> output
> > >>> any
> > >>>>> KV
> > >>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> list
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes an type
> > >> and
> > >>>>> runs
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> toString()
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>    on it:
> > >>>>>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> > >>> SimpleFunction<InputT,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> String>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> {
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>        public static String apply(InputT input) {
> > >>>>>>>>>>>>>>>>>>>>            return input.toString();
> > >>>>>>>>>>>>>>>>>>>>        }
> > >>>>>>>>>>>>>>>>>>>>    }
> > >>>>>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter like
> in
> > >>>>> Apache
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Camel.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that would
> > >> write
> > >>>>> out
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>    toString of any Object.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> My thoughts:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly needed
> > >> when
> > >>>>>>>>>> you're
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> using
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform only
> work
> > >> in
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> certain
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> cases
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> and you'll normally have to write custom code format
> > the
> > >>>>>>> strings
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> way
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> you want them?
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object
> > >> support
> > >>> to
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> TextIO.Write
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an
> > argument.
> > >>>>>>> Making
> > >>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a delimiter
> (and
> > >>>>>>> perhaps
> > >>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> prefix
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and
> > cases.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>>> Jean-Baptiste Onofré
> > >>>>>>>>>>>>> jbonofre@apache.org
> > >>>>>>>>>>>>> http://blog.nanthrax.net
> > >>>>>>>>>>>>> Talend - http://www.talend.com
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Jean-Baptiste Onofré
> > >>>>>>>>> jbonofre@apache.org
> > >>>>>>>>> http://blog.nanthrax.net
> > >>>>>>>>> Talend - http://www.talend.com
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Jean-Baptiste Onofré
> > >>>>>>> jbonofre@apache.org
> > >>>>>>> http://blog.nanthrax.net
> > >>>>>>> Talend - http://www.talend.com
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Jean-Baptiste Onofré
> > >>>>> jbonofre@apache.org
> > >>>>> http://blog.nanthrax.net
> > >>>>> Talend - http://www.talend.com
> > >>>>>
> > >>>>
> > >>>
> > >>> --
> > >>> Jean-Baptiste Onofré
> > >>> jbonofre@apache.org
> > >>> http://blog.nanthrax.net
> > >>> Talend - http://www.talend.com
> > >>>
> > >>
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: PCollection to PCollection Conversion

Posted by Jesse Anderson <je...@smokinghand.com>.

I prefer JB's take. I think there should be three overloaded methods on the
class. I like Vikas' name ToString. The methods for a simple conversion
should be:

ToString.strings() - Outputs the .toString() of the objects in the
PCollection
ToString.strings(String delimiter) - Outputs the .toString() of KVs, Lists,
etc with the delimiter between every entry
ToString.formatted(String format) - Outputs the formatted
<https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html> string
with the object passed in. For objects made up of different parts like KVs,
each one is passed in as separate toString() of a varargs.

I think doing these three methods would cover every simple and advanced
"simple conversions." As JB says, we'll need other specific converters for
other formats like XML.

I'd really like to see this class in the next version of Beam. What does
everyone think of the class name, methods name, and method operations so we
can have Vikas finish up?

Thanks,

Jesse

On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Vikas,
>
> did you take a look on:
>
>
> https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/java/extensions/dataformat
>
> You can see KV2String and ToString could be part of this extension.
> I'm also using JAXB for XML and Jackson for JSON
> marshalling/unmarshalling. I'm planning to deal with Avro (IndexedRecord).
>
> Regards
> JB
>
> On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> > Hi All,
> >
> >   Not being aware of the discussion here, I sent out a PR
> > <https://github.com/apache/beam/pull/1704> but JB and others directed
> me to
> > this thread. Having converted PCollection<T> to PCollection<String>
> several
> > times, I feel something like 'ToString' transform is common enough to be
> > part of the core. What do you all think?
> >
> > Also, if someone else is already working on or interested in tackling
> this,
> > then I am happy to discard the PR.
> >
> > Regards,
> > Vikas
> >
> > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <am...@gmail.com> wrote:
> >
> >> It seems that there were a lot of good points raised here, and I tend to
> >> agree that something as trivial and lean as "ToString" should be a part
> of
> >> core.ake
> >> I'm particularly fond of makeString(prefix, toString, suffix) in various
> >> combinations (Scala-like).
> >> For "fromString", I think JB has a good point leveraging JAXB and
> Jackson -
> >> though I think this should be in extensions as it is not as lean as
> >> toString.
> >>
> >> Thanks,
> >> Amit
> >>
> >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> >> wrote:
> >>
> >>> Hi Jesse,
> >>>
> >>> yes, I started something there (using JAXB and Jackson). Let me polish
> >>> and push.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> >>>> I went through the string conversions. Do you have an example of
> >> writing
> >>>> out XML/JSON/etc too?
> >>>>
> >>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <jb@nanthrax.net
> >
> >>>> wrote:
> >>>>
> >>>>> Hi Jesse,
> >>>>>
> >>>>>
> >>>>>
> >>> https://github.com/jbonofre/incubator-beam/tree/DATAFORMAT/sdks/java/
> >> extensions/dataformat
> >>>>>
> >>>>> it's very simple and stupid and of course not complete at all (I have
> >>>>> other commits but not merged as they need some polishing), but as I
> >>>>> said, it's a base of discussion.
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> >>>>>> @jb Sounds good. Just let us know once you've pushed.
> >>>>>>
> >>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> >> jb@nanthrax.net>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Good point Eugene.
> >>>>>>>
> >>>>>>> Right now, it's a DoFn collection to experiment a bit (a pure
> >>>>>>> extension). It's pretty stupid ;)
> >>>>>>>
> >>>>>>> But, you are right, depending the direction of such extension, it
> >>> could
> >>>>>>> cover more use cases (even if it's not my first intention ;)).
> >>>>>>>
> >>>>>>> Let me push the branch (pretty small) as an illustration, and in
> the
> >>>>>>> mean time, I'm preparing a document (more focused on the use
> cases).
> >>>>>>>
> >>>>>>> WDYT ?
> >>>>>>>
> >>>>>>> Regards
> >>>>>>> JB
> >>>>>>>
> >>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> >>>>>>>> Hi JB,
> >>>>>>>> Depending on the scope of what you want to ultimately accomplish
> >> with
> >>>>>>> this
> >>>>>>>> extension, I think it may make sense to write a proposal document
> >> and
> >>>>>>>> discuss it.
> >>>>>>>> If it's just a collection of utility DoFn's for various
> >> well-defined
> >>>>>>>> source/target format pairs, then that's probably not needed, but
> if
> >>>>> it's
> >>>>>>>> anything more, then I think it is.
> >>>>>>>> That will help avoid a lot of churn if people propose reasonable
> >>>>>>>> significant changes.
> >>>>>>>>
> >>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> >>> jb@nanthrax.net
> >>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my github
> >>> and I
> >>>>>>>>> will post on the dev mailing list when done.
> >>>>>>>>>
> >>>>>>>>> Regards
> >>>>>>>>> JB
> >>>>>>>>>
> >>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> >>>>>>>>>> I want to bring this thread back up since we've had time to
> think
> >>>>> about
> >>>>>>>>> it
> >>>>>>>>>> more and make a plan.
> >>>>>>>>>>
> >>>>>>>>>> I think a format-specific converter will be more time consuming
> >>> task
> >>>>>>> than
> >>>>>>>>>> we originally thought. It'd have to be a writer that takes
> >> another
> >>>>>>> writer
> >>>>>>>>>> as a parameter.
> >>>>>>>>>>
> >>>>>>>>>> I think a string converter can be done as a simple transform.
> >>>>>>>>>>
> >>>>>>>>>> I think we should start with a simple string converter and plan
> >>> for a
> >>>>>>>>>> format-specific writer.
> >>>>>>>>>>
> >>>>>>>>>> What are your thoughts?
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Jesse
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> >>>>> jesse@smokinghand.com
> >>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I was thinking about what the outputs would look like last
> >> night. I
> >>>>>>>>>> realized that more complex formats like JSON and XML may or may
> >> not
> >>>>>>>>> output
> >>>>>>>>>> the data in a valid format.
> >>>>>>>>>>
> >>>>>>>>>> Doing a direct conversion on unbounded collections would work
> >> just
> >>>>>>> fine.
> >>>>>>>>>> They're self-contained. For writing out bounded collections,
> >> that's
> >>>>>>> where
> >>>>>>>>>> we'll hit the issues. This changes the uber conversion transform
> >>>>> into a
> >>>>>>>>>> transform that needs to be a writer.
> >>>>>>>>>>
> >>>>>>>>>> If a transform executes a JSON conversion on a per element
> basis,
> >>>>> we'd
> >>>>>>>>> get
> >>>>>>>>>> this:
> >>>>>>>>>> {
> >>>>>>>>>> "key": "value"
> >>>>>>>>>> }, {
> >>>>>>>>>> "key": "value"
> >>>>>>>>>> },
> >>>>>>>>>>
> >>>>>>>>>> That isn't valid JSON.
> >>>>>>>>>>
> >>>>>>>>>> The conversion transform would need to know do several things
> >> when
> >>>>>>>>> writing
> >>>>>>>>>> out a file. It would need to add brackets for an array. Now we
> >>> have:
> >>>>>>>>>> [
> >>>>>>>>>> {
> >>>>>>>>>> "key": "value"
> >>>>>>>>>> }, {
> >>>>>>>>>> "key": "value"
> >>>>>>>>>> },
> >>>>>>>>>> ]
> >>>>>>>>>>
> >>>>>>>>>> We still don't have valid JSON. We have to remove the last comma
> >> or
> >>>>>>> have
> >>>>>>>>>> the uber transform start putting in the commas, except for the
> >> last
> >>>>>>>>> element.
> >>>>>>>>>>
> >>>>>>>>>> [
> >>>>>>>>>> {
> >>>>>>>>>> "key": "value"
> >>>>>>>>>> }, {
> >>>>>>>>>> "key": "value"
> >>>>>>>>>> }
> >>>>>>>>>> ]
> >>>>>>>>>>
> >>>>>>>>>> Only by doing this do we have valid JSON.
> >>>>>>>>>>
> >>>>>>>>>> I'd argue we'd have a similar issue with XML. Some parsers
> >> require
> >>> a
> >>>>>>> root
> >>>>>>>>>> element for everything. The uber transform would have to put the
> >>> root
> >>>>>>>>>> element tags at the beginning and end of the file.
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> >>> owenzhang1990@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I would love to see a lean core and abundant Transforms at the
> >> same
> >>>>>>> time.
> >>>>>>>>>>
> >>>>>>>>>> Maybe we can look at what Confluent <
> >>> https://github.com/confluentinc
> >>>>>>
> >>>>>>>>> does
> >>>>>>>>>> for kafka-connect. They have official extensions support for
> >> JDBC,
> >>>>> HDFS
> >>>>>>>>> and
> >>>>>>>>>> ElasticSearch under https://github.com/confluentinc. They put
> >> them
> >>>>>>> along
> >>>>>>>>>> with other community extensions on
> >>>>>>>>>> https://www.confluent.io/product/connectors/ for visibility.
> >>>>>>>>>>
> >>>>>>>>>> Although not a commercial company, can we have a GitHub user
> like
> >>>>>>>>>> beam-community to host projects we build around beam but not
> >>> suitable
> >>>>>>> for
> >>>>>>>>>> https://github.com/apache/incubator-beam. In the future, we may
> >>> have
> >>>>>>>>>> beam-algebra like http://github.com/twitter/algebird for
> algebra
> >>>>>>>>> operations
> >>>>>>>>>> and beam-ml / beam-dl for machine learning / deep learning.
> Also,
> >>>>> there
> >>>>>>>>>> will will be beam related projects elsewhere maintained by other
> >>>>>>>>>> communities. We can put all of them on the beam-website or like
> >>> spark
> >>>>>>>>>> packages as mentioned by Amit.
> >>>>>>>>>>
> >>>>>>>>>> My $0.02
> >>>>>>>>>> Manu
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> >>>>> <klk@google.com.invalid
> >>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> On this point from Amit and Ismaël, I agree: we could benefit
> >>> from a
> >>>>>>>>> place
> >>>>>>>>>>> for miscellaneous non-core helper transformations.
> >>>>>>>>>>>
> >>>>>>>>>>> We have sdks/java/extensions but it is organized as separate
> >>>>>>> artifacts.
> >>>>>>>>> I
> >>>>>>>>>>> think that is fine, considering the nature of Join and
> >> SortValues.
> >>>>> But
> >>>>>>>>> for
> >>>>>>>>>>> simpler transforms, Importing one artifact per tiny transform
> is
> >>> too
> >>>>>>>>> much
> >>>>>>>>>>> overhead. It also seems unlikely that we will have enough
> >>>>> commonality
> >>>>>>>>>> among
> >>>>>>>>>>> the transforms to call the artifact anything other than [some
> >>>>> synonym
> >>>>>>>>> for]
> >>>>>>>>>>> "miscellaneous".
> >>>>>>>>>>>
> >>>>>>>>>>> I wouldn't want to take this too far - even though the SDK many
> >>>>>>>>>> transforms*
> >>>>>>>>>>> that are not required for the model [1], I like that the SDK
> >>>>> artifact
> >>>>>>>>> has
> >>>>>>>>>>> everything a user might need in their "getting started" phase
> of
> >>>>> use.
> >>>>>>>>> This
> >>>>>>>>>>> user-friendliness (the user doesn't care that ParDo is core and
> >>> Sum
> >>>>> is
> >>>>>>>>>> not)
> >>>>>>>>>>> plus the difficulty of judging which transforms go where, are
> >>>>> probably
> >>>>>>>>> why
> >>>>>>>>>>> we have them mostly all in one place.
> >>>>>>>>>>>
> >>>>>>>>>>> Models to look at, off the top of my head, include Pig's
> >> PiggyBank
> >>>>> and
> >>>>>>>>>>> Apex's Malhar. These have different levels of support implied.
> >>>>> Others?
> >>>>>>>>>>>
> >>>>>>>>>>> Kenn
> >>>>>>>>>>>
> >>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct,
> >>>>> Filter,
> >>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min,
> >>> Values,
> >>>>>>>>>> KvSwap,
> >>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> >>> WithTimestamps
> >>>>>>>>>>>
> >>>>>>>>>>> * at least they are separate classes and not methods on
> >>> PCollection
> >>>>>>> :-)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <
> iemejia@gmail.com
> >>>
> >>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Nice discussion, and thanks Jesse for bringing this subject
> >>> back.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I agree 100% with Amit and the idea of having a home for those
> >>>>>>>>>> transforms
> >>>>>>>>>>>> that are not core enough to be part of the sdk, but that we
> all
> >>> end
> >>>>>>> up
> >>>>>>>>>>>> re-writing somehow.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This is a needed improvement to be more developer friendly,
> but
> >>>>> also
> >>>>>>> as
> >>>>>>>>>> a
> >>>>>>>>>>>> reference of good practices of Beam development, and for this
> >>>>> reason
> >>>>>>> I
> >>>>>>>>>>>> agree with JB that at this moment it would be better for these
> >>>>>>>>>> transforms
> >>>>>>>>>>>> to reside in the Beam repository at least for visibility
> >> reasons.
> >>>>>>>>>>>>
> >>>>>>>>>>>> One additional question is if these transforms represent a
> >>>>> different
> >>>>>>>>> DSL
> >>>>>>>>>>> or
> >>>>>>>>>>>> if those could be grouped with the current extensions (e.g.
> >> Join
> >>>>> and
> >>>>>>>>>>>> SortValues) into something more general that we as a community
> >>>>> could
> >>>>>>>>>>>> maintain, but well even if it is not the case, it would be
> >> really
> >>>>>>> nice
> >>>>>>>>>> to
> >>>>>>>>>>>> start working on something like this.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ismaël Mejía
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <
> >>>>>>> jb@nanthrax.net
> >>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Related to spark-package, we also have Apache Bahir to host
> >>>>>>>>>>>>> connectors/transforms for Spark and Flink.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> IMHO, right now, Beam should host this, not sure if it makes
> >>> sense
> >>>>>>>>>>>>> directly in the core.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It reminds me the "Integration" DSL we discussed in the
> >>> technical
> >>>>>>>>>>> vision
> >>>>>>>>>>>>> document.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards
> >>>>>>>>>>>>> JB
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> I think Jesse has a very good point on one hand, while
> Luke's
> >>> and
> >>>>>>>>>>>>>> Kenneth's
> >>>>>>>>>>>>>> worries about committing users to specific implementations
> is
> >>> in
> >>>>>>>>>>> place.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The Spark community has a 3rd party repository for useful
> >>>>> libraries
> >>>>>>>>>>> that
> >>>>>>>>>>>>>> for various reasons are not a part of the Apache Spark
> >> project:
> >>>>>>>>>>>>>> https://spark-packages.org/.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Maybe a "common-transformations" package would serve both
> >> users
> >>>>>>> quick
> >>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more "enabling" ?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> >>>>>>>>>> <klk@google.com.invalid
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It seems useful for small scale debugging / demoing to have
> >>>>>>>>>>>>>>> Dump.toString(). I think it should be named to clearly
> >>> indicate
> >>>>>>> its
> >>>>>>>>>>>>>>> limited
> >>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump namespace,
> but
> >>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it should
> >> be
> >>>>>>> pretty
> >>>>>>>>>>>>>>> printed, not treated as a machine-to-machine wire format.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The broader question of representing data in JSON or XML,
> >> etc,
> >>>>> is
> >>>>>>>>>>>> already
> >>>>>>>>>>>>>>> the subject of many mature libraries which are already easy
> >> to
> >>>>> use
> >>>>>>>>>>> with
> >>>>>>>>>>>>>>> Beam.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The more esoteric practice of implicit or semi-implicit
> >>>>> coercions
> >>>>>>>>>>> seems
> >>>>>>>>>>>>>>> like it is also already addressed in many ways elsewhere.
> >>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as
> >>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with Beam.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> In both of the last cases, there are many reasonable
> >>> approaches,
> >>>>>>> and
> >>>>>>>>>>> we
> >>>>>>>>>>>>>>> shouldn't commit our users to one of them.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> >>>>>>>>>>> <lcwik@google.com.invalid
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The suggestions you give seem good except for the the XML
> >>> cases.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Might want to have the XML be a document per line similar
> >> to
> >>>>> the
> >>>>>>>>>>> JSON
> >>>>>>>>>>>>>>>> examples you have been giving.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> >>>>>>>>>>>> jesse@smokinghand.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I was
> >> more
> >>>>>>> think
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV. It
> >> should
> >>>>>>>>>> handle
> >>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give
> someone
> >>>>>>>>>>> something
> >>>>>>>>>>>>>>>>> general purpose enough that you would just end up writing
> >>> your
> >>>>>>> own
> >>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> handle it anyway.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Here are some ideas on what it could look like with a
> >> method
> >>>>> and
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> resulting string output:
> >>>>>>>>>>>>>>>>> *Stringify.toJSON()*
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> With KV:
> >>>>>>>>>>>>>>>>> {"key": "value"}
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> With Iterables:
> >>>>>>>>>>>>>>>>> ["one", "two", "three"]
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> With KV:
> >>>>>>>>>>>>>>>>> <rootelement key=value />
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> With Iterables:
> >>>>>>>>>>>>>>>>> <rootelement>
> >>>>>>>>>>>>>>>>>   <item>one</item>
> >>>>>>>>>>>>>>>>>   <item>two</item>
> >>>>>>>>>>>>>>>>>   <item>three</item>
> >>>>>>>>>>>>>>>>> </rootelement>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> With KV:
> >>>>>>>>>>>>>>>>> key,value
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> With Iterables:
> >>>>>>>>>>>>>>>>> one,two,three
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Do you think that would strike a good balance between
> >>> reusable
> >>>>>>>>>> code
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> writing your own for more difficult formatting?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jesse
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> >>>>>>>>>>> <lcwik@google.com.invalid
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special treatment in
> >>>>> TextIO,
> >>>>>>>>>>>> people
> >>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not
> >> supported.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Also, the example that you provide is using the fact that
> >>> the
> >>>>>>>>>> input
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> format
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question about
> >> using
> >>> KV
> >>>>>>>>>> with
> >>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed input
> >>>>> format
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> still
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> would require to write a type conversion function, this
> >> time
> >>>>>>> from
> >>>>>>>>>>> KV
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> >>>>>>>>>>>> jesse@smokinghand.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Lukasz,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> >>> TextIO.Write.
> >>>>>>> For
> >>>>>>>>>>> CSV
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> call would look like:
> >>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix,
> >>> delimiter,
> >>>>>>>>>>>> suffix).
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The code would be something like:
> >>>>>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> for (Item item : list) {
> >>>>>>>>>>>>>>>>>>   buffer.append(item.toString());
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>   if(notLast) {
> >>>>>>>>>>>>>>>>>>     buffer.append(delimiter);
> >>>>>>>>>>>>>>>>>>   }
> >>>>>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> buffer.append(suffix);
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> c.output(buffer.toString());
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and other
> >>>>>>> formats
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> without
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> complicated logic. The same sort of thing could be done
> >> for
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> TextIO.Write.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Jesse
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> >>>>>>>>>>>> <lcwik@google.com.invalid
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The conversion from object to string will have uses
> >> outside
> >>>>> of
> >>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would want to
> >>> have
> >>>>> a
> >>>>>>>>>>> ParDo
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> conversion.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if you
> >>>>> consider
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> subset
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> of CSV like formats where it could have fixed width
> >> fields,
> >>>>> or
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> escaping
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> quoting around other fields, or headers that should be
> >>>>> placed
> >>>>>>> at
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> top.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Having all these format conversions within TextIO.Write
> >>>>> seems
> >>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> lot
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> logic to contain in that transform which should just
> >> focus
> >>>>> on
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> writing
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> files.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> jesse@smokinghand.com>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> This is a thread moved over from the user mailing list.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> >>>>> PCollection<KV>
> >>>>>>> to
> >>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually
> convert
> >>> the
> >>>>>>> KV
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> to a
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> String:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>         p
> >>>>>>>>>>>>>>>>>>>>
> >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> >>>>>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<String,
> >> Long>
> >>>>>>>>>> count)
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> ->*
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> *                            count.getKey() + ":" +
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> count.getValue()*
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> *                        ).withOutputType(
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> >>>>>>> ("output/stringcounts"));
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> This code really should be something like:
> >>>>>>>>>>>>>>>>>>>>         p
> >>>>>>>>>>>>>>>>>>>>
> >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> >>>>>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
> >>>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> >>>>>>>>> ("output/stringcounts"));
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> To summarize the discussion:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to output
> >>> any
> >>>>> KV
> >>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> list
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes an type
> >> and
> >>>>> runs
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> toString()
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>    on it:
> >>>>>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> >>> SimpleFunction<InputT,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> String>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>        public static String apply(InputT input) {
> >>>>>>>>>>>>>>>>>>>>            return input.toString();
> >>>>>>>>>>>>>>>>>>>>        }
> >>>>>>>>>>>>>>>>>>>>    }
> >>>>>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter like in
> >>>>> Apache
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Camel.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that would
> >> write
> >>>>> out
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>    toString of any Object.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> My thoughts:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly needed
> >> when
> >>>>>>>>>> you're
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> using
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform only work
> >> in
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> cases
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> and you'll normally have to write custom code format
> the
> >>>>>>> strings
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> you want them?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object
> >> support
> >>> to
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> TextIO.Write
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an
> argument.
> >>>>>>> Making
> >>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a delimiter (and
> >>>>>>> perhaps
> >>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> prefix
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and
> cases.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Jesse
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Jean-Baptiste Onofré
> >>>>>>>>>>>>> jbonofre@apache.org
> >>>>>>>>>>>>> http://blog.nanthrax.net
> >>>>>>>>>>>>> Talend - http://www.talend.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Jean-Baptiste Onofré
> >>>>>>>>> jbonofre@apache.org
> >>>>>>>>> http://blog.nanthrax.net
> >>>>>>>>> Talend - http://www.talend.com
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Jean-Baptiste Onofré
> >>>>>>> jbonofre@apache.org
> >>>>>>> http://blog.nanthrax.net
> >>>>>>> Talend - http://www.talend.com
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Jean-Baptiste Onofré
> >>>>> jbonofre@apache.org
> >>>>> http://blog.nanthrax.net
> >>>>> Talend - http://www.talend.com
> >>>>>
> >>>>
> >>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbonofre@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>>
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: PCollection to PCollection Conversion

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Vikas,

did you take a look on:

https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/java/extensions/dataformat

You can see KV2String and ToString could be part of this extension.
I'm also using JAXB for XML and Jackson for JSON 
marshalling/unmarshalling. I'm planning to deal with Avro (IndexedRecord).

Regards
JB

On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> Hi All,
>
>   Not being aware of the discussion here, I sent out a PR
> <https://github.com/apache/beam/pull/1704> but JB and others directed me to
> this thread. Having converted PCollection<T> to PCollection<String> several
> times, I feel something like 'ToString' transform is common enough to be
> part of the core. What do you all think?
>
> Also, if someone else is already working on or interested in tackling this,
> then I am happy to discard the PR.
>
> Regards,
> Vikas
>
> On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <am...@gmail.com> wrote:
>
>> It seems that there were a lot of good points raised here, and I tend to
>> agree that something as trivial and lean as "ToString" should be a part of
>> core.ake
>> I'm particularly fond of makeString(prefix, toString, suffix) in various
>> combinations (Scala-like).
>> For "fromString", I think JB has a good point leveraging JAXB and Jackson -
>> though I think this should be in extensions as it is not as lean as
>> toString.
>>
>> Thanks,
>> Amit
>>
>> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofr� <jb...@nanthrax.net>
>> wrote:
>>
>>> Hi Jesse,
>>>
>>> yes, I started something there (using JAXB and Jackson). Let me polish
>>> and push.
>>>
>>> Regards
>>> JB
>>>
>>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
>>>> I went through the string conversions. Do you have an example of
>> writing
>>>> out XML/JSON/etc too?
>>>>
>>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofr� <jb...@nanthrax.net>
>>>> wrote:
>>>>
>>>>> Hi Jesse,
>>>>>
>>>>>
>>>>>
>>> https://github.com/jbonofre/incubator-beam/tree/DATAFORMAT/sdks/java/
>> extensions/dataformat
>>>>>
>>>>> it's very simple and stupid and of course not complete at all (I have
>>>>> other commits but not merged as they need some polishing), but as I
>>>>> said, it's a base of discussion.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
>>>>>> @jb Sounds good. Just let us know once you've pushed.
>>>>>>
>>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofr� <
>> jb@nanthrax.net>
>>>>>> wrote:
>>>>>>
>>>>>>> Good point Eugene.
>>>>>>>
>>>>>>> Right now, it's a DoFn collection to experiment a bit (a pure
>>>>>>> extension). It's pretty stupid ;)
>>>>>>>
>>>>>>> But, you are right, depending the direction of such extension, it
>>> could
>>>>>>> cover more use cases (even if it's not my first intention ;)).
>>>>>>>
>>>>>>> Let me push the branch (pretty small) as an illustration, and in the
>>>>>>> mean time, I'm preparing a document (more focused on the use cases).
>>>>>>>
>>>>>>> WDYT ?
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
>>>>>>>> Hi JB,
>>>>>>>> Depending on the scope of what you want to ultimately accomplish
>> with
>>>>>>> this
>>>>>>>> extension, I think it may make sense to write a proposal document
>> and
>>>>>>>> discuss it.
>>>>>>>> If it's just a collection of utility DoFn's for various
>> well-defined
>>>>>>>> source/target format pairs, then that's probably not needed, but if
>>>>> it's
>>>>>>>> anything more, then I think it is.
>>>>>>>> That will help avoid a lot of churn if people propose reasonable
>>>>>>>> significant changes.
>>>>>>>>
>>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofr� <
>>> jb@nanthrax.net
>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my github
>>> and I
>>>>>>>>> will post on the dev mailing list when done.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>>
>>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
>>>>>>>>>> I want to bring this thread back up since we've had time to think
>>>>> about
>>>>>>>>> it
>>>>>>>>>> more and make a plan.
>>>>>>>>>>
>>>>>>>>>> I think a format-specific converter will be more time consuming
>>> task
>>>>>>> than
>>>>>>>>>> we originally thought. It'd have to be a writer that takes
>> another
>>>>>>> writer
>>>>>>>>>> as a parameter.
>>>>>>>>>>
>>>>>>>>>> I think a string converter can be done as a simple transform.
>>>>>>>>>>
>>>>>>>>>> I think we should start with a simple string converter and plan
>>> for a
>>>>>>>>>> format-specific writer.
>>>>>>>>>>
>>>>>>>>>> What are your thoughts?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Jesse
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
>>>>> jesse@smokinghand.com
>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I was thinking about what the outputs would look like last
>> night. I
>>>>>>>>>> realized that more complex formats like JSON and XML may or may
>> not
>>>>>>>>> output
>>>>>>>>>> the data in a valid format.
>>>>>>>>>>
>>>>>>>>>> Doing a direct conversion on unbounded collections would work
>> just
>>>>>>> fine.
>>>>>>>>>> They're self-contained. For writing out bounded collections,
>> that's
>>>>>>> where
>>>>>>>>>> we'll hit the issues. This changes the uber conversion transform
>>>>> into a
>>>>>>>>>> transform that needs to be a writer.
>>>>>>>>>>
>>>>>>>>>> If a transform executes a JSON conversion on a per element basis,
>>>>> we'd
>>>>>>>>> get
>>>>>>>>>> this:
>>>>>>>>>> {
>>>>>>>>>> "key": "value"
>>>>>>>>>> }, {
>>>>>>>>>> "key": "value"
>>>>>>>>>> },
>>>>>>>>>>
>>>>>>>>>> That isn't valid JSON.
>>>>>>>>>>
>>>>>>>>>> The conversion transform would need to know do several things
>> when
>>>>>>>>> writing
>>>>>>>>>> out a file. It would need to add brackets for an array. Now we
>>> have:
>>>>>>>>>> [
>>>>>>>>>> {
>>>>>>>>>> "key": "value"
>>>>>>>>>> }, {
>>>>>>>>>> "key": "value"
>>>>>>>>>> },
>>>>>>>>>> ]
>>>>>>>>>>
>>>>>>>>>> We still don't have valid JSON. We have to remove the last comma
>> or
>>>>>>> have
>>>>>>>>>> the uber transform start putting in the commas, except for the
>> last
>>>>>>>>> element.
>>>>>>>>>>
>>>>>>>>>> [
>>>>>>>>>> {
>>>>>>>>>> "key": "value"
>>>>>>>>>> }, {
>>>>>>>>>> "key": "value"
>>>>>>>>>> }
>>>>>>>>>> ]
>>>>>>>>>>
>>>>>>>>>> Only by doing this do we have valid JSON.
>>>>>>>>>>
>>>>>>>>>> I'd argue we'd have a similar issue with XML. Some parsers
>> require
>>> a
>>>>>>> root
>>>>>>>>>> element for everything. The uber transform would have to put the
>>> root
>>>>>>>>>> element tags at the beginning and end of the file.
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
>>> owenzhang1990@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I would love to see a lean core and abundant Transforms at the
>> same
>>>>>>> time.
>>>>>>>>>>
>>>>>>>>>> Maybe we can look at what Confluent <
>>> https://github.com/confluentinc
>>>>>>
>>>>>>>>> does
>>>>>>>>>> for kafka-connect. They have official extensions support for
>> JDBC,
>>>>> HDFS
>>>>>>>>> and
>>>>>>>>>> ElasticSearch under https://github.com/confluentinc. They put
>> them
>>>>>>> along
>>>>>>>>>> with other community extensions on
>>>>>>>>>> https://www.confluent.io/product/connectors/ for visibility.
>>>>>>>>>>
>>>>>>>>>> Although not a commercial company, can we have a GitHub user like
>>>>>>>>>> beam-community to host projects we build around beam but not
>>> suitable
>>>>>>> for
>>>>>>>>>> https://github.com/apache/incubator-beam. In the future, we may
>>> have
>>>>>>>>>> beam-algebra like http://github.com/twitter/algebird for algebra
>>>>>>>>> operations
>>>>>>>>>> and beam-ml / beam-dl for machine learning / deep learning. Also,
>>>>> there
>>>>>>>>>> will will be beam related projects elsewhere maintained by other
>>>>>>>>>> communities. We can put all of them on the beam-website or like
>>> spark
>>>>>>>>>> packages as mentioned by Amit.
>>>>>>>>>>
>>>>>>>>>> My $0.02
>>>>>>>>>> Manu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
>>>>> <klk@google.com.invalid
>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> On this point from Amit and Isma�l, I agree: we could benefit
>>> from a
>>>>>>>>> place
>>>>>>>>>>> for miscellaneous non-core helper transformations.
>>>>>>>>>>>
>>>>>>>>>>> We have sdks/java/extensions but it is organized as separate
>>>>>>> artifacts.
>>>>>>>>> I
>>>>>>>>>>> think that is fine, considering the nature of Join and
>> SortValues.
>>>>> But
>>>>>>>>> for
>>>>>>>>>>> simpler transforms, Importing one artifact per tiny transform is
>>> too
>>>>>>>>> much
>>>>>>>>>>> overhead. It also seems unlikely that we will have enough
>>>>> commonality
>>>>>>>>>> among
>>>>>>>>>>> the transforms to call the artifact anything other than [some
>>>>> synonym
>>>>>>>>> for]
>>>>>>>>>>> "miscellaneous".
>>>>>>>>>>>
>>>>>>>>>>> I wouldn't want to take this too far - even though the SDK many
>>>>>>>>>> transforms*
>>>>>>>>>>> that are not required for the model [1], I like that the SDK
>>>>> artifact
>>>>>>>>> has
>>>>>>>>>>> everything a user might need in their "getting started" phase of
>>>>> use.
>>>>>>>>> This
>>>>>>>>>>> user-friendliness (the user doesn't care that ParDo is core and
>>> Sum
>>>>> is
>>>>>>>>>> not)
>>>>>>>>>>> plus the difficulty of judging which transforms go where, are
>>>>> probably
>>>>>>>>> why
>>>>>>>>>>> we have them mostly all in one place.
>>>>>>>>>>>
>>>>>>>>>>> Models to look at, off the top of my head, include Pig's
>> PiggyBank
>>>>> and
>>>>>>>>>>> Apex's Malhar. These have different levels of support implied.
>>>>> Others?
>>>>>>>>>>>
>>>>>>>>>>> Kenn
>>>>>>>>>>>
>>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct,
>>>>> Filter,
>>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min,
>>> Values,
>>>>>>>>>> KvSwap,
>>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
>>> WithTimestamps
>>>>>>>>>>>
>>>>>>>>>>> * at least they are separate classes and not methods on
>>> PCollection
>>>>>>> :-)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Isma�l Mej�a <iemejia@gmail.com
>>>
>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> \u200bNice discussion, and thanks Jesse for bringing this subject
>>> back.
>>>>>>>>>>>>
>>>>>>>>>>>> I agree 100% with Amit and the idea of having a home for those
>>>>>>>>>> transforms
>>>>>>>>>>>> that are not core enough to be part of the sdk, but that we all
>>> end
>>>>>>> up
>>>>>>>>>>>> re-writing somehow.
>>>>>>>>>>>>
>>>>>>>>>>>> This is a needed improvement to be more developer friendly, but
>>>>> also
>>>>>>> as
>>>>>>>>>> a
>>>>>>>>>>>> reference of good practices of Beam development, and for this
>>>>> reason
>>>>>>> I
>>>>>>>>>>>> agree with JB that at this moment it would be better for these
>>>>>>>>>> transforms
>>>>>>>>>>>> to reside in the Beam repository at least for visibility
>> reasons.
>>>>>>>>>>>>
>>>>>>>>>>>> One additional question is if these transforms represent a
>>>>> different
>>>>>>>>> DSL
>>>>>>>>>>> or
>>>>>>>>>>>> if those could be grouped with the current extensions (e.g.
>> Join
>>>>> and
>>>>>>>>>>>> SortValues) into something more general that we as a community
>>>>> could
>>>>>>>>>>>> maintain, but well even if it is not the case, it would be
>> really
>>>>>>> nice
>>>>>>>>>> to
>>>>>>>>>>>> start working on something like this.
>>>>>>>>>>>>
>>>>>>>>>>>> Isma�l Mej�a\u200b
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofr� <
>>>>>>> jb@nanthrax.net
>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Related to spark-package, we also have Apache Bahir to host
>>>>>>>>>>>>> connectors/transforms for Spark and Flink.
>>>>>>>>>>>>>
>>>>>>>>>>>>> IMHO, right now, Beam should host this, not sure if it makes
>>> sense
>>>>>>>>>>>>> directly in the core.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It reminds me the "Integration" DSL we discussed in the
>>> technical
>>>>>>>>>>> vision
>>>>>>>>>>>>> document.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> JB
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think Jesse has a very good point on one hand, while Luke's
>>> and
>>>>>>>>>>>>>> Kenneth's
>>>>>>>>>>>>>> worries about committing users to specific implementations is
>>> in
>>>>>>>>>>> place.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The Spark community has a 3rd party repository for useful
>>>>> libraries
>>>>>>>>>>> that
>>>>>>>>>>>>>> for various reasons are not a part of the Apache Spark
>> project:
>>>>>>>>>>>>>> https://spark-packages.org/.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maybe a "common-transformations" package would serve both
>> users
>>>>>>> quick
>>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more "enabling" ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
>>>>>>>>>> <klk@google.com.invalid
>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It seems useful for small scale debugging / demoing to have
>>>>>>>>>>>>>>> Dump.toString(). I think it should be named to clearly
>>> indicate
>>>>>>> its
>>>>>>>>>>>>>>> limited
>>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump namespace, but
>>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it should
>> be
>>>>>>> pretty
>>>>>>>>>>>>>>> printed, not treated as a machine-to-machine wire format.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The broader question of representing data in JSON or XML,
>> etc,
>>>>> is
>>>>>>>>>>>> already
>>>>>>>>>>>>>>> the subject of many mature libraries which are already easy
>> to
>>>>> use
>>>>>>>>>>> with
>>>>>>>>>>>>>>> Beam.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The more esoteric practice of implicit or semi-implicit
>>>>> coercions
>>>>>>>>>>> seems
>>>>>>>>>>>>>>> like it is also already addressed in many ways elsewhere.
>>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as
>>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with Beam.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In both of the last cases, there are many reasonable
>>> approaches,
>>>>>>> and
>>>>>>>>>>> we
>>>>>>>>>>>>>>> shouldn't commit our users to one of them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
>>>>>>>>>>> <lcwik@google.com.invalid
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The suggestions you give seem good except for the the XML
>>> cases.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Might want to have the XML be a document per line similar
>> to
>>>>> the
>>>>>>>>>>> JSON
>>>>>>>>>>>>>>>> examples you have been giving.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
>>>>>>>>>>>> jesse@smokinghand.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I was
>> more
>>>>>>> think
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV. It
>> should
>>>>>>>>>> handle
>>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give someone
>>>>>>>>>>> something
>>>>>>>>>>>>>>>>> general purpose enough that you would just end up writing
>>> your
>>>>>>> own
>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> handle it anyway.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here are some ideas on what it could look like with a
>> method
>>>>> and
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> resulting string output:
>>>>>>>>>>>>>>>>> *Stringify.toJSON()*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With KV:
>>>>>>>>>>>>>>>>> {"key": "value"}
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With Iterables:
>>>>>>>>>>>>>>>>> ["one", "two", "three"]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With KV:
>>>>>>>>>>>>>>>>> <rootelement key=value />
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With Iterables:
>>>>>>>>>>>>>>>>> <rootelement>
>>>>>>>>>>>>>>>>>   <item>one</item>
>>>>>>>>>>>>>>>>>   <item>two</item>
>>>>>>>>>>>>>>>>>   <item>three</item>
>>>>>>>>>>>>>>>>> </rootelement>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With KV:
>>>>>>>>>>>>>>>>> key,value
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With Iterables:
>>>>>>>>>>>>>>>>> one,two,three
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Do you think that would strike a good balance between
>>> reusable
>>>>>>>>>> code
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> writing your own for more difficult formatting?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Jesse
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
>>>>>>>>>>> <lcwik@google.com.invalid
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special treatment in
>>>>> TextIO,
>>>>>>>>>>>> people
>>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not
>> supported.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also, the example that you provide is using the fact that
>>> the
>>>>>>>>>> input
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> format
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question about
>> using
>>> KV
>>>>>>>>>> with
>>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed input
>>>>> format
>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> would require to write a type conversion function, this
>> time
>>>>>>> from
>>>>>>>>>>> KV
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
>>>>>>>>>>>> jesse@smokinghand.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Lukasz,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic for
>>> TextIO.Write.
>>>>>>> For
>>>>>>>>>>> CSV
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> call would look like:
>>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix,
>>> delimiter,
>>>>>>>>>>>> suffix).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The code would be something like:
>>>>>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> for (Item item : list) {
>>>>>>>>>>>>>>>>>>   buffer.append(item.toString());
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>   if(notLast) {
>>>>>>>>>>>>>>>>>>     buffer.append(delimiter);
>>>>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> buffer.append(suffix);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> c.output(buffer.toString());
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and other
>>>>>>> formats
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> complicated logic. The same sort of thing could be done
>> for
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> TextIO.Write.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Jesse
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
>>>>>>>>>>>> <lcwik@google.com.invalid
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The conversion from object to string will have uses
>> outside
>>>>> of
>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would want to
>>> have
>>>>> a
>>>>>>>>>>> ParDo
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> conversion.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if you
>>>>> consider
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> subset
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> of CSV like formats where it could have fixed width
>> fields,
>>>>> or
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> escaping
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> quoting around other fields, or headers that should be
>>>>> placed
>>>>>>> at
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> top.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Having all these format conversions within TextIO.Write
>>>>> seems
>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> lot
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> logic to contain in that transform which should just
>> focus
>>>>> on
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> writing
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> jesse@smokinghand.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This is a thread moved over from the user mailing list.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
>>>>> PCollection<KV>
>>>>>>> to
>>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually convert
>>> the
>>>>>>> KV
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> String:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>         p
>>>>>>>>>>>>>>>>>>>>
>>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
>>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
>>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
>>>>>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<String,
>> Long>
>>>>>>>>>> count)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ->*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *                            count.getKey() + ":" +
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> count.getValue()*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *                        ).withOutputType(
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
>>>>>>> ("output/stringcounts"));
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This code really should be something like:
>>>>>>>>>>>>>>>>>>>>         p
>>>>>>>>>>>>>>>>>>>>
>>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
>>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
>>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
>>>>>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
>>>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
>>>>>>>>> ("output/stringcounts"));
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> To summarize the discussion:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to output
>>> any
>>>>> KV
>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes an type
>> and
>>>>> runs
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> toString()
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>    on it:
>>>>>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
>>> SimpleFunction<InputT,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> String>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        public static String apply(InputT input) {
>>>>>>>>>>>>>>>>>>>>            return input.toString();
>>>>>>>>>>>>>>>>>>>>        }
>>>>>>>>>>>>>>>>>>>>    }
>>>>>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter like in
>>>>> Apache
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Camel.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that would
>> write
>>>>> out
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    toString of any Object.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> My thoughts:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly needed
>> when
>>>>>>>>>> you're
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform only work
>> in
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> cases
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> and you'll normally have to write custom code format the
>>>>>>> strings
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> you want them?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object
>> support
>>> to
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> TextIO.Write
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an argument.
>>>>>>> Making
>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a delimiter (and
>>>>>>> perhaps
>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> prefix
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and cases.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Jesse
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jean-Baptiste Onofr�
>>>>>>>>>>>>> jbonofre@apache.org
>>>>>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jean-Baptiste Onofr�
>>>>>>>>> jbonofre@apache.org
>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jean-Baptiste Onofr�
>>>>>>> jbonofre@apache.org
>>>>>>> http://blog.nanthrax.net
>>>>>>> Talend - http://www.talend.com
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Jean-Baptiste Onofr�
>>>>> jbonofre@apache.org
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>>>
>>>>
>>>
>>> --
>>> Jean-Baptiste Onofr�
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: PCollection to PCollection Conversion

Posted by Vikas Kedigehalli <vi...@gmail.com>.

Hi All,

  Not being aware of the discussion here, I sent out a PR
<https://github.com/apache/beam/pull/1704> but JB and others directed me to
this thread. Having converted PCollection<T> to PCollection<String> several
times, I feel something like 'ToString' transform is common enough to be
part of the core. What do you all think?

Also, if someone else is already working on or interested in tackling this,
then I am happy to discard the PR.

Regards,
Vikas

On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <am...@gmail.com> wrote:

> It seems that there were a lot of good points raised here, and I tend to
> agree that something as trivial and lean as "ToString" should be a part of
> core.
> I'm particularly fond of makeString(prefix, toString, suffix) in various
> combinations (Scala-like).
> For "fromString", I think JB has a good point leveraging JAXB and Jackson -
> though I think this should be in extensions as it is not as lean as
> toString.
>
> Thanks,
> Amit
>
> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Hi Jesse,
> >
> > yes, I started something there (using JAXB and Jackson). Let me polish
> > and push.
> >
> > Regards
> > JB
> >
> > On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > > I went through the string conversions. Do you have an example of
> writing
> > > out XML/JSON/etc too?
> > >
> > > On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> > > wrote:
> > >
> > >> Hi Jesse,
> > >>
> > >>
> > >>
> > https://github.com/jbonofre/incubator-beam/tree/DATAFORMAT/sdks/java/
> extensions/dataformat
> > >>
> > >> it's very simple and stupid and of course not complete at all (I have
> > >> other commits but not merged as they need some polishing), but as I
> > >> said, it's a base of discussion.
> > >>
> > >> Regards
> > >> JB
> > >>
> > >> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> > >>> @jb Sounds good. Just let us know once you've pushed.
> > >>>
> > >>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> jb@nanthrax.net>
> > >>> wrote:
> > >>>
> > >>>> Good point Eugene.
> > >>>>
> > >>>> Right now, it's a DoFn collection to experiment a bit (a pure
> > >>>> extension). It's pretty stupid ;)
> > >>>>
> > >>>> But, you are right, depending the direction of such extension, it
> > could
> > >>>> cover more use cases (even if it's not my first intention ;)).
> > >>>>
> > >>>> Let me push the branch (pretty small) as an illustration, and in the
> > >>>> mean time, I'm preparing a document (more focused on the use cases).
> > >>>>
> > >>>> WDYT ?
> > >>>>
> > >>>> Regards
> > >>>> JB
> > >>>>
> > >>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> > >>>>> Hi JB,
> > >>>>> Depending on the scope of what you want to ultimately accomplish
> with
> > >>>> this
> > >>>>> extension, I think it may make sense to write a proposal document
> and
> > >>>>> discuss it.
> > >>>>> If it's just a collection of utility DoFn's for various
> well-defined
> > >>>>> source/target format pairs, then that's probably not needed, but if
> > >> it's
> > >>>>> anything more, then I think it is.
> > >>>>> That will help avoid a lot of churn if people propose reasonable
> > >>>>> significant changes.
> > >>>>>
> > >>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> > jb@nanthrax.net
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my github
> > and I
> > >>>>>> will post on the dev mailing list when done.
> > >>>>>>
> > >>>>>> Regards
> > >>>>>> JB
> > >>>>>>
> > >>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > >>>>>>> I want to bring this thread back up since we've had time to think
> > >> about
> > >>>>>> it
> > >>>>>>> more and make a plan.
> > >>>>>>>
> > >>>>>>> I think a format-specific converter will be more time consuming
> > task
> > >>>> than
> > >>>>>>> we originally thought. It'd have to be a writer that takes
> another
> > >>>> writer
> > >>>>>>> as a parameter.
> > >>>>>>>
> > >>>>>>> I think a string converter can be done as a simple transform.
> > >>>>>>>
> > >>>>>>> I think we should start with a simple string converter and plan
> > for a
> > >>>>>>> format-specific writer.
> > >>>>>>>
> > >>>>>>> What are your thoughts?
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>>
> > >>>>>>> Jesse
> > >>>>>>>
> > >>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> > >> jesse@smokinghand.com
> > >>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>> I was thinking about what the outputs would look like last
> night. I
> > >>>>>>> realized that more complex formats like JSON and XML may or may
> not
> > >>>>>> output
> > >>>>>>> the data in a valid format.
> > >>>>>>>
> > >>>>>>> Doing a direct conversion on unbounded collections would work
> just
> > >>>> fine.
> > >>>>>>> They're self-contained. For writing out bounded collections,
> that's
> > >>>> where
> > >>>>>>> we'll hit the issues. This changes the uber conversion transform
> > >> into a
> > >>>>>>> transform that needs to be a writer.
> > >>>>>>>
> > >>>>>>> If a transform executes a JSON conversion on a per element basis,
> > >> we'd
> > >>>>>> get
> > >>>>>>> this:
> > >>>>>>> {
> > >>>>>>> "key": "value"
> > >>>>>>> }, {
> > >>>>>>> "key": "value"
> > >>>>>>> },
> > >>>>>>>
> > >>>>>>> That isn't valid JSON.
> > >>>>>>>
> > >>>>>>> The conversion transform would need to know do several things
> when
> > >>>>>> writing
> > >>>>>>> out a file. It would need to add brackets for an array. Now we
> > have:
> > >>>>>>> [
> > >>>>>>> {
> > >>>>>>> "key": "value"
> > >>>>>>> }, {
> > >>>>>>> "key": "value"
> > >>>>>>> },
> > >>>>>>> ]
> > >>>>>>>
> > >>>>>>> We still don't have valid JSON. We have to remove the last comma
> or
> > >>>> have
> > >>>>>>> the uber transform start putting in the commas, except for the
> last
> > >>>>>> element.
> > >>>>>>>
> > >>>>>>> [
> > >>>>>>> {
> > >>>>>>> "key": "value"
> > >>>>>>> }, {
> > >>>>>>> "key": "value"
> > >>>>>>> }
> > >>>>>>> ]
> > >>>>>>>
> > >>>>>>> Only by doing this do we have valid JSON.
> > >>>>>>>
> > >>>>>>> I'd argue we'd have a similar issue with XML. Some parsers
> require
> > a
> > >>>> root
> > >>>>>>> element for everything. The uber transform would have to put the
> > root
> > >>>>>>> element tags at the beginning and end of the file.
> > >>>>>>>
> > >>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> > owenzhang1990@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> I would love to see a lean core and abundant Transforms at the
> same
> > >>>> time.
> > >>>>>>>
> > >>>>>>> Maybe we can look at what Confluent <
> > https://github.com/confluentinc
> > >>>
> > >>>>>> does
> > >>>>>>> for kafka-connect. They have official extensions support for
> JDBC,
> > >> HDFS
> > >>>>>> and
> > >>>>>>> ElasticSearch under https://github.com/confluentinc. They put
> them
> > >>>> along
> > >>>>>>> with other community extensions on
> > >>>>>>> https://www.confluent.io/product/connectors/ for visibility.
> > >>>>>>>
> > >>>>>>> Although not a commercial company, can we have a GitHub user like
> > >>>>>>> beam-community to host projects we build around beam but not
> > suitable
> > >>>> for
> > >>>>>>> https://github.com/apache/incubator-beam. In the future, we may
> > have
> > >>>>>>> beam-algebra like http://github.com/twitter/algebird for algebra
> > >>>>>> operations
> > >>>>>>> and beam-ml / beam-dl for machine learning / deep learning. Also,
> > >> there
> > >>>>>>> will will be beam related projects elsewhere maintained by other
> > >>>>>>> communities. We can put all of them on the beam-website or like
> > spark
> > >>>>>>> packages as mentioned by Amit.
> > >>>>>>>
> > >>>>>>> My $0.02
> > >>>>>>> Manu
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> > >> <klk@google.com.invalid
> > >>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> On this point from Amit and Ismaël, I agree: we could benefit
> > from a
> > >>>>>> place
> > >>>>>>>> for miscellaneous non-core helper transformations.
> > >>>>>>>>
> > >>>>>>>> We have sdks/java/extensions but it is organized as separate
> > >>>> artifacts.
> > >>>>>> I
> > >>>>>>>> think that is fine, considering the nature of Join and
> SortValues.
> > >> But
> > >>>>>> for
> > >>>>>>>> simpler transforms, Importing one artifact per tiny transform is
> > too
> > >>>>>> much
> > >>>>>>>> overhead. It also seems unlikely that we will have enough
> > >> commonality
> > >>>>>>> among
> > >>>>>>>> the transforms to call the artifact anything other than [some
> > >> synonym
> > >>>>>> for]
> > >>>>>>>> "miscellaneous".
> > >>>>>>>>
> > >>>>>>>> I wouldn't want to take this too far - even though the SDK many
> > >>>>>>> transforms*
> > >>>>>>>> that are not required for the model [1], I like that the SDK
> > >> artifact
> > >>>>>> has
> > >>>>>>>> everything a user might need in their "getting started" phase of
> > >> use.
> > >>>>>> This
> > >>>>>>>> user-friendliness (the user doesn't care that ParDo is core and
> > Sum
> > >> is
> > >>>>>>> not)
> > >>>>>>>> plus the difficulty of judging which transforms go where, are
> > >> probably
> > >>>>>> why
> > >>>>>>>> we have them mostly all in one place.
> > >>>>>>>>
> > >>>>>>>> Models to look at, off the top of my head, include Pig's
> PiggyBank
> > >> and
> > >>>>>>>> Apex's Malhar. These have different levels of support implied.
> > >> Others?
> > >>>>>>>>
> > >>>>>>>> Kenn
> > >>>>>>>>
> > >>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct,
> > >> Filter,
> > >>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min,
> > Values,
> > >>>>>>> KvSwap,
> > >>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> > WithTimestamps
> > >>>>>>>>
> > >>>>>>>> * at least they are separate classes and not methods on
> > PCollection
> > >>>> :-)
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <iemejia@gmail.com
> >
> > >>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Nice discussion, and thanks Jesse for bringing this subject
> > back.
> > >>>>>>>>>
> > >>>>>>>>> I agree 100% with Amit and the idea of having a home for those
> > >>>>>>> transforms
> > >>>>>>>>> that are not core enough to be part of the sdk, but that we all
> > end
> > >>>> up
> > >>>>>>>>> re-writing somehow.
> > >>>>>>>>>
> > >>>>>>>>> This is a needed improvement to be more developer friendly, but
> > >> also
> > >>>> as
> > >>>>>>> a
> > >>>>>>>>> reference of good practices of Beam development, and for this
> > >> reason
> > >>>> I
> > >>>>>>>>> agree with JB that at this moment it would be better for these
> > >>>>>>> transforms
> > >>>>>>>>> to reside in the Beam repository at least for visibility
> reasons.
> > >>>>>>>>>
> > >>>>>>>>> One additional question is if these transforms represent a
> > >> different
> > >>>>>> DSL
> > >>>>>>>> or
> > >>>>>>>>> if those could be grouped with the current extensions (e.g.
> Join
> > >> and
> > >>>>>>>>> SortValues) into something more general that we as a community
> > >> could
> > >>>>>>>>> maintain, but well even if it is not the case, it would be
> really
> > >>>> nice
> > >>>>>>> to
> > >>>>>>>>> start working on something like this.
> > >>>>>>>>>
> > >>>>>>>>> Ismaël Mejía
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <
> > >>>> jb@nanthrax.net
> > >>>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Related to spark-package, we also have Apache Bahir to host
> > >>>>>>>>>> connectors/transforms for Spark and Flink.
> > >>>>>>>>>>
> > >>>>>>>>>> IMHO, right now, Beam should host this, not sure if it makes
> > sense
> > >>>>>>>>>> directly in the core.
> > >>>>>>>>>>
> > >>>>>>>>>> It reminds me the "Integration" DSL we discussed in the
> > technical
> > >>>>>>>> vision
> > >>>>>>>>>> document.
> > >>>>>>>>>>
> > >>>>>>>>>> Regards
> > >>>>>>>>>> JB
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> I think Jesse has a very good point on one hand, while Luke's
> > and
> > >>>>>>>>>>> Kenneth's
> > >>>>>>>>>>> worries about committing users to specific implementations is
> > in
> > >>>>>>>> place.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The Spark community has a 3rd party repository for useful
> > >> libraries
> > >>>>>>>> that
> > >>>>>>>>>>> for various reasons are not a part of the Apache Spark
> project:
> > >>>>>>>>>>> https://spark-packages.org/.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Maybe a "common-transformations" package would serve both
> users
> > >>>> quick
> > >>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more "enabling" ?
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > >>>>>>> <klk@google.com.invalid
> > >>>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> It seems useful for small scale debugging / demoing to have
> > >>>>>>>>>>>> Dump.toString(). I think it should be named to clearly
> > indicate
> > >>>> its
> > >>>>>>>>>>>> limited
> > >>>>>>>>>>>> scope. Maybe other stuff could go in the Dump namespace, but
> > >>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it should
> be
> > >>>> pretty
> > >>>>>>>>>>>> printed, not treated as a machine-to-machine wire format.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The broader question of representing data in JSON or XML,
> etc,
> > >> is
> > >>>>>>>>> already
> > >>>>>>>>>>>> the subject of many mature libraries which are already easy
> to
> > >> use
> > >>>>>>>> with
> > >>>>>>>>>>>> Beam.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The more esoteric practice of implicit or semi-implicit
> > >> coercions
> > >>>>>>>> seems
> > >>>>>>>>>>>> like it is also already addressed in many ways elsewhere.
> > >>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as
> > >>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with Beam.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In both of the last cases, there are many reasonable
> > approaches,
> > >>>> and
> > >>>>>>>> we
> > >>>>>>>>>>>> shouldn't commit our users to one of them.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > >>>>>>>> <lcwik@google.com.invalid
> > >>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The suggestions you give seem good except for the the XML
> > cases.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Might want to have the XML be a document per line similar
> to
> > >> the
> > >>>>>>>> JSON
> > >>>>>>>>>>>>> examples you have been giving.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> > >>>>>>>>> jesse@smokinghand.com>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I was
> more
> > >>>> think
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> that
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV. It
> should
> > >>>>>>> handle
> > >>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give someone
> > >>>>>>>> something
> > >>>>>>>>>>>>>> general purpose enough that you would just end up writing
> > your
> > >>>> own
> > >>>>>>>>> code
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> handle it anyway.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Here are some ideas on what it could look like with a
> method
> > >> and
> > >>>>>>>> the
> > >>>>>>>>>>>>>> resulting string output:
> > >>>>>>>>>>>>>> *Stringify.toJSON()*
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>> {"key": "value"}
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>> ["one", "two", "three"]
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>> <rootelement key=value />
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>> <rootelement>
> > >>>>>>>>>>>>>>   <item>one</item>
> > >>>>>>>>>>>>>>   <item>two</item>
> > >>>>>>>>>>>>>>   <item>three</item>
> > >>>>>>>>>>>>>> </rootelement>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>> key,value
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>> one,two,three
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Do you think that would strike a good balance between
> > reusable
> > >>>>>>> code
> > >>>>>>>>> and
> > >>>>>>>>>>>>>> writing your own for more difficult formatting?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > >>>>>>>> <lcwik@google.com.invalid
> > >>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Jesse, I believe if one format gets special treatment in
> > >> TextIO,
> > >>>>>>>>> people
> > >>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not
> supported.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Also, the example that you provide is using the fact that
> > the
> > >>>>>>> input
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> format
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question about
> using
> > KV
> > >>>>>>> with
> > >>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed input
> > >> format
> > >>>>>>>> and
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> still
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> would require to write a type conversion function, this
> time
> > >>>> from
> > >>>>>>>> KV
> > >>>>>>>>> to
> > >>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> > >>>>>>>>> jesse@smokinghand.com>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Lukasz,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> > TextIO.Write.
> > >>>> For
> > >>>>>>>> CSV
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> call would look like:
> > >>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix,
> > delimiter,
> > >>>>>>>>> suffix).
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The code would be something like:
> > >>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> for (Item item : list) {
> > >>>>>>>>>>>>>>>   buffer.append(item.toString());
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>   if(notLast) {
> > >>>>>>>>>>>>>>>     buffer.append(delimiter);
> > >>>>>>>>>>>>>>>   }
> > >>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> buffer.append(suffix);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> c.output(buffer.toString());
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and other
> > >>>> formats
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> without
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> complicated logic. The same sort of thing could be done
> for
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> TextIO.Write.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> > >>>>>>>>> <lcwik@google.com.invalid
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The conversion from object to string will have uses
> outside
> > >> of
> > >>>>>>>> just
> > >>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would want to
> > have
> > >> a
> > >>>>>>>> ParDo
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> do
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> conversion.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if you
> > >> consider
> > >>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> subset
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> of CSV like formats where it could have fixed width
> fields,
> > >> or
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> escaping
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> quoting around other fields, or headers that should be
> > >> placed
> > >>>> at
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> top.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Having all these format conversions within TextIO.Write
> > >> seems
> > >>>>>>>> like
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> lot
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> logic to contain in that transform which should just
> focus
> > >> on
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> writing
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> files.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> jesse@smokinghand.com>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> This is a thread moved over from the user mailing list.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> > >> PCollection<KV>
> > >>>> to
> > >>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually convert
> > the
> > >>>> KV
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> to a
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> String:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>         p
> > >>>>>>>>>>>>>>>>>
> > >>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > >>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > >>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > >>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<String,
> Long>
> > >>>>>>> count)
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> ->*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> *                            count.getKey() + ":" +
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> count.getValue()*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> *                        ).withOutputType(
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > >>>> ("output/stringcounts"));
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> This code really should be something like:
> > >>>>>>>>>>>>>>>>>         p
> > >>>>>>>>>>>>>>>>>
> > >>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > >>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > >>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > >>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
> > >>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > >>>>>> ("output/stringcounts"));
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> To summarize the discussion:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to output
> > any
> > >> KV
> > >>>>>>> or
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> list
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes an type
> and
> > >> runs
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> toString()
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>    on it:
> > >>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> > SimpleFunction<InputT,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> String>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> {
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>        public static String apply(InputT input) {
> > >>>>>>>>>>>>>>>>>            return input.toString();
> > >>>>>>>>>>>>>>>>>        }
> > >>>>>>>>>>>>>>>>>    }
> > >>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter like in
> > >> Apache
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Camel.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that would
> write
> > >> out
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>    toString of any Object.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> My thoughts:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly needed
> when
> > >>>>>>> you're
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> using
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform only work
> in
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> certain
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> cases
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> and you'll normally have to write custom code format the
> > >>>> strings
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> way
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> you want them?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object
> support
> > to
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> TextIO.Write
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an argument.
> > >>>> Making
> > >>>>>>> a
> > >>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a delimiter (and
> > >>>> perhaps
> > >>>>>>> a
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> prefix
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and cases.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> Jean-Baptiste Onofré
> > >>>>>>>>>> jbonofre@apache.org
> > >>>>>>>>>> http://blog.nanthrax.net
> > >>>>>>>>>> Talend - http://www.talend.com
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Jean-Baptiste Onofré
> > >>>>>> jbonofre@apache.org
> > >>>>>> http://blog.nanthrax.net
> > >>>>>> Talend - http://www.talend.com
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>> --
> > >>>> Jean-Baptiste Onofré
> > >>>> jbonofre@apache.org
> > >>>> http://blog.nanthrax.net
> > >>>> Talend - http://www.talend.com
> > >>>>
> > >>>
> > >>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbonofre@apache.org
> > >> http://blog.nanthrax.net
> > >> Talend - http://www.talend.com
> > >>
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>