You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Romain Manni-Bucau <rm...@gmail.com> on 2018/03/11 18:46:56 UTC

(java) stream & beam?

Hi guys,

don't know if you already experienced using java Stream API as a
replacement for pipeline API but did some tests:
https://github.com/rmannibucau/jbeam

It is far to be complete but already shows where it fails (beam doesn't
have a way to reduce in the caller machine for instance, coder handling is
not that trivial, lambda are not working well with default Stream API
etc...).

However it is interesting to see that having such an API is pretty natural
compare to the pipeline API
so wonder if beam should work on its own Stream API (with surely another
name for obvious reasons ;)).

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

Re: (java) stream & beam?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Yep, while we can pass lambdas i guess it is fine or we have to use proxies
to hide the mutation but i dont think we need to be that purist to move to
a more expressive dsl.


Le 13 mars 2018 19:49, "Ben Chambers" <bj...@gmail.com> a écrit :

> The CombineFn API has three types parameters (input, accumulator, and
> output) and methods that approximately correspond to those parts of the
> collector
>
> CombineFn.createAccumulator = supplier
> CombineFn.addInput = accumulator
> CombineFn.mergeAccumlator = combiner
> CombineFn.extractOutput = finisher
>
> That said, the Collector API has some minimal, cosmetic differences, such
> as CombineFn.addInput may either mutate the accumulator or return it. The
> Collector accumulator method is a BiConsumer, meaning it must modify.
>
> On Tue, Mar 13, 2018 at 11:39 AM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Misses the collect split in 3 (supplier, combiner, aggregator) but
>> globally agree.
>>
>>  I d just take java stream, remove "client" method or make them big data
>> if possible, ensure all hooks are serializable to avoid hacks and add an
>> unwrap to be able to access the pipeline in case we need a very custom
>> thing and we are done for me.
>>
>> Le 13 mars 2018 19:26, "Ben Chambers" <bc...@apache.org> a écrit :
>>
>>> I think the existing rationale (not introducing lots of special fluent
>>> methods) makes sense. However, if we look at the Java Stream API, we
>>> probably wouldn't need to introduce *a lot* of fluent builders to get most
>>> of the functionality. Specifically, if we focus on map, flatMap, and
>>> collect from the Stream API, and a few extensions, we get something like:
>>>
>>> * collection.map(DoFn) for applying aParDo
>>> * collection.map(SerialiazableFn) for Java8 lambda shorthand
>>> * collection.flatMap(SerialiazbleFn) for Java8 lambda shorthand
>>> * collection.collect(CombineFn) for applying a CombineFn
>>> * collection.apply(PTransform) for applying a composite transform. note
>>> that PTransforms could also use serializable lambdas for definition.
>>>
>>> (Note that GroupByKey doesn't even show up here -- it could, but that
>>> could also be way of wrapping a collector, as in the Java8
>>> Collectors.groupyingBy [1]
>>>
>>> With this, we could write code like:
>>>
>>> collection
>>>   .map(myDoFn)
>>>   .map((s) -> s.toString())
>>>   .collect(new IntegerCombineFn())
>>>   .apply(GroupByKey.of());
>>>
>>> That said, my two concerns are:
>>> (1) having two similar but different Java APIs. If we have more
>>> idiomatic way of writing pipelines in Java, we should make that the
>>> standard. Otherwise, users will be confused by seeing "Beam" examples
>>> written in multiple, incompatible syntaxes.
>>> (2) making sure the above is truly idiomatic Java and that it doesn't
>>> any conflicts with the cross-language Beam programming model. I don't think
>>> it does. We have (I believ) chosen to make the Python and Go SDKs idiomatic
>>> for those languages where possible.
>>>
>>> If this work is focused on making the Java SDK more idiomatic (and thus
>>> easier for Java users to learn), it seems like a good thing. We should just
>>> make sure it doesn't scope-creep into defining an entirely new DSL or SDK.
>>>
>>> [1] https://docs.oracle.com/javase/8/docs/api/java/util/
>>> stream/Collectors.html#groupingBy-java.util.function.Function-
>>>
>>> On Tue, Mar 13, 2018 at 11:06 AM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Yep
>>>>
>>>> I know the rational and it makes sense but it also increases the
>>>> entering steps for users and is not that smooth in ides, in particular for
>>>> custom code.
>>>>
>>>> So I really think it makes sense to build an user friendly api on top
>>>> of beam core dev one.
>>>>
>>>>
>>>> Le 13 mars 2018 18:35, "Aljoscha Krettek" <al...@apache.org> a
>>>> écrit :
>>>>
>>>>> https://beam.apache.org/blog/2016/05/27/where-is-my-
>>>>> pcollection-dot-map.html
>>>>>
>>>>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <rm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Le 12 mars 2018 00:16, "Reuven Lax" <re...@google.com> a écrit :
>>>>>
>>>>> I think it would be interesting to see what a Java stream-based API
>>>>> would look like. As I mentioned elsewhere, we are not limited to having
>>>>> only one API for Beam.
>>>>>
>>>>> If I remember correctly, a Java stream API was considered for Dataflow
>>>>> back at the very beginning. I don't completely remember why it was
>>>>> rejected, but I suspect at least part of the reason might have been that
>>>>> Java streams were considered too new and untested back then.
>>>>>
>>>>>
>>>>> Coders are broken - typevariables dont have bounds except object - and
>>>>> reducers are not trivial to impl generally I guess.
>>>>>
>>>>> However being close of this api can help a lot so +1 to try to have a
>>>>> java dsl on top of current api. Would also be neat to integrate it with
>>>>> completionstage :).
>>>>>
>>>>>
>>>>>
>>>>> Reuven
>>>>>
>>>>>
>>>>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <
>>>>> rmannibucau@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a
>>>>>> écrit :
>>>>>>
>>>>>> Hi Romain,
>>>>>>
>>>>>> I remember we have discussed about the way to express pipeline while
>>>>>> ago.
>>>>>>
>>>>>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>>>>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>>>>>> it's the approach in flume).
>>>>>> However, we agreed that apply() syntax gives a more flexible approach.
>>>>>>
>>>>>> Using Java Stream is interesting but I'm afraid we would have the
>>>>>> same issue as the one we identified discussing "fluent Java SDK". However,
>>>>>> we can have a Stream API DSL on top of the SDK IMHO.
>>>>>>
>>>>>>
>>>>>> Agree and a beam stream interface (copying jdk api but making lambda
>>>>>> serializable to avoid the cast need).
>>>>>>
>>>>>> On my side i think it enables user to discover the api. If you check
>>>>>> my poc impl you quickly see the steps needed to do simple things like a map
>>>>>> which is a first citizen.
>>>>>>
>>>>>> Also curious if we could impl reduce with pipeline result = get an
>>>>>> output of a batch from the runner (client) jvm. I see how to do it for
>>>>>> longs - with metrics - but not for collect().
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>>
>>>>>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> don't know if you already experienced using java Stream API as a
>>>>>>> replacement for pipeline API but did some tests: https://github.com/
>>>>>>> rmannibucau/jbeam
>>>>>>>
>>>>>>> It is far to be complete but already shows where it fails (beam
>>>>>>> doesn't have a way to reduce in the caller machine for instance, coder
>>>>>>> handling is not that trivial, lambda are not working well with default
>>>>>>> Stream API etc...).
>>>>>>>
>>>>>>> However it is interesting to see that having such an API is pretty
>>>>>>> natural compare to the pipeline API
>>>>>>> so wonder if beam should work on its own Stream API (with surely
>>>>>>> another name for obvious reasons ;)).
>>>>>>>
>>>>>>> Romain Manni-Bucau
>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>>>>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>>>>>> http://rmannibucau.wordpress.com> | Github <https://github.com/
>>>>>>> rmannibucau> | LinkedIn <https://www.linkedin.com/in/rmannibucau> |
>>>>>>> Book <https://www.packtpub.com/application-development/java-
>>>>>>> ee-8-high-performance>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>

Re: (java) stream & beam?

Posted by Ben Chambers <bj...@gmail.com>.

The CombineFn API has three types parameters (input, accumulator, and
output) and methods that approximately correspond to those parts of the
collector

CombineFn.createAccumulator = supplier
CombineFn.addInput = accumulator
CombineFn.mergeAccumlator = combiner
CombineFn.extractOutput = finisher

That said, the Collector API has some minimal, cosmetic differences, such
as CombineFn.addInput may either mutate the accumulator or return it. The
Collector accumulator method is a BiConsumer, meaning it must modify.

On Tue, Mar 13, 2018 at 11:39 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Misses the collect split in 3 (supplier, combiner, aggregator) but
> globally agree.
>
>  I d just take java stream, remove "client" method or make them big data
> if possible, ensure all hooks are serializable to avoid hacks and add an
> unwrap to be able to access the pipeline in case we need a very custom
> thing and we are done for me.
>
> Le 13 mars 2018 19:26, "Ben Chambers" <bc...@apache.org> a écrit :
>
>> I think the existing rationale (not introducing lots of special fluent
>> methods) makes sense. However, if we look at the Java Stream API, we
>> probably wouldn't need to introduce *a lot* of fluent builders to get most
>> of the functionality. Specifically, if we focus on map, flatMap, and
>> collect from the Stream API, and a few extensions, we get something like:
>>
>> * collection.map(DoFn) for applying aParDo
>> * collection.map(SerialiazableFn) for Java8 lambda shorthand
>> * collection.flatMap(SerialiazbleFn) for Java8 lambda shorthand
>> * collection.collect(CombineFn) for applying a CombineFn
>> * collection.apply(PTransform) for applying a composite transform. note
>> that PTransforms could also use serializable lambdas for definition.
>>
>> (Note that GroupByKey doesn't even show up here -- it could, but that
>> could also be way of wrapping a collector, as in the Java8
>> Collectors.groupyingBy [1]
>>
>> With this, we could write code like:
>>
>> collection
>>   .map(myDoFn)
>>   .map((s) -> s.toString())
>>   .collect(new IntegerCombineFn())
>>   .apply(GroupByKey.of());
>>
>> That said, my two concerns are:
>> (1) having two similar but different Java APIs. If we have more idiomatic
>> way of writing pipelines in Java, we should make that the standard.
>> Otherwise, users will be confused by seeing "Beam" examples written in
>> multiple, incompatible syntaxes.
>> (2) making sure the above is truly idiomatic Java and that it doesn't any
>> conflicts with the cross-language Beam programming model. I don't think it
>> does. We have (I believ) chosen to make the Python and Go SDKs idiomatic
>> for those languages where possible.
>>
>> If this work is focused on making the Java SDK more idiomatic (and thus
>> easier for Java users to learn), it seems like a good thing. We should just
>> make sure it doesn't scope-creep into defining an entirely new DSL or SDK.
>>
>> [1]
>> https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-
>>
>> On Tue, Mar 13, 2018 at 11:06 AM Romain Manni-Bucau <
>> rmannibucau@gmail.com> wrote:
>>
>>> Yep
>>>
>>> I know the rational and it makes sense but it also increases the
>>> entering steps for users and is not that smooth in ides, in particular for
>>> custom code.
>>>
>>> So I really think it makes sense to build an user friendly api on top of
>>> beam core dev one.
>>>
>>>
>>> Le 13 mars 2018 18:35, "Aljoscha Krettek" <al...@apache.org> a
>>> écrit :
>>>
>>>>
>>>> https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html
>>>>
>>>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <rm...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> Le 12 mars 2018 00:16, "Reuven Lax" <re...@google.com> a écrit :
>>>>
>>>> I think it would be interesting to see what a Java stream-based API
>>>> would look like. As I mentioned elsewhere, we are not limited to having
>>>> only one API for Beam.
>>>>
>>>> If I remember correctly, a Java stream API was considered for Dataflow
>>>> back at the very beginning. I don't completely remember why it was
>>>> rejected, but I suspect at least part of the reason might have been that
>>>> Java streams were considered too new and untested back then.
>>>>
>>>>
>>>> Coders are broken - typevariables dont have bounds except object - and
>>>> reducers are not trivial to impl generally I guess.
>>>>
>>>> However being close of this api can help a lot so +1 to try to have a
>>>> java dsl on top of current api. Would also be neat to integrate it with
>>>> completionstage :).
>>>>
>>>>
>>>>
>>>> Reuven
>>>>
>>>>
>>>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a
>>>>> écrit :
>>>>>
>>>>> Hi Romain,
>>>>>
>>>>> I remember we have discussed about the way to express pipeline while
>>>>> ago.
>>>>>
>>>>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>>>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>>>>> it's the approach in flume).
>>>>> However, we agreed that apply() syntax gives a more flexible approach.
>>>>>
>>>>> Using Java Stream is interesting but I'm afraid we would have the same
>>>>> issue as the one we identified discussing "fluent Java SDK". However, we
>>>>> can have a Stream API DSL on top of the SDK IMHO.
>>>>>
>>>>>
>>>>> Agree and a beam stream interface (copying jdk api but making lambda
>>>>> serializable to avoid the cast need).
>>>>>
>>>>> On my side i think it enables user to discover the api. If you check
>>>>> my poc impl you quickly see the steps needed to do simple things like a map
>>>>> which is a first citizen.
>>>>>
>>>>> Also curious if we could impl reduce with pipeline result = get an
>>>>> output of a batch from the runner (client) jvm. I see how to do it for
>>>>> longs - with metrics - but not for collect().
>>>>>
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>>
>>>>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> don't know if you already experienced using java Stream API as a
>>>>>> replacement for pipeline API but did some tests:
>>>>>> https://github.com/rmannibucau/jbeam
>>>>>>
>>>>>> It is far to be complete but already shows where it fails (beam
>>>>>> doesn't have a way to reduce in the caller machine for instance, coder
>>>>>> handling is not that trivial, lambda are not working well with default
>>>>>> Stream API etc...).
>>>>>>
>>>>>> However it is interesting to see that having such an API is pretty
>>>>>> natural compare to the pipeline API
>>>>>> so wonder if beam should work on its own Stream API (with surely
>>>>>> another name for obvious reasons ;)).
>>>>>>
>>>>>> Romain Manni-Bucau
>>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>>>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>>>>> http://rmannibucau.wordpress.com> | Github <
>>>>>> https://github.com/rmannibucau> | LinkedIn <
>>>>>> https://www.linkedin.com/in/rmannibucau> | Book <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>>

Re: (java) stream & beam?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Misses the collect split in 3 (supplier, combiner, aggregator) but globally
agree.

 I d just take java stream, remove "client" method or make them big data if
possible, ensure all hooks are serializable to avoid hacks and add an
unwrap to be able to access the pipeline in case we need a very custom
thing and we are done for me.

Le 13 mars 2018 19:26, "Ben Chambers" <bc...@apache.org> a écrit :

> I think the existing rationale (not introducing lots of special fluent
> methods) makes sense. However, if we look at the Java Stream API, we
> probably wouldn't need to introduce *a lot* of fluent builders to get most
> of the functionality. Specifically, if we focus on map, flatMap, and
> collect from the Stream API, and a few extensions, we get something like:
>
> * collection.map(DoFn) for applying aParDo
> * collection.map(SerialiazableFn) for Java8 lambda shorthand
> * collection.flatMap(SerialiazbleFn) for Java8 lambda shorthand
> * collection.collect(CombineFn) for applying a CombineFn
> * collection.apply(PTransform) for applying a composite transform. note
> that PTransforms could also use serializable lambdas for definition.
>
> (Note that GroupByKey doesn't even show up here -- it could, but that
> could also be way of wrapping a collector, as in the Java8
> Collectors.groupyingBy [1]
>
> With this, we could write code like:
>
> collection
>   .map(myDoFn)
>   .map((s) -> s.toString())
>   .collect(new IntegerCombineFn())
>   .apply(GroupByKey.of());
>
> That said, my two concerns are:
> (1) having two similar but different Java APIs. If we have more idiomatic
> way of writing pipelines in Java, we should make that the standard.
> Otherwise, users will be confused by seeing "Beam" examples written in
> multiple, incompatible syntaxes.
> (2) making sure the above is truly idiomatic Java and that it doesn't any
> conflicts with the cross-language Beam programming model. I don't think it
> does. We have (I believ) chosen to make the Python and Go SDKs idiomatic
> for those languages where possible.
>
> If this work is focused on making the Java SDK more idiomatic (and thus
> easier for Java users to learn), it seems like a good thing. We should just
> make sure it doesn't scope-creep into defining an entirely new DSL or SDK.
>
> [1] https://docs.oracle.com/javase/8/docs/api/java/util/
> stream/Collectors.html#groupingBy-java.util.function.Function-
>
> On Tue, Mar 13, 2018 at 11:06 AM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Yep
>>
>> I know the rational and it makes sense but it also increases the entering
>> steps for users and is not that smooth in ides, in particular for custom
>> code.
>>
>> So I really think it makes sense to build an user friendly api on top of
>> beam core dev one.
>>
>>
>> Le 13 mars 2018 18:35, "Aljoscha Krettek" <al...@apache.org> a écrit :
>>
>>> https://beam.apache.org/blog/2016/05/27/where-is-my-
>>> pcollection-dot-map.html
>>>
>>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <rm...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> Le 12 mars 2018 00:16, "Reuven Lax" <re...@google.com> a écrit :
>>>
>>> I think it would be interesting to see what a Java stream-based API
>>> would look like. As I mentioned elsewhere, we are not limited to having
>>> only one API for Beam.
>>>
>>> If I remember correctly, a Java stream API was considered for Dataflow
>>> back at the very beginning. I don't completely remember why it was
>>> rejected, but I suspect at least part of the reason might have been that
>>> Java streams were considered too new and untested back then.
>>>
>>>
>>> Coders are broken - typevariables dont have bounds except object - and
>>> reducers are not trivial to impl generally I guess.
>>>
>>> However being close of this api can help a lot so +1 to try to have a
>>> java dsl on top of current api. Would also be neat to integrate it with
>>> completionstage :).
>>>
>>>
>>>
>>> Reuven
>>>
>>>
>>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a
>>>> écrit :
>>>>
>>>> Hi Romain,
>>>>
>>>> I remember we have discussed about the way to express pipeline while
>>>> ago.
>>>>
>>>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>>>> it's the approach in flume).
>>>> However, we agreed that apply() syntax gives a more flexible approach.
>>>>
>>>> Using Java Stream is interesting but I'm afraid we would have the same
>>>> issue as the one we identified discussing "fluent Java SDK". However, we
>>>> can have a Stream API DSL on top of the SDK IMHO.
>>>>
>>>>
>>>> Agree and a beam stream interface (copying jdk api but making lambda
>>>> serializable to avoid the cast need).
>>>>
>>>> On my side i think it enables user to discover the api. If you check my
>>>> poc impl you quickly see the steps needed to do simple things like a map
>>>> which is a first citizen.
>>>>
>>>> Also curious if we could impl reduce with pipeline result = get an
>>>> output of a batch from the runner (client) jvm. I see how to do it for
>>>> longs - with metrics - but not for collect().
>>>>
>>>>
>>>> Regards
>>>> JB
>>>>
>>>>
>>>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> don't know if you already experienced using java Stream API as a
>>>>> replacement for pipeline API but did some tests: https://github.com/
>>>>> rmannibucau/jbeam
>>>>>
>>>>> It is far to be complete but already shows where it fails (beam
>>>>> doesn't have a way to reduce in the caller machine for instance, coder
>>>>> handling is not that trivial, lambda are not working well with default
>>>>> Stream API etc...).
>>>>>
>>>>> However it is interesting to see that having such an API is pretty
>>>>> natural compare to the pipeline API
>>>>> so wonder if beam should work on its own Stream API (with surely
>>>>> another name for obvious reasons ;)).
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>>>> http://rmannibucau.wordpress.com> | Github <https://github.com/
>>>>> rmannibucau> | LinkedIn <https://www.linkedin.com/in/rmannibucau> |
>>>>> Book <https://www.packtpub.com/application-development/java-
>>>>> ee-8-high-performance>
>>>>>
>>>>
>>>>
>>>
>>>

Re: (java) stream & beam?

Posted by Ben Chambers <bc...@apache.org>.

I think the existing rationale (not introducing lots of special fluent
methods) makes sense. However, if we look at the Java Stream API, we
probably wouldn't need to introduce *a lot* of fluent builders to get most
of the functionality. Specifically, if we focus on map, flatMap, and
collect from the Stream API, and a few extensions, we get something like:

* collection.map(DoFn) for applying aParDo
* collection.map(SerialiazableFn) for Java8 lambda shorthand
* collection.flatMap(SerialiazbleFn) for Java8 lambda shorthand
* collection.collect(CombineFn) for applying a CombineFn
* collection.apply(PTransform) for applying a composite transform. note
that PTransforms could also use serializable lambdas for definition.

(Note that GroupByKey doesn't even show up here -- it could, but that could
also be way of wrapping a collector, as in the Java8 Collectors.groupyingBy
[1]

With this, we could write code like:

collection
  .map(myDoFn)
  .map((s) -> s.toString())
  .collect(new IntegerCombineFn())
  .apply(GroupByKey.of());

That said, my two concerns are:
(1) having two similar but different Java APIs. If we have more idiomatic
way of writing pipelines in Java, we should make that the standard.
Otherwise, users will be confused by seeing "Beam" examples written in
multiple, incompatible syntaxes.
(2) making sure the above is truly idiomatic Java and that it doesn't any
conflicts with the cross-language Beam programming model. I don't think it
does. We have (I believ) chosen to make the Python and Go SDKs idiomatic
for those languages where possible.

If this work is focused on making the Java SDK more idiomatic (and thus
easier for Java users to learn), it seems like a good thing. We should just
make sure it doesn't scope-creep into defining an entirely new DSL or SDK.

[1]
https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-

On Tue, Mar 13, 2018 at 11:06 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Yep
>
> I know the rational and it makes sense but it also increases the entering
> steps for users and is not that smooth in ides, in particular for custom
> code.
>
> So I really think it makes sense to build an user friendly api on top of
> beam core dev one.
>
>
> Le 13 mars 2018 18:35, "Aljoscha Krettek" <al...@apache.org> a écrit :
>
>>
>> https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html
>>
>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <rm...@gmail.com>
>> wrote:
>>
>>
>>
>> Le 12 mars 2018 00:16, "Reuven Lax" <re...@google.com> a écrit :
>>
>> I think it would be interesting to see what a Java stream-based API would
>> look like. As I mentioned elsewhere, we are not limited to having only one
>> API for Beam.
>>
>> If I remember correctly, a Java stream API was considered for Dataflow
>> back at the very beginning. I don't completely remember why it was
>> rejected, but I suspect at least part of the reason might have been that
>> Java streams were considered too new and untested back then.
>>
>>
>> Coders are broken - typevariables dont have bounds except object - and
>> reducers are not trivial to impl generally I guess.
>>
>> However being close of this api can help a lot so +1 to try to have a
>> java dsl on top of current api. Would also be neat to integrate it with
>> completionstage :).
>>
>>
>>
>> Reuven
>>
>>
>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <rm...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a
>>> écrit :
>>>
>>> Hi Romain,
>>>
>>> I remember we have discussed about the way to express pipeline while ago.
>>>
>>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>>> it's the approach in flume).
>>> However, we agreed that apply() syntax gives a more flexible approach.
>>>
>>> Using Java Stream is interesting but I'm afraid we would have the same
>>> issue as the one we identified discussing "fluent Java SDK". However, we
>>> can have a Stream API DSL on top of the SDK IMHO.
>>>
>>>
>>> Agree and a beam stream interface (copying jdk api but making lambda
>>> serializable to avoid the cast need).
>>>
>>> On my side i think it enables user to discover the api. If you check my
>>> poc impl you quickly see the steps needed to do simple things like a map
>>> which is a first citizen.
>>>
>>> Also curious if we could impl reduce with pipeline result = get an
>>> output of a batch from the runner (client) jvm. I see how to do it for
>>> longs - with metrics - but not for collect().
>>>
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>>
>>>> Hi guys,
>>>>
>>>> don't know if you already experienced using java Stream API as a
>>>> replacement for pipeline API but did some tests:
>>>> https://github.com/rmannibucau/jbeam
>>>>
>>>> It is far to be complete but already shows where it fails (beam doesn't
>>>> have a way to reduce in the caller machine for instance, coder handling is
>>>> not that trivial, lambda are not working well with default Stream API
>>>> etc...).
>>>>
>>>> However it is interesting to see that having such an API is pretty
>>>> natural compare to the pipeline API
>>>> so wonder if beam should work on its own Stream API (with surely
>>>> another name for obvious reasons ;)).
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>>> http://rmannibucau.wordpress.com> | Github <
>>>> https://github.com/rmannibucau> | LinkedIn <
>>>> https://www.linkedin.com/in/rmannibucau> | Book <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>>
>>>
>>>
>>
>>

Re: (java) stream & beam?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Le 15 mars 2018 07:50, "Robert Bradshaw" <ro...@google.com> a écrit :

On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Le 15 mars 2018 06:52, "Robert Bradshaw" <ro...@google.com> a écrit :

>> The stream API was looked at way back when we were designing the API;
one of the primary reasons it was not further pursued at the time was the
demand for Java 7 compatibility. It is also much more natural with lambdas,
but unfortunately the Java compiler discards types in this case, making
coder inference impossible. Still is interesting to explore, and I've been
toying with using this wrapping method for other applications
(specifically, giving a Pandas Dataframe API to PCollections in Python).

>> There's a higher level question lingering here about making things more
fluent by putting methods on PCollections in our primary API. It was
somewhat of an experiment to go the very pure approach of *everything*
being expressed a PTransform, and this is not without its disadvantages,
and (gasp) may be worth revisiting. In particular, some things that have
changed in the meantime are

>> * The Java SDK is no longer *the* definition of the model. The model has
been (mostly) formalized in the portability work, and the general Beam
concepts and notion of PTransform are much more widely fleshed out and
understood.

> This is wrong for all java users which are still the mainstream. It is
important to keep that in mind and even if I know portable API is something
important for you,

I think you miss-understood me. My point is that it is now much easier to
disentangle the essence of the Beam model (reified in part in the portable
API) from the Java API itself (which may evolve more independently, whereas
formerly syntactic sugar here would be conflated with core concepts).


Oh ok. Agree.


> it is solething which should stay on top of runners and their api which
means java for all but one.

> All that to say that the most common default is java.

I don't think it'll be that way for long; scala alone might give Java a run
for its money.


Scala will probably need its own api but also generally goes with the best
of breed approach which is the opposite of beam by design (vendor
portability gives much more important guarantees but not being always the
best) do let see how it goes :).


> However I agree each language should have its natural API and should
absolutely not just port over the same API. Goal being indeed to respect
its own philosophy.

> Conclusion: java needs a most expressive stream like API.

> There is another way to see it: catching up API debt compared to
concurrent API.


>> * Java 8's lambdas, etc. allows for much more succinct representation of
operations, which makes the relative ratio of boilerplate of using apply
that much higher. This is one of the struggles we had with the Python API,
pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. pcoll
| Map(...) is at least closer to pcoll.map(...).
>> * With over two years of experience with the 100% pure approach, we
still haven't "gotten used to it" enough that adding such methods isn't
appealing. (Note that by design adding such methods later is always easier
than taking them away, which was one justification for starting at the
extreme point).

>> Even if we go this route, there's no need to remove apply, and

>> pcoll
>>      .map(...)
>>      .apply(...)
>>      .flatMap(...)

>> flows fairly well (with map/flatMap being syntactic sugar to apply).

>> Agree but the issue with that is you loose the natural approach and it
is harder to rework it whereas having an api on top of "apply" let you keep
both concerns split.

Having multiple APIs undesirable, best to have one unless there are hard
constraints that prevent it (e.g. if the two would be jarringly
inconsistent, or one is forced by an interface, etc.)

>> Also pcollection api is what is complex (coders, sides, ...) and what I
hope we can hide behind another API.

I'd like to simplify things as well.

>> I think we would also have to still use apply for parameterless
operations like gbk that place constraints on the element types. I don't
see how to do combinePerKey either (though, asymmetrically, globalCombine
is fine).

>> The largest fear I have is feature creep. There would have to be a very
clear line of what's in and what's not, likely with what's in being a very
short list (which is probably OK and would give the biggest gain, but not
much discoverability). The criteria can't be primitives (gbk is
problematic, and the most natural map isn't really the full ParDo
primitive--in fact the full ParDo might be "advanced" enough to merit
requiring apply).

> Is the previous proposal an issue (jet api)?

On first glance, StreamStage doesn't sound to me like a PCollection (mixes
the notion of operations and values), and methods like flatMapUsingContext
and hashJoin2 seem far down the slippery slope. But I haven't spent that
much time looking at it.


Hz has some concept making it way faster like spark etc when used since it
hosts the data and execution and you can do data affinity. This part doesnt
apply to us but overall their api is nice and smooth to discover.


>> Who knows, though I still think we made the right decision to attempt
apply-only at the time, maybe I'll have to flesh this out into a new blog
post that is a rebuttal to my original one :).

> Maybe for part of the users, clearly not for the ones I met last 3 months
(what they said opening their IDE is censured ;)).

Re: (java) stream & beam?

Posted by Robert Bradshaw <ro...@google.com>.

Yes. The very original Python API didn't have GBK, just a
lambda-parameterized groupBy.

On Sat, Mar 17, 2018, 12:21 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Gbk can be fluent if you pass a key extractor lambda ;)
>
> Le 17 mars 2018 00:00, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a écrit :
>
>> Big +1
>>
>> Regards
>> JB
>> Le 16 mars 2018, à 15:59, Reuven Lax <re...@google.com> a écrit:
>>>
>>> BTW while it's true that raw GBK can't be fluent (due to constraint on
>>> element type). once we have schema support we can introduce groupByField,
>>> and that can be fluent.
>>>
>>>
>>> On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau <
>>>> rmannibucau@gmail.com>
>>>> wrote:
>>>>
>>>> > Le 15 mars 2018 06:52, "Robert Bradshaw" <ro...@google.com> a
>>>> écrit :
>>>>
>>>> >> The stream API was looked at way back when we were designing the API;
>>>> one of the primary reasons it was not further pursued at the time was
>>>> the
>>>> demand for Java 7 compatibility. It is also much more natural with
>>>> lambdas,
>>>> but unfortunately the Java compiler discards types in this case, making
>>>> coder inference impossible. Still is interesting to explore, and I've
>>>> been
>>>> toying with using this wrapping method for other applications
>>>> (specifically, giving a Pandas Dataframe API to PCollections in Python).
>>>>
>>>> >> There's a higher level question lingering here about making things
>>>> more
>>>> fluent by putting methods on PCollections in our primary API. It was
>>>> somewhat of an experiment to go the very pure approach of *everything*
>>>> being expressed a PTransform, and this is not without its disadvantages,
>>>> and (gasp) may be worth revisiting. In particular, some things that have
>>>> changed in the meantime are
>>>>
>>>> >> * The Java SDK is no longer *the* definition of the model. The model
>>>> has
>>>> been (mostly) formalized in the portability work, and the general Beam
>>>> concepts and notion of PTransform are much more widely fleshed out and
>>>> understood.
>>>>
>>>> > This is wrong for all java users which are still the mainstream. It is
>>>> important to keep that in mind and even if I know portable API is
>>>> something
>>>> important for you,
>>>>
>>>> I think you miss-understood me. My point is that it is now much easier
>>>> to
>>>> disentangle the essence of the Beam model (reified in part in the
>>>> portable
>>>> API) from the Java API itself (which may evolve more independently,
>>>> whereas
>>>> formerly syntactic sugar here would be conflated with core concepts).
>>>>
>>>> > it is solething which should stay on top of runners and their api
>>>> which
>>>> means java for all but one.
>>>>
>>>> > All that to say that the most common default is java.
>>>>
>>>> I don't think it'll be that way for long; scala alone might give Java a
>>>> run
>>>> for its money.
>>>>
>>>> > However I agree each language should have its natural API and should
>>>> absolutely not just port over the same API. Goal being indeed to respect
>>>> its own philosophy.
>>>>
>>>> > Conclusion: java needs a most expressive stream like API.
>>>>
>>>> > There is another way to see it: catching up API debt compared to
>>>> concurrent API.
>>>>
>>>>
>>>> >> * Java 8's lambdas, etc. allows for much more succinct
>>>> representation of
>>>> operations, which makes the relative ratio of boilerplate of using apply
>>>> that much higher. This is one of the struggles we had with the Python
>>>> API,
>>>> pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant.
>>>> pcoll
>>>> | Map(...) is at least closer to pcoll.map(...).
>>>> >> * With over two years of experience with the 100% pure approach, we
>>>> still haven't "gotten used to it" enough that adding such methods isn't
>>>> appealing. (Note that by design adding such methods later is always
>>>> easier
>>>> than taking them away, which was one justification for starting at the
>>>> extreme point).
>>>>
>>>> >> Even if we go this route, there's no need to remove apply, and
>>>>
>>>> >> pcoll
>>>> >>      .map(...)
>>>> >>      .apply(...)
>>>> >>      .flatMap(...)
>>>>
>>>> >> flows fairly well (with map/flatMap being syntactic sugar to apply).
>>>>
>>>> >> Agree but the issue with that is you loose the natural approach and
>>>> it
>>>> is harder to rework it whereas having an api on top of "apply" let you
>>>> keep
>>>> both concerns split.
>>>>
>>>> Having multiple APIs undesirable, best to have one unless there are hard
>>>> constraints that prevent it (e.g. if the two would be jarringly
>>>> inconsistent, or one is forced by an interface, etc.)
>>>>
>>>> >> Also pcollection api is what is complex (coders, sides, ...) and
>>>> what I
>>>> hope we can hide behind another API.
>>>>
>>>> I'd like to simplify things as well.
>>>>
>>>> >> I think we would also have to still use apply for parameterless
>>>> operations like gbk that place constraints on the element types. I don't
>>>> see how to do combinePerKey either (though, asymmetrically,
>>>> globalCombine
>>>> is fine).
>>>>
>>>> >> The largest fear I have is feature creep. There would have to be a
>>>> very
>>>> clear line of what's in and what's not, likely with what's in being a
>>>> very
>>>> short list (which is probably OK and would give the biggest gain, but
>>>> not
>>>> much discoverability). The criteria can't be primitives (gbk is
>>>> problematic, and the most natural map isn't really the full ParDo
>>>> primitive--in fact the full ParDo might be "advanced" enough to merit
>>>> requiring apply).
>>>>
>>>> > Is the previous proposal an issue (jet api)?
>>>>
>>>> On first glance, StreamStage doesn't sound to me like a PCollection
>>>> (mixes
>>>> the notion of operations and values), and methods like
>>>> flatMapUsingContext
>>>> and hashJoin2 seem far down the slippery slope. But I haven't spent that
>>>> much time looking at it.
>>>>
>>>> >> Who knows, though I still think we made the right decision to attempt
>>>> apply-only at the time, maybe I'll have to flesh this out into a new
>>>> blog
>>>> post that is a rebuttal to my original one :).
>>>>
>>>> > Maybe for part of the users, clearly not for the ones I met last 3
>>>> months
>>>> (what they said opening their IDE is censured ;)).
>>>>
>>>

Re: (java) stream & beam?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Gbk can be fluent if you pass a key extractor lambda ;)

Le 17 mars 2018 00:00, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a écrit :

> Big +1
>
> Regards
> JB
> Le 16 mars 2018, à 15:59, Reuven Lax <re...@google.com> a écrit:
>>
>> BTW while it's true that raw GBK can't be fluent (due to constraint on
>> element type). once we have schema support we can introduce groupByField,
>> and that can be fluent.
>>
>>
>> On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com>
>>> wrote:
>>>
>>> > Le 15 mars 2018 06:52, "Robert Bradshaw" <ro...@google.com> a
>>> écrit :
>>>
>>> >> The stream API was looked at way back when we were designing the API;
>>> one of the primary reasons it was not further pursued at the time was the
>>> demand for Java 7 compatibility. It is also much more natural with
>>> lambdas,
>>> but unfortunately the Java compiler discards types in this case, making
>>> coder inference impossible. Still is interesting to explore, and I've
>>> been
>>> toying with using this wrapping method for other applications
>>> (specifically, giving a Pandas Dataframe API to PCollections in Python).
>>>
>>> >> There's a higher level question lingering here about making things
>>> more
>>> fluent by putting methods on PCollections in our primary API. It was
>>> somewhat of an experiment to go the very pure approach of *everything*
>>> being expressed a PTransform, and this is not without its disadvantages,
>>> and (gasp) may be worth revisiting. In particular, some things that have
>>> changed in the meantime are
>>>
>>> >> * The Java SDK is no longer *the* definition of the model. The model
>>> has
>>> been (mostly) formalized in the portability work, and the general Beam
>>> concepts and notion of PTransform are much more widely fleshed out and
>>> understood.
>>>
>>> > This is wrong for all java users which are still the mainstream. It is
>>> important to keep that in mind and even if I know portable API is
>>> something
>>> important for you,
>>>
>>> I think you miss-understood me. My point is that it is now much easier to
>>> disentangle the essence of the Beam model (reified in part in the
>>> portable
>>> API) from the Java API itself (which may evolve more independently,
>>> whereas
>>> formerly syntactic sugar here would be conflated with core concepts).
>>>
>>> > it is solething which should stay on top of runners and their api which
>>> means java for all but one.
>>>
>>> > All that to say that the most common default is java.
>>>
>>> I don't think it'll be that way for long; scala alone might give Java a
>>> run
>>> for its money.
>>>
>>> > However I agree each language should have its natural API and should
>>> absolutely not just port over the same API. Goal being indeed to respect
>>> its own philosophy.
>>>
>>> > Conclusion: java needs a most expressive stream like API.
>>>
>>> > There is another way to see it: catching up API debt compared to
>>> concurrent API.
>>>
>>>
>>> >> * Java 8's lambdas, etc. allows for much more succinct representation
>>> of
>>> operations, which makes the relative ratio of boilerplate of using apply
>>> that much higher. This is one of the struggles we had with the Python
>>> API,
>>> pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant.
>>> pcoll
>>> | Map(...) is at least closer to pcoll.map(...).
>>> >> * With over two years of experience with the 100% pure approach, we
>>> still haven't "gotten used to it" enough that adding such methods isn't
>>> appealing. (Note that by design adding such methods later is always
>>> easier
>>> than taking them away, which was one justification for starting at the
>>> extreme point).
>>>
>>> >> Even if we go this route, there's no need to remove apply, and
>>>
>>> >> pcoll
>>> >>      .map(...)
>>> >>      .apply(...)
>>> >>      .flatMap(...)
>>>
>>> >> flows fairly well (with map/flatMap being syntactic sugar to apply).
>>>
>>> >> Agree but the issue with that is you loose the natural approach and it
>>> is harder to rework it whereas having an api on top of "apply" let you
>>> keep
>>> both concerns split.
>>>
>>> Having multiple APIs undesirable, best to have one unless there are hard
>>> constraints that prevent it (e.g. if the two would be jarringly
>>> inconsistent, or one is forced by an interface, etc.)
>>>
>>> >> Also pcollection api is what is complex (coders, sides, ...) and what
>>> I
>>> hope we can hide behind another API.
>>>
>>> I'd like to simplify things as well.
>>>
>>> >> I think we would also have to still use apply for parameterless
>>> operations like gbk that place constraints on the element types. I don't
>>> see how to do combinePerKey either (though, asymmetrically, globalCombine
>>> is fine).
>>>
>>> >> The largest fear I have is feature creep. There would have to be a
>>> very
>>> clear line of what's in and what's not, likely with what's in being a
>>> very
>>> short list (which is probably OK and would give the biggest gain, but not
>>> much discoverability). The criteria can't be primitives (gbk is
>>> problematic, and the most natural map isn't really the full ParDo
>>> primitive--in fact the full ParDo might be "advanced" enough to merit
>>> requiring apply).
>>>
>>> > Is the previous proposal an issue (jet api)?
>>>
>>> On first glance, StreamStage doesn't sound to me like a PCollection
>>> (mixes
>>> the notion of operations and values), and methods like
>>> flatMapUsingContext
>>> and hashJoin2 seem far down the slippery slope. But I haven't spent that
>>> much time looking at it.
>>>
>>> >> Who knows, though I still think we made the right decision to attempt
>>> apply-only at the time, maybe I'll have to flesh this out into a new blog
>>> post that is a rebuttal to my original one :).
>>>
>>> > Maybe for part of the users, clearly not for the ones I met last 3
>>> months
>>> (what they said opening their IDE is censured ;)).
>>>
>>

Re: (java) stream & beam?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Big +1

Regards
JB

Le 16 mars 2018 à 15:59, à 15:59, Reuven Lax <re...@google.com> a écrit:
>BTW while it's true that raw GBK can't be fluent (due to constraint on
>element type). once we have schema support we can introduce
>groupByField,
>and that can be fluent.
>
>
>On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw <ro...@google.com>
>wrote:
>
>> On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau
><rmannibucau@gmail.com
>> >
>> wrote:
>>
>> > Le 15 mars 2018 06:52, "Robert Bradshaw" <ro...@google.com> a
>écrit :
>>
>> >> The stream API was looked at way back when we were designing the
>API;
>> one of the primary reasons it was not further pursued at the time was
>the
>> demand for Java 7 compatibility. It is also much more natural with
>lambdas,
>> but unfortunately the Java compiler discards types in this case,
>making
>> coder inference impossible. Still is interesting to explore, and I've
>been
>> toying with using this wrapping method for other applications
>> (specifically, giving a Pandas Dataframe API to PCollections in
>Python).
>>
>> >> There's a higher level question lingering here about making things
>more
>> fluent by putting methods on PCollections in our primary API. It was
>> somewhat of an experiment to go the very pure approach of
>*everything*
>> being expressed a PTransform, and this is not without its
>disadvantages,
>> and (gasp) may be worth revisiting. In particular, some things that
>have
>> changed in the meantime are
>>
>> >> * The Java SDK is no longer *the* definition of the model. The
>model has
>> been (mostly) formalized in the portability work, and the general
>Beam
>> concepts and notion of PTransform are much more widely fleshed out
>and
>> understood.
>>
>> > This is wrong for all java users which are still the mainstream. It
>is
>> important to keep that in mind and even if I know portable API is
>something
>> important for you,
>>
>> I think you miss-understood me. My point is that it is now much
>easier to
>> disentangle the essence of the Beam model (reified in part in the
>portable
>> API) from the Java API itself (which may evolve more independently,
>whereas
>> formerly syntactic sugar here would be conflated with core concepts).
>>
>> > it is solething which should stay on top of runners and their api
>which
>> means java for all but one.
>>
>> > All that to say that the most common default is java.
>>
>> I don't think it'll be that way for long; scala alone might give Java
>a run
>> for its money.
>>
>> > However I agree each language should have its natural API and
>should
>> absolutely not just port over the same API. Goal being indeed to
>respect
>> its own philosophy.
>>
>> > Conclusion: java needs a most expressive stream like API.
>>
>> > There is another way to see it: catching up API debt compared to
>> concurrent API.
>>
>>
>> >> * Java 8's lambdas, etc. allows for much more succinct
>representation of
>> operations, which makes the relative ratio of boilerplate of using
>apply
>> that much higher. This is one of the struggles we had with the Python
>API,
>> pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant.
>pcoll
>> | Map(...) is at least closer to pcoll.map(...).
>> >> * With over two years of experience with the 100% pure approach,
>we
>> still haven't "gotten used to it" enough that adding such methods
>isn't
>> appealing. (Note that by design adding such methods later is always
>easier
>> than taking them away, which was one justification for starting at
>the
>> extreme point).
>>
>> >> Even if we go this route, there's no need to remove apply, and
>>
>> >> pcoll
>> >>      .map(...)
>> >>      .apply(...)
>> >>      .flatMap(...)
>>
>> >> flows fairly well (with map/flatMap being syntactic sugar to
>apply).
>>
>> >> Agree but the issue with that is you loose the natural approach
>and it
>> is harder to rework it whereas having an api on top of "apply" let
>you keep
>> both concerns split.
>>
>> Having multiple APIs undesirable, best to have one unless there are
>hard
>> constraints that prevent it (e.g. if the two would be jarringly
>> inconsistent, or one is forced by an interface, etc.)
>>
>> >> Also pcollection api is what is complex (coders, sides, ...) and
>what I
>> hope we can hide behind another API.
>>
>> I'd like to simplify things as well.
>>
>> >> I think we would also have to still use apply for parameterless
>> operations like gbk that place constraints on the element types. I
>don't
>> see how to do combinePerKey either (though, asymmetrically,
>globalCombine
>> is fine).
>>
>> >> The largest fear I have is feature creep. There would have to be a
>very
>> clear line of what's in and what's not, likely with what's in being a
>very
>> short list (which is probably OK and would give the biggest gain, but
>not
>> much discoverability). The criteria can't be primitives (gbk is
>> problematic, and the most natural map isn't really the full ParDo
>> primitive--in fact the full ParDo might be "advanced" enough to merit
>> requiring apply).
>>
>> > Is the previous proposal an issue (jet api)?
>>
>> On first glance, StreamStage doesn't sound to me like a PCollection
>(mixes
>> the notion of operations and values), and methods like
>flatMapUsingContext
>> and hashJoin2 seem far down the slippery slope. But I haven't spent
>that
>> much time looking at it.
>>
>> >> Who knows, though I still think we made the right decision to
>attempt
>> apply-only at the time, maybe I'll have to flesh this out into a new
>blog
>> post that is a rebuttal to my original one :).
>>
>> > Maybe for part of the users, clearly not for the ones I met last 3
>months
>> (what they said opening their IDE is censured ;)).
>>

Re: (java) stream & beam?

Posted by Reuven Lax <re...@google.com>.

BTW while it's true that raw GBK can't be fluent (due to constraint on
element type). once we have schema support we can introduce groupByField,
and that can be fluent.


On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw <ro...@google.com>
wrote:

> On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau <rmannibucau@gmail.com
> >
> wrote:
>
> > Le 15 mars 2018 06:52, "Robert Bradshaw" <ro...@google.com> a écrit :
>
> >> The stream API was looked at way back when we were designing the API;
> one of the primary reasons it was not further pursued at the time was the
> demand for Java 7 compatibility. It is also much more natural with lambdas,
> but unfortunately the Java compiler discards types in this case, making
> coder inference impossible. Still is interesting to explore, and I've been
> toying with using this wrapping method for other applications
> (specifically, giving a Pandas Dataframe API to PCollections in Python).
>
> >> There's a higher level question lingering here about making things more
> fluent by putting methods on PCollections in our primary API. It was
> somewhat of an experiment to go the very pure approach of *everything*
> being expressed a PTransform, and this is not without its disadvantages,
> and (gasp) may be worth revisiting. In particular, some things that have
> changed in the meantime are
>
> >> * The Java SDK is no longer *the* definition of the model. The model has
> been (mostly) formalized in the portability work, and the general Beam
> concepts and notion of PTransform are much more widely fleshed out and
> understood.
>
> > This is wrong for all java users which are still the mainstream. It is
> important to keep that in mind and even if I know portable API is something
> important for you,
>
> I think you miss-understood me. My point is that it is now much easier to
> disentangle the essence of the Beam model (reified in part in the portable
> API) from the Java API itself (which may evolve more independently, whereas
> formerly syntactic sugar here would be conflated with core concepts).
>
> > it is solething which should stay on top of runners and their api which
> means java for all but one.
>
> > All that to say that the most common default is java.
>
> I don't think it'll be that way for long; scala alone might give Java a run
> for its money.
>
> > However I agree each language should have its natural API and should
> absolutely not just port over the same API. Goal being indeed to respect
> its own philosophy.
>
> > Conclusion: java needs a most expressive stream like API.
>
> > There is another way to see it: catching up API debt compared to
> concurrent API.
>
>
> >> * Java 8's lambdas, etc. allows for much more succinct representation of
> operations, which makes the relative ratio of boilerplate of using apply
> that much higher. This is one of the struggles we had with the Python API,
> pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. pcoll
> | Map(...) is at least closer to pcoll.map(...).
> >> * With over two years of experience with the 100% pure approach, we
> still haven't "gotten used to it" enough that adding such methods isn't
> appealing. (Note that by design adding such methods later is always easier
> than taking them away, which was one justification for starting at the
> extreme point).
>
> >> Even if we go this route, there's no need to remove apply, and
>
> >> pcoll
> >>      .map(...)
> >>      .apply(...)
> >>      .flatMap(...)
>
> >> flows fairly well (with map/flatMap being syntactic sugar to apply).
>
> >> Agree but the issue with that is you loose the natural approach and it
> is harder to rework it whereas having an api on top of "apply" let you keep
> both concerns split.
>
> Having multiple APIs undesirable, best to have one unless there are hard
> constraints that prevent it (e.g. if the two would be jarringly
> inconsistent, or one is forced by an interface, etc.)
>
> >> Also pcollection api is what is complex (coders, sides, ...) and what I
> hope we can hide behind another API.
>
> I'd like to simplify things as well.
>
> >> I think we would also have to still use apply for parameterless
> operations like gbk that place constraints on the element types. I don't
> see how to do combinePerKey either (though, asymmetrically, globalCombine
> is fine).
>
> >> The largest fear I have is feature creep. There would have to be a very
> clear line of what's in and what's not, likely with what's in being a very
> short list (which is probably OK and would give the biggest gain, but not
> much discoverability). The criteria can't be primitives (gbk is
> problematic, and the most natural map isn't really the full ParDo
> primitive--in fact the full ParDo might be "advanced" enough to merit
> requiring apply).
>
> > Is the previous proposal an issue (jet api)?
>
> On first glance, StreamStage doesn't sound to me like a PCollection (mixes
> the notion of operations and values), and methods like flatMapUsingContext
> and hashJoin2 seem far down the slippery slope. But I haven't spent that
> much time looking at it.
>
> >> Who knows, though I still think we made the right decision to attempt
> apply-only at the time, maybe I'll have to flesh this out into a new blog
> post that is a rebuttal to my original one :).
>
> > Maybe for part of the users, clearly not for the ones I met last 3 months
> (what they said opening their IDE is censured ;)).
>

Re: (java) stream & beam?

Posted by Robert Bradshaw <ro...@google.com>.

On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Le 15 mars 2018 06:52, "Robert Bradshaw" <ro...@google.com> a écrit :

>> The stream API was looked at way back when we were designing the API;
one of the primary reasons it was not further pursued at the time was the
demand for Java 7 compatibility. It is also much more natural with lambdas,
but unfortunately the Java compiler discards types in this case, making
coder inference impossible. Still is interesting to explore, and I've been
toying with using this wrapping method for other applications
(specifically, giving a Pandas Dataframe API to PCollections in Python).

>> There's a higher level question lingering here about making things more
fluent by putting methods on PCollections in our primary API. It was
somewhat of an experiment to go the very pure approach of *everything*
being expressed a PTransform, and this is not without its disadvantages,
and (gasp) may be worth revisiting. In particular, some things that have
changed in the meantime are

>> * The Java SDK is no longer *the* definition of the model. The model has
been (mostly) formalized in the portability work, and the general Beam
concepts and notion of PTransform are much more widely fleshed out and
understood.

> This is wrong for all java users which are still the mainstream. It is
important to keep that in mind and even if I know portable API is something
important for you,

I think you miss-understood me. My point is that it is now much easier to
disentangle the essence of the Beam model (reified in part in the portable
API) from the Java API itself (which may evolve more independently, whereas
formerly syntactic sugar here would be conflated with core concepts).

> it is solething which should stay on top of runners and their api which
means java for all but one.

> All that to say that the most common default is java.

I don't think it'll be that way for long; scala alone might give Java a run
for its money.

> However I agree each language should have its natural API and should
absolutely not just port over the same API. Goal being indeed to respect
its own philosophy.

> Conclusion: java needs a most expressive stream like API.

> There is another way to see it: catching up API debt compared to
concurrent API.


>> * Java 8's lambdas, etc. allows for much more succinct representation of
operations, which makes the relative ratio of boilerplate of using apply
that much higher. This is one of the struggles we had with the Python API,
pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. pcoll
| Map(...) is at least closer to pcoll.map(...).
>> * With over two years of experience with the 100% pure approach, we
still haven't "gotten used to it" enough that adding such methods isn't
appealing. (Note that by design adding such methods later is always easier
than taking them away, which was one justification for starting at the
extreme point).

>> Even if we go this route, there's no need to remove apply, and

>> pcoll
>>      .map(...)
>>      .apply(...)
>>      .flatMap(...)

>> flows fairly well (with map/flatMap being syntactic sugar to apply).

>> Agree but the issue with that is you loose the natural approach and it
is harder to rework it whereas having an api on top of "apply" let you keep
both concerns split.

Having multiple APIs undesirable, best to have one unless there are hard
constraints that prevent it (e.g. if the two would be jarringly
inconsistent, or one is forced by an interface, etc.)

>> Also pcollection api is what is complex (coders, sides, ...) and what I
hope we can hide behind another API.

I'd like to simplify things as well.

>> I think we would also have to still use apply for parameterless
operations like gbk that place constraints on the element types. I don't
see how to do combinePerKey either (though, asymmetrically, globalCombine
is fine).

>> The largest fear I have is feature creep. There would have to be a very
clear line of what's in and what's not, likely with what's in being a very
short list (which is probably OK and would give the biggest gain, but not
much discoverability). The criteria can't be primitives (gbk is
problematic, and the most natural map isn't really the full ParDo
primitive--in fact the full ParDo might be "advanced" enough to merit
requiring apply).

> Is the previous proposal an issue (jet api)?

On first glance, StreamStage doesn't sound to me like a PCollection (mixes
the notion of operations and values), and methods like flatMapUsingContext
and hashJoin2 seem far down the slippery slope. But I haven't spent that
much time looking at it.

>> Who knows, though I still think we made the right decision to attempt
apply-only at the time, maybe I'll have to flesh this out into a new blog
post that is a rebuttal to my original one :).

> Maybe for part of the users, clearly not for the ones I met last 3 months
(what they said opening their IDE is censured ;)).

Re: (java) stream & beam?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Le 15 mars 2018 06:52, "Robert Bradshaw" <ro...@google.com> a écrit :

The stream API was looked at way back when we were designing the API; one
of the primary reasons it was not further pursued at the time was the
demand for Java 7 compatibility. It is also much more natural with lambdas,
but unfortunately the Java compiler discards types in this case, making
coder inference impossible. Still is interesting to explore, and I've been
toying with using this wrapping method for other applications
(specifically, giving a Pandas Dataframe API to PCollections in Python).

There's a higher level question lingering here about making things more
fluent by putting methods on PCollections in our primary API. It was
somewhat of an experiment to go the very pure approach of *everything*
being expressed a PTransform, and this is not without its disadvantages,
and (gasp) may be worth revisiting. In particular, some things that have
changed in the meantime are

* The Java SDK is no longer *the* definition of the model. The model has
been (mostly) formalized in the portability work, and the general Beam
concepts and notion of PTransform are much more widely fleshed out and
understood.

This is wrong for all java users which are still the mainstream. It is
important to keep that in mind and even if I know portable API is something
important for you, it is solething which should stay on top of runners and
their api which means java for all but one.

All that to say that the most common default is java.

However I agree each language should have its natural API and should
absolutely not just port over the same API. Goal being indeed to respect
its own philosophy.

Conclusion: java needs a most expressive stream like API.

There is another way to see it: catching up API debt compared to concurrent
API.

* Java 8's lambdas, etc. allows for much more succinct representation of
operations, which makes the relative ratio of boilerplate of using apply
that much higher. This is one of the struggles we had with the Python API,
pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. pcoll
| Map(...) is at least closer to pcoll.map(...).
* With over two years of experience with the 100% pure approach, we still
haven't "gotten used to it" enough that adding such methods isn't
appealing. (Note that by design adding such methods later is always easier
than taking them away, which was one justification for starting at the
extreme point).

Even if we go this route, there's no need to remove apply, and

pcoll
    .map(...)
    .apply(...)
    .flatMap(...)

flows fairly well (with map/flatMap being syntactic sugar to apply).

Agree but the issue with that is you loose the natural approach and it is
harder to rework it whereas having an api on top of "apply" let you keep
both concerns split.

Also pcollection api is what is complex (coders, sides, ...) and what I
hope we can hide behind another API.

I think we would also have to still use apply for parameterless operations
like gbk that place constraints on the element types. I don't see how to do
combinePerKey either (though, asymmetrically, globalCombine is fine).

The largest fear I have is feature creep. There would have to be a very
clear line of what's in and what's not, likely with what's in being a very
short list (which is probably OK and would give the biggest gain, but not
much discoverability). The criteria can't be primitives (gbk is
problematic, and the most natural map isn't really the full ParDo
primitive--in fact the full ParDo might be "advanced" enough to merit
requiring apply).

Is the previous proposal an issue (jet api)?

Who knows, though I still think we made the right decision to attempt
apply-only at the time, maybe I'll have to flesh this out into a new blog
post that is a rebuttal to my original one :).

Maybe for part of the users, clearly not for the ones I met last 3 months
(what they said opening their IDE is censured ;)).

- Robert

On Wed, Mar 14, 2018 at 1:28 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Hi Jan,
>
> The wrapping is almost exactly what I had un mind (I would pass the
> expected Class to support a bit more like in most jre or javax API but
> that's a detail) but I would really try to align it on java stream just to
> keep the dev comfortable: https://github.com/hazelcast/hazelcast-jet/blob/
> 9c4ea86a59ae3b899498f389b5459d67c2b4cdcd/hazelcast-jet-core/
> src/main/java/com/hazelcast/jet/pipeline/StreamStage.java
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-03-14 9:03 GMT+01:00 Jan Lukavský <je...@seznam.cz>:
>
>> Hi all,
>>
>> the are actually some steps taken in this direction - a few emails
>> already went to this channel about donation of Euphoria API (
>> https://github.com/seznam/euphoria) to Apache Beam. SGA has already been
>> signed, currently there is work in progress for porting all Euphoria's
>> features to Beam. The idea is that Euphoria could be this "user friendly"
>> layer on top of Beam. In our proof-of-concept this works like this:
>>
>>    // create input
>>    String raw = "hi there hi hi sue bob hi sue ZOW bob";
>>    List<String> words = Arrays.asList(raw.split(" "));
>>
>>    Pipeline pipeline = Pipeline.create(options());
>>
>>    // create input PCollection
>>    PCollection<String> input = pipeline.apply(
>>        Create.of(words)).setTypeDescriptor(TypeDescriptor.of(String.
>> class));
>>
>>    // holder of mapping between Euphoria and Beam
>>    BeamFlow flow = BeamFlow.create(pipeline);
>>
>>    // lift this PCollection to Euphoria API
>>    Dataset<String> dataset = flow.wrapped(input);
>>
>>    // do something with the data
>>    Dataset<Pair<String, Long>> output = CountByKey.of(dataset)
>>        .keyBy(e -> e)
>>        .output();
>>
>>    // convert Euphoria API back to Beam
>>    PCollection<Pair<String, Long>> beamOut = flow.unwrapped(output);
>>
>>    // do whatever with the resulting PCollection
>>    PAssert.that(beamOut)
>>        .containsInAnyOrder(
>>            Pair.of("hi", 4L),
>>            Pair.of("there", 1L),
>>            Pair.of("sue", 2L),
>>            Pair.of("ZOW", 1L),
>>            Pair.of("bob", 2L));
>>
>>    // run, forrest, run
>>    pipeline.run();
>> I'm aware that this is not the "stream" API this thread was about, but
>> Euphoria also has a "fluent" package - https://github.com/seznam/
>> euphoria/tree/master/euphoria-fluent. This is by no means a complete or
>> production ready API, but it could help solve this dichotomy between
>> whether to keep Beam API as is, or introduce some more use-friendly API. As
>> I said, there is work in progress in this, so if anyone could spare some
>> time and give us helping hand with this porting, it would be just awesome.
>> :-)
>>
>> Jan
>>
>>
>> On 03/13/2018 07:06 PM, Romain Manni-Bucau wrote:
>>
>> Yep
>>
>> I know the rational and it makes sense but it also increases the entering
>> steps for users and is not that smooth in ides, in particular for custom
>> code.
>>
>> So I really think it makes sense to build an user friendly api on top of
>> beam core dev one.
>>
>>
>> Le 13 mars 2018 18:35, "Aljoscha Krettek" <al...@apache.org> a écrit :
>>
>>> https://beam.apache.org/blog/2016/05/27/where-is-my-
>>> pcollection-dot-map.html
>>>
>>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <rm...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> Le 12 mars 2018 00:16, "Reuven Lax" <re...@google.com> a écrit :
>>>
>>> I think it would be interesting to see what a Java stream-based API
>>> would look like. As I mentioned elsewhere, we are not limited to having
>>> only one API for Beam.
>>>
>>> If I remember correctly, a Java stream API was considered for Dataflow
>>> back at the very beginning. I don't completely remember why it was
>>> rejected, but I suspect at least part of the reason might have been that
>>> Java streams were considered too new and untested back then.
>>>
>>>
>>> Coders are broken - typevariables dont have bounds except object - and
>>> reducers are not trivial to impl generally I guess.
>>>
>>> However being close of this api can help a lot so +1 to try to have a
>>> java dsl on top of current api. Would also be neat to integrate it with
>>> completionstage :).
>>>
>>>
>>>
>>> Reuven
>>>
>>>
>>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a
>>>> écrit :
>>>>
>>>> Hi Romain,
>>>>
>>>> I remember we have discussed about the way to express pipeline while
>>>> ago.
>>>>
>>>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>>>> it's the approach in flume).
>>>> However, we agreed that apply() syntax gives a more flexible approach.
>>>>
>>>> Using Java Stream is interesting but I'm afraid we would have the same
>>>> issue as the one we identified discussing "fluent Java SDK". However, we
>>>> can have a Stream API DSL on top of the SDK IMHO.
>>>>
>>>>
>>>> Agree and a beam stream interface (copying jdk api but making lambda
>>>> serializable to avoid the cast need).
>>>>
>>>> On my side i think it enables user to discover the api. If you check my
>>>> poc impl you quickly see the steps needed to do simple things like a map
>>>> which is a first citizen.
>>>>
>>>> Also curious if we could impl reduce with pipeline result = get an
>>>> output of a batch from the runner (client) jvm. I see how to do it for
>>>> longs - with metrics - but not for collect().
>>>>
>>>>
>>>> Regards
>>>> JB
>>>>
>>>>
>>>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> don't know if you already experienced using java Stream API as a
>>>>> replacement for pipeline API but did some tests: https://github.com/
>>>>> rmannibucau/jbeam
>>>>>
>>>>> It is far to be complete but already shows where it fails (beam
>>>>> doesn't have a way to reduce in the caller machine for instance, coder
>>>>> handling is not that trivial, lambda are not working well with default
>>>>> Stream API etc...).
>>>>>
>>>>> However it is interesting to see that having such an API is pretty
>>>>> natural compare to the pipeline API
>>>>> so wonder if beam should work on its own Stream API (with surely
>>>>> another name for obvious reasons ;)).
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>>>> http://rmannibucau.wordpress.com> | Github <https://github.com/
>>>>> rmannibucau> | LinkedIn <https://www.linkedin.com/in/rmannibucau> |
>>>>> Book <https://www.packtpub.com/application-development/java-
>>>>> ee-8-high-performance>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: (java) stream & beam?

Posted by Robert Bradshaw <ro...@google.com>.

The stream API was looked at way back when we were designing the API; one
of the primary reasons it was not further pursued at the time was the
demand for Java 7 compatibility. It is also much more natural with lambdas,
but unfortunately the Java compiler discards types in this case, making
coder inference impossible. Still is interesting to explore, and I've been
toying with using this wrapping method for other applications
(specifically, giving a Pandas Dataframe API to PCollections in Python).

There's a higher level question lingering here about making things more
fluent by putting methods on PCollections in our primary API. It was
somewhat of an experiment to go the very pure approach of *everything*
being expressed a PTransform, and this is not without its disadvantages,
and (gasp) may be worth revisiting. In particular, some things that have
changed in the meantime are

* The Java SDK is no longer *the* definition of the model. The model has
been (mostly) formalized in the portability work, and the general Beam
concepts and notion of PTransform are much more widely fleshed out and
understood.
* Java 8's lambdas, etc. allows for much more succinct representation of
operations, which makes the relative ratio of boilerplate of using apply
that much higher. This is one of the struggles we had with the Python API,
pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. pcoll
| Map(...) is at least closer to pcoll.map(...).
* With over two years of experience with the 100% pure approach, we still
haven't "gotten used to it" enough that adding such methods isn't
appealing. (Note that by design adding such methods later is always easier
than taking them away, which was one justification for starting at the
extreme point).

Even if we go this route, there's no need to remove apply, and

pcoll
    .map(...)
    .apply(...)
    .flatMap(...)

flows fairly well (with map/flatMap being syntactic sugar to apply).

I think we would also have to still use apply for parameterless operations
like gbk that place constraints on the element types. I don't see how to do
combinePerKey either (though, asymmetrically, globalCombine is fine).

The largest fear I have is feature creep. There would have to be a very
clear line of what's in and what's not, likely with what's in being a very
short list (which is probably OK and would give the biggest gain, but not
much discoverability). The criteria can't be primitives (gbk is
problematic, and the most natural map isn't really the full ParDo
primitive--in fact the full ParDo might be "advanced" enough to merit
requiring apply).

Who knows, though I still think we made the right decision to attempt
apply-only at the time, maybe I'll have to flesh this out into a new blog
post that is a rebuttal to my original one :).

- Robert




On Wed, Mar 14, 2018 at 1:28 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Hi Jan,
>
> The wrapping is almost exactly what I had un mind (I would pass the
> expected Class to support a bit more like in most jre or javax API but
> that's a detail) but I would really try to align it on java stream just to
> keep the dev comfortable:
> https://github.com/hazelcast/hazelcast-jet/blob/9c4ea86a59ae3b899498f389b5459d67c2b4cdcd/hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/StreamStage.java
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-03-14 9:03 GMT+01:00 Jan Lukavský <je...@seznam.cz>:
>
>> Hi all,
>>
>> the are actually some steps taken in this direction - a few emails
>> already went to this channel about donation of Euphoria API (
>> https://github.com/seznam/euphoria) to Apache Beam. SGA has already been
>> signed, currently there is work in progress for porting all Euphoria's
>> features to Beam. The idea is that Euphoria could be this "user friendly"
>> layer on top of Beam. In our proof-of-concept this works like this:
>>
>>    // create input
>>    String raw = "hi there hi hi sue bob hi sue ZOW bob";
>>    List<String> words = Arrays.asList(raw.split(" "));
>>
>>    Pipeline pipeline = Pipeline.create(options());
>>
>>    // create input PCollection
>>    PCollection<String> input = pipeline.apply(
>>
>> Create.of(words)).setTypeDescriptor(TypeDescriptor.of(String.class));
>>
>>    // holder of mapping between Euphoria and Beam
>>    BeamFlow flow = BeamFlow.create(pipeline);
>>
>>    // lift this PCollection to Euphoria API
>>    Dataset<String> dataset = flow.wrapped(input);
>>
>>    // do something with the data
>>    Dataset<Pair<String, Long>> output = CountByKey.of(dataset)
>>        .keyBy(e -> e)
>>        .output();
>>
>>    // convert Euphoria API back to Beam
>>    PCollection<Pair<String, Long>> beamOut = flow.unwrapped(output);
>>
>>    // do whatever with the resulting PCollection
>>    PAssert.that(beamOut)
>>        .containsInAnyOrder(
>>            Pair.of("hi", 4L),
>>            Pair.of("there", 1L),
>>            Pair.of("sue", 2L),
>>            Pair.of("ZOW", 1L),
>>            Pair.of("bob", 2L));
>>
>>    // run, forrest, run
>>    pipeline.run();
>> I'm aware that this is not the "stream" API this thread was about, but
>> Euphoria also has a "fluent" package -
>> https://github.com/seznam/euphoria/tree/master/euphoria-fluent. This is
>> by no means a complete or production ready API, but it could help solve
>> this dichotomy between whether to keep Beam API as is, or introduce some
>> more use-friendly API. As I said, there is work in progress in this, so if
>> anyone could spare some time and give us helping hand with this porting, it
>> would be just awesome. :-)
>>
>> Jan
>>
>>
>> On 03/13/2018 07:06 PM, Romain Manni-Bucau wrote:
>>
>> Yep
>>
>> I know the rational and it makes sense but it also increases the entering
>> steps for users and is not that smooth in ides, in particular for custom
>> code.
>>
>> So I really think it makes sense to build an user friendly api on top of
>> beam core dev one.
>>
>>
>> Le 13 mars 2018 18:35, "Aljoscha Krettek" <al...@apache.org> a écrit :
>>
>>>
>>> https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html
>>>
>>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <rm...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> Le 12 mars 2018 00:16, "Reuven Lax" <re...@google.com> a écrit :
>>>
>>> I think it would be interesting to see what a Java stream-based API
>>> would look like. As I mentioned elsewhere, we are not limited to having
>>> only one API for Beam.
>>>
>>> If I remember correctly, a Java stream API was considered for Dataflow
>>> back at the very beginning. I don't completely remember why it was
>>> rejected, but I suspect at least part of the reason might have been that
>>> Java streams were considered too new and untested back then.
>>>
>>>
>>> Coders are broken - typevariables dont have bounds except object - and
>>> reducers are not trivial to impl generally I guess.
>>>
>>> However being close of this api can help a lot so +1 to try to have a
>>> java dsl on top of current api. Would also be neat to integrate it with
>>> completionstage :).
>>>
>>>
>>>
>>> Reuven
>>>
>>>
>>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a
>>>> écrit :
>>>>
>>>> Hi Romain,
>>>>
>>>> I remember we have discussed about the way to express pipeline while
>>>> ago.
>>>>
>>>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>>>> it's the approach in flume).
>>>> However, we agreed that apply() syntax gives a more flexible approach.
>>>>
>>>> Using Java Stream is interesting but I'm afraid we would have the same
>>>> issue as the one we identified discussing "fluent Java SDK". However, we
>>>> can have a Stream API DSL on top of the SDK IMHO.
>>>>
>>>>
>>>> Agree and a beam stream interface (copying jdk api but making lambda
>>>> serializable to avoid the cast need).
>>>>
>>>> On my side i think it enables user to discover the api. If you check my
>>>> poc impl you quickly see the steps needed to do simple things like a map
>>>> which is a first citizen.
>>>>
>>>> Also curious if we could impl reduce with pipeline result = get an
>>>> output of a batch from the runner (client) jvm. I see how to do it for
>>>> longs - with metrics - but not for collect().
>>>>
>>>>
>>>> Regards
>>>> JB
>>>>
>>>>
>>>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> don't know if you already experienced using java Stream API as a
>>>>> replacement for pipeline API but did some tests:
>>>>> https://github.com/rmannibucau/jbeam
>>>>>
>>>>> It is far to be complete but already shows where it fails (beam
>>>>> doesn't have a way to reduce in the caller machine for instance, coder
>>>>> handling is not that trivial, lambda are not working well with default
>>>>> Stream API etc...).
>>>>>
>>>>> However it is interesting to see that having such an API is pretty
>>>>> natural compare to the pipeline API
>>>>> so wonder if beam should work on its own Stream API (with surely
>>>>> another name for obvious reasons ;)).
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>>>> http://rmannibucau.wordpress.com> | Github <
>>>>> https://github.com/rmannibucau> | LinkedIn <
>>>>> https://www.linkedin.com/in/rmannibucau> | Book <
>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: (java) stream & beam?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Hi Jan,

The wrapping is almost exactly what I had un mind (I would pass the
expected Class to support a bit more like in most jre or javax API but
that's a detail) but I would really try to align it on java stream just to
keep the dev comfortable:
https://github.com/hazelcast/hazelcast-jet/blob/9c4ea86a59ae3b899498f389b5459d67c2b4cdcd/hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/StreamStage.java


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-03-14 9:03 GMT+01:00 Jan Lukavský <je...@seznam.cz>:

> Hi all,
>
> the are actually some steps taken in this direction - a few emails already
> went to this channel about donation of Euphoria API (
> https://github.com/seznam/euphoria) to Apache Beam. SGA has already been
> signed, currently there is work in progress for porting all Euphoria's
> features to Beam. The idea is that Euphoria could be this "user friendly"
> layer on top of Beam. In our proof-of-concept this works like this:
>
>    // create input
>    String raw = "hi there hi hi sue bob hi sue ZOW bob";
>    List<String> words = Arrays.asList(raw.split(" "));
>
>    Pipeline pipeline = Pipeline.create(options());
>
>    // create input PCollection
>    PCollection<String> input = pipeline.apply(
>        Create.of(words)).setTypeDescriptor(TypeDescriptor.of(String.
> class));
>
>    // holder of mapping between Euphoria and Beam
>    BeamFlow flow = BeamFlow.create(pipeline);
>
>    // lift this PCollection to Euphoria API
>    Dataset<String> dataset = flow.wrapped(input);
>
>    // do something with the data
>    Dataset<Pair<String, Long>> output = CountByKey.of(dataset)
>        .keyBy(e -> e)
>        .output();
>
>    // convert Euphoria API back to Beam
>    PCollection<Pair<String, Long>> beamOut = flow.unwrapped(output);
>
>    // do whatever with the resulting PCollection
>    PAssert.that(beamOut)
>        .containsInAnyOrder(
>            Pair.of("hi", 4L),
>            Pair.of("there", 1L),
>            Pair.of("sue", 2L),
>            Pair.of("ZOW", 1L),
>            Pair.of("bob", 2L));
>
>    // run, forrest, run
>    pipeline.run();
> I'm aware that this is not the "stream" API this thread was about, but
> Euphoria also has a "fluent" package - https://github.com/seznam/
> euphoria/tree/master/euphoria-fluent. This is by no means a complete or
> production ready API, but it could help solve this dichotomy between
> whether to keep Beam API as is, or introduce some more use-friendly API. As
> I said, there is work in progress in this, so if anyone could spare some
> time and give us helping hand with this porting, it would be just awesome.
> :-)
>
> Jan
>
>
> On 03/13/2018 07:06 PM, Romain Manni-Bucau wrote:
>
> Yep
>
> I know the rational and it makes sense but it also increases the entering
> steps for users and is not that smooth in ides, in particular for custom
> code.
>
> So I really think it makes sense to build an user friendly api on top of
> beam core dev one.
>
>
> Le 13 mars 2018 18:35, "Aljoscha Krettek" <al...@apache.org> a écrit :
>
>> https://beam.apache.org/blog/2016/05/27/where-is-my-pcollect
>> ion-dot-map.html
>>
>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <rm...@gmail.com>
>> wrote:
>>
>>
>>
>> Le 12 mars 2018 00:16, "Reuven Lax" <re...@google.com> a écrit :
>>
>> I think it would be interesting to see what a Java stream-based API would
>> look like. As I mentioned elsewhere, we are not limited to having only one
>> API for Beam.
>>
>> If I remember correctly, a Java stream API was considered for Dataflow
>> back at the very beginning. I don't completely remember why it was
>> rejected, but I suspect at least part of the reason might have been that
>> Java streams were considered too new and untested back then.
>>
>>
>> Coders are broken - typevariables dont have bounds except object - and
>> reducers are not trivial to impl generally I guess.
>>
>> However being close of this api can help a lot so +1 to try to have a
>> java dsl on top of current api. Would also be neat to integrate it with
>> completionstage :).
>>
>>
>>
>> Reuven
>>
>>
>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <rm...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a
>>> écrit :
>>>
>>> Hi Romain,
>>>
>>> I remember we have discussed about the way to express pipeline while ago.
>>>
>>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>>> it's the approach in flume).
>>> However, we agreed that apply() syntax gives a more flexible approach.
>>>
>>> Using Java Stream is interesting but I'm afraid we would have the same
>>> issue as the one we identified discussing "fluent Java SDK". However, we
>>> can have a Stream API DSL on top of the SDK IMHO.
>>>
>>>
>>> Agree and a beam stream interface (copying jdk api but making lambda
>>> serializable to avoid the cast need).
>>>
>>> On my side i think it enables user to discover the api. If you check my
>>> poc impl you quickly see the steps needed to do simple things like a map
>>> which is a first citizen.
>>>
>>> Also curious if we could impl reduce with pipeline result = get an
>>> output of a batch from the runner (client) jvm. I see how to do it for
>>> longs - with metrics - but not for collect().
>>>
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>>
>>>> Hi guys,
>>>>
>>>> don't know if you already experienced using java Stream API as a
>>>> replacement for pipeline API but did some tests:
>>>> https://github.com/rmannibucau/jbeam
>>>>
>>>> It is far to be complete but already shows where it fails (beam doesn't
>>>> have a way to reduce in the caller machine for instance, coder handling is
>>>> not that trivial, lambda are not working well with default Stream API
>>>> etc...).
>>>>
>>>> However it is interesting to see that having such an API is pretty
>>>> natural compare to the pipeline API
>>>> so wonder if beam should work on its own Stream API (with surely
>>>> another name for obvious reasons ;)).
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>>> http://rmannibucau.wordpress.com> | Github <
>>>> https://github.com/rmannibucau> | LinkedIn <
>>>> https://www.linkedin.com/in/rmannibucau> | Book <
>>>> https://www.packtpub.com/application-development/java-ee-8-
>>>> high-performance>
>>>>
>>>
>>>
>>
>>
>

Re: (java) stream & beam?

Posted by Jan Lukavský <je...@seznam.cz>.

Hi all,

the are actually some steps taken in this direction - a few emails 
already went to this channel about donation of Euphoria API 
(https://github.com/seznam/euphoria) to Apache Beam. SGA has already 
been signed, currently there is work in progress for porting all 
Euphoria's features to Beam. The idea is that Euphoria could be this 
"user friendly" layer on top of Beam. In our proof-of-concept this works 
like this:

    // create input
    String raw = "hi there hi hi sue bob hi sue ZOW bob";
    List<String> words = Arrays.asList(raw.split(" "));

    Pipeline pipeline = Pipeline.create(options());

    // create input PCollection
    PCollection<String> input = pipeline.apply(
Create.of(words)).setTypeDescriptor(TypeDescriptor.of(String.class));

    // holder of mapping between Euphoria and Beam
    BeamFlow flow = BeamFlow.create(pipeline);

    // lift this PCollection to Euphoria API
    Dataset<String> dataset = flow.wrapped(input);

    // do something with the data
    Dataset<Pair<String, Long>> output = CountByKey.of(dataset)
        .keyBy(e -> e)
        .output();

    // convert Euphoria API back to Beam
    PCollection<Pair<String, Long>> beamOut = flow.unwrapped(output);

    // do whatever with the resulting PCollection
    PAssert.that(beamOut)
        .containsInAnyOrder(
            Pair.of("hi", 4L),
            Pair.of("there", 1L),
            Pair.of("sue", 2L),
            Pair.of("ZOW", 1L),
            Pair.of("bob", 2L));

    // run, forrest, run
    pipeline.run();

I'm aware that this is not the "stream" API this thread was about, but 
Euphoria also has a "fluent" package - 
https://github.com/seznam/euphoria/tree/master/euphoria-fluent. This is 
by no means a complete or production ready API, but it could help solve 
this dichotomy between whether to keep Beam API as is, or introduce some 
more use-friendly API. As I said, there is work in progress in this, so 
if anyone could spare some time and give us helping hand with this 
porting, it would be just awesome. :-)

Jan

On 03/13/2018 07:06 PM, Romain Manni-Bucau wrote:
> Yep
>
> I know the rational and it makes sense but it also increases the 
> entering steps for users and is not that smooth in ides, in particular 
> for custom code.
>
> So I really think it makes sense to build an user friendly api on top 
> of beam core dev one.
>
>
> Le 13 mars 2018 18:35, "Aljoscha Krettek" <aljoscha@apache.org 
> <ma...@apache.org>> a écrit :
>
>     https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html
>     <https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html>
>
>>     On 11. Mar 2018, at 22:21, Romain Manni-Bucau
>>     <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>
>>
>>
>>     Le 12 mars 2018 00:16, "Reuven Lax" <relax@google.com
>>     <ma...@google.com>> a écrit :
>>
>>         I think it would be interesting to see what a Java
>>         stream-based API would look like. As I mentioned elsewhere,
>>         we are not limited to having only one API for Beam.
>>
>>         If I remember correctly, a Java stream API was considered for
>>         Dataflow back at the very beginning. I don't completely
>>         remember why it was rejected, but I suspect at least part of
>>         the reason might have been that Java streams were considered
>>         too new and untested back then.
>>
>>
>>     Coders are broken - typevariables dont have bounds except object
>>     - and reducers are not trivial to impl generally I guess.
>>
>>     However being close of this api can help a lot so +1 to try to
>>     have a java dsl on top of current api. Would also be neat to
>>     integrate it with completionstage :).
>>
>>
>>
>>         Reuven
>>
>>
>>         On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau
>>         <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
>>
>>
>>
>>             Le 11 mars 2018 21:18, "Jean-Baptiste Onofré"
>>             <jb@nanthrax.net <ma...@nanthrax.net>> a écrit :
>>
>>                 Hi Romain,
>>
>>                 I remember we have discussed about the way to express
>>                 pipeline while ago.
>>
>>                 I was fan of a "DSL" compared to the one we have in
>>                 Camel: instead of using apply(), use a dedicated form
>>                 (like .map(), .reduce(), etc, AFAIR, it's the
>>                 approach in flume).
>>                 However, we agreed that apply() syntax gives a more
>>                 flexible approach.
>>
>>                 Using Java Stream is interesting but I'm afraid we
>>                 would have the same issue as the one we identified
>>                 discussing "fluent Java SDK". However, we can have a
>>                 Stream API DSL on top of the SDK IMHO.
>>
>>
>>             Agree and a beam stream interface (copying jdk api but
>>             making lambda serializable to avoid the cast need).
>>
>>             On my side i think it enables user to discover the api.
>>             If you check my poc impl you quickly see the steps needed
>>             to do simple things like a map which is a first citizen.
>>
>>             Also curious if we could impl reduce with pipeline result
>>             = get an output of a batch from the runner (client) jvm.
>>             I see how to do it for longs - with metrics - but not for
>>             collect().
>>
>>
>>                 Regards
>>                 JB
>>
>>
>>                 On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>
>>                     Hi guys,
>>
>>                     don't know if you already experienced using java
>>                     Stream API as a replacement for pipeline API but
>>                     did some tests:
>>                     https://github.com/rmannibucau/jbeam
>>                     <https://github.com/rmannibucau/jbeam>
>>
>>                     It is far to be complete but already shows where
>>                     it fails (beam doesn't have a way to reduce in
>>                     the caller machine for instance, coder handling
>>                     is not that trivial, lambda are not working well
>>                     with default Stream API etc...).
>>
>>                     However it is interesting to see that having such
>>                     an API is pretty natural compare to the pipeline API
>>                     so wonder if beam should work on its own Stream
>>                     API (with surely another name for obvious reasons
>>                     ;)).
>>
>>                     Romain Manni-Bucau
>>                     @rmannibucau <https://twitter.com/rmannibucau
>>                     <https://twitter.com/rmannibucau>> | Blog
>>                     <https://rmannibucau.metawerx.net/
>>                     <https://rmannibucau.metawerx.net/>> | Old Blog
>>                     <http://rmannibucau.wordpress.com
>>                     <http://rmannibucau.wordpress.com/>> | Github
>>                     <https://github.com/rmannibucau
>>                     <https://github.com/rmannibucau>> | LinkedIn
>>                     <https://www.linkedin.com/in/rmannibucau
>>                     <https://www.linkedin.com/in/rmannibucau>> | Book
>>                     <https://www.packtpub.com/application-development/java-ee-8-high-performance
>>                     <https://www.packtpub.com/application-development/java-ee-8-high-performance>>
>>
>>
>>
>

Re: (java) stream & beam?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Yep

I know the rational and it makes sense but it also increases the entering
steps for users and is not that smooth in ides, in particular for custom
code.

So I really think it makes sense to build an user friendly api on top of
beam core dev one.


Le 13 mars 2018 18:35, "Aljoscha Krettek" <al...@apache.org> a écrit :

> https://beam.apache.org/blog/2016/05/27/where-is-my-
> pcollection-dot-map.html
>
> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>
>
> Le 12 mars 2018 00:16, "Reuven Lax" <re...@google.com> a écrit :
>
> I think it would be interesting to see what a Java stream-based API would
> look like. As I mentioned elsewhere, we are not limited to having only one
> API for Beam.
>
> If I remember correctly, a Java stream API was considered for Dataflow
> back at the very beginning. I don't completely remember why it was
> rejected, but I suspect at least part of the reason might have been that
> Java streams were considered too new and untested back then.
>
>
> Coders are broken - typevariables dont have bounds except object - and
> reducers are not trivial to impl generally I guess.
>
> However being close of this api can help a lot so +1 to try to have a java
> dsl on top of current api. Would also be neat to integrate it with
> completionstage :).
>
>
>
> Reuven
>
>
> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>>
>>
>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a écrit :
>>
>> Hi Romain,
>>
>> I remember we have discussed about the way to express pipeline while ago.
>>
>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>> it's the approach in flume).
>> However, we agreed that apply() syntax gives a more flexible approach.
>>
>> Using Java Stream is interesting but I'm afraid we would have the same
>> issue as the one we identified discussing "fluent Java SDK". However, we
>> can have a Stream API DSL on top of the SDK IMHO.
>>
>>
>> Agree and a beam stream interface (copying jdk api but making lambda
>> serializable to avoid the cast need).
>>
>> On my side i think it enables user to discover the api. If you check my
>> poc impl you quickly see the steps needed to do simple things like a map
>> which is a first citizen.
>>
>> Also curious if we could impl reduce with pipeline result = get an output
>> of a batch from the runner (client) jvm. I see how to do it for longs -
>> with metrics - but not for collect().
>>
>>
>> Regards
>> JB
>>
>>
>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>
>>> Hi guys,
>>>
>>> don't know if you already experienced using java Stream API as a
>>> replacement for pipeline API but did some tests:
>>> https://github.com/rmannibucau/jbeam
>>>
>>> It is far to be complete but already shows where it fails (beam doesn't
>>> have a way to reduce in the caller machine for instance, coder handling is
>>> not that trivial, lambda are not working well with default Stream API
>>> etc...).
>>>
>>> However it is interesting to see that having such an API is pretty
>>> natural compare to the pipeline API
>>> so wonder if beam should work on its own Stream API (with surely another
>>> name for obvious reasons ;)).
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>> http://rmannibucau.wordpress.com> | Github <
>>> https://github.com/rmannibucau> | LinkedIn <
>>> https://www.linkedin.com/in/rmannibucau> | Book <
>>> https://www.packtpub.com/application-development/java-ee-8-
>>> high-performance>
>>>
>>
>>
>
>

Re: (java) stream & beam?

Posted by Aljoscha Krettek <al...@apache.org>.

https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html

> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <rm...@gmail.com> wrote:
> 
> 
> 
> Le 12 mars 2018 00:16, "Reuven Lax" <relax@google.com <ma...@google.com>> a écrit :
> I think it would be interesting to see what a Java stream-based API would look like. As I mentioned elsewhere, we are not limited to having only one API for Beam.
> 
> If I remember correctly, a Java stream API was considered for Dataflow back at the very beginning. I don't completely remember why it was rejected, but I suspect at least part of the reason might have been that Java streams were considered too new and untested back then.
> 
> Coders are broken - typevariables dont have bounds except object - and reducers are not trivial to impl generally I guess.
> 
> However being close of this api can help a lot so +1 to try to have a java dsl on top of current api. Would also be neat to integrate it with completionstage :).
> 
> 
> 
> Reuven
> 
> 
> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <rmannibucau@gmail.com <ma...@gmail.com>> wrote:
> 
> 
> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <jb@nanthrax.net <ma...@nanthrax.net>> a écrit :
> Hi Romain,
> 
> I remember we have discussed about the way to express pipeline while ago.
> 
> I was fan of a "DSL" compared to the one we have in Camel: instead of using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR, it's the approach in flume).
> However, we agreed that apply() syntax gives a more flexible approach.
> 
> Using Java Stream is interesting but I'm afraid we would have the same issue as the one we identified discussing "fluent Java SDK". However, we can have a Stream API DSL on top of the SDK IMHO.
> 
> Agree and a beam stream interface (copying jdk api but making lambda serializable to avoid the cast need).
> 
> On my side i think it enables user to discover the api. If you check my poc impl you quickly see the steps needed to do simple things like a map which is a first citizen.
> 
> Also curious if we could impl reduce with pipeline result = get an output of a batch from the runner (client) jvm. I see how to do it for longs - with metrics - but not for collect().
> 
> 
> Regards
> JB
> 
> 
> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
> Hi guys,
> 
> don't know if you already experienced using java Stream API as a replacement for pipeline API but did some tests: https://github.com/rmannibucau/jbeam <https://github.com/rmannibucau/jbeam>
> 
> It is far to be complete but already shows where it fails (beam doesn't have a way to reduce in the caller machine for instance, coder handling is not that trivial, lambda are not working well with default Stream API etc...).
> 
> However it is interesting to see that having such an API is pretty natural compare to the pipeline API
> so wonder if beam should work on its own Stream API (with surely another name for obvious reasons ;)).
> 
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>> | Blog <https://rmannibucau.metawerx.net/ <https://rmannibucau.metawerx.net/>> | Old Blog <http://rmannibucau.wordpress.com <http://rmannibucau.wordpress.com/>> | Github <https://github.com/rmannibucau <https://github.com/rmannibucau>> | LinkedIn <https://www.linkedin.com/in/rmannibucau <https://www.linkedin.com/in/rmannibucau>> | Book <https://www.packtpub.com/application-development/java-ee-8-high-performance <https://www.packtpub.com/application-development/java-ee-8-high-performance>>
> 
>

Re: (java) stream & beam?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Le 12 mars 2018 00:16, "Reuven Lax" <re...@google.com> a écrit :

I think it would be interesting to see what a Java stream-based API would
look like. As I mentioned elsewhere, we are not limited to having only one
API for Beam.

If I remember correctly, a Java stream API was considered for Dataflow back
at the very beginning. I don't completely remember why it was rejected, but
I suspect at least part of the reason might have been that Java streams
were considered too new and untested back then.


Coders are broken - typevariables dont have bounds except object - and
reducers are not trivial to impl generally I guess.

However being close of this api can help a lot so +1 to try to have a java
dsl on top of current api. Would also be neat to integrate it with
completionstage :).



Reuven


On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <rm...@gmail.com>
wrote:

>
>
> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a écrit :
>
> Hi Romain,
>
> I remember we have discussed about the way to express pipeline while ago.
>
> I was fan of a "DSL" compared to the one we have in Camel: instead of
> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
> it's the approach in flume).
> However, we agreed that apply() syntax gives a more flexible approach.
>
> Using Java Stream is interesting but I'm afraid we would have the same
> issue as the one we identified discussing "fluent Java SDK". However, we
> can have a Stream API DSL on top of the SDK IMHO.
>
>
> Agree and a beam stream interface (copying jdk api but making lambda
> serializable to avoid the cast need).
>
> On my side i think it enables user to discover the api. If you check my
> poc impl you quickly see the steps needed to do simple things like a map
> which is a first citizen.
>
> Also curious if we could impl reduce with pipeline result = get an output
> of a batch from the runner (client) jvm. I see how to do it for longs -
> with metrics - but not for collect().
>
>
> Regards
> JB
>
>
> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>
>> Hi guys,
>>
>> don't know if you already experienced using java Stream API as a
>> replacement for pipeline API but did some tests: https://github.com/
>> rmannibucau/jbeam
>>
>> It is far to be complete but already shows where it fails (beam doesn't
>> have a way to reduce in the caller machine for instance, coder handling is
>> not that trivial, lambda are not working well with default Stream API
>> etc...).
>>
>> However it is interesting to see that having such an API is pretty
>> natural compare to the pipeline API
>> so wonder if beam should work on its own Stream API (with surely another
>> name for obvious reasons ;)).
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>> https://rmannibucau.metawerx.net/> | Old Blog <
>> http://rmannibucau.wordpress.com> | Github <https://github.com/
>> rmannibucau> | LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-
>> ee-8-high-performance>
>>
>
>

Re: (java) stream & beam?

Posted by Reuven Lax <re...@google.com>.

I think it would be interesting to see what a Java stream-based API would
look like. As I mentioned elsewhere, we are not limited to having only one
API for Beam.

If I remember correctly, a Java stream API was considered for Dataflow back
at the very beginning. I don't completely remember why it was rejected, but
I suspect at least part of the reason might have been that Java streams
were considered too new and untested back then.

Reuven


On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <rm...@gmail.com>
wrote:

>
>
> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a écrit :
>
> Hi Romain,
>
> I remember we have discussed about the way to express pipeline while ago.
>
> I was fan of a "DSL" compared to the one we have in Camel: instead of
> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
> it's the approach in flume).
> However, we agreed that apply() syntax gives a more flexible approach.
>
> Using Java Stream is interesting but I'm afraid we would have the same
> issue as the one we identified discussing "fluent Java SDK". However, we
> can have a Stream API DSL on top of the SDK IMHO.
>
>
> Agree and a beam stream interface (copying jdk api but making lambda
> serializable to avoid the cast need).
>
> On my side i think it enables user to discover the api. If you check my
> poc impl you quickly see the steps needed to do simple things like a map
> which is a first citizen.
>
> Also curious if we could impl reduce with pipeline result = get an output
> of a batch from the runner (client) jvm. I see how to do it for longs -
> with metrics - but not for collect().
>
>
> Regards
> JB
>
>
> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>
>> Hi guys,
>>
>> don't know if you already experienced using java Stream API as a
>> replacement for pipeline API but did some tests:
>> https://github.com/rmannibucau/jbeam
>>
>> It is far to be complete but already shows where it fails (beam doesn't
>> have a way to reduce in the caller machine for instance, coder handling is
>> not that trivial, lambda are not working well with default Stream API
>> etc...).
>>
>> However it is interesting to see that having such an API is pretty
>> natural compare to the pipeline API
>> so wonder if beam should work on its own Stream API (with surely another
>> name for obvious reasons ;)).
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>> https://rmannibucau.metawerx.net/> | Old Blog <
>> http://rmannibucau.wordpress.com> | Github <
>> https://github.com/rmannibucau> | LinkedIn <
>> https://www.linkedin.com/in/rmannibucau> | Book <
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>> >
>>
>
>

Re: (java) stream & beam?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <jb...@nanthrax.net> a écrit :

Hi Romain,

I remember we have discussed about the way to express pipeline while ago.

I was fan of a "DSL" compared to the one we have in Camel: instead of using
apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR, it's the
approach in flume).
However, we agreed that apply() syntax gives a more flexible approach.

Using Java Stream is interesting but I'm afraid we would have the same
issue as the one we identified discussing "fluent Java SDK". However, we
can have a Stream API DSL on top of the SDK IMHO.

Agree and a beam stream interface (copying jdk api but making lambda
serializable to avoid the cast need).

On my side i think it enables user to discover the api. If you check my poc
impl you quickly see the steps needed to do simple things like a map which
is a first citizen.

Also curious if we could impl reduce with pipeline result = get an output
of a batch from the runner (client) jvm. I see how to do it for longs -
with metrics - but not for collect().

Regards
JB

On 11/03/2018 19:46, Romain Manni-Bucau wrote:

> Hi guys,
>
> don't know if you already experienced using java Stream API as a
> replacement for pipeline API but did some tests:
> https://github.com/rmannibucau/jbeam
>
> It is far to be complete but already shows where it fails (beam doesn't
> have a way to reduce in the caller machine for instance, coder handling is
> not that trivial, lambda are not working well with default Stream API
> etc...).
>
> However it is interesting to see that having such an API is pretty natural
> compare to the pipeline API
> so wonder if beam should work on its own Stream API (with surely another
> name for obvious reasons ;)).
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
> https://rmannibucau.metawerx.net/> | Old Blog <
> http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
> LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book <
> https://www.packtpub.com/application-development/java-ee-8-
> high-performance>
>

Re: (java) stream & beam?

Posted by Reuven Lax <re...@google.com>.

A "fluent" API isn't completely incompatible with our current apply-based
API. We could easily add fluent member functions to PCollections which are
syntactic sugar (i.e. delegate to apply). We would need to be disciplined
though, as there will be a tendency for everyone to ask for their transform
to be added as well (this would be a lot saner in a language that supported
mixin methods). This does have some advantages in cleaner user code and
more discoverable transform (i.e. IDE autocomplete and dropdowns work).

One potential concern would be losing some type safety. e.g. today if I
have a PCollection<Long>, I can't apply GroupByKey to it - the Java type
system will only allow me to do this if I have a Pollection<KV>. If however
groupByKey was a method on PCollection, then we can't stop it from being
called.

On Sun, Mar 11, 2018 at 1:18 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Romain,
>
> I remember we have discussed about the way to express pipeline while ago.
>
> I was fan of a "DSL" compared to the one we have in Camel: instead of
> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
> it's the approach in flume).
> However, we agreed that apply() syntax gives a more flexible approach.
>
> Using Java Stream is interesting but I'm afraid we would have the same
> issue as the one we identified discussing "fluent Java SDK". However, we
> can have a Stream API DSL on top of the SDK IMHO.
>
> Regards
> JB
>
> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
> > Hi guys,
> >
> > don't know if you already experienced using java Stream API as a
> > replacement for pipeline API but did some tests:
> > https://github.com/rmannibucau/jbeam
> >
> > It is far to be complete but already shows where it fails (beam doesn't
> > have a way to reduce in the caller machine for instance, coder handling
> > is not that trivial, lambda are not working well with default Stream API
> > etc...).
> >
> > However it is interesting to see that having such an API is pretty
> > natural compare to the pipeline API
> > so wonder if beam should work on its own Stream API (with surely another
> > name for obvious reasons ;)).
> >
> > Romain Manni-Bucau
> > @rmannibucau <https://twitter.com/rmannibucau> | Blog
> > <https://rmannibucau.metawerx.net/> | Old Blog
> > <http://rmannibucau.wordpress.com> | Github
> > <https://github.com/rmannibucau> | LinkedIn
> > <https://www.linkedin.com/in/rmannibucau> | Book
> > <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
>

Re: (java) stream & beam?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Romain,

I remember we have discussed about the way to express pipeline while ago.

I was fan of a "DSL" compared to the one we have in Camel: instead of 
using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR, 
it's the approach in flume).
However, we agreed that apply() syntax gives a more flexible approach.

Using Java Stream is interesting but I'm afraid we would have the same 
issue as the one we identified discussing "fluent Java SDK". However, we 
can have a Stream API DSL on top of the SDK IMHO.

Regards
JB

On 11/03/2018 19:46, Romain Manni-Bucau wrote:
> Hi guys,
> 
> don't know if you already experienced using java Stream API as a 
> replacement for pipeline API but did some tests: 
> https://github.com/rmannibucau/jbeam
> 
> It is far to be complete but already shows where it fails (beam doesn't 
> have a way to reduce in the caller machine for instance, coder handling 
> is not that trivial, lambda are not working well with default Stream API 
> etc...).
> 
> However it is interesting to see that having such an API is pretty 
> natural compare to the pipeline API
> so wonder if beam should work on its own Stream API (with surely another 
> name for obvious reasons ;)).
> 
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> | Blog 
> <https://rmannibucau.metawerx.net/> | Old Blog 
> <http://rmannibucau.wordpress.com> | Github 
> <https://github.com/rmannibucau> | LinkedIn 
> <https://www.linkedin.com/in/rmannibucau> | Book 
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>