You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Etienne Chauchot <ec...@gmail.com> on 2017/05/12 08:55:15 UTC

Re: [DISCUSSION] using NexMark for Beam

Hi guys,

I wanted to let you know that I have just submitted a PR around NexMark. 
This is a port of the NexMark queries to Beam, to be used as integration 
tests.
This can also be used as A-B testing (no-regression or performance 
comparison between 2 versions of the same engine or of the same runner)

This a continuation of the previous PR (#99) from Mark Shields.
The code has changed quite a bit: some queries have changed to use new 
Beam APIs and there where some big refactorings. More important, we can 
now run all the queries in all the runners.

Nevertheless, there are still some open issues in Nexmark 
(https://github.com/iemejia/beam/issues) and in Beam upstream (see issue 
links in https://issues.apache.org/jira/browse/BEAM-160)

I wanted to submit the PR before our (Ismaël and I) NexMark talk at the 
ApacheCon. The PR is not perfect but it is in a good shape to share it.

Best,

Etienne



Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
> This is great! Having a variety of realistic-ish pipelines running on all
> runners complements the validation suite and IO IT work.
>
> If I recall, some of these involve heavy and esoteric uses of state, so
> definitely give me a ping if you hit any trouble.
>
> Kenn
>
> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <ec...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> Ismael and I are working on upgrading the Nexmark implementation for Beam.
>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing the
>> work done by Mark Shields. See https://github.com/apache/beam/pull/366
>> for the original PR.
>>
>> The PR contains queries that have a wide coverage of the Beam model and
>> that represent a realistic end user use case (some come from client
>> experience on Google Cloud Dataflow).
>>
>> So far, we have upgraded the implementation to the latest Beam snapshot.
>> And we are able to execute a good subset of the queries in the different
>> runners. We upgraded the nexmark drivers to do so: direct driver (upgraded
>> from inProcessDriver) and flink driver and we added a new one for spark.
>>
>> There is still a good amount of work to do and we would like to know if
>> you think that this contribution can have its place into Beam eventually.
>>
>> The interests of having Nexmark on Beam that we have seen so far are:
>>
>> - Rich batch/streaming test
>>
>> - A-B testing of runners or runtimes (non-regression, performance
>> comparison between versions ...)
>>
>> - Integration testing (sdk/runners, runner/runtime, ...)
>>
>> - Validate beam capability matrix
>>
>> - It can be used as part of the ongoing PerfKit work (if there is any
>> interest).
>>
>> As a final note, we are tracking the issues in the same repo. If someone
>> is interested in contributing, or have more ideas, you are welcome :)
>>
>> Etienne
>>
>>

Re: [DISCUSSION] using NexMark for Beam

Posted by Davor Bonaci <da...@apache.org>.

Thrilled to see this!

Hopefully we'll get to automating the execution fully in Jenkins, and have
the results readily available to the whole community. (Happy to help
myself!)

On Sun, May 14, 2017 at 5:29 AM, Reuven Lax <re...@google.com.invalid>
wrote:

> Great to hear! A couple of comments:
>
> When Query 10 was written, the file-based sinks did not supported unbounded
> input. Now that in Beam FileBasedSink supports windowed output files, I
> think we should just rip out the custom IO code in Query 10 and replace it
> with AvroIO  - this is closer to what real Beam users will do, and it will
> also make it support HDFS.
>
> Query 10 also tests some subtle semantics around late data - notably that
> if an element from a source is not late, elements resulting from processing
> that element are not late. Essentially this is a correctness test for
> watermarks, and should apply to all runners IMO.
>
> WinningBids.java (used in Query6) uses a fairly awkward (and
> computationally expensive) custom merging window function - largely because
> Mark was trying to avoid using the state API as much as he could (at the
> time there was no public state API). IMO we should rewrite WinningBids to
> use state. This should result in both cleaner code, and more efficient
> query.
>
> Reuven
>
> On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <ie...@gmail.com> wrote:
>
> > Hello,
> >
> > Thanks Etienne for opening the Pull Request and starting the
> > discussion for the review process. I also want to thank publicly all
> > the people that somehow contributed to this:
> >
> > - Mark Shields and the original people at google who worked at nexmark
> > for contributing this in the first place.
> > - Etienne because his work and constant help really improved the
> > status of the queries, your work on query 3 was really nice, and also
> > for the hard work of helping me test all the queries with all the
> > runners and ping the runner maintainers for fixes.
> > - Aviem/Amit for all the help to solve the issues with the spark
> > runner whose support is now almost feature complete (even in
> > streaming!).
> > - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
> > quickly adding the support for metrics.
> > - Thomas Groh and Kenneth for fixing some needed parts in Direct
> > Runner + answering our questions on the State/Timer API.
> > - JB and the talend crew for all the feedback and help to run in our
> > benchmark cluster.
> > - And of course the rest of the Beam community :)
> >
> > Some comments:
> >
> > - This does not need to have a feature branch since we have been
> > working on this in a fork for months now and with the stable API we
> > can simply do a traditional PR review. Of course the review is a bit
> > bigger so we expect it to take some time, but I hope we can get some
> > quick progress once FSR is out.
> >
> > - We need a hand from the google guys, for the moment we have tested
> > all the queries in all the runners, but not in the Dataflow runner
> > because we don't have access to it (well we have but not with the
> > freedom that you guys have to run the benchmark at will), so if we can
> > get some access that would be nice or if this is not possible, it
> > would be nice if some of you guys help us test/report any given issue
> > on this runner,
> >
> > - We also have to decide the future of some features, this is probably
> > independent of the current PR and part of the evolution of Nexmark on
> > Beam:
> >
> > -- There are still some pending things that can be improved even after
> > the review once in master, e.g. we have for the moment only synthetic
> > sources but the original version took also data from Pubsub, we have
> > to define the correct scope for this and given the case also add other
> > sources, e.g. Kafka, HDFS.
> >
> > -- Query 10 is really oriented to testing Google Runner/IOs specific
> > features, so we have to decide what to do with this one, maybe
> > mirroring it with Kafka/HDFS to have something equivalent in the
> > Apache world.
> >
> > This is all for now, I am really glad that this is finally happening
> > and I hope this soon gets merged.
> >
> > Ismaël
> >
> > On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <lc...@google.com.invalid>
> > wrote:
> > > I think these are valuable enough that we should get them into
> > apache/master
> > >
> > > On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <jb@nanthrax.net
> >
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> PR or even a feature branch could work. Up to you.
> > >>
> > >> Regards
> > >> JB
> > >>
> > >>
> > >> On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
> > >>
> > >>> Hi guys,
> > >>>
> > >>> I wanted to let you know that I have just submitted a PR around
> > NexMark.
> > >>> This is
> > >>> a port of the NexMark queries to Beam, to be used as integration
> tests.
> > >>> This can also be used as A-B testing (no-regression or performance
> > >>> comparison
> > >>> between 2 versions of the same engine or of the same runner)
> > >>>
> > >>> This a continuation of the previous PR (#99) from Mark Shields.
> > >>> The code has changed quite a bit: some queries have changed to use
> new
> > >>> Beam APIs
> > >>> and there where some big refactorings. More important, we can now run
> > all
> > >>> the
> > >>> queries in all the runners.
> > >>>
> > >>> Nevertheless, there are still some open issues in Nexmark
> > >>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see
> > issue
> > >>> links
> > >>> in https://issues.apache.org/jira/browse/BEAM-160)
> > >>>
> > >>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at
> the
> > >>> ApacheCon. The PR is not perfect but it is in a good shape to share
> it.
> > >>>
> > >>> Best,
> > >>>
> > >>> Etienne
> > >>>
> > >>>
> > >>>
> > >>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
> > >>>
> > >>>> This is great! Having a variety of realistic-ish pipelines running
> on
> > all
> > >>>> runners complements the validation suite and IO IT work.
> > >>>>
> > >>>> If I recall, some of these involve heavy and esoteric uses of state,
> > so
> > >>>> definitely give me a ping if you hit any trouble.
> > >>>>
> > >>>> Kenn
> > >>>>
> > >>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
> > echauchot@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>> Hi all,
> > >>>>>
> > >>>>> Ismael and I are working on upgrading the Nexmark implementation
> for
> > >>>>> Beam.
> > >>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
> > >>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing
> > the
> > >>>>> work done by Mark Shields. See https://github.com/apache/
> > beam/pull/366
> > >>>>> for the original PR.
> > >>>>>
> > >>>>> The PR contains queries that have a wide coverage of the Beam model
> > and
> > >>>>> that represent a realistic end user use case (some come from client
> > >>>>> experience on Google Cloud Dataflow).
> > >>>>>
> > >>>>> So far, we have upgraded the implementation to the latest Beam
> > snapshot.
> > >>>>> And we are able to execute a good subset of the queries in the
> > different
> > >>>>> runners. We upgraded the nexmark drivers to do so: direct driver
> > >>>>> (upgraded
> > >>>>> from inProcessDriver) and flink driver and we added a new one for
> > spark.
> > >>>>>
> > >>>>> There is still a good amount of work to do and we would like to
> know
> > if
> > >>>>> you think that this contribution can have its place into Beam
> > >>>>> eventually.
> > >>>>>
> > >>>>> The interests of having Nexmark on Beam that we have seen so far
> are:
> > >>>>>
> > >>>>> - Rich batch/streaming test
> > >>>>>
> > >>>>> - A-B testing of runners or runtimes (non-regression, performance
> > >>>>> comparison between versions ...)
> > >>>>>
> > >>>>> - Integration testing (sdk/runners, runner/runtime, ...)
> > >>>>>
> > >>>>> - Validate beam capability matrix
> > >>>>>
> > >>>>> - It can be used as part of the ongoing PerfKit work (if there is
> any
> > >>>>> interest).
> > >>>>>
> > >>>>> As a final note, we are tracking the issues in the same repo. If
> > someone
> > >>>>> is interested in contributing, or have more ideas, you are welcome
> :)
> > >>>>>
> > >>>>> Etienne
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbonofre@apache.org
> > >> http://blog.nanthrax.net
> > >> Talend - http://www.talend.com
> > >>
> >
>

Re: [DISCUSSION] using NexMark for Beam

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

I think adding a new query for window merging makes sense and would be easier to follow and maintain.

Regards
JB

On Sep 19, 2017, 18:09, at 18:09, Reuven Lax <re...@google.com.INVALID> wrote:
>On Tue, Sep 19, 2017 at 7:29 AM, Etienne Chauchot <ec...@gmail.com>
>wrote:
>
>> Hi all,
>>
>> I'm resuming my work on Nexmark a bit, starting to do some
>maintenance on
>> the tickets
>>
>> @Reuven: I have some comments inline below.
>>
>> Le 14/05/2017 à 14:29, Reuven Lax a écrit :
>>
>>> Great to hear! A couple of comments:
>>>
>>> When Query 10 was written, the file-based sinks did not supported
>>> unbounded
>>> input. Now that in Beam FileBasedSink supports windowed output
>files, I
>>> think we should just rip out the custom IO code in Query 10 and
>replace it
>>> with AvroIO  - this is closer to what real Beam users will do, and
>it will
>>> also make it support HDFS.
>>>
>> +1: I updated this ticket
>https://issues.apache.org/jira/browse/BEAM-2856
>>
>>>
>>> Query 10 also tests some subtle semantics around late data - notably
>that
>>> if an element from a source is not late, elements resulting from
>>> processing
>>> that element are not late. Essentially this is a correctness test
>for
>>> watermarks, and should apply to all runners IMO.
>>>
>> Yes I agree, but there is some ValidatesRunner tests around this,
>right?
>> If not, we should create some IMHO.
>>
>>>
>>> WinningBids.java (used in Query6) uses a fairly awkward (and
>>> computationally expensive) custom merging window function - largely
>>> because
>>> Mark was trying to avoid using the state API as much as he could (at
>the
>>> time there was no public state API). IMO we should rewrite
>WinningBids to
>>> use state. This should result in both cleaner code, and more
>efficient
>>> query.
>>>
>> I agree that this query is a bit awkward. But it is the only one in
>the
>> query set that illustrates custom window merging. There is already
>query 3
>> that illustrates the use of state API (I migrated it to use state API
>after
>> Mark released it). Even if there is now a ValidatesRunner test on
>custom
>> window merging ([1]), I believe it could be useful to keep
>WinningBids as
>> it is to serve as benchmark of custom window merging in the runners.
>>
>
>My memory wast that this was an awkward use of merging windows, as the
>merge function was very expensive (building maps, etc.). As such, the
>cost
>of the WinningBid merge function dominated, so it really just served as
>a
>benchmark of how  often windows were merged (i.e. merge is called very
>often in streaming runners an much less often in batch runners).
>
>I wonder if we're best off introducing a new query that more explicitly
>tests merging windows, with a more-reasonable merging window fn.
>
>
>> WDYT?
>>
>> [1] https://github.com/apache/beam/blob/c65aca07faf7b8c4dabe6cae
>> 7b5b52286d2b25b1/sdks/java/core/src/test/java/org/apache/
>> beam/sdk/transforms/windowing/WindowTest.java#L591
>>
>> Best,
>> Etienne
>>
>>
>>> Reuven
>>>
>>> On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <ie...@gmail.com>
>wrote:
>>>
>>> Hello,
>>>>
>>>> Thanks Etienne for opening the Pull Request and starting the
>>>> discussion for the review process. I also want to thank publicly
>all
>>>> the people that somehow contributed to this:
>>>>
>>>> - Mark Shields and the original people at google who worked at
>nexmark
>>>> for contributing this in the first place.
>>>> - Etienne because his work and constant help really improved the
>>>> status of the queries, your work on query 3 was really nice, and
>also
>>>> for the hard work of helping me test all the queries with all the
>>>> runners and ping the runner maintainers for fixes.
>>>> - Aviem/Amit for all the help to solve the issues with the spark
>>>> runner whose support is now almost feature complete (even in
>>>> streaming!).
>>>> - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
>>>> quickly adding the support for metrics.
>>>> - Thomas Groh and Kenneth for fixing some needed parts in Direct
>>>> Runner + answering our questions on the State/Timer API.
>>>> - JB and the talend crew for all the feedback and help to run in
>our
>>>> benchmark cluster.
>>>> - And of course the rest of the Beam community :)
>>>>
>>>> Some comments:
>>>>
>>>> - This does not need to have a feature branch since we have been
>>>> working on this in a fork for months now and with the stable API we
>>>> can simply do a traditional PR review. Of course the review is a
>bit
>>>> bigger so we expect it to take some time, but I hope we can get
>some
>>>> quick progress once FSR is out.
>>>>
>>>> - We need a hand from the google guys, for the moment we have
>tested
>>>> all the queries in all the runners, but not in the Dataflow runner
>>>> because we don't have access to it (well we have but not with the
>>>> freedom that you guys have to run the benchmark at will), so if we
>can
>>>> get some access that would be nice or if this is not possible, it
>>>> would be nice if some of you guys help us test/report any given
>issue
>>>> on this runner,
>>>>
>>>> - We also have to decide the future of some features, this is
>probably
>>>> independent of the current PR and part of the evolution of Nexmark
>on
>>>> Beam:
>>>>
>>>> -- There are still some pending things that can be improved even
>after
>>>> the review once in master, e.g. we have for the moment only
>synthetic
>>>> sources but the original version took also data from Pubsub, we
>have
>>>> to define the correct scope for this and given the case also add
>other
>>>> sources, e.g. Kafka, HDFS.
>>>>
>>>> -- Query 10 is really oriented to testing Google Runner/IOs
>specific
>>>> features, so we have to decide what to do with this one, maybe
>>>> mirroring it with Kafka/HDFS to have something equivalent in the
>>>> Apache world.
>>>>
>>>> This is all for now, I am really glad that this is finally
>happening
>>>> and I hope this soon gets merged.
>>>>
>>>> Ismaël
>>>>
>>>> On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik
><lc...@google.com.invalid>
>>>> wrote:
>>>>
>>>>> I think these are valuable enough that we should get them into
>>>>>
>>>> apache/master
>>>>
>>>>> On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré
><jb...@nanthrax.net>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>>
>>>>>> PR or even a feature branch could work. Up to you.
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>>
>>>>>> On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>>
>>>>>>> I wanted to let you know that I have just submitted a PR around
>>>>>>>
>>>>>> NexMark.
>>>>
>>>>> This is
>>>>>>> a port of the NexMark queries to Beam, to be used as integration
>>>>>>> tests.
>>>>>>> This can also be used as A-B testing (no-regression or
>performance
>>>>>>> comparison
>>>>>>> between 2 versions of the same engine or of the same runner)
>>>>>>>
>>>>>>> This a continuation of the previous PR (#99) from Mark Shields.
>>>>>>> The code has changed quite a bit: some queries have changed to
>use new
>>>>>>> Beam APIs
>>>>>>> and there where some big refactorings. More important, we can
>now run
>>>>>>>
>>>>>> all
>>>>
>>>>> the
>>>>>>> queries in all the runners.
>>>>>>>
>>>>>>> Nevertheless, there are still some open issues in Nexmark
>>>>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream
>(see
>>>>>>>
>>>>>> issue
>>>>
>>>>> links
>>>>>>> in https://issues.apache.org/jira/browse/BEAM-160)
>>>>>>>
>>>>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk
>at
>>>>>>> the
>>>>>>> ApacheCon. The PR is not perfect but it is in a good shape to
>share
>>>>>>> it.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Etienne
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>>>>>>
>>>>>>> This is great! Having a variety of realistic-ish pipelines
>running on
>>>>>>>>
>>>>>>> all
>>>>
>>>>> runners complements the validation suite and IO IT work.
>>>>>>>>
>>>>>>>> If I recall, some of these involve heavy and esoteric uses of
>state,
>>>>>>>>
>>>>>>> so
>>>>
>>>>> definitely give me a ping if you hit any trouble.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
>>>>>>>>
>>>>>>> echauchot@gmail.com>
>>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>>> Ismael and I are working on upgrading the Nexmark
>implementation for
>>>>>>>>> Beam.
>>>>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are
>continuing
>>>>>>>>>
>>>>>>>> the
>>>>
>>>>> work done by Mark Shields. See https://github.com/apache/
>>>>>>>>>
>>>>>>>> beam/pull/366
>>>>
>>>>> for the original PR.
>>>>>>>>>
>>>>>>>>> The PR contains queries that have a wide coverage of the Beam
>model
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>> that represent a realistic end user use case (some come from
>client
>>>>>>>>> experience on Google Cloud Dataflow).
>>>>>>>>>
>>>>>>>>> So far, we have upgraded the implementation to the latest Beam
>>>>>>>>>
>>>>>>>> snapshot.
>>>>
>>>>> And we are able to execute a good subset of the queries in the
>>>>>>>>>
>>>>>>>> different
>>>>
>>>>> runners. We upgraded the nexmark drivers to do so: direct driver
>>>>>>>>> (upgraded
>>>>>>>>> from inProcessDriver) and flink driver and we added a new one
>for
>>>>>>>>>
>>>>>>>> spark.
>>>>
>>>>> There is still a good amount of work to do and we would like to
>know
>>>>>>>>>
>>>>>>>> if
>>>>
>>>>> you think that this contribution can have its place into Beam
>>>>>>>>> eventually.
>>>>>>>>>
>>>>>>>>> The interests of having Nexmark on Beam that we have seen so
>far
>>>>>>>>> are:
>>>>>>>>>
>>>>>>>>> - Rich batch/streaming test
>>>>>>>>>
>>>>>>>>> - A-B testing of runners or runtimes (non-regression,
>performance
>>>>>>>>> comparison between versions ...)
>>>>>>>>>
>>>>>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>>>>>>>
>>>>>>>>> - Validate beam capability matrix
>>>>>>>>>
>>>>>>>>> - It can be used as part of the ongoing PerfKit work (if there
>is
>>>>>>>>> any
>>>>>>>>> interest).
>>>>>>>>>
>>>>>>>>> As a final note, we are tracking the issues in the same repo.
>If
>>>>>>>>>
>>>>>>>> someone
>>>>
>>>>> is interested in contributing, or have more ideas, you are welcome
>:)
>>>>>>>>>
>>>>>>>>> Etienne
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>> Jean-Baptiste Onofré
>>>>>> jbonofre@apache.org
>>>>>> http://blog.nanthrax.net
>>>>>> Talend - http://www.talend.com
>>>>>>
>>>>>>
>>

Re: [DISCUSSION] using NexMark for Beam

Posted by Kenneth Knowles <kl...@google.com.INVALID>.

IIRC it is also exciting in that the final window is smaller than the union
of its component windows. I think we have a JIRA open to decide if that is
even allowed.

On Wed, Sep 20, 2017 at 12:15 AM, Etienne Chauchot <ec...@gmail.com>
wrote:

> Hi
> Indeed, the query builds a lot of maps, it is thus expensive. I totally
> agree with your point, adding a query with a simple merge such as the one
> that is done in the ValidatesRunner bellow is a good idea. I'll add 2
> tickets, one for the migration of winningBids to state API and one for the
> creation of query 13 that illustrates a simple custom window merge.
>
> Etienne
>
>
> Le 19/09/2017 à 18:09, Reuven Lax a écrit :
>
>> On Tue, Sep 19, 2017 at 7:29 AM, Etienne Chauchot <ec...@gmail.com>
>> wrote:
>>
>> Hi all,
>>>
>>> I'm resuming my work on Nexmark a bit, starting to do some maintenance on
>>> the tickets
>>>
>>> @Reuven: I have some comments inline below.
>>>
>>> Le 14/05/2017 à 14:29, Reuven Lax a écrit :
>>>
>>> Great to hear! A couple of comments:
>>>>
>>>> When Query 10 was written, the file-based sinks did not supported
>>>> unbounded
>>>> input. Now that in Beam FileBasedSink supports windowed output files, I
>>>> think we should just rip out the custom IO code in Query 10 and replace
>>>> it
>>>> with AvroIO  - this is closer to what real Beam users will do, and it
>>>> will
>>>> also make it support HDFS.
>>>>
>>>> +1: I updated this ticket https://issues.apache.org/jira
>>> /browse/BEAM-2856
>>>
>>> Query 10 also tests some subtle semantics around late data - notably that
>>>> if an element from a source is not late, elements resulting from
>>>> processing
>>>> that element are not late. Essentially this is a correctness test for
>>>> watermarks, and should apply to all runners IMO.
>>>>
>>>> Yes I agree, but there is some ValidatesRunner tests around this, right?
>>> If not, we should create some IMHO.
>>>
>>> WinningBids.java (used in Query6) uses a fairly awkward (and
>>>> computationally expensive) custom merging window function - largely
>>>> because
>>>> Mark was trying to avoid using the state API as much as he could (at the
>>>> time there was no public state API). IMO we should rewrite WinningBids
>>>> to
>>>> use state. This should result in both cleaner code, and more efficient
>>>> query.
>>>>
>>>> I agree that this query is a bit awkward. But it is the only one in the
>>> query set that illustrates custom window merging. There is already query
>>> 3
>>> that illustrates the use of state API (I migrated it to use state API
>>> after
>>> Mark released it). Even if there is now a ValidatesRunner test on custom
>>> window merging ([1]), I believe it could be useful to keep WinningBids as
>>> it is to serve as benchmark of custom window merging in the runners.
>>>
>>> My memory wast that this was an awkward use of merging windows, as the
>> merge function was very expensive (building maps, etc.). As such, the cost
>> of the WinningBid merge function dominated, so it really just served as a
>> benchmark of how  often windows were merged (i.e. merge is called very
>> often in streaming runners an much less often in batch runners).
>>
>> I wonder if we're best off introducing a new query that more explicitly
>> tests merging windows, with a more-reasonable merging window fn.
>>
>
> WDYT?
>>>
>>> [1] https://github.com/apache/beam/blob/c65aca07faf7b8c4dabe6cae
>>> 7b5b52286d2b25b1/sdks/java/core/src/test/java/org/apache/
>>> beam/sdk/transforms/windowing/WindowTest.java#L591
>>>
>>> Best,
>>> Etienne
>>>
>>>
>>> Reuven
>>>>
>>>> On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <ie...@gmail.com>
>>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>>> Thanks Etienne for opening the Pull Request and starting the
>>>>> discussion for the review process. I also want to thank publicly all
>>>>> the people that somehow contributed to this:
>>>>>
>>>>> - Mark Shields and the original people at google who worked at nexmark
>>>>> for contributing this in the first place.
>>>>> - Etienne because his work and constant help really improved the
>>>>> status of the queries, your work on query 3 was really nice, and also
>>>>> for the hard work of helping me test all the queries with all the
>>>>> runners and ping the runner maintainers for fixes.
>>>>> - Aviem/Amit for all the help to solve the issues with the spark
>>>>> runner whose support is now almost feature complete (even in
>>>>> streaming!).
>>>>> - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
>>>>> quickly adding the support for metrics.
>>>>> - Thomas Groh and Kenneth for fixing some needed parts in Direct
>>>>> Runner + answering our questions on the State/Timer API.
>>>>> - JB and the talend crew for all the feedback and help to run in our
>>>>> benchmark cluster.
>>>>> - And of course the rest of the Beam community :)
>>>>>
>>>>> Some comments:
>>>>>
>>>>> - This does not need to have a feature branch since we have been
>>>>> working on this in a fork for months now and with the stable API we
>>>>> can simply do a traditional PR review. Of course the review is a bit
>>>>> bigger so we expect it to take some time, but I hope we can get some
>>>>> quick progress once FSR is out.
>>>>>
>>>>> - We need a hand from the google guys, for the moment we have tested
>>>>> all the queries in all the runners, but not in the Dataflow runner
>>>>> because we don't have access to it (well we have but not with the
>>>>> freedom that you guys have to run the benchmark at will), so if we can
>>>>> get some access that would be nice or if this is not possible, it
>>>>> would be nice if some of you guys help us test/report any given issue
>>>>> on this runner,
>>>>>
>>>>> - We also have to decide the future of some features, this is probably
>>>>> independent of the current PR and part of the evolution of Nexmark on
>>>>> Beam:
>>>>>
>>>>> -- There are still some pending things that can be improved even after
>>>>> the review once in master, e.g. we have for the moment only synthetic
>>>>> sources but the original version took also data from Pubsub, we have
>>>>> to define the correct scope for this and given the case also add other
>>>>> sources, e.g. Kafka, HDFS.
>>>>>
>>>>> -- Query 10 is really oriented to testing Google Runner/IOs specific
>>>>> features, so we have to decide what to do with this one, maybe
>>>>> mirroring it with Kafka/HDFS to have something equivalent in the
>>>>> Apache world.
>>>>>
>>>>> This is all for now, I am really glad that this is finally happening
>>>>> and I hope this soon gets merged.
>>>>>
>>>>> Ismaël
>>>>>
>>>>> On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <lcwik@google.com.invalid
>>>>> >
>>>>> wrote:
>>>>>
>>>>> I think these are valuable enough that we should get them into
>>>>>>
>>>>>> apache/master
>>>>>
>>>>> On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <jb@nanthrax.net
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>> PR or even a feature branch could work. Up to you.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>>
>>>>>>> On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
>>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>>> I wanted to let you know that I have just submitted a PR around
>>>>>>>>
>>>>>>>> NexMark.
>>>>>>>
>>>>>> This is
>>>>>>
>>>>>>> a port of the NexMark queries to Beam, to be used as integration
>>>>>>>> tests.
>>>>>>>> This can also be used as A-B testing (no-regression or performance
>>>>>>>> comparison
>>>>>>>> between 2 versions of the same engine or of the same runner)
>>>>>>>>
>>>>>>>> This a continuation of the previous PR (#99) from Mark Shields.
>>>>>>>> The code has changed quite a bit: some queries have changed to use
>>>>>>>> new
>>>>>>>> Beam APIs
>>>>>>>> and there where some big refactorings. More important, we can now
>>>>>>>> run
>>>>>>>>
>>>>>>>> all
>>>>>>>
>>>>>> the
>>>>>>
>>>>>>> queries in all the runners.
>>>>>>>>
>>>>>>>> Nevertheless, there are still some open issues in Nexmark
>>>>>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see
>>>>>>>>
>>>>>>>> issue
>>>>>>>
>>>>>> links
>>>>>>
>>>>>>> in https://issues.apache.org/jira/browse/BEAM-160)
>>>>>>>>
>>>>>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at
>>>>>>>> the
>>>>>>>> ApacheCon. The PR is not perfect but it is in a good shape to share
>>>>>>>> it.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Etienne
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>>>>>>>
>>>>>>>> This is great! Having a variety of realistic-ish pipelines running
>>>>>>>> on
>>>>>>>> all
>>>>>>>>
>>>>>>> runners complements the validation suite and IO IT work.
>>>>>>
>>>>>>> If I recall, some of these involve heavy and esoteric uses of state,
>>>>>>>>>
>>>>>>>>> so
>>>>>>>>
>>>>>>> definitely give me a ping if you hit any trouble.
>>>>>>
>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
>>>>>>>>>
>>>>>>>>> echauchot@gmail.com>
>>>>>>>>
>>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Ismael and I are working on upgrading the Nexmark implementation
>>>>>>>>>> for
>>>>>>>>>> Beam.
>>>>>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>
>>>>>>>> work done by Mark Shields. See https://github.com/apache/
>>>>>>
>>>>>>> beam/pull/366
>>>>>>>>>
>>>>>>>> for the original PR.
>>>>>>
>>>>>>> The PR contains queries that have a wide coverage of the Beam model
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>
>>>>>>>> that represent a realistic end user use case (some come from client
>>>>>>
>>>>>>> experience on Google Cloud Dataflow).
>>>>>>>>>>
>>>>>>>>>> So far, we have upgraded the implementation to the latest Beam
>>>>>>>>>>
>>>>>>>>>> snapshot.
>>>>>>>>>
>>>>>>>> And we are able to execute a good subset of the queries in the
>>>>>>
>>>>>>> different
>>>>>>>>>
>>>>>>>> runners. We upgraded the nexmark drivers to do so: direct driver
>>>>>>
>>>>>>> (upgraded
>>>>>>>>>> from inProcessDriver) and flink driver and we added a new one for
>>>>>>>>>>
>>>>>>>>>> spark.
>>>>>>>>>
>>>>>>>> There is still a good amount of work to do and we would like to know
>>>>>>
>>>>>>> if
>>>>>>>>>
>>>>>>>> you think that this contribution can have its place into Beam
>>>>>>
>>>>>>> eventually.
>>>>>>>>>>
>>>>>>>>>> The interests of having Nexmark on Beam that we have seen so far
>>>>>>>>>> are:
>>>>>>>>>>
>>>>>>>>>> - Rich batch/streaming test
>>>>>>>>>>
>>>>>>>>>> - A-B testing of runners or runtimes (non-regression, performance
>>>>>>>>>> comparison between versions ...)
>>>>>>>>>>
>>>>>>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>>>>>>>>
>>>>>>>>>> - Validate beam capability matrix
>>>>>>>>>>
>>>>>>>>>> - It can be used as part of the ongoing PerfKit work (if there is
>>>>>>>>>> any
>>>>>>>>>> interest).
>>>>>>>>>>
>>>>>>>>>> As a final note, we are tracking the issues in the same repo. If
>>>>>>>>>>
>>>>>>>>>> someone
>>>>>>>>>
>>>>>>>> is interested in contributing, or have more ideas, you are welcome
>>>>>> :)
>>>>>>
>>>>>>> Etienne
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>> Jean-Baptiste Onofré
>>>>>>> jbonofre@apache.org
>>>>>>> http://blog.nanthrax.net
>>>>>>> Talend - http://www.talend.com
>>>>>>>
>>>>>>>
>>>>>>>
>

Re: [DISCUSSION] using NexMark for Beam

Posted by Etienne Chauchot <ec...@gmail.com>.

Hi
Indeed, the query builds a lot of maps, it is thus expensive. I totally 
agree with your point, adding a query with a simple merge such as the 
one that is done in the ValidatesRunner bellow is a good idea. I'll add 
2 tickets, one for the migration of winningBids to state API and one for 
the creation of query 13 that illustrates a simple custom window merge.

Etienne

Le 19/09/2017 à 18:09, Reuven Lax a écrit :
> On Tue, Sep 19, 2017 at 7:29 AM, Etienne Chauchot <ec...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I'm resuming my work on Nexmark a bit, starting to do some maintenance on
>> the tickets
>>
>> @Reuven: I have some comments inline below.
>>
>> Le 14/05/2017 à 14:29, Reuven Lax a écrit :
>>
>>> Great to hear! A couple of comments:
>>>
>>> When Query 10 was written, the file-based sinks did not supported
>>> unbounded
>>> input. Now that in Beam FileBasedSink supports windowed output files, I
>>> think we should just rip out the custom IO code in Query 10 and replace it
>>> with AvroIO  - this is closer to what real Beam users will do, and it will
>>> also make it support HDFS.
>>>
>> +1: I updated this ticket https://issues.apache.org/jira/browse/BEAM-2856
>>
>>> Query 10 also tests some subtle semantics around late data - notably that
>>> if an element from a source is not late, elements resulting from
>>> processing
>>> that element are not late. Essentially this is a correctness test for
>>> watermarks, and should apply to all runners IMO.
>>>
>> Yes I agree, but there is some ValidatesRunner tests around this, right?
>> If not, we should create some IMHO.
>>
>>> WinningBids.java (used in Query6) uses a fairly awkward (and
>>> computationally expensive) custom merging window function - largely
>>> because
>>> Mark was trying to avoid using the state API as much as he could (at the
>>> time there was no public state API). IMO we should rewrite WinningBids to
>>> use state. This should result in both cleaner code, and more efficient
>>> query.
>>>
>> I agree that this query is a bit awkward. But it is the only one in the
>> query set that illustrates custom window merging. There is already query 3
>> that illustrates the use of state API (I migrated it to use state API after
>> Mark released it). Even if there is now a ValidatesRunner test on custom
>> window merging ([1]), I believe it could be useful to keep WinningBids as
>> it is to serve as benchmark of custom window merging in the runners.
>>
> My memory wast that this was an awkward use of merging windows, as the
> merge function was very expensive (building maps, etc.). As such, the cost
> of the WinningBid merge function dominated, so it really just served as a
> benchmark of how  often windows were merged (i.e. merge is called very
> often in streaming runners an much less often in batch runners).
>
> I wonder if we're best off introducing a new query that more explicitly
> tests merging windows, with a more-reasonable merging window fn.

>> WDYT?
>>
>> [1] https://github.com/apache/beam/blob/c65aca07faf7b8c4dabe6cae
>> 7b5b52286d2b25b1/sdks/java/core/src/test/java/org/apache/
>> beam/sdk/transforms/windowing/WindowTest.java#L591
>>
>> Best,
>> Etienne
>>
>>
>>> Reuven
>>>
>>> On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <ie...@gmail.com> wrote:
>>>
>>> Hello,
>>>> Thanks Etienne for opening the Pull Request and starting the
>>>> discussion for the review process. I also want to thank publicly all
>>>> the people that somehow contributed to this:
>>>>
>>>> - Mark Shields and the original people at google who worked at nexmark
>>>> for contributing this in the first place.
>>>> - Etienne because his work and constant help really improved the
>>>> status of the queries, your work on query 3 was really nice, and also
>>>> for the hard work of helping me test all the queries with all the
>>>> runners and ping the runner maintainers for fixes.
>>>> - Aviem/Amit for all the help to solve the issues with the spark
>>>> runner whose support is now almost feature complete (even in
>>>> streaming!).
>>>> - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
>>>> quickly adding the support for metrics.
>>>> - Thomas Groh and Kenneth for fixing some needed parts in Direct
>>>> Runner + answering our questions on the State/Timer API.
>>>> - JB and the talend crew for all the feedback and help to run in our
>>>> benchmark cluster.
>>>> - And of course the rest of the Beam community :)
>>>>
>>>> Some comments:
>>>>
>>>> - This does not need to have a feature branch since we have been
>>>> working on this in a fork for months now and with the stable API we
>>>> can simply do a traditional PR review. Of course the review is a bit
>>>> bigger so we expect it to take some time, but I hope we can get some
>>>> quick progress once FSR is out.
>>>>
>>>> - We need a hand from the google guys, for the moment we have tested
>>>> all the queries in all the runners, but not in the Dataflow runner
>>>> because we don't have access to it (well we have but not with the
>>>> freedom that you guys have to run the benchmark at will), so if we can
>>>> get some access that would be nice or if this is not possible, it
>>>> would be nice if some of you guys help us test/report any given issue
>>>> on this runner,
>>>>
>>>> - We also have to decide the future of some features, this is probably
>>>> independent of the current PR and part of the evolution of Nexmark on
>>>> Beam:
>>>>
>>>> -- There are still some pending things that can be improved even after
>>>> the review once in master, e.g. we have for the moment only synthetic
>>>> sources but the original version took also data from Pubsub, we have
>>>> to define the correct scope for this and given the case also add other
>>>> sources, e.g. Kafka, HDFS.
>>>>
>>>> -- Query 10 is really oriented to testing Google Runner/IOs specific
>>>> features, so we have to decide what to do with this one, maybe
>>>> mirroring it with Kafka/HDFS to have something equivalent in the
>>>> Apache world.
>>>>
>>>> This is all for now, I am really glad that this is finally happening
>>>> and I hope this soon gets merged.
>>>>
>>>> Ismaël
>>>>
>>>> On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <lc...@google.com.invalid>
>>>> wrote:
>>>>
>>>>> I think these are valuable enough that we should get them into
>>>>>
>>>> apache/master
>>>>
>>>>> On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>> PR or even a feature branch could work. Up to you.
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>>
>>>>>> On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>> I wanted to let you know that I have just submitted a PR around
>>>>>>>
>>>>>> NexMark.
>>>>> This is
>>>>>>> a port of the NexMark queries to Beam, to be used as integration
>>>>>>> tests.
>>>>>>> This can also be used as A-B testing (no-regression or performance
>>>>>>> comparison
>>>>>>> between 2 versions of the same engine or of the same runner)
>>>>>>>
>>>>>>> This a continuation of the previous PR (#99) from Mark Shields.
>>>>>>> The code has changed quite a bit: some queries have changed to use new
>>>>>>> Beam APIs
>>>>>>> and there where some big refactorings. More important, we can now run
>>>>>>>
>>>>>> all
>>>>> the
>>>>>>> queries in all the runners.
>>>>>>>
>>>>>>> Nevertheless, there are still some open issues in Nexmark
>>>>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see
>>>>>>>
>>>>>> issue
>>>>> links
>>>>>>> in https://issues.apache.org/jira/browse/BEAM-160)
>>>>>>>
>>>>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at
>>>>>>> the
>>>>>>> ApacheCon. The PR is not perfect but it is in a good shape to share
>>>>>>> it.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Etienne
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>>>>>>
>>>>>>> This is great! Having a variety of realistic-ish pipelines running on
>>>>>>> all
>>>>> runners complements the validation suite and IO IT work.
>>>>>>>> If I recall, some of these involve heavy and esoteric uses of state,
>>>>>>>>
>>>>>>> so
>>>>> definitely give me a ping if you hit any trouble.
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
>>>>>>>>
>>>>>>> echauchot@gmail.com>
>>>>> wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>>> Ismael and I are working on upgrading the Nexmark implementation for
>>>>>>>>> Beam.
>>>>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing
>>>>>>>>>
>>>>>>>> the
>>>>> work done by Mark Shields. See https://github.com/apache/
>>>>>>>> beam/pull/366
>>>>> for the original PR.
>>>>>>>>> The PR contains queries that have a wide coverage of the Beam model
>>>>>>>>>
>>>>>>>> and
>>>>> that represent a realistic end user use case (some come from client
>>>>>>>>> experience on Google Cloud Dataflow).
>>>>>>>>>
>>>>>>>>> So far, we have upgraded the implementation to the latest Beam
>>>>>>>>>
>>>>>>>> snapshot.
>>>>> And we are able to execute a good subset of the queries in the
>>>>>>>> different
>>>>> runners. We upgraded the nexmark drivers to do so: direct driver
>>>>>>>>> (upgraded
>>>>>>>>> from inProcessDriver) and flink driver and we added a new one for
>>>>>>>>>
>>>>>>>> spark.
>>>>> There is still a good amount of work to do and we would like to know
>>>>>>>> if
>>>>> you think that this contribution can have its place into Beam
>>>>>>>>> eventually.
>>>>>>>>>
>>>>>>>>> The interests of having Nexmark on Beam that we have seen so far
>>>>>>>>> are:
>>>>>>>>>
>>>>>>>>> - Rich batch/streaming test
>>>>>>>>>
>>>>>>>>> - A-B testing of runners or runtimes (non-regression, performance
>>>>>>>>> comparison between versions ...)
>>>>>>>>>
>>>>>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>>>>>>>
>>>>>>>>> - Validate beam capability matrix
>>>>>>>>>
>>>>>>>>> - It can be used as part of the ongoing PerfKit work (if there is
>>>>>>>>> any
>>>>>>>>> interest).
>>>>>>>>>
>>>>>>>>> As a final note, we are tracking the issues in the same repo. If
>>>>>>>>>
>>>>>>>> someone
>>>>> is interested in contributing, or have more ideas, you are welcome :)
>>>>>>>>> Etienne
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>> Jean-Baptiste Onofré
>>>>>> jbonofre@apache.org
>>>>>> http://blog.nanthrax.net
>>>>>> Talend - http://www.talend.com
>>>>>>
>>>>>>

Re: [DISCUSSION] using NexMark for Beam

Posted by Reuven Lax <re...@google.com.INVALID>.

On Tue, Sep 19, 2017 at 7:29 AM, Etienne Chauchot <ec...@gmail.com>
wrote:

> Hi all,
>
> I'm resuming my work on Nexmark a bit, starting to do some maintenance on
> the tickets
>
> @Reuven: I have some comments inline below.
>
> Le 14/05/2017 à 14:29, Reuven Lax a écrit :
>
>> Great to hear! A couple of comments:
>>
>> When Query 10 was written, the file-based sinks did not supported
>> unbounded
>> input. Now that in Beam FileBasedSink supports windowed output files, I
>> think we should just rip out the custom IO code in Query 10 and replace it
>> with AvroIO  - this is closer to what real Beam users will do, and it will
>> also make it support HDFS.
>>
> +1: I updated this ticket https://issues.apache.org/jira/browse/BEAM-2856
>
>>
>> Query 10 also tests some subtle semantics around late data - notably that
>> if an element from a source is not late, elements resulting from
>> processing
>> that element are not late. Essentially this is a correctness test for
>> watermarks, and should apply to all runners IMO.
>>
> Yes I agree, but there is some ValidatesRunner tests around this, right?
> If not, we should create some IMHO.
>
>>
>> WinningBids.java (used in Query6) uses a fairly awkward (and
>> computationally expensive) custom merging window function - largely
>> because
>> Mark was trying to avoid using the state API as much as he could (at the
>> time there was no public state API). IMO we should rewrite WinningBids to
>> use state. This should result in both cleaner code, and more efficient
>> query.
>>
> I agree that this query is a bit awkward. But it is the only one in the
> query set that illustrates custom window merging. There is already query 3
> that illustrates the use of state API (I migrated it to use state API after
> Mark released it). Even if there is now a ValidatesRunner test on custom
> window merging ([1]), I believe it could be useful to keep WinningBids as
> it is to serve as benchmark of custom window merging in the runners.
>

My memory wast that this was an awkward use of merging windows, as the
merge function was very expensive (building maps, etc.). As such, the cost
of the WinningBid merge function dominated, so it really just served as a
benchmark of how  often windows were merged (i.e. merge is called very
often in streaming runners an much less often in batch runners).

I wonder if we're best off introducing a new query that more explicitly
tests merging windows, with a more-reasonable merging window fn.


> WDYT?
>
> [1] https://github.com/apache/beam/blob/c65aca07faf7b8c4dabe6cae
> 7b5b52286d2b25b1/sdks/java/core/src/test/java/org/apache/
> beam/sdk/transforms/windowing/WindowTest.java#L591
>
> Best,
> Etienne
>
>
>> Reuven
>>
>> On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <ie...@gmail.com> wrote:
>>
>> Hello,
>>>
>>> Thanks Etienne for opening the Pull Request and starting the
>>> discussion for the review process. I also want to thank publicly all
>>> the people that somehow contributed to this:
>>>
>>> - Mark Shields and the original people at google who worked at nexmark
>>> for contributing this in the first place.
>>> - Etienne because his work and constant help really improved the
>>> status of the queries, your work on query 3 was really nice, and also
>>> for the hard work of helping me test all the queries with all the
>>> runners and ping the runner maintainers for fixes.
>>> - Aviem/Amit for all the help to solve the issues with the spark
>>> runner whose support is now almost feature complete (even in
>>> streaming!).
>>> - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
>>> quickly adding the support for metrics.
>>> - Thomas Groh and Kenneth for fixing some needed parts in Direct
>>> Runner + answering our questions on the State/Timer API.
>>> - JB and the talend crew for all the feedback and help to run in our
>>> benchmark cluster.
>>> - And of course the rest of the Beam community :)
>>>
>>> Some comments:
>>>
>>> - This does not need to have a feature branch since we have been
>>> working on this in a fork for months now and with the stable API we
>>> can simply do a traditional PR review. Of course the review is a bit
>>> bigger so we expect it to take some time, but I hope we can get some
>>> quick progress once FSR is out.
>>>
>>> - We need a hand from the google guys, for the moment we have tested
>>> all the queries in all the runners, but not in the Dataflow runner
>>> because we don't have access to it (well we have but not with the
>>> freedom that you guys have to run the benchmark at will), so if we can
>>> get some access that would be nice or if this is not possible, it
>>> would be nice if some of you guys help us test/report any given issue
>>> on this runner,
>>>
>>> - We also have to decide the future of some features, this is probably
>>> independent of the current PR and part of the evolution of Nexmark on
>>> Beam:
>>>
>>> -- There are still some pending things that can be improved even after
>>> the review once in master, e.g. we have for the moment only synthetic
>>> sources but the original version took also data from Pubsub, we have
>>> to define the correct scope for this and given the case also add other
>>> sources, e.g. Kafka, HDFS.
>>>
>>> -- Query 10 is really oriented to testing Google Runner/IOs specific
>>> features, so we have to decide what to do with this one, maybe
>>> mirroring it with Kafka/HDFS to have something equivalent in the
>>> Apache world.
>>>
>>> This is all for now, I am really glad that this is finally happening
>>> and I hope this soon gets merged.
>>>
>>> Ismaël
>>>
>>> On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <lc...@google.com.invalid>
>>> wrote:
>>>
>>>> I think these are valuable enough that we should get them into
>>>>
>>> apache/master
>>>
>>>> On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>> wrote:
>>>>
>>>> Hi,
>>>>>
>>>>> PR or even a feature branch could work. Up to you.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>>
>>>>> On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
>>>>>
>>>>> Hi guys,
>>>>>>
>>>>>> I wanted to let you know that I have just submitted a PR around
>>>>>>
>>>>> NexMark.
>>>
>>>> This is
>>>>>> a port of the NexMark queries to Beam, to be used as integration
>>>>>> tests.
>>>>>> This can also be used as A-B testing (no-regression or performance
>>>>>> comparison
>>>>>> between 2 versions of the same engine or of the same runner)
>>>>>>
>>>>>> This a continuation of the previous PR (#99) from Mark Shields.
>>>>>> The code has changed quite a bit: some queries have changed to use new
>>>>>> Beam APIs
>>>>>> and there where some big refactorings. More important, we can now run
>>>>>>
>>>>> all
>>>
>>>> the
>>>>>> queries in all the runners.
>>>>>>
>>>>>> Nevertheless, there are still some open issues in Nexmark
>>>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see
>>>>>>
>>>>> issue
>>>
>>>> links
>>>>>> in https://issues.apache.org/jira/browse/BEAM-160)
>>>>>>
>>>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at
>>>>>> the
>>>>>> ApacheCon. The PR is not perfect but it is in a good shape to share
>>>>>> it.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Etienne
>>>>>>
>>>>>>
>>>>>>
>>>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>>>>>
>>>>>> This is great! Having a variety of realistic-ish pipelines running on
>>>>>>>
>>>>>> all
>>>
>>>> runners complements the validation suite and IO IT work.
>>>>>>>
>>>>>>> If I recall, some of these involve heavy and esoteric uses of state,
>>>>>>>
>>>>>> so
>>>
>>>> definitely give me a ping if you hit any trouble.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
>>>>>>>
>>>>>> echauchot@gmail.com>
>>>
>>>> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>>> Ismael and I are working on upgrading the Nexmark implementation for
>>>>>>>> Beam.
>>>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing
>>>>>>>>
>>>>>>> the
>>>
>>>> work done by Mark Shields. See https://github.com/apache/
>>>>>>>>
>>>>>>> beam/pull/366
>>>
>>>> for the original PR.
>>>>>>>>
>>>>>>>> The PR contains queries that have a wide coverage of the Beam model
>>>>>>>>
>>>>>>> and
>>>
>>>> that represent a realistic end user use case (some come from client
>>>>>>>> experience on Google Cloud Dataflow).
>>>>>>>>
>>>>>>>> So far, we have upgraded the implementation to the latest Beam
>>>>>>>>
>>>>>>> snapshot.
>>>
>>>> And we are able to execute a good subset of the queries in the
>>>>>>>>
>>>>>>> different
>>>
>>>> runners. We upgraded the nexmark drivers to do so: direct driver
>>>>>>>> (upgraded
>>>>>>>> from inProcessDriver) and flink driver and we added a new one for
>>>>>>>>
>>>>>>> spark.
>>>
>>>> There is still a good amount of work to do and we would like to know
>>>>>>>>
>>>>>>> if
>>>
>>>> you think that this contribution can have its place into Beam
>>>>>>>> eventually.
>>>>>>>>
>>>>>>>> The interests of having Nexmark on Beam that we have seen so far
>>>>>>>> are:
>>>>>>>>
>>>>>>>> - Rich batch/streaming test
>>>>>>>>
>>>>>>>> - A-B testing of runners or runtimes (non-regression, performance
>>>>>>>> comparison between versions ...)
>>>>>>>>
>>>>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>>>>>>
>>>>>>>> - Validate beam capability matrix
>>>>>>>>
>>>>>>>> - It can be used as part of the ongoing PerfKit work (if there is
>>>>>>>> any
>>>>>>>> interest).
>>>>>>>>
>>>>>>>> As a final note, we are tracking the issues in the same repo. If
>>>>>>>>
>>>>>>> someone
>>>
>>>> is interested in contributing, or have more ideas, you are welcome :)
>>>>>>>>
>>>>>>>> Etienne
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> jbonofre@apache.org
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>>>
>>>>>
>

Re: [DISCUSSION] using NexMark for Beam

Posted by Etienne Chauchot <ec...@gmail.com>.

Hi all,

I'm resuming my work on Nexmark a bit, starting to do some maintenance 
on the tickets

@Reuven: I have some comments inline below.

Le 14/05/2017 à 14:29, Reuven Lax a écrit :
> Great to hear! A couple of comments:
>
> When Query 10 was written, the file-based sinks did not supported unbounded
> input. Now that in Beam FileBasedSink supports windowed output files, I
> think we should just rip out the custom IO code in Query 10 and replace it
> with AvroIO  - this is closer to what real Beam users will do, and it will
> also make it support HDFS.
+1: I updated this ticket https://issues.apache.org/jira/browse/BEAM-2856
>
> Query 10 also tests some subtle semantics around late data - notably that
> if an element from a source is not late, elements resulting from processing
> that element are not late. Essentially this is a correctness test for
> watermarks, and should apply to all runners IMO.
Yes I agree, but there is some ValidatesRunner tests around this, right? 
If not, we should create some IMHO.
>
> WinningBids.java (used in Query6) uses a fairly awkward (and
> computationally expensive) custom merging window function - largely because
> Mark was trying to avoid using the state API as much as he could (at the
> time there was no public state API). IMO we should rewrite WinningBids to
> use state. This should result in both cleaner code, and more efficient
> query.
I agree that this query is a bit awkward. But it is the only one in the 
query set that illustrates custom window merging. There is already query 
3 that illustrates the use of state API (I migrated it to use state API 
after Mark released it). Even if there is now a ValidatesRunner test on 
custom window merging ([1]), I believe it could be useful to keep 
WinningBids as it is to serve as benchmark of custom window merging in 
the runners.

WDYT?

[1] 
https://github.com/apache/beam/blob/c65aca07faf7b8c4dabe6cae7b5b52286d2b25b1/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowTest.java#L591

Best,
Etienne
>
> Reuven
>
> On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <ie...@gmail.com> wrote:
>
>> Hello,
>>
>> Thanks Etienne for opening the Pull Request and starting the
>> discussion for the review process. I also want to thank publicly all
>> the people that somehow contributed to this:
>>
>> - Mark Shields and the original people at google who worked at nexmark
>> for contributing this in the first place.
>> - Etienne because his work and constant help really improved the
>> status of the queries, your work on query 3 was really nice, and also
>> for the hard work of helping me test all the queries with all the
>> runners and ping the runner maintainers for fixes.
>> - Aviem/Amit for all the help to solve the issues with the spark
>> runner whose support is now almost feature complete (even in
>> streaming!).
>> - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
>> quickly adding the support for metrics.
>> - Thomas Groh and Kenneth for fixing some needed parts in Direct
>> Runner + answering our questions on the State/Timer API.
>> - JB and the talend crew for all the feedback and help to run in our
>> benchmark cluster.
>> - And of course the rest of the Beam community :)
>>
>> Some comments:
>>
>> - This does not need to have a feature branch since we have been
>> working on this in a fork for months now and with the stable API we
>> can simply do a traditional PR review. Of course the review is a bit
>> bigger so we expect it to take some time, but I hope we can get some
>> quick progress once FSR is out.
>>
>> - We need a hand from the google guys, for the moment we have tested
>> all the queries in all the runners, but not in the Dataflow runner
>> because we don't have access to it (well we have but not with the
>> freedom that you guys have to run the benchmark at will), so if we can
>> get some access that would be nice or if this is not possible, it
>> would be nice if some of you guys help us test/report any given issue
>> on this runner,
>>
>> - We also have to decide the future of some features, this is probably
>> independent of the current PR and part of the evolution of Nexmark on
>> Beam:
>>
>> -- There are still some pending things that can be improved even after
>> the review once in master, e.g. we have for the moment only synthetic
>> sources but the original version took also data from Pubsub, we have
>> to define the correct scope for this and given the case also add other
>> sources, e.g. Kafka, HDFS.
>>
>> -- Query 10 is really oriented to testing Google Runner/IOs specific
>> features, so we have to decide what to do with this one, maybe
>> mirroring it with Kafka/HDFS to have something equivalent in the
>> Apache world.
>>
>> This is all for now, I am really glad that this is finally happening
>> and I hope this soon gets merged.
>>
>> Ismaël
>>
>> On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <lc...@google.com.invalid>
>> wrote:
>>> I think these are valuable enough that we should get them into
>> apache/master
>>> On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> PR or even a feature branch could work. Up to you.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>>
>>>> On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I wanted to let you know that I have just submitted a PR around
>> NexMark.
>>>>> This is
>>>>> a port of the NexMark queries to Beam, to be used as integration tests.
>>>>> This can also be used as A-B testing (no-regression or performance
>>>>> comparison
>>>>> between 2 versions of the same engine or of the same runner)
>>>>>
>>>>> This a continuation of the previous PR (#99) from Mark Shields.
>>>>> The code has changed quite a bit: some queries have changed to use new
>>>>> Beam APIs
>>>>> and there where some big refactorings. More important, we can now run
>> all
>>>>> the
>>>>> queries in all the runners.
>>>>>
>>>>> Nevertheless, there are still some open issues in Nexmark
>>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see
>> issue
>>>>> links
>>>>> in https://issues.apache.org/jira/browse/BEAM-160)
>>>>>
>>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at the
>>>>> ApacheCon. The PR is not perfect but it is in a good shape to share it.
>>>>>
>>>>> Best,
>>>>>
>>>>> Etienne
>>>>>
>>>>>
>>>>>
>>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>>>>
>>>>>> This is great! Having a variety of realistic-ish pipelines running on
>> all
>>>>>> runners complements the validation suite and IO IT work.
>>>>>>
>>>>>> If I recall, some of these involve heavy and esoteric uses of state,
>> so
>>>>>> definitely give me a ping if you hit any trouble.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
>> echauchot@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>> Ismael and I are working on upgrading the Nexmark implementation for
>>>>>>> Beam.
>>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing
>> the
>>>>>>> work done by Mark Shields. See https://github.com/apache/
>> beam/pull/366
>>>>>>> for the original PR.
>>>>>>>
>>>>>>> The PR contains queries that have a wide coverage of the Beam model
>> and
>>>>>>> that represent a realistic end user use case (some come from client
>>>>>>> experience on Google Cloud Dataflow).
>>>>>>>
>>>>>>> So far, we have upgraded the implementation to the latest Beam
>> snapshot.
>>>>>>> And we are able to execute a good subset of the queries in the
>> different
>>>>>>> runners. We upgraded the nexmark drivers to do so: direct driver
>>>>>>> (upgraded
>>>>>>> from inProcessDriver) and flink driver and we added a new one for
>> spark.
>>>>>>> There is still a good amount of work to do and we would like to know
>> if
>>>>>>> you think that this contribution can have its place into Beam
>>>>>>> eventually.
>>>>>>>
>>>>>>> The interests of having Nexmark on Beam that we have seen so far are:
>>>>>>>
>>>>>>> - Rich batch/streaming test
>>>>>>>
>>>>>>> - A-B testing of runners or runtimes (non-regression, performance
>>>>>>> comparison between versions ...)
>>>>>>>
>>>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>>>>>
>>>>>>> - Validate beam capability matrix
>>>>>>>
>>>>>>> - It can be used as part of the ongoing PerfKit work (if there is any
>>>>>>> interest).
>>>>>>>
>>>>>>> As a final note, we are tracking the issues in the same repo. If
>> someone
>>>>>>> is interested in contributing, or have more ideas, you are welcome :)
>>>>>>>
>>>>>>> Etienne
>>>>>>>
>>>>>>>
>>>>>>>
>>>> --
>>>> Jean-Baptiste Onofré
>>>> jbonofre@apache.org
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>

Re: [DISCUSSION] using NexMark for Beam

Posted by Reuven Lax <re...@google.com.INVALID>.

Great to hear! A couple of comments:

When Query 10 was written, the file-based sinks did not supported unbounded
input. Now that in Beam FileBasedSink supports windowed output files, I
think we should just rip out the custom IO code in Query 10 and replace it
with AvroIO  - this is closer to what real Beam users will do, and it will
also make it support HDFS.

Query 10 also tests some subtle semantics around late data - notably that
if an element from a source is not late, elements resulting from processing
that element are not late. Essentially this is a correctness test for
watermarks, and should apply to all runners IMO.

WinningBids.java (used in Query6) uses a fairly awkward (and
computationally expensive) custom merging window function - largely because
Mark was trying to avoid using the state API as much as he could (at the
time there was no public state API). IMO we should rewrite WinningBids to
use state. This should result in both cleaner code, and more efficient
query.

Reuven

On Sun, May 14, 2017 at 3:09 AM, Ismaël Mejía <ie...@gmail.com> wrote:

> Hello,
>
> Thanks Etienne for opening the Pull Request and starting the
> discussion for the review process. I also want to thank publicly all
> the people that somehow contributed to this:
>
> - Mark Shields and the original people at google who worked at nexmark
> for contributing this in the first place.
> - Etienne because his work and constant help really improved the
> status of the queries, your work on query 3 was really nice, and also
> for the hard work of helping me test all the queries with all the
> runners and ping the runner maintainers for fixes.
> - Aviem/Amit for all the help to solve the issues with the spark
> runner whose support is now almost feature complete (even in
> streaming!).
> - Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
> quickly adding the support for metrics.
> - Thomas Groh and Kenneth for fixing some needed parts in Direct
> Runner + answering our questions on the State/Timer API.
> - JB and the talend crew for all the feedback and help to run in our
> benchmark cluster.
> - And of course the rest of the Beam community :)
>
> Some comments:
>
> - This does not need to have a feature branch since we have been
> working on this in a fork for months now and with the stable API we
> can simply do a traditional PR review. Of course the review is a bit
> bigger so we expect it to take some time, but I hope we can get some
> quick progress once FSR is out.
>
> - We need a hand from the google guys, for the moment we have tested
> all the queries in all the runners, but not in the Dataflow runner
> because we don't have access to it (well we have but not with the
> freedom that you guys have to run the benchmark at will), so if we can
> get some access that would be nice or if this is not possible, it
> would be nice if some of you guys help us test/report any given issue
> on this runner,
>
> - We also have to decide the future of some features, this is probably
> independent of the current PR and part of the evolution of Nexmark on
> Beam:
>
> -- There are still some pending things that can be improved even after
> the review once in master, e.g. we have for the moment only synthetic
> sources but the original version took also data from Pubsub, we have
> to define the correct scope for this and given the case also add other
> sources, e.g. Kafka, HDFS.
>
> -- Query 10 is really oriented to testing Google Runner/IOs specific
> features, so we have to decide what to do with this one, maybe
> mirroring it with Kafka/HDFS to have something equivalent in the
> Apache world.
>
> This is all for now, I am really glad that this is finally happening
> and I hope this soon gets merged.
>
> Ismaël
>
> On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <lc...@google.com.invalid>
> wrote:
> > I think these are valuable enough that we should get them into
> apache/master
> >
> > On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> >
> >> Hi,
> >>
> >> PR or even a feature branch could work. Up to you.
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
> >>
> >>> Hi guys,
> >>>
> >>> I wanted to let you know that I have just submitted a PR around
> NexMark.
> >>> This is
> >>> a port of the NexMark queries to Beam, to be used as integration tests.
> >>> This can also be used as A-B testing (no-regression or performance
> >>> comparison
> >>> between 2 versions of the same engine or of the same runner)
> >>>
> >>> This a continuation of the previous PR (#99) from Mark Shields.
> >>> The code has changed quite a bit: some queries have changed to use new
> >>> Beam APIs
> >>> and there where some big refactorings. More important, we can now run
> all
> >>> the
> >>> queries in all the runners.
> >>>
> >>> Nevertheless, there are still some open issues in Nexmark
> >>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see
> issue
> >>> links
> >>> in https://issues.apache.org/jira/browse/BEAM-160)
> >>>
> >>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at the
> >>> ApacheCon. The PR is not perfect but it is in a good shape to share it.
> >>>
> >>> Best,
> >>>
> >>> Etienne
> >>>
> >>>
> >>>
> >>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
> >>>
> >>>> This is great! Having a variety of realistic-ish pipelines running on
> all
> >>>> runners complements the validation suite and IO IT work.
> >>>>
> >>>> If I recall, some of these involve heavy and esoteric uses of state,
> so
> >>>> definitely give me a ping if you hit any trouble.
> >>>>
> >>>> Kenn
> >>>>
> >>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
> echauchot@gmail.com>
> >>>> wrote:
> >>>>
> >>>> Hi all,
> >>>>>
> >>>>> Ismael and I are working on upgrading the Nexmark implementation for
> >>>>> Beam.
> >>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
> >>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing
> the
> >>>>> work done by Mark Shields. See https://github.com/apache/
> beam/pull/366
> >>>>> for the original PR.
> >>>>>
> >>>>> The PR contains queries that have a wide coverage of the Beam model
> and
> >>>>> that represent a realistic end user use case (some come from client
> >>>>> experience on Google Cloud Dataflow).
> >>>>>
> >>>>> So far, we have upgraded the implementation to the latest Beam
> snapshot.
> >>>>> And we are able to execute a good subset of the queries in the
> different
> >>>>> runners. We upgraded the nexmark drivers to do so: direct driver
> >>>>> (upgraded
> >>>>> from inProcessDriver) and flink driver and we added a new one for
> spark.
> >>>>>
> >>>>> There is still a good amount of work to do and we would like to know
> if
> >>>>> you think that this contribution can have its place into Beam
> >>>>> eventually.
> >>>>>
> >>>>> The interests of having Nexmark on Beam that we have seen so far are:
> >>>>>
> >>>>> - Rich batch/streaming test
> >>>>>
> >>>>> - A-B testing of runners or runtimes (non-regression, performance
> >>>>> comparison between versions ...)
> >>>>>
> >>>>> - Integration testing (sdk/runners, runner/runtime, ...)
> >>>>>
> >>>>> - Validate beam capability matrix
> >>>>>
> >>>>> - It can be used as part of the ongoing PerfKit work (if there is any
> >>>>> interest).
> >>>>>
> >>>>> As a final note, we are tracking the issues in the same repo. If
> someone
> >>>>> is interested in contributing, or have more ideas, you are welcome :)
> >>>>>
> >>>>> Etienne
> >>>>>
> >>>>>
> >>>>>
> >>>
> >> --
> >> Jean-Baptiste Onofré
> >> jbonofre@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
>

Re: [DISCUSSION] using NexMark for Beam

Posted by Ismaël Mejía <ie...@gmail.com>.

Hello,

Thanks Etienne for opening the Pull Request and starting the
discussion for the review process. I also want to thank publicly all
the people that somehow contributed to this:

- Mark Shields and the original people at google who worked at nexmark
for contributing this in the first place.
- Etienne because his work and constant help really improved the
status of the queries, your work on query 3 was really nice, and also
for the hard work of helping me test all the queries with all the
runners and ping the runner maintainers for fixes.
- Aviem/Amit for all the help to solve the issues with the spark
runner whose support is now almost feature complete (even in
streaming!).
- Aljoscha/Jinsong for the fix to merge IntervalWindowFn and for
quickly adding the support for metrics.
- Thomas Groh and Kenneth for fixing some needed parts in Direct
Runner + answering our questions on the State/Timer API.
- JB and the talend crew for all the feedback and help to run in our
benchmark cluster.
- And of course the rest of the Beam community :)

Some comments:

- This does not need to have a feature branch since we have been
working on this in a fork for months now and with the stable API we
can simply do a traditional PR review. Of course the review is a bit
bigger so we expect it to take some time, but I hope we can get some
quick progress once FSR is out.

- We need a hand from the google guys, for the moment we have tested
all the queries in all the runners, but not in the Dataflow runner
because we don't have access to it (well we have but not with the
freedom that you guys have to run the benchmark at will), so if we can
get some access that would be nice or if this is not possible, it
would be nice if some of you guys help us test/report any given issue
on this runner,

- We also have to decide the future of some features, this is probably
independent of the current PR and part of the evolution of Nexmark on
Beam:

-- There are still some pending things that can be improved even after
the review once in master, e.g. we have for the moment only synthetic
sources but the original version took also data from Pubsub, we have
to define the correct scope for this and given the case also add other
sources, e.g. Kafka, HDFS.

-- Query 10 is really oriented to testing Google Runner/IOs specific
features, so we have to decide what to do with this one, maybe
mirroring it with Kafka/HDFS to have something equivalent in the
Apache world.

This is all for now, I am really glad that this is finally happening
and I hope this soon gets merged.

Ismaël

On Fri, May 12, 2017 at 6:07 PM, Lukasz Cwik <lc...@google.com.invalid> wrote:
> I think these are valuable enough that we should get them into apache/master
>
> On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> Hi,
>>
>> PR or even a feature branch could work. Up to you.
>>
>> Regards
>> JB
>>
>>
>> On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
>>
>>> Hi guys,
>>>
>>> I wanted to let you know that I have just submitted a PR around NexMark.
>>> This is
>>> a port of the NexMark queries to Beam, to be used as integration tests.
>>> This can also be used as A-B testing (no-regression or performance
>>> comparison
>>> between 2 versions of the same engine or of the same runner)
>>>
>>> This a continuation of the previous PR (#99) from Mark Shields.
>>> The code has changed quite a bit: some queries have changed to use new
>>> Beam APIs
>>> and there where some big refactorings. More important, we can now run all
>>> the
>>> queries in all the runners.
>>>
>>> Nevertheless, there are still some open issues in Nexmark
>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see issue
>>> links
>>> in https://issues.apache.org/jira/browse/BEAM-160)
>>>
>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at the
>>> ApacheCon. The PR is not perfect but it is in a good shape to share it.
>>>
>>> Best,
>>>
>>> Etienne
>>>
>>>
>>>
>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>>
>>>> This is great! Having a variety of realistic-ish pipelines running on all
>>>> runners complements the validation suite and IO IT work.
>>>>
>>>> If I recall, some of these involve heavy and esoteric uses of state, so
>>>> definitely give me a ping if you hit any trouble.
>>>>
>>>> Kenn
>>>>
>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <ec...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi all,
>>>>>
>>>>> Ismael and I are working on upgrading the Nexmark implementation for
>>>>> Beam.
>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing the
>>>>> work done by Mark Shields. See https://github.com/apache/beam/pull/366
>>>>> for the original PR.
>>>>>
>>>>> The PR contains queries that have a wide coverage of the Beam model and
>>>>> that represent a realistic end user use case (some come from client
>>>>> experience on Google Cloud Dataflow).
>>>>>
>>>>> So far, we have upgraded the implementation to the latest Beam snapshot.
>>>>> And we are able to execute a good subset of the queries in the different
>>>>> runners. We upgraded the nexmark drivers to do so: direct driver
>>>>> (upgraded
>>>>> from inProcessDriver) and flink driver and we added a new one for spark.
>>>>>
>>>>> There is still a good amount of work to do and we would like to know if
>>>>> you think that this contribution can have its place into Beam
>>>>> eventually.
>>>>>
>>>>> The interests of having Nexmark on Beam that we have seen so far are:
>>>>>
>>>>> - Rich batch/streaming test
>>>>>
>>>>> - A-B testing of runners or runtimes (non-regression, performance
>>>>> comparison between versions ...)
>>>>>
>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>>>
>>>>> - Validate beam capability matrix
>>>>>
>>>>> - It can be used as part of the ongoing PerfKit work (if there is any
>>>>> interest).
>>>>>
>>>>> As a final note, we are tracking the issues in the same repo. If someone
>>>>> is interested in contributing, or have more ideas, you are welcome :)
>>>>>
>>>>> Etienne
>>>>>
>>>>>
>>>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>

Re: [DISCUSSION] using NexMark for Beam

Posted by Lukasz Cwik <lc...@google.com.INVALID>.

I think these are valuable enough that we should get them into apache/master

On Fri, May 12, 2017 at 4:34 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi,
>
> PR or even a feature branch could work. Up to you.
>
> Regards
> JB
>
>
> On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
>
>> Hi guys,
>>
>> I wanted to let you know that I have just submitted a PR around NexMark.
>> This is
>> a port of the NexMark queries to Beam, to be used as integration tests.
>> This can also be used as A-B testing (no-regression or performance
>> comparison
>> between 2 versions of the same engine or of the same runner)
>>
>> This a continuation of the previous PR (#99) from Mark Shields.
>> The code has changed quite a bit: some queries have changed to use new
>> Beam APIs
>> and there where some big refactorings. More important, we can now run all
>> the
>> queries in all the runners.
>>
>> Nevertheless, there are still some open issues in Nexmark
>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see issue
>> links
>> in https://issues.apache.org/jira/browse/BEAM-160)
>>
>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at the
>> ApacheCon. The PR is not perfect but it is in a good shape to share it.
>>
>> Best,
>>
>> Etienne
>>
>>
>>
>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>
>>> This is great! Having a variety of realistic-ish pipelines running on all
>>> runners complements the validation suite and IO IT work.
>>>
>>> If I recall, some of these involve heavy and esoteric uses of state, so
>>> definitely give me a ping if you hit any trouble.
>>>
>>> Kenn
>>>
>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <ec...@gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>>>
>>>> Ismael and I are working on upgrading the Nexmark implementation for
>>>> Beam.
>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing the
>>>> work done by Mark Shields. See https://github.com/apache/beam/pull/366
>>>> for the original PR.
>>>>
>>>> The PR contains queries that have a wide coverage of the Beam model and
>>>> that represent a realistic end user use case (some come from client
>>>> experience on Google Cloud Dataflow).
>>>>
>>>> So far, we have upgraded the implementation to the latest Beam snapshot.
>>>> And we are able to execute a good subset of the queries in the different
>>>> runners. We upgraded the nexmark drivers to do so: direct driver
>>>> (upgraded
>>>> from inProcessDriver) and flink driver and we added a new one for spark.
>>>>
>>>> There is still a good amount of work to do and we would like to know if
>>>> you think that this contribution can have its place into Beam
>>>> eventually.
>>>>
>>>> The interests of having Nexmark on Beam that we have seen so far are:
>>>>
>>>> - Rich batch/streaming test
>>>>
>>>> - A-B testing of runners or runtimes (non-regression, performance
>>>> comparison between versions ...)
>>>>
>>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>>
>>>> - Validate beam capability matrix
>>>>
>>>> - It can be used as part of the ongoing PerfKit work (if there is any
>>>> interest).
>>>>
>>>> As a final note, we are tracking the issues in the same repo. If someone
>>>> is interested in contributing, or have more ideas, you are welcome :)
>>>>
>>>> Etienne
>>>>
>>>>
>>>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSSION] using NexMark for Beam

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi,

PR or even a feature branch could work. Up to you.

Regards
JB

On 05/12/2017 10:55 AM, Etienne Chauchot wrote:
> Hi guys,
>
> I wanted to let you know that I have just submitted a PR around NexMark. This is
> a port of the NexMark queries to Beam, to be used as integration tests.
> This can also be used as A-B testing (no-regression or performance comparison
> between 2 versions of the same engine or of the same runner)
>
> This a continuation of the previous PR (#99) from Mark Shields.
> The code has changed quite a bit: some queries have changed to use new Beam APIs
> and there where some big refactorings. More important, we can now run all the
> queries in all the runners.
>
> Nevertheless, there are still some open issues in Nexmark
> (https://github.com/iemejia/beam/issues) and in Beam upstream (see issue links
> in https://issues.apache.org/jira/browse/BEAM-160)
>
> I wanted to submit the PR before our (Ismaël and I) NexMark talk at the
> ApacheCon. The PR is not perfect but it is in a good shape to share it.
>
> Best,
>
> Etienne
>
>
>
> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>> This is great! Having a variety of realistic-ish pipelines running on all
>> runners complements the validation suite and IO IT work.
>>
>> If I recall, some of these involve heavy and esoteric uses of state, so
>> definitely give me a ping if you hit any trouble.
>>
>> Kenn
>>
>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <ec...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Ismael and I are working on upgrading the Nexmark implementation for Beam.
>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing the
>>> work done by Mark Shields. See https://github.com/apache/beam/pull/366
>>> for the original PR.
>>>
>>> The PR contains queries that have a wide coverage of the Beam model and
>>> that represent a realistic end user use case (some come from client
>>> experience on Google Cloud Dataflow).
>>>
>>> So far, we have upgraded the implementation to the latest Beam snapshot.
>>> And we are able to execute a good subset of the queries in the different
>>> runners. We upgraded the nexmark drivers to do so: direct driver (upgraded
>>> from inProcessDriver) and flink driver and we added a new one for spark.
>>>
>>> There is still a good amount of work to do and we would like to know if
>>> you think that this contribution can have its place into Beam eventually.
>>>
>>> The interests of having Nexmark on Beam that we have seen so far are:
>>>
>>> - Rich batch/streaming test
>>>
>>> - A-B testing of runners or runtimes (non-regression, performance
>>> comparison between versions ...)
>>>
>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>
>>> - Validate beam capability matrix
>>>
>>> - It can be used as part of the ongoing PerfKit work (if there is any
>>> interest).
>>>
>>> As a final note, we are tracking the issues in the same repo. If someone
>>> is interested in contributing, or have more ideas, you are welcome :)
>>>
>>> Etienne
>>>
>>>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSSION] using NexMark for Beam

Posted by Etienne Chauchot <ec...@gmail.com>.

@Reuven,

Tell me if I can help on that

Etienne
Le 15/09/2017 à 06:44, Reuven Lax a écrit :
> It's being worked on. Turns out there are some modifications still needed
> to the NexMark queries.
>
> Reuven
>
> On Thu, Sep 14, 2017 at 9:33 PM, Pei HE <pe...@apache.org> wrote:
>
>> Could any Googlers help to run NexMark on Dataflow streaming and share the
>> numbers with the community?
>> --
>> Pei
>>
>> On Fri, Aug 25, 2017 at 11:28 PM, Lukasz Cwik <lc...@google.com.invalid>
>> wrote:
>>
>>> Etienne, cut some JIRAs for improvements like ValidatesRunner for the
>>> Nexmark suite that you think are worthy. Some of them might be good
>>> 'starter' tasks as well.
>>>
>>> On Fri, Aug 25, 2017 at 1:43 AM, Etienne Chauchot <ec...@gmail.com>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> There is also some points to discuss:
>>>>
>>>> - I think some of the tests in this test suite should be generalized as
>>>> validatesRunner tests like it was done for example for custom window
>>>> merging (https://github.com/apache/beam/blob/5181e619f17e1f69fabe8d5
>>>> bdfc7a3a6a2142cde/sdks/java/core/src/test/java/org/apache/
>>>> beam/sdk/transforms/windowing/WindowTest.java#L591)
>>>>
>>>> - We have run almost no tests on Dataflow, so if someone could run the
>>>> test suite on dataflow, he's very welcome. All needed information are
>>> still
>>>> in the README, but I'll move these info to the website.
>>>>
>>>> - other points?
>>>>
>>>> WDYT?
>>>>
>>>> Best,
>>>>
>>>> Etienne
>>>>
>>>>
>>>>
>>>> Le 24/08/2017 à 18:35, Lukasz Cwik a écrit :
>>>>
>>>>> Yeah, was looking forward to this.
>>>>>
>>>>> On Thu, Aug 24, 2017 at 9:20 AM, Tyler Akidau
>>> <takidau@google.com.invalid
>>>>> wrote:
>>>>>
>>>>> Awesome news, thank you! :-D
>>>>>> On Thu, Aug 24, 2017 at 12:40 AM Etienne Chauchot <
>> echauchot@gmail.com
>>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>> I wanted to let you know that the Nexmark PR is merged into master.
>>> Feel
>>>>>>> free to use it (e.g. performance testing, release testing ...).
>>>>>>>
>>>>>>> Etienne
>>>>>>>
>>>>>>> Le 12/05/2017 à 10:55, Etienne Chauchot a écrit :
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> I wanted to let you know that I have just submitted a PR around
>>>>>>>> NexMark. This is a port of the NexMark queries to Beam, to be used
>> as
>>>>>>>> integration tests.
>>>>>>>> This can also be used as A-B testing (no-regression or performance
>>>>>>>> comparison between 2 versions of the same engine or of the same
>>> runner)
>>>>>>>> This a continuation of the previous PR (#99) from Mark Shields.
>>>>>>>> The code has changed quite a bit: some queries have changed to use
>>> new
>>>>>>>> Beam APIs and there where some big refactorings. More important, we
>>>>>>>> can now run all the queries in all the runners.
>>>>>>>>
>>>>>>>> Nevertheless, there are still some open issues in Nexmark
>>>>>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see
>>>>>>>> issue links in https://issues.apache.org/jira/browse/BEAM-160)
>>>>>>>>
>>>>>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at
>>>>>>>> the ApacheCon. The PR is not perfect but it is in a good shape to
>>>>>>>> share it.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Etienne
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>>>>>>>
>>>>>>>>> This is great! Having a variety of realistic-ish pipelines running
>>> on
>>>>>>>>> all
>>>>>>>>> runners complements the validation suite and IO IT work.
>>>>>>>>>
>>>>>>>>> If I recall, some of these involve heavy and esoteric uses of
>> state,
>>>>>>>> so
>>>>>>> definitely give me a ping if you hit any trouble.
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
>>>>>>>>>
>>>>>>>> echauchot@gmail.com>
>>>>>>> wrote:
>>>>>>>>> Hi all,
>>>>>>>>>> Ismael and I are working on upgrading the Nexmark implementation
>>> for
>>>>>>>>>> Beam.
>>>>>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are
>> continuing
>>>>>>>>> the
>>>>>>> work done by Mark Shields. See https://github.com/apache/
>>>>>>>>> beam/pull/366
>>>>>>> for the original PR.
>>>>>>>>>> The PR contains queries that have a wide coverage of the Beam
>> model
>>>>>>>>> and
>>>>>>> that represent a realistic end user use case (some come from client
>>>>>>>>>> experience on Google Cloud Dataflow).
>>>>>>>>>>
>>>>>>>>>> So far, we have upgraded the implementation to the latest Beam
>>>>>>>>>> snapshot.
>>>>>>>>>> And we are able to execute a good subset of the queries in the
>>>>>>>>>> different
>>>>>>>>>> runners. We upgraded the nexmark drivers to do so: direct driver
>>>>>>>>>> (upgraded
>>>>>>>>>> from inProcessDriver) and flink driver and we added a new one for
>>>>>>>>>> spark.
>>>>>>>>>>
>>>>>>>>>> There is still a good amount of work to do and we would like to
>>> know
>>>>>>>>> if
>>>>>>> you think that this contribution can have its place into Beam
>>>>>>>>>> eventually.
>>>>>>>>>>
>>>>>>>>>> The interests of having Nexmark on Beam that we have seen so far
>>> are:
>>>>>>>>>> - Rich batch/streaming test
>>>>>>>>>>
>>>>>>>>>> - A-B testing of runners or runtimes (non-regression, performance
>>>>>>>>>> comparison between versions ...)
>>>>>>>>>>
>>>>>>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>>>>>>>>
>>>>>>>>>> - Validate beam capability matrix
>>>>>>>>>>
>>>>>>>>>> - It can be used as part of the ongoing PerfKit work (if there is
>>> any
>>>>>>>>>> interest).
>>>>>>>>>>
>>>>>>>>>> As a final note, we are tracking the issues in the same repo. If
>>>>>>>>>> someone
>>>>>>>>>> is interested in contributing, or have more ideas, you are
>> welcome
>>> :)
>>>>>>>>>> Etienne
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>

Re: [DISCUSSION] using NexMark for Beam

Posted by Reuven Lax <re...@google.com.INVALID>.

It's being worked on. Turns out there are some modifications still needed
to the NexMark queries.

Reuven

On Thu, Sep 14, 2017 at 9:33 PM, Pei HE <pe...@apache.org> wrote:

> Could any Googlers help to run NexMark on Dataflow streaming and share the
> numbers with the community?
> --
> Pei
>
> On Fri, Aug 25, 2017 at 11:28 PM, Lukasz Cwik <lc...@google.com.invalid>
> wrote:
>
> > Etienne, cut some JIRAs for improvements like ValidatesRunner for the
> > Nexmark suite that you think are worthy. Some of them might be good
> > 'starter' tasks as well.
> >
> > On Fri, Aug 25, 2017 at 1:43 AM, Etienne Chauchot <ec...@gmail.com>
> > wrote:
> >
> > > Hi guys,
> > >
> > > There is also some points to discuss:
> > >
> > > - I think some of the tests in this test suite should be generalized as
> > > validatesRunner tests like it was done for example for custom window
> > > merging (https://github.com/apache/beam/blob/5181e619f17e1f69fabe8d5
> > > bdfc7a3a6a2142cde/sdks/java/core/src/test/java/org/apache/
> > > beam/sdk/transforms/windowing/WindowTest.java#L591)
> > >
> > > - We have run almost no tests on Dataflow, so if someone could run the
> > > test suite on dataflow, he's very welcome. All needed information are
> > still
> > > in the README, but I'll move these info to the website.
> > >
> > > - other points?
> > >
> > > WDYT?
> > >
> > > Best,
> > >
> > > Etienne
> > >
> > >
> > >
> > > Le 24/08/2017 à 18:35, Lukasz Cwik a écrit :
> > >
> > >> Yeah, was looking forward to this.
> > >>
> > >> On Thu, Aug 24, 2017 at 9:20 AM, Tyler Akidau
> > <takidau@google.com.invalid
> > >> >
> > >> wrote:
> > >>
> > >> Awesome news, thank you! :-D
> > >>>
> > >>> On Thu, Aug 24, 2017 at 12:40 AM Etienne Chauchot <
> echauchot@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>> Hi all,
> > >>>>
> > >>>> I wanted to let you know that the Nexmark PR is merged into master.
> > Feel
> > >>>> free to use it (e.g. performance testing, release testing ...).
> > >>>>
> > >>>> Etienne
> > >>>>
> > >>>> Le 12/05/2017 à 10:55, Etienne Chauchot a écrit :
> > >>>>
> > >>>>> Hi guys,
> > >>>>>
> > >>>>> I wanted to let you know that I have just submitted a PR around
> > >>>>> NexMark. This is a port of the NexMark queries to Beam, to be used
> as
> > >>>>> integration tests.
> > >>>>> This can also be used as A-B testing (no-regression or performance
> > >>>>> comparison between 2 versions of the same engine or of the same
> > runner)
> > >>>>>
> > >>>>> This a continuation of the previous PR (#99) from Mark Shields.
> > >>>>> The code has changed quite a bit: some queries have changed to use
> > new
> > >>>>> Beam APIs and there where some big refactorings. More important, we
> > >>>>> can now run all the queries in all the runners.
> > >>>>>
> > >>>>> Nevertheless, there are still some open issues in Nexmark
> > >>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see
> > >>>>> issue links in https://issues.apache.org/jira/browse/BEAM-160)
> > >>>>>
> > >>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at
> > >>>>> the ApacheCon. The PR is not perfect but it is in a good shape to
> > >>>>> share it.
> > >>>>>
> > >>>>> Best,
> > >>>>>
> > >>>>> Etienne
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
> > >>>>>
> > >>>>>> This is great! Having a variety of realistic-ish pipelines running
> > on
> > >>>>>> all
> > >>>>>> runners complements the validation suite and IO IT work.
> > >>>>>>
> > >>>>>> If I recall, some of these involve heavy and esoteric uses of
> state,
> > >>>>>>
> > >>>>> so
> > >>>
> > >>>> definitely give me a ping if you hit any trouble.
> > >>>>>>
> > >>>>>> Kenn
> > >>>>>>
> > >>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
> > >>>>>>
> > >>>>> echauchot@gmail.com>
> > >>>
> > >>>> wrote:
> > >>>>>>
> > >>>>>> Hi all,
> > >>>>>>>
> > >>>>>>> Ismael and I are working on upgrading the Nexmark implementation
> > for
> > >>>>>>> Beam.
> > >>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
> > >>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are
> continuing
> > >>>>>>>
> > >>>>>> the
> > >>>
> > >>>> work done by Mark Shields. See https://github.com/apache/
> > >>>>>>>
> > >>>>>> beam/pull/366
> > >>>
> > >>>> for the original PR.
> > >>>>>>>
> > >>>>>>> The PR contains queries that have a wide coverage of the Beam
> model
> > >>>>>>>
> > >>>>>> and
> > >>>
> > >>>> that represent a realistic end user use case (some come from client
> > >>>>>>> experience on Google Cloud Dataflow).
> > >>>>>>>
> > >>>>>>> So far, we have upgraded the implementation to the latest Beam
> > >>>>>>> snapshot.
> > >>>>>>> And we are able to execute a good subset of the queries in the
> > >>>>>>> different
> > >>>>>>> runners. We upgraded the nexmark drivers to do so: direct driver
> > >>>>>>> (upgraded
> > >>>>>>> from inProcessDriver) and flink driver and we added a new one for
> > >>>>>>> spark.
> > >>>>>>>
> > >>>>>>> There is still a good amount of work to do and we would like to
> > know
> > >>>>>>>
> > >>>>>> if
> > >>>
> > >>>> you think that this contribution can have its place into Beam
> > >>>>>>> eventually.
> > >>>>>>>
> > >>>>>>> The interests of having Nexmark on Beam that we have seen so far
> > are:
> > >>>>>>>
> > >>>>>>> - Rich batch/streaming test
> > >>>>>>>
> > >>>>>>> - A-B testing of runners or runtimes (non-regression, performance
> > >>>>>>> comparison between versions ...)
> > >>>>>>>
> > >>>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
> > >>>>>>>
> > >>>>>>> - Validate beam capability matrix
> > >>>>>>>
> > >>>>>>> - It can be used as part of the ongoing PerfKit work (if there is
> > any
> > >>>>>>> interest).
> > >>>>>>>
> > >>>>>>> As a final note, we are tracking the issues in the same repo. If
> > >>>>>>> someone
> > >>>>>>> is interested in contributing, or have more ideas, you are
> welcome
> > :)
> > >>>>>>>
> > >>>>>>> Etienne
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>
> > >
> >
>

Re: [DISCUSSION] using NexMark for Beam

Posted by Pei HE <pe...@apache.org>.

Could any Googlers help to run NexMark on Dataflow streaming and share the
numbers with the community?
--
Pei

On Fri, Aug 25, 2017 at 11:28 PM, Lukasz Cwik <lc...@google.com.invalid>
wrote:

> Etienne, cut some JIRAs for improvements like ValidatesRunner for the
> Nexmark suite that you think are worthy. Some of them might be good
> 'starter' tasks as well.
>
> On Fri, Aug 25, 2017 at 1:43 AM, Etienne Chauchot <ec...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > There is also some points to discuss:
> >
> > - I think some of the tests in this test suite should be generalized as
> > validatesRunner tests like it was done for example for custom window
> > merging (https://github.com/apache/beam/blob/5181e619f17e1f69fabe8d5
> > bdfc7a3a6a2142cde/sdks/java/core/src/test/java/org/apache/
> > beam/sdk/transforms/windowing/WindowTest.java#L591)
> >
> > - We have run almost no tests on Dataflow, so if someone could run the
> > test suite on dataflow, he's very welcome. All needed information are
> still
> > in the README, but I'll move these info to the website.
> >
> > - other points?
> >
> > WDYT?
> >
> > Best,
> >
> > Etienne
> >
> >
> >
> > Le 24/08/2017 à 18:35, Lukasz Cwik a écrit :
> >
> >> Yeah, was looking forward to this.
> >>
> >> On Thu, Aug 24, 2017 at 9:20 AM, Tyler Akidau
> <takidau@google.com.invalid
> >> >
> >> wrote:
> >>
> >> Awesome news, thank you! :-D
> >>>
> >>> On Thu, Aug 24, 2017 at 12:40 AM Etienne Chauchot <echauchot@gmail.com
> >
> >>> wrote:
> >>>
> >>> Hi all,
> >>>>
> >>>> I wanted to let you know that the Nexmark PR is merged into master.
> Feel
> >>>> free to use it (e.g. performance testing, release testing ...).
> >>>>
> >>>> Etienne
> >>>>
> >>>> Le 12/05/2017 à 10:55, Etienne Chauchot a écrit :
> >>>>
> >>>>> Hi guys,
> >>>>>
> >>>>> I wanted to let you know that I have just submitted a PR around
> >>>>> NexMark. This is a port of the NexMark queries to Beam, to be used as
> >>>>> integration tests.
> >>>>> This can also be used as A-B testing (no-regression or performance
> >>>>> comparison between 2 versions of the same engine or of the same
> runner)
> >>>>>
> >>>>> This a continuation of the previous PR (#99) from Mark Shields.
> >>>>> The code has changed quite a bit: some queries have changed to use
> new
> >>>>> Beam APIs and there where some big refactorings. More important, we
> >>>>> can now run all the queries in all the runners.
> >>>>>
> >>>>> Nevertheless, there are still some open issues in Nexmark
> >>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see
> >>>>> issue links in https://issues.apache.org/jira/browse/BEAM-160)
> >>>>>
> >>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at
> >>>>> the ApacheCon. The PR is not perfect but it is in a good shape to
> >>>>> share it.
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> Etienne
> >>>>>
> >>>>>
> >>>>>
> >>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
> >>>>>
> >>>>>> This is great! Having a variety of realistic-ish pipelines running
> on
> >>>>>> all
> >>>>>> runners complements the validation suite and IO IT work.
> >>>>>>
> >>>>>> If I recall, some of these involve heavy and esoteric uses of state,
> >>>>>>
> >>>>> so
> >>>
> >>>> definitely give me a ping if you hit any trouble.
> >>>>>>
> >>>>>> Kenn
> >>>>>>
> >>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
> >>>>>>
> >>>>> echauchot@gmail.com>
> >>>
> >>>> wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>>
> >>>>>>> Ismael and I are working on upgrading the Nexmark implementation
> for
> >>>>>>> Beam.
> >>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
> >>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing
> >>>>>>>
> >>>>>> the
> >>>
> >>>> work done by Mark Shields. See https://github.com/apache/
> >>>>>>>
> >>>>>> beam/pull/366
> >>>
> >>>> for the original PR.
> >>>>>>>
> >>>>>>> The PR contains queries that have a wide coverage of the Beam model
> >>>>>>>
> >>>>>> and
> >>>
> >>>> that represent a realistic end user use case (some come from client
> >>>>>>> experience on Google Cloud Dataflow).
> >>>>>>>
> >>>>>>> So far, we have upgraded the implementation to the latest Beam
> >>>>>>> snapshot.
> >>>>>>> And we are able to execute a good subset of the queries in the
> >>>>>>> different
> >>>>>>> runners. We upgraded the nexmark drivers to do so: direct driver
> >>>>>>> (upgraded
> >>>>>>> from inProcessDriver) and flink driver and we added a new one for
> >>>>>>> spark.
> >>>>>>>
> >>>>>>> There is still a good amount of work to do and we would like to
> know
> >>>>>>>
> >>>>>> if
> >>>
> >>>> you think that this contribution can have its place into Beam
> >>>>>>> eventually.
> >>>>>>>
> >>>>>>> The interests of having Nexmark on Beam that we have seen so far
> are:
> >>>>>>>
> >>>>>>> - Rich batch/streaming test
> >>>>>>>
> >>>>>>> - A-B testing of runners or runtimes (non-regression, performance
> >>>>>>> comparison between versions ...)
> >>>>>>>
> >>>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
> >>>>>>>
> >>>>>>> - Validate beam capability matrix
> >>>>>>>
> >>>>>>> - It can be used as part of the ongoing PerfKit work (if there is
> any
> >>>>>>> interest).
> >>>>>>>
> >>>>>>> As a final note, we are tracking the issues in the same repo. If
> >>>>>>> someone
> >>>>>>> is interested in contributing, or have more ideas, you are welcome
> :)
> >>>>>>>
> >>>>>>> Etienne
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >
>

Re: [DISCUSSION] using NexMark for Beam

Posted by Lukasz Cwik <lc...@google.com.INVALID>.

Etienne, cut some JIRAs for improvements like ValidatesRunner for the
Nexmark suite that you think are worthy. Some of them might be good
'starter' tasks as well.

On Fri, Aug 25, 2017 at 1:43 AM, Etienne Chauchot <ec...@gmail.com>
wrote:

> Hi guys,
>
> There is also some points to discuss:
>
> - I think some of the tests in this test suite should be generalized as
> validatesRunner tests like it was done for example for custom window
> merging (https://github.com/apache/beam/blob/5181e619f17e1f69fabe8d5
> bdfc7a3a6a2142cde/sdks/java/core/src/test/java/org/apache/
> beam/sdk/transforms/windowing/WindowTest.java#L591)
>
> - We have run almost no tests on Dataflow, so if someone could run the
> test suite on dataflow, he's very welcome. All needed information are still
> in the README, but I'll move these info to the website.
>
> - other points?
>
> WDYT?
>
> Best,
>
> Etienne
>
>
>
> Le 24/08/2017 à 18:35, Lukasz Cwik a écrit :
>
>> Yeah, was looking forward to this.
>>
>> On Thu, Aug 24, 2017 at 9:20 AM, Tyler Akidau <takidau@google.com.invalid
>> >
>> wrote:
>>
>> Awesome news, thank you! :-D
>>>
>>> On Thu, Aug 24, 2017 at 12:40 AM Etienne Chauchot <ec...@gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>>>
>>>> I wanted to let you know that the Nexmark PR is merged into master. Feel
>>>> free to use it (e.g. performance testing, release testing ...).
>>>>
>>>> Etienne
>>>>
>>>> Le 12/05/2017 à 10:55, Etienne Chauchot a écrit :
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I wanted to let you know that I have just submitted a PR around
>>>>> NexMark. This is a port of the NexMark queries to Beam, to be used as
>>>>> integration tests.
>>>>> This can also be used as A-B testing (no-regression or performance
>>>>> comparison between 2 versions of the same engine or of the same runner)
>>>>>
>>>>> This a continuation of the previous PR (#99) from Mark Shields.
>>>>> The code has changed quite a bit: some queries have changed to use new
>>>>> Beam APIs and there where some big refactorings. More important, we
>>>>> can now run all the queries in all the runners.
>>>>>
>>>>> Nevertheless, there are still some open issues in Nexmark
>>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see
>>>>> issue links in https://issues.apache.org/jira/browse/BEAM-160)
>>>>>
>>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at
>>>>> the ApacheCon. The PR is not perfect but it is in a good shape to
>>>>> share it.
>>>>>
>>>>> Best,
>>>>>
>>>>> Etienne
>>>>>
>>>>>
>>>>>
>>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>>>>
>>>>>> This is great! Having a variety of realistic-ish pipelines running on
>>>>>> all
>>>>>> runners complements the validation suite and IO IT work.
>>>>>>
>>>>>> If I recall, some of these involve heavy and esoteric uses of state,
>>>>>>
>>>>> so
>>>
>>>> definitely give me a ping if you hit any trouble.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
>>>>>>
>>>>> echauchot@gmail.com>
>>>
>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>>
>>>>>>> Ismael and I are working on upgrading the Nexmark implementation for
>>>>>>> Beam.
>>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing
>>>>>>>
>>>>>> the
>>>
>>>> work done by Mark Shields. See https://github.com/apache/
>>>>>>>
>>>>>> beam/pull/366
>>>
>>>> for the original PR.
>>>>>>>
>>>>>>> The PR contains queries that have a wide coverage of the Beam model
>>>>>>>
>>>>>> and
>>>
>>>> that represent a realistic end user use case (some come from client
>>>>>>> experience on Google Cloud Dataflow).
>>>>>>>
>>>>>>> So far, we have upgraded the implementation to the latest Beam
>>>>>>> snapshot.
>>>>>>> And we are able to execute a good subset of the queries in the
>>>>>>> different
>>>>>>> runners. We upgraded the nexmark drivers to do so: direct driver
>>>>>>> (upgraded
>>>>>>> from inProcessDriver) and flink driver and we added a new one for
>>>>>>> spark.
>>>>>>>
>>>>>>> There is still a good amount of work to do and we would like to know
>>>>>>>
>>>>>> if
>>>
>>>> you think that this contribution can have its place into Beam
>>>>>>> eventually.
>>>>>>>
>>>>>>> The interests of having Nexmark on Beam that we have seen so far are:
>>>>>>>
>>>>>>> - Rich batch/streaming test
>>>>>>>
>>>>>>> - A-B testing of runners or runtimes (non-regression, performance
>>>>>>> comparison between versions ...)
>>>>>>>
>>>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>>>>>
>>>>>>> - Validate beam capability matrix
>>>>>>>
>>>>>>> - It can be used as part of the ongoing PerfKit work (if there is any
>>>>>>> interest).
>>>>>>>
>>>>>>> As a final note, we are tracking the issues in the same repo. If
>>>>>>> someone
>>>>>>> is interested in contributing, or have more ideas, you are welcome :)
>>>>>>>
>>>>>>> Etienne
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>

Re: [DISCUSSION] using NexMark for Beam

Posted by Etienne Chauchot <ec...@gmail.com>.

Hi guys,

There is also some points to discuss:

- I think some of the tests in this test suite should be generalized as 
validatesRunner tests like it was done for example for custom window 
merging 
(https://github.com/apache/beam/blob/5181e619f17e1f69fabe8d5bdfc7a3a6a2142cde/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowTest.java#L591)

- We have run almost no tests on Dataflow, so if someone could run the 
test suite on dataflow, he's very welcome. All needed information are 
still in the README, but I'll move these info to the website.

- other points?

WDYT?

Best,

Etienne


Le 24/08/2017 à 18:35, Lukasz Cwik a écrit :
> Yeah, was looking forward to this.
>
> On Thu, Aug 24, 2017 at 9:20 AM, Tyler Akidau <ta...@google.com.invalid>
> wrote:
>
>> Awesome news, thank you! :-D
>>
>> On Thu, Aug 24, 2017 at 12:40 AM Etienne Chauchot <ec...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I wanted to let you know that the Nexmark PR is merged into master. Feel
>>> free to use it (e.g. performance testing, release testing ...).
>>>
>>> Etienne
>>>
>>> Le 12/05/2017 à 10:55, Etienne Chauchot a écrit :
>>>> Hi guys,
>>>>
>>>> I wanted to let you know that I have just submitted a PR around
>>>> NexMark. This is a port of the NexMark queries to Beam, to be used as
>>>> integration tests.
>>>> This can also be used as A-B testing (no-regression or performance
>>>> comparison between 2 versions of the same engine or of the same runner)
>>>>
>>>> This a continuation of the previous PR (#99) from Mark Shields.
>>>> The code has changed quite a bit: some queries have changed to use new
>>>> Beam APIs and there where some big refactorings. More important, we
>>>> can now run all the queries in all the runners.
>>>>
>>>> Nevertheless, there are still some open issues in Nexmark
>>>> (https://github.com/iemejia/beam/issues) and in Beam upstream (see
>>>> issue links in https://issues.apache.org/jira/browse/BEAM-160)
>>>>
>>>> I wanted to submit the PR before our (Ismaël and I) NexMark talk at
>>>> the ApacheCon. The PR is not perfect but it is in a good shape to
>>>> share it.
>>>>
>>>> Best,
>>>>
>>>> Etienne
>>>>
>>>>
>>>>
>>>> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>>>>> This is great! Having a variety of realistic-ish pipelines running on
>>>>> all
>>>>> runners complements the validation suite and IO IT work.
>>>>>
>>>>> If I recall, some of these involve heavy and esoteric uses of state,
>> so
>>>>> definitely give me a ping if you hit any trouble.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
>> echauchot@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Ismael and I are working on upgrading the Nexmark implementation for
>>>>>> Beam.
>>>>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>>>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing
>> the
>>>>>> work done by Mark Shields. See https://github.com/apache/
>> beam/pull/366
>>>>>> for the original PR.
>>>>>>
>>>>>> The PR contains queries that have a wide coverage of the Beam model
>> and
>>>>>> that represent a realistic end user use case (some come from client
>>>>>> experience on Google Cloud Dataflow).
>>>>>>
>>>>>> So far, we have upgraded the implementation to the latest Beam
>>>>>> snapshot.
>>>>>> And we are able to execute a good subset of the queries in the
>>>>>> different
>>>>>> runners. We upgraded the nexmark drivers to do so: direct driver
>>>>>> (upgraded
>>>>>> from inProcessDriver) and flink driver and we added a new one for
>>>>>> spark.
>>>>>>
>>>>>> There is still a good amount of work to do and we would like to know
>> if
>>>>>> you think that this contribution can have its place into Beam
>>>>>> eventually.
>>>>>>
>>>>>> The interests of having Nexmark on Beam that we have seen so far are:
>>>>>>
>>>>>> - Rich batch/streaming test
>>>>>>
>>>>>> - A-B testing of runners or runtimes (non-regression, performance
>>>>>> comparison between versions ...)
>>>>>>
>>>>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>>>>
>>>>>> - Validate beam capability matrix
>>>>>>
>>>>>> - It can be used as part of the ongoing PerfKit work (if there is any
>>>>>> interest).
>>>>>>
>>>>>> As a final note, we are tracking the issues in the same repo. If
>>>>>> someone
>>>>>> is interested in contributing, or have more ideas, you are welcome :)
>>>>>>
>>>>>> Etienne
>>>>>>
>>>>>>
>>>

Re: [DISCUSSION] using NexMark for Beam

Posted by Lukasz Cwik <lc...@google.com.INVALID>.

Yeah, was looking forward to this.

On Thu, Aug 24, 2017 at 9:20 AM, Tyler Akidau <ta...@google.com.invalid>
wrote:

> Awesome news, thank you! :-D
>
> On Thu, Aug 24, 2017 at 12:40 AM Etienne Chauchot <ec...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I wanted to let you know that the Nexmark PR is merged into master. Feel
> > free to use it (e.g. performance testing, release testing ...).
> >
> > Etienne
> >
> > Le 12/05/2017 à 10:55, Etienne Chauchot a écrit :
> > > Hi guys,
> > >
> > > I wanted to let you know that I have just submitted a PR around
> > > NexMark. This is a port of the NexMark queries to Beam, to be used as
> > > integration tests.
> > > This can also be used as A-B testing (no-regression or performance
> > > comparison between 2 versions of the same engine or of the same runner)
> > >
> > > This a continuation of the previous PR (#99) from Mark Shields.
> > > The code has changed quite a bit: some queries have changed to use new
> > > Beam APIs and there where some big refactorings. More important, we
> > > can now run all the queries in all the runners.
> > >
> > > Nevertheless, there are still some open issues in Nexmark
> > > (https://github.com/iemejia/beam/issues) and in Beam upstream (see
> > > issue links in https://issues.apache.org/jira/browse/BEAM-160)
> > >
> > > I wanted to submit the PR before our (Ismaël and I) NexMark talk at
> > > the ApacheCon. The PR is not perfect but it is in a good shape to
> > > share it.
> > >
> > > Best,
> > >
> > > Etienne
> > >
> > >
> > >
> > > Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
> > >> This is great! Having a variety of realistic-ish pipelines running on
> > >> all
> > >> runners complements the validation suite and IO IT work.
> > >>
> > >> If I recall, some of these involve heavy and esoteric uses of state,
> so
> > >> definitely give me a ping if you hit any trouble.
> > >>
> > >> Kenn
> > >>
> > >> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <
> echauchot@gmail.com>
> > >> wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> Ismael and I are working on upgrading the Nexmark implementation for
> > >>> Beam.
> > >>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
> > >>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing
> the
> > >>> work done by Mark Shields. See https://github.com/apache/
> beam/pull/366
> > >>> for the original PR.
> > >>>
> > >>> The PR contains queries that have a wide coverage of the Beam model
> and
> > >>> that represent a realistic end user use case (some come from client
> > >>> experience on Google Cloud Dataflow).
> > >>>
> > >>> So far, we have upgraded the implementation to the latest Beam
> > >>> snapshot.
> > >>> And we are able to execute a good subset of the queries in the
> > >>> different
> > >>> runners. We upgraded the nexmark drivers to do so: direct driver
> > >>> (upgraded
> > >>> from inProcessDriver) and flink driver and we added a new one for
> > >>> spark.
> > >>>
> > >>> There is still a good amount of work to do and we would like to know
> if
> > >>> you think that this contribution can have its place into Beam
> > >>> eventually.
> > >>>
> > >>> The interests of having Nexmark on Beam that we have seen so far are:
> > >>>
> > >>> - Rich batch/streaming test
> > >>>
> > >>> - A-B testing of runners or runtimes (non-regression, performance
> > >>> comparison between versions ...)
> > >>>
> > >>> - Integration testing (sdk/runners, runner/runtime, ...)
> > >>>
> > >>> - Validate beam capability matrix
> > >>>
> > >>> - It can be used as part of the ongoing PerfKit work (if there is any
> > >>> interest).
> > >>>
> > >>> As a final note, we are tracking the issues in the same repo. If
> > >>> someone
> > >>> is interested in contributing, or have more ideas, you are welcome :)
> > >>>
> > >>> Etienne
> > >>>
> > >>>
> > >
> >
> >
>

Re: [DISCUSSION] using NexMark for Beam

Posted by Tyler Akidau <ta...@google.com.INVALID>.

Awesome news, thank you! :-D

On Thu, Aug 24, 2017 at 12:40 AM Etienne Chauchot <ec...@gmail.com>
wrote:

> Hi all,
>
> I wanted to let you know that the Nexmark PR is merged into master. Feel
> free to use it (e.g. performance testing, release testing ...).
>
> Etienne
>
> Le 12/05/2017 à 10:55, Etienne Chauchot a écrit :
> > Hi guys,
> >
> > I wanted to let you know that I have just submitted a PR around
> > NexMark. This is a port of the NexMark queries to Beam, to be used as
> > integration tests.
> > This can also be used as A-B testing (no-regression or performance
> > comparison between 2 versions of the same engine or of the same runner)
> >
> > This a continuation of the previous PR (#99) from Mark Shields.
> > The code has changed quite a bit: some queries have changed to use new
> > Beam APIs and there where some big refactorings. More important, we
> > can now run all the queries in all the runners.
> >
> > Nevertheless, there are still some open issues in Nexmark
> > (https://github.com/iemejia/beam/issues) and in Beam upstream (see
> > issue links in https://issues.apache.org/jira/browse/BEAM-160)
> >
> > I wanted to submit the PR before our (Ismaël and I) NexMark talk at
> > the ApacheCon. The PR is not perfect but it is in a good shape to
> > share it.
> >
> > Best,
> >
> > Etienne
> >
> >
> >
> > Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
> >> This is great! Having a variety of realistic-ish pipelines running on
> >> all
> >> runners complements the validation suite and IO IT work.
> >>
> >> If I recall, some of these involve heavy and esoteric uses of state, so
> >> definitely give me a ping if you hit any trouble.
> >>
> >> Kenn
> >>
> >> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <ec...@gmail.com>
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Ismael and I are working on upgrading the Nexmark implementation for
> >>> Beam.
> >>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
> >>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing the
> >>> work done by Mark Shields. See https://github.com/apache/beam/pull/366
> >>> for the original PR.
> >>>
> >>> The PR contains queries that have a wide coverage of the Beam model and
> >>> that represent a realistic end user use case (some come from client
> >>> experience on Google Cloud Dataflow).
> >>>
> >>> So far, we have upgraded the implementation to the latest Beam
> >>> snapshot.
> >>> And we are able to execute a good subset of the queries in the
> >>> different
> >>> runners. We upgraded the nexmark drivers to do so: direct driver
> >>> (upgraded
> >>> from inProcessDriver) and flink driver and we added a new one for
> >>> spark.
> >>>
> >>> There is still a good amount of work to do and we would like to know if
> >>> you think that this contribution can have its place into Beam
> >>> eventually.
> >>>
> >>> The interests of having Nexmark on Beam that we have seen so far are:
> >>>
> >>> - Rich batch/streaming test
> >>>
> >>> - A-B testing of runners or runtimes (non-regression, performance
> >>> comparison between versions ...)
> >>>
> >>> - Integration testing (sdk/runners, runner/runtime, ...)
> >>>
> >>> - Validate beam capability matrix
> >>>
> >>> - It can be used as part of the ongoing PerfKit work (if there is any
> >>> interest).
> >>>
> >>> As a final note, we are tracking the issues in the same repo. If
> >>> someone
> >>> is interested in contributing, or have more ideas, you are welcome :)
> >>>
> >>> Etienne
> >>>
> >>>
> >
>
>

Re: [DISCUSSION] using NexMark for Beam

Posted by Etienne Chauchot <ec...@gmail.com>.

Hi all,

I wanted to let you know that the Nexmark PR is merged into master. Feel 
free to use it (e.g. performance testing, release testing ...).

Etienne

Le 12/05/2017 à 10:55, Etienne Chauchot a écrit :
> Hi guys,
>
> I wanted to let you know that I have just submitted a PR around 
> NexMark. This is a port of the NexMark queries to Beam, to be used as 
> integration tests.
> This can also be used as A-B testing (no-regression or performance 
> comparison between 2 versions of the same engine or of the same runner)
>
> This a continuation of the previous PR (#99) from Mark Shields.
> The code has changed quite a bit: some queries have changed to use new 
> Beam APIs and there where some big refactorings. More important, we 
> can now run all the queries in all the runners.
>
> Nevertheless, there are still some open issues in Nexmark 
> (https://github.com/iemejia/beam/issues) and in Beam upstream (see 
> issue links in https://issues.apache.org/jira/browse/BEAM-160)
>
> I wanted to submit the PR before our (Ismaël and I) NexMark talk at 
> the ApacheCon. The PR is not perfect but it is in a good shape to 
> share it.
>
> Best,
>
> Etienne
>
>
>
> Le 22/03/2017 à 04:51, Kenneth Knowles a écrit :
>> This is great! Having a variety of realistic-ish pipelines running on 
>> all
>> runners complements the validation suite and IO IT work.
>>
>> If I recall, some of these involve heavy and esoteric uses of state, so
>> definitely give me a ping if you hit any trouble.
>>
>> Kenn
>>
>> On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot <ec...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Ismael and I are working on upgrading the Nexmark implementation for 
>>> Beam.
>>> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
>>> https://issues.apache.org/jira/browse/BEAM-160. We are continuing the
>>> work done by Mark Shields. See https://github.com/apache/beam/pull/366
>>> for the original PR.
>>>
>>> The PR contains queries that have a wide coverage of the Beam model and
>>> that represent a realistic end user use case (some come from client
>>> experience on Google Cloud Dataflow).
>>>
>>> So far, we have upgraded the implementation to the latest Beam 
>>> snapshot.
>>> And we are able to execute a good subset of the queries in the 
>>> different
>>> runners. We upgraded the nexmark drivers to do so: direct driver 
>>> (upgraded
>>> from inProcessDriver) and flink driver and we added a new one for 
>>> spark.
>>>
>>> There is still a good amount of work to do and we would like to know if
>>> you think that this contribution can have its place into Beam 
>>> eventually.
>>>
>>> The interests of having Nexmark on Beam that we have seen so far are:
>>>
>>> - Rich batch/streaming test
>>>
>>> - A-B testing of runners or runtimes (non-regression, performance
>>> comparison between versions ...)
>>>
>>> - Integration testing (sdk/runners, runner/runtime, ...)
>>>
>>> - Validate beam capability matrix
>>>
>>> - It can be used as part of the ongoing PerfKit work (if there is any
>>> interest).
>>>
>>> As a final note, we are tracking the issues in the same repo. If 
>>> someone
>>> is interested in contributing, or have more ideas, you are welcome :)
>>>
>>> Etienne
>>>
>>>
>