You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Daniel Oliveira <da...@google.com> on 2019/04/25 23:01:33 UTC

Removing Java Reference Runner code

Hey everyone,

I made a preliminary PR for removing all the Java Reference Runner code (
PR-8380 <https://github.com/apache/beam/pull/8380>) since I wanted to see
if it could be done easily. It seems to be working fine, so I wanted to
open up this discussion to make sure people are still in agreement on
getting rid of this code and that people don't have any concerns.

For those who need additional context about this, this previous thread
<https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E>
is where we discussed deprecating the Java Reference Runner (in some places
it's called the ULR or Universal Local Runner, but it's the same thing).
Then there's this thread
<https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E>
where we discussed removing the code from the repo since it's been
deprecated.

If no one has any objections to trying to remove the code I'll have someone
review the PR I wrote and start a vote to have it merged.

Thanks,
Daniel Oliveira

Re: Removing Java Reference Runner code

Posted by Daniel Oliveira <da...@google.com>.

Hey Kenn,

I'm not 100% sure. Robert (+Robert Bradshaw <ro...@google.com>) could
answer your question accurately. Last I checked (about 2 months ago) there
was no such target, but I don't think there's anything preventing one from
being written.

On Fri, Apr 26, 2019 at 12:05 PM Kenneth Knowles <ke...@apache.org> wrote:

> Good points. Distilling one single item: can I, today, run the Java SDK's
> suite of ValidatesRunner command against the Python ULR + Java SDK Harness,
> in a single Gradle command?
>
> Kenn
>
> On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin <ke...@google.com> wrote:
>
>> If there is no plans to invest in ULR then it makes sense to remove it.
>>
>> Going forward, however, I think we should try to document the higher
>> level approach we're taking with runners (and portability) now that we have
>> something working and can reflect on it. For example, couple of things that
>> are not 100% clear to me:
>>  - if the focus is on python runner for portability efforts, how does
>> java SDK (and other languages) tie into this? E.g. how do we run, test,
>> measure, and develop things (pipelines, aspects of the SDK, runner);
>>  - what's our approach to developing new features, should we make sure
>> python runner supports them as early as possible (e.g. schemas and SQL)?
>>  - java DirectRunner is still there:
>>     - it is still the primary tool for java SDK development purposes, and
>> as Kenn mentioned in the linked threads it adds value by making sure users
>> don't rely on implementation details of specific runners. Do we have a
>> similar story for portable scenarios?
>>     - I assume that extra validations in the DirectRunner have impact on
>> performance in various ways (potentially non-deterministic). While this
>> doesn't matter in some cases, it might do in others. Having a local runner
>> that is (better) optimized for execution would probably make more sense for
>> perf measurements, integration tests, and maybe even local production jobs.
>> Is this something potentially worth looking into?
>>
>> Regards,
>> Anton
>>
>>
>> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels <mx...@apache.org>
>> wrote:
>>
>>> Thanks for following up with this. I have mixed feelings to see the
>>> portable Java DirectRunner go, but I'm in favor of this change because
>>> it removes a lot of code that we do not really make use of.
>>>
>>> -Max
>>>
>>> On 26.04.19 02:58, Kenneth Knowles wrote:
>>> > Thanks for providing all this background on the PR. It is very easy to
>>> > see where it came from. Definitely nice to have less code and fewer
>>> > things that can break. Perhaps lazy consensus is enough.
>>> >
>>> > Kenn
>>> >
>>> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <
>>> danoliveira@google.com
>>> > <ma...@google.com>> wrote:
>>> >
>>> >     Hey everyone,
>>> >
>>> >     I made a preliminary PR for removing all the Java Reference Runner
>>> >     code (PR-8380 <https://github.com/apache/beam/pull/8380>) since I
>>> >     wanted to see if it could be done easily. It seems to be working
>>> >     fine, so I wanted to open up this discussion to make sure people
>>> are
>>> >     still in agreement on getting rid of this code and that people
>>> don't
>>> >     have any concerns.
>>> >
>>> >     For those who need additional context about this, this previous
>>> >     thread
>>> >     <
>>> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
>>> >
>>> >     is where we discussed deprecating the Java Reference Runner (in
>>> some
>>> >     places it's called the ULR or Universal Local Runner, but it's the
>>> >     same thing). Then there's this thread
>>> >     <
>>> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
>>> >
>>> >     where we discussed removing the code from the repo since it's been
>>> >     deprecated.
>>> >
>>> >     If no one has any objections to trying to remove the code I'll have
>>> >     someone review the PR I wrote and start a vote to have it merged.
>>> >
>>> >     Thanks,
>>> >     Daniel Oliveira
>>> >
>>>
>>

Re: Removing Java Reference Runner code

Posted by Mikhail Gryzykhin <mi...@google.com>.

+1 to remove overall. We removed all tests for ULR already and when we did
that, tests were red. Removing code base is a natural next step.

It is a valid point that we should have a way to run portable pipelines
locally with Python ULR.

I don't believe that a Java person working with Java SDK should actually
debug worker in most cases. If we have a situation when SDK dev have to
debug runner retularly, we should improve runner logging and error
reporting. This can be a great exercise of improving testability. As well
as a good requirement if we want to eventually split mono-repo.

--Mikhail

On Fri, Apr 26, 2019 at 12:36 PM Boyuan Zhang <bo...@google.com> wrote:

> Another concern from me is, will it be difficult for a Java person (who
> developing Java SDK) to figure out what's going on in Python ULR when
> debugging?
>
> On Fri, Apr 26, 2019 at 12:05 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>> Good points. Distilling one single item: can I, today, run the Java SDK's
>> suite of ValidatesRunner command against the Python ULR + Java SDK Harness,
>> in a single Gradle command?
>>
>> Kenn
>>
>> On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin <ke...@google.com> wrote:
>>
>>> If there is no plans to invest in ULR then it makes sense to remove it.
>>>
>>> Going forward, however, I think we should try to document the higher
>>> level approach we're taking with runners (and portability) now that we have
>>> something working and can reflect on it. For example, couple of things that
>>> are not 100% clear to me:
>>>  - if the focus is on python runner for portability efforts, how does
>>> java SDK (and other languages) tie into this? E.g. how do we run, test,
>>> measure, and develop things (pipelines, aspects of the SDK, runner);
>>>  - what's our approach to developing new features, should we make sure
>>> python runner supports them as early as possible (e.g. schemas and SQL)?
>>>  - java DirectRunner is still there:
>>>     - it is still the primary tool for java SDK development purposes,
>>> and as Kenn mentioned in the linked threads it adds value by making sure
>>> users don't rely on implementation details of specific runners. Do we have
>>> a similar story for portable scenarios?
>>>     - I assume that extra validations in the DirectRunner have impact on
>>> performance in various ways (potentially non-deterministic). While this
>>> doesn't matter in some cases, it might do in others. Having a local runner
>>> that is (better) optimized for execution would probably make more sense for
>>> perf measurements, integration tests, and maybe even local production jobs.
>>> Is this something potentially worth looking into?
>>>
>>> Regards,
>>> Anton
>>>
>>>
>>> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels <mx...@apache.org>
>>> wrote:
>>>
>>>> Thanks for following up with this. I have mixed feelings to see the
>>>> portable Java DirectRunner go, but I'm in favor of this change because
>>>> it removes a lot of code that we do not really make use of.
>>>>
>>>> -Max
>>>>
>>>> On 26.04.19 02:58, Kenneth Knowles wrote:
>>>> > Thanks for providing all this background on the PR. It is very easy
>>>> to
>>>> > see where it came from. Definitely nice to have less code and fewer
>>>> > things that can break. Perhaps lazy consensus is enough.
>>>> >
>>>> > Kenn
>>>> >
>>>> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <
>>>> danoliveira@google.com
>>>> > <ma...@google.com>> wrote:
>>>> >
>>>> >     Hey everyone,
>>>> >
>>>> >     I made a preliminary PR for removing all the Java Reference Runner
>>>> >     code (PR-8380 <https://github.com/apache/beam/pull/8380>) since I
>>>> >     wanted to see if it could be done easily. It seems to be working
>>>> >     fine, so I wanted to open up this discussion to make sure people
>>>> are
>>>> >     still in agreement on getting rid of this code and that people
>>>> don't
>>>> >     have any concerns.
>>>> >
>>>> >     For those who need additional context about this, this previous
>>>> >     thread
>>>> >     <
>>>> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
>>>> >
>>>> >     is where we discussed deprecating the Java Reference Runner (in
>>>> some
>>>> >     places it's called the ULR or Universal Local Runner, but it's the
>>>> >     same thing). Then there's this thread
>>>> >     <
>>>> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
>>>> >
>>>> >     where we discussed removing the code from the repo since it's been
>>>> >     deprecated.
>>>> >
>>>> >     If no one has any objections to trying to remove the code I'll
>>>> have
>>>> >     someone review the PR I wrote and start a vote to have it merged.
>>>> >
>>>> >     Thanks,
>>>> >     Daniel Oliveira
>>>> >
>>>>
>>>

Re: Removing Java Reference Runner code

Posted by Daniel Oliveira <da...@google.com>.

It sounds like no one has any objections specifically to removing this
code. I'll get someone to review the PR and I'll start a vote to merge it
as soon as it's approved.

On Mon, Apr 29, 2019 at 3:39 AM Robert Bradshaw <ro...@google.com> wrote:

> I'd imagine that most users will continue to debug their pipelines
> using a direct runner, and even if the portable runner is used it can
> be run in "loopback" mode where the pipeline-submitting process also
> acts as the worker(s), so one can output print statements, set
> breakpoints, etc. as if it were all in-process (unless there's
> actually something strange with the runner <-> SDK API itself).
>
> Similarly, for development, many (most) features (IO, SQL, schemas)
> are runner-agnostic, though of course this is not always the case
> especially if there are fundamental changes to the model (e.g. one
> that comes to mind is retractions).
>
> That's not to say there isn't also value in testing your code on a
> portable runner that will more faithfully represent production
> environments, but at this level of integration test (e.g. using docker
> and all) I don't think having Python is that high of a barrier.
>
> As for a gradle command to run JVR tests on the Python ULR, I don't
> think that's currently available, but it should be.
>
>
>
> On Sat, Apr 27, 2019 at 4:53 AM Daniel Oliveira <da...@google.com>
> wrote:
> >
> > Hey Boyuan,
> >
> > I think that's a good question. Mikhail's mostly right, that the user
> shouldn't need to know how the Python ULR works for their debugging. This
> is actually more of an issue with portability itself anyway. Even when I
> was coding Java pipelines on the Java ULR, if something went wrong in the
> runner it was still really difficult to debug. Hopefully the only people
> that will need to do that painful exercise are Beam devs doing development
> work on the runners. If an average user is having a problem, the runner's
> logs and error messages should be effective enough that the user shouldn't
> care what language the runner is using or how it's implemented.
> >
> > On Fri, Apr 26, 2019 at 12:36 PM Boyuan Zhang <bo...@google.com>
> wrote:
> >>
> >> Another concern from me is, will it be difficult for a Java person (who
> developing Java SDK) to figure out what's going on in Python ULR when
> debugging?
> >>
> >> On Fri, Apr 26, 2019 at 12:05 PM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>>
> >>> Good points. Distilling one single item: can I, today, run the Java
> SDK's suite of ValidatesRunner command against the Python ULR + Java SDK
> Harness, in a single Gradle command?
> >>>
> >>> Kenn
> >>>
> >>> On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin <ke...@google.com> wrote:
> >>>>
> >>>> If there is no plans to invest in ULR then it makes sense to remove
> it.
> >>>>
> >>>> Going forward, however, I think we should try to document the higher
> level approach we're taking with runners (and portability) now that we have
> something working and can reflect on it. For example, couple of things that
> are not 100% clear to me:
> >>>>  - if the focus is on python runner for portability efforts, how does
> java SDK (and other languages) tie into this? E.g. how do we run, test,
> measure, and develop things (pipelines, aspects of the SDK, runner);
> >>>>  - what's our approach to developing new features, should we make
> sure python runner supports them as early as possible (e.g. schemas and
> SQL)?
> >>>>  - java DirectRunner is still there:
> >>>>     - it is still the primary tool for java SDK development purposes,
> and as Kenn mentioned in the linked threads it adds value by making sure
> users don't rely on implementation details of specific runners. Do we have
> a similar story for portable scenarios?
> >>>>     - I assume that extra validations in the DirectRunner have impact
> on performance in various ways (potentially non-deterministic). While this
> doesn't matter in some cases, it might do in others. Having a local runner
> that is (better) optimized for execution would probably make more sense for
> perf measurements, integration tests, and maybe even local production jobs.
> Is this something potentially worth looking into?
> >>>>
> >>>> Regards,
> >>>> Anton
> >>>>
> >>>>
> >>>> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels <mx...@apache.org>
> wrote:
> >>>>>
> >>>>> Thanks for following up with this. I have mixed feelings to see the
> >>>>> portable Java DirectRunner go, but I'm in favor of this change
> because
> >>>>> it removes a lot of code that we do not really make use of.
> >>>>>
> >>>>> -Max
> >>>>>
> >>>>> On 26.04.19 02:58, Kenneth Knowles wrote:
> >>>>> > Thanks for providing all this background on the PR. It is very
> easy to
> >>>>> > see where it came from. Definitely nice to have less code and fewer
> >>>>> > things that can break. Perhaps lazy consensus is enough.
> >>>>> >
> >>>>> > Kenn
> >>>>> >
> >>>>> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <
> danoliveira@google.com
> >>>>> > <ma...@google.com>> wrote:
> >>>>> >
> >>>>> >     Hey everyone,
> >>>>> >
> >>>>> >     I made a preliminary PR for removing all the Java Reference
> Runner
> >>>>> >     code (PR-8380 <https://github.com/apache/beam/pull/8380>)
> since I
> >>>>> >     wanted to see if it could be done easily. It seems to be
> working
> >>>>> >     fine, so I wanted to open up this discussion to make sure
> people are
> >>>>> >     still in agreement on getting rid of this code and that people
> don't
> >>>>> >     have any concerns.
> >>>>> >
> >>>>> >     For those who need additional context about this, this previous
> >>>>> >     thread
> >>>>> >     <
> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
> >
> >>>>> >     is where we discussed deprecating the Java Reference Runner
> (in some
> >>>>> >     places it's called the ULR or Universal Local Runner, but it's
> the
> >>>>> >     same thing). Then there's this thread
> >>>>> >     <
> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
> >
> >>>>> >     where we discussed removing the code from the repo since it's
> been
> >>>>> >     deprecated.
> >>>>> >
> >>>>> >     If no one has any objections to trying to remove the code I'll
> have
> >>>>> >     someone review the PR I wrote and start a vote to have it
> merged.
> >>>>> >
> >>>>> >     Thanks,
> >>>>> >     Daniel Oliveira
> >>>>> >
>

Re: Removing Java Reference Runner code

Posted by Robert Bradshaw <ro...@google.com>.

I'd imagine that most users will continue to debug their pipelines
using a direct runner, and even if the portable runner is used it can
be run in "loopback" mode where the pipeline-submitting process also
acts as the worker(s), so one can output print statements, set
breakpoints, etc. as if it were all in-process (unless there's
actually something strange with the runner <-> SDK API itself).

Similarly, for development, many (most) features (IO, SQL, schemas)
are runner-agnostic, though of course this is not always the case
especially if there are fundamental changes to the model (e.g. one
that comes to mind is retractions).

That's not to say there isn't also value in testing your code on a
portable runner that will more faithfully represent production
environments, but at this level of integration test (e.g. using docker
and all) I don't think having Python is that high of a barrier.

As for a gradle command to run JVR tests on the Python ULR, I don't
think that's currently available, but it should be.



On Sat, Apr 27, 2019 at 4:53 AM Daniel Oliveira <da...@google.com> wrote:
>
> Hey Boyuan,
>
> I think that's a good question. Mikhail's mostly right, that the user shouldn't need to know how the Python ULR works for their debugging. This is actually more of an issue with portability itself anyway. Even when I was coding Java pipelines on the Java ULR, if something went wrong in the runner it was still really difficult to debug. Hopefully the only people that will need to do that painful exercise are Beam devs doing development work on the runners. If an average user is having a problem, the runner's logs and error messages should be effective enough that the user shouldn't care what language the runner is using or how it's implemented.
>
> On Fri, Apr 26, 2019 at 12:36 PM Boyuan Zhang <bo...@google.com> wrote:
>>
>> Another concern from me is, will it be difficult for a Java person (who developing Java SDK) to figure out what's going on in Python ULR when debugging?
>>
>> On Fri, Apr 26, 2019 at 12:05 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>
>>> Good points. Distilling one single item: can I, today, run the Java SDK's suite of ValidatesRunner command against the Python ULR + Java SDK Harness, in a single Gradle command?
>>>
>>> Kenn
>>>
>>> On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin <ke...@google.com> wrote:
>>>>
>>>> If there is no plans to invest in ULR then it makes sense to remove it.
>>>>
>>>> Going forward, however, I think we should try to document the higher level approach we're taking with runners (and portability) now that we have something working and can reflect on it. For example, couple of things that are not 100% clear to me:
>>>>  - if the focus is on python runner for portability efforts, how does java SDK (and other languages) tie into this? E.g. how do we run, test, measure, and develop things (pipelines, aspects of the SDK, runner);
>>>>  - what's our approach to developing new features, should we make sure python runner supports them as early as possible (e.g. schemas and SQL)?
>>>>  - java DirectRunner is still there:
>>>>     - it is still the primary tool for java SDK development purposes, and as Kenn mentioned in the linked threads it adds value by making sure users don't rely on implementation details of specific runners. Do we have a similar story for portable scenarios?
>>>>     - I assume that extra validations in the DirectRunner have impact on performance in various ways (potentially non-deterministic). While this doesn't matter in some cases, it might do in others. Having a local runner that is (better) optimized for execution would probably make more sense for perf measurements, integration tests, and maybe even local production jobs. Is this something potentially worth looking into?
>>>>
>>>> Regards,
>>>> Anton
>>>>
>>>>
>>>> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels <mx...@apache.org> wrote:
>>>>>
>>>>> Thanks for following up with this. I have mixed feelings to see the
>>>>> portable Java DirectRunner go, but I'm in favor of this change because
>>>>> it removes a lot of code that we do not really make use of.
>>>>>
>>>>> -Max
>>>>>
>>>>> On 26.04.19 02:58, Kenneth Knowles wrote:
>>>>> > Thanks for providing all this background on the PR. It is very easy to
>>>>> > see where it came from. Definitely nice to have less code and fewer
>>>>> > things that can break. Perhaps lazy consensus is enough.
>>>>> >
>>>>> > Kenn
>>>>> >
>>>>> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <danoliveira@google.com
>>>>> > <ma...@google.com>> wrote:
>>>>> >
>>>>> >     Hey everyone,
>>>>> >
>>>>> >     I made a preliminary PR for removing all the Java Reference Runner
>>>>> >     code (PR-8380 <https://github.com/apache/beam/pull/8380>) since I
>>>>> >     wanted to see if it could be done easily. It seems to be working
>>>>> >     fine, so I wanted to open up this discussion to make sure people are
>>>>> >     still in agreement on getting rid of this code and that people don't
>>>>> >     have any concerns.
>>>>> >
>>>>> >     For those who need additional context about this, this previous
>>>>> >     thread
>>>>> >     <https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E>
>>>>> >     is where we discussed deprecating the Java Reference Runner (in some
>>>>> >     places it's called the ULR or Universal Local Runner, but it's the
>>>>> >     same thing). Then there's this thread
>>>>> >     <https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E>
>>>>> >     where we discussed removing the code from the repo since it's been
>>>>> >     deprecated.
>>>>> >
>>>>> >     If no one has any objections to trying to remove the code I'll have
>>>>> >     someone review the PR I wrote and start a vote to have it merged.
>>>>> >
>>>>> >     Thanks,
>>>>> >     Daniel Oliveira
>>>>> >

Re: Removing Java Reference Runner code

Posted by Daniel Oliveira <da...@google.com>.

Hey Boyuan,

I think that's a good question. Mikhail's mostly right, that the user
shouldn't need to know how the Python ULR works for their debugging. This
is actually more of an issue with portability itself anyway. Even when I
was coding Java pipelines on the Java ULR, if something went wrong in the
runner it was still really difficult to debug. Hopefully the only people
that will need to do that painful exercise are Beam devs doing development
work on the runners. If an average user is having a problem, the runner's
logs and error messages should be effective enough that the user shouldn't
care what language the runner is using or how it's implemented.

On Fri, Apr 26, 2019 at 12:36 PM Boyuan Zhang <bo...@google.com> wrote:

> Another concern from me is, will it be difficult for a Java person (who
> developing Java SDK) to figure out what's going on in Python ULR when
> debugging?
>
> On Fri, Apr 26, 2019 at 12:05 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>> Good points. Distilling one single item: can I, today, run the Java SDK's
>> suite of ValidatesRunner command against the Python ULR + Java SDK Harness,
>> in a single Gradle command?
>>
>> Kenn
>>
>> On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin <ke...@google.com> wrote:
>>
>>> If there is no plans to invest in ULR then it makes sense to remove it.
>>>
>>> Going forward, however, I think we should try to document the higher
>>> level approach we're taking with runners (and portability) now that we have
>>> something working and can reflect on it. For example, couple of things that
>>> are not 100% clear to me:
>>>  - if the focus is on python runner for portability efforts, how does
>>> java SDK (and other languages) tie into this? E.g. how do we run, test,
>>> measure, and develop things (pipelines, aspects of the SDK, runner);
>>>  - what's our approach to developing new features, should we make sure
>>> python runner supports them as early as possible (e.g. schemas and SQL)?
>>>  - java DirectRunner is still there:
>>>     - it is still the primary tool for java SDK development purposes,
>>> and as Kenn mentioned in the linked threads it adds value by making sure
>>> users don't rely on implementation details of specific runners. Do we have
>>> a similar story for portable scenarios?
>>>     - I assume that extra validations in the DirectRunner have impact on
>>> performance in various ways (potentially non-deterministic). While this
>>> doesn't matter in some cases, it might do in others. Having a local runner
>>> that is (better) optimized for execution would probably make more sense for
>>> perf measurements, integration tests, and maybe even local production jobs.
>>> Is this something potentially worth looking into?
>>>
>>> Regards,
>>> Anton
>>>
>>>
>>> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels <mx...@apache.org>
>>> wrote:
>>>
>>>> Thanks for following up with this. I have mixed feelings to see the
>>>> portable Java DirectRunner go, but I'm in favor of this change because
>>>> it removes a lot of code that we do not really make use of.
>>>>
>>>> -Max
>>>>
>>>> On 26.04.19 02:58, Kenneth Knowles wrote:
>>>> > Thanks for providing all this background on the PR. It is very easy
>>>> to
>>>> > see where it came from. Definitely nice to have less code and fewer
>>>> > things that can break. Perhaps lazy consensus is enough.
>>>> >
>>>> > Kenn
>>>> >
>>>> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <
>>>> danoliveira@google.com
>>>> > <ma...@google.com>> wrote:
>>>> >
>>>> >     Hey everyone,
>>>> >
>>>> >     I made a preliminary PR for removing all the Java Reference Runner
>>>> >     code (PR-8380 <https://github.com/apache/beam/pull/8380>) since I
>>>> >     wanted to see if it could be done easily. It seems to be working
>>>> >     fine, so I wanted to open up this discussion to make sure people
>>>> are
>>>> >     still in agreement on getting rid of this code and that people
>>>> don't
>>>> >     have any concerns.
>>>> >
>>>> >     For those who need additional context about this, this previous
>>>> >     thread
>>>> >     <
>>>> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
>>>> >
>>>> >     is where we discussed deprecating the Java Reference Runner (in
>>>> some
>>>> >     places it's called the ULR or Universal Local Runner, but it's the
>>>> >     same thing). Then there's this thread
>>>> >     <
>>>> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
>>>> >
>>>> >     where we discussed removing the code from the repo since it's been
>>>> >     deprecated.
>>>> >
>>>> >     If no one has any objections to trying to remove the code I'll
>>>> have
>>>> >     someone review the PR I wrote and start a vote to have it merged.
>>>> >
>>>> >     Thanks,
>>>> >     Daniel Oliveira
>>>> >
>>>>
>>>

Re: Removing Java Reference Runner code

Posted by Boyuan Zhang <bo...@google.com>.

Another concern from me is, will it be difficult for a Java person (who
developing Java SDK) to figure out what's going on in Python ULR when
debugging?

On Fri, Apr 26, 2019 at 12:05 PM Kenneth Knowles <ke...@apache.org> wrote:

> Good points. Distilling one single item: can I, today, run the Java SDK's
> suite of ValidatesRunner command against the Python ULR + Java SDK Harness,
> in a single Gradle command?
>
> Kenn
>
> On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin <ke...@google.com> wrote:
>
>> If there is no plans to invest in ULR then it makes sense to remove it.
>>
>> Going forward, however, I think we should try to document the higher
>> level approach we're taking with runners (and portability) now that we have
>> something working and can reflect on it. For example, couple of things that
>> are not 100% clear to me:
>>  - if the focus is on python runner for portability efforts, how does
>> java SDK (and other languages) tie into this? E.g. how do we run, test,
>> measure, and develop things (pipelines, aspects of the SDK, runner);
>>  - what's our approach to developing new features, should we make sure
>> python runner supports them as early as possible (e.g. schemas and SQL)?
>>  - java DirectRunner is still there:
>>     - it is still the primary tool for java SDK development purposes, and
>> as Kenn mentioned in the linked threads it adds value by making sure users
>> don't rely on implementation details of specific runners. Do we have a
>> similar story for portable scenarios?
>>     - I assume that extra validations in the DirectRunner have impact on
>> performance in various ways (potentially non-deterministic). While this
>> doesn't matter in some cases, it might do in others. Having a local runner
>> that is (better) optimized for execution would probably make more sense for
>> perf measurements, integration tests, and maybe even local production jobs.
>> Is this something potentially worth looking into?
>>
>> Regards,
>> Anton
>>
>>
>> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels <mx...@apache.org>
>> wrote:
>>
>>> Thanks for following up with this. I have mixed feelings to see the
>>> portable Java DirectRunner go, but I'm in favor of this change because
>>> it removes a lot of code that we do not really make use of.
>>>
>>> -Max
>>>
>>> On 26.04.19 02:58, Kenneth Knowles wrote:
>>> > Thanks for providing all this background on the PR. It is very easy to
>>> > see where it came from. Definitely nice to have less code and fewer
>>> > things that can break. Perhaps lazy consensus is enough.
>>> >
>>> > Kenn
>>> >
>>> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <
>>> danoliveira@google.com
>>> > <ma...@google.com>> wrote:
>>> >
>>> >     Hey everyone,
>>> >
>>> >     I made a preliminary PR for removing all the Java Reference Runner
>>> >     code (PR-8380 <https://github.com/apache/beam/pull/8380>) since I
>>> >     wanted to see if it could be done easily. It seems to be working
>>> >     fine, so I wanted to open up this discussion to make sure people
>>> are
>>> >     still in agreement on getting rid of this code and that people
>>> don't
>>> >     have any concerns.
>>> >
>>> >     For those who need additional context about this, this previous
>>> >     thread
>>> >     <
>>> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
>>> >
>>> >     is where we discussed deprecating the Java Reference Runner (in
>>> some
>>> >     places it's called the ULR or Universal Local Runner, but it's the
>>> >     same thing). Then there's this thread
>>> >     <
>>> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
>>> >
>>> >     where we discussed removing the code from the repo since it's been
>>> >     deprecated.
>>> >
>>> >     If no one has any objections to trying to remove the code I'll have
>>> >     someone review the PR I wrote and start a vote to have it merged.
>>> >
>>> >     Thanks,
>>> >     Daniel Oliveira
>>> >
>>>
>>

Re: Removing Java Reference Runner code

Posted by Kenneth Knowles <ke...@apache.org>.

Good points. Distilling one single item: can I, today, run the Java SDK's
suite of ValidatesRunner command against the Python ULR + Java SDK Harness,
in a single Gradle command?

Kenn

On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin <ke...@google.com> wrote:

> If there is no plans to invest in ULR then it makes sense to remove it.
>
> Going forward, however, I think we should try to document the higher level
> approach we're taking with runners (and portability) now that we have
> something working and can reflect on it. For example, couple of things that
> are not 100% clear to me:
>  - if the focus is on python runner for portability efforts, how does java
> SDK (and other languages) tie into this? E.g. how do we run, test, measure,
> and develop things (pipelines, aspects of the SDK, runner);
>  - what's our approach to developing new features, should we make sure
> python runner supports them as early as possible (e.g. schemas and SQL)?
>  - java DirectRunner is still there:
>     - it is still the primary tool for java SDK development purposes, and
> as Kenn mentioned in the linked threads it adds value by making sure users
> don't rely on implementation details of specific runners. Do we have a
> similar story for portable scenarios?
>     - I assume that extra validations in the DirectRunner have impact on
> performance in various ways (potentially non-deterministic). While this
> doesn't matter in some cases, it might do in others. Having a local runner
> that is (better) optimized for execution would probably make more sense for
> perf measurements, integration tests, and maybe even local production jobs.
> Is this something potentially worth looking into?
>
> Regards,
> Anton
>
>
> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels <mx...@apache.org> wrote:
>
>> Thanks for following up with this. I have mixed feelings to see the
>> portable Java DirectRunner go, but I'm in favor of this change because
>> it removes a lot of code that we do not really make use of.
>>
>> -Max
>>
>> On 26.04.19 02:58, Kenneth Knowles wrote:
>> > Thanks for providing all this background on the PR. It is very easy to
>> > see where it came from. Definitely nice to have less code and fewer
>> > things that can break. Perhaps lazy consensus is enough.
>> >
>> > Kenn
>> >
>> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <danoliveira@google.com
>> > <ma...@google.com>> wrote:
>> >
>> >     Hey everyone,
>> >
>> >     I made a preliminary PR for removing all the Java Reference Runner
>> >     code (PR-8380 <https://github.com/apache/beam/pull/8380>) since I
>> >     wanted to see if it could be done easily. It seems to be working
>> >     fine, so I wanted to open up this discussion to make sure people are
>> >     still in agreement on getting rid of this code and that people don't
>> >     have any concerns.
>> >
>> >     For those who need additional context about this, this previous
>> >     thread
>> >     <
>> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
>> >
>> >     is where we discussed deprecating the Java Reference Runner (in some
>> >     places it's called the ULR or Universal Local Runner, but it's the
>> >     same thing). Then there's this thread
>> >     <
>> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
>> >
>> >     where we discussed removing the code from the repo since it's been
>> >     deprecated.
>> >
>> >     If no one has any objections to trying to remove the code I'll have
>> >     someone review the PR I wrote and start a vote to have it merged.
>> >
>> >     Thanks,
>> >     Daniel Oliveira
>> >
>>
>

Re: Removing Java Reference Runner code

Posted by Daniel Oliveira <da...@google.com>.

Good questions Anton. I can't give *definitive* answers to any of these,
but I can at least explain how I've been interpreting the move to the
Python version.

 - if the focus is on python runner for portability efforts, how does java
> SDK (and other languages) tie into this? E.g. how do we run, test, measure,
> and develop things (pipelines, aspects of the SDK, runner);

You should be able to run anything that worked with the Java ULR on the
Python one. Thanks to Portability the Runner and SDK can be completely
independent. For example when I was working on the Java ULR I got it
running the Python validatesRunner tests that are currently used to test
the Python ULR. The reverse should hold true. I don't want to get too in
depth on how it and other local portable runners are used, but the short
version is that you would start the runner as a separate process on your
machine and then indicate the runner you're using and the port it's on in
your Pipeline Options.

The main obstacle I see is that recommending a Python runner for people
running Java pipelines is counterintuitive. It would require users to have
Python installed on their machine just to test their Java code which is a
difficult situation to explain.

 - what's our approach to developing new features, should we make sure
> python runner supports them as early as possible (e.g. schemas and SQL)?
>

That was the original hope with the Java ULR, that it would be a good place
to start implementing and iterating on new features without having to
implement them in a more complex runner. Of course we never actually
reached that goal, but we might be able to with the Python ULR since it's
so much further in development.

- java DirectRunner is still there:
>     - it is still the primary tool for java SDK development purposes, and
> as Kenn mentioned in the linked threads it adds value by making sure users
> don't rely on implementation details of specific runners. Do we have a
> similar story for portable scenarios?
>

I think a long-term goal when it comes to portable runners is that we only
have one local runner in one language that all developers use across
multiple SDKs. In that sense yes, the Python ULR would have a similar
story, but for all SDKs, but only with portable pipelines.

But we've had differing ideas about this and how far it should go. Like is
this runner supposed to be good for debugging or just running already
validated pipelines? Do we still want non-portable local runners for each
SDK for performance or debug reasons? Questions like that haven't really
been answered. I think in one of the threads I linked to in the OP there
was some discussion about this if you want to see.

- I assume that extra validations in the DirectRunner have impact on
> performance in various ways (potentially non-deterministic). While this
> doesn't matter in some cases, it might do in others. Having a local runner
> that is (better) optimized for execution would probably make more sense for
> perf measurements, integration tests, and maybe even local production jobs.
> Is this something potentially worth looking into?
>

Basically what I mentioned above, there's no specific plans so it's mainly
something that's up for community discussion.

My personal opinion is that it's worth looking into, but I think a basic
implementation of portable features is more important first. Once
portability is at the point where it's reached parity with non-portable
pipelines feature-wise, then we can start thinking about having runners
with more niche uses.

On Fri, Apr 26, 2019 at 9:54 AM Anton Kedin <ke...@google.com> wrote:

> If there is no plans to invest in ULR then it makes sense to remove it.
>
> Going forward, however, I think we should try to document the higher level
> approach we're taking with runners (and portability) now that we have
> something working and can reflect on it. For example, couple of things that
> are not 100% clear to me:
>  - if the focus is on python runner for portability efforts, how does java
> SDK (and other languages) tie into this? E.g. how do we run, test, measure,
> and develop things (pipelines, aspects of the SDK, runner);
>  - what's our approach to developing new features, should we make sure
> python runner supports them as early as possible (e.g. schemas and SQL)?
>  - java DirectRunner is still there:
>     - it is still the primary tool for java SDK development purposes, and
> as Kenn mentioned in the linked threads it adds value by making sure users
> don't rely on implementation details of specific runners. Do we have a
> similar story for portable scenarios?
>     - I assume that extra validations in the DirectRunner have impact on
> performance in various ways (potentially non-deterministic). While this
> doesn't matter in some cases, it might do in others. Having a local runner
> that is (better) optimized for execution would probably make more sense for
> perf measurements, integration tests, and maybe even local production jobs.
> Is this something potentially worth looking into?
>
> Regards,
> Anton
>
>
> On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels <mx...@apache.org> wrote:
>
>> Thanks for following up with this. I have mixed feelings to see the
>> portable Java DirectRunner go, but I'm in favor of this change because
>> it removes a lot of code that we do not really make use of.
>>
>> -Max
>>
>> On 26.04.19 02:58, Kenneth Knowles wrote:
>> > Thanks for providing all this background on the PR. It is very easy to
>> > see where it came from. Definitely nice to have less code and fewer
>> > things that can break. Perhaps lazy consensus is enough.
>> >
>> > Kenn
>> >
>> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <danoliveira@google.com
>> > <ma...@google.com>> wrote:
>> >
>> >     Hey everyone,
>> >
>> >     I made a preliminary PR for removing all the Java Reference Runner
>> >     code (PR-8380 <https://github.com/apache/beam/pull/8380>) since I
>> >     wanted to see if it could be done easily. It seems to be working
>> >     fine, so I wanted to open up this discussion to make sure people are
>> >     still in agreement on getting rid of this code and that people don't
>> >     have any concerns.
>> >
>> >     For those who need additional context about this, this previous
>> >     thread
>> >     <
>> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
>> >
>> >     is where we discussed deprecating the Java Reference Runner (in some
>> >     places it's called the ULR or Universal Local Runner, but it's the
>> >     same thing). Then there's this thread
>> >     <
>> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
>> >
>> >     where we discussed removing the code from the repo since it's been
>> >     deprecated.
>> >
>> >     If no one has any objections to trying to remove the code I'll have
>> >     someone review the PR I wrote and start a vote to have it merged.
>> >
>> >     Thanks,
>> >     Daniel Oliveira
>> >
>>
>

Re: Removing Java Reference Runner code

Posted by Anton Kedin <ke...@google.com>.

If there is no plans to invest in ULR then it makes sense to remove it.

Going forward, however, I think we should try to document the higher level
approach we're taking with runners (and portability) now that we have
something working and can reflect on it. For example, couple of things that
are not 100% clear to me:
 - if the focus is on python runner for portability efforts, how does java
SDK (and other languages) tie into this? E.g. how do we run, test, measure,
and develop things (pipelines, aspects of the SDK, runner);
 - what's our approach to developing new features, should we make sure
python runner supports them as early as possible (e.g. schemas and SQL)?
 - java DirectRunner is still there:
    - it is still the primary tool for java SDK development purposes, and
as Kenn mentioned in the linked threads it adds value by making sure users
don't rely on implementation details of specific runners. Do we have a
similar story for portable scenarios?
    - I assume that extra validations in the DirectRunner have impact on
performance in various ways (potentially non-deterministic). While this
doesn't matter in some cases, it might do in others. Having a local runner
that is (better) optimized for execution would probably make more sense for
perf measurements, integration tests, and maybe even local production jobs.
Is this something potentially worth looking into?

Regards,
Anton


On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels <mx...@apache.org> wrote:

> Thanks for following up with this. I have mixed feelings to see the
> portable Java DirectRunner go, but I'm in favor of this change because
> it removes a lot of code that we do not really make use of.
>
> -Max
>
> On 26.04.19 02:58, Kenneth Knowles wrote:
> > Thanks for providing all this background on the PR. It is very easy to
> > see where it came from. Definitely nice to have less code and fewer
> > things that can break. Perhaps lazy consensus is enough.
> >
> > Kenn
> >
> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <danoliveira@google.com
> > <ma...@google.com>> wrote:
> >
> >     Hey everyone,
> >
> >     I made a preliminary PR for removing all the Java Reference Runner
> >     code (PR-8380 <https://github.com/apache/beam/pull/8380>) since I
> >     wanted to see if it could be done easily. It seems to be working
> >     fine, so I wanted to open up this discussion to make sure people are
> >     still in agreement on getting rid of this code and that people don't
> >     have any concerns.
> >
> >     For those who need additional context about this, this previous
> >     thread
> >     <
> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
> >
> >     is where we discussed deprecating the Java Reference Runner (in some
> >     places it's called the ULR or Universal Local Runner, but it's the
> >     same thing). Then there's this thread
> >     <
> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
> >
> >     where we discussed removing the code from the repo since it's been
> >     deprecated.
> >
> >     If no one has any objections to trying to remove the code I'll have
> >     someone review the PR I wrote and start a vote to have it merged.
> >
> >     Thanks,
> >     Daniel Oliveira
> >
>

Re: Removing Java Reference Runner code

Posted by Maximilian Michels <mx...@apache.org>.

Thanks for following up with this. I have mixed feelings to see the 
portable Java DirectRunner go, but I'm in favor of this change because 
it removes a lot of code that we do not really make use of.

-Max

On 26.04.19 02:58, Kenneth Knowles wrote:
> Thanks for providing all this background on the PR. It is very easy to 
> see where it came from. Definitely nice to have less code and fewer 
> things that can break. Perhaps lazy consensus is enough.
> 
> Kenn
> 
> On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <danoliveira@google.com 
> <ma...@google.com>> wrote:
> 
>     Hey everyone,
> 
>     I made a preliminary PR for removing all the Java Reference Runner
>     code (PR-8380 <https://github.com/apache/beam/pull/8380>) since I
>     wanted to see if it could be done easily. It seems to be working
>     fine, so I wanted to open up this discussion to make sure people are
>     still in agreement on getting rid of this code and that people don't
>     have any concerns.
> 
>     For those who need additional context about this, this previous
>     thread
>     <https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E>
>     is where we discussed deprecating the Java Reference Runner (in some
>     places it's called the ULR or Universal Local Runner, but it's the
>     same thing). Then there's this thread
>     <https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E>
>     where we discussed removing the code from the repo since it's been
>     deprecated.
> 
>     If no one has any objections to trying to remove the code I'll have
>     someone review the PR I wrote and start a vote to have it merged.
> 
>     Thanks,
>     Daniel Oliveira
>

Re: Removing Java Reference Runner code

Posted by Kenneth Knowles <ke...@apache.org>.

Thanks for providing all this background on the PR. It is very easy to see
where it came from. Definitely nice to have less code and fewer things that
can break. Perhaps lazy consensus is enough.

Kenn

On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira <da...@google.com>
wrote:

> Hey everyone,
>
> I made a preliminary PR for removing all the Java Reference Runner code (
> PR-8380 <https://github.com/apache/beam/pull/8380>) since I wanted to see
> if it could be done easily. It seems to be working fine, so I wanted to
> open up this discussion to make sure people are still in agreement on
> getting rid of this code and that people don't have any concerns.
>
> For those who need additional context about this, this previous thread
> <https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E>
> is where we discussed deprecating the Java Reference Runner (in some places
> it's called the ULR or Universal Local Runner, but it's the same thing).
> Then there's this thread
> <https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E>
> where we discussed removing the code from the repo since it's been
> deprecated.
>
> If no one has any objections to trying to remove the code I'll have
> someone review the PR I wrote and start a vote to have it merged.
>
> Thanks,
> Daniel Oliveira
>