You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Romain Manni-Bucau <rm...@gmail.com> on 2018/04/10 16:33:23 UTC

org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Hi

does org.apache.beam.sdk.values.TupleTag#genId need to get the
stacktrace or can we use any id generator (like
UUID.random().toString())? Using traces is quite slow under load and
environments where the root stack is not just the "next" level so
skipping it would be nice.

Romain Manni-Bucau
@rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Thomas Groh <tg...@google.com>.

It may be reasonable to port most of those TupleTags to have an explicit,
rather than generated ID, which will remove the need to inspect the stack
trace.

However, as mentioned, the constructor shouldn't provide an unstable ID, as
otherwise most pipelines won't work on production runners.

On Tue, Apr 10, 2018 at 10:09 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Oops cross post sorry.
>
> Issue i hit on this thread is it is used a lot in tests abd it slows down
> tests for nothing like with generatesequence ones
>
> Le 10 avr. 2018 19:00, "Romain Manni-Bucau" <rm...@gmail.com> a
> écrit :
>
>>
>>
>> Le 10 avr. 2018 18:40, "Robert Bradshaw" <ro...@google.com> a écrit :
>>
>> These values should be, inasmuch as possible, stable across VMs. How slow
>> is slow? Doesn't this happen only once per VM startup?
>>
>>
>> Once per jvm and idea launches a jvm per test and the daemon does save
>> enough time, you still go through the whole project and check all upstream
>> deps it seems.
>>
>> It is <1s with maven vs 5-6s with gradle.
>>
>>
>> On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <rm...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> does org.apache.beam.sdk.values.TupleTag#genId need to get the
>>> stacktrace or can we use any id generator (like
>>> UUID.random().toString())? Using traces is quite slow under load and
>>> environments where the root stack is not just the "next" level so
>>> skipping it would be nice.
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>
>>
>>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Reuven Lax <re...@google.com>.

I think it's best for the users to pick their own identifier for TupleTags.
As part of the snapshot/update proposal, we want to be able update
pipelines across refactorings. We tell people to supply a stable name for
transforms (in apply) specifically to make this doable (so we can match up
transforms between the old and new graphs). However if side inputs have
names that are unstable across refactoring, this becomes impossible to do
(impossible even for the user, since the user has no insight into these
auto-generated ids).

Reuven

On Wed, Apr 11, 2018 at 4:58 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

>
>
> Le 10 avr. 2018 22:59, "Robert Bradshaw" <ro...@google.com> a écrit :
>
> On Tue, Apr 10, 2018 at 1:49 PM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>>
>> Le 10 avr. 2018 21:25, "Robert Bradshaw" <ro...@google.com> a écrit :
>>
>> On Tue, Apr 10, 2018 at 12:10 PM Romain Manni-Bucau <
>> rmannibucau@gmail.com> wrote:
>>
>>> This is interesting cause it leads to "why do the workers need to do it
>>> again instead of reusing the computed one?". Technically the answer is
>>> trivial but in terms of design I think beam tends to abuse static init
>>> block - even in dofn api - which easily lead to issues when we will want to
>>> support more than a main (thinking to OSGi for instance).
>>>
>>> So:
>>>
>>> 1. Why not using a standard programming model not cinit based? (Perf are
>>> not a valid answer indeed)
>>>
>>
>> The Java language (as far as I know) doesn't have the ability to prohibit
>> assigning static values (such as TupleTags) as static members. We can,
>> however, detect this (which is what the current code does). It doesn't seem
>> to me that code like
>>
>> public class MyDoFn {
>>     public static final TupleTag<String> SOME_OUTPUT_TAG = new
>> TupleTag<>();
>>     ...
>> }
>>
>> is "bad practice," especially as this tag will need to be referenced in
>> multiple places.
>>
>>
>> It is as soon as you dont run in a flat classpath env. In flat cp it is
>> acceptable and dont have much side effects...but beam doesnt know where it
>> runs ;).
>>
>
> The problem is, people *will* write this.
>
>
>
> This is ok and if thegenid/default constructor is deprecated it is ok.
>
> Note however the dofn api encourages this pattern for state and timers
> instead of passing them as parameters or field (not static) injections
> which would be saner.
>
>
>
>> 2. GenId should probably be deprecated and considered a bad practise
>>>
>>
>> Is the proposal that we require the user to manually provide unique
>> identifiers everywhere? Or for static case like above? (Note that
>> accidentally re-using identifiers can lead to subtle incorrect pipeline
>> results.)
>>
>>
>> Yep.
>>
>
> Yep to which?
>
>
>
> Force a value. Static or not is a user choice we should just advice to not
> be static.
>
>
>
>> And ensure we can serialize a tupletag with an already uuid-generated id
>> for instance.
>>
>
> Yes, we already do this.
>
>
>>
>>
>> This looks like a detail but for beam 3 we should ensure we drop the
>>> legacy bringing bad practises in our user code.
>>>
>>> Le 10 avr. 2018 20:15, "Ben Chambers" <bc...@apache.org> a écrit :
>>>
>>>> I believe it doesn't need to be stable across refactoring, only across
>>>> all workers executing a specific version of the code. Specifically, it is
>>>> used as follows:
>>>>
>>>> 1. Create a pipeline on the user's machine. It walks the stack until
>>>> the static initializer block, which provides an ID.
>>>> 2. Send the pipeline to many worker machines.
>>>> 3. Each worker machine walks the stack until the static initializer
>>>> block (on the same version of the code), receiving the same ID.
>>>>
>>>> This ensures that the tupletag is the same on all the workers, as well
>>>> as on the user's machine, which is critical since it used as an identifier
>>>> across these machines.
>>>>
>>>> Assigning a UUID would work if all of the machines agreed on the same
>>>> tuple ID, which could be accomplished with serialization. Serialization,
>>>> however, doesn't work well with static initializers, since those will have
>>>> been called to initialize the class at load time.
>>>>
>>>> On Tue, Apr 10, 2018 at 10:27 AM Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>> Well issue is more about all the existing tests currently.
>>>>>
>>>>> Out of curiosity: how walking the stack is stable since the stack can
>>>>> change? Stop condition is the static block of a class which can use method
>>>>> so refactoring and therefore is not stable. Should it be deprecated?
>>>>>
>>>>>
>>>>> Le 10 avr. 2018 19:17, "Robert Bradshaw" <ro...@google.com> a
>>>>> écrit :
>>>>>
>>>>> If it's too slow perhaps you could use the constructor where you pass
>>>>> an explicit id (though in my experience walking the stack isn't that slow).
>>>>>
>>>>> On Tue, Apr 10, 2018 at 10:09 AM Romain Manni-Bucau <
>>>>> rmannibucau@gmail.com> wrote:
>>>>>
>>>>>> Oops cross post sorry.
>>>>>>
>>>>>> Issue i hit on this thread is it is used a lot in tests abd it slows
>>>>>> down tests for nothing like with generatesequence ones
>>>>>>
>>>>>> Le 10 avr. 2018 19:00, "Romain Manni-Bucau" <rm...@gmail.com>
>>>>>> a écrit :
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Le 10 avr. 2018 18:40, "Robert Bradshaw" <ro...@google.com> a
>>>>>>> écrit :
>>>>>>>
>>>>>>> These values should be, inasmuch as possible, stable across VMs. How
>>>>>>> slow is slow? Doesn't this happen only once per VM startup?
>>>>>>>
>>>>>>>
>>>>>>> Once per jvm and idea launches a jvm per test and the daemon does
>>>>>>> save enough time, you still go through the whole project and check all
>>>>>>> upstream deps it seems.
>>>>>>>
>>>>>>> It is <1s with maven vs 5-6s with gradle.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <
>>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> does org.apache.beam.sdk.values.TupleTag#genId need to get the
>>>>>>>> stacktrace or can we use any id generator (like
>>>>>>>> UUID.random().toString())? Using traces is quite slow under load and
>>>>>>>> environments where the root stack is not just the "next" level so
>>>>>>>> skipping it would be nice.
>>>>>>>>
>>>>>>>> Romain Manni-Bucau
>>>>>>>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>
>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Le 10 avr. 2018 22:59, "Robert Bradshaw" <ro...@google.com> a écrit :

On Tue, Apr 10, 2018 at 1:49 PM Romain Manni-Bucau <rm...@gmail.com>
wrote:

>
> Le 10 avr. 2018 21:25, "Robert Bradshaw" <ro...@google.com> a écrit :
>
> On Tue, Apr 10, 2018 at 12:10 PM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> This is interesting cause it leads to "why do the workers need to do it
>> again instead of reusing the computed one?". Technically the answer is
>> trivial but in terms of design I think beam tends to abuse static init
>> block - even in dofn api - which easily lead to issues when we will want to
>> support more than a main (thinking to OSGi for instance).
>>
>> So:
>>
>> 1. Why not using a standard programming model not cinit based? (Perf are
>> not a valid answer indeed)
>>
>
> The Java language (as far as I know) doesn't have the ability to prohibit
> assigning static values (such as TupleTags) as static members. We can,
> however, detect this (which is what the current code does). It doesn't seem
> to me that code like
>
> public class MyDoFn {
>     public static final TupleTag<String> SOME_OUTPUT_TAG = new
> TupleTag<>();
>     ...
> }
>
> is "bad practice," especially as this tag will need to be referenced in
> multiple places.
>
>
> It is as soon as you dont run in a flat classpath env. In flat cp it is
> acceptable and dont have much side effects...but beam doesnt know where it
> runs ;).
>

The problem is, people *will* write this.



This is ok and if thegenid/default constructor is deprecated it is ok.

Note however the dofn api encourages this pattern for state and timers
instead of passing them as parameters or field (not static) injections
which would be saner.



> 2. GenId should probably be deprecated and considered a bad practise
>>
>
> Is the proposal that we require the user to manually provide unique
> identifiers everywhere? Or for static case like above? (Note that
> accidentally re-using identifiers can lead to subtle incorrect pipeline
> results.)
>
>
> Yep.
>

Yep to which?



Force a value. Static or not is a user choice we should just advice to not
be static.



> And ensure we can serialize a tupletag with an already uuid-generated id
> for instance.
>

Yes, we already do this.


>
>
> This looks like a detail but for beam 3 we should ensure we drop the
>> legacy bringing bad practises in our user code.
>>
>> Le 10 avr. 2018 20:15, "Ben Chambers" <bc...@apache.org> a écrit :
>>
>>> I believe it doesn't need to be stable across refactoring, only across
>>> all workers executing a specific version of the code. Specifically, it is
>>> used as follows:
>>>
>>> 1. Create a pipeline on the user's machine. It walks the stack until the
>>> static initializer block, which provides an ID.
>>> 2. Send the pipeline to many worker machines.
>>> 3. Each worker machine walks the stack until the static initializer
>>> block (on the same version of the code), receiving the same ID.
>>>
>>> This ensures that the tupletag is the same on all the workers, as well
>>> as on the user's machine, which is critical since it used as an identifier
>>> across these machines.
>>>
>>> Assigning a UUID would work if all of the machines agreed on the same
>>> tuple ID, which could be accomplished with serialization. Serialization,
>>> however, doesn't work well with static initializers, since those will have
>>> been called to initialize the class at load time.
>>>
>>> On Tue, Apr 10, 2018 at 10:27 AM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Well issue is more about all the existing tests currently.
>>>>
>>>> Out of curiosity: how walking the stack is stable since the stack can
>>>> change? Stop condition is the static block of a class which can use method
>>>> so refactoring and therefore is not stable. Should it be deprecated?
>>>>
>>>>
>>>> Le 10 avr. 2018 19:17, "Robert Bradshaw" <ro...@google.com> a
>>>> écrit :
>>>>
>>>> If it's too slow perhaps you could use the constructor where you pass
>>>> an explicit id (though in my experience walking the stack isn't that slow).
>>>>
>>>> On Tue, Apr 10, 2018 at 10:09 AM Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>> Oops cross post sorry.
>>>>>
>>>>> Issue i hit on this thread is it is used a lot in tests abd it slows
>>>>> down tests for nothing like with generatesequence ones
>>>>>
>>>>> Le 10 avr. 2018 19:00, "Romain Manni-Bucau" <rm...@gmail.com> a
>>>>> écrit :
>>>>>
>>>>>>
>>>>>>
>>>>>> Le 10 avr. 2018 18:40, "Robert Bradshaw" <ro...@google.com> a
>>>>>> écrit :
>>>>>>
>>>>>> These values should be, inasmuch as possible, stable across VMs. How
>>>>>> slow is slow? Doesn't this happen only once per VM startup?
>>>>>>
>>>>>>
>>>>>> Once per jvm and idea launches a jvm per test and the daemon does
>>>>>> save enough time, you still go through the whole project and check all
>>>>>> upstream deps it seems.
>>>>>>
>>>>>> It is <1s with maven vs 5-6s with gradle.
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <
>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> does org.apache.beam.sdk.values.TupleTag#genId need to get the
>>>>>>> stacktrace or can we use any id generator (like
>>>>>>> UUID.random().toString())? Using traces is quite slow under load and
>>>>>>> environments where the root stack is not just the "next" level so
>>>>>>> skipping it would be nice.
>>>>>>>
>>>>>>> Romain Manni-Bucau
>>>>>>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>>>>>
>>>>>>
>>>>>>
>>>>
>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Robert Bradshaw <ro...@google.com>.

On Tue, Apr 10, 2018 at 1:49 PM Romain Manni-Bucau <rm...@gmail.com>
wrote:

>
> Le 10 avr. 2018 21:25, "Robert Bradshaw" <ro...@google.com> a écrit :
>
> On Tue, Apr 10, 2018 at 12:10 PM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> This is interesting cause it leads to "why do the workers need to do it
>> again instead of reusing the computed one?". Technically the answer is
>> trivial but in terms of design I think beam tends to abuse static init
>> block - even in dofn api - which easily lead to issues when we will want to
>> support more than a main (thinking to OSGi for instance).
>>
>> So:
>>
>> 1. Why not using a standard programming model not cinit based? (Perf are
>> not a valid answer indeed)
>>
>
> The Java language (as far as I know) doesn't have the ability to prohibit
> assigning static values (such as TupleTags) as static members. We can,
> however, detect this (which is what the current code does). It doesn't seem
> to me that code like
>
> public class MyDoFn {
>     public static final TupleTag<String> SOME_OUTPUT_TAG = new
> TupleTag<>();
>     ...
> }
>
> is "bad practice," especially as this tag will need to be referenced in
> multiple places.
>
>
> It is as soon as you dont run in a flat classpath env. In flat cp it is
> acceptable and dont have much side effects...but beam doesnt know where it
> runs ;).
>

The problem is, people *will* write this.


> 2. GenId should probably be deprecated and considered a bad practise
>>
>
> Is the proposal that we require the user to manually provide unique
> identifiers everywhere? Or for static case like above? (Note that
> accidentally re-using identifiers can lead to subtle incorrect pipeline
> results.)
>
>
> Yep.
>

Yep to which?


> And ensure we can serialize a tupletag with an already uuid-generated id
> for instance.
>

Yes, we already do this.


>
>
> This looks like a detail but for beam 3 we should ensure we drop the
>> legacy bringing bad practises in our user code.
>>
>> Le 10 avr. 2018 20:15, "Ben Chambers" <bc...@apache.org> a écrit :
>>
>>> I believe it doesn't need to be stable across refactoring, only across
>>> all workers executing a specific version of the code. Specifically, it is
>>> used as follows:
>>>
>>> 1. Create a pipeline on the user's machine. It walks the stack until the
>>> static initializer block, which provides an ID.
>>> 2. Send the pipeline to many worker machines.
>>> 3. Each worker machine walks the stack until the static initializer
>>> block (on the same version of the code), receiving the same ID.
>>>
>>> This ensures that the tupletag is the same on all the workers, as well
>>> as on the user's machine, which is critical since it used as an identifier
>>> across these machines.
>>>
>>> Assigning a UUID would work if all of the machines agreed on the same
>>> tuple ID, which could be accomplished with serialization. Serialization,
>>> however, doesn't work well with static initializers, since those will have
>>> been called to initialize the class at load time.
>>>
>>> On Tue, Apr 10, 2018 at 10:27 AM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Well issue is more about all the existing tests currently.
>>>>
>>>> Out of curiosity: how walking the stack is stable since the stack can
>>>> change? Stop condition is the static block of a class which can use method
>>>> so refactoring and therefore is not stable. Should it be deprecated?
>>>>
>>>>
>>>> Le 10 avr. 2018 19:17, "Robert Bradshaw" <ro...@google.com> a
>>>> écrit :
>>>>
>>>> If it's too slow perhaps you could use the constructor where you pass
>>>> an explicit id (though in my experience walking the stack isn't that slow).
>>>>
>>>> On Tue, Apr 10, 2018 at 10:09 AM Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>> Oops cross post sorry.
>>>>>
>>>>> Issue i hit on this thread is it is used a lot in tests abd it slows
>>>>> down tests for nothing like with generatesequence ones
>>>>>
>>>>> Le 10 avr. 2018 19:00, "Romain Manni-Bucau" <rm...@gmail.com> a
>>>>> écrit :
>>>>>
>>>>>>
>>>>>>
>>>>>> Le 10 avr. 2018 18:40, "Robert Bradshaw" <ro...@google.com> a
>>>>>> écrit :
>>>>>>
>>>>>> These values should be, inasmuch as possible, stable across VMs. How
>>>>>> slow is slow? Doesn't this happen only once per VM startup?
>>>>>>
>>>>>>
>>>>>> Once per jvm and idea launches a jvm per test and the daemon does
>>>>>> save enough time, you still go through the whole project and check all
>>>>>> upstream deps it seems.
>>>>>>
>>>>>> It is <1s with maven vs 5-6s with gradle.
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <
>>>>>> rmannibucau@gmail.com> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> does org.apache.beam.sdk.values.TupleTag#genId need to get the
>>>>>>> stacktrace or can we use any id generator (like
>>>>>>> UUID.random().toString())? Using traces is quite slow under load and
>>>>>>> environments where the root stack is not just the "next" level so
>>>>>>> skipping it would be nice.
>>>>>>>
>>>>>>> Romain Manni-Bucau
>>>>>>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>>>>>
>>>>>>
>>>>>>
>>>>
>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Le 10 avr. 2018 21:25, "Robert Bradshaw" <ro...@google.com> a écrit :

On Tue, Apr 10, 2018 at 12:10 PM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> This is interesting cause it leads to "why do the workers need to do it
> again instead of reusing the computed one?". Technically the answer is
> trivial but in terms of design I think beam tends to abuse static init
> block - even in dofn api - which easily lead to issues when we will want to
> support more than a main (thinking to OSGi for instance).
>
> So:
>
> 1. Why not using a standard programming model not cinit based? (Perf are
> not a valid answer indeed)
>

The Java language (as far as I know) doesn't have the ability to prohibit
assigning static values (such as TupleTags) as static members. We can,
however, detect this (which is what the current code does). It doesn't seem
to me that code like

public class MyDoFn {
    public static final TupleTag<String> SOME_OUTPUT_TAG = new TupleTag<>();
    ...
}

is "bad practice," especially as this tag will need to be referenced in
multiple places.


It is as soon as you dont run in a flat classpath env. In flat cp it is
acceptable and dont have much side effects...but beam doesnt know where it
runs ;).



> 2. GenId should probably be deprecated and considered a bad practise
>

Is the proposal that we require the user to manually provide unique
identifiers everywhere? Or for static case like above? (Note that
accidentally re-using identifiers can lead to subtle incorrect pipeline
results.)


Yep. And ensure we can serialize a tupletag with an already uuid-generated
id for instance.


This looks like a detail but for beam 3 we should ensure we drop the legacy
> bringing bad practises in our user code.
>
> Le 10 avr. 2018 20:15, "Ben Chambers" <bc...@apache.org> a écrit :
>
>> I believe it doesn't need to be stable across refactoring, only across
>> all workers executing a specific version of the code. Specifically, it is
>> used as follows:
>>
>> 1. Create a pipeline on the user's machine. It walks the stack until the
>> static initializer block, which provides an ID.
>> 2. Send the pipeline to many worker machines.
>> 3. Each worker machine walks the stack until the static initializer block
>> (on the same version of the code), receiving the same ID.
>>
>> This ensures that the tupletag is the same on all the workers, as well as
>> on the user's machine, which is critical since it used as an identifier
>> across these machines.
>>
>> Assigning a UUID would work if all of the machines agreed on the same
>> tuple ID, which could be accomplished with serialization. Serialization,
>> however, doesn't work well with static initializers, since those will have
>> been called to initialize the class at load time.
>>
>> On Tue, Apr 10, 2018 at 10:27 AM Romain Manni-Bucau <
>> rmannibucau@gmail.com> wrote:
>>
>>> Well issue is more about all the existing tests currently.
>>>
>>> Out of curiosity: how walking the stack is stable since the stack can
>>> change? Stop condition is the static block of a class which can use method
>>> so refactoring and therefore is not stable. Should it be deprecated?
>>>
>>>
>>> Le 10 avr. 2018 19:17, "Robert Bradshaw" <ro...@google.com> a écrit :
>>>
>>> If it's too slow perhaps you could use the constructor where you pass an
>>> explicit id (though in my experience walking the stack isn't that slow).
>>>
>>> On Tue, Apr 10, 2018 at 10:09 AM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Oops cross post sorry.
>>>>
>>>> Issue i hit on this thread is it is used a lot in tests abd it slows
>>>> down tests for nothing like with generatesequence ones
>>>>
>>>> Le 10 avr. 2018 19:00, "Romain Manni-Bucau" <rm...@gmail.com> a
>>>> écrit :
>>>>
>>>>>
>>>>>
>>>>> Le 10 avr. 2018 18:40, "Robert Bradshaw" <ro...@google.com> a
>>>>> écrit :
>>>>>
>>>>> These values should be, inasmuch as possible, stable across VMs. How
>>>>> slow is slow? Doesn't this happen only once per VM startup?
>>>>>
>>>>>
>>>>> Once per jvm and idea launches a jvm per test and the daemon does save
>>>>> enough time, you still go through the whole project and check all upstream
>>>>> deps it seems.
>>>>>
>>>>> It is <1s with maven vs 5-6s with gradle.
>>>>>
>>>>>
>>>>> On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <
>>>>> rmannibucau@gmail.com> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> does org.apache.beam.sdk.values.TupleTag#genId need to get the
>>>>>> stacktrace or can we use any id generator (like
>>>>>> UUID.random().toString())? Using traces is quite slow under load and
>>>>>> environments where the root stack is not just the "next" level so
>>>>>> skipping it would be nice.
>>>>>>
>>>>>> Romain Manni-Bucau
>>>>>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>>>>
>>>>>
>>>>>
>>>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Robert Bradshaw <ro...@google.com>.

On Tue, Apr 10, 2018 at 12:10 PM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> This is interesting cause it leads to "why do the workers need to do it
> again instead of reusing the computed one?". Technically the answer is
> trivial but in terms of design I think beam tends to abuse static init
> block - even in dofn api - which easily lead to issues when we will want to
> support more than a main (thinking to OSGi for instance).
>
> So:
>
> 1. Why not using a standard programming model not cinit based? (Perf are
> not a valid answer indeed)
>

The Java language (as far as I know) doesn't have the ability to prohibit
assigning static values (such as TupleTags) as static members. We can,
however, detect this (which is what the current code does). It doesn't seem
to me that code like

public class MyDoFn {
    public static final TupleTag<String> SOME_OUTPUT_TAG = new TupleTag<>();
    ...
}

is "bad practice," especially as this tag will need to be referenced in
multiple places.


> 2. GenId should probably be deprecated and considered a bad practise
>

Is the proposal that we require the user to manually provide unique
identifiers everywhere? Or for static case like above? (Note that
accidentally re-using identifiers can lead to subtle incorrect pipeline
results.)

This looks like a detail but for beam 3 we should ensure we drop the legacy
> bringing bad practises in our user code.
>
> Le 10 avr. 2018 20:15, "Ben Chambers" <bc...@apache.org> a écrit :
>
>> I believe it doesn't need to be stable across refactoring, only across
>> all workers executing a specific version of the code. Specifically, it is
>> used as follows:
>>
>> 1. Create a pipeline on the user's machine. It walks the stack until the
>> static initializer block, which provides an ID.
>> 2. Send the pipeline to many worker machines.
>> 3. Each worker machine walks the stack until the static initializer block
>> (on the same version of the code), receiving the same ID.
>>
>> This ensures that the tupletag is the same on all the workers, as well as
>> on the user's machine, which is critical since it used as an identifier
>> across these machines.
>>
>> Assigning a UUID would work if all of the machines agreed on the same
>> tuple ID, which could be accomplished with serialization. Serialization,
>> however, doesn't work well with static initializers, since those will have
>> been called to initialize the class at load time.
>>
>> On Tue, Apr 10, 2018 at 10:27 AM Romain Manni-Bucau <
>> rmannibucau@gmail.com> wrote:
>>
>>> Well issue is more about all the existing tests currently.
>>>
>>> Out of curiosity: how walking the stack is stable since the stack can
>>> change? Stop condition is the static block of a class which can use method
>>> so refactoring and therefore is not stable. Should it be deprecated?
>>>
>>>
>>> Le 10 avr. 2018 19:17, "Robert Bradshaw" <ro...@google.com> a écrit :
>>>
>>> If it's too slow perhaps you could use the constructor where you pass an
>>> explicit id (though in my experience walking the stack isn't that slow).
>>>
>>> On Tue, Apr 10, 2018 at 10:09 AM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Oops cross post sorry.
>>>>
>>>> Issue i hit on this thread is it is used a lot in tests abd it slows
>>>> down tests for nothing like with generatesequence ones
>>>>
>>>> Le 10 avr. 2018 19:00, "Romain Manni-Bucau" <rm...@gmail.com> a
>>>> écrit :
>>>>
>>>>>
>>>>>
>>>>> Le 10 avr. 2018 18:40, "Robert Bradshaw" <ro...@google.com> a
>>>>> écrit :
>>>>>
>>>>> These values should be, inasmuch as possible, stable across VMs. How
>>>>> slow is slow? Doesn't this happen only once per VM startup?
>>>>>
>>>>>
>>>>> Once per jvm and idea launches a jvm per test and the daemon does save
>>>>> enough time, you still go through the whole project and check all upstream
>>>>> deps it seems.
>>>>>
>>>>> It is <1s with maven vs 5-6s with gradle.
>>>>>
>>>>>
>>>>> On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <
>>>>> rmannibucau@gmail.com> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> does org.apache.beam.sdk.values.TupleTag#genId need to get the
>>>>>> stacktrace or can we use any id generator (like
>>>>>> UUID.random().toString())? Using traces is quite slow under load and
>>>>>> environments where the root stack is not just the "next" level so
>>>>>> skipping it would be nice.
>>>>>>
>>>>>> Romain Manni-Bucau
>>>>>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>>>>
>>>>>
>>>>>
>>>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

This is interesting cause it leads to "why do the workers need to do it
again instead of reusing the computed one?". Technically the answer is
trivial but in terms of design I think beam tends to abuse static init
block - even in dofn api - which easily lead to issues when we will want to
support more than a main (thinking to OSGi for instance).

So:

1. Why not using a standard programming model not cinit based? (Perf are
not a valid answer indeed)
2. GenId should probably be deprecated and considered a bad practise

This looks like a detail but for beam 3 we should ensure we drop the legacy
bringing bad practises in our user code.

Le 10 avr. 2018 20:15, "Ben Chambers" <bc...@apache.org> a écrit :

> I believe it doesn't need to be stable across refactoring, only across all
> workers executing a specific version of the code. Specifically, it is used
> as follows:
>
> 1. Create a pipeline on the user's machine. It walks the stack until the
> static initializer block, which provides an ID.
> 2. Send the pipeline to many worker machines.
> 3. Each worker machine walks the stack until the static initializer block
> (on the same version of the code), receiving the same ID.
>
> This ensures that the tupletag is the same on all the workers, as well as
> on the user's machine, which is critical since it used as an identifier
> across these machines.
>
> Assigning a UUID would work if all of the machines agreed on the same
> tuple ID, which could be accomplished with serialization. Serialization,
> however, doesn't work well with static initializers, since those will have
> been called to initialize the class at load time.
>
> On Tue, Apr 10, 2018 at 10:27 AM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Well issue is more about all the existing tests currently.
>>
>> Out of curiosity: how walking the stack is stable since the stack can
>> change? Stop condition is the static block of a class which can use method
>> so refactoring and therefore is not stable. Should it be deprecated?
>>
>>
>> Le 10 avr. 2018 19:17, "Robert Bradshaw" <ro...@google.com> a écrit :
>>
>> If it's too slow perhaps you could use the constructor where you pass an
>> explicit id (though in my experience walking the stack isn't that slow).
>>
>> On Tue, Apr 10, 2018 at 10:09 AM Romain Manni-Bucau <
>> rmannibucau@gmail.com> wrote:
>>
>>> Oops cross post sorry.
>>>
>>> Issue i hit on this thread is it is used a lot in tests abd it slows
>>> down tests for nothing like with generatesequence ones
>>>
>>> Le 10 avr. 2018 19:00, "Romain Manni-Bucau" <rm...@gmail.com> a
>>> écrit :
>>>
>>>>
>>>>
>>>> Le 10 avr. 2018 18:40, "Robert Bradshaw" <ro...@google.com> a
>>>> écrit :
>>>>
>>>> These values should be, inasmuch as possible, stable across VMs. How
>>>> slow is slow? Doesn't this happen only once per VM startup?
>>>>
>>>>
>>>> Once per jvm and idea launches a jvm per test and the daemon does save
>>>> enough time, you still go through the whole project and check all upstream
>>>> deps it seems.
>>>>
>>>> It is <1s with maven vs 5-6s with gradle.
>>>>
>>>>
>>>> On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <
>>>> rmannibucau@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> does org.apache.beam.sdk.values.TupleTag#genId need to get the
>>>>> stacktrace or can we use any id generator (like
>>>>> UUID.random().toString())? Using traces is quite slow under load and
>>>>> environments where the root stack is not just the "next" level so
>>>>> skipping it would be nice.
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>>>
>>>>
>>>>
>>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Ben Chambers <bc...@apache.org>.

I believe it doesn't need to be stable across refactoring, only across all
workers executing a specific version of the code. Specifically, it is used
as follows:

1. Create a pipeline on the user's machine. It walks the stack until the
static initializer block, which provides an ID.
2. Send the pipeline to many worker machines.
3. Each worker machine walks the stack until the static initializer block
(on the same version of the code), receiving the same ID.

This ensures that the tupletag is the same on all the workers, as well as
on the user's machine, which is critical since it used as an identifier
across these machines.

Assigning a UUID would work if all of the machines agreed on the same tuple
ID, which could be accomplished with serialization. Serialization, however,
doesn't work well with static initializers, since those will have been
called to initialize the class at load time.

On Tue, Apr 10, 2018 at 10:27 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Well issue is more about all the existing tests currently.
>
> Out of curiosity: how walking the stack is stable since the stack can
> change? Stop condition is the static block of a class which can use method
> so refactoring and therefore is not stable. Should it be deprecated?
>
>
> Le 10 avr. 2018 19:17, "Robert Bradshaw" <ro...@google.com> a écrit :
>
> If it's too slow perhaps you could use the constructor where you pass an
> explicit id (though in my experience walking the stack isn't that slow).
>
> On Tue, Apr 10, 2018 at 10:09 AM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Oops cross post sorry.
>>
>> Issue i hit on this thread is it is used a lot in tests abd it slows down
>> tests for nothing like with generatesequence ones
>>
>> Le 10 avr. 2018 19:00, "Romain Manni-Bucau" <rm...@gmail.com> a
>> écrit :
>>
>>>
>>>
>>> Le 10 avr. 2018 18:40, "Robert Bradshaw" <ro...@google.com> a écrit :
>>>
>>> These values should be, inasmuch as possible, stable across VMs. How
>>> slow is slow? Doesn't this happen only once per VM startup?
>>>
>>>
>>> Once per jvm and idea launches a jvm per test and the daemon does save
>>> enough time, you still go through the whole project and check all upstream
>>> deps it seems.
>>>
>>> It is <1s with maven vs 5-6s with gradle.
>>>
>>>
>>> On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <
>>> rmannibucau@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> does org.apache.beam.sdk.values.TupleTag#genId need to get the
>>>> stacktrace or can we use any id generator (like
>>>> UUID.random().toString())? Using traces is quite slow under load and
>>>> environments where the root stack is not just the "next" level so
>>>> skipping it would be nice.
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>>
>>>
>>>
>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Well issue is more about all the existing tests currently.

Out of curiosity: how walking the stack is stable since the stack can
change? Stop condition is the static block of a class which can use method
so refactoring and therefore is not stable. Should it be deprecated?

Le 10 avr. 2018 19:17, "Robert Bradshaw" <ro...@google.com> a écrit :

If it's too slow perhaps you could use the constructor where you pass an
explicit id (though in my experience walking the stack isn't that slow).

On Tue, Apr 10, 2018 at 10:09 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Oops cross post sorry.
>
> Issue i hit on this thread is it is used a lot in tests abd it slows down
> tests for nothing like with generatesequence ones
>
> Le 10 avr. 2018 19:00, "Romain Manni-Bucau" <rm...@gmail.com> a
> écrit :
>
>>
>>
>> Le 10 avr. 2018 18:40, "Robert Bradshaw" <ro...@google.com> a écrit :
>>
>> These values should be, inasmuch as possible, stable across VMs. How slow
>> is slow? Doesn't this happen only once per VM startup?
>>
>>
>> Once per jvm and idea launches a jvm per test and the daemon does save
>> enough time, you still go through the whole project and check all upstream
>> deps it seems.
>>
>> It is <1s with maven vs 5-6s with gradle.
>>
>>
>> On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <rm...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> does org.apache.beam.sdk.values.TupleTag#genId need to get the
>>> stacktrace or can we use any id generator (like
>>> UUID.random().toString())? Using traces is quite slow under load and
>>> environments where the root stack is not just the "next" level so
>>> skipping it would be nice.
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>
>>
>>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Robert Bradshaw <ro...@google.com>.

If it's too slow perhaps you could use the constructor where you pass an
explicit id (though in my experience walking the stack isn't that slow).

On Tue, Apr 10, 2018 at 10:09 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Oops cross post sorry.
>
> Issue i hit on this thread is it is used a lot in tests abd it slows down
> tests for nothing like with generatesequence ones
>
> Le 10 avr. 2018 19:00, "Romain Manni-Bucau" <rm...@gmail.com> a
> écrit :
>
>>
>>
>> Le 10 avr. 2018 18:40, "Robert Bradshaw" <ro...@google.com> a écrit :
>>
>> These values should be, inasmuch as possible, stable across VMs. How slow
>> is slow? Doesn't this happen only once per VM startup?
>>
>>
>> Once per jvm and idea launches a jvm per test and the daemon does save
>> enough time, you still go through the whole project and check all upstream
>> deps it seems.
>>
>> It is <1s with maven vs 5-6s with gradle.
>>
>>
>> On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <rm...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> does org.apache.beam.sdk.values.TupleTag#genId need to get the
>>> stacktrace or can we use any id generator (like
>>> UUID.random().toString())? Using traces is quite slow under load and
>>> environments where the root stack is not just the "next" level so
>>> skipping it would be nice.
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>>
>>
>>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Oops cross post sorry.

Issue i hit on this thread is it is used a lot in tests abd it slows down
tests for nothing like with generatesequence ones

Le 10 avr. 2018 19:00, "Romain Manni-Bucau" <rm...@gmail.com> a
écrit :

>
>
> Le 10 avr. 2018 18:40, "Robert Bradshaw" <ro...@google.com> a écrit :
>
> These values should be, inasmuch as possible, stable across VMs. How slow
> is slow? Doesn't this happen only once per VM startup?
>
>
> Once per jvm and idea launches a jvm per test and the daemon does save
> enough time, you still go through the whole project and check all upstream
> deps it seems.
>
> It is <1s with maven vs 5-6s with gradle.
>
>
> On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Hi
>>
>> does org.apache.beam.sdk.values.TupleTag#genId need to get the
>> stacktrace or can we use any id generator (like
>> UUID.random().toString())? Using traces is quite slow under load and
>> environments where the root stack is not just the "next" level so
>> skipping it would be nice.
>>
>> Romain Manni-Bucau
>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>
>
>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Romain Manni-Bucau <rm...@gmail.com>.

Le 10 avr. 2018 18:40, "Robert Bradshaw" <ro...@google.com> a écrit :

These values should be, inasmuch as possible, stable across VMs. How slow
is slow? Doesn't this happen only once per VM startup?

Once per jvm and idea launches a jvm per test and the daemon does save
enough time, you still go through the whole project and check all upstream
deps it seems.

It is <1s with maven vs 5-6s with gradle.

On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Hi
>
> does org.apache.beam.sdk.values.TupleTag#genId need to get the
> stacktrace or can we use any id generator (like
> UUID.random().toString())? Using traces is quite slow under load and
> environments where the root stack is not just the "next" level so
> skipping it would be nice.
>
> Romain Manni-Bucau
> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Thomas Groh <tg...@google.com>.

In fact, this is explicitly to work with `static final` TupleTags, and
using a non-stable isn't feasible.

A static final TupleTag won't be serialized in the closure of an object
that uses it - it will be instantiated independently in any other
ClassLoader, such as on a remote JVM. If you use a constant TupleTag during
pipeline construction, and again during runtime, they must have matching
identifiers, or the system can't correlate the two objects. Use of
something like UUID.random() would remove our ability to use any constant
values.

On Tue, Apr 10, 2018 at 9:40 AM Robert Bradshaw <ro...@google.com> wrote:

> These values should be, inasmuch as possible, stable across VMs. How slow
> is slow? Doesn't this happen only once per VM startup?
>
> On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <rm...@gmail.com>
> wrote:
>
>> Hi
>>
>> does org.apache.beam.sdk.values.TupleTag#genId need to get the
>> stacktrace or can we use any id generator (like
>> UUID.random().toString())? Using traces is quite slow under load and
>> environments where the root stack is not just the "next" level so
>> skipping it would be nice.
>>
>> Romain Manni-Bucau
>> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>>
>

Re: org.apache.beam.sdk.values.TupleTag#genId and stacktraces?

Posted by Robert Bradshaw <ro...@google.com>.

These values should be, inasmuch as possible, stable across VMs. How slow
is slow? Doesn't this happen only once per VM startup?

On Tue, Apr 10, 2018 at 9:33 AM Romain Manni-Bucau <rm...@gmail.com>
wrote:

> Hi
>
> does org.apache.beam.sdk.values.TupleTag#genId need to get the
> stacktrace or can we use any id generator (like
> UUID.random().toString())? Using traces is quite slow under load and
> environments where the root stack is not just the "next" level so
> skipping it would be nice.
>
> Romain Manni-Bucau
> @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>