You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Charles Chen <cc...@google.com> on 2018/06/05 23:55:53 UTC

Re: Existing transactionality inconsistency in the Beam Java State API

Thanks everyone for commenting and contributing to the discussion.  There
appears to be enough consensus on these points to start an initial
implementation.  Specifically, I'd like to highlight from the doc (
https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
):

*With respect to existing state data and transactionality: *We will go with
presenting a consistent view of state data, in that after a write, a read
will always return a value that reflects the result of that write.  When a
read returning an iterator object is done, an implicit snapshot of the
underlying value at that point will be taken, and any subsequent mutation
of the state will not change the values contained in, or invalidate, a
previously returned iterator.  We will use state.prefetch() and
state.prefetch_key(key) to suggest that the runner prefetch relevant data;
this will supercede and replace the existing state.readLater() methods from
the Java API, since the semantics of saying "prefetch" are much clearer to
the user.

*With respect to state for merging windows: *We will first implement proposed
design 3
<https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.wm9yk8l65u>
by excluding an implementation of non-combinable state types like the
non-combinable ValueState, since this is the most forwards-compatible
option.  At a later point, we will either obtain community consensus to
remove non-combinable ValueState from the Java SDK as well, or turn to proposed
design 1
<https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.cyfcmrgayhn>
by implementing non-combinable state types in the Python SDK and reject
jobs that use non-combinable state types with merging windows.

Best,
Charles

On Fri, May 25, 2018 at 10:56 AM Lukasz Cwik <lc...@google.com> wrote:

> Great, I was confused in the description that was provided and then the
> follow up by Ben. I think its worthwhile to describe the differences with
> actual examples of what happens.
>
> On Fri, May 25, 2018 at 10:54 AM Kenneth Knowles <kl...@google.com> wrote:
>
>> I think the return value of read() should always be an immutable value.
>>
>> Kenn
>>
>> On Fri, May 25, 2018 at 10:44 AM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> Kenn, in the second example where we are creating views whenever read()
>>> is called, is it that the view's underlying data is immutable. For example:
>>> Iterable<String> values = state.read();
>>> state.append("newValue");
>>> If I iterate over values, does values now contain "newValues"?
>>>
>>>
>>> On Thu, May 24, 2018 at 10:38 AM Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> I see what you mean but I don't agree that futures imply anything other
>>>> than "it is a value that you have to force", with deliberately many
>>>> possible implementations. When at the point in 1 and you've got
>>>>
>>>>     interface ReadableState<T> {
>>>>         T read()
>>>>     }
>>>>
>>>> and you want to improve performance, both approaches "void readLater()"
>>>> and "StateFuture<T> read()" are natural evolutions. They both gain the same
>>>> 10x and they both support the "unchanging committed state plus buffered
>>>> mutations" implementation well. And snapshots are essentially free for that
>>>> implementation if the buffered mutations are stored in a decent data
>>>> structure.
>>>>
>>>> My recollection was that futures were seen as more cumbersome. They
>>>> affect the types even for simple uses. The only appealing future API was
>>>> Guava's, but we didn't want that on the API surface. And we did not intend
>>>> for these to be used in complex ways, so the usability & perf benefits of a
>>>> future-based API weren't really realized anyhow.
>>>>
>>>> The only reason I belabor this is that if we ever wanted to support
>>>> more complex use cases, such as concurrent use of state, then my preference
>>>> would flip. I wouldn't want to make XYZState a synchronized monitor. At
>>>> that point I'd favor using a snapshots-are-free concurrent data structure
>>>> under the hood of a future-based API.
>>>>
>>>> Since there is really only one implementation in mind for this, maybe
>>>> only one that works reasonably, we should just document it as such. The
>>>> docs on ReadableState do make it sound like writes will invalidate the
>>>> usefulness of readLater, even though that isn't the case for the intended
>>>> implementation strategy.
>>>>
>>>> Kenn
>>>>
>>>> On Thu, May 24, 2018 at 9:40 AM Ben Chambers <bc...@apache.org>
>>>> wrote:
>>>>
>>>>> I think Kenn's second option accurately reflects my memory of the
>>>>> original intentions:
>>>>>
>>>>> 1. I remember we we considered either using the Future interface or
>>>>> calling the ReadableState interface a future, and explicitly said "no,
>>>>> future implies asynchrony and that the value returned by `get` won't change
>>>>> over multiple calls, but we want the latest value each time". So, I
>>>>> remember us explicitly considering and rejecting Future, thus the name
>>>>> "ReadableState".
>>>>>
>>>>> 2. The intuition behind the implementation was analogous to a
>>>>> mutable-reference cell in languages like ML / Scheme / etc. The
>>>>> ReadableState is just a pointer to the the reference cell. Calling read
>>>>> returns the value currently in the cell. If we have 100 ReadableStates
>>>>> pointing at the same cell, they all get the same value regardless of when
>>>>> they were created. This avoids needing to duplicate/snapshot values at any
>>>>> point in time.
>>>>>
>>>>> 3. ReadLater was added, as noted by Charles, to suggest prefetching
>>>>> the associated value. This was added after benchmarks showed 10x (if I
>>>>> remember correctly) performance improvements in things like
>>>>> GroupAlsoByWindows by minimizing round-trips asking for more state. The
>>>>> intuition being -- if we need to make an RPC to load one state value, we
>>>>> are better off making an RPC to load all the values we need.
>>>>>
>>>>> Overall, I too lean towards maintaining the second interpretation
>>>>> since it seems to be consistent and I believe we had additional reasons for
>>>>> preferring it over futures.
>>>>>
>>>>> Given the confusion, I think strengthening the class documentation
>>>>> makes sense -- I note the only hint of the current behavior is that
>>>>> ReadableState indicates it gets the *current* value (emphasis mine). We
>>>>> should emphasize that and perhaps even mention that the ReadableState
>>>>> should be understood as just a reference or handle to the underlying state,
>>>>> and thus its value will reflect the latest write.
>>>>>
>>>>> Charles, if it helps, the plan I remember regarding prefetching was
>>>>> something like:
>>>>>
>>>>> interface ReadableMapState<K, V> {
>>>>>    ReadableState<V> get(K key);
>>>>>    ReadableState<Iterable<V>> getIterable();
>>>>>    ReadableState<Map<K, V>> get();
>>>>>    // ... more things ...
>>>>> }
>>>>>
>>>>> Then prefetching a value is `mapState.get(key).readLater()` and
>>>>> prefetching the entire map is `mapState.get().readLater()`, etc.
>>>>>
>>>>> On Wed, May 23, 2018 at 7:13 PM Charles Chen <cc...@google.com> wrote:
>>>>>
>>>>>> Thanks Kenn.  I think there are two issues to highlight: (1) the API
>>>>>> should allow for some sort of prefetching / batching / background I/O for
>>>>>> state; and (2) it should be clear what the semantics are for reading (e.g.
>>>>>> so we don't have confusing read after write behavior).
>>>>>>
>>>>>> The approach I'm leaning towards for (1) is to allow a
>>>>>> state.prefetch() method (to prefetch a value, iterable or [entire] map
>>>>>> state) and maybe something like state.prefetch_key(key) to prefetch a
>>>>>> specific KV in the map.  Issue (2) seems to be okay in either of Kenn's
>>>>>> positions.
>>>>>>
>>>>>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <ro...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for laying this out so well, Kenn. I'm also leaning towards
>>>>>>> the second option, despite its drawbacks. (In particular, readLater
>>>>>>> should not influence what's returned at read(), it's just a hint.)
>>>>>>>
>>>>>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <kl...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Great idea to bring it to dev@. I think it is better to focus here
>>>>>>>> than long doc comment threads.
>>>>>>>>
>>>>>>>> I had strong opinions that I think were a bit confused and wrong.
>>>>>>>> Sorry for that. I stated this position:
>>>>>>>>
>>>>>>>>  - XYZState class is a handle to a mutable location
>>>>>>>>  - its methods like isEmpty() or contents() should return immutable
>>>>>>>> future values (implicitly means their contents are semantically frozen when
>>>>>>>> they are created)
>>>>>>>>  - the fact that you created the future is a hint that all
>>>>>>>> necessary fetching/computation should be kicked off
>>>>>>>>  - later forced with get()
>>>>>>>>  - when it was designed, pure async style was not a viable option
>>>>>>>>
>>>>>>>> I see now that the actual position of some of its original
>>>>>>>> designers is:
>>>>>>>>
>>>>>>>>  - XYZState class is a view on a mutable location
>>>>>>>>  - its methods return new views on that mutable location
>>>>>>>>  - calling readLater() is a hint that some fetching/computation
>>>>>>>> should be kicked off
>>>>>>>>  - later read() will combine whatever readLater() did with
>>>>>>>> additional local info to give the current value
>>>>>>>>  - async style not applicable nor desirable as per Beam's focus on
>>>>>>>> naive straight-line coding + autoscaling
>>>>>>>>
>>>>>>>> These are both internally consistent I think. In fact, I like the
>>>>>>>> second perspective better than the one I have been promoting. There are
>>>>>>>> some weaknesses: readLater() is pretty tightly coupled to a particular
>>>>>>>> implementation style, and futures are decades old so you can get good APIs
>>>>>>>> and performance without inventing anything. But I still like the non-future
>>>>>>>> version a little better.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <cc...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> During the design of the Beam Python State API, we noticed some
>>>>>>>>> transactionality inconsistencies in the existing Beam Java State API (these
>>>>>>>>> are the unresolved bugs BEAM-2980
>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are
>>>>>>>>> therefore having a discussion about this API which can have implications
>>>>>>>>> for its future development in all Beam languages:
>>>>>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>>>>>>
>>>>>>>>> If you have an opinion on the possible design approaches, it would
>>>>>>>>> be very helpful to bring up in the doc or on this thread.  Thanks!
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Charles
>>>>>>>>>
>>>>>>>>

Re: Existing transactionality inconsistency in the Beam Java State API

Posted by Lukasz Cwik <lc...@google.com>.
Sounds great and thanks for the conclusion summary.

On Tue, Jun 5, 2018 at 4:56 PM Charles Chen <cc...@google.com> wrote:

> Thanks everyone for commenting and contributing to the discussion.  There
> appears to be enough consensus on these points to start an initial
> implementation.  Specifically, I'd like to highlight from the doc (
> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
> ):
>
> *With respect to existing state data and transactionality: *We will go
> with presenting a consistent view of state data, in that after a write, a
> read will always return a value that reflects the result of that write.
> When a read returning an iterator object is done, an implicit snapshot of
> the underlying value at that point will be taken, and any subsequent
> mutation of the state will not change the values contained in, or
> invalidate, a previously returned iterator.  We will use state.prefetch()
> and state.prefetch_key(key) to suggest that the runner prefetch relevant
> data; this will supercede and replace the existing state.readLater()
> methods from the Java API, since the semantics of saying "prefetch" are
> much clearer to the user.
>
> *With respect to state for merging windows: *We will first implement proposed
> design 3
> <https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.wm9yk8l65u>
> by excluding an implementation of non-combinable state types like the
> non-combinable ValueState, since this is the most forwards-compatible
> option.  At a later point, we will either obtain community consensus to
> remove non-combinable ValueState from the Java SDK as well, or turn to proposed
> design 1
> <https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.cyfcmrgayhn>
> by implementing non-combinable state types in the Python SDK and reject
> jobs that use non-combinable state types with merging windows.
>
> Best,
> Charles
>
> On Fri, May 25, 2018 at 10:56 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> Great, I was confused in the description that was provided and then the
>> follow up by Ben. I think its worthwhile to describe the differences with
>> actual examples of what happens.
>>
>> On Fri, May 25, 2018 at 10:54 AM Kenneth Knowles <kl...@google.com> wrote:
>>
>>> I think the return value of read() should always be an immutable value.
>>>
>>> Kenn
>>>
>>> On Fri, May 25, 2018 at 10:44 AM Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> Kenn, in the second example where we are creating views whenever read()
>>>> is called, is it that the view's underlying data is immutable. For example:
>>>> Iterable<String> values = state.read();
>>>> state.append("newValue");
>>>> If I iterate over values, does values now contain "newValues"?
>>>>
>>>>
>>>> On Thu, May 24, 2018 at 10:38 AM Kenneth Knowles <kl...@google.com>
>>>> wrote:
>>>>
>>>>> I see what you mean but I don't agree that futures imply anything
>>>>> other than "it is a value that you have to force", with deliberately many
>>>>> possible implementations. When at the point in 1 and you've got
>>>>>
>>>>>     interface ReadableState<T> {
>>>>>         T read()
>>>>>     }
>>>>>
>>>>> and you want to improve performance, both approaches "void
>>>>> readLater()" and "StateFuture<T> read()" are natural evolutions. They both
>>>>> gain the same 10x and they both support the "unchanging committed state
>>>>> plus buffered mutations" implementation well. And snapshots are essentially
>>>>> free for that implementation if the buffered mutations are stored in a
>>>>> decent data structure.
>>>>>
>>>>> My recollection was that futures were seen as more cumbersome. They
>>>>> affect the types even for simple uses. The only appealing future API was
>>>>> Guava's, but we didn't want that on the API surface. And we did not intend
>>>>> for these to be used in complex ways, so the usability & perf benefits of a
>>>>> future-based API weren't really realized anyhow.
>>>>>
>>>>> The only reason I belabor this is that if we ever wanted to support
>>>>> more complex use cases, such as concurrent use of state, then my preference
>>>>> would flip. I wouldn't want to make XYZState a synchronized monitor. At
>>>>> that point I'd favor using a snapshots-are-free concurrent data structure
>>>>> under the hood of a future-based API.
>>>>>
>>>>> Since there is really only one implementation in mind for this, maybe
>>>>> only one that works reasonably, we should just document it as such. The
>>>>> docs on ReadableState do make it sound like writes will invalidate the
>>>>> usefulness of readLater, even though that isn't the case for the intended
>>>>> implementation strategy.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Thu, May 24, 2018 at 9:40 AM Ben Chambers <bc...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I think Kenn's second option accurately reflects my memory of the
>>>>>> original intentions:
>>>>>>
>>>>>> 1. I remember we we considered either using the Future interface or
>>>>>> calling the ReadableState interface a future, and explicitly said "no,
>>>>>> future implies asynchrony and that the value returned by `get` won't change
>>>>>> over multiple calls, but we want the latest value each time". So, I
>>>>>> remember us explicitly considering and rejecting Future, thus the name
>>>>>> "ReadableState".
>>>>>>
>>>>>> 2. The intuition behind the implementation was analogous to a
>>>>>> mutable-reference cell in languages like ML / Scheme / etc. The
>>>>>> ReadableState is just a pointer to the the reference cell. Calling read
>>>>>> returns the value currently in the cell. If we have 100 ReadableStates
>>>>>> pointing at the same cell, they all get the same value regardless of when
>>>>>> they were created. This avoids needing to duplicate/snapshot values at any
>>>>>> point in time.
>>>>>>
>>>>>> 3. ReadLater was added, as noted by Charles, to suggest prefetching
>>>>>> the associated value. This was added after benchmarks showed 10x (if I
>>>>>> remember correctly) performance improvements in things like
>>>>>> GroupAlsoByWindows by minimizing round-trips asking for more state. The
>>>>>> intuition being -- if we need to make an RPC to load one state value, we
>>>>>> are better off making an RPC to load all the values we need.
>>>>>>
>>>>>> Overall, I too lean towards maintaining the second interpretation
>>>>>> since it seems to be consistent and I believe we had additional reasons for
>>>>>> preferring it over futures.
>>>>>>
>>>>>> Given the confusion, I think strengthening the class documentation
>>>>>> makes sense -- I note the only hint of the current behavior is that
>>>>>> ReadableState indicates it gets the *current* value (emphasis mine). We
>>>>>> should emphasize that and perhaps even mention that the ReadableState
>>>>>> should be understood as just a reference or handle to the underlying state,
>>>>>> and thus its value will reflect the latest write.
>>>>>>
>>>>>> Charles, if it helps, the plan I remember regarding prefetching was
>>>>>> something like:
>>>>>>
>>>>>> interface ReadableMapState<K, V> {
>>>>>>    ReadableState<V> get(K key);
>>>>>>    ReadableState<Iterable<V>> getIterable();
>>>>>>    ReadableState<Map<K, V>> get();
>>>>>>    // ... more things ...
>>>>>> }
>>>>>>
>>>>>> Then prefetching a value is `mapState.get(key).readLater()` and
>>>>>> prefetching the entire map is `mapState.get().readLater()`, etc.
>>>>>>
>>>>>> On Wed, May 23, 2018 at 7:13 PM Charles Chen <cc...@google.com> wrote:
>>>>>>
>>>>>>> Thanks Kenn.  I think there are two issues to highlight: (1) the API
>>>>>>> should allow for some sort of prefetching / batching / background I/O for
>>>>>>> state; and (2) it should be clear what the semantics are for reading (e.g.
>>>>>>> so we don't have confusing read after write behavior).
>>>>>>>
>>>>>>> The approach I'm leaning towards for (1) is to allow a
>>>>>>> state.prefetch() method (to prefetch a value, iterable or [entire] map
>>>>>>> state) and maybe something like state.prefetch_key(key) to prefetch a
>>>>>>> specific KV in the map.  Issue (2) seems to be okay in either of Kenn's
>>>>>>> positions.
>>>>>>>
>>>>>>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <ro...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for laying this out so well, Kenn. I'm also leaning towards
>>>>>>>> the second option, despite its drawbacks. (In particular, readLater
>>>>>>>> should not influence what's returned at read(), it's just a hint.)
>>>>>>>>
>>>>>>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <kl...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Great idea to bring it to dev@. I think it is better to focus
>>>>>>>>> here than long doc comment threads.
>>>>>>>>>
>>>>>>>>> I had strong opinions that I think were a bit confused and wrong.
>>>>>>>>> Sorry for that. I stated this position:
>>>>>>>>>
>>>>>>>>>  - XYZState class is a handle to a mutable location
>>>>>>>>>  - its methods like isEmpty() or contents() should return
>>>>>>>>> immutable future values (implicitly means their contents are semantically
>>>>>>>>> frozen when they are created)
>>>>>>>>>  - the fact that you created the future is a hint that all
>>>>>>>>> necessary fetching/computation should be kicked off
>>>>>>>>>  - later forced with get()
>>>>>>>>>  - when it was designed, pure async style was not a viable option
>>>>>>>>>
>>>>>>>>> I see now that the actual position of some of its original
>>>>>>>>> designers is:
>>>>>>>>>
>>>>>>>>>  - XYZState class is a view on a mutable location
>>>>>>>>>  - its methods return new views on that mutable location
>>>>>>>>>  - calling readLater() is a hint that some fetching/computation
>>>>>>>>> should be kicked off
>>>>>>>>>  - later read() will combine whatever readLater() did with
>>>>>>>>> additional local info to give the current value
>>>>>>>>>  - async style not applicable nor desirable as per Beam's focus on
>>>>>>>>> naive straight-line coding + autoscaling
>>>>>>>>>
>>>>>>>>> These are both internally consistent I think. In fact, I like the
>>>>>>>>> second perspective better than the one I have been promoting. There are
>>>>>>>>> some weaknesses: readLater() is pretty tightly coupled to a particular
>>>>>>>>> implementation style, and futures are decades old so you can get good APIs
>>>>>>>>> and performance without inventing anything. But I still like the non-future
>>>>>>>>> version a little better.
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <cc...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> During the design of the Beam Python State API, we noticed some
>>>>>>>>>> transactionality inconsistencies in the existing Beam Java State API (these
>>>>>>>>>> are the unresolved bugs BEAM-2980
>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are
>>>>>>>>>> therefore having a discussion about this API which can have implications
>>>>>>>>>> for its future development in all Beam languages:
>>>>>>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>>>>>>>
>>>>>>>>>> If you have an opinion on the possible design approaches, it
>>>>>>>>>> would be very helpful to bring up in the doc or on this thread.  Thanks!
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Charles
>>>>>>>>>>
>>>>>>>>>