You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Charles Chen <cc...@google.com> on 2018/05/23 23:05:23 UTC

Existing transactionality inconsistency in the Beam Java State API

During the design of the Beam Python State API, we noticed some
transactionality inconsistencies in the existing Beam Java State API (these
are the unresolved bugs BEAM-2980
<https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
<https://issues.apache.org/jira/browse/BEAM-2975>).  We are therefore
having a discussion about this API which can have implications for its
future development in all Beam languages:
https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b

If you have an opinion on the possible design approaches, it would be very
helpful to bring up in the doc or on this thread.  Thanks!

Best,
Charles

Re: Existing transactionality inconsistency in the Beam Java State API

Posted by Lukasz Cwik <lc...@google.com>.

Sounds great and thanks for the conclusion summary.

On Tue, Jun 5, 2018 at 4:56 PM Charles Chen <cc...@google.com> wrote:

> Thanks everyone for commenting and contributing to the discussion.  There
> appears to be enough consensus on these points to start an initial
> implementation.  Specifically, I'd like to highlight from the doc (
> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
> ):
>
> *With respect to existing state data and transactionality: *We will go
> with presenting a consistent view of state data, in that after a write, a
> read will always return a value that reflects the result of that write.
> When a read returning an iterator object is done, an implicit snapshot of
> the underlying value at that point will be taken, and any subsequent
> mutation of the state will not change the values contained in, or
> invalidate, a previously returned iterator.  We will use state.prefetch()
> and state.prefetch_key(key) to suggest that the runner prefetch relevant
> data; this will supercede and replace the existing state.readLater()
> methods from the Java API, since the semantics of saying "prefetch" are
> much clearer to the user.
>
> *With respect to state for merging windows: *We will first implement proposed
> design 3
> <https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.wm9yk8l65u>
> by excluding an implementation of non-combinable state types like the
> non-combinable ValueState, since this is the most forwards-compatible
> option.  At a later point, we will either obtain community consensus to
> remove non-combinable ValueState from the Java SDK as well, or turn to proposed
> design 1
> <https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.cyfcmrgayhn>
> by implementing non-combinable state types in the Python SDK and reject
> jobs that use non-combinable state types with merging windows.
>
> Best,
> Charles
>
> On Fri, May 25, 2018 at 10:56 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> Great, I was confused in the description that was provided and then the
>> follow up by Ben. I think its worthwhile to describe the differences with
>> actual examples of what happens.
>>
>> On Fri, May 25, 2018 at 10:54 AM Kenneth Knowles <kl...@google.com> wrote:
>>
>>> I think the return value of read() should always be an immutable value.
>>>
>>> Kenn
>>>
>>> On Fri, May 25, 2018 at 10:44 AM Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> Kenn, in the second example where we are creating views whenever read()
>>>> is called, is it that the view's underlying data is immutable. For example:
>>>> Iterable<String> values = state.read();
>>>> state.append("newValue");
>>>> If I iterate over values, does values now contain "newValues"?
>>>>
>>>>
>>>> On Thu, May 24, 2018 at 10:38 AM Kenneth Knowles <kl...@google.com>
>>>> wrote:
>>>>
>>>>> I see what you mean but I don't agree that futures imply anything
>>>>> other than "it is a value that you have to force", with deliberately many
>>>>> possible implementations. When at the point in 1 and you've got
>>>>>
>>>>>     interface ReadableState<T> {
>>>>>         T read()
>>>>>     }
>>>>>
>>>>> and you want to improve performance, both approaches "void
>>>>> readLater()" and "StateFuture<T> read()" are natural evolutions. They both
>>>>> gain the same 10x and they both support the "unchanging committed state
>>>>> plus buffered mutations" implementation well. And snapshots are essentially
>>>>> free for that implementation if the buffered mutations are stored in a
>>>>> decent data structure.
>>>>>
>>>>> My recollection was that futures were seen as more cumbersome. They
>>>>> affect the types even for simple uses. The only appealing future API was
>>>>> Guava's, but we didn't want that on the API surface. And we did not intend
>>>>> for these to be used in complex ways, so the usability & perf benefits of a
>>>>> future-based API weren't really realized anyhow.
>>>>>
>>>>> The only reason I belabor this is that if we ever wanted to support
>>>>> more complex use cases, such as concurrent use of state, then my preference
>>>>> would flip. I wouldn't want to make XYZState a synchronized monitor. At
>>>>> that point I'd favor using a snapshots-are-free concurrent data structure
>>>>> under the hood of a future-based API.
>>>>>
>>>>> Since there is really only one implementation in mind for this, maybe
>>>>> only one that works reasonably, we should just document it as such. The
>>>>> docs on ReadableState do make it sound like writes will invalidate the
>>>>> usefulness of readLater, even though that isn't the case for the intended
>>>>> implementation strategy.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Thu, May 24, 2018 at 9:40 AM Ben Chambers <bc...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I think Kenn's second option accurately reflects my memory of the
>>>>>> original intentions:
>>>>>>
>>>>>> 1. I remember we we considered either using the Future interface or
>>>>>> calling the ReadableState interface a future, and explicitly said "no,
>>>>>> future implies asynchrony and that the value returned by `get` won't change
>>>>>> over multiple calls, but we want the latest value each time". So, I
>>>>>> remember us explicitly considering and rejecting Future, thus the name
>>>>>> "ReadableState".
>>>>>>
>>>>>> 2. The intuition behind the implementation was analogous to a
>>>>>> mutable-reference cell in languages like ML / Scheme / etc. The
>>>>>> ReadableState is just a pointer to the the reference cell. Calling read
>>>>>> returns the value currently in the cell. If we have 100 ReadableStates
>>>>>> pointing at the same cell, they all get the same value regardless of when
>>>>>> they were created. This avoids needing to duplicate/snapshot values at any
>>>>>> point in time.
>>>>>>
>>>>>> 3. ReadLater was added, as noted by Charles, to suggest prefetching
>>>>>> the associated value. This was added after benchmarks showed 10x (if I
>>>>>> remember correctly) performance improvements in things like
>>>>>> GroupAlsoByWindows by minimizing round-trips asking for more state. The
>>>>>> intuition being -- if we need to make an RPC to load one state value, we
>>>>>> are better off making an RPC to load all the values we need.
>>>>>>
>>>>>> Overall, I too lean towards maintaining the second interpretation
>>>>>> since it seems to be consistent and I believe we had additional reasons for
>>>>>> preferring it over futures.
>>>>>>
>>>>>> Given the confusion, I think strengthening the class documentation
>>>>>> makes sense -- I note the only hint of the current behavior is that
>>>>>> ReadableState indicates it gets the *current* value (emphasis mine). We
>>>>>> should emphasize that and perhaps even mention that the ReadableState
>>>>>> should be understood as just a reference or handle to the underlying state,
>>>>>> and thus its value will reflect the latest write.
>>>>>>
>>>>>> Charles, if it helps, the plan I remember regarding prefetching was
>>>>>> something like:
>>>>>>
>>>>>> interface ReadableMapState<K, V> {
>>>>>>    ReadableState<V> get(K key);
>>>>>>    ReadableState<Iterable<V>> getIterable();
>>>>>>    ReadableState<Map<K, V>> get();
>>>>>>    // ... more things ...
>>>>>> }
>>>>>>
>>>>>> Then prefetching a value is `mapState.get(key).readLater()` and
>>>>>> prefetching the entire map is `mapState.get().readLater()`, etc.
>>>>>>
>>>>>> On Wed, May 23, 2018 at 7:13 PM Charles Chen <cc...@google.com> wrote:
>>>>>>
>>>>>>> Thanks Kenn.  I think there are two issues to highlight: (1) the API
>>>>>>> should allow for some sort of prefetching / batching / background I/O for
>>>>>>> state; and (2) it should be clear what the semantics are for reading (e.g.
>>>>>>> so we don't have confusing read after write behavior).
>>>>>>>
>>>>>>> The approach I'm leaning towards for (1) is to allow a
>>>>>>> state.prefetch() method (to prefetch a value, iterable or [entire] map
>>>>>>> state) and maybe something like state.prefetch_key(key) to prefetch a
>>>>>>> specific KV in the map.  Issue (2) seems to be okay in either of Kenn's
>>>>>>> positions.
>>>>>>>
>>>>>>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <ro...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for laying this out so well, Kenn. I'm also leaning towards
>>>>>>>> the second option, despite its drawbacks. (In particular, readLater
>>>>>>>> should not influence what's returned at read(), it's just a hint.)
>>>>>>>>
>>>>>>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <kl...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Great idea to bring it to dev@. I think it is better to focus
>>>>>>>>> here than long doc comment threads.
>>>>>>>>>
>>>>>>>>> I had strong opinions that I think were a bit confused and wrong.
>>>>>>>>> Sorry for that. I stated this position:
>>>>>>>>>
>>>>>>>>>  - XYZState class is a handle to a mutable location
>>>>>>>>>  - its methods like isEmpty() or contents() should return
>>>>>>>>> immutable future values (implicitly means their contents are semantically
>>>>>>>>> frozen when they are created)
>>>>>>>>>  - the fact that you created the future is a hint that all
>>>>>>>>> necessary fetching/computation should be kicked off
>>>>>>>>>  - later forced with get()
>>>>>>>>>  - when it was designed, pure async style was not a viable option
>>>>>>>>>
>>>>>>>>> I see now that the actual position of some of its original
>>>>>>>>> designers is:
>>>>>>>>>
>>>>>>>>>  - XYZState class is a view on a mutable location
>>>>>>>>>  - its methods return new views on that mutable location
>>>>>>>>>  - calling readLater() is a hint that some fetching/computation
>>>>>>>>> should be kicked off
>>>>>>>>>  - later read() will combine whatever readLater() did with
>>>>>>>>> additional local info to give the current value
>>>>>>>>>  - async style not applicable nor desirable as per Beam's focus on
>>>>>>>>> naive straight-line coding + autoscaling
>>>>>>>>>
>>>>>>>>> These are both internally consistent I think. In fact, I like the
>>>>>>>>> second perspective better than the one I have been promoting. There are
>>>>>>>>> some weaknesses: readLater() is pretty tightly coupled to a particular
>>>>>>>>> implementation style, and futures are decades old so you can get good APIs
>>>>>>>>> and performance without inventing anything. But I still like the non-future
>>>>>>>>> version a little better.
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <cc...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> During the design of the Beam Python State API, we noticed some
>>>>>>>>>> transactionality inconsistencies in the existing Beam Java State API (these
>>>>>>>>>> are the unresolved bugs BEAM-2980
>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are
>>>>>>>>>> therefore having a discussion about this API which can have implications
>>>>>>>>>> for its future development in all Beam languages:
>>>>>>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>>>>>>>
>>>>>>>>>> If you have an opinion on the possible design approaches, it
>>>>>>>>>> would be very helpful to bring up in the doc or on this thread.  Thanks!
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Charles
>>>>>>>>>>
>>>>>>>>>

Re: Existing transactionality inconsistency in the Beam Java State API

Posted by Charles Chen <cc...@google.com>.

Thanks everyone for commenting and contributing to the discussion.  There
appears to be enough consensus on these points to start an initial
implementation.  Specifically, I'd like to highlight from the doc (
https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
):

*With respect to existing state data and transactionality: *We will go with
presenting a consistent view of state data, in that after a write, a read
will always return a value that reflects the result of that write.  When a
read returning an iterator object is done, an implicit snapshot of the
underlying value at that point will be taken, and any subsequent mutation
of the state will not change the values contained in, or invalidate, a
previously returned iterator.  We will use state.prefetch() and
state.prefetch_key(key) to suggest that the runner prefetch relevant data;
this will supercede and replace the existing state.readLater() methods from
the Java API, since the semantics of saying "prefetch" are much clearer to
the user.

*With respect to state for merging windows: *We will first implement proposed
design 3
<https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.wm9yk8l65u>
by excluding an implementation of non-combinable state types like the
non-combinable ValueState, since this is the most forwards-compatible
option.  At a later point, we will either obtain community consensus to
remove non-combinable ValueState from the Java SDK as well, or turn to proposed
design 1
<https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.cyfcmrgayhn>
by implementing non-combinable state types in the Python SDK and reject
jobs that use non-combinable state types with merging windows.

Best,
Charles

On Fri, May 25, 2018 at 10:56 AM Lukasz Cwik <lc...@google.com> wrote:

> Great, I was confused in the description that was provided and then the
> follow up by Ben. I think its worthwhile to describe the differences with
> actual examples of what happens.
>
> On Fri, May 25, 2018 at 10:54 AM Kenneth Knowles <kl...@google.com> wrote:
>
>> I think the return value of read() should always be an immutable value.
>>
>> Kenn
>>
>> On Fri, May 25, 2018 at 10:44 AM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> Kenn, in the second example where we are creating views whenever read()
>>> is called, is it that the view's underlying data is immutable. For example:
>>> Iterable<String> values = state.read();
>>> state.append("newValue");
>>> If I iterate over values, does values now contain "newValues"?
>>>
>>>
>>> On Thu, May 24, 2018 at 10:38 AM Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> I see what you mean but I don't agree that futures imply anything other
>>>> than "it is a value that you have to force", with deliberately many
>>>> possible implementations. When at the point in 1 and you've got
>>>>
>>>>     interface ReadableState<T> {
>>>>         T read()
>>>>     }
>>>>
>>>> and you want to improve performance, both approaches "void readLater()"
>>>> and "StateFuture<T> read()" are natural evolutions. They both gain the same
>>>> 10x and they both support the "unchanging committed state plus buffered
>>>> mutations" implementation well. And snapshots are essentially free for that
>>>> implementation if the buffered mutations are stored in a decent data
>>>> structure.
>>>>
>>>> My recollection was that futures were seen as more cumbersome. They
>>>> affect the types even for simple uses. The only appealing future API was
>>>> Guava's, but we didn't want that on the API surface. And we did not intend
>>>> for these to be used in complex ways, so the usability & perf benefits of a
>>>> future-based API weren't really realized anyhow.
>>>>
>>>> The only reason I belabor this is that if we ever wanted to support
>>>> more complex use cases, such as concurrent use of state, then my preference
>>>> would flip. I wouldn't want to make XYZState a synchronized monitor. At
>>>> that point I'd favor using a snapshots-are-free concurrent data structure
>>>> under the hood of a future-based API.
>>>>
>>>> Since there is really only one implementation in mind for this, maybe
>>>> only one that works reasonably, we should just document it as such. The
>>>> docs on ReadableState do make it sound like writes will invalidate the
>>>> usefulness of readLater, even though that isn't the case for the intended
>>>> implementation strategy.
>>>>
>>>> Kenn
>>>>
>>>> On Thu, May 24, 2018 at 9:40 AM Ben Chambers <bc...@apache.org>
>>>> wrote:
>>>>
>>>>> I think Kenn's second option accurately reflects my memory of the
>>>>> original intentions:
>>>>>
>>>>> 1. I remember we we considered either using the Future interface or
>>>>> calling the ReadableState interface a future, and explicitly said "no,
>>>>> future implies asynchrony and that the value returned by `get` won't change
>>>>> over multiple calls, but we want the latest value each time". So, I
>>>>> remember us explicitly considering and rejecting Future, thus the name
>>>>> "ReadableState".
>>>>>
>>>>> 2. The intuition behind the implementation was analogous to a
>>>>> mutable-reference cell in languages like ML / Scheme / etc. The
>>>>> ReadableState is just a pointer to the the reference cell. Calling read
>>>>> returns the value currently in the cell. If we have 100 ReadableStates
>>>>> pointing at the same cell, they all get the same value regardless of when
>>>>> they were created. This avoids needing to duplicate/snapshot values at any
>>>>> point in time.
>>>>>
>>>>> 3. ReadLater was added, as noted by Charles, to suggest prefetching
>>>>> the associated value. This was added after benchmarks showed 10x (if I
>>>>> remember correctly) performance improvements in things like
>>>>> GroupAlsoByWindows by minimizing round-trips asking for more state. The
>>>>> intuition being -- if we need to make an RPC to load one state value, we
>>>>> are better off making an RPC to load all the values we need.
>>>>>
>>>>> Overall, I too lean towards maintaining the second interpretation
>>>>> since it seems to be consistent and I believe we had additional reasons for
>>>>> preferring it over futures.
>>>>>
>>>>> Given the confusion, I think strengthening the class documentation
>>>>> makes sense -- I note the only hint of the current behavior is that
>>>>> ReadableState indicates it gets the *current* value (emphasis mine). We
>>>>> should emphasize that and perhaps even mention that the ReadableState
>>>>> should be understood as just a reference or handle to the underlying state,
>>>>> and thus its value will reflect the latest write.
>>>>>
>>>>> Charles, if it helps, the plan I remember regarding prefetching was
>>>>> something like:
>>>>>
>>>>> interface ReadableMapState<K, V> {
>>>>>    ReadableState<V> get(K key);
>>>>>    ReadableState<Iterable<V>> getIterable();
>>>>>    ReadableState<Map<K, V>> get();
>>>>>    // ... more things ...
>>>>> }
>>>>>
>>>>> Then prefetching a value is `mapState.get(key).readLater()` and
>>>>> prefetching the entire map is `mapState.get().readLater()`, etc.
>>>>>
>>>>> On Wed, May 23, 2018 at 7:13 PM Charles Chen <cc...@google.com> wrote:
>>>>>
>>>>>> Thanks Kenn.  I think there are two issues to highlight: (1) the API
>>>>>> should allow for some sort of prefetching / batching / background I/O for
>>>>>> state; and (2) it should be clear what the semantics are for reading (e.g.
>>>>>> so we don't have confusing read after write behavior).
>>>>>>
>>>>>> The approach I'm leaning towards for (1) is to allow a
>>>>>> state.prefetch() method (to prefetch a value, iterable or [entire] map
>>>>>> state) and maybe something like state.prefetch_key(key) to prefetch a
>>>>>> specific KV in the map.  Issue (2) seems to be okay in either of Kenn's
>>>>>> positions.
>>>>>>
>>>>>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <ro...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for laying this out so well, Kenn. I'm also leaning towards
>>>>>>> the second option, despite its drawbacks. (In particular, readLater
>>>>>>> should not influence what's returned at read(), it's just a hint.)
>>>>>>>
>>>>>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <kl...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Great idea to bring it to dev@. I think it is better to focus here
>>>>>>>> than long doc comment threads.
>>>>>>>>
>>>>>>>> I had strong opinions that I think were a bit confused and wrong.
>>>>>>>> Sorry for that. I stated this position:
>>>>>>>>
>>>>>>>>  - XYZState class is a handle to a mutable location
>>>>>>>>  - its methods like isEmpty() or contents() should return immutable
>>>>>>>> future values (implicitly means their contents are semantically frozen when
>>>>>>>> they are created)
>>>>>>>>  - the fact that you created the future is a hint that all
>>>>>>>> necessary fetching/computation should be kicked off
>>>>>>>>  - later forced with get()
>>>>>>>>  - when it was designed, pure async style was not a viable option
>>>>>>>>
>>>>>>>> I see now that the actual position of some of its original
>>>>>>>> designers is:
>>>>>>>>
>>>>>>>>  - XYZState class is a view on a mutable location
>>>>>>>>  - its methods return new views on that mutable location
>>>>>>>>  - calling readLater() is a hint that some fetching/computation
>>>>>>>> should be kicked off
>>>>>>>>  - later read() will combine whatever readLater() did with
>>>>>>>> additional local info to give the current value
>>>>>>>>  - async style not applicable nor desirable as per Beam's focus on
>>>>>>>> naive straight-line coding + autoscaling
>>>>>>>>
>>>>>>>> These are both internally consistent I think. In fact, I like the
>>>>>>>> second perspective better than the one I have been promoting. There are
>>>>>>>> some weaknesses: readLater() is pretty tightly coupled to a particular
>>>>>>>> implementation style, and futures are decades old so you can get good APIs
>>>>>>>> and performance without inventing anything. But I still like the non-future
>>>>>>>> version a little better.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <cc...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> During the design of the Beam Python State API, we noticed some
>>>>>>>>> transactionality inconsistencies in the existing Beam Java State API (these
>>>>>>>>> are the unresolved bugs BEAM-2980
>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are
>>>>>>>>> therefore having a discussion about this API which can have implications
>>>>>>>>> for its future development in all Beam languages:
>>>>>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>>>>>>
>>>>>>>>> If you have an opinion on the possible design approaches, it would
>>>>>>>>> be very helpful to bring up in the doc or on this thread.  Thanks!
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Charles
>>>>>>>>>
>>>>>>>>

Re: Existing transactionality inconsistency in the Beam Java State API

Posted by Lukasz Cwik <lc...@google.com>.

Great, I was confused in the description that was provided and then the
follow up by Ben. I think its worthwhile to describe the differences with
actual examples of what happens.

On Fri, May 25, 2018 at 10:54 AM Kenneth Knowles <kl...@google.com> wrote:

> I think the return value of read() should always be an immutable value.
>
> Kenn
>
> On Fri, May 25, 2018 at 10:44 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> Kenn, in the second example where we are creating views whenever read()
>> is called, is it that the view's underlying data is immutable. For example:
>> Iterable<String> values = state.read();
>> state.append("newValue");
>> If I iterate over values, does values now contain "newValues"?
>>
>>
>> On Thu, May 24, 2018 at 10:38 AM Kenneth Knowles <kl...@google.com> wrote:
>>
>>> I see what you mean but I don't agree that futures imply anything other
>>> than "it is a value that you have to force", with deliberately many
>>> possible implementations. When at the point in 1 and you've got
>>>
>>>     interface ReadableState<T> {
>>>         T read()
>>>     }
>>>
>>> and you want to improve performance, both approaches "void readLater()"
>>> and "StateFuture<T> read()" are natural evolutions. They both gain the same
>>> 10x and they both support the "unchanging committed state plus buffered
>>> mutations" implementation well. And snapshots are essentially free for that
>>> implementation if the buffered mutations are stored in a decent data
>>> structure.
>>>
>>> My recollection was that futures were seen as more cumbersome. They
>>> affect the types even for simple uses. The only appealing future API was
>>> Guava's, but we didn't want that on the API surface. And we did not intend
>>> for these to be used in complex ways, so the usability & perf benefits of a
>>> future-based API weren't really realized anyhow.
>>>
>>> The only reason I belabor this is that if we ever wanted to support more
>>> complex use cases, such as concurrent use of state, then my preference
>>> would flip. I wouldn't want to make XYZState a synchronized monitor. At
>>> that point I'd favor using a snapshots-are-free concurrent data structure
>>> under the hood of a future-based API.
>>>
>>> Since there is really only one implementation in mind for this, maybe
>>> only one that works reasonably, we should just document it as such. The
>>> docs on ReadableState do make it sound like writes will invalidate the
>>> usefulness of readLater, even though that isn't the case for the intended
>>> implementation strategy.
>>>
>>> Kenn
>>>
>>> On Thu, May 24, 2018 at 9:40 AM Ben Chambers <bc...@apache.org>
>>> wrote:
>>>
>>>> I think Kenn's second option accurately reflects my memory of the
>>>> original intentions:
>>>>
>>>> 1. I remember we we considered either using the Future interface or
>>>> calling the ReadableState interface a future, and explicitly said "no,
>>>> future implies asynchrony and that the value returned by `get` won't change
>>>> over multiple calls, but we want the latest value each time". So, I
>>>> remember us explicitly considering and rejecting Future, thus the name
>>>> "ReadableState".
>>>>
>>>> 2. The intuition behind the implementation was analogous to a
>>>> mutable-reference cell in languages like ML / Scheme / etc. The
>>>> ReadableState is just a pointer to the the reference cell. Calling read
>>>> returns the value currently in the cell. If we have 100 ReadableStates
>>>> pointing at the same cell, they all get the same value regardless of when
>>>> they were created. This avoids needing to duplicate/snapshot values at any
>>>> point in time.
>>>>
>>>> 3. ReadLater was added, as noted by Charles, to suggest prefetching the
>>>> associated value. This was added after benchmarks showed 10x (if I remember
>>>> correctly) performance improvements in things like GroupAlsoByWindows by
>>>> minimizing round-trips asking for more state. The intuition being -- if we
>>>> need to make an RPC to load one state value, we are better off making an
>>>> RPC to load all the values we need.
>>>>
>>>> Overall, I too lean towards maintaining the second interpretation since
>>>> it seems to be consistent and I believe we had additional reasons for
>>>> preferring it over futures.
>>>>
>>>> Given the confusion, I think strengthening the class documentation
>>>> makes sense -- I note the only hint of the current behavior is that
>>>> ReadableState indicates it gets the *current* value (emphasis mine). We
>>>> should emphasize that and perhaps even mention that the ReadableState
>>>> should be understood as just a reference or handle to the underlying state,
>>>> and thus its value will reflect the latest write.
>>>>
>>>> Charles, if it helps, the plan I remember regarding prefetching was
>>>> something like:
>>>>
>>>> interface ReadableMapState<K, V> {
>>>>    ReadableState<V> get(K key);
>>>>    ReadableState<Iterable<V>> getIterable();
>>>>    ReadableState<Map<K, V>> get();
>>>>    // ... more things ...
>>>> }
>>>>
>>>> Then prefetching a value is `mapState.get(key).readLater()` and
>>>> prefetching the entire map is `mapState.get().readLater()`, etc.
>>>>
>>>> On Wed, May 23, 2018 at 7:13 PM Charles Chen <cc...@google.com> wrote:
>>>>
>>>>> Thanks Kenn.  I think there are two issues to highlight: (1) the API
>>>>> should allow for some sort of prefetching / batching / background I/O for
>>>>> state; and (2) it should be clear what the semantics are for reading (e.g.
>>>>> so we don't have confusing read after write behavior).
>>>>>
>>>>> The approach I'm leaning towards for (1) is to allow a
>>>>> state.prefetch() method (to prefetch a value, iterable or [entire] map
>>>>> state) and maybe something like state.prefetch_key(key) to prefetch a
>>>>> specific KV in the map.  Issue (2) seems to be okay in either of Kenn's
>>>>> positions.
>>>>>
>>>>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for laying this out so well, Kenn. I'm also leaning towards
>>>>>> the second option, despite its drawbacks. (In particular, readLater
>>>>>> should not influence what's returned at read(), it's just a hint.)
>>>>>>
>>>>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <kl...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Great idea to bring it to dev@. I think it is better to focus here
>>>>>>> than long doc comment threads.
>>>>>>>
>>>>>>> I had strong opinions that I think were a bit confused and wrong.
>>>>>>> Sorry for that. I stated this position:
>>>>>>>
>>>>>>>  - XYZState class is a handle to a mutable location
>>>>>>>  - its methods like isEmpty() or contents() should return immutable
>>>>>>> future values (implicitly means their contents are semantically frozen when
>>>>>>> they are created)
>>>>>>>  - the fact that you created the future is a hint that all necessary
>>>>>>> fetching/computation should be kicked off
>>>>>>>  - later forced with get()
>>>>>>>  - when it was designed, pure async style was not a viable option
>>>>>>>
>>>>>>> I see now that the actual position of some of its original designers
>>>>>>> is:
>>>>>>>
>>>>>>>  - XYZState class is a view on a mutable location
>>>>>>>  - its methods return new views on that mutable location
>>>>>>>  - calling readLater() is a hint that some fetching/computation
>>>>>>> should be kicked off
>>>>>>>  - later read() will combine whatever readLater() did with
>>>>>>> additional local info to give the current value
>>>>>>>  - async style not applicable nor desirable as per Beam's focus on
>>>>>>> naive straight-line coding + autoscaling
>>>>>>>
>>>>>>> These are both internally consistent I think. In fact, I like the
>>>>>>> second perspective better than the one I have been promoting. There are
>>>>>>> some weaknesses: readLater() is pretty tightly coupled to a particular
>>>>>>> implementation style, and futures are decades old so you can get good APIs
>>>>>>> and performance without inventing anything. But I still like the non-future
>>>>>>> version a little better.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <cc...@google.com> wrote:
>>>>>>>
>>>>>>>> During the design of the Beam Python State API, we noticed some
>>>>>>>> transactionality inconsistencies in the existing Beam Java State API (these
>>>>>>>> are the unresolved bugs BEAM-2980
>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are
>>>>>>>> therefore having a discussion about this API which can have implications
>>>>>>>> for its future development in all Beam languages:
>>>>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>>>>>
>>>>>>>> If you have an opinion on the possible design approaches, it would
>>>>>>>> be very helpful to bring up in the doc or on this thread.  Thanks!
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Charles
>>>>>>>>
>>>>>>>

Re: Existing transactionality inconsistency in the Beam Java State API

Posted by Kenneth Knowles <kl...@google.com>.

I think the return value of read() should always be an immutable value.

Kenn

On Fri, May 25, 2018 at 10:44 AM Lukasz Cwik <lc...@google.com> wrote:

> Kenn, in the second example where we are creating views whenever read() is
> called, is it that the view's underlying data is immutable. For example:
> Iterable<String> values = state.read();
> state.append("newValue");
> If I iterate over values, does values now contain "newValues"?
>
>
> On Thu, May 24, 2018 at 10:38 AM Kenneth Knowles <kl...@google.com> wrote:
>
>> I see what you mean but I don't agree that futures imply anything other
>> than "it is a value that you have to force", with deliberately many
>> possible implementations. When at the point in 1 and you've got
>>
>>     interface ReadableState<T> {
>>         T read()
>>     }
>>
>> and you want to improve performance, both approaches "void readLater()"
>> and "StateFuture<T> read()" are natural evolutions. They both gain the same
>> 10x and they both support the "unchanging committed state plus buffered
>> mutations" implementation well. And snapshots are essentially free for that
>> implementation if the buffered mutations are stored in a decent data
>> structure.
>>
>> My recollection was that futures were seen as more cumbersome. They
>> affect the types even for simple uses. The only appealing future API was
>> Guava's, but we didn't want that on the API surface. And we did not intend
>> for these to be used in complex ways, so the usability & perf benefits of a
>> future-based API weren't really realized anyhow.
>>
>> The only reason I belabor this is that if we ever wanted to support more
>> complex use cases, such as concurrent use of state, then my preference
>> would flip. I wouldn't want to make XYZState a synchronized monitor. At
>> that point I'd favor using a snapshots-are-free concurrent data structure
>> under the hood of a future-based API.
>>
>> Since there is really only one implementation in mind for this, maybe
>> only one that works reasonably, we should just document it as such. The
>> docs on ReadableState do make it sound like writes will invalidate the
>> usefulness of readLater, even though that isn't the case for the intended
>> implementation strategy.
>>
>> Kenn
>>
>> On Thu, May 24, 2018 at 9:40 AM Ben Chambers <bc...@apache.org>
>> wrote:
>>
>>> I think Kenn's second option accurately reflects my memory of the
>>> original intentions:
>>>
>>> 1. I remember we we considered either using the Future interface or
>>> calling the ReadableState interface a future, and explicitly said "no,
>>> future implies asynchrony and that the value returned by `get` won't change
>>> over multiple calls, but we want the latest value each time". So, I
>>> remember us explicitly considering and rejecting Future, thus the name
>>> "ReadableState".
>>>
>>> 2. The intuition behind the implementation was analogous to a
>>> mutable-reference cell in languages like ML / Scheme / etc. The
>>> ReadableState is just a pointer to the the reference cell. Calling read
>>> returns the value currently in the cell. If we have 100 ReadableStates
>>> pointing at the same cell, they all get the same value regardless of when
>>> they were created. This avoids needing to duplicate/snapshot values at any
>>> point in time.
>>>
>>> 3. ReadLater was added, as noted by Charles, to suggest prefetching the
>>> associated value. This was added after benchmarks showed 10x (if I remember
>>> correctly) performance improvements in things like GroupAlsoByWindows by
>>> minimizing round-trips asking for more state. The intuition being -- if we
>>> need to make an RPC to load one state value, we are better off making an
>>> RPC to load all the values we need.
>>>
>>> Overall, I too lean towards maintaining the second interpretation since
>>> it seems to be consistent and I believe we had additional reasons for
>>> preferring it over futures.
>>>
>>> Given the confusion, I think strengthening the class documentation makes
>>> sense -- I note the only hint of the current behavior is that ReadableState
>>> indicates it gets the *current* value (emphasis mine). We should emphasize
>>> that and perhaps even mention that the ReadableState should be understood
>>> as just a reference or handle to the underlying state, and thus its value
>>> will reflect the latest write.
>>>
>>> Charles, if it helps, the plan I remember regarding prefetching was
>>> something like:
>>>
>>> interface ReadableMapState<K, V> {
>>>    ReadableState<V> get(K key);
>>>    ReadableState<Iterable<V>> getIterable();
>>>    ReadableState<Map<K, V>> get();
>>>    // ... more things ...
>>> }
>>>
>>> Then prefetching a value is `mapState.get(key).readLater()` and
>>> prefetching the entire map is `mapState.get().readLater()`, etc.
>>>
>>> On Wed, May 23, 2018 at 7:13 PM Charles Chen <cc...@google.com> wrote:
>>>
>>>> Thanks Kenn.  I think there are two issues to highlight: (1) the API
>>>> should allow for some sort of prefetching / batching / background I/O for
>>>> state; and (2) it should be clear what the semantics are for reading (e.g.
>>>> so we don't have confusing read after write behavior).
>>>>
>>>> The approach I'm leaning towards for (1) is to allow a state.prefetch()
>>>> method (to prefetch a value, iterable or [entire] map state) and maybe
>>>> something like state.prefetch_key(key) to prefetch a specific KV in the
>>>> map.  Issue (2) seems to be okay in either of Kenn's positions.
>>>>
>>>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> Thanks for laying this out so well, Kenn. I'm also leaning towards the
>>>>> second option, despite its drawbacks. (In particular, readLater
>>>>> should not influence what's returned at read(), it's just a hint.)
>>>>>
>>>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <kl...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Great idea to bring it to dev@. I think it is better to focus here
>>>>>> than long doc comment threads.
>>>>>>
>>>>>> I had strong opinions that I think were a bit confused and wrong.
>>>>>> Sorry for that. I stated this position:
>>>>>>
>>>>>>  - XYZState class is a handle to a mutable location
>>>>>>  - its methods like isEmpty() or contents() should return immutable
>>>>>> future values (implicitly means their contents are semantically frozen when
>>>>>> they are created)
>>>>>>  - the fact that you created the future is a hint that all necessary
>>>>>> fetching/computation should be kicked off
>>>>>>  - later forced with get()
>>>>>>  - when it was designed, pure async style was not a viable option
>>>>>>
>>>>>> I see now that the actual position of some of its original designers
>>>>>> is:
>>>>>>
>>>>>>  - XYZState class is a view on a mutable location
>>>>>>  - its methods return new views on that mutable location
>>>>>>  - calling readLater() is a hint that some fetching/computation
>>>>>> should be kicked off
>>>>>>  - later read() will combine whatever readLater() did with additional
>>>>>> local info to give the current value
>>>>>>  - async style not applicable nor desirable as per Beam's focus on
>>>>>> naive straight-line coding + autoscaling
>>>>>>
>>>>>> These are both internally consistent I think. In fact, I like the
>>>>>> second perspective better than the one I have been promoting. There are
>>>>>> some weaknesses: readLater() is pretty tightly coupled to a particular
>>>>>> implementation style, and futures are decades old so you can get good APIs
>>>>>> and performance without inventing anything. But I still like the non-future
>>>>>> version a little better.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <cc...@google.com> wrote:
>>>>>>
>>>>>>> During the design of the Beam Python State API, we noticed some
>>>>>>> transactionality inconsistencies in the existing Beam Java State API (these
>>>>>>> are the unresolved bugs BEAM-2980
>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>>>>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are
>>>>>>> therefore having a discussion about this API which can have implications
>>>>>>> for its future development in all Beam languages:
>>>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>>>>
>>>>>>> If you have an opinion on the possible design approaches, it would
>>>>>>> be very helpful to bring up in the doc or on this thread.  Thanks!
>>>>>>>
>>>>>>> Best,
>>>>>>> Charles
>>>>>>>
>>>>>>

Re: Existing transactionality inconsistency in the Beam Java State API

Posted by Lukasz Cwik <lc...@google.com>.

Kenn, in the second example where we are creating views whenever read() is
called, is it that the view's underlying data is immutable. For example:
Iterable<String> values = state.read();
state.append("newValue");
If I iterate over values, does values now contain "newValues"?


On Thu, May 24, 2018 at 10:38 AM Kenneth Knowles <kl...@google.com> wrote:

> I see what you mean but I don't agree that futures imply anything other
> than "it is a value that you have to force", with deliberately many
> possible implementations. When at the point in 1 and you've got
>
>     interface ReadableState<T> {
>         T read()
>     }
>
> and you want to improve performance, both approaches "void readLater()"
> and "StateFuture<T> read()" are natural evolutions. They both gain the same
> 10x and they both support the "unchanging committed state plus buffered
> mutations" implementation well. And snapshots are essentially free for that
> implementation if the buffered mutations are stored in a decent data
> structure.
>
> My recollection was that futures were seen as more cumbersome. They affect
> the types even for simple uses. The only appealing future API was Guava's,
> but we didn't want that on the API surface. And we did not intend for these
> to be used in complex ways, so the usability & perf benefits of a
> future-based API weren't really realized anyhow.
>
> The only reason I belabor this is that if we ever wanted to support more
> complex use cases, such as concurrent use of state, then my preference
> would flip. I wouldn't want to make XYZState a synchronized monitor. At
> that point I'd favor using a snapshots-are-free concurrent data structure
> under the hood of a future-based API.
>
> Since there is really only one implementation in mind for this, maybe only
> one that works reasonably, we should just document it as such. The docs on
> ReadableState do make it sound like writes will invalidate the usefulness
> of readLater, even though that isn't the case for the intended
> implementation strategy.
>
> Kenn
>
> On Thu, May 24, 2018 at 9:40 AM Ben Chambers <bc...@apache.org> wrote:
>
>> I think Kenn's second option accurately reflects my memory of the
>> original intentions:
>>
>> 1. I remember we we considered either using the Future interface or
>> calling the ReadableState interface a future, and explicitly said "no,
>> future implies asynchrony and that the value returned by `get` won't change
>> over multiple calls, but we want the latest value each time". So, I
>> remember us explicitly considering and rejecting Future, thus the name
>> "ReadableState".
>>
>> 2. The intuition behind the implementation was analogous to a
>> mutable-reference cell in languages like ML / Scheme / etc. The
>> ReadableState is just a pointer to the the reference cell. Calling read
>> returns the value currently in the cell. If we have 100 ReadableStates
>> pointing at the same cell, they all get the same value regardless of when
>> they were created. This avoids needing to duplicate/snapshot values at any
>> point in time.
>>
>> 3. ReadLater was added, as noted by Charles, to suggest prefetching the
>> associated value. This was added after benchmarks showed 10x (if I remember
>> correctly) performance improvements in things like GroupAlsoByWindows by
>> minimizing round-trips asking for more state. The intuition being -- if we
>> need to make an RPC to load one state value, we are better off making an
>> RPC to load all the values we need.
>>
>> Overall, I too lean towards maintaining the second interpretation since
>> it seems to be consistent and I believe we had additional reasons for
>> preferring it over futures.
>>
>> Given the confusion, I think strengthening the class documentation makes
>> sense -- I note the only hint of the current behavior is that ReadableState
>> indicates it gets the *current* value (emphasis mine). We should emphasize
>> that and perhaps even mention that the ReadableState should be understood
>> as just a reference or handle to the underlying state, and thus its value
>> will reflect the latest write.
>>
>> Charles, if it helps, the plan I remember regarding prefetching was
>> something like:
>>
>> interface ReadableMapState<K, V> {
>>    ReadableState<V> get(K key);
>>    ReadableState<Iterable<V>> getIterable();
>>    ReadableState<Map<K, V>> get();
>>    // ... more things ...
>> }
>>
>> Then prefetching a value is `mapState.get(key).readLater()` and
>> prefetching the entire map is `mapState.get().readLater()`, etc.
>>
>> On Wed, May 23, 2018 at 7:13 PM Charles Chen <cc...@google.com> wrote:
>>
>>> Thanks Kenn.  I think there are two issues to highlight: (1) the API
>>> should allow for some sort of prefetching / batching / background I/O for
>>> state; and (2) it should be clear what the semantics are for reading (e.g.
>>> so we don't have confusing read after write behavior).
>>>
>>> The approach I'm leaning towards for (1) is to allow a state.prefetch()
>>> method (to prefetch a value, iterable or [entire] map state) and maybe
>>> something like state.prefetch_key(key) to prefetch a specific KV in the
>>> map.  Issue (2) seems to be okay in either of Kenn's positions.
>>>
>>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> Thanks for laying this out so well, Kenn. I'm also leaning towards the
>>>> second option, despite its drawbacks. (In particular, readLater should
>>>> not influence what's returned at read(), it's just a hint.)
>>>>
>>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <kl...@google.com> wrote:
>>>>
>>>>> Great idea to bring it to dev@. I think it is better to focus here
>>>>> than long doc comment threads.
>>>>>
>>>>> I had strong opinions that I think were a bit confused and wrong.
>>>>> Sorry for that. I stated this position:
>>>>>
>>>>>  - XYZState class is a handle to a mutable location
>>>>>  - its methods like isEmpty() or contents() should return immutable
>>>>> future values (implicitly means their contents are semantically frozen when
>>>>> they are created)
>>>>>  - the fact that you created the future is a hint that all necessary
>>>>> fetching/computation should be kicked off
>>>>>  - later forced with get()
>>>>>  - when it was designed, pure async style was not a viable option
>>>>>
>>>>> I see now that the actual position of some of its original designers
>>>>> is:
>>>>>
>>>>>  - XYZState class is a view on a mutable location
>>>>>  - its methods return new views on that mutable location
>>>>>  - calling readLater() is a hint that some fetching/computation should
>>>>> be kicked off
>>>>>  - later read() will combine whatever readLater() did with additional
>>>>> local info to give the current value
>>>>>  - async style not applicable nor desirable as per Beam's focus on
>>>>> naive straight-line coding + autoscaling
>>>>>
>>>>> These are both internally consistent I think. In fact, I like the
>>>>> second perspective better than the one I have been promoting. There are
>>>>> some weaknesses: readLater() is pretty tightly coupled to a particular
>>>>> implementation style, and futures are decades old so you can get good APIs
>>>>> and performance without inventing anything. But I still like the non-future
>>>>> version a little better.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <cc...@google.com> wrote:
>>>>>
>>>>>> During the design of the Beam Python State API, we noticed some
>>>>>> transactionality inconsistencies in the existing Beam Java State API (these
>>>>>> are the unresolved bugs BEAM-2980
>>>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>>>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are
>>>>>> therefore having a discussion about this API which can have implications
>>>>>> for its future development in all Beam languages:
>>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>>>
>>>>>> If you have an opinion on the possible design approaches, it would be
>>>>>> very helpful to bring up in the doc or on this thread.  Thanks!
>>>>>>
>>>>>> Best,
>>>>>> Charles
>>>>>>
>>>>>

Re: Existing transactionality inconsistency in the Beam Java State API

Posted by Kenneth Knowles <kl...@google.com>.

I see what you mean but I don't agree that futures imply anything other
than "it is a value that you have to force", with deliberately many
possible implementations. When at the point in 1 and you've got

    interface ReadableState<T> {
        T read()
    }

and you want to improve performance, both approaches "void readLater()" and
"StateFuture<T> read()" are natural evolutions. They both gain the same 10x
and they both support the "unchanging committed state plus buffered
mutations" implementation well. And snapshots are essentially free for that
implementation if the buffered mutations are stored in a decent data
structure.

My recollection was that futures were seen as more cumbersome. They affect
the types even for simple uses. The only appealing future API was Guava's,
but we didn't want that on the API surface. And we did not intend for these
to be used in complex ways, so the usability & perf benefits of a
future-based API weren't really realized anyhow.

The only reason I belabor this is that if we ever wanted to support more
complex use cases, such as concurrent use of state, then my preference
would flip. I wouldn't want to make XYZState a synchronized monitor. At
that point I'd favor using a snapshots-are-free concurrent data structure
under the hood of a future-based API.

Since there is really only one implementation in mind for this, maybe only
one that works reasonably, we should just document it as such. The docs on
ReadableState do make it sound like writes will invalidate the usefulness
of readLater, even though that isn't the case for the intended
implementation strategy.

Kenn

On Thu, May 24, 2018 at 9:40 AM Ben Chambers <bc...@apache.org> wrote:

> I think Kenn's second option accurately reflects my memory of the original
> intentions:
>
> 1. I remember we we considered either using the Future interface or
> calling the ReadableState interface a future, and explicitly said "no,
> future implies asynchrony and that the value returned by `get` won't change
> over multiple calls, but we want the latest value each time". So, I
> remember us explicitly considering and rejecting Future, thus the name
> "ReadableState".
>
> 2. The intuition behind the implementation was analogous to a
> mutable-reference cell in languages like ML / Scheme / etc. The
> ReadableState is just a pointer to the the reference cell. Calling read
> returns the value currently in the cell. If we have 100 ReadableStates
> pointing at the same cell, they all get the same value regardless of when
> they were created. This avoids needing to duplicate/snapshot values at any
> point in time.
>
> 3. ReadLater was added, as noted by Charles, to suggest prefetching the
> associated value. This was added after benchmarks showed 10x (if I remember
> correctly) performance improvements in things like GroupAlsoByWindows by
> minimizing round-trips asking for more state. The intuition being -- if we
> need to make an RPC to load one state value, we are better off making an
> RPC to load all the values we need.
>
> Overall, I too lean towards maintaining the second interpretation since it
> seems to be consistent and I believe we had additional reasons for
> preferring it over futures.
>
> Given the confusion, I think strengthening the class documentation makes
> sense -- I note the only hint of the current behavior is that ReadableState
> indicates it gets the *current* value (emphasis mine). We should emphasize
> that and perhaps even mention that the ReadableState should be understood
> as just a reference or handle to the underlying state, and thus its value
> will reflect the latest write.
>
> Charles, if it helps, the plan I remember regarding prefetching was
> something like:
>
> interface ReadableMapState<K, V> {
>    ReadableState<V> get(K key);
>    ReadableState<Iterable<V>> getIterable();
>    ReadableState<Map<K, V>> get();
>    // ... more things ...
> }
>
> Then prefetching a value is `mapState.get(key).readLater()` and
> prefetching the entire map is `mapState.get().readLater()`, etc.
>
> On Wed, May 23, 2018 at 7:13 PM Charles Chen <cc...@google.com> wrote:
>
>> Thanks Kenn.  I think there are two issues to highlight: (1) the API
>> should allow for some sort of prefetching / batching / background I/O for
>> state; and (2) it should be clear what the semantics are for reading (e.g.
>> so we don't have confusing read after write behavior).
>>
>> The approach I'm leaning towards for (1) is to allow a state.prefetch()
>> method (to prefetch a value, iterable or [entire] map state) and maybe
>> something like state.prefetch_key(key) to prefetch a specific KV in the
>> map.  Issue (2) seems to be okay in either of Kenn's positions.
>>
>> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> Thanks for laying this out so well, Kenn. I'm also leaning towards the
>>> second option, despite its drawbacks. (In particular, readLater should
>>> not influence what's returned at read(), it's just a hint.)
>>>
>>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> Great idea to bring it to dev@. I think it is better to focus here
>>>> than long doc comment threads.
>>>>
>>>> I had strong opinions that I think were a bit confused and wrong. Sorry
>>>> for that. I stated this position:
>>>>
>>>>  - XYZState class is a handle to a mutable location
>>>>  - its methods like isEmpty() or contents() should return immutable
>>>> future values (implicitly means their contents are semantically frozen when
>>>> they are created)
>>>>  - the fact that you created the future is a hint that all necessary
>>>> fetching/computation should be kicked off
>>>>  - later forced with get()
>>>>  - when it was designed, pure async style was not a viable option
>>>>
>>>> I see now that the actual position of some of its original designers is:
>>>>
>>>>  - XYZState class is a view on a mutable location
>>>>  - its methods return new views on that mutable location
>>>>  - calling readLater() is a hint that some fetching/computation should
>>>> be kicked off
>>>>  - later read() will combine whatever readLater() did with additional
>>>> local info to give the current value
>>>>  - async style not applicable nor desirable as per Beam's focus on
>>>> naive straight-line coding + autoscaling
>>>>
>>>> These are both internally consistent I think. In fact, I like the
>>>> second perspective better than the one I have been promoting. There are
>>>> some weaknesses: readLater() is pretty tightly coupled to a particular
>>>> implementation style, and futures are decades old so you can get good APIs
>>>> and performance without inventing anything. But I still like the non-future
>>>> version a little better.
>>>>
>>>> Kenn
>>>>
>>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <cc...@google.com> wrote:
>>>>
>>>>> During the design of the Beam Python State API, we noticed some
>>>>> transactionality inconsistencies in the existing Beam Java State API (these
>>>>> are the unresolved bugs BEAM-2980
>>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are therefore
>>>>> having a discussion about this API which can have implications for its
>>>>> future development in all Beam languages:
>>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>>
>>>>> If you have an opinion on the possible design approaches, it would be
>>>>> very helpful to bring up in the doc or on this thread.  Thanks!
>>>>>
>>>>> Best,
>>>>> Charles
>>>>>
>>>>

Re: Existing transactionality inconsistency in the Beam Java State API

Posted by Ben Chambers <bc...@apache.org>.

I think Kenn's second option accurately reflects my memory of the original
intentions:

1. I remember we we considered either using the Future interface or calling
the ReadableState interface a future, and explicitly said "no, future
implies asynchrony and that the value returned by `get` won't change over
multiple calls, but we want the latest value each time". So, I remember us
explicitly considering and rejecting Future, thus the name "ReadableState".

2. The intuition behind the implementation was analogous to a
mutable-reference cell in languages like ML / Scheme / etc. The
ReadableState is just a pointer to the the reference cell. Calling read
returns the value currently in the cell. If we have 100 ReadableStates
pointing at the same cell, they all get the same value regardless of when
they were created. This avoids needing to duplicate/snapshot values at any
point in time.

3. ReadLater was added, as noted by Charles, to suggest prefetching the
associated value. This was added after benchmarks showed 10x (if I remember
correctly) performance improvements in things like GroupAlsoByWindows by
minimizing round-trips asking for more state. The intuition being -- if we
need to make an RPC to load one state value, we are better off making an
RPC to load all the values we need.

Overall, I too lean towards maintaining the second interpretation since it
seems to be consistent and I believe we had additional reasons for
preferring it over futures.

Given the confusion, I think strengthening the class documentation makes
sense -- I note the only hint of the current behavior is that ReadableState
indicates it gets the *current* value (emphasis mine). We should emphasize
that and perhaps even mention that the ReadableState should be understood
as just a reference or handle to the underlying state, and thus its value
will reflect the latest write.

Charles, if it helps, the plan I remember regarding prefetching was
something like:

interface ReadableMapState<K, V> {
   ReadableState<V> get(K key);
   ReadableState<Iterable<V>> getIterable();
   ReadableState<Map<K, V>> get();
   // ... more things ...
}

Then prefetching a value is `mapState.get(key).readLater()` and prefetching
the entire map is `mapState.get().readLater()`, etc.

On Wed, May 23, 2018 at 7:13 PM Charles Chen <cc...@google.com> wrote:

> Thanks Kenn.  I think there are two issues to highlight: (1) the API
> should allow for some sort of prefetching / batching / background I/O for
> state; and (2) it should be clear what the semantics are for reading (e.g.
> so we don't have confusing read after write behavior).
>
> The approach I'm leaning towards for (1) is to allow a state.prefetch()
> method (to prefetch a value, iterable or [entire] map state) and maybe
> something like state.prefetch_key(key) to prefetch a specific KV in the
> map.  Issue (2) seems to be okay in either of Kenn's positions.
>
> On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Thanks for laying this out so well, Kenn. I'm also leaning towards the
>> second option, despite its drawbacks. (In particular, readLater should
>> not influence what's returned at read(), it's just a hint.)
>>
>> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <kl...@google.com> wrote:
>>
>>> Great idea to bring it to dev@. I think it is better to focus here than
>>> long doc comment threads.
>>>
>>> I had strong opinions that I think were a bit confused and wrong. Sorry
>>> for that. I stated this position:
>>>
>>>  - XYZState class is a handle to a mutable location
>>>  - its methods like isEmpty() or contents() should return immutable
>>> future values (implicitly means their contents are semantically frozen when
>>> they are created)
>>>  - the fact that you created the future is a hint that all necessary
>>> fetching/computation should be kicked off
>>>  - later forced with get()
>>>  - when it was designed, pure async style was not a viable option
>>>
>>> I see now that the actual position of some of its original designers is:
>>>
>>>  - XYZState class is a view on a mutable location
>>>  - its methods return new views on that mutable location
>>>  - calling readLater() is a hint that some fetching/computation should
>>> be kicked off
>>>  - later read() will combine whatever readLater() did with additional
>>> local info to give the current value
>>>  - async style not applicable nor desirable as per Beam's focus on naive
>>> straight-line coding + autoscaling
>>>
>>> These are both internally consistent I think. In fact, I like the second
>>> perspective better than the one I have been promoting. There are some
>>> weaknesses: readLater() is pretty tightly coupled to a particular
>>> implementation style, and futures are decades old so you can get good APIs
>>> and performance without inventing anything. But I still like the non-future
>>> version a little better.
>>>
>>> Kenn
>>>
>>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <cc...@google.com> wrote:
>>>
>>>> During the design of the Beam Python State API, we noticed some
>>>> transactionality inconsistencies in the existing Beam Java State API (these
>>>> are the unresolved bugs BEAM-2980
>>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are therefore
>>>> having a discussion about this API which can have implications for its
>>>> future development in all Beam languages:
>>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>>
>>>> If you have an opinion on the possible design approaches, it would be
>>>> very helpful to bring up in the doc or on this thread.  Thanks!
>>>>
>>>> Best,
>>>> Charles
>>>>
>>>

Re: Existing transactionality inconsistency in the Beam Java State API

Posted by Charles Chen <cc...@google.com>.

Thanks Kenn.  I think there are two issues to highlight: (1) the API should
allow for some sort of prefetching / batching / background I/O for state;
and (2) it should be clear what the semantics are for reading (e.g. so we
don't have confusing read after write behavior).

The approach I'm leaning towards for (1) is to allow a state.prefetch()
method (to prefetch a value, iterable or [entire] map state) and maybe
something like state.prefetch_key(key) to prefetch a specific KV in the
map.  Issue (2) seems to be okay in either of Kenn's positions.

On Wed, May 23, 2018 at 5:33 PM Robert Bradshaw <ro...@google.com> wrote:

> Thanks for laying this out so well, Kenn. I'm also leaning towards the
> second option, despite its drawbacks. (In particular, readLater should
> not influence what's returned at read(), it's just a hint.)
>
> On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <kl...@google.com> wrote:
>
>> Great idea to bring it to dev@. I think it is better to focus here than
>> long doc comment threads.
>>
>> I had strong opinions that I think were a bit confused and wrong. Sorry
>> for that. I stated this position:
>>
>>  - XYZState class is a handle to a mutable location
>>  - its methods like isEmpty() or contents() should return immutable
>> future values (implicitly means their contents are semantically frozen when
>> they are created)
>>  - the fact that you created the future is a hint that all necessary
>> fetching/computation should be kicked off
>>  - later forced with get()
>>  - when it was designed, pure async style was not a viable option
>>
>> I see now that the actual position of some of its original designers is:
>>
>>  - XYZState class is a view on a mutable location
>>  - its methods return new views on that mutable location
>>  - calling readLater() is a hint that some fetching/computation should be
>> kicked off
>>  - later read() will combine whatever readLater() did with additional
>> local info to give the current value
>>  - async style not applicable nor desirable as per Beam's focus on naive
>> straight-line coding + autoscaling
>>
>> These are both internally consistent I think. In fact, I like the second
>> perspective better than the one I have been promoting. There are some
>> weaknesses: readLater() is pretty tightly coupled to a particular
>> implementation style, and futures are decades old so you can get good APIs
>> and performance without inventing anything. But I still like the non-future
>> version a little better.
>>
>> Kenn
>>
>> On Wed, May 23, 2018 at 4:05 PM Charles Chen <cc...@google.com> wrote:
>>
>>> During the design of the Beam Python State API, we noticed some
>>> transactionality inconsistencies in the existing Beam Java State API (these
>>> are the unresolved bugs BEAM-2980
>>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are therefore
>>> having a discussion about this API which can have implications for its
>>> future development in all Beam languages:
>>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>>
>>> If you have an opinion on the possible design approaches, it would be
>>> very helpful to bring up in the doc or on this thread.  Thanks!
>>>
>>> Best,
>>> Charles
>>>
>>

Re: Existing transactionality inconsistency in the Beam Java State API

Posted by Robert Bradshaw <ro...@google.com>.

Thanks for laying this out so well, Kenn. I'm also leaning towards the
second option, despite its drawbacks. (In particular, readLater should not
influence what's returned at read(), it's just a hint.)

On Wed, May 23, 2018 at 4:43 PM Kenneth Knowles <kl...@google.com> wrote:

> Great idea to bring it to dev@. I think it is better to focus here than
> long doc comment threads.
>
> I had strong opinions that I think were a bit confused and wrong. Sorry
> for that. I stated this position:
>
>  - XYZState class is a handle to a mutable location
>  - its methods like isEmpty() or contents() should return immutable future
> values (implicitly means their contents are semantically frozen when they
> are created)
>  - the fact that you created the future is a hint that all necessary
> fetching/computation should be kicked off
>  - later forced with get()
>  - when it was designed, pure async style was not a viable option
>
> I see now that the actual position of some of its original designers is:
>
>  - XYZState class is a view on a mutable location
>  - its methods return new views on that mutable location
>  - calling readLater() is a hint that some fetching/computation should be
> kicked off
>  - later read() will combine whatever readLater() did with additional
> local info to give the current value
>  - async style not applicable nor desirable as per Beam's focus on naive
> straight-line coding + autoscaling
>
> These are both internally consistent I think. In fact, I like the second
> perspective better than the one I have been promoting. There are some
> weaknesses: readLater() is pretty tightly coupled to a particular
> implementation style, and futures are decades old so you can get good APIs
> and performance without inventing anything. But I still like the non-future
> version a little better.
>
> Kenn
>
> On Wed, May 23, 2018 at 4:05 PM Charles Chen <cc...@google.com> wrote:
>
>> During the design of the Beam Python State API, we noticed some
>> transactionality inconsistencies in the existing Beam Java State API (these
>> are the unresolved bugs BEAM-2980
>> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
>> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are therefore
>> having a discussion about this API which can have implications for its
>> future development in all Beam languages:
>> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>>
>> If you have an opinion on the possible design approaches, it would be
>> very helpful to bring up in the doc or on this thread.  Thanks!
>>
>> Best,
>> Charles
>>
>

Re: Existing transactionality inconsistency in the Beam Java State API

Posted by Kenneth Knowles <kl...@google.com>.

Great idea to bring it to dev@. I think it is better to focus here than
long doc comment threads.

I had strong opinions that I think were a bit confused and wrong. Sorry for
that. I stated this position:

 - XYZState class is a handle to a mutable location
 - its methods like isEmpty() or contents() should return immutable future
values (implicitly means their contents are semantically frozen when they
are created)
 - the fact that you created the future is a hint that all necessary
fetching/computation should be kicked off
 - later forced with get()
 - when it was designed, pure async style was not a viable option

I see now that the actual position of some of its original designers is:

 - XYZState class is a view on a mutable location
 - its methods return new views on that mutable location
 - calling readLater() is a hint that some fetching/computation should be
kicked off
 - later read() will combine whatever readLater() did with additional local
info to give the current value
 - async style not applicable nor desirable as per Beam's focus on naive
straight-line coding + autoscaling

These are both internally consistent I think. In fact, I like the second
perspective better than the one I have been promoting. There are some
weaknesses: readLater() is pretty tightly coupled to a particular
implementation style, and futures are decades old so you can get good APIs
and performance without inventing anything. But I still like the non-future
version a little better.

Kenn

On Wed, May 23, 2018 at 4:05 PM Charles Chen <cc...@google.com> wrote:

> During the design of the Beam Python State API, we noticed some
> transactionality inconsistencies in the existing Beam Java State API (these
> are the unresolved bugs BEAM-2980
> <https://issues.apache.org/jira/browse/BEAM-2980> and BEAM-2975
> <https://issues.apache.org/jira/browse/BEAM-2975>).  We are therefore
> having a discussion about this API which can have implications for its
> future development in all Beam languages:
> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
>
> If you have an opinion on the possible design approaches, it would be very
> helpful to bring up in the doc or on this thread.  Thanks!
>
> Best,
> Charles
>