You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by jincheng sun <su...@gmail.com> on 2019/11/06 09:08:17 UTC

[DISCUSS] Avoid redundant encoding and decoding between runner and harness

Hi all,

I am trying to make some improvements of portability framework to make it
usable in other projects. However, we find that the coder between runner
and harness can only be FullWindowedValueCoder. This means each time when
sending a WindowedValue, we have to encode/decode timestamp, windows and
pan infos. In some circumstances(such as using the portability framework in
Flink), only values are needed between runner and harness. So, it would be
nice if we can configure the coder and avoid redundant encoding and
decoding between runner and harness to improve the performance.

There are two approaches to solve this issue:

    Approach 1:  Support ValueOnlyWindowedValueCoder between runner and
harness.
    Approach 2:  Add a "constant" window coder that embeds all the
windowing information as part of the coder that should be used to wrap the
value during decoding.

More details can be found here [1].

As of the shortcomings of “Approach 2” which still need to encode/decode
timestamp and pane infos, we tend to choose “Approach 1” which brings
better performance and is more thorough.

Welcome any feedback :)

Best,
Jincheng

[1]
https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

Posted by jincheng sun <su...@gmail.com>.
Thank you all for your feedbacks.

@Kenn, @Robert I got what you mean now. I think it's good to make
window/timestamp/paneinfo values configurable while also avoid redundant
encoding and decoding.

Besides, I think the idea from @Luke is good, that we can replace
WindowedValue<T> with T in some cases to reduce the overhead of
understanding. But maybe we can solve it in a seperate issue?

I will update the PR according to option 2 (Add a windowed value coder
whose window/timestamp/paneinfo values are specified as constants). Thanks
again.

Best,
Jincheng

Kenneth Knowles <ke...@apache.org>于2019年11月9日 周六02:59写道:

>
>
> On Fri, Nov 8, 2019 at 9:23 AM Luke Cwik <lc...@google.com> wrote:
>
>>
>>
>> On Thu, Nov 7, 2019 at 7:36 PM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>>
>>>
>>> On Thu, Nov 7, 2019 at 9:19 AM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> I did suggest one other alternative on Jincheng's PR[1] which was to
>>>> allow windowless values to be sent across the gRPC port. The SDK would then
>>>> be responsible for ensuring that the execution didn't access any properties
>>>> that required knowledge of the timestamp, pane or window. This is different
>>>> then adding the ValueOnlyWindowedValueCoder as a model coder because it
>>>> allows SDKs to pass around raw values to functions without any windowing
>>>> overhead which could be useful for things like the side input window
>>>> mapping or window merging functions we have.
>>>>
>>>
>>> When you say "pass around" what does it mean? If it is over the wire,
>>> there is already no overhead to ValueOnlyWindowedValueCoder. So do you mean
>>> the overhead of having the layer of boxing of WindowedValue? I would assume
>>> all non-value components of the WindowedValue from
>>> ValueOnlyWindowedValueCoder are pointers to a single shared immutable
>>> instance carried by the coder instance.
>>>
>>
>> I was referring to the layer of boxing of WindowedValue. My concern
>> wasn't the performance overhead of passing around a wrapper object but the
>> cognitive overhead of understanding why everything needs to be wrapped in a
>> windowed value. Since you have been working on SQL for some time, this
>> would be analogous to executing a UDF and making all the machinery around
>> it take WindowedValue<T> instead of T.
>>
>>
>>> I think raw values can already be passed to functions, no? The main
>>> thing is that elements in a PCollection always have a window, timestamp,
>>> and paneinfo. Not all values are elements. Is there a specific case you
>>> have in mind? I would not expect WindowMappingFn or window merging fn to be
>>> passing "elements" but just values of the appropriate type for the function.
>>>
>>
>> This is about the machinery around WindowMappingFn/WindowMergingFn. For
>> example the implementation around WindowMappingFn takes a
>> WindowedValue<Window> and unwraps it forwarding it to the WindowMappingFn
>> and then takes the result and wraps it in a WindowedValue<Window> and
>> returns that to the runner.
>>
>
> I'm not familiar with this, but it sounds like it should not be necessary
> and is an implementation detail. Is there a model change necessary to avoid
> the unboxing/boxing? I would be surprised.
>
> Kenn
>
>
>>
>>
>>>
>>>
>>>> On Thu, Nov 7, 2019 at 8:48 AM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> I think there is some misunderstanding about what is meant by option
>>>>> 2. What Kenn (I think) and I are proposing is not a WindowedValueCoder
>>>>> whose window/timestamp/paneinfo coders are parameterized to be
>>>>> constant coders, but a WindowedValueCoder whose
>>>>> window/timestamp/paneinfo values are specified as constants in the
>>>>> coder.
>>>>>
>>>>> Let's call this NewValueOnlyWindowedValueCoder, and is parameterized
>>>>> by a window, timestamp, and pane info instance
>>>>>
>>>>> The existing ValueOnlyWindowedValueCoder is literally
>>>>> NewValueOnlyWindowedValueCoder(GlobalWindow, MIN_TIMESTAMP,
>>>>> PaneInfo.NO_FIRING). Note in particular that using the existing
>>>>> ValueOnlyWindowedValueCoder would give the wrong timestamp and pane
>>>>> info if it is use for the result of a GBK, which I think is the loss
>>>>> of consistency referred to here.
>>>>>
>>>>
>>> Yes, this is exactly what I am proposing and sounds like what "approach
>>> 2" is. I think this approach "just works". It is simpler and more efficient
>>> than "WindowedValueCoder.of(ValueCoder, WindowCoder, TimestampCoder,
>>> PaneInfoCoder)" which would require some overhead even if WindowCoder,
>>> TimestampCoder and PaneInfoCoder were constant coders consuming zero bytes.
>>>
>>> Kenn
>>>
>>>
>>>
>>>> On Thu, Nov 7, 2019 at 1:03 AM jincheng sun <su...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Thanks for your feedback and the valuable comments, Kenn & Robert!
>>>>> >
>>>>> > I think your comments are more comprehensive and enlighten me a lot.
>>>>> The two proposals which I mentioned above are to reuse the existing coder
>>>>> (FullWindowedValueCoder and ValueOnlyWindowedValueCoder). Now, with your
>>>>> comments, I think we can further abstract 'FullWindowedValueCoder' and
>>>>> 'ValueOnlyWindowedValueCoder', that is, we can rename
>>>>> 'FullWindowedValueCoder' to 'WindowedValueCoder', and make the coders of
>>>>> window/timestamp/pane configurable. Then we can remove
>>>>> 'ValueOnlyWindowedValueCoder' and need to add a mount of constant coders
>>>>> for window/timestamp/pane.
>>>>> >
>>>>> > I have replied your comments on the doc, and quick feedback as
>>>>> following:
>>>>> >
>>>>> > Regarding to "Approach 2: probably no SDK harness work / compatible
>>>>> with existing Beam model so no risk of introducing inconsistency",if we
>>>>> "just puts default window/timestamp/pane info on elements" and don't change
>>>>> the original coder, then the performance is not optimized. If we want to
>>>>> get the best performance, then the default coder of Window/timestamp/pane
>>>>> should be constant coder. In this case the SDK harnesses need to be aware
>>>>> of the constant coder and there will be some development work in the SDK
>>>>> harness. Besides, the SDK harness also needs to make the coders for
>>>>> window/timestamp/pane configurable and this will introduce some related
>>>>> changes, such as updating WindowedValueCoder._get_component_coders, etc.
>>>>> >
>>>>> > Regarding to "Approach 1: option a: if the SDK harness has to
>>>>> understand 'values without windows' then very large changes and high risk
>>>>> of introducing inconsistency (I eliminated many of these inconsistencies)",
>>>>> we only need to add ValueOnlyWindowedValueCoder to the StandardCoders and
>>>>> all the SDK harness should be aware of this coder. There is no much changes
>>>>> actually.
>>>>> >
>>>>> > Please feel free to correct me if there is anyting incorrect. :)
>>>>> >
>>>>> > Besides, I'm not quite clear about the consistency issues you meant
>>>>> here. Could you please give me some hints about this?
>>>>> >
>>>>> > Best,
>>>>> > Jincheng
>>>>> >
>>>>> > Robert Bradshaw <ro...@google.com> 于2019年11月7日周四 上午3:38写道:
>>>>> >>
>>>>> >> Yes, the portability framework is designed to support this, and
>>>>> >> possibly even more efficient transfers of data than
>>>>> element-by-element
>>>>> >> as per the wire coder specified in the IO port operators. I left
>>>>> some
>>>>> >> comments on the doc as well, and would also prefer approach 2.
>>>>> >>
>>>>> >> On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles <ke...@apache.org>
>>>>> wrote:
>>>>> >> >
>>>>> >> > I think the portability framework is designed for this. The
>>>>> runner controls the coder on the grpc ports and the runner controls the
>>>>> process bundle descriptor.
>>>>> >> >
>>>>> >> > I commented on the doc. I think what is missing is analysis of
>>>>> scope of SDK harness changes and risk to model consistency
>>>>> >> >
>>>>> >> >     Approach 2: probably no SDK harness work / compatible with
>>>>> existing Beam model so no risk of introducing inconsistency
>>>>> >> >
>>>>> >> >     Approach 1: what are all the details?
>>>>> >> >         option a: if the SDK harness has to understand "values
>>>>> without windows" then very large changes and high risk of introducing
>>>>> inconsistency (I eliminated many of these inconsistencies)
>>>>> >> >         option b: if the coder just puts default
>>>>> window/timestamp/pane info on elements, then it is the same as approach 2,
>>>>> no work / no risk
>>>>> >> >
>>>>> >> > Kenn
>>>>> >> >
>>>>> >> > On Wed, Nov 6, 2019 at 1:09 AM jincheng sun <
>>>>> sunjincheng121@gmail.com> wrote:
>>>>> >> >>
>>>>> >> >> Hi all,
>>>>> >> >>
>>>>> >> >> I am trying to make some improvements of portability framework
>>>>> to make it usable in other projects. However, we find that the coder
>>>>> between runner and harness can only be FullWindowedValueCoder. This means
>>>>> each time when sending a WindowedValue, we have to encode/decode timestamp,
>>>>> windows and pan infos. In some circumstances(such as using the portability
>>>>> framework in Flink), only values are needed between runner and harness. So,
>>>>> it would be nice if we can configure the coder and avoid redundant encoding
>>>>> and decoding between runner and harness to improve the performance.
>>>>> >> >>
>>>>> >> >> There are two approaches to solve this issue:
>>>>> >> >>
>>>>> >> >>     Approach 1:  Support ValueOnlyWindowedValueCoder between
>>>>> runner and harness.
>>>>> >> >>     Approach 2:  Add a "constant" window coder that embeds all
>>>>> the windowing information as part of the coder that should be used to wrap
>>>>> the value during decoding.
>>>>> >> >>
>>>>> >> >> More details can be found here [1].
>>>>> >> >>
>>>>> >> >> As of the shortcomings of “Approach 2” which still need to
>>>>> encode/decode timestamp and pane infos, we tend to choose “Approach 1”
>>>>> which brings better performance and is more thorough.
>>>>> >> >>
>>>>> >> >> Welcome any feedback :)
>>>>> >> >>
>>>>> >> >> Best,
>>>>> >> >> Jincheng
>>>>> >> >>
>>>>> >> >> [1]
>>>>> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
>>>>> >> >>
>>>>>
>>>>

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

Posted by Kenneth Knowles <ke...@apache.org>.
On Fri, Nov 8, 2019 at 9:23 AM Luke Cwik <lc...@google.com> wrote:

>
>
> On Thu, Nov 7, 2019 at 7:36 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>>
>>
>> On Thu, Nov 7, 2019 at 9:19 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> I did suggest one other alternative on Jincheng's PR[1] which was to
>>> allow windowless values to be sent across the gRPC port. The SDK would then
>>> be responsible for ensuring that the execution didn't access any properties
>>> that required knowledge of the timestamp, pane or window. This is different
>>> then adding the ValueOnlyWindowedValueCoder as a model coder because it
>>> allows SDKs to pass around raw values to functions without any windowing
>>> overhead which could be useful for things like the side input window
>>> mapping or window merging functions we have.
>>>
>>
>> When you say "pass around" what does it mean? If it is over the wire,
>> there is already no overhead to ValueOnlyWindowedValueCoder. So do you mean
>> the overhead of having the layer of boxing of WindowedValue? I would assume
>> all non-value components of the WindowedValue from
>> ValueOnlyWindowedValueCoder are pointers to a single shared immutable
>> instance carried by the coder instance.
>>
>
> I was referring to the layer of boxing of WindowedValue. My concern wasn't
> the performance overhead of passing around a wrapper object but the
> cognitive overhead of understanding why everything needs to be wrapped in a
> windowed value. Since you have been working on SQL for some time, this
> would be analogous to executing a UDF and making all the machinery around
> it take WindowedValue<T> instead of T.
>
>
>> I think raw values can already be passed to functions, no? The main thing
>> is that elements in a PCollection always have a window, timestamp, and
>> paneinfo. Not all values are elements. Is there a specific case you have in
>> mind? I would not expect WindowMappingFn or window merging fn to be passing
>> "elements" but just values of the appropriate type for the function.
>>
>
> This is about the machinery around WindowMappingFn/WindowMergingFn. For
> example the implementation around WindowMappingFn takes a
> WindowedValue<Window> and unwraps it forwarding it to the WindowMappingFn
> and then takes the result and wraps it in a WindowedValue<Window> and
> returns that to the runner.
>

I'm not familiar with this, but it sounds like it should not be necessary
and is an implementation detail. Is there a model change necessary to avoid
the unboxing/boxing? I would be surprised.

Kenn


>
>
>>
>>
>>> On Thu, Nov 7, 2019 at 8:48 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> I think there is some misunderstanding about what is meant by option
>>>> 2. What Kenn (I think) and I are proposing is not a WindowedValueCoder
>>>> whose window/timestamp/paneinfo coders are parameterized to be
>>>> constant coders, but a WindowedValueCoder whose
>>>> window/timestamp/paneinfo values are specified as constants in the
>>>> coder.
>>>>
>>>> Let's call this NewValueOnlyWindowedValueCoder, and is parameterized
>>>> by a window, timestamp, and pane info instance
>>>>
>>>> The existing ValueOnlyWindowedValueCoder is literally
>>>> NewValueOnlyWindowedValueCoder(GlobalWindow, MIN_TIMESTAMP,
>>>> PaneInfo.NO_FIRING). Note in particular that using the existing
>>>> ValueOnlyWindowedValueCoder would give the wrong timestamp and pane
>>>> info if it is use for the result of a GBK, which I think is the loss
>>>> of consistency referred to here.
>>>>
>>>
>> Yes, this is exactly what I am proposing and sounds like what "approach
>> 2" is. I think this approach "just works". It is simpler and more efficient
>> than "WindowedValueCoder.of(ValueCoder, WindowCoder, TimestampCoder,
>> PaneInfoCoder)" which would require some overhead even if WindowCoder,
>> TimestampCoder and PaneInfoCoder were constant coders consuming zero bytes.
>>
>> Kenn
>>
>>
>>
>>> On Thu, Nov 7, 2019 at 1:03 AM jincheng sun <su...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Thanks for your feedback and the valuable comments, Kenn & Robert!
>>>> >
>>>> > I think your comments are more comprehensive and enlighten me a lot.
>>>> The two proposals which I mentioned above are to reuse the existing coder
>>>> (FullWindowedValueCoder and ValueOnlyWindowedValueCoder). Now, with your
>>>> comments, I think we can further abstract 'FullWindowedValueCoder' and
>>>> 'ValueOnlyWindowedValueCoder', that is, we can rename
>>>> 'FullWindowedValueCoder' to 'WindowedValueCoder', and make the coders of
>>>> window/timestamp/pane configurable. Then we can remove
>>>> 'ValueOnlyWindowedValueCoder' and need to add a mount of constant coders
>>>> for window/timestamp/pane.
>>>> >
>>>> > I have replied your comments on the doc, and quick feedback as
>>>> following:
>>>> >
>>>> > Regarding to "Approach 2: probably no SDK harness work / compatible
>>>> with existing Beam model so no risk of introducing inconsistency",if we
>>>> "just puts default window/timestamp/pane info on elements" and don't change
>>>> the original coder, then the performance is not optimized. If we want to
>>>> get the best performance, then the default coder of Window/timestamp/pane
>>>> should be constant coder. In this case the SDK harnesses need to be aware
>>>> of the constant coder and there will be some development work in the SDK
>>>> harness. Besides, the SDK harness also needs to make the coders for
>>>> window/timestamp/pane configurable and this will introduce some related
>>>> changes, such as updating WindowedValueCoder._get_component_coders, etc.
>>>> >
>>>> > Regarding to "Approach 1: option a: if the SDK harness has to
>>>> understand 'values without windows' then very large changes and high risk
>>>> of introducing inconsistency (I eliminated many of these inconsistencies)",
>>>> we only need to add ValueOnlyWindowedValueCoder to the StandardCoders and
>>>> all the SDK harness should be aware of this coder. There is no much changes
>>>> actually.
>>>> >
>>>> > Please feel free to correct me if there is anyting incorrect. :)
>>>> >
>>>> > Besides, I'm not quite clear about the consistency issues you meant
>>>> here. Could you please give me some hints about this?
>>>> >
>>>> > Best,
>>>> > Jincheng
>>>> >
>>>> > Robert Bradshaw <ro...@google.com> 于2019年11月7日周四 上午3:38写道:
>>>> >>
>>>> >> Yes, the portability framework is designed to support this, and
>>>> >> possibly even more efficient transfers of data than
>>>> element-by-element
>>>> >> as per the wire coder specified in the IO port operators. I left some
>>>> >> comments on the doc as well, and would also prefer approach 2.
>>>> >>
>>>> >> On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles <ke...@apache.org>
>>>> wrote:
>>>> >> >
>>>> >> > I think the portability framework is designed for this. The runner
>>>> controls the coder on the grpc ports and the runner controls the process
>>>> bundle descriptor.
>>>> >> >
>>>> >> > I commented on the doc. I think what is missing is analysis of
>>>> scope of SDK harness changes and risk to model consistency
>>>> >> >
>>>> >> >     Approach 2: probably no SDK harness work / compatible with
>>>> existing Beam model so no risk of introducing inconsistency
>>>> >> >
>>>> >> >     Approach 1: what are all the details?
>>>> >> >         option a: if the SDK harness has to understand "values
>>>> without windows" then very large changes and high risk of introducing
>>>> inconsistency (I eliminated many of these inconsistencies)
>>>> >> >         option b: if the coder just puts default
>>>> window/timestamp/pane info on elements, then it is the same as approach 2,
>>>> no work / no risk
>>>> >> >
>>>> >> > Kenn
>>>> >> >
>>>> >> > On Wed, Nov 6, 2019 at 1:09 AM jincheng sun <
>>>> sunjincheng121@gmail.com> wrote:
>>>> >> >>
>>>> >> >> Hi all,
>>>> >> >>
>>>> >> >> I am trying to make some improvements of portability framework to
>>>> make it usable in other projects. However, we find that the coder between
>>>> runner and harness can only be FullWindowedValueCoder. This means each time
>>>> when sending a WindowedValue, we have to encode/decode timestamp, windows
>>>> and pan infos. In some circumstances(such as using the portability
>>>> framework in Flink), only values are needed between runner and harness. So,
>>>> it would be nice if we can configure the coder and avoid redundant encoding
>>>> and decoding between runner and harness to improve the performance.
>>>> >> >>
>>>> >> >> There are two approaches to solve this issue:
>>>> >> >>
>>>> >> >>     Approach 1:  Support ValueOnlyWindowedValueCoder between
>>>> runner and harness.
>>>> >> >>     Approach 2:  Add a "constant" window coder that embeds all
>>>> the windowing information as part of the coder that should be used to wrap
>>>> the value during decoding.
>>>> >> >>
>>>> >> >> More details can be found here [1].
>>>> >> >>
>>>> >> >> As of the shortcomings of “Approach 2” which still need to
>>>> encode/decode timestamp and pane infos, we tend to choose “Approach 1”
>>>> which brings better performance and is more thorough.
>>>> >> >>
>>>> >> >> Welcome any feedback :)
>>>> >> >>
>>>> >> >> Best,
>>>> >> >> Jincheng
>>>> >> >>
>>>> >> >> [1]
>>>> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
>>>> >> >>
>>>>
>>>

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

Posted by Luke Cwik <lc...@google.com>.
On Thu, Nov 7, 2019 at 7:36 PM Kenneth Knowles <ke...@apache.org> wrote:

>
>
> On Thu, Nov 7, 2019 at 9:19 AM Luke Cwik <lc...@google.com> wrote:
>
>> I did suggest one other alternative on Jincheng's PR[1] which was to
>> allow windowless values to be sent across the gRPC port. The SDK would then
>> be responsible for ensuring that the execution didn't access any properties
>> that required knowledge of the timestamp, pane or window. This is different
>> then adding the ValueOnlyWindowedValueCoder as a model coder because it
>> allows SDKs to pass around raw values to functions without any windowing
>> overhead which could be useful for things like the side input window
>> mapping or window merging functions we have.
>>
>
> When you say "pass around" what does it mean? If it is over the wire,
> there is already no overhead to ValueOnlyWindowedValueCoder. So do you mean
> the overhead of having the layer of boxing of WindowedValue? I would assume
> all non-value components of the WindowedValue from
> ValueOnlyWindowedValueCoder are pointers to a single shared immutable
> instance carried by the coder instance.
>

I was referring to the layer of boxing of WindowedValue. My concern wasn't
the performance overhead of passing around a wrapper object but the
cognitive overhead of understanding why everything needs to be wrapped in a
windowed value. Since you have been working on SQL for some time, this
would be analogous to executing a UDF and making all the machinery around
it take WindowedValue<T> instead of T.


> I think raw values can already be passed to functions, no? The main thing
> is that elements in a PCollection always have a window, timestamp, and
> paneinfo. Not all values are elements. Is there a specific case you have in
> mind? I would not expect WindowMappingFn or window merging fn to be passing
> "elements" but just values of the appropriate type for the function.
>

This is about the machinery around WindowMappingFn/WindowMergingFn. For
example the implementation around WindowMappingFn takes a
WindowedValue<Window> and unwraps it forwarding it to the WindowMappingFn
and then takes the result and wraps it in a WindowedValue<Window> and
returns that to the runner.


>
>
>> On Thu, Nov 7, 2019 at 8:48 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> I think there is some misunderstanding about what is meant by option
>>> 2. What Kenn (I think) and I are proposing is not a WindowedValueCoder
>>> whose window/timestamp/paneinfo coders are parameterized to be
>>> constant coders, but a WindowedValueCoder whose
>>> window/timestamp/paneinfo values are specified as constants in the
>>> coder.
>>>
>>> Let's call this NewValueOnlyWindowedValueCoder, and is parameterized
>>> by a window, timestamp, and pane info instance
>>>
>>> The existing ValueOnlyWindowedValueCoder is literally
>>> NewValueOnlyWindowedValueCoder(GlobalWindow, MIN_TIMESTAMP,
>>> PaneInfo.NO_FIRING). Note in particular that using the existing
>>> ValueOnlyWindowedValueCoder would give the wrong timestamp and pane
>>> info if it is use for the result of a GBK, which I think is the loss
>>> of consistency referred to here.
>>>
>>
> Yes, this is exactly what I am proposing and sounds like what "approach 2"
> is. I think this approach "just works". It is simpler and more efficient
> than "WindowedValueCoder.of(ValueCoder, WindowCoder, TimestampCoder,
> PaneInfoCoder)" which would require some overhead even if WindowCoder,
> TimestampCoder and PaneInfoCoder were constant coders consuming zero bytes.
>
> Kenn
>
>
>
>> On Thu, Nov 7, 2019 at 1:03 AM jincheng sun <su...@gmail.com>
>>> wrote:
>>> >
>>> > Thanks for your feedback and the valuable comments, Kenn & Robert!
>>> >
>>> > I think your comments are more comprehensive and enlighten me a lot.
>>> The two proposals which I mentioned above are to reuse the existing coder
>>> (FullWindowedValueCoder and ValueOnlyWindowedValueCoder). Now, with your
>>> comments, I think we can further abstract 'FullWindowedValueCoder' and
>>> 'ValueOnlyWindowedValueCoder', that is, we can rename
>>> 'FullWindowedValueCoder' to 'WindowedValueCoder', and make the coders of
>>> window/timestamp/pane configurable. Then we can remove
>>> 'ValueOnlyWindowedValueCoder' and need to add a mount of constant coders
>>> for window/timestamp/pane.
>>> >
>>> > I have replied your comments on the doc, and quick feedback as
>>> following:
>>> >
>>> > Regarding to "Approach 2: probably no SDK harness work / compatible
>>> with existing Beam model so no risk of introducing inconsistency",if we
>>> "just puts default window/timestamp/pane info on elements" and don't change
>>> the original coder, then the performance is not optimized. If we want to
>>> get the best performance, then the default coder of Window/timestamp/pane
>>> should be constant coder. In this case the SDK harnesses need to be aware
>>> of the constant coder and there will be some development work in the SDK
>>> harness. Besides, the SDK harness also needs to make the coders for
>>> window/timestamp/pane configurable and this will introduce some related
>>> changes, such as updating WindowedValueCoder._get_component_coders, etc.
>>> >
>>> > Regarding to "Approach 1: option a: if the SDK harness has to
>>> understand 'values without windows' then very large changes and high risk
>>> of introducing inconsistency (I eliminated many of these inconsistencies)",
>>> we only need to add ValueOnlyWindowedValueCoder to the StandardCoders and
>>> all the SDK harness should be aware of this coder. There is no much changes
>>> actually.
>>> >
>>> > Please feel free to correct me if there is anyting incorrect. :)
>>> >
>>> > Besides, I'm not quite clear about the consistency issues you meant
>>> here. Could you please give me some hints about this?
>>> >
>>> > Best,
>>> > Jincheng
>>> >
>>> > Robert Bradshaw <ro...@google.com> 于2019年11月7日周四 上午3:38写道:
>>> >>
>>> >> Yes, the portability framework is designed to support this, and
>>> >> possibly even more efficient transfers of data than element-by-element
>>> >> as per the wire coder specified in the IO port operators. I left some
>>> >> comments on the doc as well, and would also prefer approach 2.
>>> >>
>>> >> On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles <ke...@apache.org>
>>> wrote:
>>> >> >
>>> >> > I think the portability framework is designed for this. The runner
>>> controls the coder on the grpc ports and the runner controls the process
>>> bundle descriptor.
>>> >> >
>>> >> > I commented on the doc. I think what is missing is analysis of
>>> scope of SDK harness changes and risk to model consistency
>>> >> >
>>> >> >     Approach 2: probably no SDK harness work / compatible with
>>> existing Beam model so no risk of introducing inconsistency
>>> >> >
>>> >> >     Approach 1: what are all the details?
>>> >> >         option a: if the SDK harness has to understand "values
>>> without windows" then very large changes and high risk of introducing
>>> inconsistency (I eliminated many of these inconsistencies)
>>> >> >         option b: if the coder just puts default
>>> window/timestamp/pane info on elements, then it is the same as approach 2,
>>> no work / no risk
>>> >> >
>>> >> > Kenn
>>> >> >
>>> >> > On Wed, Nov 6, 2019 at 1:09 AM jincheng sun <
>>> sunjincheng121@gmail.com> wrote:
>>> >> >>
>>> >> >> Hi all,
>>> >> >>
>>> >> >> I am trying to make some improvements of portability framework to
>>> make it usable in other projects. However, we find that the coder between
>>> runner and harness can only be FullWindowedValueCoder. This means each time
>>> when sending a WindowedValue, we have to encode/decode timestamp, windows
>>> and pan infos. In some circumstances(such as using the portability
>>> framework in Flink), only values are needed between runner and harness. So,
>>> it would be nice if we can configure the coder and avoid redundant encoding
>>> and decoding between runner and harness to improve the performance.
>>> >> >>
>>> >> >> There are two approaches to solve this issue:
>>> >> >>
>>> >> >>     Approach 1:  Support ValueOnlyWindowedValueCoder between
>>> runner and harness.
>>> >> >>     Approach 2:  Add a "constant" window coder that embeds all the
>>> windowing information as part of the coder that should be used to wrap the
>>> value during decoding.
>>> >> >>
>>> >> >> More details can be found here [1].
>>> >> >>
>>> >> >> As of the shortcomings of “Approach 2” which still need to
>>> encode/decode timestamp and pane infos, we tend to choose “Approach 1”
>>> which brings better performance and is more thorough.
>>> >> >>
>>> >> >> Welcome any feedback :)
>>> >> >>
>>> >> >> Best,
>>> >> >> Jincheng
>>> >> >>
>>> >> >> [1]
>>> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
>>> >> >>
>>>
>>

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

Posted by Kenneth Knowles <ke...@apache.org>.
On Thu, Nov 7, 2019 at 9:19 AM Luke Cwik <lc...@google.com> wrote:

> I did suggest one other alternative on Jincheng's PR[1] which was to allow
> windowless values to be sent across the gRPC port. The SDK would then be
> responsible for ensuring that the execution didn't access any properties
> that required knowledge of the timestamp, pane or window. This is different
> then adding the ValueOnlyWindowedValueCoder as a model coder because it
> allows SDKs to pass around raw values to functions without any windowing
> overhead which could be useful for things like the side input window
> mapping or window merging functions we have.
>

When you say "pass around" what does it mean? If it is over the wire, there
is already no overhead to ValueOnlyWindowedValueCoder. So do you mean the
overhead of having the layer of boxing of WindowedValue? I would assume all
non-value components of the WindowedValue from ValueOnlyWindowedValueCoder
are pointers to a single shared immutable instance carried by the coder
instance.

I think raw values can already be passed to functions, no? The main thing
is that elements in a PCollection always have a window, timestamp, and
paneinfo. Not all values are elements. Is there a specific case you have in
mind? I would not expect WindowMappingFn or window merging fn to be passing
"elements" but just values of the appropriate type for the function.


> On Thu, Nov 7, 2019 at 8:48 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> I think there is some misunderstanding about what is meant by option
>> 2. What Kenn (I think) and I are proposing is not a WindowedValueCoder
>> whose window/timestamp/paneinfo coders are parameterized to be
>> constant coders, but a WindowedValueCoder whose
>> window/timestamp/paneinfo values are specified as constants in the
>> coder.
>>
>> Let's call this NewValueOnlyWindowedValueCoder, and is parameterized
>> by a window, timestamp, and pane info instance
>>
>> The existing ValueOnlyWindowedValueCoder is literally
>> NewValueOnlyWindowedValueCoder(GlobalWindow, MIN_TIMESTAMP,
>> PaneInfo.NO_FIRING). Note in particular that using the existing
>> ValueOnlyWindowedValueCoder would give the wrong timestamp and pane
>> info if it is use for the result of a GBK, which I think is the loss
>> of consistency referred to here.
>>
>
Yes, this is exactly what I am proposing and sounds like what "approach 2"
is. I think this approach "just works". It is simpler and more efficient
than "WindowedValueCoder.of(ValueCoder, WindowCoder, TimestampCoder,
PaneInfoCoder)" which would require some overhead even if WindowCoder,
TimestampCoder and PaneInfoCoder were constant coders consuming zero bytes.

Kenn



> On Thu, Nov 7, 2019 at 1:03 AM jincheng sun <su...@gmail.com>
>> wrote:
>> >
>> > Thanks for your feedback and the valuable comments, Kenn & Robert!
>> >
>> > I think your comments are more comprehensive and enlighten me a lot.
>> The two proposals which I mentioned above are to reuse the existing coder
>> (FullWindowedValueCoder and ValueOnlyWindowedValueCoder). Now, with your
>> comments, I think we can further abstract 'FullWindowedValueCoder' and
>> 'ValueOnlyWindowedValueCoder', that is, we can rename
>> 'FullWindowedValueCoder' to 'WindowedValueCoder', and make the coders of
>> window/timestamp/pane configurable. Then we can remove
>> 'ValueOnlyWindowedValueCoder' and need to add a mount of constant coders
>> for window/timestamp/pane.
>> >
>> > I have replied your comments on the doc, and quick feedback as
>> following:
>> >
>> > Regarding to "Approach 2: probably no SDK harness work / compatible
>> with existing Beam model so no risk of introducing inconsistency",if we
>> "just puts default window/timestamp/pane info on elements" and don't change
>> the original coder, then the performance is not optimized. If we want to
>> get the best performance, then the default coder of Window/timestamp/pane
>> should be constant coder. In this case the SDK harnesses need to be aware
>> of the constant coder and there will be some development work in the SDK
>> harness. Besides, the SDK harness also needs to make the coders for
>> window/timestamp/pane configurable and this will introduce some related
>> changes, such as updating WindowedValueCoder._get_component_coders, etc.
>> >
>> > Regarding to "Approach 1: option a: if the SDK harness has to
>> understand 'values without windows' then very large changes and high risk
>> of introducing inconsistency (I eliminated many of these inconsistencies)",
>> we only need to add ValueOnlyWindowedValueCoder to the StandardCoders and
>> all the SDK harness should be aware of this coder. There is no much changes
>> actually.
>> >
>> > Please feel free to correct me if there is anyting incorrect. :)
>> >
>> > Besides, I'm not quite clear about the consistency issues you meant
>> here. Could you please give me some hints about this?
>> >
>> > Best,
>> > Jincheng
>> >
>> > Robert Bradshaw <ro...@google.com> 于2019年11月7日周四 上午3:38写道:
>> >>
>> >> Yes, the portability framework is designed to support this, and
>> >> possibly even more efficient transfers of data than element-by-element
>> >> as per the wire coder specified in the IO port operators. I left some
>> >> comments on the doc as well, and would also prefer approach 2.
>> >>
>> >> On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles <ke...@apache.org>
>> wrote:
>> >> >
>> >> > I think the portability framework is designed for this. The runner
>> controls the coder on the grpc ports and the runner controls the process
>> bundle descriptor.
>> >> >
>> >> > I commented on the doc. I think what is missing is analysis of scope
>> of SDK harness changes and risk to model consistency
>> >> >
>> >> >     Approach 2: probably no SDK harness work / compatible with
>> existing Beam model so no risk of introducing inconsistency
>> >> >
>> >> >     Approach 1: what are all the details?
>> >> >         option a: if the SDK harness has to understand "values
>> without windows" then very large changes and high risk of introducing
>> inconsistency (I eliminated many of these inconsistencies)
>> >> >         option b: if the coder just puts default
>> window/timestamp/pane info on elements, then it is the same as approach 2,
>> no work / no risk
>> >> >
>> >> > Kenn
>> >> >
>> >> > On Wed, Nov 6, 2019 at 1:09 AM jincheng sun <
>> sunjincheng121@gmail.com> wrote:
>> >> >>
>> >> >> Hi all,
>> >> >>
>> >> >> I am trying to make some improvements of portability framework to
>> make it usable in other projects. However, we find that the coder between
>> runner and harness can only be FullWindowedValueCoder. This means each time
>> when sending a WindowedValue, we have to encode/decode timestamp, windows
>> and pan infos. In some circumstances(such as using the portability
>> framework in Flink), only values are needed between runner and harness. So,
>> it would be nice if we can configure the coder and avoid redundant encoding
>> and decoding between runner and harness to improve the performance.
>> >> >>
>> >> >> There are two approaches to solve this issue:
>> >> >>
>> >> >>     Approach 1:  Support ValueOnlyWindowedValueCoder between runner
>> and harness.
>> >> >>     Approach 2:  Add a "constant" window coder that embeds all the
>> windowing information as part of the coder that should be used to wrap the
>> value during decoding.
>> >> >>
>> >> >> More details can be found here [1].
>> >> >>
>> >> >> As of the shortcomings of “Approach 2” which still need to
>> encode/decode timestamp and pane infos, we tend to choose “Approach 1”
>> which brings better performance and is more thorough.
>> >> >>
>> >> >> Welcome any feedback :)
>> >> >>
>> >> >> Best,
>> >> >> Jincheng
>> >> >>
>> >> >> [1]
>> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
>> >> >>
>>
>

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

Posted by Luke Cwik <lc...@google.com>.
I did suggest one other alternative on Jincheng's PR[1] which was to allow
windowless values to be sent across the gRPC port. The SDK would then be
responsible for ensuring that the execution didn't access any properties
that required knowledge of the timestamp, pane or window. This is different
then adding the ValueOnlyWindowedValueCoder as a model coder because it
allows SDKs to pass around raw values to functions without any windowing
overhead which could be useful for things like the side input window
mapping or window merging functions we have.

1: https://github.com/apache/beam/pull/9979

On Thu, Nov 7, 2019 at 8:48 AM Robert Bradshaw <ro...@google.com> wrote:

> I think there is some misunderstanding about what is meant by option
> 2. What Kenn (I think) and I are proposing is not a WindowedValueCoder
> whose window/timestamp/paneinfo coders are parameterized to be
> constant coders, but a WindowedValueCoder whose
> window/timestamp/paneinfo values are specified as constants in the
> coder.
>
> Let's call this NewValueOnlyWindowedValueCoder, and is parameterized
> by a window, timestamp, and pane info instance
>
> The existing ValueOnlyWindowedValueCoder is literally
> NewValueOnlyWindowedValueCoder(GlobalWindow, MIN_TIMESTAMP,
> PaneInfo.NO_FIRING). Note in particular that using the existing
> ValueOnlyWindowedValueCoder would give the wrong timestamp and pane
> info if it is use for the result of a GBK, which I think is the loss
> of consistency referred to here.
>
> On Thu, Nov 7, 2019 at 1:03 AM jincheng sun <su...@gmail.com>
> wrote:
> >
> > Thanks for your feedback and the valuable comments, Kenn & Robert!
> >
> > I think your comments are more comprehensive and enlighten me a lot. The
> two proposals which I mentioned above are to reuse the existing coder
> (FullWindowedValueCoder and ValueOnlyWindowedValueCoder). Now, with your
> comments, I think we can further abstract 'FullWindowedValueCoder' and
> 'ValueOnlyWindowedValueCoder', that is, we can rename
> 'FullWindowedValueCoder' to 'WindowedValueCoder', and make the coders of
> window/timestamp/pane configurable. Then we can remove
> 'ValueOnlyWindowedValueCoder' and need to add a mount of constant coders
> for window/timestamp/pane.
> >
> > I have replied your comments on the doc, and quick feedback as following:
> >
> > Regarding to "Approach 2: probably no SDK harness work / compatible with
> existing Beam model so no risk of introducing inconsistency",if we "just
> puts default window/timestamp/pane info on elements" and don't change the
> original coder, then the performance is not optimized. If we want to get
> the best performance, then the default coder of Window/timestamp/pane
> should be constant coder. In this case the SDK harnesses need to be aware
> of the constant coder and there will be some development work in the SDK
> harness. Besides, the SDK harness also needs to make the coders for
> window/timestamp/pane configurable and this will introduce some related
> changes, such as updating WindowedValueCoder._get_component_coders, etc.
> >
> > Regarding to "Approach 1: option a: if the SDK harness has to understand
> 'values without windows' then very large changes and high risk of
> introducing inconsistency (I eliminated many of these inconsistencies)", we
> only need to add ValueOnlyWindowedValueCoder to the StandardCoders and all
> the SDK harness should be aware of this coder. There is no much changes
> actually.
> >
> > Please feel free to correct me if there is anyting incorrect. :)
> >
> > Besides, I'm not quite clear about the consistency issues you meant
> here. Could you please give me some hints about this?
> >
> > Best,
> > Jincheng
> >
> > Robert Bradshaw <ro...@google.com> 于2019年11月7日周四 上午3:38写道:
> >>
> >> Yes, the portability framework is designed to support this, and
> >> possibly even more efficient transfers of data than element-by-element
> >> as per the wire coder specified in the IO port operators. I left some
> >> comments on the doc as well, and would also prefer approach 2.
> >>
> >> On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles <ke...@apache.org>
> wrote:
> >> >
> >> > I think the portability framework is designed for this. The runner
> controls the coder on the grpc ports and the runner controls the process
> bundle descriptor.
> >> >
> >> > I commented on the doc. I think what is missing is analysis of scope
> of SDK harness changes and risk to model consistency
> >> >
> >> >     Approach 2: probably no SDK harness work / compatible with
> existing Beam model so no risk of introducing inconsistency
> >> >
> >> >     Approach 1: what are all the details?
> >> >         option a: if the SDK harness has to understand "values
> without windows" then very large changes and high risk of introducing
> inconsistency (I eliminated many of these inconsistencies)
> >> >         option b: if the coder just puts default
> window/timestamp/pane info on elements, then it is the same as approach 2,
> no work / no risk
> >> >
> >> > Kenn
> >> >
> >> > On Wed, Nov 6, 2019 at 1:09 AM jincheng sun <su...@gmail.com>
> wrote:
> >> >>
> >> >> Hi all,
> >> >>
> >> >> I am trying to make some improvements of portability framework to
> make it usable in other projects. However, we find that the coder between
> runner and harness can only be FullWindowedValueCoder. This means each time
> when sending a WindowedValue, we have to encode/decode timestamp, windows
> and pan infos. In some circumstances(such as using the portability
> framework in Flink), only values are needed between runner and harness. So,
> it would be nice if we can configure the coder and avoid redundant encoding
> and decoding between runner and harness to improve the performance.
> >> >>
> >> >> There are two approaches to solve this issue:
> >> >>
> >> >>     Approach 1:  Support ValueOnlyWindowedValueCoder between runner
> and harness.
> >> >>     Approach 2:  Add a "constant" window coder that embeds all the
> windowing information as part of the coder that should be used to wrap the
> value during decoding.
> >> >>
> >> >> More details can be found here [1].
> >> >>
> >> >> As of the shortcomings of “Approach 2” which still need to
> encode/decode timestamp and pane infos, we tend to choose “Approach 1”
> which brings better performance and is more thorough.
> >> >>
> >> >> Welcome any feedback :)
> >> >>
> >> >> Best,
> >> >> Jincheng
> >> >>
> >> >> [1]
> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
> >> >>
>

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

Posted by Robert Bradshaw <ro...@google.com>.
I think there is some misunderstanding about what is meant by option
2. What Kenn (I think) and I are proposing is not a WindowedValueCoder
whose window/timestamp/paneinfo coders are parameterized to be
constant coders, but a WindowedValueCoder whose
window/timestamp/paneinfo values are specified as constants in the
coder.

Let's call this NewValueOnlyWindowedValueCoder, and is parameterized
by a window, timestamp, and pane info instance

The existing ValueOnlyWindowedValueCoder is literally
NewValueOnlyWindowedValueCoder(GlobalWindow, MIN_TIMESTAMP,
PaneInfo.NO_FIRING). Note in particular that using the existing
ValueOnlyWindowedValueCoder would give the wrong timestamp and pane
info if it is use for the result of a GBK, which I think is the loss
of consistency referred to here.

On Thu, Nov 7, 2019 at 1:03 AM jincheng sun <su...@gmail.com> wrote:
>
> Thanks for your feedback and the valuable comments, Kenn & Robert!
>
> I think your comments are more comprehensive and enlighten me a lot. The two proposals which I mentioned above are to reuse the existing coder (FullWindowedValueCoder and ValueOnlyWindowedValueCoder). Now, with your comments, I think we can further abstract 'FullWindowedValueCoder' and 'ValueOnlyWindowedValueCoder', that is, we can rename 'FullWindowedValueCoder' to 'WindowedValueCoder', and make the coders of window/timestamp/pane configurable. Then we can remove 'ValueOnlyWindowedValueCoder' and need to add a mount of constant coders for window/timestamp/pane.
>
> I have replied your comments on the doc, and quick feedback as following:
>
> Regarding to "Approach 2: probably no SDK harness work / compatible with existing Beam model so no risk of introducing inconsistency",if we "just puts default window/timestamp/pane info on elements" and don't change the original coder, then the performance is not optimized. If we want to get the best performance, then the default coder of Window/timestamp/pane should be constant coder. In this case the SDK harnesses need to be aware of the constant coder and there will be some development work in the SDK harness. Besides, the SDK harness also needs to make the coders for window/timestamp/pane configurable and this will introduce some related changes, such as updating WindowedValueCoder._get_component_coders, etc.
>
> Regarding to "Approach 1: option a: if the SDK harness has to understand 'values without windows' then very large changes and high risk of introducing inconsistency (I eliminated many of these inconsistencies)", we only need to add ValueOnlyWindowedValueCoder to the StandardCoders and all the SDK harness should be aware of this coder. There is no much changes actually.
>
> Please feel free to correct me if there is anyting incorrect. :)
>
> Besides, I'm not quite clear about the consistency issues you meant here. Could you please give me some hints about this?
>
> Best,
> Jincheng
>
> Robert Bradshaw <ro...@google.com> 于2019年11月7日周四 上午3:38写道:
>>
>> Yes, the portability framework is designed to support this, and
>> possibly even more efficient transfers of data than element-by-element
>> as per the wire coder specified in the IO port operators. I left some
>> comments on the doc as well, and would also prefer approach 2.
>>
>> On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles <ke...@apache.org> wrote:
>> >
>> > I think the portability framework is designed for this. The runner controls the coder on the grpc ports and the runner controls the process bundle descriptor.
>> >
>> > I commented on the doc. I think what is missing is analysis of scope of SDK harness changes and risk to model consistency
>> >
>> >     Approach 2: probably no SDK harness work / compatible with existing Beam model so no risk of introducing inconsistency
>> >
>> >     Approach 1: what are all the details?
>> >         option a: if the SDK harness has to understand "values without windows" then very large changes and high risk of introducing inconsistency (I eliminated many of these inconsistencies)
>> >         option b: if the coder just puts default window/timestamp/pane info on elements, then it is the same as approach 2, no work / no risk
>> >
>> > Kenn
>> >
>> > On Wed, Nov 6, 2019 at 1:09 AM jincheng sun <su...@gmail.com> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I am trying to make some improvements of portability framework to make it usable in other projects. However, we find that the coder between runner and harness can only be FullWindowedValueCoder. This means each time when sending a WindowedValue, we have to encode/decode timestamp, windows and pan infos. In some circumstances(such as using the portability framework in Flink), only values are needed between runner and harness. So, it would be nice if we can configure the coder and avoid redundant encoding and decoding between runner and harness to improve the performance.
>> >>
>> >> There are two approaches to solve this issue:
>> >>
>> >>     Approach 1:  Support ValueOnlyWindowedValueCoder between runner and harness.
>> >>     Approach 2:  Add a "constant" window coder that embeds all the windowing information as part of the coder that should be used to wrap the value during decoding.
>> >>
>> >> More details can be found here [1].
>> >>
>> >> As of the shortcomings of “Approach 2” which still need to encode/decode timestamp and pane infos, we tend to choose “Approach 1” which brings better performance and is more thorough.
>> >>
>> >> Welcome any feedback :)
>> >>
>> >> Best,
>> >> Jincheng
>> >>
>> >> [1] https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
>> >>

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

Posted by jincheng sun <su...@gmail.com>.
Thanks for your feedback and the valuable comments, Kenn & Robert!

I think your comments are more comprehensive and enlighten me a lot. The
two proposals which I mentioned above are to reuse the existing coder
(FullWindowedValueCoder and ValueOnlyWindowedValueCoder). Now, with your
comments, I think we can further abstract 'FullWindowedValueCoder' and
'ValueOnlyWindowedValueCoder', that is, we can rename
'FullWindowedValueCoder' to 'WindowedValueCoder', and make the coders of
window/timestamp/pane configurable. Then we can remove
'ValueOnlyWindowedValueCoder' and need to add a mount of constant coders
for window/timestamp/pane.

I have replied your comments on the doc, and quick feedback as following:

Regarding to "Approach 2: probably no SDK harness work / compatible with
existing Beam model so no risk of introducing inconsistency",if we "just
puts default window/timestamp/pane info on elements" and don't change the
original coder, then the performance is not optimized. If we want to get
the best performance, then the default coder of Window/timestamp/pane
should be constant coder. In this case the SDK harnesses need to be aware
of the constant coder and there will be some development work in the SDK
harness. Besides, the SDK harness also needs to make the coders for
window/timestamp/pane configurable and this will introduce some related
changes, such as updating WindowedValueCoder._get_component_coders, etc.

Regarding to "Approach 1: option a: if the SDK harness has to understand
'values without windows' then very large changes and high risk of
introducing inconsistency (I eliminated many of these inconsistencies)", we
only need to add ValueOnlyWindowedValueCoder to the StandardCoders and all
the SDK harness should be aware of this coder. There is no much changes
actually.

Please feel free to correct me if there is anyting incorrect. :)

Besides, I'm not quite clear about the consistency issues you meant here.
Could you please give me some hints about this?

Best,
Jincheng

Robert Bradshaw <ro...@google.com> 于2019年11月7日周四 上午3:38写道:

> Yes, the portability framework is designed to support this, and
> possibly even more efficient transfers of data than element-by-element
> as per the wire coder specified in the IO port operators. I left some
> comments on the doc as well, and would also prefer approach 2.
>
> On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > I think the portability framework is designed for this. The runner
> controls the coder on the grpc ports and the runner controls the process
> bundle descriptor.
> >
> > I commented on the doc. I think what is missing is analysis of scope of
> SDK harness changes and risk to model consistency
> >
> >     Approach 2: probably no SDK harness work / compatible with existing
> Beam model so no risk of introducing inconsistency
> >
> >     Approach 1: what are all the details?
> >         option a: if the SDK harness has to understand "values without
> windows" then very large changes and high risk of introducing inconsistency
> (I eliminated many of these inconsistencies)
> >         option b: if the coder just puts default window/timestamp/pane
> info on elements, then it is the same as approach 2, no work / no risk
> >
> > Kenn
> >
> > On Wed, Nov 6, 2019 at 1:09 AM jincheng sun <su...@gmail.com>
> wrote:
> >>
> >> Hi all,
> >>
> >> I am trying to make some improvements of portability framework to make
> it usable in other projects. However, we find that the coder between runner
> and harness can only be FullWindowedValueCoder. This means each time when
> sending a WindowedValue, we have to encode/decode timestamp, windows and
> pan infos. In some circumstances(such as using the portability framework in
> Flink), only values are needed between runner and harness. So, it would be
> nice if we can configure the coder and avoid redundant encoding and
> decoding between runner and harness to improve the performance.
> >>
> >> There are two approaches to solve this issue:
> >>
> >>     Approach 1:  Support ValueOnlyWindowedValueCoder between runner and
> harness.
> >>     Approach 2:  Add a "constant" window coder that embeds all the
> windowing information as part of the coder that should be used to wrap the
> value during decoding.
> >>
> >> More details can be found here [1].
> >>
> >> As of the shortcomings of “Approach 2” which still need to
> encode/decode timestamp and pane infos, we tend to choose “Approach 1”
> which brings better performance and is more thorough.
> >>
> >> Welcome any feedback :)
> >>
> >> Best,
> >> Jincheng
> >>
> >> [1]
> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
> >>
>

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

Posted by Robert Bradshaw <ro...@google.com>.
Yes, the portability framework is designed to support this, and
possibly even more efficient transfers of data than element-by-element
as per the wire coder specified in the IO port operators. I left some
comments on the doc as well, and would also prefer approach 2.

On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles <ke...@apache.org> wrote:
>
> I think the portability framework is designed for this. The runner controls the coder on the grpc ports and the runner controls the process bundle descriptor.
>
> I commented on the doc. I think what is missing is analysis of scope of SDK harness changes and risk to model consistency
>
>     Approach 2: probably no SDK harness work / compatible with existing Beam model so no risk of introducing inconsistency
>
>     Approach 1: what are all the details?
>         option a: if the SDK harness has to understand "values without windows" then very large changes and high risk of introducing inconsistency (I eliminated many of these inconsistencies)
>         option b: if the coder just puts default window/timestamp/pane info on elements, then it is the same as approach 2, no work / no risk
>
> Kenn
>
> On Wed, Nov 6, 2019 at 1:09 AM jincheng sun <su...@gmail.com> wrote:
>>
>> Hi all,
>>
>> I am trying to make some improvements of portability framework to make it usable in other projects. However, we find that the coder between runner and harness can only be FullWindowedValueCoder. This means each time when sending a WindowedValue, we have to encode/decode timestamp, windows and pan infos. In some circumstances(such as using the portability framework in Flink), only values are needed between runner and harness. So, it would be nice if we can configure the coder and avoid redundant encoding and decoding between runner and harness to improve the performance.
>>
>> There are two approaches to solve this issue:
>>
>>     Approach 1:  Support ValueOnlyWindowedValueCoder between runner and harness.
>>     Approach 2:  Add a "constant" window coder that embeds all the windowing information as part of the coder that should be used to wrap the value during decoding.
>>
>> More details can be found here [1].
>>
>> As of the shortcomings of “Approach 2” which still need to encode/decode timestamp and pane infos, we tend to choose “Approach 1” which brings better performance and is more thorough.
>>
>> Welcome any feedback :)
>>
>> Best,
>> Jincheng
>>
>> [1] https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
>>

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

Posted by Kenneth Knowles <ke...@apache.org>.
I think the portability framework is designed for this. The runner controls
the coder on the grpc ports and the runner controls the process bundle
descriptor.

I commented on the doc. I think what is missing is analysis of scope of SDK
harness changes and risk to model consistency

    Approach 2: probably no SDK harness work / compatible with existing
Beam model so no risk of introducing inconsistency

    Approach 1: what are all the details?
        option a: if the SDK harness has to understand "values without
windows" then very large changes and high risk of introducing inconsistency
(I eliminated many of these inconsistencies)
        option b: if the coder just puts default window/timestamp/pane info
on elements, then it is the same as approach 2, no work / no risk

Kenn

On Wed, Nov 6, 2019 at 1:09 AM jincheng sun <su...@gmail.com>
wrote:

> Hi all,
>
> I am trying to make some improvements of portability framework to make it
> usable in other projects. However, we find that the coder between runner
> and harness can only be FullWindowedValueCoder. This means each time when
> sending a WindowedValue, we have to encode/decode timestamp, windows and
> pan infos. In some circumstances(such as using the portability framework in
> Flink), only values are needed between runner and harness. So, it would be
> nice if we can configure the coder and avoid redundant encoding and
> decoding between runner and harness to improve the performance.
>
> There are two approaches to solve this issue:
>
>     Approach 1:  Support ValueOnlyWindowedValueCoder between runner and
> harness.
>     Approach 2:  Add a "constant" window coder that embeds all the
> windowing information as part of the coder that should be used to wrap the
> value during decoding.
>
> More details can be found here [1].
>
> As of the shortcomings of “Approach 2” which still need to encode/decode
> timestamp and pane infos, we tend to choose “Approach 1” which brings
> better performance and is more thorough.
>
> Welcome any feedback :)
>
> Best,
> Jincheng
>
> [1]
> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
>
>