You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Reza Rokni <re...@google.com> on 2020/01/16 02:52:07 UTC

Re: [Proposal] Slowly Changing Dimensions and Distributed Map Side Inputs (in Dataflow)

+1 To this proposal, this is a very common pattern requirement from users.
With the following current workaround having seen a lot of traction:

https://beam.apache.org/documentation/patterns/side-inputs/#slowly-updating-global-window-side-inputs

Making this process simpler for users and Out Of the Box, would be a great
win!

I would also mention that ideally we will also cover the large distributed
side inputs, but a lot of the core cases for this comes down to Side inputs
that do fit in memory. Perhaps worth putting priorities on the work with
the smaller side input tables having precedence. Unless the work will cover
both cases in the same way of course.

Cheers

Reza

On Thu, 19 Dec 2019 at 07:14, Kenneth Knowles <ke...@apache.org> wrote:

> I do think that the implementation concerns around larger side inputs are
> relevant to most runners. Ideally there would be no model change necessary.
> Triggers are harder and bring in consistency concerns, which are even more
> likely to be relevant to all runners.
>
> Kenn
>
> On Wed, Dec 18, 2019 at 11:23 AM Luke Cwik <lc...@google.com> wrote:
>
>> Most of the doc is about how to support distributed side inputs in
>> Dataflow and doesn't really cover how the Beam model (accumulating,
>> discarding, retraction) triggers impact what are the "contents" of a
>> PCollection in time and how this proposal for a limited set of side input
>> shapes can work to support larger side inputs in Dataflow.
>>
>> On Tue, Dec 17, 2019 at 2:28 AM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> Hi Mikhail,
>>> On 12/17/19 10:43 AM, Mikhail Gryzykhin wrote:
>>>
>>> inline
>>>
>>> On Tue, Dec 17, 2019 at 12:59 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>
>>>> Hi,
>>>>
>>>> I actually thought that the proposal refers to Dataflow only. If this
>>>> is supposed to be general, can we remove the Dataflow/Windmill specific
>>>> parts and replace them with generic ones?
>>>>
>>>  I'll look into rephrasing doc to keep Dataflow/Windmill as example.
>>>
>>> Cool, thanks!
>>>
>>> I'd have two more questions:
>>>>
>>>>  a) the proposal is named "Slowly changing", why is the rate of change
>>>> essential to the proposal? Once running on event time, that should not
>>>> matter, or what am I missing?
>>>>
>>> Within this proposal, it is suggested to make a full snapshot of data on
>>> every re-read. This is generally expensive and setting time event to short
>>> interval might cause issues. Otherwise it is not essential.
>>>
>>> Understood. This relates to table-stream duality, where the requirements
>>> might relax once you don't have to convert table to stream by re-reading
>>> it, but by being able to retrieve updates as you go (example would be
>>> reading directly from kafka or any other "commit log" abstraction).
>>>
>>>  b) The description says: 'User wants to solve a stream enrichment
>>>> problem. In brief request sounds like: ”I want to enrich each event in this
>>>> stream by corresponding data from given table.”'. That is understandable,
>>>> but would it be better to enable the user to express this intent directly
>>>> (via Join operation)? The actual implementation might be runner (and
>>>> input!) specific. The analogy is that when doing group-by-key operation,
>>>> runner can choose hash grouping or sort-merge grouping, but that is not
>>>> (directly) expressed in user code. I'm not saying that we should not have
>>>> low-level transforms, just asking if it would be better to leave this
>>>> decision to the runner (at least in some cases). It might be the case that
>>>> we want to make core SDK as low level as possible (and as reasonable), I
>>>> just want to make sure that that is really the intent.
>>>>
>>> The idea is to add basic operation with as small change as possible for
>>> current API.
>>> Ultimate goal is to have a Join/GBK operator that will choose proper
>>> strategy. However, I don't think that we have proper tools and view of how
>>> to choose best strategy at hand as of yet.
>>>
>>> OK, cool. That is where I would find it very much useful to have some
>>> sort of "goals", that we are targeting. I agree that there are some pieces
>>> missing in the puzzle as of now. But it would be good to know what these
>>> pieces are and what needs to be done to fulfill our goals. But this is
>>> probably not related to discussion of this proposal, but more related to
>>> the concept of BIP or similar.
>>>
>>> Thanks for the explanation.
>>>
>>> Thanks for the proposal!
>>>>
>>>> Jan
>>>> On 12/17/19 12:01 AM, Kenneth Knowles wrote:
>>>>
>>>> I want to highlight that this design works for definitely more runners
>>>> than just Dataflow. I see two pieces of it that I want to bring onto the
>>>> thread:
>>>>
>>>> 1. A new kind of "unbounded source" which is a periodic refresh of a
>>>> bounded source, and use that as a side input. Each main input element has a
>>>> window that maps to a specific refresh of the side input.
>>>> 2. Distributed map side inputs: supporting very large lookup tables,
>>>> but with consistency challenges. Even the part about "windmill API"
>>>> probably applies to other runners
>>>>
>>>> So I hope the title and "Objective" section do not cause people to stop
>>>> reading.
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Dec 16, 2019 at 11:36 AM Mikhail Gryzykhin <mi...@google.com>
>>>> wrote:
>>>>
>>>>> +some people explicitly
>>>>>
>>>>> Can you please check on the doc and comment if it looks fine?
>>>>>
>>>>> Thank you,
>>>>> --Mikhail
>>>>>
>>>>> On Tue, Dec 10, 2019 at 1:43 PM Mikhail Gryzykhin <mi...@google.com>
>>>>> wrote:
>>>>>
>>>>>> "Good news, everyone-"
>>>>>> ―Farnsworth
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> Recently, I was looking into relaxing limitations on side inputs in
>>>>>> Dataflow runner. As part of it, I came up with design proposal for
>>>>>> standardizing slowly changing dimensions use case in Beam and relevant
>>>>>> changes to add support for distributed map side inputs.
>>>>>>
>>>>>> Please review and comment on design doc.
>>>>>> <https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg>
>>>>>>  [1]
>>>>>>
>>>>>> Thank you,
>>>>>> Mikhail.
>>>>>>
>>>>>> -----
>>>>>>
>>>>>> [1]
>>>>>> https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg
>>>>>>
>>>>>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: [Proposal] Slowly Changing Dimensions and Distributed Map Side Inputs (in Dataflow)

Posted by Mikhail Gryzykhin <mi...@google.com>.

UPD:
I have updated doc with API suggestions, please check on relevant section
of the doc [1]
<https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg/edit#heading=h.5e78hch3k732>

--Mikhail

[1]
https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg/edit#heading=h.5e78hch3k732

On Thu, Jan 16, 2020 at 2:52 AM Reza Rokni <re...@google.com> wrote:

> +1 To this proposal, this is a very common pattern requirement from users.
> With the following current workaround having seen a lot of traction:
>
>
> https://beam.apache.org/documentation/patterns/side-inputs/#slowly-updating-global-window-side-inputs
>
> Making this process simpler for users and Out Of the Box, would be a great
> win!
>
> I would also mention that ideally we will also cover the large distributed
> side inputs, but a lot of the core cases for this comes down to Side inputs
> that do fit in memory. Perhaps worth putting priorities on the work with
> the smaller side input tables having precedence. Unless the work will cover
> both cases in the same way of course.
>
> Cheers
>
> Reza
>
> On Thu, 19 Dec 2019 at 07:14, Kenneth Knowles <ke...@apache.org> wrote:
>
>> I do think that the implementation concerns around larger side inputs are
>> relevant to most runners. Ideally there would be no model change necessary.
>> Triggers are harder and bring in consistency concerns, which are even more
>> likely to be relevant to all runners.
>>
>> Kenn
>>
>> On Wed, Dec 18, 2019 at 11:23 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> Most of the doc is about how to support distributed side inputs in
>>> Dataflow and doesn't really cover how the Beam model (accumulating,
>>> discarding, retraction) triggers impact what are the "contents" of a
>>> PCollection in time and how this proposal for a limited set of side input
>>> shapes can work to support larger side inputs in Dataflow.
>>>
>>> On Tue, Dec 17, 2019 at 2:28 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>
>>>> Hi Mikhail,
>>>> On 12/17/19 10:43 AM, Mikhail Gryzykhin wrote:
>>>>
>>>> inline
>>>>
>>>> On Tue, Dec 17, 2019 at 12:59 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I actually thought that the proposal refers to Dataflow only. If this
>>>>> is supposed to be general, can we remove the Dataflow/Windmill specific
>>>>> parts and replace them with generic ones?
>>>>>
>>>>  I'll look into rephrasing doc to keep Dataflow/Windmill as example.
>>>>
>>>> Cool, thanks!
>>>>
>>>> I'd have two more questions:
>>>>>
>>>>>  a) the proposal is named "Slowly changing", why is the rate of change
>>>>> essential to the proposal? Once running on event time, that should not
>>>>> matter, or what am I missing?
>>>>>
>>>> Within this proposal, it is suggested to make a full snapshot of data
>>>> on every re-read. This is generally expensive and setting time event to
>>>> short interval might cause issues. Otherwise it is not essential.
>>>>
>>>> Understood. This relates to table-stream duality, where the
>>>> requirements might relax once you don't have to convert table to stream by
>>>> re-reading it, but by being able to retrieve updates as you go (example
>>>> would be reading directly from kafka or any other "commit log" abstraction).
>>>>
>>>>  b) The description says: 'User wants to solve a stream enrichment
>>>>> problem. In brief request sounds like: ”I want to enrich each event in this
>>>>> stream by corresponding data from given table.”'. That is understandable,
>>>>> but would it be better to enable the user to express this intent directly
>>>>> (via Join operation)? The actual implementation might be runner (and
>>>>> input!) specific. The analogy is that when doing group-by-key operation,
>>>>> runner can choose hash grouping or sort-merge grouping, but that is not
>>>>> (directly) expressed in user code. I'm not saying that we should not have
>>>>> low-level transforms, just asking if it would be better to leave this
>>>>> decision to the runner (at least in some cases). It might be the case that
>>>>> we want to make core SDK as low level as possible (and as reasonable), I
>>>>> just want to make sure that that is really the intent.
>>>>>
>>>> The idea is to add basic operation with as small change as possible for
>>>> current API.
>>>> Ultimate goal is to have a Join/GBK operator that will choose proper
>>>> strategy. However, I don't think that we have proper tools and view of how
>>>> to choose best strategy at hand as of yet.
>>>>
>>>> OK, cool. That is where I would find it very much useful to have some
>>>> sort of "goals", that we are targeting. I agree that there are some pieces
>>>> missing in the puzzle as of now. But it would be good to know what these
>>>> pieces are and what needs to be done to fulfill our goals. But this is
>>>> probably not related to discussion of this proposal, but more related to
>>>> the concept of BIP or similar.
>>>>
>>>> Thanks for the explanation.
>>>>
>>>> Thanks for the proposal!
>>>>>
>>>>> Jan
>>>>> On 12/17/19 12:01 AM, Kenneth Knowles wrote:
>>>>>
>>>>> I want to highlight that this design works for definitely more runners
>>>>> than just Dataflow. I see two pieces of it that I want to bring onto the
>>>>> thread:
>>>>>
>>>>> 1. A new kind of "unbounded source" which is a periodic refresh of a
>>>>> bounded source, and use that as a side input. Each main input element has a
>>>>> window that maps to a specific refresh of the side input.
>>>>> 2. Distributed map side inputs: supporting very large lookup tables,
>>>>> but with consistency challenges. Even the part about "windmill API"
>>>>> probably applies to other runners
>>>>>
>>>>> So I hope the title and "Objective" section do not cause people to
>>>>> stop reading.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Dec 16, 2019 at 11:36 AM Mikhail Gryzykhin <mi...@google.com>
>>>>> wrote:
>>>>>
>>>>>> +some people explicitly
>>>>>>
>>>>>> Can you please check on the doc and comment if it looks fine?
>>>>>>
>>>>>> Thank you,
>>>>>> --Mikhail
>>>>>>
>>>>>> On Tue, Dec 10, 2019 at 1:43 PM Mikhail Gryzykhin <mi...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> "Good news, everyone-"
>>>>>>> ―Farnsworth
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> Recently, I was looking into relaxing limitations on side inputs in
>>>>>>> Dataflow runner. As part of it, I came up with design proposal for
>>>>>>> standardizing slowly changing dimensions use case in Beam and relevant
>>>>>>> changes to add support for distributed map side inputs.
>>>>>>>
>>>>>>> Please review and comment on design doc.
>>>>>>> <https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg>
>>>>>>>  [1]
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Mikhail.
>>>>>>>
>>>>>>> -----
>>>>>>>
>>>>>>> [1]
>>>>>>> https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg
>>>>>>>
>>>>>>>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>