You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Ben Chambers (JIRA)" <ji...@apache.org> on 2017/01/10 23:13:58 UTC

[jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

Ben Chambers created BEAM-1261:
----------------------------------

             Summary: State API should allow state to be managed in different windows
                 Key: BEAM-1261
                 URL: https://issues.apache.org/jira/browse/BEAM-1261
             Project: Beam
          Issue Type: Bug
          Components: beam-model, sdk-java-core
            Reporter: Ben Chambers
            Assignee: Kenneth Knowles


For example, even if the elements are being processed in fixed windows of an hour, it may be desirable for the state to "roll over" between windows (or be available to all windows).

It will also be necessary to figure out when this state should be deleted (TTL? maximum retention?)

Another problem is how to deal with out of order data. If data comes in from the 10:00 AM window, should its state changes be visible to the data in the 9:00 AM window? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
I think WindowMappingFn (https://issues.apache.org/jira/browse/BEAM-260 /
https://s.apache.org/beam-windowmappingfn-1-pager) is a good fit for this.
There are details to shake out.

One big thing it does not address well (because it is focused only on GC
thresholds) is specifically which windows need their state accessible from
which others, hence how much parallelism is available and how much
communication is there between windows. Today it is somewhat moot because
we don't use that parallelism.

On Wed, Jan 11, 2017 at 10:03 AM, Lukasz Cwik <lc...@google.com.invalid>
wrote:

> Bundle processing order is indeterminate, wouldn't accessing user state of
> a different window lead to indeterminate state information. This seems to
> be even weaker then what you get from side inputs that are triggered
> multiple times.
>
> On Wed, Jan 11, 2017 at 10:01 AM, Tyler Akidau <ta...@apache.org> wrote:
>
> > On Wed, Jan 11, 2017 at 9:43 AM Robert Bradshaw
> > <ro...@google.com.invalid>
> > wrote:
> >
> > > On Wed, Jan 11, 2017 at 8:59 AM, Lukasz Cwik <lcwik@google.com.invalid
> >
> > > wrote:
> > > > I was under the impression that user state was scoped to a ParDo and
> > was
> > > > not shareable across multiple ParDos. Wouldn't rewindowing require
> the
> > > > usage of multiple ParDos and hence not allow for state to be shared?
> > >
> > > No, you'd do something like
> > >
> > > pc.apply(WindowInto(grouping_windowing))
> > >   .apply(GroupByKey())
> > >   .apply(WindowInto(state_windowing)
> > >   .apply(ParDo(state_using_dofn)
> > >
> > > You could reify the window after GroupByKey if you need to inspect it.
> > >
> > > However, I'm liking the idea of being able to associate different
> > > WindowFns with particular state tags similar to side inputs (though
> > > the default would be the windowing of the main input).
> > >
> >
> > Can you expand upon what you mean by this? I'm not sure I understand what
> > you're getting at yet.
> >
> > -Tyler
> >
> >
> > >
> > > > On Tue, Jan 10, 2017 at 10:51 PM, Robert Bradshaw <
> > > > robertwb@google.com.invalid> wrote:
> > > >
> > > >> Possibly this could be handled by rewindowing and the current
> > > semantics. If
> > > >> not, maybe treat state like a side input with its own windowing and
> > > window
> > > >> mapping fn.
> > > >>
> > > >> On Jan 10, 2017 3:14 PM, "Ben Chambers (JIRA)" <ji...@apache.org>
> > wrote:
> > > >>
> > > >> > Ben Chambers created BEAM-1261:
> > > >> > ----------------------------------
> > > >> >
> > > >> >              Summary: State API should allow state to be managed
> in
> > > >> > different windows
> > > >> >                  Key: BEAM-1261
> > > >> >                  URL: https://issues.apache.org/
> > jira/browse/BEAM-1261
> > > >> >              Project: Beam
> > > >> >           Issue Type: Bug
> > > >> >           Components: beam-model, sdk-java-core
> > > >> >             Reporter: Ben Chambers
> > > >> >             Assignee: Kenneth Knowles
> > > >> >
> > > >> >
> > > >> > For example, even if the elements are being processed in fixed
> > > windows of
> > > >> > an hour, it may be desirable for the state to "roll over" between
> > > windows
> > > >> > (or be available to all windows).
> > > >> >
> > > >> > It will also be necessary to figure out when this state should be
> > > deleted
> > > >> > (TTL? maximum retention?)
> > > >> >
> > > >> > Another problem is how to deal with out of order data. If data
> comes
> > > in
> > > >> > from the 10:00 AM window, should its state changes be visible to
> the
> > > data
> > > >> > in the 9:00 AM window?
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > This message was sent by Atlassian JIRA
> > > >> > (v6.3.4#6332)
> > > >> >
> > > >>
> > >
> >
>

Re: [jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

Posted by Lukasz Cwik <lc...@google.com.INVALID>.
Bundle processing order is indeterminate, wouldn't accessing user state of
a different window lead to indeterminate state information. This seems to
be even weaker then what you get from side inputs that are triggered
multiple times.

On Wed, Jan 11, 2017 at 10:01 AM, Tyler Akidau <ta...@apache.org> wrote:

> On Wed, Jan 11, 2017 at 9:43 AM Robert Bradshaw
> <ro...@google.com.invalid>
> wrote:
>
> > On Wed, Jan 11, 2017 at 8:59 AM, Lukasz Cwik <lc...@google.com.invalid>
> > wrote:
> > > I was under the impression that user state was scoped to a ParDo and
> was
> > > not shareable across multiple ParDos. Wouldn't rewindowing require the
> > > usage of multiple ParDos and hence not allow for state to be shared?
> >
> > No, you'd do something like
> >
> > pc.apply(WindowInto(grouping_windowing))
> >   .apply(GroupByKey())
> >   .apply(WindowInto(state_windowing)
> >   .apply(ParDo(state_using_dofn)
> >
> > You could reify the window after GroupByKey if you need to inspect it.
> >
> > However, I'm liking the idea of being able to associate different
> > WindowFns with particular state tags similar to side inputs (though
> > the default would be the windowing of the main input).
> >
>
> Can you expand upon what you mean by this? I'm not sure I understand what
> you're getting at yet.
>
> -Tyler
>
>
> >
> > > On Tue, Jan 10, 2017 at 10:51 PM, Robert Bradshaw <
> > > robertwb@google.com.invalid> wrote:
> > >
> > >> Possibly this could be handled by rewindowing and the current
> > semantics. If
> > >> not, maybe treat state like a side input with its own windowing and
> > window
> > >> mapping fn.
> > >>
> > >> On Jan 10, 2017 3:14 PM, "Ben Chambers (JIRA)" <ji...@apache.org>
> wrote:
> > >>
> > >> > Ben Chambers created BEAM-1261:
> > >> > ----------------------------------
> > >> >
> > >> >              Summary: State API should allow state to be managed in
> > >> > different windows
> > >> >                  Key: BEAM-1261
> > >> >                  URL: https://issues.apache.org/
> jira/browse/BEAM-1261
> > >> >              Project: Beam
> > >> >           Issue Type: Bug
> > >> >           Components: beam-model, sdk-java-core
> > >> >             Reporter: Ben Chambers
> > >> >             Assignee: Kenneth Knowles
> > >> >
> > >> >
> > >> > For example, even if the elements are being processed in fixed
> > windows of
> > >> > an hour, it may be desirable for the state to "roll over" between
> > windows
> > >> > (or be available to all windows).
> > >> >
> > >> > It will also be necessary to figure out when this state should be
> > deleted
> > >> > (TTL? maximum retention?)
> > >> >
> > >> > Another problem is how to deal with out of order data. If data comes
> > in
> > >> > from the 10:00 AM window, should its state changes be visible to the
> > data
> > >> > in the 9:00 AM window?
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > This message was sent by Atlassian JIRA
> > >> > (v6.3.4#6332)
> > >> >
> > >>
> >
>

Re: [jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

Posted by Tyler Akidau <ta...@apache.org>.
On Wed, Jan 11, 2017 at 9:43 AM Robert Bradshaw <ro...@google.com.invalid>
wrote:

> On Wed, Jan 11, 2017 at 8:59 AM, Lukasz Cwik <lc...@google.com.invalid>
> wrote:
> > I was under the impression that user state was scoped to a ParDo and was
> > not shareable across multiple ParDos. Wouldn't rewindowing require the
> > usage of multiple ParDos and hence not allow for state to be shared?
>
> No, you'd do something like
>
> pc.apply(WindowInto(grouping_windowing))
>   .apply(GroupByKey())
>   .apply(WindowInto(state_windowing)
>   .apply(ParDo(state_using_dofn)
>
> You could reify the window after GroupByKey if you need to inspect it.
>
> However, I'm liking the idea of being able to associate different
> WindowFns with particular state tags similar to side inputs (though
> the default would be the windowing of the main input).
>

Can you expand upon what you mean by this? I'm not sure I understand what
you're getting at yet.

-Tyler


>
> > On Tue, Jan 10, 2017 at 10:51 PM, Robert Bradshaw <
> > robertwb@google.com.invalid> wrote:
> >
> >> Possibly this could be handled by rewindowing and the current
> semantics. If
> >> not, maybe treat state like a side input with its own windowing and
> window
> >> mapping fn.
> >>
> >> On Jan 10, 2017 3:14 PM, "Ben Chambers (JIRA)" <ji...@apache.org> wrote:
> >>
> >> > Ben Chambers created BEAM-1261:
> >> > ----------------------------------
> >> >
> >> >              Summary: State API should allow state to be managed in
> >> > different windows
> >> >                  Key: BEAM-1261
> >> >                  URL: https://issues.apache.org/jira/browse/BEAM-1261
> >> >              Project: Beam
> >> >           Issue Type: Bug
> >> >           Components: beam-model, sdk-java-core
> >> >             Reporter: Ben Chambers
> >> >             Assignee: Kenneth Knowles
> >> >
> >> >
> >> > For example, even if the elements are being processed in fixed
> windows of
> >> > an hour, it may be desirable for the state to "roll over" between
> windows
> >> > (or be available to all windows).
> >> >
> >> > It will also be necessary to figure out when this state should be
> deleted
> >> > (TTL? maximum retention?)
> >> >
> >> > Another problem is how to deal with out of order data. If data comes
> in
> >> > from the 10:00 AM window, should its state changes be visible to the
> data
> >> > in the 9:00 AM window?
> >> >
> >> >
> >> >
> >> > --
> >> > This message was sent by Atlassian JIRA
> >> > (v6.3.4#6332)
> >> >
> >>
>

Re: [jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

Posted by Robert Bradshaw <ro...@google.com.INVALID>.
On Wed, Jan 11, 2017 at 8:59 AM, Lukasz Cwik <lc...@google.com.invalid> wrote:
> I was under the impression that user state was scoped to a ParDo and was
> not shareable across multiple ParDos. Wouldn't rewindowing require the
> usage of multiple ParDos and hence not allow for state to be shared?

No, you'd do something like

pc.apply(WindowInto(grouping_windowing))
  .apply(GroupByKey())
  .apply(WindowInto(state_windowing)
  .apply(ParDo(state_using_dofn)

You could reify the window after GroupByKey if you need to inspect it.

However, I'm liking the idea of being able to associate different
WindowFns with particular state tags similar to side inputs (though
the default would be the windowing of the main input).

> On Tue, Jan 10, 2017 at 10:51 PM, Robert Bradshaw <
> robertwb@google.com.invalid> wrote:
>
>> Possibly this could be handled by rewindowing and the current semantics. If
>> not, maybe treat state like a side input with its own windowing and window
>> mapping fn.
>>
>> On Jan 10, 2017 3:14 PM, "Ben Chambers (JIRA)" <ji...@apache.org> wrote:
>>
>> > Ben Chambers created BEAM-1261:
>> > ----------------------------------
>> >
>> >              Summary: State API should allow state to be managed in
>> > different windows
>> >                  Key: BEAM-1261
>> >                  URL: https://issues.apache.org/jira/browse/BEAM-1261
>> >              Project: Beam
>> >           Issue Type: Bug
>> >           Components: beam-model, sdk-java-core
>> >             Reporter: Ben Chambers
>> >             Assignee: Kenneth Knowles
>> >
>> >
>> > For example, even if the elements are being processed in fixed windows of
>> > an hour, it may be desirable for the state to "roll over" between windows
>> > (or be available to all windows).
>> >
>> > It will also be necessary to figure out when this state should be deleted
>> > (TTL? maximum retention?)
>> >
>> > Another problem is how to deal with out of order data. If data comes in
>> > from the 10:00 AM window, should its state changes be visible to the data
>> > in the 9:00 AM window?
>> >
>> >
>> >
>> > --
>> > This message was sent by Atlassian JIRA
>> > (v6.3.4#6332)
>> >
>>

Re: [jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

Posted by Lukasz Cwik <lc...@google.com.INVALID>.
I was under the impression that user state was scoped to a ParDo and was
not shareable across multiple ParDos. Wouldn't rewindowing require the
usage of multiple ParDos and hence not allow for state to be shared?

On Tue, Jan 10, 2017 at 10:51 PM, Robert Bradshaw <
robertwb@google.com.invalid> wrote:

> Possibly this could be handled by rewindowing and the current semantics. If
> not, maybe treat state like a side input with its own windowing and window
> mapping fn.
>
> On Jan 10, 2017 3:14 PM, "Ben Chambers (JIRA)" <ji...@apache.org> wrote:
>
> > Ben Chambers created BEAM-1261:
> > ----------------------------------
> >
> >              Summary: State API should allow state to be managed in
> > different windows
> >                  Key: BEAM-1261
> >                  URL: https://issues.apache.org/jira/browse/BEAM-1261
> >              Project: Beam
> >           Issue Type: Bug
> >           Components: beam-model, sdk-java-core
> >             Reporter: Ben Chambers
> >             Assignee: Kenneth Knowles
> >
> >
> > For example, even if the elements are being processed in fixed windows of
> > an hour, it may be desirable for the state to "roll over" between windows
> > (or be available to all windows).
> >
> > It will also be necessary to figure out when this state should be deleted
> > (TTL? maximum retention?)
> >
> > Another problem is how to deal with out of order data. If data comes in
> > from the 10:00 AM window, should its state changes be visible to the data
> > in the 9:00 AM window?
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.3.4#6332)
> >
>

Re: [jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

Posted by Robert Bradshaw <ro...@google.com.INVALID>.
Possibly this could be handled by rewindowing and the current semantics. If
not, maybe treat state like a side input with its own windowing and window
mapping fn.

On Jan 10, 2017 3:14 PM, "Ben Chambers (JIRA)" <ji...@apache.org> wrote:

> Ben Chambers created BEAM-1261:
> ----------------------------------
>
>              Summary: State API should allow state to be managed in
> different windows
>                  Key: BEAM-1261
>                  URL: https://issues.apache.org/jira/browse/BEAM-1261
>              Project: Beam
>           Issue Type: Bug
>           Components: beam-model, sdk-java-core
>             Reporter: Ben Chambers
>             Assignee: Kenneth Knowles
>
>
> For example, even if the elements are being processed in fixed windows of
> an hour, it may be desirable for the state to "roll over" between windows
> (or be available to all windows).
>
> It will also be necessary to figure out when this state should be deleted
> (TTL? maximum retention?)
>
> Another problem is how to deal with out of order data. If data comes in
> from the 10:00 AM window, should its state changes be visible to the data
> in the 9:00 AM window?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>