You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Aljoscha Krettek <al...@apache.org> on 2016/05/03 10:58:37 UTC

Re: [DISCUSS] Adding Some Sort of SideInputRunner

I'm afraid I have yet another question. What's the interplay between the
state that holds the buffered main-input elements and possible per-key
state that might be used by the DoFn. I guess I'm not seeing all the parts
but my problem is that one part (the buffering) requires a different type
of state scope as the other part (key-scoped state access in the DoFn)
while they both seem to be using the same StateInternals form the step
context. How does that work?

Cheers,
Aljoscha

On Thu, 28 Apr 2016 at 20:05 Kenneth Knowles <kl...@google.com.invalid> wrote:

> On Thu, Apr 28, 2016 at 10:19 AM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
> > No worries :-) and thanks for the detailed answers!
> >
> > I still have one question, though: you wrote that "The side input is
> > considered ready when there has been any data output/added to the
> > PCollection that it is being read as a side input. So the upstream
> trigger
> > controls this." How does this work with side inputs that consist of
> > multiple elements, i.e. ListPCollectionView and MapPCollectionView. For
> > them, do we also consider the side input as ready once the first element
> > arrives? That's why I was wondering about the triggers being responsible
> > for deciding when a side input is ready.
> >
>
> Yes, just as you describe. The side input window becomes ready once it has
> any data. So, combining your items 2.5 and 3, you have a situation where
> main input elements may be combined with only a speculative subset of the
> side input data. They will not be reprocessed once more up-to-date side
> input values become known. Beyond this initial period of waiting for the
> very first firing of the side input window, there are no consistency
> restrictions/guarantees on main input vs side input windows or triggerings.
> It may be that for a given runner updating the side input with the new
> value happens at high latency so all the main input elements are processed
> and gone before the update goes through. It is a bit of a dangerous area
> for users. I'm pretty interested in ideas in this space.
>
> Kenn
>

Re: [DISCUSS] Adding Some Sort of SideInputRunner

Posted by Aljoscha Krettek <al...@apache.org>.
Maybe, I'll try and figure something out. :-)

My problem was that the doc for StateInternals explicitly states that
access to state is always implicitly scoped to the key being processed. In
my understanding this was always the key of an element but it seems that it
can also be a more abstract key, such as the sharding key. The fact that
this could be the case was hidden away in code outside the SDK, it seems.

Thanks for your help!

On Tue, 3 May 2016 at 19:40 Kenneth Knowles <kl...@google.com.invalid> wrote:

> I think the answer to your questions might be StateNamespace.
>
> The lowest level of state is always key-scoped, while the StateNamespace
> indicates whether it is global to the key, further scoped to a particular
> window, or even scoped to a particular trigger. When the DoFn needs a side
> input, the key might actually be gone from the user's point of view. It is
> up to the StepContext to provide an appropriately-scoped StateInternals,
> usually by some consistent sharding key such as the key from the upstream
> GBK.
>
> I don't want to go too much into state accessed in the DoFn as I haven't
> yet got a chance to prepare and publish the design doc for that, and I want
> everyone to have access to it for any discussion.
>
> Does this help?
>
> On Tue, May 3, 2016 at 1:58 AM, Aljoscha Krettek <al...@apache.org>
> wrote:
>
> > I'm afraid I have yet another question. What's the interplay between the
> > state that holds the buffered main-input elements and possible per-key
> > state that might be used by the DoFn. I guess I'm not seeing all the
> parts
> > but my problem is that one part (the buffering) requires a different type
> > of state scope as the other part (key-scoped state access in the DoFn)
> > while they both seem to be using the same StateInternals form the step
> > context. How does that work?
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 28 Apr 2016 at 20:05 Kenneth Knowles <kl...@google.com.invalid>
> > wrote:
> >
> > > On Thu, Apr 28, 2016 at 10:19 AM, Aljoscha Krettek <
> aljoscha@apache.org>
> > > wrote:
> > >
> > > > No worries :-) and thanks for the detailed answers!
> > > >
> > > > I still have one question, though: you wrote that "The side input is
> > > > considered ready when there has been any data output/added to the
> > > > PCollection that it is being read as a side input. So the upstream
> > > trigger
> > > > controls this." How does this work with side inputs that consist of
> > > > multiple elements, i.e. ListPCollectionView and MapPCollectionView.
> For
> > > > them, do we also consider the side input as ready once the first
> > element
> > > > arrives? That's why I was wondering about the triggers being
> > responsible
> > > > for deciding when a side input is ready.
> > > >
> > >
> > > Yes, just as you describe. The side input window becomes ready once it
> > has
> > > any data. So, combining your items 2.5 and 3, you have a situation
> where
> > > main input elements may be combined with only a speculative subset of
> the
> > > side input data. They will not be reprocessed once more up-to-date side
> > > input values become known. Beyond this initial period of waiting for
> the
> > > very first firing of the side input window, there are no consistency
> > > restrictions/guarantees on main input vs side input windows or
> > triggerings.
> > > It may be that for a given runner updating the side input with the new
> > > value happens at high latency so all the main input elements are
> > processed
> > > and gone before the update goes through. It is a bit of a dangerous
> area
> > > for users. I'm pretty interested in ideas in this space.
> > >
> > > Kenn
> > >
> >
>

Re: [DISCUSS] Adding Some Sort of SideInputRunner

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
I think the answer to your questions might be StateNamespace.

The lowest level of state is always key-scoped, while the StateNamespace
indicates whether it is global to the key, further scoped to a particular
window, or even scoped to a particular trigger. When the DoFn needs a side
input, the key might actually be gone from the user's point of view. It is
up to the StepContext to provide an appropriately-scoped StateInternals,
usually by some consistent sharding key such as the key from the upstream
GBK.

I don't want to go too much into state accessed in the DoFn as I haven't
yet got a chance to prepare and publish the design doc for that, and I want
everyone to have access to it for any discussion.

Does this help?

On Tue, May 3, 2016 at 1:58 AM, Aljoscha Krettek <al...@apache.org>
wrote:

> I'm afraid I have yet another question. What's the interplay between the
> state that holds the buffered main-input elements and possible per-key
> state that might be used by the DoFn. I guess I'm not seeing all the parts
> but my problem is that one part (the buffering) requires a different type
> of state scope as the other part (key-scoped state access in the DoFn)
> while they both seem to be using the same StateInternals form the step
> context. How does that work?
>
> Cheers,
> Aljoscha
>
> On Thu, 28 Apr 2016 at 20:05 Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > On Thu, Apr 28, 2016 at 10:19 AM, Aljoscha Krettek <al...@apache.org>
> > wrote:
> >
> > > No worries :-) and thanks for the detailed answers!
> > >
> > > I still have one question, though: you wrote that "The side input is
> > > considered ready when there has been any data output/added to the
> > > PCollection that it is being read as a side input. So the upstream
> > trigger
> > > controls this." How does this work with side inputs that consist of
> > > multiple elements, i.e. ListPCollectionView and MapPCollectionView. For
> > > them, do we also consider the side input as ready once the first
> element
> > > arrives? That's why I was wondering about the triggers being
> responsible
> > > for deciding when a side input is ready.
> > >
> >
> > Yes, just as you describe. The side input window becomes ready once it
> has
> > any data. So, combining your items 2.5 and 3, you have a situation where
> > main input elements may be combined with only a speculative subset of the
> > side input data. They will not be reprocessed once more up-to-date side
> > input values become known. Beyond this initial period of waiting for the
> > very first firing of the side input window, there are no consistency
> > restrictions/guarantees on main input vs side input windows or
> triggerings.
> > It may be that for a given runner updating the side input with the new
> > value happens at high latency so all the main input elements are
> processed
> > and gone before the update goes through. It is a bit of a dangerous area
> > for users. I'm pretty interested in ideas in this space.
> >
> > Kenn
> >
>