You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Edward Hartwell Goose <ed...@gmail.com> on 2017/08/28 14:38:01 UTC

Long windows / lagged events

Hi,

I'm starting to learn about Apache Beam, and I'm curious whether our data
sets fit into the model.

We have a set of events occurring which we record by User, broadly
simplified down to purchases and shares. In its simplest form: someone
buying something and someone posting it on Facebook at some point
afterwards.

The events could occur potentially weeks apart - e.g. I purchase something
today, 2 weeks later I have a good experience with the product and then
share it on Facebook.

I'd like to be able to identify the "influencing" event that triggered the
share, which is most likely to be the most recent event prior to that
share. For instance:

T0: Purchase 1
T1: Purchase 2
T2: Purchase 3
T3: Share 1
T4: Purchase 4
T5: Share 2

I believe that the events T0 and T1 are likely to be influencing T3, but
I'd like to broadly attribute T3 to T2, and ideally pass it to some sort of
Combiner to be added to other data. Perhaps something like this at a first
pass:

User X, Event T3, Influenced by Purchase 3 at T2
User X, Event T5, Influenced by Purchase 4 at T4

I'd read/understood that if the window was long (e.g. > 24 hours) a lot of
data has to be stored and held up and that causes problems. I'd be happy to
have a cutoff of somewhere in the region of a few months, but certainly
longer than 24 hours.

For extra bonus points, I'd like to be able to say something like this too:
User X, Event T3, Total Prior Purchases = £X, Total Number of Purchases = 3

Is it possible to do that with Beam? Or is there an alternative way of
solving that problem?

If it's relevant, I'd most likely be using the batch processing model to
start, and our dataset size is ~30-50 million users with around 100 million
events (i.e. most users generate a small number of events).

Thanks,
Ed

Re: Long windows / lagged events

Posted by Lukasz Cwik <lc...@google.com>.

Using a bounded (batch style) pipeline you should be able to just group all
events by user and ignore windowing completely and produce any information
since you'll have a global view of all events. This scales well since data
for a user is only held up to the point that it is processed and then can
be garbage collected. Building this pipeline should be fast and should
allow you to answer the question of what makes data irrelevant for a user.

Using this data, you should be able to build a pipeline using an unbounded
data source. You can use a really long duration window (such as a calendar
(month/year) or session with gap duration of many months) and use an early
firing trigger to produce intermediate output. If you store this data into
state (see https://beam.apache.org/blog/2017/02/13/stateful-processing.html),
you can compute both kinds of events. The windows will finish based upon
the windowing strategy that you choose and you can then compute a final
record saying that the user has become inactive if you so choose.

On Mon, Aug 28, 2017 at 7:38 AM, Edward Hartwell Goose <ed...@gmail.com>
wrote:

> Hi,
>
> I'm starting to learn about Apache Beam, and I'm curious whether our data
> sets fit into the model.
>
> We have a set of events occurring which we record by User, broadly
> simplified down to purchases and shares. In its simplest form: someone
> buying something and someone posting it on Facebook at some point
> afterwards.
>
> The events could occur potentially weeks apart - e.g. I purchase something
> today, 2 weeks later I have a good experience with the product and then
> share it on Facebook.
>
> I'd like to be able to identify the "influencing" event that triggered the
> share, which is most likely to be the most recent event prior to that
> share. For instance:
>
> T0: Purchase 1
> T1: Purchase 2
> T2: Purchase 3
> T3: Share 1
> T4: Purchase 4
> T5: Share 2
>
> I believe that the events T0 and T1 are likely to be influencing T3, but
> I'd like to broadly attribute T3 to T2, and ideally pass it to some sort of
> Combiner to be added to other data. Perhaps something like this at a first
> pass:
>
> User X, Event T3, Influenced by Purchase 3 at T2
> User X, Event T5, Influenced by Purchase 4 at T4
>
> I'd read/understood that if the window was long (e.g. > 24 hours) a lot of
> data has to be stored and held up and that causes problems. I'd be happy to
> have a cutoff of somewhere in the region of a few months, but certainly
> longer than 24 hours.
>
> For extra bonus points, I'd like to be able to say something like this too:
> User X, Event T3, Total Prior Purchases = £X, Total Number of Purchases = 3
>
> Is it possible to do that with Beam? Or is there an alternative way of
> solving that problem?
>
> If it's relevant, I'd most likely be using the batch processing model to
> start, and our dataset size is ~30-50 million users with around 100 million
> events (i.e. most users generate a small number of events).
>
> Thanks,
> Ed
>