You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Aljoscha Krettek <al...@apache.org> on 2015/05/12 18:30:27 UTC

[DISCUSS] Access to Time and Window in Streaming Operations

Hi,
I'll try to make it quick this time. I think we need to make
information about the event time of an element and information about
windows in which it resides accessible to the user. A simple example
would be the aggregation of some user behaviour, for example:

in = clickSource()

analysedData = in
  .window(10 minutes).every(5 minutes)
  .groupBy("userId")
  .filter(is something interesting)
  .sum("something")

analysedData.storeToMySystem()

Now the results of this window aggregation tell me that at some point,
there was some window and in this window some attribute summed up to
this. This might not be very helpful. What might be helpful is the
information that there occurred a spike in something at 12:45 on
Wednesday. Therefore, I think we need to make this information
available somehow.

I only have some rough Ideas about how this might work, but I would
first like to discuss whether others even think this necessary. So
fire away...

Cheers,
Aljoscha

Re: [DISCUSS] Access to Time and Window in Streaming Operations

Posted by Gyula Fóra <gy...@gmail.com>.
Hi,
This was the exact need that motivated me to rework the windowing and
introduce the StreamWindow abstraction which can hold any metadata that
represents the current window.

At this moment it only contains a unique id but this could be extended
easily.

When the user created a windoweddatastream by applying some windowed
tranformation he can call .getDiscretizedStream() which will return a
stream of StreamWindows from which the user can extract any metadata
afterwards.

So this is practically something we can add already easily and no need to
rewrite any logic.

Cheers,
Gyula

On Tuesday, May 12, 2015, Aljoscha Krettek <al...@apache.org> wrote:

> Hi,
> I'll try to make it quick this time. I think we need to make
> information about the event time of an element and information about
> windows in which it resides accessible to the user. A simple example
> would be the aggregation of some user behaviour, for example:
>
> in = clickSource()
>
> analysedData = in
>   .window(10 minutes).every(5 minutes)
>   .groupBy("userId")
>   .filter(is something interesting)
>   .sum("something")
>
> analysedData.storeToMySystem()
>
> Now the results of this window aggregation tell me that at some point,
> there was some window and in this window some attribute summed up to
> this. This might not be very helpful. What might be helpful is the
> information that there occurred a spike in something at 12:45 on
> Wednesday. Therefore, I think we need to make this information
> available somehow.
>
> I only have some rough Ideas about how this might work, but I would
> first like to discuss whether others even think this necessary. So
> fire away...
>
> Cheers,
> Aljoscha
>

Re: [DISCUSS] Access to Time and Window in Streaming Operations

Posted by Bruno Cadonna <ca...@informatik.hu-berlin.de>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Aljoscha originally wanted to discuss whether it is necessary

<quote>
to make information about the event time of an element and information
about windows in which it resides accessible to the user.
</quote>

I guess, we all agree that this information is necessary to timestamp
the results and that timestamping the result is important. Don't we?

Cheers,
Bruno

On 13.05.2015 11:45, Stephan Ewen wrote:
> Okay, I thought that this thread is about how to make timestamps
> and window information of a record accessible to the user code. 
> This involves how represent the information in which window a
> record is, whether to attach it to that records, etc
> 
> The semantics of what happens if you have multiple windows after
> another (on time and other metrics) is orthogonal to that, as far
> as I can tell. This deals with the question how and if timestamps
> are re-assigned, how punctuations and watermarks propagate...
> 
> On Wed, May 13, 2015 at 11:40 AM, Gyula Fóra <gy...@apache.org>
> wrote:
> 
>> I think this is that thread :)
>> 
>> But as I said it is just a matter of what we want to add, and we
>> can already do it.
>> 
>> On Wed, May 13, 2015 at 11:37 AM, Stephan Ewen <se...@apache.org>
>> wrote:
>> 
>>> This is a pretty central question, actually (timestamping the
>>> results of windows). Let us kick off a separate thread for
>>> this...
>>> 
>>> On Wed, May 13, 2015 at 9:20 AM, Bruno Cadonna < 
>>> cadonna@informatik.hu-berlin.de> wrote:
>>> 
> Hi,
> 
> I think timestamping results of a time window operator is
> essential. Without timestamps in the results, it is not possible to
> execute two time window operators one after the other.
> 
> Cheers, Bruno
> 
> On 12.05.2015 18:30, Aljoscha Krettek wrote:
>>>>>> Hi, I'll try to make it quick this time. I think we need
>>>>>> to make information about the event time of an element
>>>>>> and information about windows in which it resides
>>>>>> accessible to the user. A simple example would be the
>>>>>> aggregation of some user behaviour, for example:
>>>>>> 
>>>>>> in = clickSource()
>>>>>> 
>>>>>> analysedData = in .window(10 minutes).every(5 minutes) 
>>>>>> .groupBy("userId") .filter(is something interesting) 
>>>>>> .sum("something")
>>>>>> 
>>>>>> analysedData.storeToMySystem()
>>>>>> 
>>>>>> Now the results of this window aggregation tell me that
>>>>>> at some point, there was some window and in this window
>>>>>> some attribute summed up to this. This might not be very
>>>>>> helpful. What might be helpful is the information that
>>>>>> there occurred a spike in something at 12:45 on
>>>>>> Wednesday. Therefore, I think we need to make this 
>>>>>> information available somehow.
>>>>>> 
>>>>>> I only have some rough Ideas about how this might work,
>>>>>> but I would first like to discuss whether others even
>>>>>> think this necessary. So fire away...
>>>>>> 
>>>>>> Cheers, Aljoscha
>>>>>> 
> 
>>>> 
>>> 
>> 
> 

- -- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Dr. Bruno Cadonna
  Postdoctoral Researcher

  Databases and Information Systems
  Department of Computer Science
  Humboldt-Universität zu Berlin

  http://www.informatik.hu-berlin.de/~cadonnab

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iQEcBAEBAgAGBQJVUzbZAAoJEKdCIJx7flKwwEAH/jrhs2h2Ii1cdRWO6E0/h86Y
fHztPR7aYaNjNP9ya7TnYnCDyeAvgOITYEJ5DyjpbXu1yaqJ4ZKi8HF5EZQpamRe
ZGA/3gTLf+JGtI9swTHCPIVFh6BMTJ5R1L8lx4SplqHCjgqulT+TrpCEKyP1t9BA
cTpL6NjtqEXBOk9pPg3IkBwTG3IPR3IlOyxIjIDsYw8O75Csp/Gb9k7QWzlN8g9B
jxzOAyi5ASDBJtvJVupzSkAT/HY9ugo6+kCJGp+BWFqxyfnl0u0tVYE22r1INWjL
x+qPAvF0EePMDYFeOXpRAqPnz6/R8Z8QidhsEGjiVqjctUyiXxgJYLN3RL/DYqI=
=d4Zq
-----END PGP SIGNATURE-----

Re: [DISCUSS] Access to Time and Window in Streaming Operations

Posted by Stephan Ewen <se...@apache.org>.
Okay, I thought that this thread is about how to make timestamps and window
information of a record accessible to the user code.
This involves how represent the information in which window a record is,
whether to attach it to that records, etc

The semantics of what happens if you have multiple windows after another
(on time and other metrics) is orthogonal to that, as far as I can tell.
This deals with the question how and if timestamps are re-assigned, how
punctuations and watermarks propagate...

On Wed, May 13, 2015 at 11:40 AM, Gyula Fóra <gy...@apache.org> wrote:

> I think this is that thread :)
>
> But as I said it is just a matter of what we want to add, and we can
> already do it.
>
> On Wed, May 13, 2015 at 11:37 AM, Stephan Ewen <se...@apache.org> wrote:
>
> > This is a pretty central question, actually (timestamping the results of
> > windows). Let us kick off a separate thread for this...
> >
> > On Wed, May 13, 2015 at 9:20 AM, Bruno Cadonna <
> > cadonna@informatik.hu-berlin.de> wrote:
> >
> > > -----BEGIN PGP SIGNED MESSAGE-----
> > > Hash: SHA1
> > >
> > > Hi,
> > >
> > > I think timestamping results of a time window operator is essential.
> > > Without timestamps in the results, it is not possible to execute two
> > > time window operators one after the other.
> > >
> > > Cheers,
> > > Bruno
> > >
> > > On 12.05.2015 18:30, Aljoscha Krettek wrote:
> > > > Hi, I'll try to make it quick this time. I think we need to make
> > > > information about the event time of an element and information
> > > > about windows in which it resides accessible to the user. A simple
> > > > example would be the aggregation of some user behaviour, for
> > > > example:
> > > >
> > > > in = clickSource()
> > > >
> > > > analysedData = in .window(10 minutes).every(5 minutes)
> > > > .groupBy("userId") .filter(is something interesting)
> > > > .sum("something")
> > > >
> > > > analysedData.storeToMySystem()
> > > >
> > > > Now the results of this window aggregation tell me that at some
> > > > point, there was some window and in this window some attribute
> > > > summed up to this. This might not be very helpful. What might be
> > > > helpful is the information that there occurred a spike in something
> > > > at 12:45 on Wednesday. Therefore, I think we need to make this
> > > > information available somehow.
> > > >
> > > > I only have some rough Ideas about how this might work, but I
> > > > would first like to discuss whether others even think this
> > > > necessary. So fire away...
> > > >
> > > > Cheers, Aljoscha
> > > >
> > >
> > > - --
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > >
> > >   Dr. Bruno Cadonna
> > >   Postdoctoral Researcher
> > >
> > >   Databases and Information Systems
> > >   Department of Computer Science
> > >   Humboldt-Universität zu Berlin
> > >
> > >   http://www.informatik.hu-berlin.de/~cadonnab
> > >
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > -----BEGIN PGP SIGNATURE-----
> > > Version: GnuPG v1.4.11 (GNU/Linux)
> > >
> > > iQEcBAEBAgAGBQJVUvstAAoJEKdCIJx7flKwHBcIAIeyrcNuGUVqGYHuQmoEPj2r
> > > nFqSW6DZVwoLmqip300kfOdCWiwcnN881xOi9M0+JEaXPo0RyEfCvhsVJuJU3TS7
> > > /9WYSWtlk81UOkkBtJef4YnQUpB4fwIQGymgwx7fHNGCaTdd5qQ9YqAuGrhpPol0
> > > ZEcuZgQJSFGoz9k0kSkzit/2Td6xdkXesKq73ZZDNqVJvyUBu1oZyBs6DE2MHYo/
> > > waWM5iI1G/mKvZR9WaoYcTLi0I2D0zIcEp62zRial18I1D+GCcxYVjqB2dS1eS12
> > > eXMGBcvA+/WFW3oz2Na5f44Us7pb/RA65UminiacoOnVE7n9eLwy3miFUyb+ssE=
> > > =o4iT
> > > -----END PGP SIGNATURE-----
> > >
> >
>

Re: [DISCUSS] Access to Time and Window in Streaming Operations

Posted by Gyula Fóra <gy...@apache.org>.
I think this is that thread :)

But as I said it is just a matter of what we want to add, and we can
already do it.

On Wed, May 13, 2015 at 11:37 AM, Stephan Ewen <se...@apache.org> wrote:

> This is a pretty central question, actually (timestamping the results of
> windows). Let us kick off a separate thread for this...
>
> On Wed, May 13, 2015 at 9:20 AM, Bruno Cadonna <
> cadonna@informatik.hu-berlin.de> wrote:
>
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Hi,
> >
> > I think timestamping results of a time window operator is essential.
> > Without timestamps in the results, it is not possible to execute two
> > time window operators one after the other.
> >
> > Cheers,
> > Bruno
> >
> > On 12.05.2015 18:30, Aljoscha Krettek wrote:
> > > Hi, I'll try to make it quick this time. I think we need to make
> > > information about the event time of an element and information
> > > about windows in which it resides accessible to the user. A simple
> > > example would be the aggregation of some user behaviour, for
> > > example:
> > >
> > > in = clickSource()
> > >
> > > analysedData = in .window(10 minutes).every(5 minutes)
> > > .groupBy("userId") .filter(is something interesting)
> > > .sum("something")
> > >
> > > analysedData.storeToMySystem()
> > >
> > > Now the results of this window aggregation tell me that at some
> > > point, there was some window and in this window some attribute
> > > summed up to this. This might not be very helpful. What might be
> > > helpful is the information that there occurred a spike in something
> > > at 12:45 on Wednesday. Therefore, I think we need to make this
> > > information available somehow.
> > >
> > > I only have some rough Ideas about how this might work, but I
> > > would first like to discuss whether others even think this
> > > necessary. So fire away...
> > >
> > > Cheers, Aljoscha
> > >
> >
> > - --
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >   Dr. Bruno Cadonna
> >   Postdoctoral Researcher
> >
> >   Databases and Information Systems
> >   Department of Computer Science
> >   Humboldt-Universität zu Berlin
> >
> >   http://www.informatik.hu-berlin.de/~cadonnab
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.11 (GNU/Linux)
> >
> > iQEcBAEBAgAGBQJVUvstAAoJEKdCIJx7flKwHBcIAIeyrcNuGUVqGYHuQmoEPj2r
> > nFqSW6DZVwoLmqip300kfOdCWiwcnN881xOi9M0+JEaXPo0RyEfCvhsVJuJU3TS7
> > /9WYSWtlk81UOkkBtJef4YnQUpB4fwIQGymgwx7fHNGCaTdd5qQ9YqAuGrhpPol0
> > ZEcuZgQJSFGoz9k0kSkzit/2Td6xdkXesKq73ZZDNqVJvyUBu1oZyBs6DE2MHYo/
> > waWM5iI1G/mKvZR9WaoYcTLi0I2D0zIcEp62zRial18I1D+GCcxYVjqB2dS1eS12
> > eXMGBcvA+/WFW3oz2Na5f44Us7pb/RA65UminiacoOnVE7n9eLwy3miFUyb+ssE=
> > =o4iT
> > -----END PGP SIGNATURE-----
> >
>

Re: [DISCUSS] Access to Time and Window in Streaming Operations

Posted by Stephan Ewen <se...@apache.org>.
This is a pretty central question, actually (timestamping the results of
windows). Let us kick off a separate thread for this...

On Wed, May 13, 2015 at 9:20 AM, Bruno Cadonna <
cadonna@informatik.hu-berlin.de> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> I think timestamping results of a time window operator is essential.
> Without timestamps in the results, it is not possible to execute two
> time window operators one after the other.
>
> Cheers,
> Bruno
>
> On 12.05.2015 18:30, Aljoscha Krettek wrote:
> > Hi, I'll try to make it quick this time. I think we need to make
> > information about the event time of an element and information
> > about windows in which it resides accessible to the user. A simple
> > example would be the aggregation of some user behaviour, for
> > example:
> >
> > in = clickSource()
> >
> > analysedData = in .window(10 minutes).every(5 minutes)
> > .groupBy("userId") .filter(is something interesting)
> > .sum("something")
> >
> > analysedData.storeToMySystem()
> >
> > Now the results of this window aggregation tell me that at some
> > point, there was some window and in this window some attribute
> > summed up to this. This might not be very helpful. What might be
> > helpful is the information that there occurred a spike in something
> > at 12:45 on Wednesday. Therefore, I think we need to make this
> > information available somehow.
> >
> > I only have some rough Ideas about how this might work, but I
> > would first like to discuss whether others even think this
> > necessary. So fire away...
> >
> > Cheers, Aljoscha
> >
>
> - --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>   Dr. Bruno Cadonna
>   Postdoctoral Researcher
>
>   Databases and Information Systems
>   Department of Computer Science
>   Humboldt-Universität zu Berlin
>
>   http://www.informatik.hu-berlin.de/~cadonnab
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
>
> iQEcBAEBAgAGBQJVUvstAAoJEKdCIJx7flKwHBcIAIeyrcNuGUVqGYHuQmoEPj2r
> nFqSW6DZVwoLmqip300kfOdCWiwcnN881xOi9M0+JEaXPo0RyEfCvhsVJuJU3TS7
> /9WYSWtlk81UOkkBtJef4YnQUpB4fwIQGymgwx7fHNGCaTdd5qQ9YqAuGrhpPol0
> ZEcuZgQJSFGoz9k0kSkzit/2Td6xdkXesKq73ZZDNqVJvyUBu1oZyBs6DE2MHYo/
> waWM5iI1G/mKvZR9WaoYcTLi0I2D0zIcEp62zRial18I1D+GCcxYVjqB2dS1eS12
> eXMGBcvA+/WFW3oz2Na5f44Us7pb/RA65UminiacoOnVE7n9eLwy3miFUyb+ssE=
> =o4iT
> -----END PGP SIGNATURE-----
>

Re: [DISCUSS] Access to Time and Window in Streaming Operations

Posted by Bruno Cadonna <ca...@informatik.hu-berlin.de>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I think timestamping results of a time window operator is essential.
Without timestamps in the results, it is not possible to execute two
time window operators one after the other.

Cheers,
Bruno

On 12.05.2015 18:30, Aljoscha Krettek wrote:
> Hi, I'll try to make it quick this time. I think we need to make 
> information about the event time of an element and information
> about windows in which it resides accessible to the user. A simple
> example would be the aggregation of some user behaviour, for
> example:
> 
> in = clickSource()
> 
> analysedData = in .window(10 minutes).every(5 minutes) 
> .groupBy("userId") .filter(is something interesting) 
> .sum("something")
> 
> analysedData.storeToMySystem()
> 
> Now the results of this window aggregation tell me that at some
> point, there was some window and in this window some attribute
> summed up to this. This might not be very helpful. What might be
> helpful is the information that there occurred a spike in something
> at 12:45 on Wednesday. Therefore, I think we need to make this
> information available somehow.
> 
> I only have some rough Ideas about how this might work, but I
> would first like to discuss whether others even think this
> necessary. So fire away...
> 
> Cheers, Aljoscha
> 

- -- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Dr. Bruno Cadonna
  Postdoctoral Researcher

  Databases and Information Systems
  Department of Computer Science
  Humboldt-Universität zu Berlin

  http://www.informatik.hu-berlin.de/~cadonnab

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iQEcBAEBAgAGBQJVUvstAAoJEKdCIJx7flKwHBcIAIeyrcNuGUVqGYHuQmoEPj2r
nFqSW6DZVwoLmqip300kfOdCWiwcnN881xOi9M0+JEaXPo0RyEfCvhsVJuJU3TS7
/9WYSWtlk81UOkkBtJef4YnQUpB4fwIQGymgwx7fHNGCaTdd5qQ9YqAuGrhpPol0
ZEcuZgQJSFGoz9k0kSkzit/2Td6xdkXesKq73ZZDNqVJvyUBu1oZyBs6DE2MHYo/
waWM5iI1G/mKvZR9WaoYcTLi0I2D0zIcEp62zRial18I1D+GCcxYVjqB2dS1eS12
eXMGBcvA+/WFW3oz2Na5f44Us7pb/RA65UminiacoOnVE7n9eLwy3miFUyb+ssE=
=o4iT
-----END PGP SIGNATURE-----