You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Geoffry Sumter <vi...@gmail.com> on 2015/02/24 01:49:23 UTC

Reprocessing and windowing

Hey everyone,

I've been thinking about reprocessing
<http://samza.apache.org/learn/documentation/0.7.0/jobs/reprocessing.html> when
my job has windowed state
<http://samza.apache.org/learn/documentation/0.7.0/container/state-management.html#windowed-aggregation>
and
I have a few questions.

Context: I have a series of physical sensors that stream partial scans of
their surroundings over the course of ~5-10 minutes (at the end of 5-10
minutes its provided a full scan of its surroundings and starts over from
the start). Each packet of data has a timestamp that we're just going to
have to trust is 'close enough.' When processing in real-time it's very
natural to window the data every 5 minutes (wall clock) and merge into a
complete view of their collective surroundings. For our purposes, old data
arriving > 5 minutes late is no longer useful for many applications.

Now, I'd love to be able to reprocess data, especially by increasing
parallelism and processing quickly, but this seems incompatible with using
wall-clock-based windowed state. I could base my windowing/binning on the
timestamps provided by my input data, but then I have to be careful to
handle cases where some of my data may be arbitrarily delayed and the
possibility that one partition will get significantly ahead of other ones
(less interesting surroundings and faster computations) which makes waiting
for a majority of partitions to get to a certain point in time difficult.

Does anyone have experience with windowing and reprocessing? Any literature
recommendations?

Thanks!
Geoffry

Re: Reprocessing and windowing

Posted by Yi Pan <ni...@gmail.com>.
Hey, Geoffry,

We have started some work in SAMZA-552 to create a window operator API in
samza, as part of effort to implement support for a high-level language. I
will probably be able to have something to share in a few days and would
love to get feedbacks regarding to the window operator.

Thanks!

On Mon, Feb 23, 2015 at 4:49 PM, Geoffry Sumter <vi...@gmail.com> wrote:

> Hey everyone,
>
> I've been thinking about reprocessing
> <http://samza.apache.org/learn/documentation/0.7.0/jobs/reprocessing.html>
> when
> my job has windowed state
> <
> http://samza.apache.org/learn/documentation/0.7.0/container/state-management.html#windowed-aggregation
> >
> and
> I have a few questions.
>
> Context: I have a series of physical sensors that stream partial scans of
> their surroundings over the course of ~5-10 minutes (at the end of 5-10
> minutes its provided a full scan of its surroundings and starts over from
> the start). Each packet of data has a timestamp that we're just going to
> have to trust is 'close enough.' When processing in real-time it's very
> natural to window the data every 5 minutes (wall clock) and merge into a
> complete view of their collective surroundings. For our purposes, old data
> arriving > 5 minutes late is no longer useful for many applications.
>
> Now, I'd love to be able to reprocess data, especially by increasing
> parallelism and processing quickly, but this seems incompatible with using
> wall-clock-based windowed state. I could base my windowing/binning on the
> timestamps provided by my input data, but then I have to be careful to
> handle cases where some of my data may be arbitrarily delayed and the
> possibility that one partition will get significantly ahead of other ones
> (less interesting surroundings and faster computations) which makes waiting
> for a majority of partitions to get to a certain point in time difficult.
>
> Does anyone have experience with windowing and reprocessing? Any literature
> recommendations?
>
> Thanks!
> Geoffry
>

Re: Reprocessing and windowing

Posted by Roger Hoover <ro...@gmail.com>.
Hi Geoffry,

You might find the Google Millwheel paper and recent talk relevant.  That system supports windows based on event time as well as reprocessing.

Sent from my iPhone

> On Feb 23, 2015, at 4:49 PM, Geoffry Sumter <vi...@gmail.com> wrote:
> 
> Hey everyone,
> 
> I've been thinking about reprocessing
> <http://samza.apache.org/learn/documentation/0.7.0/jobs/reprocessing.html> when
> my job has windowed state
> <http://samza.apache.org/learn/documentation/0.7.0/container/state-management.html#windowed-aggregation>
> and
> I have a few questions.
> 
> Context: I have a series of physical sensors that stream partial scans of
> their surroundings over the course of ~5-10 minutes (at the end of 5-10
> minutes its provided a full scan of its surroundings and starts over from
> the start). Each packet of data has a timestamp that we're just going to
> have to trust is 'close enough.' When processing in real-time it's very
> natural to window the data every 5 minutes (wall clock) and merge into a
> complete view of their collective surroundings. For our purposes, old data
> arriving > 5 minutes late is no longer useful for many applications.
> 
> Now, I'd love to be able to reprocess data, especially by increasing
> parallelism and processing quickly, but this seems incompatible with using
> wall-clock-based windowed state. I could base my windowing/binning on the
> timestamps provided by my input data, but then I have to be careful to
> handle cases where some of my data may be arbitrarily delayed and the
> possibility that one partition will get significantly ahead of other ones
> (less interesting surroundings and faster computations) which makes waiting
> for a majority of partitions to get to a certain point in time difficult.
> 
> Does anyone have experience with windowing and reprocessing? Any literature
> recommendations?
> 
> Thanks!
> Geoffry