You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Kishore N C <ki...@gmail.com> on 2015/10/14 18:33:45 UTC

Detecting "done" on a bounded input dataset

Hi,

Our data processing pipeline consists of a set of Samza jobs, that form a
DAG. Sometimes, we have to throw finite datasets into the Kafka topic that
acts as the entry point to the pipeline. Given that different Samza jobs in
the DAG could have varying latencies in terms of processing the records (or
could even temporarily fails or be stuck), how do I detect that my assembly
of jobs have finished processing all records? It's not as simple as
tallying the input and output record counts, as some jobs could be
filtering data, and others could be grouping records etc.

Thanks,

Kishore.

Re: Detecting "done" on a bounded input dataset

Posted by Kishore N C <ki...@gmail.com>.
Hi Yi,

Detecting both scenarios will be useful.

Scenario 1: To detect that the reprocessing job has "caught up" with the
stream, as described by Jay Kreps here
<http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html>,
so that we can make the app point to the new DB and tear down the old
version of the jobs and DB.

Scenario 2: We want to reuse our stream infrastructure/code for bounded
datasets, so that we don't have to write the some jobs on Hadoop (which is
the way it is right now). So, detecting the end is required for shutting
down the jobs.

Regards,

Kishore.



On Wed, Oct 14, 2015 at 11:55 PM, Yi Pan <ni...@gmail.com> wrote:

> Hi, Kishore,
>
> First I want some clarification on your use case.
> 1) Scenario 1: you still want the Samza jobs continuously running, while
> simply want to detect the end of a certain stream. On detection, do you
> need to unsubscribe from the stream? Or you are still OK receiving more
> messages from the stream?
> 2) Scenario 2: you want the Samza jobs to shutdown when detecting the end
> of a certain stream.
>
> Which scenario are you targeting?
>
> Thanks!
>
> -Yi
>
> On Wed, Oct 14, 2015 at 9:33 AM, Kishore N C <ki...@gmail.com> wrote:
>
> > Hi,
> >
> > Our data processing pipeline consists of a set of Samza jobs, that form a
> > DAG. Sometimes, we have to throw finite datasets into the Kafka topic
> that
> > acts as the entry point to the pipeline. Given that different Samza jobs
> in
> > the DAG could have varying latencies in terms of processing the records
> (or
> > could even temporarily fails or be stuck), how do I detect that my
> assembly
> > of jobs have finished processing all records? It's not as simple as
> > tallying the input and output record counts, as some jobs could be
> > filtering data, and others could be grouping records etc.
> >
> > Thanks,
> >
> > Kishore.
> >
>



-- 
It is our choices that show what we truly are,
far more than our abilities.

Re: Detecting "done" on a bounded input dataset

Posted by Yi Pan <ni...@gmail.com>.
Hi, Kishore,

First I want some clarification on your use case.
1) Scenario 1: you still want the Samza jobs continuously running, while
simply want to detect the end of a certain stream. On detection, do you
need to unsubscribe from the stream? Or you are still OK receiving more
messages from the stream?
2) Scenario 2: you want the Samza jobs to shutdown when detecting the end
of a certain stream.

Which scenario are you targeting?

Thanks!

-Yi

On Wed, Oct 14, 2015 at 9:33 AM, Kishore N C <ki...@gmail.com> wrote:

> Hi,
>
> Our data processing pipeline consists of a set of Samza jobs, that form a
> DAG. Sometimes, we have to throw finite datasets into the Kafka topic that
> acts as the entry point to the pipeline. Given that different Samza jobs in
> the DAG could have varying latencies in terms of processing the records (or
> could even temporarily fails or be stuck), how do I detect that my assembly
> of jobs have finished processing all records? It's not as simple as
> tallying the input and output record counts, as some jobs could be
> filtering data, and others could be grouping records etc.
>
> Thanks,
>
> Kishore.
>