You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by "Jarrad, Ken " <ke...@citi.com.INVALID> on 2016/09/06 13:55:28 UTC

checkpoint on flush of system producer

I am writing a SystemProducer and I would be grateful for any comments or documentation.

I want my SystemProducer to collect messages until flush is invoked.
That is, I don't want to transmit each message 'one by one'.
I want to wait until flush is invoked and then transmit 'in bulk'.

I don't know how to do this without violating the 'at least once' guarantee.

I am concerned that any message that is 'sent' from my StreamTask will be included in a checkpoint.
The message might not have been transmitted, however.
The message may reside in my SystemProducer pending flush.

If the StreamTask is re-started from the checkpoint then the message will not be replayed.
[The task input stream is a Kafka topic]

How do I co-ordinate the checkpoint mechanism with SystemProducer so that the checkpoint is delayed until the message is flushed?


RE: checkpoint on flush of system producer

Posted by "Jarrad, Ken " <ke...@citi.com.INVALID>.
Kx Systems aka Q or KDB

-----Original Message-----
From: Yi Pan [mailto:nickpan47@gmail.com] 
Sent: 06 September 2016 17:40
To: dev@samza.apache.org
Subject: Re: checkpoint on flush of system producer

Hi, Jarrad,

Yes! You have found your answer! Looking forward to your implementation of SystemProducer. Just curious, what's the target output system that you are writing to?

-Yi

On Tue, Sep 6, 2016 at 9:01 AM, Jarrad, Ken <ke...@citi.com.invalid>
wrote:

> I think I have discovered the answer to my question. Hopefully someone 
> can confirm my understanding.
>
> The 'flush' method of my SystemProducer will be invoked prior to the 
> checkpoint. Thus the checkpoint cannot get ahead of the SystemProducer.
> Thus the guarantee is 'at least once'.
>
> I will do my data transmission in method 'flush' of SystemProducer. 
> Method 'send' will accumulate the messages, pending flush.
>
> Specifically, method 'commit' of TaskInstance invokes:
>         1. collector.flush
>         2. offsetManager.checkpoint(taskName)
>
> The TaskInstance commit is invoked by the RunLoop either:
>         1. due to time elapsed, or
>         2. by co-ordination
>
> The property task.commit.ms controls the periodic commits.
> The method 'commit' of TaskCoordinator requests a commit (from within 
> method 'process' of StreamTask).
>
> -----Original Message-----
> From: Jarrad, Ken [ICG-IT]
> Sent: 06 September 2016 14:55
> To: 'dev@samza.apache.org'
> Subject: checkpoint on flush of system producer
>
> I am writing a SystemProducer and I would be grateful for any comments 
> or documentation.
>
> I want my SystemProducer to collect messages until flush is invoked.
> That is, I don't want to transmit each message 'one by one'.
> I want to wait until flush is invoked and then transmit 'in bulk'.
>
> I don't know how to do this without violating the 'at least once'
> guarantee.
>
> I am concerned that any message that is 'sent' from my StreamTask will 
> be included in a checkpoint.
> The message might not have been transmitted, however.
> The message may reside in my SystemProducer pending flush.
>
> If the StreamTask is re-started from the checkpoint then the message 
> will not be replayed.
> [The task input stream is a Kafka topic]
>
> How do I co-ordinate the checkpoint mechanism with SystemProducer so 
> that the checkpoint is delayed until the message is flushed?
>
>

Re: checkpoint on flush of system producer

Posted by Yi Pan <ni...@gmail.com>.
Hi, Jarrad,

Yes! You have found your answer! Looking forward to your implementation of
SystemProducer. Just curious, what's the target output system that you are
writing to?

-Yi

On Tue, Sep 6, 2016 at 9:01 AM, Jarrad, Ken <ke...@citi.com.invalid>
wrote:

> I think I have discovered the answer to my question. Hopefully someone can
> confirm my understanding.
>
> The 'flush' method of my SystemProducer will be invoked prior to the
> checkpoint. Thus the checkpoint cannot get ahead of the SystemProducer.
> Thus the guarantee is 'at least once'.
>
> I will do my data transmission in method 'flush' of SystemProducer. Method
> 'send' will accumulate the messages, pending flush.
>
> Specifically, method 'commit' of TaskInstance invokes:
>         1. collector.flush
>         2. offsetManager.checkpoint(taskName)
>
> The TaskInstance commit is invoked by the RunLoop either:
>         1. due to time elapsed, or
>         2. by co-ordination
>
> The property task.commit.ms controls the periodic commits.
> The method 'commit' of TaskCoordinator requests a commit (from within
> method 'process' of StreamTask).
>
> -----Original Message-----
> From: Jarrad, Ken [ICG-IT]
> Sent: 06 September 2016 14:55
> To: 'dev@samza.apache.org'
> Subject: checkpoint on flush of system producer
>
> I am writing a SystemProducer and I would be grateful for any comments or
> documentation.
>
> I want my SystemProducer to collect messages until flush is invoked.
> That is, I don't want to transmit each message 'one by one'.
> I want to wait until flush is invoked and then transmit 'in bulk'.
>
> I don't know how to do this without violating the 'at least once'
> guarantee.
>
> I am concerned that any message that is 'sent' from my StreamTask will be
> included in a checkpoint.
> The message might not have been transmitted, however.
> The message may reside in my SystemProducer pending flush.
>
> If the StreamTask is re-started from the checkpoint then the message will
> not be replayed.
> [The task input stream is a Kafka topic]
>
> How do I co-ordinate the checkpoint mechanism with SystemProducer so that
> the checkpoint is delayed until the message is flushed?
>
>

RE: checkpoint on flush of system producer

Posted by "Jarrad, Ken " <ke...@citi.com.INVALID>.
I think I have discovered the answer to my question. Hopefully someone can confirm my understanding.

The 'flush' method of my SystemProducer will be invoked prior to the checkpoint. Thus the checkpoint cannot get ahead of the SystemProducer. Thus the guarantee is 'at least once'.

I will do my data transmission in method 'flush' of SystemProducer. Method 'send' will accumulate the messages, pending flush.

Specifically, method 'commit' of TaskInstance invokes:
	1. collector.flush
	2. offsetManager.checkpoint(taskName)

The TaskInstance commit is invoked by the RunLoop either:
	1. due to time elapsed, or 
	2. by co-ordination

The property task.commit.ms controls the periodic commits. 
The method 'commit' of TaskCoordinator requests a commit (from within method 'process' of StreamTask).

-----Original Message-----
From: Jarrad, Ken [ICG-IT] 
Sent: 06 September 2016 14:55
To: 'dev@samza.apache.org'
Subject: checkpoint on flush of system producer

I am writing a SystemProducer and I would be grateful for any comments or documentation.

I want my SystemProducer to collect messages until flush is invoked.
That is, I don't want to transmit each message 'one by one'.
I want to wait until flush is invoked and then transmit 'in bulk'.

I don't know how to do this without violating the 'at least once' guarantee.

I am concerned that any message that is 'sent' from my StreamTask will be included in a checkpoint.
The message might not have been transmitted, however.
The message may reside in my SystemProducer pending flush.

If the StreamTask is re-started from the checkpoint then the message will not be replayed.
[The task input stream is a Kafka topic]

How do I co-ordinate the checkpoint mechanism with SystemProducer so that the checkpoint is delayed until the message is flushed?