You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Ashok Gupta <gu...@gmail.com> on 2014/05/02 04:12:47 UTC

Question about OpaqueTridentKafkaSpout

Hi,

 I have theoretical question about the guarantees
OpaqueKafkaTridentKafkaSpout provides. I would like to take an example to
illustrate the question I have.

 Suppose a batch with txId 10 has tuple t1, t2, t3, t4 and they
respectively come from the kafka partition p1,p2,p3,p4. When this batch is
played for the very first time it failed processing however the commit
happen for tuples t3 in the database while it did not happen for the tuples
t1,t2,t4. Since the batch failed, it is expected that the metadata in the
zookeeper is not going to be updated i.e. it will not assume the offsets as
committed for p1,p2,p3,p4. It is expected that the batch will be replayed,
however, suppose before it gets replayed the kafka partition p3 goes down.
What happens now? I understand that another batch with same transaction id
containing t1, t2, t4 may be replayed, however since p3 is down, t3 won’t
be replayed again. Since t3 is not replayed again, even if the batch
succeeds on replay the offsets for the p3 don’t get updated in the
zookeeper. That is all fine as long fault tolerance and opaque behavior is
concerned.

 My concern is more around what happens when partition p3 is back up again
and the spout starts reading data from the last offset it committed
successfully. Since from partition p3, tuple t3 is again going to be read
and it is certainly going to be in a batch with some txId > 10 (say 19) it
is going to be applied in the state again. This apparently violates the
exactly once semantics.

 Is the concern genuine or am I missing something?
Regards
-- 
Ashok Gupta,
(+1) 361-522-2172
San Jose, CA

Re: Question about OpaqueTridentKafkaSpout

Posted by Abhishek Bhattacharjee <ab...@gmail.com>.
I understand that. But if you check the first post in the thread it says if
a batch has tuples t1 t2 .. from partitions p1 p2..
I think it is not possible to have tuples from different partitions to form
a batch.


On Wed, May 7, 2014 at 8:23 AM, P. Taylor Goetz <pt...@gmail.com> wrote:

> It all depends on the nature of the spout.
>
> With a transactional spout, batches are always the same, even if replayed.
>
> With an opaque spout, batches can change. But you have the guarantee that
> a tuple will only ever be processed successfully in a single batch. If a
> tuple fails in one batch, it could succeed in another.
>
> -Taylor
>
> On May 6, 2014, at 8:19 PM, Ashok Gupta <gu...@gmail.com> wrote:
>
> I think it can. That is where the coordinator comes in picture.
> Coordinator defines the parameters of a batch and emitters do the job of
> emitting the sub portions of batch.
>
>
>
>
>
> On Mon, May 5, 2014 at 12:50 PM, Abhishek Bhattacharjee <
> abhishek.bhattacharjee11@gmail.com> wrote:
>
>> Are you sure that a batch can consist of tuples from different partitions
>> ?
>> I am just asking I am not sure , if it can then your question seems to be
>> valid else it is not valid anymore :-)
>>
>>
>> On Fri, May 2, 2014 at 7:42 AM, Ashok Gupta <gu...@gmail.com>wrote:
>>
>>>
>>> Hi,
>>>
>>>  I have theoretical question about the guarantees
>>> OpaqueKafkaTridentKafkaSpout provides. I would like to take an example to
>>> illustrate the question I have.
>>>
>>>  Suppose a batch with txId 10 has tuple t1, t2, t3, t4 and they
>>> respectively come from the kafka partition p1,p2,p3,p4. When this batch is
>>> played for the very first time it failed processing however the commit
>>> happen for tuples t3 in the database while it did not happen for the tuples
>>> t1,t2,t4. Since the batch failed, it is expected that the metadata in the
>>> zookeeper is not going to be updated i.e. it will not assume the offsets as
>>> committed for p1,p2,p3,p4. It is expected that the batch will be replayed,
>>> however, suppose before it gets replayed the kafka partition p3 goes down.
>>> What happens now? I understand that another batch with same transaction id
>>> containing t1, t2, t4 may be replayed, however since p3 is down, t3 won’t
>>> be replayed again. Since t3 is not replayed again, even if the batch
>>> succeeds on replay the offsets for the p3 don’t get updated in the
>>> zookeeper. That is all fine as long fault tolerance and opaque behavior is
>>> concerned.
>>>
>>>  My concern is more around what happens when partition p3 is back up
>>> again and the spout starts reading data from the last offset it committed
>>> successfully. Since from partition p3, tuple t3 is again going to be read
>>> and it is certainly going to be in a batch with some txId > 10 (say 19) it
>>> is going to be applied in the state again. This apparently violates the
>>> exactly once semantics.
>>>
>>>  Is the concern genuine or am I missing something?
>>> Regards
>>> --
>>> Ashok Gupta,
>>> (+1) 361-522-2172
>>> San Jose, CA
>>>
>>
>>
>>
>> --
>> *Abhishek Bhattacharjee*
>> *Pune Institute of Computer Technology*
>>
>
>
>
> --
> Ashok Gupta,
> (+1) 361-522-2172
> San Jose, CA
>
>


-- 
*Abhishek Bhattacharjee*
*Pune Institute of Computer Technology*

Re: Question about OpaqueTridentKafkaSpout

Posted by "P. Taylor Goetz" <pt...@gmail.com>.
It all depends on the nature of the spout.

With a transactional spout, batches are always the same, even if replayed.

With an opaque spout, batches can change. But you have the guarantee that a tuple will only ever be processed successfully in a single batch. If a tuple fails in one batch, it could succeed in another.

-Taylor

> On May 6, 2014, at 8:19 PM, Ashok Gupta <gu...@gmail.com> wrote:
> 
> I think it can. That is where the coordinator comes in picture. Coordinator defines the parameters of a batch and emitters do the job of emitting the sub portions of batch.
> 
> 
> 
> 
> 
>> On Mon, May 5, 2014 at 12:50 PM, Abhishek Bhattacharjee <ab...@gmail.com> wrote:
>> Are you sure that a batch can consist of tuples from different partitions ?
>> I am just asking I am not sure , if it can then your question seems to be valid else it is not valid anymore :-) 
>> 
>> 
>>> On Fri, May 2, 2014 at 7:42 AM, Ashok Gupta <gu...@gmail.com> wrote:
>>> 
>>> Hi, 
>>> 
>>> I have theoretical question about the guarantees OpaqueKafkaTridentKafkaSpout provides. I would like to take an example to illustrate the question I have.
>>> 
>>> Suppose a batch with txId 10 has tuple t1, t2, t3, t4 and they respectively come from the kafka partition p1,p2,p3,p4. When this batch is played for the very first time it failed processing however the commit happen for tuples t3 in the database while it did not happen for the tuples t1,t2,t4. Since the batch failed, it is expected that the metadata in the zookeeper is not going to be updated i.e. it will not assume the offsets as committed for p1,p2,p3,p4. It is expected that the batch will be replayed, however, suppose before it gets replayed the kafka partition p3 goes down. What happens now? I understand that another batch with same transaction id containing t1, t2, t4 may be replayed, however since p3 is down, t3 won’t be replayed again. Since t3 is not replayed again, even if the batch succeeds on replay the offsets for the p3 don’t get updated in the zookeeper. That is all fine as long fault tolerance and opaque behavior is concerned. 
>>> 
>>> My concern is more around what happens when partition p3 is back up again and the spout starts reading data from the last offset it committed successfully. Since from partition p3, tuple t3 is again going to be read and it is certainly going to be in a batch with some txId > 10 (say 19) it is going to be applied in the state again. This apparently violates the exactly once semantics. 
>>> 
>>> Is the concern genuine or am I missing something?
>>> 
>>> Regards
>>> -- 
>>> Ashok Gupta, 
>>> (+1) 361-522-2172
>>> San Jose, CA
>> 
>> 
>> 
>> -- 
>> Abhishek Bhattacharjee
>> Pune Institute of Computer Technology
> 
> 
> 
> -- 
> Ashok Gupta, 
> (+1) 361-522-2172
> San Jose, CA

Re: Question about OpaqueTridentKafkaSpout

Posted by Abhishek Bhattacharjee <ab...@gmail.com>.
Have you tried it. I don't think it is possible because in the coordinator
you specify no. of partitions.
And in the emit function you give the batchoutputcollector and the
partition no. as parameters so a batch has tuples from a single partition,
 that is my understanding.


On Wed, May 7, 2014 at 5:49 AM, Ashok Gupta <gu...@gmail.com>wrote:

> I think it can. That is where the coordinator comes in picture.
> Coordinator defines the parameters of a batch and emitters do the job of
> emitting the sub portions of batch.
>
>
>
>
>
> On Mon, May 5, 2014 at 12:50 PM, Abhishek Bhattacharjee <
> abhishek.bhattacharjee11@gmail.com> wrote:
>
>> Are you sure that a batch can consist of tuples from different partitions
>> ?
>> I am just asking I am not sure , if it can then your question seems to be
>> valid else it is not valid anymore :-)
>>
>>
>> On Fri, May 2, 2014 at 7:42 AM, Ashok Gupta <gu...@gmail.com>wrote:
>>
>>>
>>> Hi,
>>>
>>>  I have theoretical question about the guarantees
>>> OpaqueKafkaTridentKafkaSpout provides. I would like to take an example to
>>> illustrate the question I have.
>>>
>>>  Suppose a batch with txId 10 has tuple t1, t2, t3, t4 and they
>>> respectively come from the kafka partition p1,p2,p3,p4. When this batch is
>>> played for the very first time it failed processing however the commit
>>> happen for tuples t3 in the database while it did not happen for the tuples
>>> t1,t2,t4. Since the batch failed, it is expected that the metadata in the
>>> zookeeper is not going to be updated i.e. it will not assume the offsets as
>>> committed for p1,p2,p3,p4. It is expected that the batch will be replayed,
>>> however, suppose before it gets replayed the kafka partition p3 goes down.
>>> What happens now? I understand that another batch with same transaction id
>>> containing t1, t2, t4 may be replayed, however since p3 is down, t3 won’t
>>> be replayed again. Since t3 is not replayed again, even if the batch
>>> succeeds on replay the offsets for the p3 don’t get updated in the
>>> zookeeper. That is all fine as long fault tolerance and opaque behavior is
>>> concerned.
>>>
>>>  My concern is more around what happens when partition p3 is back up
>>> again and the spout starts reading data from the last offset it committed
>>> successfully. Since from partition p3, tuple t3 is again going to be read
>>> and it is certainly going to be in a batch with some txId > 10 (say 19) it
>>> is going to be applied in the state again. This apparently violates the
>>> exactly once semantics.
>>>
>>>  Is the concern genuine or am I missing something?
>>> Regards
>>> --
>>> Ashok Gupta,
>>> (+1) 361-522-2172
>>> San Jose, CA
>>>
>>
>>
>>
>> --
>> *Abhishek Bhattacharjee*
>> *Pune Institute of Computer Technology*
>>
>
>
>
> --
> Ashok Gupta,
> (+1) 361-522-2172
> San Jose, CA
>



-- 
*Abhishek Bhattacharjee*
*Pune Institute of Computer Technology*

Re: Question about OpaqueTridentKafkaSpout

Posted by Ashok Gupta <gu...@gmail.com>.
I think it can. That is where the coordinator comes in picture. Coordinator
defines the parameters of a batch and emitters do the job of emitting the
sub portions of batch.





On Mon, May 5, 2014 at 12:50 PM, Abhishek Bhattacharjee <
abhishek.bhattacharjee11@gmail.com> wrote:

> Are you sure that a batch can consist of tuples from different partitions ?
> I am just asking I am not sure , if it can then your question seems to be
> valid else it is not valid anymore :-)
>
>
> On Fri, May 2, 2014 at 7:42 AM, Ashok Gupta <gu...@gmail.com>wrote:
>
>>
>> Hi,
>>
>>  I have theoretical question about the guarantees
>> OpaqueKafkaTridentKafkaSpout provides. I would like to take an example to
>> illustrate the question I have.
>>
>>  Suppose a batch with txId 10 has tuple t1, t2, t3, t4 and they
>> respectively come from the kafka partition p1,p2,p3,p4. When this batch is
>> played for the very first time it failed processing however the commit
>> happen for tuples t3 in the database while it did not happen for the tuples
>> t1,t2,t4. Since the batch failed, it is expected that the metadata in the
>> zookeeper is not going to be updated i.e. it will not assume the offsets as
>> committed for p1,p2,p3,p4. It is expected that the batch will be replayed,
>> however, suppose before it gets replayed the kafka partition p3 goes down.
>> What happens now? I understand that another batch with same transaction id
>> containing t1, t2, t4 may be replayed, however since p3 is down, t3 won’t
>> be replayed again. Since t3 is not replayed again, even if the batch
>> succeeds on replay the offsets for the p3 don’t get updated in the
>> zookeeper. That is all fine as long fault tolerance and opaque behavior is
>> concerned.
>>
>>  My concern is more around what happens when partition p3 is back up
>> again and the spout starts reading data from the last offset it committed
>> successfully. Since from partition p3, tuple t3 is again going to be read
>> and it is certainly going to be in a batch with some txId > 10 (say 19) it
>> is going to be applied in the state again. This apparently violates the
>> exactly once semantics.
>>
>>  Is the concern genuine or am I missing something?
>> Regards
>> --
>> Ashok Gupta,
>> (+1) 361-522-2172
>> San Jose, CA
>>
>
>
>
> --
> *Abhishek Bhattacharjee*
> *Pune Institute of Computer Technology*
>



-- 
Ashok Gupta,
(+1) 361-522-2172
San Jose, CA

Re: Question about OpaqueTridentKafkaSpout

Posted by Abhishek Bhattacharjee <ab...@gmail.com>.
Are you sure that a batch can consist of tuples from different partitions ?
I am just asking I am not sure , if it can then your question seems to be
valid else it is not valid anymore :-)


On Fri, May 2, 2014 at 7:42 AM, Ashok Gupta <gu...@gmail.com>wrote:

>
> Hi,
>
>  I have theoretical question about the guarantees
> OpaqueKafkaTridentKafkaSpout provides. I would like to take an example to
> illustrate the question I have.
>
>  Suppose a batch with txId 10 has tuple t1, t2, t3, t4 and they
> respectively come from the kafka partition p1,p2,p3,p4. When this batch is
> played for the very first time it failed processing however the commit
> happen for tuples t3 in the database while it did not happen for the tuples
> t1,t2,t4. Since the batch failed, it is expected that the metadata in the
> zookeeper is not going to be updated i.e. it will not assume the offsets as
> committed for p1,p2,p3,p4. It is expected that the batch will be replayed,
> however, suppose before it gets replayed the kafka partition p3 goes down.
> What happens now? I understand that another batch with same transaction id
> containing t1, t2, t4 may be replayed, however since p3 is down, t3 won’t
> be replayed again. Since t3 is not replayed again, even if the batch
> succeeds on replay the offsets for the p3 don’t get updated in the
> zookeeper. That is all fine as long fault tolerance and opaque behavior is
> concerned.
>
>  My concern is more around what happens when partition p3 is back up again
> and the spout starts reading data from the last offset it committed
> successfully. Since from partition p3, tuple t3 is again going to be read
> and it is certainly going to be in a batch with some txId > 10 (say 19) it
> is going to be applied in the state again. This apparently violates the
> exactly once semantics.
>
>  Is the concern genuine or am I missing something?
> Regards
> --
> Ashok Gupta,
> (+1) 361-522-2172
> San Jose, CA
>



-- 
*Abhishek Bhattacharjee*
*Pune Institute of Computer Technology*