You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shushant Arora <sh...@gmail.com> on 2015/08/10 23:32:13 UTC

avoid duplicate due to executor failure in spark stream

Hi

How can I avoid duplicate processing of kafka messages in spark stream 1.3
because of executor failure.

1.Can I some how access accumulators of failed task in retry  task to skip
those many events which are already processed by failed task on this
partition ?

2.Or I ll have to persist each msg processed and then check before
processing each msg whether its already processed by failure task and
delete this perisited information at each batch end?

Re: avoid duplicate due to executor failure in spark stream

Posted by Cody Koeninger <co...@koeninger.org>.

Accumulators aren't going to work to communicate state changes between
executors.  You need external storage.

On Tue, Aug 11, 2015 at 11:28 AM, Shushant Arora <sh...@gmail.com>
wrote:

> What if processing is neither idempotent nor its in transaction ,say  I am
> posting events to some external server after processing.
>
> Is it possible to get accumulator of failed task in retry task? Is there
> any way to detect whether this task is retried task or original task ?
>
> I was trying to achieve something like incrementing a counter after each
> event processed and if task fails- retry task will just ignore already
> processed events by accessing counter of failed task. Is it directly
> possible to access accumulator per task basis without writing to hdfs or
> hbase.
>
>
>
>
> On Tue, Aug 11, 2015 at 3:15 AM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>>
>> http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
>>
>>
>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations
>>
>> https://www.youtube.com/watch?v=fXnNEq1v3VA
>>
>>
>> On Mon, Aug 10, 2015 at 4:32 PM, Shushant Arora <
>> shushantarora09@gmail.com> wrote:
>>
>>> Hi
>>>
>>> How can I avoid duplicate processing of kafka messages in spark stream
>>> 1.3 because of executor failure.
>>>
>>> 1.Can I some how access accumulators of failed task in retry  task to
>>> skip those many events which are already processed by failed task on this
>>> partition ?
>>>
>>> 2.Or I ll have to persist each msg processed and then check before
>>> processing each msg whether its already processed by failure task and
>>> delete this perisited information at each batch end?
>>>
>>
>>
>

Re: avoid duplicate due to executor failure in spark stream

Posted by Shushant Arora <sh...@gmail.com>.

What if processing is neither idempotent nor its in transaction ,say  I am
posting events to some external server after processing.

Is it possible to get accumulator of failed task in retry task? Is there
any way to detect whether this task is retried task or original task ?

I was trying to achieve something like incrementing a counter after each
event processed and if task fails- retry task will just ignore already
processed events by accessing counter of failed task. Is it directly
possible to access accumulator per task basis without writing to hdfs or
hbase.

On Tue, Aug 11, 2015 at 3:15 AM, Cody Koeninger <co...@koeninger.org> wrote:

>
> http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
>
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations
>
> https://www.youtube.com/watch?v=fXnNEq1v3VA
>
>
> On Mon, Aug 10, 2015 at 4:32 PM, Shushant Arora <shushantarora09@gmail.com
> > wrote:
>
>> Hi
>>
>> How can I avoid duplicate processing of kafka messages in spark stream
>> 1.3 because of executor failure.
>>
>> 1.Can I some how access accumulators of failed task in retry  task to
>> skip those many events which are already processed by failed task on this
>> partition ?
>>
>> 2.Or I ll have to persist each msg processed and then check before
>> processing each msg whether its already processed by failure task and
>> delete this perisited information at each batch end?
>>
>
>

Re: avoid duplicate due to executor failure in spark stream

Posted by Cody Koeninger <co...@koeninger.org>.

http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers

http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

https://www.youtube.com/watch?v=fXnNEq1v3VA


On Mon, Aug 10, 2015 at 4:32 PM, Shushant Arora <sh...@gmail.com>
wrote:

> Hi
>
> How can I avoid duplicate processing of kafka messages in spark stream 1.3
> because of executor failure.
>
> 1.Can I some how access accumulators of failed task in retry  task to skip
> those many events which are already processed by failed task on this
> partition ?
>
> 2.Or I ll have to persist each msg processed and then check before
> processing each msg whether its already processed by failure task and
> delete this perisited information at each batch end?
>