You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Grega Kešpret <gr...@celtra.com> on 2013/11/25 09:58:13 UTC

Resubmision due to a fetch failure

Hi!

We use Spark to process logs in batches and persist the end result in a db.
Last week, we re-ran the job on the same data couple of times, only to find
that one run had more results than the rest. Digging through the logs, we
found out that a task has been lost and marked for resubmission.

I marked the lines here:
https://gist.github.com/gregakespret/7541805#file-spark-fetch-failure-L1432-L1509

Because of that, one block of data was processed two times and the final
result was not correct.

My question is how can we catch such occurrences in the code, so that we
can do an effective rollback/discard the data that will get recomputed?

Thanks,


Grega
--
[image: Inline image 1]
*Grega Kešpret*
Analytics engineer

Celtra — Rich Media Mobile Advertising
celtra.com <http://www.celtra.com/> |
@celtramobile<http://www.twitter.com/celtramobile>

Re: Resubmision due to a fetch failure

Posted by Grega Kešpret <gr...@celtra.com>.
Hi!

I tried setting spark.task.maxFailures to 1 (having this patch
https://github.com/apache/incubator-spark/pull/245) and started a job.
After some time, I killed all JVMs running on one of the two workers. I was
expecting Spark job to fail, however it re-fetched tasks to one of the two
workers that was still alive and the job succeeded.

Is there some other way I can make Spark job fail-fast?

Grega
--
[image: Inline image 1]
*Grega Kešpret*
Analytics engineer

Celtra — Rich Media Mobile Advertising
celtra.com <http://www.celtra.com/> |
@celtramobile<http://www.twitter.com/celtramobile>


On Thu, Nov 28, 2013 at 5:50 PM, Grega Kešpret <gr...@celtra.com> wrote:

> Thanks!
>
> Grega
> --
> [image: Inline image 1]
> *Grega Kešpret*
> Analytics engineer
>
> Celtra — Rich Media Mobile Advertising
> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>
>
> On Thu, Nov 28, 2013 at 3:40 PM, Prashant Sharma <sc...@gmail.com>wrote:
>
>> did you mean  spark.task.maxFailures
>> http://spark.incubator.apache.org/docs/latest/configuration.html
>>
>>
>> On Thu, Nov 28, 2013 at 7:58 PM, Grega Kešpret <gr...@celtra.com> wrote:
>>
>>> Bumping this thread, so it gets attention.
>>>
>>> Grega
>>>
>>> On Tue, Nov 26, 2013 at 12:26 PM, Grega Kešpret <gr...@celtra.com>wrote:
>>>
>>>> Also, is there a way to specify to Spark that it shouldn't resubmit
>>>> failed stages/tasks, but fail-fast in case any fetch failure occurs?
>>>>
>>>> Grega
>>>> --
>>>> [image: Inline image 1]
>>>> *Grega Kešpret*
>>>> Analytics engineer
>>>>
>>>> Celtra — Rich Media Mobile Advertising
>>>> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>>>>
>>>>
>>>> On Mon, Nov 25, 2013 at 9:58 AM, Grega Kešpret <gr...@celtra.com>wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> We use Spark to process logs in batches and persist the end result in
>>>>> a db. Last week, we re-ran the job on the same data couple of times, only
>>>>> to find that one run had more results than the rest. Digging through the
>>>>> logs, we found out that a task has been lost and marked for resubmission.
>>>>>
>>>>> I marked the lines here:
>>>>>
>>>>> https://gist.github.com/gregakespret/7541805#file-spark-fetch-failure-L1432-L1509
>>>>>
>>>>> Because of that, one block of data was processed two times and the
>>>>> final result was not correct.
>>>>>
>>>>> My question is how can we catch such occurrences in the code, so that
>>>>> we can do an effective rollback/discard the data that will get recomputed?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> Grega
>>>>> --
>>>>> [image: Inline image 1]
>>>>> *Grega Kešpret*
>>>>> Analytics engineer
>>>>>
>>>>> Celtra — Rich Media Mobile Advertising
>>>>> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> s
>>
>
>

Re: Resubmision due to a fetch failure

Posted by Grega Kešpret <gr...@celtra.com>.
Thanks!

Grega
--
[image: Inline image 1]
*Grega Kešpret*
Analytics engineer

Celtra — Rich Media Mobile Advertising
celtra.com <http://www.celtra.com/> |
@celtramobile<http://www.twitter.com/celtramobile>


On Thu, Nov 28, 2013 at 3:40 PM, Prashant Sharma <sc...@gmail.com>wrote:

> did you mean  spark.task.maxFailures
> http://spark.incubator.apache.org/docs/latest/configuration.html
>
>
> On Thu, Nov 28, 2013 at 7:58 PM, Grega Kešpret <gr...@celtra.com> wrote:
>
>> Bumping this thread, so it gets attention.
>>
>> Grega
>>
>> On Tue, Nov 26, 2013 at 12:26 PM, Grega Kešpret <gr...@celtra.com> wrote:
>>
>>> Also, is there a way to specify to Spark that it shouldn't resubmit
>>> failed stages/tasks, but fail-fast in case any fetch failure occurs?
>>>
>>> Grega
>>> --
>>> [image: Inline image 1]
>>> *Grega Kešpret*
>>> Analytics engineer
>>>
>>> Celtra — Rich Media Mobile Advertising
>>> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>>>
>>>
>>> On Mon, Nov 25, 2013 at 9:58 AM, Grega Kešpret <gr...@celtra.com> wrote:
>>>
>>>> Hi!
>>>>
>>>> We use Spark to process logs in batches and persist the end result in a
>>>> db. Last week, we re-ran the job on the same data couple of times, only to
>>>> find that one run had more results than the rest. Digging through the logs,
>>>> we found out that a task has been lost and marked for resubmission.
>>>>
>>>> I marked the lines here:
>>>>
>>>> https://gist.github.com/gregakespret/7541805#file-spark-fetch-failure-L1432-L1509
>>>>
>>>> Because of that, one block of data was processed two times and the
>>>> final result was not correct.
>>>>
>>>> My question is how can we catch such occurrences in the code, so that
>>>> we can do an effective rollback/discard the data that will get recomputed?
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Grega
>>>> --
>>>> [image: Inline image 1]
>>>> *Grega Kešpret*
>>>> Analytics engineer
>>>>
>>>> Celtra — Rich Media Mobile Advertising
>>>> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>>>>
>>>
>>>
>>
>
>
> --
> s
>

Re: Resubmision due to a fetch failure

Posted by Prashant Sharma <sc...@gmail.com>.
did you mean  spark.task.maxFailures
http://spark.incubator.apache.org/docs/latest/configuration.html


On Thu, Nov 28, 2013 at 7:58 PM, Grega Kešpret <gr...@celtra.com> wrote:

> Bumping this thread, so it gets attention.
>
> Grega
>
> On Tue, Nov 26, 2013 at 12:26 PM, Grega Kešpret <gr...@celtra.com> wrote:
>
>> Also, is there a way to specify to Spark that it shouldn't resubmit
>> failed stages/tasks, but fail-fast in case any fetch failure occurs?
>>
>> Grega
>> --
>> [image: Inline image 1]
>> *Grega Kešpret*
>> Analytics engineer
>>
>> Celtra — Rich Media Mobile Advertising
>> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>>
>>
>> On Mon, Nov 25, 2013 at 9:58 AM, Grega Kešpret <gr...@celtra.com> wrote:
>>
>>> Hi!
>>>
>>> We use Spark to process logs in batches and persist the end result in a
>>> db. Last week, we re-ran the job on the same data couple of times, only to
>>> find that one run had more results than the rest. Digging through the logs,
>>> we found out that a task has been lost and marked for resubmission.
>>>
>>> I marked the lines here:
>>>
>>> https://gist.github.com/gregakespret/7541805#file-spark-fetch-failure-L1432-L1509
>>>
>>> Because of that, one block of data was processed two times and the final
>>> result was not correct.
>>>
>>> My question is how can we catch such occurrences in the code, so that we
>>> can do an effective rollback/discard the data that will get recomputed?
>>>
>>> Thanks,
>>>
>>>
>>> Grega
>>> --
>>> [image: Inline image 1]
>>> *Grega Kešpret*
>>> Analytics engineer
>>>
>>> Celtra — Rich Media Mobile Advertising
>>> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>>>
>>
>>
>


-- 
s

Re: Resubmision due to a fetch failure

Posted by Grega Kešpret <gr...@celtra.com>.
Bumping this thread, so it gets attention.

Grega

On Tue, Nov 26, 2013 at 12:26 PM, Grega Kešpret <gr...@celtra.com> wrote:

> Also, is there a way to specify to Spark that it shouldn't resubmit failed
> stages/tasks, but fail-fast in case any fetch failure occurs?
>
> Grega
> --
> [image: Inline image 1]
> *Grega Kešpret*
> Analytics engineer
>
> Celtra — Rich Media Mobile Advertising
> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>
>
> On Mon, Nov 25, 2013 at 9:58 AM, Grega Kešpret <gr...@celtra.com> wrote:
>
>> Hi!
>>
>> We use Spark to process logs in batches and persist the end result in a
>> db. Last week, we re-ran the job on the same data couple of times, only to
>> find that one run had more results than the rest. Digging through the logs,
>> we found out that a task has been lost and marked for resubmission.
>>
>> I marked the lines here:
>>
>> https://gist.github.com/gregakespret/7541805#file-spark-fetch-failure-L1432-L1509
>>
>> Because of that, one block of data was processed two times and the final
>> result was not correct.
>>
>> My question is how can we catch such occurrences in the code, so that we
>> can do an effective rollback/discard the data that will get recomputed?
>>
>> Thanks,
>>
>>
>> Grega
>> --
>> [image: Inline image 1]
>> *Grega Kešpret*
>> Analytics engineer
>>
>> Celtra — Rich Media Mobile Advertising
>> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>>
>
>

Re: Resubmision due to a fetch failure

Posted by Grega Kešpret <gr...@celtra.com>.
Also, is there a way to specify to Spark that it shouldn't resubmit failed
stages/tasks, but fail-fast in case any fetch failure occurs?

Grega
--
[image: Inline image 1]
*Grega Kešpret*
Analytics engineer

Celtra — Rich Media Mobile Advertising
celtra.com <http://www.celtra.com/> |
@celtramobile<http://www.twitter.com/celtramobile>


On Mon, Nov 25, 2013 at 9:58 AM, Grega Kešpret <gr...@celtra.com> wrote:

> Hi!
>
> We use Spark to process logs in batches and persist the end result in a
> db. Last week, we re-ran the job on the same data couple of times, only to
> find that one run had more results than the rest. Digging through the logs,
> we found out that a task has been lost and marked for resubmission.
>
> I marked the lines here:
>
> https://gist.github.com/gregakespret/7541805#file-spark-fetch-failure-L1432-L1509
>
> Because of that, one block of data was processed two times and the final
> result was not correct.
>
> My question is how can we catch such occurrences in the code, so that we
> can do an effective rollback/discard the data that will get recomputed?
>
> Thanks,
>
>
> Grega
> --
> [image: Inline image 1]
> *Grega Kešpret*
> Analytics engineer
>
> Celtra — Rich Media Mobile Advertising
> celtra.com <http://www.celtra.com/> | @celtramobile<http://www.twitter.com/celtramobile>
>