You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sunita Arvind <su...@gmail.com> on 2018/01/25 04:34:25 UTC

Re: a way to allow spark job to continue despite task failures?

Had a similar situation and landed on this question.
Finally I was able to make it do what I needed by cheating the spark driver
:)
i.e By setting a very high value for "--conf spark.task.maxFailures=800".
I made it 800 deliberately which typically is 4. So by the time 800
attempts for failed tasks were done, other tasks completed.
You can set it to higher or lower value depending on how many more tasks
you have and the duration they take to complete.

regards
Sunita

On Fri, Nov 13, 2015 at 4:50 PM, Ted Yu <yu...@gmail.com> wrote:

> I searched the code base and looked at:
> https://spark.apache.org/docs/latest/running-on-yarn.html
>
> I didn't find mapred.max.map.failures.percent or its counterpart.
>
> FYI
>
> On Fri, Nov 13, 2015 at 9:05 AM, Nicolae Marasoiu <
> nicolae.marasoiu@adswizz.com> wrote:
>
>> Hi,
>>
>>
>> I know a task can fail 2 times and only the 3rd breaks the entire job.
>>
>> I am good with this number of attempts.
>>
>> I would like that after trying a task 3 times, it continues with the
>> other tasks.
>>
>> The job can be "failed", but I want all tasks run.
>>
>> Please see my use case.
>>
>>
>> I read a hadoop input set, and some gzip files are incomplete. I would
>> like to just skip them and the only way I see is to tell Spark to ignore
>> some tasks permanently failing, if it is possible. With traditional hadoop
>> map-reduce this was possible using mapred.max.map.failures.percent.
>>
>>
>> Do map-reduce params like mapred.max.map.failures.percent apply to
>> Spark/YARN map-reduce jobs ?
>>
>> I edited $HADOOP_CONF_DIR/mapred-site.xml and
>> added mapred.max.map.failures.percent=30 but does not seem to apply, job
>> still failed after 3 task attempt fails.
>>
>>
>> Should Spark transmit this parameter? Or the mapred.* do not apply?
>>
>> Do other hadoop parameters (e.g. the ones involved in the input reading,
>> not in the "processing" or "application" like this max.map.failures) - are
>> others taken into account and transmitted? I saw that it should scan
>> HADOOP_CONF_DIR and forward those, but I guess this does not apply to any
>> parameter, since Spark has its own distribution & DAG stages processing
>> logic, which just happens to have a YARN implementation.
>>
>>
>> Do you know a way to do this in Spark - to ignore a predefined number of
>> tasks fail, but allow the job to continue? This way I could see all the
>> faulty input files in one job run, delete them all and continue with the
>> rest.
>>
>>
>> Just to mention, doing a manual gzip -t on top of hadoop cat is
>> infeasible and map-reduce is way faster to scan the 15K files worth 70GB
>> (its doing 25M/s per node), while the old style hadoop cat is doing much
>> less.
>>
>>
>> Thanks,
>>
>> Nicu
>>
>
>