You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ivan Petrov <ca...@gmail.com> on 2020/08/27 10:50:26 UTC

Some sort of chaos monkey for spark jobs, do we have it?

Hi, I'm feeling pain while trying to insert 2-3 millions of records into
Mongo using plain Spark RDD. There were so many hidden problems.

I would like to avoid this in future and looking for a way to kill
individual spark tasks at specific stage and verify expected behaviour of
my Spark job.

ideal setup
1. write spark job
2. run spark job on YARN
3. run a tool that kills certain % or number of tasks at specific stage
4. verify results

Real world scenario.
Mongo spark driver has very optimistic assumption that insert never fails.
I've enabled ordered=false for the driver to ignore duplicated records
insertion.
It kind-a worked before I met speculative execution.
- Task failed once because of duplicates. It's expected, another task
uploaded same data
- Then spark killed same task twice during speculative execution.
- The whole job failed since there were 3 fails for a given task
and spark.task.maxFailures=4

I didn't get three kills in dev cluster during 100+ runs but got it in
production :) Production cluster is a bit noisy.
Such a chaos monkey would help to tune my job configuration for production
using the dev cluster.