You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jon Chase <jo...@gmail.com> on 2016/07/20 04:30:59 UTC

Extremely slow shuffle writes and large job time fluxuations

I'm running into an issue with a pyspark job where I'm sometimes seeing
extremely variable job times (20min to 2hr) and very long shuffle times
(e.g. ~2 minutes for 18KB/86 records).

Cluster set up is Amazon EMR 4.4.0, Spark 1.6.0, an m4.2xl driver and a
single m4.10xlarge (40 vCPU, 160GB) executor with the following params:

--conf spark.driver.memory=10G --conf spark.default.parallelism=160 --conf
spark.driver.maxResultSize=4g --num-executors 2 --executor-cores 20
--executor-memory 67G

What's odd is that sometimes the job will run in 20 minutes and sometimes
it will take 2 hours - both jobs with the same data.  I'm using RDDs (not
DataFrames).  There's plenty of RAM, I've looked at the GC logs (using CMS)
and they look fine.  The job reads some data from files, does some
maps/filters/joins/etc; nothing too special.

The only thing I've noticed that looks odd is that the slow instances of
the job have unusually long Shuffle Write times for some tasks.  For
example, a .join operation has ~30 tasks out of 320 that take 2.5 minutes,
GC time of 0.1 seconds, Shuffle Read Size / Records of 12KB/30, and, most
interestingly, Write Time of 2.5 minutes for Shuffle Write Size / Records
of 18KB/86 records.  When looking at the event time line for the stage it's
almost all yellow (Shuffle Write).

We've been running this job on a difference EMR cluster topology (12
m3.2xlarge's) and have not seen the slow down described above.  We've only
observed it on the m4.10xl machine.

It might be worth mentioning again that this is pyspark and no DataFrames
(just RDDs).  When I run 'top' I sometimes see lots (e.g. 60 or 70) python
processes on the executor (I assume one per partition being processed?).

It seems like this has something to do with the single m4.10xl set up, as
we haven't seen this behavior on the 12 m3.2xl cluster.

What I really don't understand is why the job seems to run fine (20
minutes) for a while, and then (for the same data) takes so much longer (2
hours), and with such long shuffle write times.