You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ashish Jain <as...@gmail.com> on 2014/09/29 19:43:22 UTC

When to start optimizing for GC?

Hello,

I have written a standalone spark job which I run through Ooyala Job
Server. The program is working correctly, now I'm looking into how to
optimize it.

My program without optimization took 4 hours to run. The first optimization
of KyroSerializer and compiling regex pattern and reusing them reduced the
running time to 2.8 hours. I was looking into the stages to understand what
was going on, when I came across this -

Duration GC Time Result Ser Time
3.4 min 2.8 min
3.3 min 2.8 min 10 ms
3.4 min 2.8 min 1 ms
3.3 min 2.8 min 1 ms
3.3 min 2.8 min
3.4 min 2.8 min
3.3 min 2.8 min 1 ms
3.3 min 2.8 min
3.4 min 2.9 min

Is this expected time for a program to spend in GC? Or is it time for me to
deep dive into GC is behaving for my program? Or would it be easier if I
just serialize the RDD onto an SSD and work off the heap (using Tachyon)?
I'm still relatively new to Spark, there are several ways of tuning and
they are confusing, so please don't mind any dumb questions. What I'm doing
-> Read n files into n RDDs, cogroup them into 1, do a foreach to transform
the objects into a string.

Setup and settings - 1 machine, 16 cores, 128 GB RAM
Driver memory (Ooyala Job Server) - 90gb using -Xmx90gb
spark.executor.memory - 90gb
master = local[16]
spark.storage.memoryFraction=0.3
spark.shuffle.memoryFraction=0.6
spark.local.dir=SSD, ip and op directory=SSD
spark.default.parallelism=48

Any suggestions for optimization are welcome.