You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Rajat Verma <ve...@gmail.com> on 2014/11/05 07:51:23 UTC

How to increase hdfs read parallelism

Hi
I have simple use case where I have to join two feeds. I have two worker
nodes each having 96 GB memory and 24 cores. I am running spark(1.1.0) with
yarn(2.4.0).
I have allocated 80% resources to spark queue and my spark config looks like
spark.executor.cores=18
spark.executor.memory=66g
spark.executor.instances=2

My jobs schedules 400 tasks and at a time close to 40 tasks run in
parallel. My job is IO bound and CPU utilisation is less than 40%.
Spark tuning page recommends to configure 2-3 tasks per CPU core. I have
changed spark.default.parallelism to 80 but still it is running only
40(approx) tasks at a time.
How can I run more tasks in parallel.

One more question, I have to cache combined RDD after join. Should I run 4
executers with 32GB memory and set -XX:+UseCompressedOops?? what are pros
and cons of doing it.

Thanks.