You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by diplomatic Guru <di...@gmail.com> on 2015/07/03 14:58:47 UTC

Spark performance issue

Hello guys,

I'm after some advice on Spark performance.

I've a MapReduce job that read inputs carry out a simple calculation and
write the results into HDFS. I've implemented the same logic in Spark job.

When I tried both jobs on same datasets, I'm getting different execution
time, which is expected.

BUT
......
In my example, MapReduce job is performing much better than Spark.

The difference is that I'm not changing much with the MR job configuration,
e.g., memory, cores, etc...But this is not the case with Spark as it's very
flexible. So I'm sure my configuration isn't correct which is why MR is
outperforming Spark but need your advice.

For example:

Test 1:
4.5GB data -  MR job took ~55 seconds to compute, but Spark took ~3 minutes
and 20 seconds.

Test 2:
25GB data -MR took 2 minutes and 15 seconds, whereas Spark job is still
running, and it's already been 15 minutes.


I have a cluster of 15 nodes. The maximum memory that I could allocate to
each executor is 6GB. Therefore, for Test 1, this is the config I used:

--executor-memory 6G --num-executors 4 --driver-memory 6G  --executor-cores
2 (also I set "spark.storage.memoryFraction" to 0.3)


For Test 2:
--executor-memory 6G --num-executors 10 --driver-memory 6G
 --executor-cores 2 (also I set "spark.storage.memoryFraction" to 0.3)

I tried all possible combination but couldn't get better performance. Any
suggestions will be much appreciated.

Re: Spark performance issue

Posted by Silvio Fiorito <si...@granturing.com>.
It’ll help to see the code or at least understand what transformations you’re using.

Also, you have 15 nodes but not using all of them, so that means you may be losing data locality. You can see this in the job UI for Spark if any jobs do not have node or process local.

From: diplomatic Guru
Date: Friday, July 3, 2015 at 8:58 AM
To: "user@spark.apache.org<ma...@spark.apache.org>"
Subject: Spark performance issue

Hello guys,

I'm after some advice on Spark performance.

I've a MapReduce job that read inputs carry out a simple calculation and write the results into HDFS. I've implemented the same logic in Spark job.

When I tried both jobs on same datasets, I'm getting different execution time, which is expected.

BUT
......
In my example, MapReduce job is performing much better than Spark.

The difference is that I'm not changing much with the MR job configuration, e.g., memory, cores, etc...But this is not the case with Spark as it's very flexible. So I'm sure my configuration isn't correct which is why MR is outperforming Spark but need your advice.

For example:

Test 1:
4.5GB data -  MR job took ~55 seconds to compute, but Spark took ~3 minutes and 20 seconds.

Test 2:
25GB data -MR took 2 minutes and 15 seconds, whereas Spark job is still running, and it's already been 15 minutes.


I have a cluster of 15 nodes. The maximum memory that I could allocate to each executor is 6GB. Therefore, for Test 1, this is the config I used:

--executor-memory 6G --num-executors 4 --driver-memory 6G  --executor-cores 2 (also I set "spark.storage.memoryFraction" to 0.3)


For Test 2:
--executor-memory 6G --num-executors 10 --driver-memory 6G  --executor-cores 2 (also I set "spark.storage.memoryFraction" to 0.3)

I tried all possible combination but couldn't get better performance. Any suggestions will be much appreciated.