You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by innowireless TaeYun Kim <ta...@innowireless.co.kr> on 2014/07/08 02:42:46 UTC
The number of cores vs. the number of executors
Hi,
I'm trying to understand the relationship of the number of cores and the
number of executors when running a Spark job on YARN.
The test environment is as follows:
- # of data nodes: 3
- Data node machine spec:
- CPU: Core i7-4790 (# of cores: 4, # of threads: 8)
- RAM: 32GB (8GB x 4)
- HDD: 8TB (2TB x 4)
- Network: 1Gb
- Spark job flow: sc.textFile -> filter -> map -> filter -> mapToPair ->
reduceByKey -> map -> saveAsTextFile
- input data
- type: single text file
- size: 165GB
- # of lines: 454,568,833
- output
- # of lines after second filter: 310,640,717
- # of lines of the result file: 99,848,268
- size of the result file: 41GB
The job was run with following configurations:
1) --master yarn-client --executor-memory 19G --executor-cores 7
--num-executors 3 (executors per data node, use as much as cores)
2) --master yarn-client --executor-memory 19G --executor-cores 4
--num-executors 3 (# of cores reduced)
3) --master yarn-client --executor-memory 4G --executor-cores 2
--num-executors 12 (less core, more executor)
- elapsed times:
1) 50 min 15 sec
2) 55 min 48 sec
3) 31 min 23 sec
To my surprise, 3) was much faster.
I thought that 1) would be faster, since there would be less inter-executor
communication when shuffling.
Although # of cores of 1) is fewer than 3), #of cores is not the key factor
since 2) did perform well.
How can I understand this result?
Thanks.
RE: The number of cores vs. the number of executors
Posted by innowireless TaeYun Kim <ta...@innowireless.co.kr>.
For your information, I've attached the Ganglia monitoring screen capture on
the Stack Overflow question.
Please see:
http://stackoverflow.com/questions/24622108/apache-spark-the-number-of-cores
-vs-the-number-of-executors
From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr]
Sent: Tuesday, July 08, 2014 9:43 AM
To: user@spark.apache.org
Subject: The number of cores vs. the number of executors
Hi,
I'm trying to understand the relationship of the number of cores and the
number of executors when running a Spark job on YARN.
The test environment is as follows:
- # of data nodes: 3
- Data node machine spec:
- CPU: Core i7-4790 (# of cores: 4, # of threads: 8)
- RAM: 32GB (8GB x 4)
- HDD: 8TB (2TB x 4)
- Network: 1Gb
- Spark job flow: sc.textFile -> filter -> map -> filter -> mapToPair ->
reduceByKey -> map -> saveAsTextFile
- input data
- type: single text file
- size: 165GB
- # of lines: 454,568,833
- output
- # of lines after second filter: 310,640,717
- # of lines of the result file: 99,848,268
- size of the result file: 41GB
The job was run with following configurations:
1) --master yarn-client --executor-memory 19G --executor-cores 7
--num-executors 3 (executors per data node, use as much as cores)
2) --master yarn-client --executor-memory 19G --executor-cores 4
--num-executors 3 (# of cores reduced)
3) --master yarn-client --executor-memory 4G --executor-cores 2
--num-executors 12 (less core, more executor)
- elapsed times:
1) 50 min 15 sec
2) 55 min 48 sec
3) 31 min 23 sec
To my surprise, 3) was much faster.
I thought that 1) would be faster, since there would be less inter-executor
communication when shuffling.
Although # of cores of 1) is fewer than 3), #of cores is not the key factor
since 2) did perform well.
How can I understand this result?
Thanks.