You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by innowireless TaeYun Kim <ta...@innowireless.co.kr> on 2014/07/08 02:42:46 UTC

The number of cores vs. the number of executors

Hi,

 

I'm trying to understand the relationship of the number of cores and the
number of executors when running a Spark job on YARN.

 

The test environment is as follows:

 

- # of data nodes: 3

- Data node machine spec:

- CPU: Core i7-4790 (# of cores: 4, # of threads: 8)

- RAM: 32GB (8GB x 4)

- HDD: 8TB (2TB x 4)

- Network: 1Gb

 

- Spark job flow: sc.textFile -> filter -> map -> filter -> mapToPair ->
reduceByKey -> map -> saveAsTextFile

 

- input data

- type: single text file

- size: 165GB

- # of lines: 454,568,833

 

- output

- # of lines after second filter: 310,640,717

- # of lines of the result file: 99,848,268

- size of the result file: 41GB

 

The job was run with following configurations:

 

1) --master yarn-client --executor-memory 19G --executor-cores 7
--num-executors 3  (executors per data node, use as much as cores)

2) --master yarn-client --executor-memory 19G --executor-cores 4
--num-executors 3  (# of cores reduced)

3) --master yarn-client --executor-memory 4G --executor-cores 2
--num-executors 12  (less core, more executor)

 

- elapsed times:

1) 50 min 15 sec

2) 55 min 48 sec

3) 31 min 23 sec

 

To my surprise, 3) was much faster.

I thought that 1) would be faster, since there would be less inter-executor
communication when shuffling.

Although # of cores of 1) is fewer than 3), #of cores is not the key factor
since 2) did perform well.

 

How can I understand this result?

 

Thanks.

RE: The number of cores vs. the number of executors

Posted by innowireless TaeYun Kim <ta...@innowireless.co.kr>.

For your information, I've attached the Ganglia monitoring screen capture on
the Stack Overflow question.

Please see:
http://stackoverflow.com/questions/24622108/apache-spark-the-number-of-cores
-vs-the-number-of-executors 

 

From: innowireless TaeYun Kim [mailto:taeyun.kim@innowireless.co.kr] 
Sent: Tuesday, July 08, 2014 9:43 AM
To: user@spark.apache.org
Subject: The number of cores vs. the number of executors

 

Hi,

 

I'm trying to understand the relationship of the number of cores and the
number of executors when running a Spark job on YARN.

 

The test environment is as follows:

 

- # of data nodes: 3

- Data node machine spec:

- CPU: Core i7-4790 (# of cores: 4, # of threads: 8)

- RAM: 32GB (8GB x 4)

- HDD: 8TB (2TB x 4)

- Network: 1Gb

 

- Spark job flow: sc.textFile -> filter -> map -> filter -> mapToPair ->
reduceByKey -> map -> saveAsTextFile

 

- input data

- type: single text file

- size: 165GB

- # of lines: 454,568,833

 

- output

- # of lines after second filter: 310,640,717

- # of lines of the result file: 99,848,268

- size of the result file: 41GB

 

The job was run with following configurations:

 

1) --master yarn-client --executor-memory 19G --executor-cores 7
--num-executors 3  (executors per data node, use as much as cores)

2) --master yarn-client --executor-memory 19G --executor-cores 4
--num-executors 3  (# of cores reduced)

3) --master yarn-client --executor-memory 4G --executor-cores 2
--num-executors 12  (less core, more executor)

 

- elapsed times:

1) 50 min 15 sec

2) 55 min 48 sec

3) 31 min 23 sec

 

To my surprise, 3) was much faster.

I thought that 1) would be faster, since there would be less inter-executor
communication when shuffling.

Although # of cores of 1) is fewer than 3), #of cores is not the key factor
since 2) did perform well.

 

How can I understand this result?

 

Thanks.