You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tom Hubregtsen <th...@gmail.com> on 2015/07/24 20:45:12 UTC

50% performance decrease when using local file vs hdfs

Hi,

When running two experiments with the same application, we see a 50%
performance difference between using HDFS and files on disk, both using the
textFile/saveAsTextFile call. Almost all performance loss is in Stage 1. 

Input (in Stage 0):
The file is read in using val input = sc.textFile(inputFile). The total
input size is 500GB. The files on disk are partitioned into 128 MB files,
HDFS is set to a block size of 128MB. When looking at the the number of
task, we see 4x more task. We have seen this before, and it seems that this
is because Spark breaks up the files in to 32MB files. This is not the case
in HDFS.

Output (in Stage 1):
The file is written using saveAsTextFile(outputFile). The total output size
is 500GB. Because we use a custom partittioner, we always have 9025 task in
this stage. This is the stage where we see most performance loss. 

Questions:
* What is the cause of the performance loss?
-> Possible answers: 
Because of the block size (e.g. 128MB vs 33 MB) the write is less efficient
(more/less data being transferred at once)
or
Because of the block size we need to open 4x as many files, leading to a
performance loss
* How can we solve this? (We would like to not use HDFS)
* Bonus question: Should I use a different API to get a better performance?

Thanks for any responses!

Tom Hubregtsen



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/50-performance-decrease-when-using-local-file-vs-hdfs-tp23987.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org