You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Wisc Forum <wi...@gmail.com> on 2013/11/24 20:22:45 UTC

Spark performance on Amazon EC2

Hi, Spark support:

I am working on a research project which uses Spark on Amazon EC2 as the cluster foundation for distributed computation.

The project basically consumes some image and text files and project these files to different features and index them for later query.

Currently the image and text data are stored in S3 and our application distributed the projectors (convert the data to features) to worker nodes via SparkContext map function and the result is persisted in memory for later usage.

For the whole workflow, all the projectors are run in parallel while the indexing is done in sequential mode (i.e. indexing is not running on worker node). I tried different number of worker nodes in the experiments and notice that the performance increase (in terms of time-consumption) when we go choose to run with number of worker nodes 2, 3, 4, .... but the performance turned bad when 10 worker nodes are applied and worse on 12 (time-consuming going uphill on 10 and 12) and it turned good again for 14, 16 and 18 worker nodes (time-consuming going downhill again). 

I tried another task (similar thing but different program) where the intermediate features are used for training and finally classification is applied for some test data against the trained model. 
In this set of experiments we chose number of work nodes as 1, 2, 3, 4, ...10. The same thing happened except the performance is more bumpy as the time consumption went worse on 5 worker nodes than 4 and turned good on 6 worker nodes and turned bad on 8 worker nodes and turned back good on 9 and 10 worker nodes.

This is confusing as I am expecting the performance goes near-linear when we increased the number of worker nodes. When the overhead is more than the distributed gain, the performance goes bad and should not come back to good.

P. S.

I did make the below fine-tune steps:

System.setProperty("spark.scheduler.mode", "FAIR") as I want each job received more evenly assigned cluster resource
System.setProperty("spark.task.maxFailures", "6") as I want the worker node tried more times before it fail on a task
System.setProperty("spark.akka.frameSize", "250") as our communication between master and worker is more than the default 10M

Did I miss any other tuning steps?

Thanks,
Xiaobing