You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Yi Ming Huang <hu...@cn.ibm.com> on 2015/03/19 07:44:20 UTC

Need some help on the Spark performance on Hadoop Yarn

Dear Spark experts, I appreciate you can look into my problem and give me
some help and suggestions here... Thank you!

I have a simple Spark application to parse and analyze the log, and I can
run it on my hadoop yarn cluster. The problem with me is that I find it
runs quite slow on the cluster, even slower than running it just on a
single Spark machine.

This is my application sketch:
1) read in the log file and use mapToPair to transform the raw logs to my
Object - Tuple2<String, LogEntry>  I use a string as key so later I will
aggregate by the key
2) persist the RDD transformed from step 1 and let me call it logObjects
3) use aggregateByKey to to calculate the sum, avg value for each key. the
reason I use aggregateByKey instead of reduce by key is the output Object
is different
4) persist the RDD from step 3, let me call it aggregatedObjects.
5) run several takeOrdered to get top X values that I'm interested in

What suprised me is that even with the persits (MEMORY_ONLY_SER) for two
major RDDs I'm manipulating later, the process speed is not improved. It's
even slower than not persist them... Any idea on that? I logged some date
to the stdout and find the two major actions take more than 1 minutes. It's
just 1GB log though...
Another problem I'm seeing is it seems just use two of my DataNode in my
Hadoop Yarn cluster, but actually I have three. Any configuration here that
matters?



I attached the syserr output here, please help me to analyze it and suggest
where can I improve the speed. Thank you so much!
(See attached file: applicationLog.txt)
Best Regards
--------------------------------------------
Yi Ming Huang(黄毅铭)
ICS Performance
IBM Collaboration Solutions, China Development Lab, Shanghai
huangyim@cn.ibm.com (86-21)60922771
Addr: 5F, Building 10, No 399, Keyuan Road, Zhangjiang High Tech Park