You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Varadharajan Mukundan <sr...@gmail.com> on 2015/07/29 15:08:15 UTC
Simple Map Reduce taking lot of time
Hi All,
I'm running Spark 1.4.1 on a 8 core machine with 16 GB RAM. I've a 500MB
CSV file with 10 columns and i'm need of separating it into multiple
CSV/Parquet files based on one of the fields in the CSV file. I've loaded
the CSV file using spark-csv and applied the below transformations. It
takes a lot of time (more than 20-30mins) and sometimes terminates with
OOM. Any idea of better ways to do it? Thanks in advance!
I start spark-shell using the below options:
# Enabled kryo serializer
bin/spark-shell --driver-memory 6G --executor-memory 6G --master
"local[3]" --conf spark.kryoserializer.buffer.max=200m --packages
com.databricks:spark-csv_2.11:1.1.0
val df = sqlContext.load("com.databricks.spark.csv",
Map("header" -> "true",
"path" -> "file:///file.csv",
"partitionColumn" -> "date",
"numPartitions" -> "4"
)
)
df.map(r => (r(2), List(r))).reduceByKey((a,b) => a ++ b)
--
Thanks,
M. Varadharajan
------------------------------------------------
"Experience is what you get when you didn't get what you wanted"
-By Prof. Randy Pausch in "The Last Lecture"
My Journal :- http://varadharajan.in