You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Varadharajan Mukundan <sr...@gmail.com> on 2015/07/29 15:08:15 UTC

Simple Map Reduce taking lot of time

Hi All,

I'm running Spark 1.4.1 on a 8 core machine with 16 GB RAM. I've a 500MB
CSV file with 10 columns and i'm need of separating it into multiple
CSV/Parquet files based on one of the fields in the CSV file. I've loaded
the CSV file using spark-csv and applied the below transformations. It
takes a lot of time (more than 20-30mins) and sometimes terminates with
OOM. Any idea of better ways to do it? Thanks in advance!

I start spark-shell using the below options:

# Enabled kryo serializer
bin/spark-shell --driver-memory 6G  --executor-memory 6G --master
"local[3]"  --conf spark.kryoserializer.buffer.max=200m --packages
com.databricks:spark-csv_2.11:1.1.0



val df = sqlContext.load("com.databricks.spark.csv",
  Map("header" -> "true",
    "path" -> "file:///file.csv",
    "partitionColumn" -> "date",
    "numPartitions" -> "4"
  )
)

df.map(r => (r(2), List(r))).reduceByKey((a,b) => a ++ b)



-- 
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- http://varadharajan.in