You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Manu Mukerji <ma...@gmail.com> on 2014/09/08 23:37:59 UTC

Recommendations for performance

Hi,

Let me start with, I am new to spark.(be gentle)

I have a large data set in Parquet (~1.5B rows, 900 columns)

Currently Impala takes ~1-2 seconds for the queries while SparkSQL is
taking ~30 seconds..

Here is what I am currently doing..

I launch with SPARK_MEM=6g spark-shell

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
val parquetFile = sqlContext.parquetFile("hdfs:///user/ubuntu/parquet_data")
parquetFile.cache()
parquetFile.registerAsTable("parquet_table")
val test1 = sqlContext.sql("select COUNT(1) FROM parquet_table WHERE 1=1
AND ....")
test1.take(1)


My cluster has ~300Gigs of memory is 5 X c3.8xlarge nodes

I am guessing not all the data is fitting in memory.. in this config..

1) How do I determine how much memory the data will need in memory..
2) How do I tell spark to load the data in memory and keep it there..?
3) When I go to host:4040/storage/ I do not see anything there

thanks,
Manu