You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by sudiprc <su...@gmail.com> on 2014/07/17 00:02:58 UTC

SaveAsTextFile of RDD taking much time

Hi All,I am new to Spark. Written a program to read data from local big file,
sort using Spark SQL and then filter based some validation rules. I have
tested this program with 23860746 lines of file, and it took 39 secs (2
cores and Xmx as 6gb). But, when I want to serializing it to a local file,
it is taking much time (I stopped the execution). For 100K lines of file,
without saveAsTextFile, program took 8 secs, whereas writing to file taking
20 mins. BTW, I am using Kryo serialization and StorageLevel.MEMORY_ONLY_SER
option for persisting into RAM. The program is almost as provided
below:package main.scalaimport scala.reflect.runtime.universeimport
org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport
org.apache.spark.serializer.KryoRegistratorimport
org.apache.spark.sql.SQLContextimport
org.apache.spark.storage.StorageLevelimport
com.esotericsoftware.kryo.Kryoobject SparkSqlApplication extends App {   
val txtFile = "/home/admin/scala/bigfile.txt"  val outputDir =
"file:///home/admin/scala/spark-poc/sample_data1_spark-sql"  val conf = new
SparkConf()                .setMaster("local")               
.setAppName("Spark App")                .setSparkHome("$SPARK_HOME")               
.setJars(List("target/scala-2.10/spark-project_2.10-1.0.jar"))               
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")               
.set("spark.kryo.registrator", "main.scala.ReconRegistrator")  val sc = new
SparkContext(conf)  val sqlCtx = new SQLContext(sc)  import
sqlCtx.createSchemaRDD  val patient = sc.textFile(txtFile)                 
.persist(StorageLevel.MEMORY_ONLY_SER)                  .map(_.split(","))                 
.map(arr => Patient(arr(0).trim(), arr(1), arr(2)))                 
.registerAsTable("patient")  val sortedPat = sqlCtx.sql("select * from
patient order by pcode")  val validator = new GroovyIntegrator()  val
filteredInvalidPat = sortedPat.filter(patientRow => !validator.applyRules(""
+ patientRow(0)))  filteredInvalidPat.coalesce(1,
false).saveAsTextFile(outputDir)}  case class Patient(pcode: String,
disease: String, dcategory: String)class ReconRegistrator extends
KryoRegistrator {  override def registerClasses(kryo: Kryo) {   
kryo.register(classOf[Patient])  }}Can anyone help on this?Thanks,Sudip



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SaveAsTextFile-of-RDD-taking-much-time-tp9979.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.