You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Spark User <sp...@gmail.com> on 2016/09/27 20:02:25 UTC

Question about single/multi-pass execution in Spark-2.0 dataset/dataframe

case class Record(keyAttr: String, attr1: String, attr2: String, attr3:
String)

val ds = sparkSession.createDataset(rdd).as[Record]

val attr1Counts = ds.groupBy('keyAttr', 'attr1').count()

val attr2Counts = ds.groupBy('keyAttr', 'attr2').count()

val attr3Counts = ds.groupBy('keyAttr', 'attr3').count()

//similar counts for 20 attributes

//code to merge attr1Counts and attr2Counts and attr3Counts
//translate it to desired output format and save the result.

Some more details:
1) The application is a spark streaming application with batch interval in
the order of 5 - 10 mins
2) Data set is large in the order of millions of records per batch
3) I'm using spark 2.0

The above implementation doesn't seem to be efficient at all, if data set
goes through the Rows for every count aggregation for computing
attr1Counts, attr2Counts and attr3Counts. I'm concerned about the
performance.

Questions:
1) Does the catalyst optimization handle such queries and does a single
pass on the dataset under the hood?
2) Is there a better way to do such aggregations , may be using UDAFs? Or
it is better to do RDD.reduceByKey for this use case?
RDD.reduceByKey performs well for the data and batch interval of 5 - 10
mins. Not sure if data set implementation as explained above will be
equivalent or better.

Thanks,
Bharath