You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Spark User <sp...@gmail.com> on 2016/10/22 00:07:34 UTC

RDD to Dataset results in fixed number of partitions

Hi All,

I'm trying to create a Dataset from RDD and do groupBy on the Dataset. The
groupBy stage runs with 200 partitions. Although the RDD had 5000
partitions. I also seem to have no way to change that 200 partitions on the
Dataset to some other large number. This seems to be affecting the
parallelism as there are 700 executors and only 200 partitions.

The code looks somewhat like:

val sqsDstream = sparkStreamingContext.union((1 to 3).map(_ =>
      sparkStreamingContext.receiverStream(new SQSReceiver())
    ).transform(_.repartition(5000))

sqsDstream.foreachRDD(rdd => {
      val dataSet = sparkSession.createDataset(rdd)
      val aggregatedDataset: Dataset[Row] =
                  dataSet.groupBy("primaryKey").agg(udaf("key1"))
      aggregatedDataset.foreachPartition(partition => {
             //write to output stream
       })
})


Any pointers would be appreciated.
Thanks,
Bharath