You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Muthu Jayakumar <ba...@gmail.com> on 2017/07/11 21:05:03 UTC

DataFrame --- join / groupBy-agg question...

Hello there,

I may be having a naive question on join / groupBy-agg. During the days of
RDD, whenever I wanted to perform
a. groupBy-agg, I used to say reduceByKey (of PairRDDFunctions) with an
optional Partition-Strategy (with is number of partitions or Partitioner)
b. join (of PairRDDFunctions) and its variants, I used to have a way to
provide number of partitions

In DataFrame, how do I specify the number of partitions during this
operation? I could use repartition() after the fact. But this would be
another Stage in the Job.

One work around to increase the number of partitions / task during a join
is to set 'spark.sql.shuffle.partitions' it some desired number during
spark-submit. I am trying to see if there is a way to provide this
programmatically for every step of a groupBy-agg / join.

Please advice,
Muthu