You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Dibyendu Bhattacharya <di...@gmail.com> on 2018/01/31 03:51:01 UTC

why groupByKey still shuffle if SQL does "Distribute By" on same columns ?

 Hi,

I am trying something like this..

val sesDS:  Dataset[XXX] = hiveContext.sql(select).as[XXX]

The select statement is something like this : "select * from sometable ....
DISTRIBUTE by col1, col2, col3"

Then comes groupByKey...

val gpbyDS = sesDS .groupByKey(x => (x.col1, x.col2, x.col3))

As my select is already Distribute the data based on columns which are same
as what I used in groupByKey, Why does groupByKey  still doing the shuffle
? Is this an issue or I am missing something ?

Regards,
Dibyendu