You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Dibyendu Bhattacharya <di...@gmail.com> on 2018/01/31 03:51:01 UTC
why groupByKey still shuffle if SQL does "Distribute By" on same
columns ?
Hi,
I am trying something like this..
val sesDS: Dataset[XXX] = hiveContext.sql(select).as[XXX]
The select statement is something like this : "select * from sometable ....
DISTRIBUTE by col1, col2, col3"
Then comes groupByKey...
val gpbyDS = sesDS .groupByKey(x => (x.col1, x.col2, x.col3))
As my select is already Distribute the data based on columns which are same
as what I used in groupByKey, Why does groupByKey still doing the shuffle
? Is this an issue or I am missing something ?
Regards,
Dibyendu