You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by KhajaAsmath Mohammed <md...@gmail.com> on 2017/04/27 15:27:27 UTC

Data Skew in Dataframe Groupby - Any suggestions?

Hi,

I am working on requirement where I need to perform groupby on set of data
and find the max value on that group.

GroupBy on dataframe is resulting in skewness and job is running for quite
a long time (actually more time than in Hive and Impala for one day worth
of data).

Any suggestions on how to overcome this?

dataframe.groupBy(Constants.Datapoint.Vin,Constants.Datapoint.Utctime,Constants.Datapoint.ProviderDesc,Constants.Datapoint.Latitude,Constants.Datapoint.Longitude)

*Note: *I have added colleace and persited data into memory and disk too
still no improvement

Thanks,
Asmath.