You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Deepak Sharma <de...@gmail.com> on 2020/06/16 07:29:41 UTC

GroupBy issue while running K-Means - Dataframe

Hi All,
I have a custom implementation of K-Means where it needs the data to be
grouped by a key in a dataframe.
Now there is a big data skew for some of the keys , where it exceeds the
BufferHolder:
 Cannot grow BufferHolder by size 17112 because the size after growing
exceeds size limitation 2147483632

I tried solving it by converting the dataframe to RDD and then using
reduceByKey on RDD and converting it back to RDD.
This gives Java Heap : Out of memory error.
Since it looks like a common issue , i was wondering how anyone would be
solving this problem ?
-- 
Thanks
Deepak