You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Lydia Ickler <ic...@googlemail.com> on 2016/11/23 12:37:38 UTC

PowerIterationClustering can't handle "large" files

Hi all,

I have a question regarding the Power Iteration Clustering.
I have an input file (tab separated edge list) which I read in and map it to the required format of RDD[(Long, Long, Double)] to then apply PIC.
So far so good… 
The implementation works fine if the input is small (up to 50MB). 
But it crashes if I try to apply it to a file of size 650 MB.
My technical setup is a compute cluster with 1 master 2 workers. The executor memory is set to 50 GB and in total 24 cores are available.

Is it normal that  the program crashes at such a file size?
I attached my program code as well as the error output.

I hope someone can help me!
Best regards, 
Lydia