You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Will Yang <er...@gmail.com> on 2014/12/25 06:34:40 UTC

Problems with large dataset using collect() and broadcast()

Hi all,
In my occasion, I have a huge HashMap[(Int, Long), (Double, Double,
Double)], say several GB to tens of GB, after each iteration, I need to
collect() this HashMap and perform some calculation, and then broadcast()
it to every node. Now I have 20GB for each executor and after it
performances collect(), it gets stuck at "Added rdd_xx_xx", no further
respond showed on the Application UI.

I've tried to lower the spark.shuffle.memoryFraction and
spark.storage.memoryFraction, but it seems that it can only deal with as
much as 2GB HashMap. What should I optimize for such conditions.

(ps: sorry for my bad English & Grammar)


Thanks

Re: Problems with large dataset using collect() and broadcast()

Posted by Patrick Wendell <pw...@gmail.com>.
Hi Will,

When you call collect() the item you are collecting needs to fit in
memory on the driver. Is it possible your driver program does not have
enough memory?

- Patrick

On Wed, Dec 24, 2014 at 9:34 PM, Will Yang <er...@gmail.com> wrote:
> Hi all,
> In my occasion, I have a huge HashMap[(Int, Long), (Double, Double,
> Double)], say several GB to tens of GB, after each iteration, I need to
> collect() this HashMap and perform some calculation, and then broadcast()
> it to every node. Now I have 20GB for each executor and after it
> performances collect(), it gets stuck at "Added rdd_xx_xx", no further
> respond showed on the Application UI.
>
> I've tried to lower the spark.shuffle.memoryFraction and
> spark.storage.memoryFraction, but it seems that it can only deal with as
> much as 2GB HashMap. What should I optimize for such conditions.
>
> (ps: sorry for my bad English & Grammar)
>
>
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org