You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by wilson <in...@bigcount.xyz> on 2022/04/19 10:33:42 UTC

RDD memory use question

Hello,

Do you know for a big dataset why the general RDD job can be done, but 
the collect() failed due to memory overflow?

for instance, for a dataset which has xxx million of items, this can be 
done well:

  scala> rdd.map { x => x.split(",") }.map{ x => (x(5).toString, 
x(6).toDouble) }.groupByKey.mapValues(x => 
x.sum/x.size).sortBy(-_._2).take(20)


But in the final stage I issued this command and it got:

scala> rdd.collect.size
22/04/19 18:26:52 ERROR Executor: Exception in task 13.0 in stage 44.0 
(TID 349)
java.lang.OutOfMemoryError: Java heap space


Thank you.
wilson

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: RDD memory use question

Posted by Sean Owen <sr...@gmail.com>.

Don't collect() - that pulls all data into memory. Use count().

On Tue, Apr 19, 2022 at 5:34 AM wilson <in...@bigcount.xyz> wrote:

> Hello,
>
> Do you know for a big dataset why the general RDD job can be done, but
> the collect() failed due to memory overflow?
>
> for instance, for a dataset which has xxx million of items, this can be
> done well:
>
>   scala> rdd.map { x => x.split(",") }.map{ x => (x(5).toString,
> x(6).toDouble) }.groupByKey.mapValues(x =>
> x.sum/x.size).sortBy(-_._2).take(20)
>
>
> But in the final stage I issued this command and it got:
>
> scala> rdd.collect.size
> 22/04/19 18:26:52 ERROR Executor: Exception in task 13.0 in stage 44.0
> (TID 349)
> java.lang.OutOfMemoryError: Java heap space
>
>
> Thank you.
> wilson
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: RDD memory use question

Posted by wilson <in...@bigcount.xyz>.

And maybe I am silly, but how to get the RDD size? I just did:

scala> rdd.map( x=>1 ).reduce(_+_)
res6: Int = 10000000

is there a .size() method?


wilson wrote:
> Hello,
> 
> Do you know for a big dataset why the general RDD job can be done, but 
> the collect() failed due to memory overflow?
> 
> for instance, for a dataset which has xxx million of items, this can be 
> done well:
> 
>   scala> rdd.map { x => x.split(",") }.map{ x => (x(5).toString, 
> x(6).toDouble) }.groupByKey.mapValues(x => 
> x.sum/x.size).sortBy(-_._2).take(20)
> 
> 
> But in the final stage I issued this command and it got:
> 
> scala> rdd.collect.size
> 22/04/19 18:26:52 ERROR Executor: Exception in task 13.0 in stage 44.0 
> (TID 349)
> java.lang.OutOfMemoryError: Java heap space
> 
> 
> Thank you.
> wilson
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org