You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by wilson <in...@bigcount.xyz> on 2022/04/19 10:33:42 UTC
RDD memory use question
Hello,
Do you know for a big dataset why the general RDD job can be done, but
the collect() failed due to memory overflow?
for instance, for a dataset which has xxx million of items, this can be
done well:
scala> rdd.map { x => x.split(",") }.map{ x => (x(5).toString,
x(6).toDouble) }.groupByKey.mapValues(x =>
x.sum/x.size).sortBy(-_._2).take(20)
But in the final stage I issued this command and it got:
scala> rdd.collect.size
22/04/19 18:26:52 ERROR Executor: Exception in task 13.0 in stage 44.0
(TID 349)
java.lang.OutOfMemoryError: Java heap space
Thank you.
wilson
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: RDD memory use question
Posted by Sean Owen <sr...@gmail.com>.
Don't collect() - that pulls all data into memory. Use count().
On Tue, Apr 19, 2022 at 5:34 AM wilson <in...@bigcount.xyz> wrote:
> Hello,
>
> Do you know for a big dataset why the general RDD job can be done, but
> the collect() failed due to memory overflow?
>
> for instance, for a dataset which has xxx million of items, this can be
> done well:
>
> scala> rdd.map { x => x.split(",") }.map{ x => (x(5).toString,
> x(6).toDouble) }.groupByKey.mapValues(x =>
> x.sum/x.size).sortBy(-_._2).take(20)
>
>
> But in the final stage I issued this command and it got:
>
> scala> rdd.collect.size
> 22/04/19 18:26:52 ERROR Executor: Exception in task 13.0 in stage 44.0
> (TID 349)
> java.lang.OutOfMemoryError: Java heap space
>
>
> Thank you.
> wilson
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
Re: RDD memory use question
Posted by wilson <in...@bigcount.xyz>.
And maybe I am silly, but how to get the RDD size? I just did:
scala> rdd.map( x=>1 ).reduce(_+_)
res6: Int = 10000000
is there a .size() method?
wilson wrote:
> Hello,
>
> Do you know for a big dataset why the general RDD job can be done, but
> the collect() failed due to memory overflow?
>
> for instance, for a dataset which has xxx million of items, this can be
> done well:
>
> scala> rdd.map { x => x.split(",") }.map{ x => (x(5).toString,
> x(6).toDouble) }.groupByKey.mapValues(x =>
> x.sum/x.size).sortBy(-_._2).take(20)
>
>
> But in the final stage I issued this command and it got:
>
> scala> rdd.collect.size
> 22/04/19 18:26:52 ERROR Executor: Exception in task 13.0 in stage 44.0
> (TID 349)
> java.lang.OutOfMemoryError: Java heap space
>
>
> Thank you.
> wilson
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org