You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by wdbaruni <wd...@gmail.com> on 2015/07/21 22:47:02 UTC

Which memory fraction is Spark using to compute RDDs that are not going to be persisted

I am new to Spark and I understand that Spark divides the executor memory
into the following fractions:

*RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or
.cache() and can be defined by setting spark.storage.memoryFraction (default
0.6)

*Shuffle and aggregation buffers:* Which Spark uses to store shuffle
outputs. It can defined using spark.shuffle.memoryFraction. If shuffle
output exceeds this fraction, then Spark will spill data to disk (default
0.2)

*User code:* Spark uses this fraction to execute arbitrary user code
(default 0.2)

I am not mentioning the storage and shuffle safety fractions for simplicity.

My question is, which memory fraction is Spark using to compute and
transform RDDs that are not going to be persisted? For example:

lines = sc.textFile("i am a big file.txt")
count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x,
1)).reduceByKey(add)
count.saveAsTextFile("output")

Here Spark will not load the whole file at once and will partition the input
file and do all these transformations per partition in a single stage.
However, which memory fraction Spark will use to load the partitioned lines,
compute flatMap() and map()?

Thanks



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Which-memory-fraction-is-Spark-using-to-compute-RDDs-that-are-not-going-to-be-persisted-tp23942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Which memory fraction is Spark using to compute RDDs that are not going to be persisted

Posted by Andrew Or <an...@databricks.com>.
Hi,

It would be whatever's left in the JVM. This is not explicitly controlled
by a fraction like storage or shuffle. However, the computation usually
doesn't need to use that much space. In my experience it's almost always
the caching or the aggregation during shuffles that's the most memory
intensive.

-Andrew

2015-07-21 13:47 GMT-07:00 wdbaruni <wd...@gmail.com>:

> I am new to Spark and I understand that Spark divides the executor memory
> into the following fractions:
>
> *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or
> .cache() and can be defined by setting spark.storage.memoryFraction
> (default
> 0.6)
>
> *Shuffle and aggregation buffers:* Which Spark uses to store shuffle
> outputs. It can defined using spark.shuffle.memoryFraction. If shuffle
> output exceeds this fraction, then Spark will spill data to disk (default
> 0.2)
>
> *User code:* Spark uses this fraction to execute arbitrary user code
> (default 0.2)
>
> I am not mentioning the storage and shuffle safety fractions for
> simplicity.
>
> My question is, which memory fraction is Spark using to compute and
> transform RDDs that are not going to be persisted? For example:
>
> lines = sc.textFile("i am a big file.txt")
> count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x,
> 1)).reduceByKey(add)
> count.saveAsTextFile("output")
>
> Here Spark will not load the whole file at once and will partition the
> input
> file and do all these transformations per partition in a single stage.
> However, which memory fraction Spark will use to load the partitioned
> lines,
> compute flatMap() and map()?
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Which-memory-fraction-is-Spark-using-to-compute-RDDs-that-are-not-going-to-be-persisted-tp23942.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>