You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by insperatum <in...@gmail.com> on 2014/12/09 22:42:12 UTC

Caching RDDs with shared memory - bug or feature?

If all RDD elements within a partition contain pointers to a single shared
object, Spark persists as expected when the RDD is small. However, if the
RDD is more than *200 elements* then Spark reports requiring much more
memory than it actually does. This becomes a problem for large RDDs, as
Spark refuses to persist even though it can. Is this a bug or is there a
feature that I'm missing?
Cheers, Luke

*val* /n/ = ???
*class* Elem(*val* s:Array[Int])
*val* /rdd/ = /sc/.parallelize(/Seq/(1)).mapPartitions( _ => {
    *val* sharedArray = Array./ofDim/[Int](10000000)   /// Should require
~40MB/
    (1 to /n/).toIterator.map(_ => *new* Elem(sharedArray))
}).cache().count()   /// force computation/

For n = 100: /MemoryStore: Block rdd_1_0 stored as values in memory
(estimated size *38.1 MB*, free 898.7 MB)/
For n = 200: /MemoryStore: Block rdd_1_0 stored as values in memory
(estimated size *38.2 MB*, free 898.7 MB)/
For n = 201: /MemoryStore: Block rdd_1_0 stored as values in memory
(estimated size *76.7 MB*, free 860.2 MB)/
For n = 5000: /MemoryStore: *Not enough space to cache rdd_1_0 in memory!*
(computed 781.3 MB so far)/

Note: For medium sized n (where n>200 but spark can still cache), the actual
application memory still stays where it should - Spark just seems to vastly
overreport how much memory it's using. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-RDDs-with-shared-memory-bug-or-feature-tp20596.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org