You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Michael Lewis <le...@me.com> on 2014/07/08 00:50:18 UTC

memory leak query

Hi,

I hope someone can help as  I’m not sure if I’m using Spark correctly. Basically, in the simple example below 
I create an RDD which is just a sequence of random numbers. I then have a loop where I just invoke rdd.count()
what I can see  is that the memory use always nudges upwards.

If I attach YourKit to the JVM, I can see the garbage collector in action, but eventually the JVM runs out of memory.

Can anyone spot if I am doing something wrong? (Obviously the example is slightly contrived, but basically I 
have an RDD with a set of numbers and I’d like to submit lots of jobs that perform some calculation, this was
the simplest case I could create that would exhibit same memory issue.)

Regards & Thanks,
Mike


import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import scala.util.Random

object SparkTest {
  def main(args: Array[String]) {
    println ("spark memory test")

    val jars = Seq("spark-test-1.0-SNAPSHOT.jar")

    val sparkConfig : SparkConf = new SparkConf()
                                  .setMaster("local")
                                  .setAppName("tester")
                                  .setJars(jars)

    val sparkContext = new SparkContext(sparkConfig)
    val list = Seq.fill(1200000)(Random.nextInt)
    val rdd : RDD[Int] = sparkContext.makeRDD(list,10)

    for (i <- 1 to 1000000) {
      rdd.count()
    }
    sparkContext.stop()
  }
}

Re: memory leak query

Posted by Rico <ri...@gmail.com>.

Hi Michael, 

I have  similar question
<http://apache-spark-user-list.1001560.n3.nabble.com/Caching-issue-with-msg-RDD-block-could-not-be-dropped-from-memory-as-it-does-not-exist-td10248.html#a10677>  
before. My problem was that my data was too large to be cached in memory
because of serialization.

But I tried to reproduce your test and I did not experience any memory
problem. First, since count operates on the same rdd, it should not increase
the memory usage. Second, since you do not cache the rdd, each new action
such as count will simply reload the data.

I am not sure how much memory you have in your machine, but by default Spark
allocates 512M for each executor and spark.memory.fraction is set to 0.6,
which means you virtually have about 360Mbyte in reality. If you are running
your app on local machine, then you can monitor it by opening the GUI on
your browser using localhost:4040



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/memory-leak-query-tp8961p10679.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.