You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by rzykov <rz...@gmail.com> on 2014/11/22 15:23:41 UTC

java.lang.OutOfMemoryError at simple local test

Dear all, 

Unfortunately I've not got ant respond in users forum. That's why I decided
to publish this question here.
We encountered problems of failed jobs with huge amount of data. For
example, an application works perfectly with relative small sized data, but
when it grows in 2 times this  application fails.

A simple local test was prepared for this question at
https://gist.github.com/copy-of-rezo/6a137e13a1e4f841e7eb
It generates 2 sets of key-value pairs, join them, selects distinct values
and counts data finally. 

object Spill { 
  def generate = { 
    for{ 
      j <- 1 to 10 
      i <- 1 to 200 
    } yield(j, i) 
  } 
  
  def main(args: Array[String]) { 
    val conf = new SparkConf().setAppName(getClass.getSimpleName) 
    conf.set("spark.shuffle.spill", "true") 
    conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") 
    val sc = new SparkContext(conf) 
    println(generate) 
  
    val dataA = sc.parallelize(generate) 
    val dataB = sc.parallelize(generate) 
    val dst = dataA.join(dataB).distinct().count() 
    println(dst) 
  } 
} 

We compiled it locally and run 3 times with different settings of memory: 
1) --executor-memory 10M --driver-memory 10M --num-executors 1
--executor-cores 1
It fails wtih "java.lang.OutOfMemoryError: GC overhead limit exceeded" at 
..... 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) 

2) --executor-memory 20M --driver-memory 20M --num-executors 1
--executor-cores 1
It works OK 

3)  --executor-memory 10M --driver-memory 10M --num-executors 1
--executor-cores 1 But let's make less data for i from 200 to 100. It
reduces input data in 2 times and joined data in 4 times 

  def generate = { 
    for{ 
      j <- 1 to 10 
      i <- 1 to 100   // previous value was 200 
    } yield(j, i) 
  } 
This code works OK. 

We don't understand why 10M is not enough for such simple operation with
32000 bytes of ints (2 * 10 * 200 * 2 * 4) approximately? 10M of RAM works
if we change the data volume in 2 times (2000 of records of (int, int)).   
Why spilling to disk doesn't cover this case? 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-at-simple-local-test-tp9490.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: java.lang.OutOfMemoryError at simple local test

Posted by rzykov <rz...@gmail.com>.
We made some changes in code (it generates 1000 * 1000 elements) and memory
limits up to 100M:

def generate = {
  for{
    j <- 1 to 10
    i <- 1 to 1000
  } yield(j, i)
}

~/soft/spark-1.1.0-bin-hadoop2.3/bin/spark-submit --master local
--executor-memory 100M --driver-memory 100M --class Spill --num-executors 1
--executor-cores 1 target/scala-2.10/Spill-assembly-1.0.jar

The result of this: 
14/11/24 14:57:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception
in thread Thread[Executor task launch worker-0,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded

We decided to check this one by profiler and took this screenshot:
<http://apache-spark-developers-list.1001551.n3.nabble.com/file/n9532/%D0%A1%D0%BA%D1%80%D0%B8%D0%BD%D1%88%D0%BE%D1%82_2014-11-26_11.png> 

Each element of collection takes 48 bytes. Each element  = scala.Tuple2 of 2
java.lang.Integer.
But Scala  supports "@specialized"
<https://github.com/scala/scala/blob/v2.10.4/src/library/scala/Tuple2.scala#L19>   
unboxed primitive type of Int. Which takes 4 bytes only.
So from this point of view this collection would take about 1000 * 1000 * 2
* 4 = 8 Mb + some overheads.
This number in 5 times less then current result of memory consumpution.
Why Spark didn't use primitive (@specialized) types in this case?





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-at-simple-local-test-tp9490p9532.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: java.lang.OutOfMemoryError at simple local test

Posted by Sean Owen <so...@cloudera.com>.
10M is tiny compared to all of the overhead of running a lot complex Scala
based app in a JVM. I think you may be bumping up against practical minimum
sizes and that you may find it is not really the data size? I don't think
it really scales down this far.
On Nov 22, 2014 2:24 PM, "rzykov" <rz...@gmail.com> wrote:

> Dear all,
>
> Unfortunately I've not got ant respond in users forum. That's why I decided
> to publish this question here.
> We encountered problems of failed jobs with huge amount of data. For
> example, an application works perfectly with relative small sized data, but
> when it grows in 2 times this  application fails.
>
> A simple local test was prepared for this question at
> https://gist.github.com/copy-of-rezo/6a137e13a1e4f841e7eb
> It generates 2 sets of key-value pairs, join them, selects distinct values
> and counts data finally.
>
> object Spill {
>   def generate = {
>     for{
>       j <- 1 to 10
>       i <- 1 to 200
>     } yield(j, i)
>   }
>
>   def main(args: Array[String]) {
>     val conf = new SparkConf().setAppName(getClass.getSimpleName)
>     conf.set("spark.shuffle.spill", "true")
>     conf.set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")
>     val sc = new SparkContext(conf)
>     println(generate)
>
>     val dataA = sc.parallelize(generate)
>     val dataB = sc.parallelize(generate)
>     val dst = dataA.join(dataB).distinct().count()
>     println(dst)
>   }
> }
>
> We compiled it locally and run 3 times with different settings of memory:
> 1) --executor-memory 10M --driver-memory 10M --num-executors 1
> --executor-cores 1
> It fails wtih "java.lang.OutOfMemoryError: GC overhead limit exceeded" at
> .....
>
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
>
> 2) --executor-memory 20M --driver-memory 20M --num-executors 1
> --executor-cores 1
> It works OK
>
> 3)  --executor-memory 10M --driver-memory 10M --num-executors 1
> --executor-cores 1 But let's make less data for i from 200 to 100. It
> reduces input data in 2 times and joined data in 4 times
>
>   def generate = {
>     for{
>       j <- 1 to 10
>       i <- 1 to 100   // previous value was 200
>     } yield(j, i)
>   }
> This code works OK.
>
> We don't understand why 10M is not enough for such simple operation with
> 32000 bytes of ints (2 * 10 * 200 * 2 * 4) approximately? 10M of RAM works
> if we change the data volume in 2 times (2000 of records of (int, int)).
> Why spilling to disk doesn't cover this case?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-at-simple-local-test-tp9490.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>