You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shivani Rao <ra...@gmail.com> on 2014/06/18 23:17:50 UTC

Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

I am trying to process a file that contains 4 log lines (not very long) and
then write my parsed out case classes to a destination folder, and I get
the following error:


java.lang.OutOfMemoryError: Java heap space

at
org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)

at org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2244)

at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)

at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)

at
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)

at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)

at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)

at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)

at
org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)


Sadly, there are several folks that have faced this error while trying to
execute Spark jobs and there are various solutions, none of which work for
me


a) I tried (
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-td7735.html#a7736)
changing the number of partitions in my RDD by using coalesce(8) and the
error persisted

b)  I tried changing SPARK_WORKER_MEM=2g, SPARK_EXECUTOR_MEMORY=10g, and
both did not work

c) I strongly suspect there is a class path error (
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-td4719.html)
Mainly because the call stack is repetitive. Maybe the OOM error is a
disguise ?

d) I checked that i am not out of disk space and that i do not have too
many open files (ulimit -u << sudo ls /proc/<spark_master_process_id>/fd |
wc -l)


I am also noticing multiple reflections happening to find the right "class"
i guess, so it could be "class Not Found: error disguising itself as a
memory error.


Here are other threads that are encountering same situation .. but have not
been resolved in any way so far..


http://apache-spark-user-list.1001560.n3.nabble.com/no-response-in-spark-web-UI-td4633.html

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html


Any help is greatly appreciated. I am especially calling out on creators of
Spark and Databrick folks. This seems like a "known bug" waiting to happen.


Thanks,

Shivani

-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

Posted by Shivani Rao <ra...@gmail.com>.
Hello Eugene,

Thanks for your patience and answers. The issue was that one of the third
party libraries was not build with "sbt assembly" but just packaged as "sbt
package". So it did not contain all the source dependencies.

Thanks for all your help

Shivani


On Fri, Jun 20, 2014 at 1:46 PM, Eugen Cepoi <ce...@gmail.com> wrote:

> In short, ADD_JARS will add the jar to your driver classpath and also send
> it to the workers (similar to what you are doing when you do sc.addJars).
>
> ex: MASTER=master/url ADD_JARS=/path/to/myJob.jar ./bin/spark-shell
>
>
> You also have SPARK_CLASSPATH var but it does not distribute the code, it
> is only used to compute the driver classpath.
>
>
> BTW, you are not supposed to change the compute_classpath.script
>
>
> 2014-06-20 19:45 GMT+02:00 Shivani Rao <ra...@gmail.com>:
>
> Hello Eugene,
>>
>> You are right about this. I did encounter the "pergmgenspace" in the
>> spark shell. Can you tell me a little more about "ADD_JARS". In order to
>> ensure my spark_shell has all required jars, I added the jars to the
>> "$CLASSPATH" in the compute_classpath.sh script. is there another way of
>> doing it?
>>
>> Shivani
>>
>>
>> On Fri, Jun 20, 2014 at 9:47 AM, Eugen Cepoi <ce...@gmail.com>
>> wrote:
>>
>>> In my case it was due to a case class I was defining in the spark-shell
>>> and not being available on the workers. So packaging it in a jar and adding
>>> it with ADD_JARS solved the problem. Note that I don't exactly remember if
>>> it was an out of heap space exception or pergmen space. Make sure your
>>> jarsPath is correct.
>>>
>>> Usually to debug this kind of problems I am using the spark-shell (you
>>> can do the same in your job but its more time constuming to repackage,
>>> deploy, run, iterate). Try for example
>>> 1) read the lines (without any processing) and count them
>>> 2) apply processing and count
>>>
>>>
>>>
>>> 2014-06-20 17:15 GMT+02:00 Shivani Rao <ra...@gmail.com>:
>>>
>>> Hello Abhi, I did try that and it did not work
>>>>
>>>> And Eugene, Yes I am assembling the argonaut libraries in the fat jar.
>>>> So how did you overcome this problem?
>>>>
>>>> Shivani
>>>>
>>>>
>>>> On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi <ce...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Le 20 juin 2014 01:46, "Shivani Rao" <ra...@gmail.com> a écrit :
>>>>>
>>>>> >
>>>>> > Hello Andrew,
>>>>> >
>>>>> > i wish I could share the code, but for proprietary reasons I can't.
>>>>> But I can give some idea though of what i am trying to do. The job reads a
>>>>> file and for each line of that file and processors these lines. I am not
>>>>> doing anything intense in the "processLogs" function
>>>>> >
>>>>> > import argonaut._
>>>>> > import argonaut.Argonaut._
>>>>> >
>>>>> >
>>>>> > /* all of these case classes are created from json strings extracted
>>>>> from the line in the processLogs() function
>>>>> > *
>>>>> > */
>>>>> > case class struct1…
>>>>> > case class struct2…
>>>>> > case class value1(struct1, struct2)
>>>>> >
>>>>> > def processLogs(line:String): Option[(key1, value1)] {…
>>>>> > }
>>>>> >
>>>>> > def run(sparkMaster, appName, executorMemory, jarsPath) {
>>>>> >   val sparkConf = new SparkConf()
>>>>> >    sparkConf.setMaster(sparkMaster)
>>>>> >    sparkConf.setAppName(appName)
>>>>> >    sparkConf.set("spark.executor.memory", executorMemory)
>>>>> >     sparkConf.setJars(jarsPath) // This includes all the jars
>>>>> relevant jars..
>>>>> >    val sc = new SparkContext(sparkConf)
>>>>> >   val rawLogs =
>>>>> sc.textFile("hdfs://<my-hadoop-namenode:8020:myfile.txt")
>>>>> >
>>>>> rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
>>>>> >
>>>>> rawLogs.flatMap(processLogs).saveAsTextFile("hdfs://<my-hadoop-namenode:8020:outfile.txt")
>>>>> > }
>>>>> >
>>>>> > If I switch to "local" mode, the code runs just fine, it fails with
>>>>> the error I pasted above. In the cluster mode, even writing back the file
>>>>> we just read fails
>>>>> (rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
>>>>> >
>>>>> > I still believe this is a classNotFound error in disguise
>>>>> >
>>>>>
>>>>> Indeed you are right, this can be the reason. I had similar errors
>>>>> when defining case classes in the shell and trying to use them in the RDDs.
>>>>> Are you shading argonaut in the fat jar ?
>>>>>
>>>>> > Thanks
>>>>> > Shivani
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Wed, Jun 18, 2014 at 2:49 PM, Andrew Ash <an...@andrewash.com>
>>>>> wrote:
>>>>> >>
>>>>> >> Wait, so the file only has four lines and the job running out of
>>>>> heap space?  Can you share the code you're running that does the
>>>>> processing?  I'd guess that you're doing some intense processing on every
>>>>> line but just writing parsed case classes back to disk sounds very
>>>>> lightweight.
>>>>> >>
>>>>> >> I
>>>>> >>
>>>>> >>
>>>>> >> On Wed, Jun 18, 2014 at 5:17 PM, Shivani Rao <ra...@gmail.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>> I am trying to process a file that contains 4 log lines (not very
>>>>> long) and then write my parsed out case classes to a destination folder,
>>>>> and I get the following error:
>>>>> >>>
>>>>> >>>
>>>>> >>> java.lang.OutOfMemoryError: Java heap space
>>>>> >>>
>>>>> >>> at
>>>>> org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
>>>>> >>>
>>>>> >>> at
>>>>> org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2244)
>>>>> >>>
>>>>> >>> at
>>>>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
>>>>> >>>
>>>>> >>> at
>>>>> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
>>>>> >>>
>>>>> >>> at
>>>>> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>>>>> >>>
>>>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> >>>
>>>>> >>> at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>> >>>
>>>>> >>> at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>> >>>
>>>>> >>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>>>> >>>
>>>>> >>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
>>>>> >>>
>>>>> >>> at
>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>>>>> >>>
>>>>> >>> at
>>>>> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
>>>>> >>>
>>>>> >>> at
>>>>> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>>>>> >>>
>>>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> >>>
>>>>> >>> at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>> >>>
>>>>> >>> at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>> >>>
>>>>> >>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>>>> >>>
>>>>> >>> at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>>> >>>
>>>>> >>>
>>>>> >>> Sadly, there are several folks that have faced this error while
>>>>> trying to execute Spark jobs and there are various solutions, none of which
>>>>> work for me
>>>>> >>>
>>>>> >>>
>>>>> >>> a) I tried (
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-td7735.html#a7736)
>>>>> changing the number of partitions in my RDD by using coalesce(8) and the
>>>>> error persisted
>>>>> >>>
>>>>> >>> b)  I tried changing SPARK_WORKER_MEM=2g,
>>>>> SPARK_EXECUTOR_MEMORY=10g, and both did not work
>>>>> >>>
>>>>> >>> c) I strongly suspect there is a class path error (
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-td4719.html)
>>>>> Mainly because the call stack is repetitive. Maybe the OOM error is a
>>>>> disguise ?
>>>>> >>>
>>>>> >>> d) I checked that i am not out of disk space and that i do not
>>>>> have too many open files (ulimit -u << sudo ls
>>>>> /proc/<spark_master_process_id>/fd | wc -l)
>>>>> >>>
>>>>> >>>
>>>>> >>> I am also noticing multiple reflections happening to find the
>>>>> right "class" i guess, so it could be "class Not Found: error disguising
>>>>> itself as a memory error.
>>>>> >>>
>>>>> >>>
>>>>> >>> Here are other threads that are encountering same situation .. but
>>>>> have not been resolved in any way so far..
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/no-response-in-spark-web-UI-td4633.html
>>>>> >>>
>>>>> >>>
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html
>>>>> >>>
>>>>> >>>
>>>>> >>> Any help is greatly appreciated. I am especially calling out on
>>>>> creators of Spark and Databrick folks. This seems like a "known bug"
>>>>> waiting to happen.
>>>>> >>>
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>>
>>>>> >>> Shivani
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> Software Engineer
>>>>> >>> Analytics Engineering Team@ Box
>>>>> >>> Mountain View, CA
>>>>> >>
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Software Engineer
>>>>> > Analytics Engineering Team@ Box
>>>>> > Mountain View, CA
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Software Engineer
>>>> Analytics Engineering Team@ Box
>>>> Mountain View, CA
>>>>
>>>
>>>
>>
>>
>> --
>> Software Engineer
>> Analytics Engineering Team@ Box
>> Mountain View, CA
>>
>
>


-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

Posted by Eugen Cepoi <ce...@gmail.com>.
In short, ADD_JARS will add the jar to your driver classpath and also send
it to the workers (similar to what you are doing when you do sc.addJars).

ex: MASTER=master/url ADD_JARS=/path/to/myJob.jar ./bin/spark-shell


You also have SPARK_CLASSPATH var but it does not distribute the code, it
is only used to compute the driver classpath.


BTW, you are not supposed to change the compute_classpath.script


2014-06-20 19:45 GMT+02:00 Shivani Rao <ra...@gmail.com>:

> Hello Eugene,
>
> You are right about this. I did encounter the "pergmgenspace" in the spark
> shell. Can you tell me a little more about "ADD_JARS". In order to ensure
> my spark_shell has all required jars, I added the jars to the "$CLASSPATH"
> in the compute_classpath.sh script. is there another way of doing it?
>
> Shivani
>
>
> On Fri, Jun 20, 2014 at 9:47 AM, Eugen Cepoi <ce...@gmail.com>
> wrote:
>
>> In my case it was due to a case class I was defining in the spark-shell
>> and not being available on the workers. So packaging it in a jar and adding
>> it with ADD_JARS solved the problem. Note that I don't exactly remember if
>> it was an out of heap space exception or pergmen space. Make sure your
>> jarsPath is correct.
>>
>> Usually to debug this kind of problems I am using the spark-shell (you
>> can do the same in your job but its more time constuming to repackage,
>> deploy, run, iterate). Try for example
>> 1) read the lines (without any processing) and count them
>> 2) apply processing and count
>>
>>
>>
>> 2014-06-20 17:15 GMT+02:00 Shivani Rao <ra...@gmail.com>:
>>
>> Hello Abhi, I did try that and it did not work
>>>
>>> And Eugene, Yes I am assembling the argonaut libraries in the fat jar.
>>> So how did you overcome this problem?
>>>
>>> Shivani
>>>
>>>
>>> On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi <ce...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Le 20 juin 2014 01:46, "Shivani Rao" <ra...@gmail.com> a écrit :
>>>>
>>>> >
>>>> > Hello Andrew,
>>>> >
>>>> > i wish I could share the code, but for proprietary reasons I can't.
>>>> But I can give some idea though of what i am trying to do. The job reads a
>>>> file and for each line of that file and processors these lines. I am not
>>>> doing anything intense in the "processLogs" function
>>>> >
>>>> > import argonaut._
>>>> > import argonaut.Argonaut._
>>>> >
>>>> >
>>>> > /* all of these case classes are created from json strings extracted
>>>> from the line in the processLogs() function
>>>> > *
>>>> > */
>>>> > case class struct1…
>>>> > case class struct2…
>>>> > case class value1(struct1, struct2)
>>>> >
>>>> > def processLogs(line:String): Option[(key1, value1)] {…
>>>> > }
>>>> >
>>>> > def run(sparkMaster, appName, executorMemory, jarsPath) {
>>>> >   val sparkConf = new SparkConf()
>>>> >    sparkConf.setMaster(sparkMaster)
>>>> >    sparkConf.setAppName(appName)
>>>> >    sparkConf.set("spark.executor.memory", executorMemory)
>>>> >     sparkConf.setJars(jarsPath) // This includes all the jars
>>>> relevant jars..
>>>> >    val sc = new SparkContext(sparkConf)
>>>> >   val rawLogs =
>>>> sc.textFile("hdfs://<my-hadoop-namenode:8020:myfile.txt")
>>>> >
>>>> rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
>>>> >
>>>> rawLogs.flatMap(processLogs).saveAsTextFile("hdfs://<my-hadoop-namenode:8020:outfile.txt")
>>>> > }
>>>> >
>>>> > If I switch to "local" mode, the code runs just fine, it fails with
>>>> the error I pasted above. In the cluster mode, even writing back the file
>>>> we just read fails
>>>> (rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
>>>> >
>>>> > I still believe this is a classNotFound error in disguise
>>>> >
>>>>
>>>> Indeed you are right, this can be the reason. I had similar errors when
>>>> defining case classes in the shell and trying to use them in the RDDs. Are
>>>> you shading argonaut in the fat jar ?
>>>>
>>>> > Thanks
>>>> > Shivani
>>>> >
>>>> >
>>>> >
>>>> > On Wed, Jun 18, 2014 at 2:49 PM, Andrew Ash <an...@andrewash.com>
>>>> wrote:
>>>> >>
>>>> >> Wait, so the file only has four lines and the job running out of
>>>> heap space?  Can you share the code you're running that does the
>>>> processing?  I'd guess that you're doing some intense processing on every
>>>> line but just writing parsed case classes back to disk sounds very
>>>> lightweight.
>>>> >>
>>>> >> I
>>>> >>
>>>> >>
>>>> >> On Wed, Jun 18, 2014 at 5:17 PM, Shivani Rao <ra...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> I am trying to process a file that contains 4 log lines (not very
>>>> long) and then write my parsed out case classes to a destination folder,
>>>> and I get the following error:
>>>> >>>
>>>> >>>
>>>> >>> java.lang.OutOfMemoryError: Java heap space
>>>> >>>
>>>> >>> at
>>>> org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
>>>> >>>
>>>> >>> at
>>>> org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2244)
>>>> >>>
>>>> >>> at
>>>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
>>>> >>>
>>>> >>> at
>>>> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
>>>> >>>
>>>> >>> at
>>>> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>>>> >>>
>>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> >>>
>>>> >>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>> >>>
>>>> >>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>> >>>
>>>> >>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>>> >>>
>>>> >>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
>>>> >>>
>>>> >>> at
>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>>>> >>>
>>>> >>> at
>>>> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
>>>> >>>
>>>> >>> at
>>>> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>>>> >>>
>>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> >>>
>>>> >>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>> >>>
>>>> >>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>> >>>
>>>> >>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>>> >>>
>>>> >>> at
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>> >>>
>>>> >>>
>>>> >>> Sadly, there are several folks that have faced this error while
>>>> trying to execute Spark jobs and there are various solutions, none of which
>>>> work for me
>>>> >>>
>>>> >>>
>>>> >>> a) I tried (
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-td7735.html#a7736)
>>>> changing the number of partitions in my RDD by using coalesce(8) and the
>>>> error persisted
>>>> >>>
>>>> >>> b)  I tried changing SPARK_WORKER_MEM=2g,
>>>> SPARK_EXECUTOR_MEMORY=10g, and both did not work
>>>> >>>
>>>> >>> c) I strongly suspect there is a class path error (
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-td4719.html)
>>>> Mainly because the call stack is repetitive. Maybe the OOM error is a
>>>> disguise ?
>>>> >>>
>>>> >>> d) I checked that i am not out of disk space and that i do not have
>>>> too many open files (ulimit -u << sudo ls
>>>> /proc/<spark_master_process_id>/fd | wc -l)
>>>> >>>
>>>> >>>
>>>> >>> I am also noticing multiple reflections happening to find the right
>>>> "class" i guess, so it could be "class Not Found: error disguising itself
>>>> as a memory error.
>>>> >>>
>>>> >>>
>>>> >>> Here are other threads that are encountering same situation .. but
>>>> have not been resolved in any way so far..
>>>> >>>
>>>> >>>
>>>> >>>
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/no-response-in-spark-web-UI-td4633.html
>>>> >>>
>>>> >>>
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html
>>>> >>>
>>>> >>>
>>>> >>> Any help is greatly appreciated. I am especially calling out on
>>>> creators of Spark and Databrick folks. This seems like a "known bug"
>>>> waiting to happen.
>>>> >>>
>>>> >>>
>>>> >>> Thanks,
>>>> >>>
>>>> >>> Shivani
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Software Engineer
>>>> >>> Analytics Engineering Team@ Box
>>>> >>> Mountain View, CA
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Software Engineer
>>>> > Analytics Engineering Team@ Box
>>>> > Mountain View, CA
>>>>
>>>
>>>
>>>
>>> --
>>> Software Engineer
>>> Analytics Engineering Team@ Box
>>> Mountain View, CA
>>>
>>
>>
>
>
> --
> Software Engineer
> Analytics Engineering Team@ Box
> Mountain View, CA
>

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

Posted by Shivani Rao <ra...@gmail.com>.
Hello Eugene,

You are right about this. I did encounter the "pergmgenspace" in the spark
shell. Can you tell me a little more about "ADD_JARS". In order to ensure
my spark_shell has all required jars, I added the jars to the "$CLASSPATH"
in the compute_classpath.sh script. is there another way of doing it?

Shivani


On Fri, Jun 20, 2014 at 9:47 AM, Eugen Cepoi <ce...@gmail.com> wrote:

> In my case it was due to a case class I was defining in the spark-shell
> and not being available on the workers. So packaging it in a jar and adding
> it with ADD_JARS solved the problem. Note that I don't exactly remember if
> it was an out of heap space exception or pergmen space. Make sure your
> jarsPath is correct.
>
> Usually to debug this kind of problems I am using the spark-shell (you can
> do the same in your job but its more time constuming to repackage, deploy,
> run, iterate). Try for example
> 1) read the lines (without any processing) and count them
> 2) apply processing and count
>
>
>
> 2014-06-20 17:15 GMT+02:00 Shivani Rao <ra...@gmail.com>:
>
> Hello Abhi, I did try that and it did not work
>>
>> And Eugene, Yes I am assembling the argonaut libraries in the fat jar. So
>> how did you overcome this problem?
>>
>> Shivani
>>
>>
>> On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi <ce...@gmail.com>
>> wrote:
>>
>>>
>>> Le 20 juin 2014 01:46, "Shivani Rao" <ra...@gmail.com> a écrit :
>>>
>>> >
>>> > Hello Andrew,
>>> >
>>> > i wish I could share the code, but for proprietary reasons I can't.
>>> But I can give some idea though of what i am trying to do. The job reads a
>>> file and for each line of that file and processors these lines. I am not
>>> doing anything intense in the "processLogs" function
>>> >
>>> > import argonaut._
>>> > import argonaut.Argonaut._
>>> >
>>> >
>>> > /* all of these case classes are created from json strings extracted
>>> from the line in the processLogs() function
>>> > *
>>> > */
>>> > case class struct1…
>>> > case class struct2…
>>> > case class value1(struct1, struct2)
>>> >
>>> > def processLogs(line:String): Option[(key1, value1)] {…
>>> > }
>>> >
>>> > def run(sparkMaster, appName, executorMemory, jarsPath) {
>>> >   val sparkConf = new SparkConf()
>>> >    sparkConf.setMaster(sparkMaster)
>>> >    sparkConf.setAppName(appName)
>>> >    sparkConf.set("spark.executor.memory", executorMemory)
>>> >     sparkConf.setJars(jarsPath) // This includes all the jars relevant
>>> jars..
>>> >    val sc = new SparkContext(sparkConf)
>>> >   val rawLogs =
>>> sc.textFile("hdfs://<my-hadoop-namenode:8020:myfile.txt")
>>> >
>>> rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
>>> >
>>> rawLogs.flatMap(processLogs).saveAsTextFile("hdfs://<my-hadoop-namenode:8020:outfile.txt")
>>> > }
>>> >
>>> > If I switch to "local" mode, the code runs just fine, it fails with
>>> the error I pasted above. In the cluster mode, even writing back the file
>>> we just read fails
>>> (rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
>>> >
>>> > I still believe this is a classNotFound error in disguise
>>> >
>>>
>>> Indeed you are right, this can be the reason. I had similar errors when
>>> defining case classes in the shell and trying to use them in the RDDs. Are
>>> you shading argonaut in the fat jar ?
>>>
>>> > Thanks
>>> > Shivani
>>> >
>>> >
>>> >
>>> > On Wed, Jun 18, 2014 at 2:49 PM, Andrew Ash <an...@andrewash.com>
>>> wrote:
>>> >>
>>> >> Wait, so the file only has four lines and the job running out of heap
>>> space?  Can you share the code you're running that does the processing?
>>>  I'd guess that you're doing some intense processing on every line but just
>>> writing parsed case classes back to disk sounds very lightweight.
>>> >>
>>> >> I
>>> >>
>>> >>
>>> >> On Wed, Jun 18, 2014 at 5:17 PM, Shivani Rao <ra...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> I am trying to process a file that contains 4 log lines (not very
>>> long) and then write my parsed out case classes to a destination folder,
>>> and I get the following error:
>>> >>>
>>> >>>
>>> >>> java.lang.OutOfMemoryError: Java heap space
>>> >>>
>>> >>> at
>>> org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
>>> >>>
>>> >>> at
>>> org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2244)
>>> >>>
>>> >>> at
>>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
>>> >>>
>>> >>> at
>>> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
>>> >>>
>>> >>> at
>>> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>>> >>>
>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> >>>
>>> >>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> >>>
>>> >>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>> >>>
>>> >>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> >>>
>>> >>> at
>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>> >>>
>>> >>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>> >>>
>>> >>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>> >>>
>>> >>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>> >>>
>>> >>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
>>> >>>
>>> >>> at
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>>> >>>
>>> >>> at
>>> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
>>> >>>
>>> >>> at
>>> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>>> >>>
>>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> >>>
>>> >>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> >>>
>>> >>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>> >>>
>>> >>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> >>>
>>> >>> at
>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>> >>>
>>> >>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>> >>>
>>> >>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>> >>>
>>> >>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>> >>>
>>> >>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>> >>>
>>> >>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>> >>>
>>> >>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>> >>>
>>> >>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>> >>>
>>> >>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>> >>>
>>> >>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>> >>>
>>> >>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>> >>>
>>> >>>
>>> >>> Sadly, there are several folks that have faced this error while
>>> trying to execute Spark jobs and there are various solutions, none of which
>>> work for me
>>> >>>
>>> >>>
>>> >>> a) I tried (
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-td7735.html#a7736)
>>> changing the number of partitions in my RDD by using coalesce(8) and the
>>> error persisted
>>> >>>
>>> >>> b)  I tried changing SPARK_WORKER_MEM=2g, SPARK_EXECUTOR_MEMORY=10g,
>>> and both did not work
>>> >>>
>>> >>> c) I strongly suspect there is a class path error (
>>> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-td4719.html)
>>> Mainly because the call stack is repetitive. Maybe the OOM error is a
>>> disguise ?
>>> >>>
>>> >>> d) I checked that i am not out of disk space and that i do not have
>>> too many open files (ulimit -u << sudo ls
>>> /proc/<spark_master_process_id>/fd | wc -l)
>>> >>>
>>> >>>
>>> >>> I am also noticing multiple reflections happening to find the right
>>> "class" i guess, so it could be "class Not Found: error disguising itself
>>> as a memory error.
>>> >>>
>>> >>>
>>> >>> Here are other threads that are encountering same situation .. but
>>> have not been resolved in any way so far..
>>> >>>
>>> >>>
>>> >>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/no-response-in-spark-web-UI-td4633.html
>>> >>>
>>> >>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html
>>> >>>
>>> >>>
>>> >>> Any help is greatly appreciated. I am especially calling out on
>>> creators of Spark and Databrick folks. This seems like a "known bug"
>>> waiting to happen.
>>> >>>
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Shivani
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Software Engineer
>>> >>> Analytics Engineering Team@ Box
>>> >>> Mountain View, CA
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Software Engineer
>>> > Analytics Engineering Team@ Box
>>> > Mountain View, CA
>>>
>>
>>
>>
>> --
>> Software Engineer
>> Analytics Engineering Team@ Box
>> Mountain View, CA
>>
>
>


-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

Posted by Eugen Cepoi <ce...@gmail.com>.
In my case it was due to a case class I was defining in the spark-shell and
not being available on the workers. So packaging it in a jar and adding it
with ADD_JARS solved the problem. Note that I don't exactly remember if it
was an out of heap space exception or pergmen space. Make sure your
jarsPath is correct.

Usually to debug this kind of problems I am using the spark-shell (you can
do the same in your job but its more time constuming to repackage, deploy,
run, iterate). Try for example
1) read the lines (without any processing) and count them
2) apply processing and count



2014-06-20 17:15 GMT+02:00 Shivani Rao <ra...@gmail.com>:

> Hello Abhi, I did try that and it did not work
>
> And Eugene, Yes I am assembling the argonaut libraries in the fat jar. So
> how did you overcome this problem?
>
> Shivani
>
>
> On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi <ce...@gmail.com>
> wrote:
>
>>
>> Le 20 juin 2014 01:46, "Shivani Rao" <ra...@gmail.com> a écrit :
>>
>> >
>> > Hello Andrew,
>> >
>> > i wish I could share the code, but for proprietary reasons I can't. But
>> I can give some idea though of what i am trying to do. The job reads a file
>> and for each line of that file and processors these lines. I am not doing
>> anything intense in the "processLogs" function
>> >
>> > import argonaut._
>> > import argonaut.Argonaut._
>> >
>> >
>> > /* all of these case classes are created from json strings extracted
>> from the line in the processLogs() function
>> > *
>> > */
>> > case class struct1…
>> > case class struct2…
>> > case class value1(struct1, struct2)
>> >
>> > def processLogs(line:String): Option[(key1, value1)] {…
>> > }
>> >
>> > def run(sparkMaster, appName, executorMemory, jarsPath) {
>> >   val sparkConf = new SparkConf()
>> >    sparkConf.setMaster(sparkMaster)
>> >    sparkConf.setAppName(appName)
>> >    sparkConf.set("spark.executor.memory", executorMemory)
>> >     sparkConf.setJars(jarsPath) // This includes all the jars relevant
>> jars..
>> >    val sc = new SparkContext(sparkConf)
>> >   val rawLogs =
>> sc.textFile("hdfs://<my-hadoop-namenode:8020:myfile.txt")
>> >
>> rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
>> >
>> rawLogs.flatMap(processLogs).saveAsTextFile("hdfs://<my-hadoop-namenode:8020:outfile.txt")
>> > }
>> >
>> > If I switch to "local" mode, the code runs just fine, it fails with the
>> error I pasted above. In the cluster mode, even writing back the file we
>> just read fails
>> (rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
>> >
>> > I still believe this is a classNotFound error in disguise
>> >
>>
>> Indeed you are right, this can be the reason. I had similar errors when
>> defining case classes in the shell and trying to use them in the RDDs. Are
>> you shading argonaut in the fat jar ?
>>
>> > Thanks
>> > Shivani
>> >
>> >
>> >
>> > On Wed, Jun 18, 2014 at 2:49 PM, Andrew Ash <an...@andrewash.com>
>> wrote:
>> >>
>> >> Wait, so the file only has four lines and the job running out of heap
>> space?  Can you share the code you're running that does the processing?
>>  I'd guess that you're doing some intense processing on every line but just
>> writing parsed case classes back to disk sounds very lightweight.
>> >>
>> >> I
>> >>
>> >>
>> >> On Wed, Jun 18, 2014 at 5:17 PM, Shivani Rao <ra...@gmail.com>
>> wrote:
>> >>>
>> >>> I am trying to process a file that contains 4 log lines (not very
>> long) and then write my parsed out case classes to a destination folder,
>> and I get the following error:
>> >>>
>> >>>
>> >>> java.lang.OutOfMemoryError: Java heap space
>> >>>
>> >>> at
>> org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
>> >>>
>> >>> at
>> org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2244)
>> >>>
>> >>> at
>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
>> >>>
>> >>> at
>> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
>> >>>
>> >>> at
>> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>> >>>
>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >>>
>> >>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> >>>
>> >>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >>>
>> >>> at java.lang.reflect.Method.invoke(Method.java:597)
>> >>>
>> >>> at
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>> >>>
>> >>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>> >>>
>> >>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>> >>>
>> >>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>> >>>
>> >>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
>> >>>
>> >>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>> >>>
>> >>> at
>> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
>> >>>
>> >>> at
>> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>> >>>
>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >>>
>> >>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> >>>
>> >>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >>>
>> >>> at java.lang.reflect.Method.invoke(Method.java:597)
>> >>>
>> >>> at
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>> >>>
>> >>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>> >>>
>> >>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>> >>>
>> >>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>> >>>
>> >>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>> >>>
>> >>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>> >>>
>> >>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>> >>>
>> >>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>> >>>
>> >>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>> >>>
>> >>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>> >>>
>> >>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>> >>>
>> >>>
>> >>> Sadly, there are several folks that have faced this error while
>> trying to execute Spark jobs and there are various solutions, none of which
>> work for me
>> >>>
>> >>>
>> >>> a) I tried (
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-td7735.html#a7736)
>> changing the number of partitions in my RDD by using coalesce(8) and the
>> error persisted
>> >>>
>> >>> b)  I tried changing SPARK_WORKER_MEM=2g, SPARK_EXECUTOR_MEMORY=10g,
>> and both did not work
>> >>>
>> >>> c) I strongly suspect there is a class path error (
>> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-td4719.html)
>> Mainly because the call stack is repetitive. Maybe the OOM error is a
>> disguise ?
>> >>>
>> >>> d) I checked that i am not out of disk space and that i do not have
>> too many open files (ulimit -u << sudo ls
>> /proc/<spark_master_process_id>/fd | wc -l)
>> >>>
>> >>>
>> >>> I am also noticing multiple reflections happening to find the right
>> "class" i guess, so it could be "class Not Found: error disguising itself
>> as a memory error.
>> >>>
>> >>>
>> >>> Here are other threads that are encountering same situation .. but
>> have not been resolved in any way so far..
>> >>>
>> >>>
>> >>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/no-response-in-spark-web-UI-td4633.html
>> >>>
>> >>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html
>> >>>
>> >>>
>> >>> Any help is greatly appreciated. I am especially calling out on
>> creators of Spark and Databrick folks. This seems like a "known bug"
>> waiting to happen.
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Shivani
>> >>>
>> >>>
>> >>> --
>> >>> Software Engineer
>> >>> Analytics Engineering Team@ Box
>> >>> Mountain View, CA
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Software Engineer
>> > Analytics Engineering Team@ Box
>> > Mountain View, CA
>>
>
>
>
> --
> Software Engineer
> Analytics Engineering Team@ Box
> Mountain View, CA
>

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

Posted by Shivani Rao <ra...@gmail.com>.
Hello Abhi, I did try that and it did not work

And Eugene, Yes I am assembling the argonaut libraries in the fat jar. So
how did you overcome this problem?

Shivani


On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi <ce...@gmail.com> wrote:

>
> Le 20 juin 2014 01:46, "Shivani Rao" <ra...@gmail.com> a écrit :
>
> >
> > Hello Andrew,
> >
> > i wish I could share the code, but for proprietary reasons I can't. But
> I can give some idea though of what i am trying to do. The job reads a file
> and for each line of that file and processors these lines. I am not doing
> anything intense in the "processLogs" function
> >
> > import argonaut._
> > import argonaut.Argonaut._
> >
> >
> > /* all of these case classes are created from json strings extracted
> from the line in the processLogs() function
> > *
> > */
> > case class struct1…
> > case class struct2…
> > case class value1(struct1, struct2)
> >
> > def processLogs(line:String): Option[(key1, value1)] {…
> > }
> >
> > def run(sparkMaster, appName, executorMemory, jarsPath) {
> >   val sparkConf = new SparkConf()
> >    sparkConf.setMaster(sparkMaster)
> >    sparkConf.setAppName(appName)
> >    sparkConf.set("spark.executor.memory", executorMemory)
> >     sparkConf.setJars(jarsPath) // This includes all the jars relevant
> jars..
> >    val sc = new SparkContext(sparkConf)
> >   val rawLogs = sc.textFile("hdfs://<my-hadoop-namenode:8020:myfile.txt")
> >
> rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
> >
> rawLogs.flatMap(processLogs).saveAsTextFile("hdfs://<my-hadoop-namenode:8020:outfile.txt")
> > }
> >
> > If I switch to "local" mode, the code runs just fine, it fails with the
> error I pasted above. In the cluster mode, even writing back the file we
> just read fails
> (rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
> >
> > I still believe this is a classNotFound error in disguise
> >
>
> Indeed you are right, this can be the reason. I had similar errors when
> defining case classes in the shell and trying to use them in the RDDs. Are
> you shading argonaut in the fat jar ?
>
> > Thanks
> > Shivani
> >
> >
> >
> > On Wed, Jun 18, 2014 at 2:49 PM, Andrew Ash <an...@andrewash.com>
> wrote:
> >>
> >> Wait, so the file only has four lines and the job running out of heap
> space?  Can you share the code you're running that does the processing?
>  I'd guess that you're doing some intense processing on every line but just
> writing parsed case classes back to disk sounds very lightweight.
> >>
> >> I
> >>
> >>
> >> On Wed, Jun 18, 2014 at 5:17 PM, Shivani Rao <ra...@gmail.com>
> wrote:
> >>>
> >>> I am trying to process a file that contains 4 log lines (not very
> long) and then write my parsed out case classes to a destination folder,
> and I get the following error:
> >>>
> >>>
> >>> java.lang.OutOfMemoryError: Java heap space
> >>>
> >>> at
> org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
> >>>
> >>> at
> org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2244)
> >>>
> >>> at
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
> >>>
> >>> at
> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
> >>>
> >>> at
> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
> >>>
> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>
> >>> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>
> >>> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>
> >>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>>
> >>> at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
> >>>
> >>> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
> >>>
> >>> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
> >>>
> >>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
> >>>
> >>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
> >>>
> >>> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
> >>>
> >>> at
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
> >>>
> >>> at
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
> >>>
> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>
> >>> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>
> >>> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>
> >>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>>
> >>> at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
> >>>
> >>> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
> >>>
> >>> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
> >>>
> >>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
> >>>
> >>> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
> >>>
> >>> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
> >>>
> >>> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
> >>>
> >>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
> >>>
> >>> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
> >>>
> >>> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
> >>>
> >>> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
> >>>
> >>>
> >>> Sadly, there are several folks that have faced this error while trying
> to execute Spark jobs and there are various solutions, none of which work
> for me
> >>>
> >>>
> >>> a) I tried (
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-td7735.html#a7736)
> changing the number of partitions in my RDD by using coalesce(8) and the
> error persisted
> >>>
> >>> b)  I tried changing SPARK_WORKER_MEM=2g, SPARK_EXECUTOR_MEMORY=10g,
> and both did not work
> >>>
> >>> c) I strongly suspect there is a class path error (
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-td4719.html)
> Mainly because the call stack is repetitive. Maybe the OOM error is a
> disguise ?
> >>>
> >>> d) I checked that i am not out of disk space and that i do not have
> too many open files (ulimit -u << sudo ls
> /proc/<spark_master_process_id>/fd | wc -l)
> >>>
> >>>
> >>> I am also noticing multiple reflections happening to find the right
> "class" i guess, so it could be "class Not Found: error disguising itself
> as a memory error.
> >>>
> >>>
> >>> Here are other threads that are encountering same situation .. but
> have not been resolved in any way so far..
> >>>
> >>>
> >>>
> http://apache-spark-user-list.1001560.n3.nabble.com/no-response-in-spark-web-UI-td4633.html
> >>>
> >>>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html
> >>>
> >>>
> >>> Any help is greatly appreciated. I am especially calling out on
> creators of Spark and Databrick folks. This seems like a "known bug"
> waiting to happen.
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Shivani
> >>>
> >>>
> >>> --
> >>> Software Engineer
> >>> Analytics Engineering Team@ Box
> >>> Mountain View, CA
> >>
> >>
> >
> >
> >
> > --
> > Software Engineer
> > Analytics Engineering Team@ Box
> > Mountain View, CA
>



-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

Posted by Eugen Cepoi <ce...@gmail.com>.
Le 20 juin 2014 01:46, "Shivani Rao" <ra...@gmail.com> a écrit :
>
> Hello Andrew,
>
> i wish I could share the code, but for proprietary reasons I can't. But I
can give some idea though of what i am trying to do. The job reads a file
and for each line of that file and processors these lines. I am not doing
anything intense in the "processLogs" function
>
> import argonaut._
> import argonaut.Argonaut._
>
>
> /* all of these case classes are created from json strings extracted from
the line in the processLogs() function
> *
> */
> case class struct1…
> case class struct2…
> case class value1(struct1, struct2)
>
> def processLogs(line:String): Option[(key1, value1)] {…
> }
>
> def run(sparkMaster, appName, executorMemory, jarsPath) {
>   val sparkConf = new SparkConf()
>    sparkConf.setMaster(sparkMaster)
>    sparkConf.setAppName(appName)
>    sparkConf.set("spark.executor.memory", executorMemory)
>     sparkConf.setJars(jarsPath) // This includes all the jars relevant
jars..
>    val sc = new SparkContext(sparkConf)
>   val rawLogs = sc.textFile("hdfs://<my-hadoop-namenode:8020:myfile.txt")
>
rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
>
rawLogs.flatMap(processLogs).saveAsTextFile("hdfs://<my-hadoop-namenode:8020:outfile.txt")
> }
>
> If I switch to "local" mode, the code runs just fine, it fails with the
error I pasted above. In the cluster mode, even writing back the file we
just read fails
(rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")
>
> I still believe this is a classNotFound error in disguise
>

Indeed you are right, this can be the reason. I had similar errors when
defining case classes in the shell and trying to use them in the RDDs. Are
you shading argonaut in the fat jar ?

> Thanks
> Shivani
>
>
>
> On Wed, Jun 18, 2014 at 2:49 PM, Andrew Ash <an...@andrewash.com> wrote:
>>
>> Wait, so the file only has four lines and the job running out of heap
space?  Can you share the code you're running that does the processing?
 I'd guess that you're doing some intense processing on every line but just
writing parsed case classes back to disk sounds very lightweight.
>>
>> I
>>
>>
>> On Wed, Jun 18, 2014 at 5:17 PM, Shivani Rao <ra...@gmail.com>
wrote:
>>>
>>> I am trying to process a file that contains 4 log lines (not very long)
and then write my parsed out case classes to a destination folder, and I
get the following error:
>>>
>>>
>>> java.lang.OutOfMemoryError: Java heap space
>>>
>>> at
org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
>>>
>>> at
org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2244)
>>>
>>> at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
>>>
>>> at
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
>>>
>>> at
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>
>>> at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>
>>> at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>>
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>>
>>> at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>>
>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
>>>
>>> at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>>>
>>> at
org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
>>>
>>> at
org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>
>>> at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>
>>> at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>>
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>>
>>> at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>>
>>> at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>>
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>>
>>> at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>>
>>> at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>>
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>>
>>> at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>>
>>>
>>> Sadly, there are several folks that have faced this error while trying
to execute Spark jobs and there are various solutions, none of which work
for me
>>>
>>>
>>> a) I tried (
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-td7735.html#a7736)
changing the number of partitions in my RDD by using coalesce(8) and the
error persisted
>>>
>>> b)  I tried changing SPARK_WORKER_MEM=2g, SPARK_EXECUTOR_MEMORY=10g,
and both did not work
>>>
>>> c) I strongly suspect there is a class path error (
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-td4719.html)
Mainly because the call stack is repetitive. Maybe the OOM error is a
disguise ?
>>>
>>> d) I checked that i am not out of disk space and that i do not have too
many open files (ulimit -u << sudo ls /proc/<spark_master_process_id>/fd |
wc -l)
>>>
>>>
>>> I am also noticing multiple reflections happening to find the right
"class" i guess, so it could be "class Not Found: error disguising itself
as a memory error.
>>>
>>>
>>> Here are other threads that are encountering same situation .. but have
not been resolved in any way so far..
>>>
>>>
>>>
http://apache-spark-user-list.1001560.n3.nabble.com/no-response-in-spark-web-UI-td4633.html
>>>
>>>
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html
>>>
>>>
>>> Any help is greatly appreciated. I am especially calling out on
creators of Spark and Databrick folks. This seems like a "known bug"
waiting to happen.
>>>
>>>
>>> Thanks,
>>>
>>> Shivani
>>>
>>>
>>> --
>>> Software Engineer
>>> Analytics Engineering Team@ Box
>>> Mountain View, CA
>>
>>
>
>
>
> --
> Software Engineer
> Analytics Engineering Team@ Box
> Mountain View, CA

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

Posted by abhiguruvayya <sh...@gmail.com>.
Once you have generated the final RDD before submitting it to reducer try to
repartition the RDD either using coalesce(partitions) or repartition() into
known partitions. 2. Rule of thumb to create number of data partitions (3 *
num_executors * cores_per_executor). 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-0-9-1-java-lang-outOfMemoryError-Java-Heap-Space-tp7861p7970.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

Posted by Shivani Rao <ra...@gmail.com>.
Hello Andrew,

i wish I could share the code, but for proprietary reasons I can't. But I
can give some idea though of what i am trying to do. The job reads a file
and for each line of that file and processors these lines. I am not doing
anything intense in the "processLogs" function

import argonaut._
import argonaut.Argonaut._


/* all of these case classes are created from json strings extracted from
the line in the processLogs() function
*
*/
case class struct1…
case class struct2…
case class value1(struct1, struct2)

def processLogs(line:String): Option[(key1, value1)] {…
}

def run(sparkMaster, appName, executorMemory, jarsPath) {
  val sparkConf = new SparkConf()
   sparkConf.setMaster(sparkMaster)
   sparkConf.setAppName(appName)
   sparkConf.set("spark.executor.memory", executorMemory)
    sparkConf.setJars(jarsPath) // This includes all the jars relevant
jars..
   val sc = new SparkContext(sparkConf)
  val rawLogs = sc.textFile("hdfs://<my-hadoop-namenode:8020:myfile.txt")

rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")

rawLogs.flatMap(processLogs).saveAsTextFile("hdfs://<my-hadoop-namenode:8020:outfile.txt")
}

If I switch to "local" mode, the code runs just fine, it fails with the
error I pasted above. In the cluster mode, even writing back the file we
just read fails
(rawLogs.saveAsTextFile("hdfs://<my-hadoop-namenode:8020:writebackForTesting")

I still believe this is a classNotFound error in disguise

Thanks
Shivani



On Wed, Jun 18, 2014 at 2:49 PM, Andrew Ash <an...@andrewash.com> wrote:

> Wait, so the file only has four lines and the job running out of heap
> space?  Can you share the code you're running that does the processing?
>  I'd guess that you're doing some intense processing on every line but just
> writing parsed case classes back to disk sounds very lightweight.
>
> I
>
>
> On Wed, Jun 18, 2014 at 5:17 PM, Shivani Rao <ra...@gmail.com> wrote:
>
>> I am trying to process a file that contains 4 log lines (not very long)
>> and then write my parsed out case classes to a destination folder, and I
>> get the following error:
>>
>>
>> java.lang.OutOfMemoryError: Java heap space
>>
>> at
>> org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
>>
>> at
>> org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2244)
>>
>> at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
>>
>> at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
>>
>> at
>> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
>>
>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>>
>> at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
>>
>> at
>> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>>
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>>
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>>
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>>
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>>
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>>
>>
>> Sadly, there are several folks that have faced this error while trying to
>> execute Spark jobs and there are various solutions, none of which work for
>> me
>>
>>
>> a) I tried (
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-td7735.html#a7736)
>> changing the number of partitions in my RDD by using coalesce(8) and the
>> error persisted
>>
>> b)  I tried changing SPARK_WORKER_MEM=2g, SPARK_EXECUTOR_MEMORY=10g, and
>> both did not work
>>
>> c) I strongly suspect there is a class path error (
>> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-td4719.html)
>> Mainly because the call stack is repetitive. Maybe the OOM error is a
>> disguise ?
>>
>> d) I checked that i am not out of disk space and that i do not have too
>> many open files (ulimit -u << sudo ls /proc/<spark_master_process_id>/fd |
>> wc -l)
>>
>>
>> I am also noticing multiple reflections happening to find the right
>> "class" i guess, so it could be "class Not Found: error disguising itself
>> as a memory error.
>>
>>
>> Here are other threads that are encountering same situation .. but have
>> not been resolved in any way so far..
>>
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/no-response-in-spark-web-UI-td4633.html
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html
>>
>>
>> Any help is greatly appreciated. I am especially calling out on creators
>> of Spark and Databrick folks. This seems like a "known bug" waiting to
>> happen.
>>
>>
>> Thanks,
>>
>> Shivani
>>
>> --
>> Software Engineer
>> Analytics Engineering Team@ Box
>> Mountain View, CA
>>
>
>


-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

Posted by Andrew Ash <an...@andrewash.com>.
Wait, so the file only has four lines and the job running out of heap
space?  Can you share the code you're running that does the processing?
 I'd guess that you're doing some intense processing on every line but just
writing parsed case classes back to disk sounds very lightweight.

I


On Wed, Jun 18, 2014 at 5:17 PM, Shivani Rao <ra...@gmail.com> wrote:

> I am trying to process a file that contains 4 log lines (not very long)
> and then write my parsed out case classes to a destination folder, and I
> get the following error:
>
>
> java.lang.OutOfMemoryError: Java heap space
>
> at
> org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
>
> at org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2244)
>
> at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
>
> at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
>
> at
> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>
> at java.lang.reflect.Method.invoke(Method.java:597)
>
> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:350)
>
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>
> at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
>
> at
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>
> at java.lang.reflect.Method.invoke(Method.java:597)
>
> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1848)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1328)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1946)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1870)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1752)
>
>
> Sadly, there are several folks that have faced this error while trying to
> execute Spark jobs and there are various solutions, none of which work for
> me
>
>
> a) I tried (
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-td7735.html#a7736)
> changing the number of partitions in my RDD by using coalesce(8) and the
> error persisted
>
> b)  I tried changing SPARK_WORKER_MEM=2g, SPARK_EXECUTOR_MEMORY=10g, and
> both did not work
>
> c) I strongly suspect there is a class path error (
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-td4719.html)
> Mainly because the call stack is repetitive. Maybe the OOM error is a
> disguise ?
>
> d) I checked that i am not out of disk space and that i do not have too
> many open files (ulimit -u << sudo ls /proc/<spark_master_process_id>/fd |
> wc -l)
>
>
> I am also noticing multiple reflections happening to find the right
> "class" i guess, so it could be "class Not Found: error disguising itself
> as a memory error.
>
>
> Here are other threads that are encountering same situation .. but have
> not been resolved in any way so far..
>
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/no-response-in-spark-web-UI-td4633.html
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-program-thows-OutOfMemoryError-td4268.html
>
>
> Any help is greatly appreciated. I am especially calling out on creators
> of Spark and Databrick folks. This seems like a "known bug" waiting to
> happen.
>
>
> Thanks,
>
> Shivani
>
> --
> Software Engineer
> Analytics Engineering Team@ Box
> Mountain View, CA
>