You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohit Jaggi <mo...@gmail.com> on 2014/07/11 23:00:03 UTC

pyspark sc.parallelize running OOM with smallish data

spark_data_array here has about 35k rows with 4k columns. I have 4 nodes in
the cluster and gave 48g to executors. also tried kyro serialization.

traceback (most recent call last):

  File "/mohit/./m.py", line 58, in <module>

    spark_data = sc.parallelize(spark_data_array)

  File "/mohit/spark/python/pyspark/context.py", line 265, in parallelize

    jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices)

  File "/mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 537, in __call__

  File "/mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line
300, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.

: java.lang.OutOfMemoryError: Java heap space

at
org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:279)

at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

at py4j.Gateway.invoke(Gateway.java:259)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:207)

at java.lang.Thread.run(Thread.java:745)

Re: pyspark sc.parallelize running OOM with smallish data

Posted by Mohit Jaggi <mo...@gmail.com>.

Continuing to debug with Scala, I tried this on local with enough memory
(10g) and it is able to count the dataset. With more memory(for executor
and driver) in a cluster it still fails. The data is about 2Gbytes. It is
30k * 4k doubles.


On Sat, Jul 12, 2014 at 6:31 PM, Aaron Davidson <il...@gmail.com> wrote:

> I think this is probably dying on the driver itself, as you are probably
> materializing the whole dataset inside your python driver. How large is
> spark_data_array compared to your driver memory?
>
>
> On Fri, Jul 11, 2014 at 7:30 PM, Mohit Jaggi <mo...@gmail.com> wrote:
>
>> I put the same dataset into scala (using spark-shell) and it acts weird.
>> I cannot do a count on it, the executors seem to hang. The WebUI shows 0/96
>> in the status bar, shows details about the worker nodes but there is no
>> progress.
>> sc.parallelize does finish (takes too long for the data size) in scala.
>>
>>
>> On Fri, Jul 11, 2014 at 2:00 PM, Mohit Jaggi <mo...@gmail.com>
>> wrote:
>>
>>> spark_data_array here has about 35k rows with 4k columns. I have 4 nodes
>>> in the cluster and gave 48g to executors. also tried kyro serialization.
>>>
>>> traceback (most recent call last):
>>>
>>>   File "/mohit/./m.py", line 58, in <module>
>>>
>>>     spark_data = sc.parallelize(spark_data_array)
>>>
>>>   File "/mohit/spark/python/pyspark/context.py", line 265, in parallelize
>>>
>>>     jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices)
>>>
>>>   File
>>> "/mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line
>>> 537, in __call__
>>>
>>>   File "/mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
>>> line 300, in get_return_value
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling
>>> z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
>>>
>>> : java.lang.OutOfMemoryError: Java heap space
>>>
>>> at
>>> org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:279)
>>>
>>> at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>>>
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>>>
>>> at py4j.Gateway.invoke(Gateway.java:259)
>>>
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>>
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>
>>> at py4j.GatewayConnection.run(GatewayConnection.java:207)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>
>>
>

Re: pyspark sc.parallelize running OOM with smallish data

Posted by Aaron Davidson <il...@gmail.com>.

I think this is probably dying on the driver itself, as you are probably
materializing the whole dataset inside your python driver. How large is
spark_data_array compared to your driver memory?


On Fri, Jul 11, 2014 at 7:30 PM, Mohit Jaggi <mo...@gmail.com> wrote:

> I put the same dataset into scala (using spark-shell) and it acts weird. I
> cannot do a count on it, the executors seem to hang. The WebUI shows 0/96
> in the status bar, shows details about the worker nodes but there is no
> progress.
> sc.parallelize does finish (takes too long for the data size) in scala.
>
>
> On Fri, Jul 11, 2014 at 2:00 PM, Mohit Jaggi <mo...@gmail.com> wrote:
>
>> spark_data_array here has about 35k rows with 4k columns. I have 4 nodes
>> in the cluster and gave 48g to executors. also tried kyro serialization.
>>
>> traceback (most recent call last):
>>
>>   File "/mohit/./m.py", line 58, in <module>
>>
>>     spark_data = sc.parallelize(spark_data_array)
>>
>>   File "/mohit/spark/python/pyspark/context.py", line 265, in parallelize
>>
>>     jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices)
>>
>>   File "/mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
>> line 537, in __call__
>>
>>   File "/mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
>> line 300, in get_return_value
>>
>> py4j.protocol.Py4JJavaError: An error occurred while calling
>> z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
>>
>> : java.lang.OutOfMemoryError: Java heap space
>>
>> at
>> org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:279)
>>
>> at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>>
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>>
>> at py4j.Gateway.invoke(Gateway.java:259)
>>
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>
>> at py4j.GatewayConnection.run(GatewayConnection.java:207)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>
>

Re: pyspark sc.parallelize running OOM with smallish data

Posted by Mohit Jaggi <mo...@gmail.com>.

I put the same dataset into scala (using spark-shell) and it acts weird. I
cannot do a count on it, the executors seem to hang. The WebUI shows 0/96
in the status bar, shows details about the worker nodes but there is no
progress.
sc.parallelize does finish (takes too long for the data size) in scala.


On Fri, Jul 11, 2014 at 2:00 PM, Mohit Jaggi <mo...@gmail.com> wrote:

> spark_data_array here has about 35k rows with 4k columns. I have 4 nodes
> in the cluster and gave 48g to executors. also tried kyro serialization.
>
> traceback (most recent call last):
>
>   File "/mohit/./m.py", line 58, in <module>
>
>     spark_data = sc.parallelize(spark_data_array)
>
>   File "/mohit/spark/python/pyspark/context.py", line 265, in parallelize
>
>     jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices)
>
>   File "/mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> line 537, in __call__
>
>   File "/mohit/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line
> 300, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
>
> : java.lang.OutOfMemoryError: Java heap space
>
> at
> org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:279)
>
> at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>
> at py4j.Gateway.invoke(Gateway.java:259)
>
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
>
> at java.lang.Thread.run(Thread.java:745)
>