You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Alexey Grishchenko (JIRA)" <ji...@apache.org> on 2015/09/06 22:40:45 UTC

[jira] [Commented] (SPARK-10362) Cannot create DataFrame from large pandas.DataFrame

    [ https://issues.apache.org/jira/browse/SPARK-10362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732537#comment-14732537 ] 

Alexey Grishchenko commented on SPARK-10362:
--------------------------------------------

_createDataFrame()_ in Python, when called for local collection, would first call _parallelize()_ on your data. _parallelize()_ method for Python works in a following way: it creates temporary file, dumps all your data into it, and then loads this data on Java side. What happens here is that you don't have enough memory in JVM to load this data, so it raises _java.lang.OutOfMemoryError: Java heap space_. As all of these happens on driver, I recommend you to increase driver memory with _spark.driver.memory_ or _--driver-memory_

> Cannot create DataFrame from large pandas.DataFrame
> ---------------------------------------------------
>
>                 Key: SPARK-10362
>                 URL: https://issues.apache.org/jira/browse/SPARK-10362
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.4.1
>         Environment: Ubuntu 14.04
> Spark 1.4.1
>            Reporter: Hsueh-Min Chen
>            Priority: Minor
>
> I tried to convert a pandas.DataFrame object to pyspark's DataFrame. It works for small size of pandas.DataFrame (~10000), but fails for larger size.
> >>> sqlc = pyspark.sql.SQLContext(sc)
> >>> log  = sqlc.createDataFrame(logs.head(10000000))
> ---------------------------------------------------------------------------
> Py4JJavaError                             Traceback (most recent call last)
> /home/elsdrm/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
>     325                 # data could be list, tuple, generator ...
> --> 326                 rdd = self._sc.parallelize(data)
>     327             except Exception:
> /home/elsdrm/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py in parallelize(self, c, numSlices)
>     395         readRDDFromFile = self._jvm.PythonRDD.readRDDFromFile
> --> 396         jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices)
>     397         return RDD(jrdd, self, serializer)
> /home/elsdrm/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
>     537         return_value = get_return_value(answer, self.gateway_client,
> --> 538                 self.target_id, self.name)
>     539 
> /home/elsdrm/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
>     299                     'An error occurred while calling {0}{1}{2}.\n'.
> --> 300                     format(target_id, '.', name), value)
>     301             else:
> Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
> : java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:389)
> 	at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> 	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> 	at py4j.Gateway.invoke(Gateway.java:259)
> 	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> 	at py4j.commands.CallCommand.execute(CallCommand.java:79)
> 	at py4j.GatewayConnection.run(GatewayConnection.java:207)
> 	at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> TypeError                                 Traceback (most recent call last)
> <ipython-input-12-32fb25f5be64> in <module>()
> ----> 1 log  = sqlc.createDataFrame(logs.head(10000000))
> /home/elsdrm/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
>     326                 rdd = self._sc.parallelize(data)
>     327             except Exception:
> --> 328                 raise TypeError("cannot create an RDD from type: %s" % type(data))
>     329         else:
>     330             rdd = data
> TypeError: cannot create an RDD from type: <class 'list'>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org