You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by kevindahl <ke...@gmail.com> on 2015/03/16 14:23:23 UTC

Re: Spark 1.3 createDataframe error with pandas df

kevindahl wrote
> I'm trying to create a spark data frame from a pandas data frame, but for
> even the most trivial of datasets I get an error along the lines of this:
> 
> ---------------------------------------------------------------------------
> Py4JJavaError                             Traceback (most recent call
> last)
> <ipython-input-11-7857f9a55971>
>  in 
> <module>
> ()
>       3 BabyDataSet = zip(names,births)
>       4 df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
> ----> 5 rdf = sqlCtx.createDataFrame(df)
> 
> C:\spark\python\pyspark\sql\context.pyc in createDataFrame(self, data,
> schema, samplingRatio)
>     332 
>     333         if isinstance(schema, (list, tuple)):
> --> 334             first = data.first()
>     335             if not isinstance(first, (list, tuple)):
>     336                 raise ValueError("each row in `rdd` should be list
> or tuple, "
> 
> C:\spark\python\pyspark\rdd.pyc in first(self)
>    1241         ValueError: RDD is empty
>    1242         """
> -> 1243         rs = self.take(1)
>    1244         if rs:
>    1245             return rs[0]
> 
> C:\spark\python\pyspark\rdd.pyc in take(self, num)
>    1223 
>    1224             p = range(partsScanned, min(partsScanned +
> numPartsToTry, totalParts))
> -> 1225             res = self.context.runJob(self, takeUpToNumLeft, p,
> True)
>    1226 
>    1227             items += res
> 
> C:\spark\python\pyspark\context.pyc in runJob(self, rdd, partitionFunc,
> partitions, allowLocal)
>     841         # SparkContext#runJob.
>     842         mappedRDD = rdd.mapPartitions(partitionFunc)
> --> 843         it = self._jvm.PythonRDD.runJob(self._jsc.sc(),
> mappedRDD._jrdd, javaPartitions, allowLocal)
>     844         return list(mappedRDD._collect_iterator_through_file(it))
>     845 
> 
> C:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in
> __call__(self, *args)
>     536         answer = self.gateway_client.send_command(command)
>     537         return_value = get_return_value(answer,
> self.gateway_client,
> --> 538                 self.target_id, self.name)
>     539 
>     540         for temp_arg in temp_args:
> 
> C:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in
> get_return_value(answer, gateway_client, target_id, name)
>     298                 raise Py4JJavaError(
>     299                     'An error occurred while calling
> {0}{1}{2}.\n'.
> --> 300                     format(target_id, '.', name), value)
>     301             else:
>     302                 raise Py4JError(
> 
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 4.0 (TID 7, localhost): java.net.SocketException: Connection reset
> 	at java.net.SocketInputStream.read(SocketInputStream.java:189)
> 	at java.net.SocketInputStream.read(SocketInputStream.java:121)
> 	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> 	at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> 	at java.io.DataInputStream.readInt(DataInputStream.java:387)
> 	at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108)
> 	at org.apache.spark.api.python.PythonRDD$$anon$1.
> <init>
> (PythonRDD.scala:176)
> 	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:64)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> 	at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> 
> Driver stacktrace:
> 	at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
> 	at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
> 	at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
> 	at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 	at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
> 	at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> 	at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> 	at scala.Option.foreach(Option.scala:236)
> 	at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
> 	at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
> 	at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
> 	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 
> A simple repro:
> 
> names = ['Bob','Jessica','Mary','John','Mel']
> births = [968, 155, 77, 578, 973]
> BabyDataSet = zip(names,births)
> df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
> rdf = sqlCtx.createDataFrame(df)
> 
> On Win 8.1, spark 1.3, Anaconda python 2.7, IPython 3.0.0, pandas 0.15.2.

The original message may not have made it through, as I think I fumbled my
subscription. 

Anyways, I have since tested the same thing on Spark 1.3 on Ubuntu, same
Anaconda python 2.7/IPython 3.0.0/pandas 0.15.2, and found that it works
fine there.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-createDataframe-error-with-pandas-df-tp22053p22082.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark 1.3 createDataframe error with pandas df

Posted by kevindahl <ke...@gmail.com>.
Seems I am still having the same issue in different scenarios. Using the
'dylanmei/docker-zeppelin' container I get the same issue as before when
trying to create a spark dataframe from a pandas dataframe.

code:

%pyspark
import pandas as pd

names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
BabyDataSet = zip(names,births)
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
rdf = sqlc.createDataFrame(df)

result:

(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred
while calling z:org.apache.spark.api.python.PythonRDD.runJob.\n', JavaObject
id=o49), <traceback object at 0x7f0c819dce60>)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-createDataframe-error-with-pandas-df-tp22053p22809.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark 1.3 createDataframe error with pandas df

Posted by Davies Liu <da...@databricks.com>.
On Mon, Mar 16, 2015 at 6:23 AM, kevindahl <ke...@gmail.com> wrote:
> kevindahl wrote
>> I'm trying to create a spark data frame from a pandas data frame, but for
>> even the most trivial of datasets I get an error along the lines of this:
>>
>> ---------------------------------------------------------------------------
>> Py4JJavaError                             Traceback (most recent call
>> last)
>> <ipython-input-11-7857f9a55971>
>>  in
>> <module>
>> ()
>>       3 BabyDataSet = zip(names,births)
>>       4 df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
>> ----> 5 rdf = sqlCtx.createDataFrame(df)
>>
>> C:\spark\python\pyspark\sql\context.pyc in createDataFrame(self, data,
>> schema, samplingRatio)
>>     332
>>     333         if isinstance(schema, (list, tuple)):
>> --> 334             first = data.first()
>>     335             if not isinstance(first, (list, tuple)):
>>     336                 raise ValueError("each row in `rdd` should be list
>> or tuple, "
>>
>> C:\spark\python\pyspark\rdd.pyc in first(self)
>>    1241         ValueError: RDD is empty
>>    1242         """
>> -> 1243         rs = self.take(1)
>>    1244         if rs:
>>    1245             return rs[0]
>>
>> C:\spark\python\pyspark\rdd.pyc in take(self, num)
>>    1223
>>    1224             p = range(partsScanned, min(partsScanned +
>> numPartsToTry, totalParts))
>> -> 1225             res = self.context.runJob(self, takeUpToNumLeft, p,
>> True)
>>    1226
>>    1227             items += res
>>
>> C:\spark\python\pyspark\context.pyc in runJob(self, rdd, partitionFunc,
>> partitions, allowLocal)
>>     841         # SparkContext#runJob.
>>     842         mappedRDD = rdd.mapPartitions(partitionFunc)
>> --> 843         it = self._jvm.PythonRDD.runJob(self._jsc.sc(),
>> mappedRDD._jrdd, javaPartitions, allowLocal)
>>     844         return list(mappedRDD._collect_iterator_through_file(it))
>>     845
>>
>> C:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in
>> __call__(self, *args)
>>     536         answer = self.gateway_client.send_command(command)
>>     537         return_value = get_return_value(answer,
>> self.gateway_client,
>> --> 538                 self.target_id, self.name)
>>     539
>>     540         for temp_arg in temp_args:
>>
>> C:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in
>> get_return_value(answer, gateway_client, target_id, name)
>>     298                 raise Py4JJavaError(
>>     299                     'An error occurred while calling
>> {0}{1}{2}.\n'.
>> --> 300                     format(target_id, '.', name), value)
>>     301             else:
>>     302                 raise Py4JError(
>>
>> Py4JJavaError: An error occurred while calling
>> z:org.apache.spark.api.python.PythonRDD.runJob.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage
>> 4.0 (TID 7, localhost): java.net.SocketException: Connection reset

The python process had crashed, do you have any logging for this?

>>       at java.net.SocketInputStream.read(SocketInputStream.java:189)
>>       at java.net.SocketInputStream.read(SocketInputStream.java:121)
>>       at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>>       at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>>       at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>       at
>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108)
>>       at org.apache.spark.api.python.PythonRDD$$anon$1.
>> <init>
>> (PythonRDD.scala:176)
>>       at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
>>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>       at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>       at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>       at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>>       at
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
>>       at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>>       at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
>>       at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>       at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
>>       at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>       at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>       at scala.Option.foreach(Option.scala:236)
>>       at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>>       at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
>>       at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
>>       at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>
>> A simple repro:
>>
>> names = ['Bob','Jessica','Mary','John','Mel']
>> births = [968, 155, 77, 578, 973]
>> BabyDataSet = zip(names,births)
>> df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
>> rdf = sqlCtx.createDataFrame(df)
>>
>> On Win 8.1, spark 1.3, Anaconda python 2.7, IPython 3.0.0, pandas 0.15.2.
>
> The original message may not have made it through, as I think I fumbled my
> subscription.
>
> Anyways, I have since tested the same thing on Spark 1.3 on Ubuntu, same
> Anaconda python 2.7/IPython 3.0.0/pandas 0.15.2, and found that it works
> fine there.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-createDataframe-error-with-pandas-df-tp22053p22082.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org