You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by chocjy <ji...@gmail.com> on 2014/11/15 19:06:49 UTC
using zip gets EOFError error
I was trying to zip the rdd with another rdd. I store my matrix in HDFS and
load it as Ab_rdd = sc.textFile('data/Ab.txt', 100)
If I do
idx = sc.parallelize(range(m),100) #m is the number of records in Ab.txt
print matrix_Ab.matrix.zip(idx).first()
I got the following error:
If I store my matrix (Ab.txt) locally and use sc.parallelize to create the
rdd, this error doesn’t appear. Anyone knows what's going on? Thanks!
Traceback (most recent call last):
File "/home/jiyan/randomized-matrix-algorithms/spark/src/l2_exp.py", line
51, in <module>
print test_obj.execute_l2(matrix_Ab,A,b,x_opt,f_opt)
File "/home/jiyan/randomized-matrix-algorithms/spark/src/test_l2.py", line
35, in execute_l2
ls.fit()
File
"/home/jiyan/randomized-matrix-algorithms/spark/src/least_squares.py", line
23, in fit
x = self.projection.execute(self.matrix_Ab, 'solve')
File "/home/jiyan/randomized-matrix-algorithms/spark/src/projections.py",
line 26, in execute
PA = self.__project(matrix, lim)
File "/home/jiyan/randomized-matrix-algorithms/spark/src/projections.py",
line 50, in __project
print matrix.zip_with_index(self.sc).first()
File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/rdd.py",
line 881, in first
return self.take(1)[0]
File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/rdd.py",
line 868, in take
iterator =
mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 537, in __call__
File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
o37.collectPartitions.
: java.lang.ClassCastException: [B cannot be cast to java.lang.String
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:321)
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:319)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:319)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177)
PySpark worker failed with exception:
Traceback (most recent call last):
File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/worker.py",
line 73, in main
command = pickleSer._read_with_length(infile)
File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py",
line 142, in _read_with_length
length = read_int(stream)
File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py",
line 337, in read_int
raise EOFError
EOFError
14/11/15 00:36:17 ERROR PythonRDD: Python worker exited unexpectedly
(crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/worker.py",
line 73, in main
command = pickleSer._read_with_length(infile)
File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py",
line 142, in _read_with_length
length = read_int(stream)
File
"/opt/cloudera/parcels/CDH-5.1.3-1.cdh5.1.3.p0.12/lib/spark/python/pyspark/serializers.py",
line 337, in read_int
raise EOFError
EOFError
at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:118)
at
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:148)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:81)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:574)
at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:559)
Caused by: java.lang.ClassCastException: [B cannot be cast to
java.lang.String
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:321)
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:319)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:319)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177)
14/11/15 00:36:17 ERROR PythonRDD: This may have been caused by a prior
exception:
java.lang.ClassCastException: [B cannot be cast to java.lang.String
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:321)
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$4.apply(PythonRDD.scala:319)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:319)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177)
14/11/15 00:36:17 INFO DAGScheduler: Failed to run first at
/home/jiyan/randomized-matrix-algorithms/spark/src/projections.py:50
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-zip-gets-EOFError-error-tp19011.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org