You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by umeshdangat <um...@gmail.com> on 2014/07/22 01:10:24 UTC
unable to create rdd with pyspark newAPIHadoopRDD
Hello,
I have only just started playing around with spark to see if it fits my
needs. I was trying to read some data from elasticsearch as an rdd, so that
I could perform some python based analytics on it. I am unable to create the
rdd object as of now, failing with a serialization error.
Working of spark repo commit tag in master:
abeacffb7bcdfa3eeb1e969aa546029a7b464eaa.
Steps I am doing as mentioned in patch:
https://github.com/apache/spark/pull/455
IPYTHON=1
SPARK_CLASSPATH=/Users/umeshdangat/Downloads/elasticsearch-hadoop-2.0.0/dist/elasticsearch-hadoop-mr-2.0.0.jar
./bin/pyspark
from pyspark import SparkContext
sc = SparkContext('local[2]')
conf = {'es.resource': 'twitter/tweet'} #index/type
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",
"org.apache.hadoop.io.NullWritable",
"org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
Stack Trace:
Py4JJavaError Traceback (most recent call last)
/Users/umeshdangat/Documents/spark/<ipython-input-4-ee964756398b> in
<module>()
----> 1 rdd =
sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",
"org.apache.hadoop.io.NullWritable",
"org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
/Users/umeshdangat/Documents/spark/python/pyspark/context.pyc in
newAPIHadoopRDD(self, inputFormatClass, keyClass, valueClass, keyConverter,
valueConverter, conf)
426 jconf = self._dictToJavaMap(conf)
427 jrdd = self._jvm.PythonRDD.newAPIHadoopRDD(self._jsc,
inputFormatClass, keyClass,
--> 428 valueClass,
keyConverter, valueConverter, jconf)
429 return RDD(jrdd, self, PickleSerializer())
430
/Users/umeshdangat/Documents/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py
in __call__(self, *args)
535 answer = self.gateway_client.send_command(command)
536 return_value = get_return_value(answer, self.gateway_client,
--> 537 self.target_id, self.name)
538
539 for temp_arg in temp_args:
/Users/umeshdangat/Documents/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
2.0 in stage 1.0 (TID 2) had a not serializable result:
scala.collection.convert.Wrappers$MapWrapper
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1029)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1027)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1027)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:632)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1230)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/unable-to-create-rdd-with-pyspark-newAPIHadoopRDD-tp10358.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.