You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by umeshdangat <um...@gmail.com> on 2014/07/22 01:10:24 UTC

unable to create rdd with pyspark newAPIHadoopRDD

Hello,
 I have only just started playing around with spark to see if it fits my
needs. I was trying to read some data from elasticsearch as an rdd, so that
I could perform some python based analytics on it. I am unable to create the
rdd object as of now, failing with a serialization error.

Working of spark repo commit tag in master:
abeacffb7bcdfa3eeb1e969aa546029a7b464eaa.

Steps I am doing as mentioned in patch:
https://github.com/apache/spark/pull/455

IPYTHON=1
SPARK_CLASSPATH=/Users/umeshdangat/Downloads/elasticsearch-hadoop-2.0.0/dist/elasticsearch-hadoop-mr-2.0.0.jar
./bin/pyspark

from pyspark import SparkContext
sc = SparkContext('local[2]')
conf = {'es.resource': 'twitter/tweet'} #index/type
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",
"org.apache.hadoop.io.NullWritable",
"org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)

Stack Trace:
Py4JJavaError                             Traceback (most recent call last)
/Users/umeshdangat/Documents/spark/<ipython-input-4-ee964756398b> in
<module>()
----> 1 rdd =
sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",
"org.apache.hadoop.io.NullWritable",
"org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)

/Users/umeshdangat/Documents/spark/python/pyspark/context.pyc in
newAPIHadoopRDD(self, inputFormatClass, keyClass, valueClass, keyConverter,
valueConverter, conf)
    426         jconf = self._dictToJavaMap(conf)
    427         jrdd = self._jvm.PythonRDD.newAPIHadoopRDD(self._jsc,
inputFormatClass, keyClass,
--> 428                                                    valueClass,
keyConverter, valueConverter, jconf)
    429         return RDD(jrdd, self, PickleSerializer())
    430 

/Users/umeshdangat/Documents/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py
in __call__(self, *args)
    535         answer = self.gateway_client.send_command(command)
    536         return_value = get_return_value(answer, self.gateway_client,
--> 537                 self.target_id, self.name)
    538 
    539         for temp_arg in temp_args:

/Users/umeshdangat/Documents/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
2.0 in stage 1.0 (TID 2) had a not serializable result:
scala.collection.convert.Wrappers$MapWrapper
	at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1029)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1027)
	at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1027)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
	at scala.Option.foreach(Option.scala:236)
	at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:632)
	at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1230)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
	at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/unable-to-create-rdd-with-pyspark-newAPIHadoopRDD-tp10358.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.