You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by pkphlam <pk...@gmail.com> on 2015/08/23 02:47:12 UTC

pickling error with PySpark and Elasticsearch-py analyzer

Reposting my question from SO:
http://stackoverflow.com/questions/32161865/elasticsearch-analyze-not-compatible-with-spark-in-python

I'm using the elasticsearch-py client within PySpark using Python 3 and I'm
running into a problem using the analyze() function with ES in conjunction
with an RDD. In particular, each record in my RDD is a string of text and
I'm trying to analyze it to get out the token information, but I'm getting
an error when trying to use it within a map function in Spark.

For example, this works perfectly fine:

>> from elasticsearch import Elasticsearch
>> es = Elasticsearch()
>> t = 'the quick brown fox'
>> es.indices.analyze(text=t)['tokens'][0]

{'end_offset': 3,
'position': 1,
'start_offset': 0,
'token': 'the',
'type': '<ALPHANUM>'}

However, when I try this:

>> trdd = sc.parallelize(['the quick brown fox'])
>> trdd.map(lambda x: es.indices.analyze(text=x)['tokens'][0]).collect()

I get a really really long error message related to pickling (Here's the end
of it):

(self, obj)    109if'recursion'in.[0]:    110="""Could not pickle object as
excessively deep recursion required."""--> 111                 
picklePicklingErrormsg

  save_memoryviewself obj

: Could not pickle object as excessively deep recursion required.

raise.()    112    113def(,):PicklingError


I'm not sure what the error means. Am I doing something wrong? Is there a
way to map the ES analyze function onto records of an RDD?






--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pickling-error-with-PySpark-and-Elasticsearch-py-analyzer-tp24402.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org