You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Fabian Böhnlein <fa...@gmail.com> on 2016/02/23 10:45:14 UTC
PySpark Pickle reading does not find module
Hi all,
how can I make a module/class visible to a sc.pickleFile? It seems to
miss it in the env after an import in the driver PySpark context.
The module is available for writing, but reading in a new SparkContext
than the one that wrote, fails. The imports are the same in both. Any
ideas how I can point it to there apart of the global import?
How I create it:
from scipy.sparse import csr, csr_matrix import numpy as np def get_csr(y):
...
..
return csr_matrix(data, (row,col))
rdd = rdd1.map(lambda x: get_csr(x))
rdd.take(2)
[<1x150498 sparse matrix of type '<type 'numpy.float64'>' with 62 stored elements in Compressed Sparse Row format>,
<1x150498 sparse matrix of type '<type 'numpy.float64'>' with 84 stored elements in Compressed Sparse Row format>]
rdd.saveAsPickleFile(..)
Reading in a new SparkContext causes a /No module named scipy.sparse.csr
/(see below).
Loading the file in the same SparkContext where it was written, works.
The PYTHON_PATH is set on all workers to the same local anaconda
distribution and the local anaconda of this particular worker which
causes the error definitely has the module available.
File
"/usr/local/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
line 164, in _read_with_length return self.loads(obj) File
"/usr/local/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
line 422, in loads return pickle.loads(obj) ImportError: No module named
scipy.sparse.csr
Thanks,
Fabian