You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Fabian Böhnlein <fa...@gmail.com> on 2016/02/23 10:45:14 UTC

PySpark Pickle reading does not find module

Hi all,

how can I make a module/class visible to a sc.pickleFile? It seems to 
miss it in the env after an import in the driver PySpark context.

The module is available for writing, but reading in a new SparkContext 
than the one that wrote, fails. The imports are the same in both. Any 
ideas how I can point it to there apart of the global import?

How I create it:

from scipy.sparse import csr, csr_matrix import numpy as np def get_csr(y):
    ...
    ..
    return csr_matrix(data, (row,col))

rdd = rdd1.map(lambda x: get_csr(x))

rdd.take(2)
[<1x150498 sparse matrix of type '<type 'numpy.float64'>' with 62 stored elements in Compressed Sparse Row format>,
<1x150498 sparse matrix of type '<type 'numpy.float64'>' with 84 stored elements in Compressed Sparse Row format>]

rdd.saveAsPickleFile(..)


Reading in a new SparkContext causes a /No module named scipy.sparse.csr 
/(see below).
Loading the file in the same SparkContext where it was written, works.
The PYTHON_PATH is set on all workers to the same local anaconda 
distribution and the local anaconda of this particular worker which 
causes the error definitely has the module available.

File 
"/usr/local/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", 
line 164, in _read_with_length return self.loads(obj) File 
"/usr/local/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", 
line 422, in loads return pickle.loads(obj) ImportError: No module named 
scipy.sparse.csr



Thanks,
Fabian