You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Naveen Swamy <mn...@gmail.com> on 2017/10/11 23:01:05 UTC
PySpark pickling behavior
Hello fellow users,
1) I am wondering if there is documentation or guidelines to understand in
what situations does Pyspark decide to pickle the functions I use in the
map method.
2) Are there best practices to avoid pickling and sharing variables, etc,
I have a situation where I want to pass to the map methods, however, those
methods use C++ libraries underneath and Pyspark decides to pickle the
entire object and fails when trying to do that.
I tried to use broadcast, the moment I turn my function to use additional
parameters that must be passed through the map object spark decides to
create an object and try to serialize that
Now I can probably create a dummy function that just does the sharing
of the variables and initialize locally. I can chain that to the map
method, I think that would pretty awkward if I have to resort to that.
Here is my situation in code:
class Model(object):
__metaclass__ = Singleton
model_loaded = False
mod = None
@staticmethoddef load(args):
# load model@staticmethod def predict(input, args):
if not model_loaded:
load(args)
mod.predict(input)
def spark_main()
args = parse_args()
lines = read()
rdd = sc.parallelize(lines)
rdd = rdd.map(lambda x: Model.predict(x, args) //*fails here with:
pickle.PicklingError: Could not serialize object: TypeError: can't
pickle thread.lock objects*
Thanks, Naveen