You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Naveen Swamy <mn...@gmail.com> on 2017/10/11 23:01:05 UTC

PySpark pickling behavior

Hello fellow users,

1) I am wondering if there is documentation or guidelines to understand in
what situations does Pyspark decide to pickle the functions I use in the
map method.
2) Are there best practices to avoid pickling and sharing variables, etc,

I have a situation where I want to pass to the map methods, however, those
methods use C++ libraries underneath and Pyspark decides to pickle the
entire object and fails when trying to do that.

I tried to use broadcast, the moment I turn my function to use additional
parameters that must be passed through the map object spark decides to
create an object and try to serialize that

Now I can probably create a dummy function that just does the sharing
of the variables and initialize locally. I can chain that to the map
method, I think that would pretty awkward if I have to resort to that.

Here is my situation in code:

class Model(object):
  __metaclass__ = Singleton
  model_loaded = False
  mod = None
@staticmethoddef load(args):
  # load model@staticmethod  def predict(input, args):
  if not model_loaded:
    load(args)
  mod.predict(input)
def spark_main()
  args = parse_args()
  lines = read()
  rdd = sc.parallelize(lines)
  rdd = rdd.map(lambda x: Model.predict(x, args) //*fails here with:
pickle.PicklingError: Could not serialize object: TypeError: can't
pickle thread.lock objects*

Thanks, Naveen