You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Justin Uang <ju...@gmail.com> on 2015/11/03 22:17:49 UTC

Re: Pickle Spark DataFrame

Is the Manager a python multiprocessing manager? Why are you using
parallelism on python when theoretically most of the heavy lifting is done
via spark?

On Wed, Oct 28, 2015 at 4:27 PM agg212 <ag...@cs.brown.edu> wrote:

> I would just like to be able to put a Spark DataFrame in a manager.dict()
> and
> be able to get it out (manager.dict() calls pickle on the object being
> stored).  Ideally, I would just like to store a pointer to the DataFrame
> object so that it remains distributed within Spark (i.e., not materialize
> and then store).  Here is an example:
>
> data = sparkContext.jsonFile(data_file) #load file
> cache = Manager.dict() #thread-safe container
> cache['id'] = data #store reference to data, not materialized result
> new_data = cache['id'] #get reference to distributed spark dataframe
> new_data.show()
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Pickle-Spark-DataFrame-tp14803p14825.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>