You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Daedalus <tu...@gmail.com> on 2014/06/23 07:34:57 UTC

Persistent Local Node variables

*TL;DR:* I want to run a pre-processing step on the data from each partition
(such as parsing) and retain the parsed object on each node for future
processing calls to avoid repeated parsing.

/More detail:/

I have a server and two nodes in my cluster, and data partitioned using
hdfs.
I am trying to use spark to process the data and send back results.

The data is available as text, and I would like to first parse this text,
and then run future processing.
To do this, I call a simple:
JavaRDD.foreachPartition(Iterator<String>)(new
VoidFunction<Iterator&lt;String>>(){
	@Override
	public void call(Iterator<String> i){
		ParsedData p=new ParsedData(i);
	}
});

I would like to retain this ParsedData object on each node for future
processing calls, so as to avoid parsing all over again. So in my next call,
I'd like to do something like this:

JavaRDD.foreachPartition(Iterator<String>)(new
VoidFunction<Iterator&lt;String>>(){
	@Override
	public void call(Iterator<String> i){
		//refer to previously created ParsedData object
		p.process();
		//accumulate some results
	}
});



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Persistent-Local-Node-variables-tp8104.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Persistent Local Node variables

Posted by Mayur Rustagi <ma...@gmail.com>.

Are you trying to process data as part of the same Job(till same spark
context), then all you have to do is cache the output rdd of your
processing. It'll run your processing once & cache the results for future
tasks, unless your node caching the rdd goes down.
if you are trying to retain it for quite a long time you can

   - Simplistically store it as hdfs & load it each time
   - Either store that in a table & try to pull it with sparksql every
   time(experimental).
   - Use Ooyala Jobserver to cache the data & do all processing using that.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>

On Mon, Jun 23, 2014 at 11:14 AM, Daedalus <tu...@gmail.com>
wrote:

> Will using mapPartitions and creating a new RDD of ParsedData objects avoid
> multiple parsing?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Persistent-Local-Node-variables-tp8104p8107.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Persistent Local Node variables

Posted by Daedalus <tu...@gmail.com>.

Will using mapPartitions and creating a new RDD of ParsedData objects avoid
multiple parsing?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Persistent-Local-Node-variables-tp8104p8107.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.