You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by zh8788 <78...@qq.com> on 2014/11/24 02:41:37 UTC

How to keep a local variable in each cluster?

Hi,

I am new to spark. This is the first time I am posting here. Currently, I
try to implement ADMM optimization algorithms for Lasso/SVM
Then I come across a problem:

Since the training data(label, feature) is large, so I created a RDD and
cached the training data(label, feature ) in memory.  Then for ADMM, it
needs to keep  local parameters (u,v) (which are different for each
partition ). For each iteration, I need to use the training data(only on
that partition), u, v to calculate the new value for u and v. 

Question1:

One way is to zip (training data, u, v) into a rdd and update it in each
iteration, but as we can see, training data is large and won't change for
the whole time, only u, v (is small) are changed in each iteration. If I zip
these three, I could not cache that rdd (since it changed for every
iteration). But if did not cache that, I need to reuse the training data
every iteration, how could I do it?

Question2:

Related to Question1, on the online documents, it said if we don't cache the
rdd, it  will not in the memory. And rdd uses delayed operation, then I am
confused when can I view a previous rdd in memroy.

Case1:

B = A.map(function1).
B.collect()    #This forces B to be calculated ? After that, the node just
release B since it is not cached ???   
D = B.map(function3) 
D.collect()

Case2:
B = A.map(function1).
D = B.map(function3)   
D.collect()

Case3:

B = A.map(function1).
C = A.map(function2)
D = B.map(function3) 
D.collect()
 
In which case, can I view  B is in memory in each cluster when I calculate
D?

Question3:

can I use a function to do operations on two rdds? 

E.g   Function newfun(rdd1, rdd2)  
#rdd1 is large and do not change for the whole time (training data), which I
can use cache
#rdd2 is small and change in each iteration (u, v )


Questions4:

Or are there other ways to solve this kind of problem? I think this is
common problem, but I could not find any good solutions.


Thanks a lot

Han 











--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: How to keep a local variable in each cluster?

Posted by zh8788 <78...@qq.com>.
 Any comments?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604p19766.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: How to keep a local variable in each cluster?

Posted by Yanbo <ya...@gmail.com>.

发自我的 iPad

> 在 2014年11月24日,上午9:41,zh8788 <78...@qq.com> 写道:
> 
> Hi,
> 
> I am new to spark. This is the first time I am posting here. Currently, I
> try to implement ADMM optimization algorithms for Lasso/SVM
> Then I come across a problem:
> 
> Since the training data(label, feature) is large, so I created a RDD and
> cached the training data(label, feature ) in memory.  Then for ADMM, it
> needs to keep  local parameters (u,v) (which are different for each
> partition ). For each iteration, I need to use the training data(only on
> that partition), u, v to calculate the new value for u and v. 
> 
RDD has a transform named mapPartitions(), it runs separately on each partition of RDD.
> Question1:
> 
> One way is to zip (training data, u, v) into a rdd and update it in each
> iteration, but as we can see, training data is large and won't change for
> the whole time, only u, v (is small) are changed in each iteration. If I zip
> these three, I could not cache that rdd (since it changed for every
> iteration). But if did not cache that, I need to reuse the training data
> every iteration, how could I do it?
> 
> Question2:
> 
> Related to Question1, on the online documents, it said if we don't cache the
> rdd, it  will not in the memory. And rdd uses delayed operation, then I am
> confused when can I view a previous rdd in memroy.
> 
> Case1:
> 
> B = A.map(function1).
> B.collect()    #This forces B to be calculated ? After that, the node just
> release B since it is not cached ???   
> D = B.map(function3) 
> D.collect()
> 
> Case2:
> B = A.map(function1).
> D = B.map(function3)   
> D.collect()
> 
> Case3:
> 
> B = A.map(function1).
> C = A.map(function2)
> D = B.map(function3) 
> D.collect()
> 
> In which case, can I view  B is in memory in each cluster when I calculate
> D?
> 
If you want a certain RDD store in memory, use RDD.persistent(MEMORY_ONLY).
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion.
> Question3:
> 
> can I use a function to do operations on two rdds? 
Yes, but it can only be executed in driver.
> 
> E.g   Function newfun(rdd1, rdd2)  
> #rdd1 is large and do not change for the whole time (training data), which I
> can use cache
> #rdd2 is small and change in each iteration (u, v )
> 
> 
> Questions4:
> 
> Or are there other ways to solve this kind of problem? I think this is
> common problem, but I could not find any good solutions.
> 
> 
> Thanks a lot
> 
> Han 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org