You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by zh8788 <78...@qq.com> on 2014/11/24 02:41:37 UTC
How to keep a local variable in each cluster?
Hi,
I am new to spark. This is the first time I am posting here. Currently, I
try to implement ADMM optimization algorithms for Lasso/SVM
Then I come across a problem:
Since the training data(label, feature) is large, so I created a RDD and
cached the training data(label, feature ) in memory. Then for ADMM, it
needs to keep local parameters (u,v) (which are different for each
partition ). For each iteration, I need to use the training data(only on
that partition), u, v to calculate the new value for u and v.
Question1:
One way is to zip (training data, u, v) into a rdd and update it in each
iteration, but as we can see, training data is large and won't change for
the whole time, only u, v (is small) are changed in each iteration. If I zip
these three, I could not cache that rdd (since it changed for every
iteration). But if did not cache that, I need to reuse the training data
every iteration, how could I do it?
Question2:
Related to Question1, on the online documents, it said if we don't cache the
rdd, it will not in the memory. And rdd uses delayed operation, then I am
confused when can I view a previous rdd in memroy.
Case1:
B = A.map(function1).
B.collect() #This forces B to be calculated ? After that, the node just
release B since it is not cached ???
D = B.map(function3)
D.collect()
Case2:
B = A.map(function1).
D = B.map(function3)
D.collect()
Case3:
B = A.map(function1).
C = A.map(function2)
D = B.map(function3)
D.collect()
In which case, can I view B is in memory in each cluster when I calculate
D?
Question3:
can I use a function to do operations on two rdds?
E.g Function newfun(rdd1, rdd2)
#rdd1 is large and do not change for the whole time (training data), which I
can use cache
#rdd2 is small and change in each iteration (u, v )
Questions4:
Or are there other ways to solve this kind of problem? I think this is
common problem, but I could not find any good solutions.
Thanks a lot
Han
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: How to keep a local variable in each cluster?
Posted by zh8788 <78...@qq.com>.
Any comments?
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604p19766.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: How to keep a local variable in each cluster?
Posted by Yanbo <ya...@gmail.com>.
发自我的 iPad
> 在 2014年11月24日,上午9:41,zh8788 <78...@qq.com> 写道:
>
> Hi,
>
> I am new to spark. This is the first time I am posting here. Currently, I
> try to implement ADMM optimization algorithms for Lasso/SVM
> Then I come across a problem:
>
> Since the training data(label, feature) is large, so I created a RDD and
> cached the training data(label, feature ) in memory. Then for ADMM, it
> needs to keep local parameters (u,v) (which are different for each
> partition ). For each iteration, I need to use the training data(only on
> that partition), u, v to calculate the new value for u and v.
>
RDD has a transform named mapPartitions(), it runs separately on each partition of RDD.
> Question1:
>
> One way is to zip (training data, u, v) into a rdd and update it in each
> iteration, but as we can see, training data is large and won't change for
> the whole time, only u, v (is small) are changed in each iteration. If I zip
> these three, I could not cache that rdd (since it changed for every
> iteration). But if did not cache that, I need to reuse the training data
> every iteration, how could I do it?
>
> Question2:
>
> Related to Question1, on the online documents, it said if we don't cache the
> rdd, it will not in the memory. And rdd uses delayed operation, then I am
> confused when can I view a previous rdd in memroy.
>
> Case1:
>
> B = A.map(function1).
> B.collect() #This forces B to be calculated ? After that, the node just
> release B since it is not cached ???
> D = B.map(function3)
> D.collect()
>
> Case2:
> B = A.map(function1).
> D = B.map(function3)
> D.collect()
>
> Case3:
>
> B = A.map(function1).
> C = A.map(function2)
> D = B.map(function3)
> D.collect()
>
> In which case, can I view B is in memory in each cluster when I calculate
> D?
>
If you want a certain RDD store in memory, use RDD.persistent(MEMORY_ONLY).
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion.
> Question3:
>
> can I use a function to do operations on two rdds?
Yes, but it can only be executed in driver.
>
> E.g Function newfun(rdd1, rdd2)
> #rdd1 is large and do not change for the whole time (training data), which I
> can use cache
> #rdd2 is small and change in each iteration (u, v )
>
>
> Questions4:
>
> Or are there other ways to solve this kind of problem? I think this is
> common problem, but I could not find any good solutions.
>
>
> Thanks a lot
>
> Han
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org