You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Greg <gr...@zooniverse.org> on 2014/07/31 13:00:15 UTC

understanding use of "filter" function in Spark

Hi, suppose I have some data of the form
k,(x,y)
which are all numbers. For each key value (k) I want to do kmeans clustering
on all corresponding (x,y) points. For each key value I have few enough
points that I'm happy to use a traditional (non-mapreduce) kmeans
implementation. The challenge is that I have a lot of different keys so I
want to use Hadoop/Spark to help split the clustering up over multiple
computers. With Hadoop-streaming and Python the code would be easy:
pts = []
current_k = None
for k,(x,y) in  sys.stdin:
  if k == current_k:
    pts.append((x,y))
  else:
    if current_k is not None:
       #do kmeans clustering on pts
    current_k = k
    pts = []

(and obviously run kmeans for the final key as well)
How do I express this in Spark? The function f for both filter and
filterByKey needs to be transitive (all of the examples I've seen are just
adding values). Ideally I'd like to be able to run this iteratively,
changing the number of clusters for kmeans (so Spark would be nice). Given
how easy this is to do in Hadoop, I feel like this should be easy in Spark
as well but I can't figure out a way.

thanks :)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/understanding-use-of-filter-function-in-Spark-tp11037.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: understanding use of "filter" function in Spark

Posted by Sean Owen <so...@cloudera.com>.

groupByKey will give you a PairRDD, where for each key k, you have an
Iterable over all corresponding (x,y). You can then call mapValues and
apply your clustering to the points, to yield a result R. You end up
with with a PairRDD of (k,R) pairs. This of course happens in
parallel.

On Thu, Jul 31, 2014 at 12:00 PM, Greg <gr...@zooniverse.org> wrote:
> Hi, suppose I have some data of the form
> k,(x,y)
> which are all numbers. For each key value (k) I want to do kmeans clustering
> on all corresponding (x,y) points. For each key value I have few enough
> points that I'm happy to use a traditional (non-mapreduce) kmeans
> implementation. The challenge is that I have a lot of different keys so I
> want to use Hadoop/Spark to help split the clustering up over multiple
> computers. With Hadoop-streaming and Python the code would be easy:
> pts = []
> current_k = None
> for k,(x,y) in  sys.stdin:
>   if k == current_k:
>     pts.append((x,y))
>   else:
>     if current_k is not None:
>        #do kmeans clustering on pts
>     current_k = k
>     pts = []
>
> (and obviously run kmeans for the final key as well)
> How do I express this in Spark? The function f for both filter and
> filterByKey needs to be transitive (all of the examples I've seen are just
> adding values). Ideally I'd like to be able to run this iteratively,
> changing the number of clusters for kmeans (so Spark would be nice). Given
> how easy this is to do in Hadoop, I feel like this should be easy in Spark
> as well but I can't figure out a way.
>
> thanks :)
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/understanding-use-of-filter-function-in-Spark-tp11037.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.