You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hama.apache.org by Praveen Sripati <pr...@gmail.com> on 2012/04/07 15:19:23 UTC

Canopy Clustering on BSP

Hi,

After Thomas implementation of K-Means (3) I was motivated to extend it
using the Canopy clustering. So, I started looking at the MR implementation
of Canopy (1) and (2). The MR implementation of Canopy clustering is done
in two MR phases, first one to identify the canopies and second to assign
canopies to the data points. I don't see much improvement when this is done
using BSP. Please correct me if I am wrong.

Also, are there any algorithms which can implemented easily (for those who
are getting started with Hama/BSP like me) on Hama/BSP where we could also
see some performance improvements when compared to the MR implementation. I
have seen Mahout and there are many algorithms implemented in it and would
like to see something similar in Hama also.

Thanks,
Praveen

(1) -
http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
(2) - https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
(3) -
http://codingwiththomas.blogspot.in/2011/12/k-means-clustering-with-bsp-intuition.html

Re: Canopy Clustering on BSP

Posted by Praveen Sripati <pr...@gmail.com>.

A single reducer gets the centroids from all the mappers and calculates the
centroids from them. So, the reducer shouldn't take much time.

On Tue, Apr 10, 2012 at 7:51 PM, Thomas Jungblut <
thomas.jungblut@googlemail.com> wrote:

> Don't think so. It took me 2 minutes to do the mapping and nearly 18h to
> reduce. ~80M points in a 30k vector space.
> Am 10.04.2012 16:18 schrieb "Praveen Sripati" <pr...@gmail.com>:
>
> > > If your algorithm has 99% of time in sequential methods (This is the
> case
> > in canopy clustering) then you won't get a large speed boost when
> > parallelizing the remaining 1%.
> >
> > Each mapper calculates the canopies for the subset of the input data and
> > sends them to a single reducer. The reducer produces the final canopies
> > using the thresholds (t1 and t2) and the distance measure. So, the canopy
> > generation is parallelized in the mappers and the reducer shouldn't also
> > take much time.
> >
> > (1)
> > http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> > (2) https://cwiki.apache.org/MAHOUT/canopy-clustering.html
> >
> > Praveen
> >
> > On Tue, Apr 10, 2012 at 4:56 PM, Thomas Jungblut <
> > thomas.jungblut@googlemail.com> wrote:
> >
> > > KMeans is iterative, you can't do anything about it. Syncs are needed,
> > they
> > > are in mapreduce much more costly than in BSP.
> > > If you follow the mahout mailing list, Ted Dunning has found a paper
> for
> > a
> > > kmeans that is running in a single pass. So there is no iteration in
> > there.
> > > [1][2]
> > >
> > > Also, why is global sync very expensive? Is it because all the
> processes
> > > > have to enter the Barrier synchronisation before the next super step
> > > > starts?
> > >
> > >
> > > Yes, if your data is not distributed evenly or you don't have
> homogeneous
> > > servers it becomes more costly. It is also driven by Amdahl's Law,
> which
> > > simply says that your algorithms just can be as fast as the sequential
> > part
> > > of the algorithm. If your algorithm has 99% of time in sequential
> methods
> > > (This is the case in canopy clustering) then you won't get a large
> speed
> > > boost when parallelizing the remaining 1%.
> > > And that is the reason why communication with a master task (or at
> least
> > an
> > > algorithm design with a master task) isn't a good idea either, because
> > > you're then adding sequential parts and another sync. This makes the
> > whole
> > > thing slow.
> > >
> > > But sometimes you can not avoid sequential code and a master task.
> > >
> > > [1]  https://github.com/tdunning/knn
> > > [2]
> > http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf
> > >
> > > Am 10. April 2012 12:52 schrieb Praveen Sripati <
> > praveensripati@gmail.com
> > > >:
> > >
> > > > > It makes sense, since global sync is very expensive.
> > > >
> > > > For any iterative algorithm like k-means, the # of global syncs is
> > > > proportional the # of iterations. So, how does k-means fit in BSP?
> > > >
> > > > Also, why is global sync very expensive? Is it because all the
> > processes
> > > > have to enter the Barrier synchronisation before the next super step
> > > > starts?
> > > >
> > > > Praveen
> > > >
> > > > On Tue, Apr 10, 2012 at 12:30 PM, Thomas Jungblut <
> > > > thomas.jungblut@googlemail.com> wrote:
> > > >
> > > > > There are algorithms that have very few supersteps, see the
> > > Matrix-Vector
> > > > > Multiplication in GSoC this year.
> > > > > It makes sense, since global sync is very expensive.
> > > > >
> > > > > However, Canopy clustering does not fit very well, since there is a
> > > > > parallel part and a sequencial part.
> > > > > So MapReduce is a good fit for canopy clustering.
> > > > >
> > > > > Am 7. April 2012 15:19 schrieb Praveen Sripati <
> > > praveensripati@gmail.com
> > > > >:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > After Thomas implementation of K-Means (3) I was motivated to
> > extend
> > > it
> > > > > > using the Canopy clustering. So, I started looking at the MR
> > > > > implementation
> > > > > > of Canopy (1) and (2). The MR implementation of Canopy clustering
> > is
> > > > done
> > > > > > in two MR phases, first one to identify the canopies and second
> to
> > > > assign
> > > > > > canopies to the data points. I don't see much improvement when
> this
> > > is
> > > > > done
> > > > > > using BSP. Please correct me if I am wrong.
> > > > > >
> > > > > > Also, are there any algorithms which can implemented easily (for
> > > those
> > > > > who
> > > > > > are getting started with Hama/BSP like me) on Hama/BSP where we
> > could
> > > > > also
> > > > > > see some performance improvements when compared to the MR
> > > > > implementation. I
> > > > > > have seen Mahout and there are many algorithms implemented in it
> > and
> > > > > would
> > > > > > like to see something similar in Hama also.
> > > > > >
> > > > > > Thanks,
> > > > > > Praveen
> > > > > >
> > > > > > (1) -
> > > > > >
> > > >
> > http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> > > > > > (2) -
> > > > >
> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
> > > > > > (3) -
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://codingwiththomas.blogspot.in/2011/12/k-means-clustering-with-bsp-intuition.html
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thomas Jungblut
> > > > > Berlin <th...@gmail.com>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thomas Jungblut
> > > Berlin <th...@gmail.com>
> > >
> >
>

Re: Canopy Clustering on BSP

Posted by Thomas Jungblut <th...@googlemail.com>.

Don't think so. It took me 2 minutes to do the mapping and nearly 18h to
reduce. ~80M points in a 30k vector space.
Am 10.04.2012 16:18 schrieb "Praveen Sripati" <pr...@gmail.com>:

> > If your algorithm has 99% of time in sequential methods (This is the case
> in canopy clustering) then you won't get a large speed boost when
> parallelizing the remaining 1%.
>
> Each mapper calculates the canopies for the subset of the input data and
> sends them to a single reducer. The reducer produces the final canopies
> using the thresholds (t1 and t2) and the distance measure. So, the canopy
> generation is parallelized in the mappers and the reducer shouldn't also
> take much time.
>
> (1)
> http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> (2) https://cwiki.apache.org/MAHOUT/canopy-clustering.html
>
> Praveen
>
> On Tue, Apr 10, 2012 at 4:56 PM, Thomas Jungblut <
> thomas.jungblut@googlemail.com> wrote:
>
> > KMeans is iterative, you can't do anything about it. Syncs are needed,
> they
> > are in mapreduce much more costly than in BSP.
> > If you follow the mahout mailing list, Ted Dunning has found a paper for
> a
> > kmeans that is running in a single pass. So there is no iteration in
> there.
> > [1][2]
> >
> > Also, why is global sync very expensive? Is it because all the processes
> > > have to enter the Barrier synchronisation before the next super step
> > > starts?
> >
> >
> > Yes, if your data is not distributed evenly or you don't have homogeneous
> > servers it becomes more costly. It is also driven by Amdahl's Law, which
> > simply says that your algorithms just can be as fast as the sequential
> part
> > of the algorithm. If your algorithm has 99% of time in sequential methods
> > (This is the case in canopy clustering) then you won't get a large speed
> > boost when parallelizing the remaining 1%.
> > And that is the reason why communication with a master task (or at least
> an
> > algorithm design with a master task) isn't a good idea either, because
> > you're then adding sequential parts and another sync. This makes the
> whole
> > thing slow.
> >
> > But sometimes you can not avoid sequential code and a master task.
> >
> > [1]  https://github.com/tdunning/knn
> > [2]
> http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf
> >
> > Am 10. April 2012 12:52 schrieb Praveen Sripati <
> praveensripati@gmail.com
> > >:
> >
> > > > It makes sense, since global sync is very expensive.
> > >
> > > For any iterative algorithm like k-means, the # of global syncs is
> > > proportional the # of iterations. So, how does k-means fit in BSP?
> > >
> > > Also, why is global sync very expensive? Is it because all the
> processes
> > > have to enter the Barrier synchronisation before the next super step
> > > starts?
> > >
> > > Praveen
> > >
> > > On Tue, Apr 10, 2012 at 12:30 PM, Thomas Jungblut <
> > > thomas.jungblut@googlemail.com> wrote:
> > >
> > > > There are algorithms that have very few supersteps, see the
> > Matrix-Vector
> > > > Multiplication in GSoC this year.
> > > > It makes sense, since global sync is very expensive.
> > > >
> > > > However, Canopy clustering does not fit very well, since there is a
> > > > parallel part and a sequencial part.
> > > > So MapReduce is a good fit for canopy clustering.
> > > >
> > > > Am 7. April 2012 15:19 schrieb Praveen Sripati <
> > praveensripati@gmail.com
> > > >:
> > > >
> > > > > Hi,
> > > > >
> > > > > After Thomas implementation of K-Means (3) I was motivated to
> extend
> > it
> > > > > using the Canopy clustering. So, I started looking at the MR
> > > > implementation
> > > > > of Canopy (1) and (2). The MR implementation of Canopy clustering
> is
> > > done
> > > > > in two MR phases, first one to identify the canopies and second to
> > > assign
> > > > > canopies to the data points. I don't see much improvement when this
> > is
> > > > done
> > > > > using BSP. Please correct me if I am wrong.
> > > > >
> > > > > Also, are there any algorithms which can implemented easily (for
> > those
> > > > who
> > > > > are getting started with Hama/BSP like me) on Hama/BSP where we
> could
> > > > also
> > > > > see some performance improvements when compared to the MR
> > > > implementation. I
> > > > > have seen Mahout and there are many algorithms implemented in it
> and
> > > > would
> > > > > like to see something similar in Hama also.
> > > > >
> > > > > Thanks,
> > > > > Praveen
> > > > >
> > > > > (1) -
> > > > >
> > >
> http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> > > > > (2) -
> > > > https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
> > > > > (3) -
> > > > >
> > > > >
> > > >
> > >
> >
> http://codingwiththomas.blogspot.in/2011/12/k-means-clustering-with-bsp-intuition.html
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thomas Jungblut
> > > > Berlin <th...@gmail.com>
> > > >
> > >
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <th...@gmail.com>
> >
>

Re: Canopy Clustering on BSP

Posted by Praveen Sripati <pr...@gmail.com>.

> If your algorithm has 99% of time in sequential methods (This is the case
in canopy clustering) then you won't get a large speed boost when
parallelizing the remaining 1%.

Each mapper calculates the canopies for the subset of the input data and
sends them to a single reducer. The reducer produces the final canopies
using the thresholds (t1 and t2) and the distance measure. So, the canopy
generation is parallelized in the mappers and the reducer shouldn't also
take much time.

(1) http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
(2) https://cwiki.apache.org/MAHOUT/canopy-clustering.html

Praveen

On Tue, Apr 10, 2012 at 4:56 PM, Thomas Jungblut <
thomas.jungblut@googlemail.com> wrote:

> KMeans is iterative, you can't do anything about it. Syncs are needed, they
> are in mapreduce much more costly than in BSP.
> If you follow the mahout mailing list, Ted Dunning has found a paper for a
> kmeans that is running in a single pass. So there is no iteration in there.
> [1][2]
>
> Also, why is global sync very expensive? Is it because all the processes
> > have to enter the Barrier synchronisation before the next super step
> > starts?
>
>
> Yes, if your data is not distributed evenly or you don't have homogeneous
> servers it becomes more costly. It is also driven by Amdahl's Law, which
> simply says that your algorithms just can be as fast as the sequential part
> of the algorithm. If your algorithm has 99% of time in sequential methods
> (This is the case in canopy clustering) then you won't get a large speed
> boost when parallelizing the remaining 1%.
> And that is the reason why communication with a master task (or at least an
> algorithm design with a master task) isn't a good idea either, because
> you're then adding sequential parts and another sync. This makes the whole
> thing slow.
>
> But sometimes you can not avoid sequential code and a master task.
>
> [1]  https://github.com/tdunning/knn
> [2] http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf
>
> Am 10. April 2012 12:52 schrieb Praveen Sripati <praveensripati@gmail.com
> >:
>
> > > It makes sense, since global sync is very expensive.
> >
> > For any iterative algorithm like k-means, the # of global syncs is
> > proportional the # of iterations. So, how does k-means fit in BSP?
> >
> > Also, why is global sync very expensive? Is it because all the processes
> > have to enter the Barrier synchronisation before the next super step
> > starts?
> >
> > Praveen
> >
> > On Tue, Apr 10, 2012 at 12:30 PM, Thomas Jungblut <
> > thomas.jungblut@googlemail.com> wrote:
> >
> > > There are algorithms that have very few supersteps, see the
> Matrix-Vector
> > > Multiplication in GSoC this year.
> > > It makes sense, since global sync is very expensive.
> > >
> > > However, Canopy clustering does not fit very well, since there is a
> > > parallel part and a sequencial part.
> > > So MapReduce is a good fit for canopy clustering.
> > >
> > > Am 7. April 2012 15:19 schrieb Praveen Sripati <
> praveensripati@gmail.com
> > >:
> > >
> > > > Hi,
> > > >
> > > > After Thomas implementation of K-Means (3) I was motivated to extend
> it
> > > > using the Canopy clustering. So, I started looking at the MR
> > > implementation
> > > > of Canopy (1) and (2). The MR implementation of Canopy clustering is
> > done
> > > > in two MR phases, first one to identify the canopies and second to
> > assign
> > > > canopies to the data points. I don't see much improvement when this
> is
> > > done
> > > > using BSP. Please correct me if I am wrong.
> > > >
> > > > Also, are there any algorithms which can implemented easily (for
> those
> > > who
> > > > are getting started with Hama/BSP like me) on Hama/BSP where we could
> > > also
> > > > see some performance improvements when compared to the MR
> > > implementation. I
> > > > have seen Mahout and there are many algorithms implemented in it and
> > > would
> > > > like to see something similar in Hama also.
> > > >
> > > > Thanks,
> > > > Praveen
> > > >
> > > > (1) -
> > > >
> > http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> > > > (2) -
> > > https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
> > > > (3) -
> > > >
> > > >
> > >
> >
> http://codingwiththomas.blogspot.in/2011/12/k-means-clustering-with-bsp-intuition.html
> > > >
> > >
> > >
> > >
> > > --
> > > Thomas Jungblut
> > > Berlin <th...@gmail.com>
> > >
> >
>
>
>
> --
> Thomas Jungblut
> Berlin <th...@gmail.com>
>

Re: Canopy Clustering on BSP

Posted by Thomas Jungblut <th...@googlemail.com>.

KMeans is iterative, you can't do anything about it. Syncs are needed, they
are in mapreduce much more costly than in BSP.
If you follow the mahout mailing list, Ted Dunning has found a paper for a
kmeans that is running in a single pass. So there is no iteration in there.
[1][2]

Also, why is global sync very expensive? Is it because all the processes
> have to enter the Barrier synchronisation before the next super step
> starts?


Yes, if your data is not distributed evenly or you don't have homogeneous
servers it becomes more costly. It is also driven by Amdahl's Law, which
simply says that your algorithms just can be as fast as the sequential part
of the algorithm. If your algorithm has 99% of time in sequential methods
(This is the case in canopy clustering) then you won't get a large speed
boost when parallelizing the remaining 1%.
And that is the reason why communication with a master task (or at least an
algorithm design with a master task) isn't a good idea either, because
you're then adding sequential parts and another sync. This makes the whole
thing slow.

But sometimes you can not avoid sequential code and a master task.

[1]  https://github.com/tdunning/knn
[2] http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf

Am 10. April 2012 12:52 schrieb Praveen Sripati <pr...@gmail.com>:

> > It makes sense, since global sync is very expensive.
>
> For any iterative algorithm like k-means, the # of global syncs is
> proportional the # of iterations. So, how does k-means fit in BSP?
>
> Also, why is global sync very expensive? Is it because all the processes
> have to enter the Barrier synchronisation before the next super step
> starts?
>
> Praveen
>
> On Tue, Apr 10, 2012 at 12:30 PM, Thomas Jungblut <
> thomas.jungblut@googlemail.com> wrote:
>
> > There are algorithms that have very few supersteps, see the Matrix-Vector
> > Multiplication in GSoC this year.
> > It makes sense, since global sync is very expensive.
> >
> > However, Canopy clustering does not fit very well, since there is a
> > parallel part and a sequencial part.
> > So MapReduce is a good fit for canopy clustering.
> >
> > Am 7. April 2012 15:19 schrieb Praveen Sripati <praveensripati@gmail.com
> >:
> >
> > > Hi,
> > >
> > > After Thomas implementation of K-Means (3) I was motivated to extend it
> > > using the Canopy clustering. So, I started looking at the MR
> > implementation
> > > of Canopy (1) and (2). The MR implementation of Canopy clustering is
> done
> > > in two MR phases, first one to identify the canopies and second to
> assign
> > > canopies to the data points. I don't see much improvement when this is
> > done
> > > using BSP. Please correct me if I am wrong.
> > >
> > > Also, are there any algorithms which can implemented easily (for those
> > who
> > > are getting started with Hama/BSP like me) on Hama/BSP where we could
> > also
> > > see some performance improvements when compared to the MR
> > implementation. I
> > > have seen Mahout and there are many algorithms implemented in it and
> > would
> > > like to see something similar in Hama also.
> > >
> > > Thanks,
> > > Praveen
> > >
> > > (1) -
> > >
> http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> > > (2) -
> > https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
> > > (3) -
> > >
> > >
> >
> http://codingwiththomas.blogspot.in/2011/12/k-means-clustering-with-bsp-intuition.html
> > >
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <th...@gmail.com>
> >
>



-- 
Thomas Jungblut
Berlin <th...@gmail.com>

Re: Canopy Clustering on BSP

Posted by Praveen Sripati <pr...@gmail.com>.

> It makes sense, since global sync is very expensive.

For any iterative algorithm like k-means, the # of global syncs is
proportional the # of iterations. So, how does k-means fit in BSP?

Also, why is global sync very expensive? Is it because all the processes
have to enter the Barrier synchronisation before the next super step starts?

Praveen

On Tue, Apr 10, 2012 at 12:30 PM, Thomas Jungblut <
thomas.jungblut@googlemail.com> wrote:

> There are algorithms that have very few supersteps, see the Matrix-Vector
> Multiplication in GSoC this year.
> It makes sense, since global sync is very expensive.
>
> However, Canopy clustering does not fit very well, since there is a
> parallel part and a sequencial part.
> So MapReduce is a good fit for canopy clustering.
>
> Am 7. April 2012 15:19 schrieb Praveen Sripati <pr...@gmail.com>:
>
> > Hi,
> >
> > After Thomas implementation of K-Means (3) I was motivated to extend it
> > using the Canopy clustering. So, I started looking at the MR
> implementation
> > of Canopy (1) and (2). The MR implementation of Canopy clustering is done
> > in two MR phases, first one to identify the canopies and second to assign
> > canopies to the data points. I don't see much improvement when this is
> done
> > using BSP. Please correct me if I am wrong.
> >
> > Also, are there any algorithms which can implemented easily (for those
> who
> > are getting started with Hama/BSP like me) on Hama/BSP where we could
> also
> > see some performance improvements when compared to the MR
> implementation. I
> > have seen Mahout and there are many algorithms implemented in it and
> would
> > like to see something similar in Hama also.
> >
> > Thanks,
> > Praveen
> >
> > (1) -
> > http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> > (2) -
> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
> > (3) -
> >
> >
> http://codingwiththomas.blogspot.in/2011/12/k-means-clustering-with-bsp-intuition.html
> >
>
>
>
> --
> Thomas Jungblut
> Berlin <th...@gmail.com>
>

Re: Canopy Clustering on BSP

Posted by Thomas Jungblut <th...@googlemail.com>.

There are algorithms that have very few supersteps, see the Matrix-Vector
Multiplication in GSoC this year.
It makes sense, since global sync is very expensive.

However, Canopy clustering does not fit very well, since there is a
parallel part and a sequencial part.
So MapReduce is a good fit for canopy clustering.

Am 7. April 2012 15:19 schrieb Praveen Sripati <pr...@gmail.com>:

> Hi,
>
> After Thomas implementation of K-Means (3) I was motivated to extend it
> using the Canopy clustering. So, I started looking at the MR implementation
> of Canopy (1) and (2). The MR implementation of Canopy clustering is done
> in two MR phases, first one to identify the canopies and second to assign
> canopies to the data points. I don't see much improvement when this is done
> using BSP. Please correct me if I am wrong.
>
> Also, are there any algorithms which can implemented easily (for those who
> are getting started with Hama/BSP like me) on Hama/BSP where we could also
> see some performance improvements when compared to the MR implementation. I
> have seen Mahout and there are many algorithms implemented in it and would
> like to see something similar in Hama also.
>
> Thanks,
> Praveen
>
> (1) -
> http://horicky.blogspot.in/2011/04/k-means-clustering-in-map-reduce.html
> (2) - https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
> (3) -
>
> http://codingwiththomas.blogspot.in/2011/12/k-means-clustering-with-bsp-intuition.html
>



-- 
Thomas Jungblut
Berlin <th...@gmail.com>