You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dan Filimon <da...@gmail.com> on 2012/12/03 19:47:08 UTC

BFR clustering algorithm?

For the project,

I've been doing some reading on Machine Learning in general (for
example, there's a neat Caltech course called Learning from Data that
I'm following to get a grip on the theoretical basics).

I decided that I'd read the clustering chapter in that book I sent you
to since it deals specifically with large scale clustering.

I was intrigued to find a description of the Bradley, Fayyad, Reina
(BFR) algorithm which is also designed to cluster data in Euclidean
spaces, in shards.

For each chunk of data, it picks k cluster centroids (maximizing the
distance between the selected points or some other technique).

New points are grouped into these existing clusters as long as their
Mahalanobis distance is less than a certain threshold (which results
in the point being within threshold standard deviations from the
centroid).

The remaining points are clustered into smaller miniclusters with less
stringent constraints with something like hierarchical clustering and
the remaining singleton clusters are outliers.

I was thinking this could just as well be done in Mappers and the
resulting centroids (the outliers and the miniclusters being folded
into the cluster whose centroid they're closest to) could be output to
the Reducer as the same kind of rough clustering we're using Streaming
KMeans for.

My question is... why did we pick streaming k-means in particular as
opposed to this algorithm. BFR seems like a decent candidate for the
mapper clustering and while it looks more complex (algorithmically) I
wonder how the clustering quality compares to streaming k-means?

What are your thoughts on this?

Re: BFR clustering algorithm?

Posted by Ted Dunning <te...@gmail.com>.

There are literally hundreds and hundreds of algorithms for k-means alone.
 That isn't even counting clustering that doesn't optimize k-means figure
of merit.

On Tue, Dec 4, 2012 at 5:05 PM, Dan Filimon <da...@gmail.com>wrote:

> On Tue, Dec 4, 2012 at 10:00 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > I didn't know about BFR at the time and I always tend to choose
> simplicity
> > in any case.
> >
> > The theoretical bounds for streaming k-means are also persuasive.  The
> other
> > strong-ish candidate is k-means++, but it doesn't have the required
> sketch
> > architecture in the form that they have analyzed.
> >
> > BFR is a reasonable candidate for follow-on work, but we should drive to
> > conclusion with the current algorithm first.
>
> We should definitely focus on this algorithm for now. I was just
> surprised to find another one I hadn't known about. :)
>
> > On Mon, Dec 3, 2012 at 6:47 PM, Dan Filimon <dangeorge.filimon@gmail.com
> >
> > wrote:
> >>
> >> My question is... why did we pick streaming k-means in particular as
> >> opposed to this algorithm. BFR seems like a decent candidate for the
> >> mapper clustering and while it looks more complex (algorithmically) I
> >> wonder how the clustering quality compares to streaming k-means?
> >>
> >> What are your thoughts on this?
> >
> >
>

Re: BFR clustering algorithm?

Posted by Dan Filimon <da...@gmail.com>.

On Tue, Dec 4, 2012 at 10:00 AM, Ted Dunning <te...@gmail.com> wrote:
> I didn't know about BFR at the time and I always tend to choose simplicity
> in any case.
>
> The theoretical bounds for streaming k-means are also persuasive.  The other
> strong-ish candidate is k-means++, but it doesn't have the required sketch
> architecture in the form that they have analyzed.
>
> BFR is a reasonable candidate for follow-on work, but we should drive to
> conclusion with the current algorithm first.

We should definitely focus on this algorithm for now. I was just
surprised to find another one I hadn't known about. :)

> On Mon, Dec 3, 2012 at 6:47 PM, Dan Filimon <da...@gmail.com>
> wrote:
>>
>> My question is... why did we pick streaming k-means in particular as
>> opposed to this algorithm. BFR seems like a decent candidate for the
>> mapper clustering and while it looks more complex (algorithmically) I
>> wonder how the clustering quality compares to streaming k-means?
>>
>> What are your thoughts on this?
>
>

Re: BFR clustering algorithm?

Posted by Ted Dunning <te...@gmail.com>.

I didn't know about BFR at the time and I always tend to choose simplicity
in any case.

The theoretical bounds for streaming k-means are also persuasive.  The
other strong-ish candidate is k-means++, but it doesn't have the required
sketch architecture in the form that they have analyzed.

BFR is a reasonable candidate for follow-on work, but we should drive to
conclusion with the current algorithm first.

On Mon, Dec 3, 2012 at 6:47 PM, Dan Filimon <da...@gmail.com>wrote:

> My question is... why did we pick streaming k-means in particular as
> opposed to this algorithm. BFR seems like a decent candidate for the
> mapper clustering and while it looks more complex (algorithmically) I
> wonder how the clustering quality compares to streaming k-means?
>
> What are your thoughts on this?
>