You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Prabhakar Srinivasan <pr...@gmail.com> on 2013/12/03 18:34:15 UTC

Outlier detection/Pruning

Hello!
Can someone point me to some explanatory documentation for Outlier
Detection & Removal in Clustering in Mahout. I am unable to understand the
internal mechanism of outlier detection just by reading the Javadoc:
clusterClassificationThreshold Is a clustering strictness / outlier removal
parameter. Its value should be between 0 and 1. Vectors having pdf below
this value will not be clustered.

What does the pdf represent?

Thanks
Prabhakar

Re: Outlier detection/Pruning

Posted by Ted Dunning <te...@gmail.com>.
You should move to 0.8 and explore ball k-means.




On Tue, Dec 3, 2013 at 8:44 PM, Prabhakar Srinivasan <
prabhakar.srinivasan@gmail.com> wrote:

> Hello
> I am using Mahout 0.7 currently and this question is pertaining to that
> version. I am using Canopy clustering (CanopyDriver class)  first to
> determine the optimal number of clusters that best fits the dataset and
> passing that information as parameter to Kmeans clustering (kmeansDriver
> class).
>
> Regards
> Prabhakar
>
>
> On Tue, Dec 3, 2013 at 6:00 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > Can you be more specific about which code you are asking about?
> >
> > The ball k-means implementation provides a capability somewhat like this,
> > but perhaps in a more clearly defined way.
> >
> >
> > On Tue, Dec 3, 2013 at 9:34 AM, Prabhakar Srinivasan <
> > prabhakar.srinivasan@gmail.com> wrote:
> >
> > > Hello!
> > > Can someone point me to some explanatory documentation for Outlier
> > > Detection & Removal in Clustering in Mahout. I am unable to understand
> > the
> > > internal mechanism of outlier detection just by reading the Javadoc:
> > > clusterClassificationThreshold Is a clustering strictness / outlier
> > removal
> > > parameter. Its value should be between 0 and 1. Vectors having pdf
> below
> > > this value will not be clustered.
> > >
> > > What does the pdf represent?
> > >
> > > Thanks
> > > Prabhakar
> > >
> >
>

Re: Outlier detection/Pruning

Posted by Prabhakar Srinivasan <pr...@gmail.com>.
Hello
I am using Mahout 0.7 currently and this question is pertaining to that
version. I am using Canopy clustering (CanopyDriver class)  first to
determine the optimal number of clusters that best fits the dataset and
passing that information as parameter to Kmeans clustering (kmeansDriver
class).

Regards
Prabhakar


On Tue, Dec 3, 2013 at 6:00 PM, Ted Dunning <te...@gmail.com> wrote:

> Can you be more specific about which code you are asking about?
>
> The ball k-means implementation provides a capability somewhat like this,
> but perhaps in a more clearly defined way.
>
>
> On Tue, Dec 3, 2013 at 9:34 AM, Prabhakar Srinivasan <
> prabhakar.srinivasan@gmail.com> wrote:
>
> > Hello!
> > Can someone point me to some explanatory documentation for Outlier
> > Detection & Removal in Clustering in Mahout. I am unable to understand
> the
> > internal mechanism of outlier detection just by reading the Javadoc:
> > clusterClassificationThreshold Is a clustering strictness / outlier
> removal
> > parameter. Its value should be between 0 and 1. Vectors having pdf below
> > this value will not be clustered.
> >
> > What does the pdf represent?
> >
> > Thanks
> > Prabhakar
> >
>

Re: Outlier detection/Pruning

Posted by Ted Dunning <te...@gmail.com>.
Can you be more specific about which code you are asking about?

The ball k-means implementation provides a capability somewhat like this,
but perhaps in a more clearly defined way.


On Tue, Dec 3, 2013 at 9:34 AM, Prabhakar Srinivasan <
prabhakar.srinivasan@gmail.com> wrote:

> Hello!
> Can someone point me to some explanatory documentation for Outlier
> Detection & Removal in Clustering in Mahout. I am unable to understand the
> internal mechanism of outlier detection just by reading the Javadoc:
> clusterClassificationThreshold Is a clustering strictness / outlier removal
> parameter. Its value should be between 0 and 1. Vectors having pdf below
> this value will not be clustered.
>
> What does the pdf represent?
>
> Thanks
> Prabhakar
>

Re: Outlier detection/Pruning

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Tue, Dec 3, 2013 at 9:34 AM, Prabhakar Srinivasan <
prabhakar.srinivasan@gmail.com> wrote:

> Hello!
> Can someone point me to some explanatory documentation for Outlier
> Detection & Removal in Clustering in Mahout. I am unable to understand the
> internal mechanism of outlier detection just by reading the Javadoc:
> clusterClassificationThreshold Is a clustering strictness / outlier removal
> parameter. Its value should be between 0 and 1. Vectors having pdf below
> this value will not be clustered.
>
> What does the pdf represent?
>

i don't really  in the context of Mahout implementation of this, but i'd
venture to go on a limb and say pdf value = value of probability density
function for that data point (in outlier detection one  usually estimates
distribution of the data with some multidimensional density estimation
technique involving kernel functions and then just removes highly
improbable values).


> Thanks
> Prabhakar
>