You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by adam35413 <ad...@gmail.com> on 2010/04/13 19:59:02 UTC

Clustering approach

I am doing cluster analysis, and I am interested in using data that
constitutes n vectors of m independent variables each.  I have played around
with the examples, but the sample data is a control chart (i.e. each column
in a row is the output of a specific function).

For example, I could have 100 rows with 3 columns each where a column is an
independent variable (1245, 22, 451.23 for example).  All the values have
meaning when taken as a whole, but 1245 does not have any direct correlation
to 22.  I do not know how many clusters there are.  Is there a better
clustering method to use besides meanshift?
 
-- 
View this message in context: http://n3.nabble.com/Clustering-approach-tp716743p716743.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Clustering approach

Posted by adam35413 <ad...@gmail.com>.

I do not have a good way of determining what the distribution of each
variable is, and I believe it will change as I get more data from different
sources.  This means in my mind that there is not currently a good
clustering algorithm to use in the mahout framework.

Thanks for the help!
-- 
View this message in context: http://n3.nabble.com/Clustering-approach-tp716743p718711.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Clustering approach

Posted by Ted Dunning <te...@gmail.com>.

It really depends on your data and what sort of distribution it has.

Take a quick look at the distribution of each variable.  Look for stuck
values or highly asymmetric distributions.  Most of the Mahout clustering
methods will be pretty sensitive to that.

If you have a good statistical model of what kind of thing generates your
data, then the Dirichlet Process clustering method might be useful, but I
would start with k-means first.

Also, you don't say, but it sounds like your data is not that large.  If so,
I would recommend using a general purpose data analysis system like R rather
than mahout.  The highly interactive nature of R would allow you to have a
much faster learning experience.

On Tue, Apr 13, 2010 at 10:59 AM, adam35413 <ad...@gmail.com> wrote:

>
> I am doing cluster analysis, and I am interested in using data that
> constitutes n vectors of m independent variables each.  I have played
> around
> with the examples, but the sample data is a control chart (i.e. each column
> in a row is the output of a specific function).
>
> For example, I could have 100 rows with 3 columns each where a column is an
> independent variable (1245, 22, 451.23 for example).  All the values have
> meaning when taken as a whole, but 1245 does not have any direct
> correlation
> to 22.  I do not know how many clusters there are.  Is there a better
> clustering method to use besides meanshift?
>
> --
> View this message in context:
> http://n3.nabble.com/Clustering-approach-tp716743p716743.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>