You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Dongxiang <zh...@gmail.com> on 2010/09/09 17:11:13 UTC

k-means clustering for high-dimensional numerical data

Hi, I am trying to use mahout for high dimensional data clustering.

The data format is very simple. Each data is 128 dimensional with double
values.

I am now puzzled by the input format. Based on my investigation, I think I
need to first transfer the data into sequencefile format and then further
transfered to mahout vector format. Am I going in the correct way?

I think such application is very common but I am unable to find any
step-by-step tutorial or guide.

Can someone tell me the explicit commands to perform such a task? Thanks a
lot.


-- 
View this message in context: http://lucene.472066.n3.nabble.com/k-means-clustering-for-high-dimensional-numerical-data-tp1446492p1446492.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: k-means clustering for high-dimensional numerical data

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  In general, the transformation is as you suggest. You need to produce 
a set of input sequence files which contain VectorWritable vectors from 
your data. It should be a single step process.

We don't have as much tutorial material as we could, but the problem of 
transforming your data into Mahout vectors (VectorWritable sequence 
files) is often quite application-specific. We do have an example in the 
synthetic control package which you may be able to use or adapt. Look at 
the InputDriver and InputMapper classes in 
o.a.m.clustering.syntheticcontrol in examples/. These take 
space-separated numeric data such as yours and produce the correct 
VectorWritable files. Look also at o.a.m.utils.vectors.arff.Driver in 
utils/ if ARFF format is useful.

Jeff

On 9/9/10 8:11 AM, Dongxiang wrote:
> Hi, I am trying to use mahout for high dimensional data clustering.
>
> The data format is very simple. Each data is 128 dimensional with double
> values.
>
> I am now puzzled by the input format. Based on my investigation, I think I
> need to first transfer the data into sequencefile format and then further
> transfered to mahout vector format. Am I going in the correct way?
>
> I think such application is very common but I am unable to find any
> step-by-step tutorial or guide.
>
> Can someone tell me the explicit commands to perform such a task? Thanks a
> lot.
>
>