You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Dongxiang <zh...@gmail.com> on 2010/09/09 17:11:13 UTC
k-means clustering for high-dimensional numerical data
Hi, I am trying to use mahout for high dimensional data clustering.
The data format is very simple. Each data is 128 dimensional with double
values.
I am now puzzled by the input format. Based on my investigation, I think I
need to first transfer the data into sequencefile format and then further
transfered to mahout vector format. Am I going in the correct way?
I think such application is very common but I am unable to find any
step-by-step tutorial or guide.
Can someone tell me the explicit commands to perform such a task? Thanks a
lot.
--
View this message in context: http://lucene.472066.n3.nabble.com/k-means-clustering-for-high-dimensional-numerical-data-tp1446492p1446492.html
Sent from the Mahout User List mailing list archive at Nabble.com.
Re: k-means clustering for high-dimensional numerical data
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
In general, the transformation is as you suggest. You need to produce
a set of input sequence files which contain VectorWritable vectors from
your data. It should be a single step process.
We don't have as much tutorial material as we could, but the problem of
transforming your data into Mahout vectors (VectorWritable sequence
files) is often quite application-specific. We do have an example in the
synthetic control package which you may be able to use or adapt. Look at
the InputDriver and InputMapper classes in
o.a.m.clustering.syntheticcontrol in examples/. These take
space-separated numeric data such as yours and produce the correct
VectorWritable files. Look also at o.a.m.utils.vectors.arff.Driver in
utils/ if ARFF format is useful.
Jeff
On 9/9/10 8:11 AM, Dongxiang wrote:
> Hi, I am trying to use mahout for high dimensional data clustering.
>
> The data format is very simple. Each data is 128 dimensional with double
> values.
>
> I am now puzzled by the input format. Based on my investigation, I think I
> need to first transfer the data into sequencefile format and then further
> transfered to mahout vector format. Am I going in the correct way?
>
> I think such application is very common but I am unable to find any
> step-by-step tutorial or guide.
>
> Can someone tell me the explicit commands to perform such a task? Thanks a
> lot.
>
>