You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by diveman <sh...@gmail.com> on 2010/01/07 21:23:13 UTC

Kmeans clustering

I'm new to Mahout. Installed 0.3 in a 4-node cluster and run mahout kmean
example with syntheticcontrol data. I got outputs like the following:

output/canopies
output/clusters-0
output/clusters-1
output/clusters-2
output/clusters-3
output/clusters-4
output/clusters-5
output/clusters-6
output/data
output/points

by which I understand in the points folder, each point is labeled with a
cluster id. I'm wondering where I can find the cluster center, radius info,
etc. And what's in clusters-0~6? BTW, the sample data has 6 groups and the
result has 7 clusters, any clue?

Thanks!
-- 
View this message in context: http://old.nabble.com/Kmeans-clustering-tp27066415p27066415.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Kmeans clustering

Posted by diveman <sh...@gmail.com>.

I copied down to local and found that the classDumper needs both HDFS and
local disk file. So I made two folders exactly the same name one in HDFS and
one in local. And I get the following:

hadoop jar mahout-utils-0.3-SNAPSHOT.jar
org.apache.mahout.utils.clustering.ClusterDumper -s /data/clusters-6 -o
/data/output
Input Path: /data/clusters-6/part-00000
Exception in thread "main" java.lang.NullPointerException
        at
org.apache.mahout.utils.vectors.VectorHelper.vectorToString(VectorHelper.java:60)
        at
org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:124)
        at
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:253)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Any thoughts?




Drew Farris wrote:
> 
> I suspect you are seeing this error because ClusterDumper doesn't work
> from hadoop/hdfs, you will need to copy the directories down to your
> local disk and run from there using java.
> 
> On Thu, Jan 7, 2010 at 4:30 PM, diveman <sh...@gmail.com> wrote:
>>
>> Thanks!
>>
>> and when I try to run the dumper it gives me the following:
>> hadoop jar mahout-utils-0.3-SNAPSHOT.jar
>> org.apache.mahout.utils.clustering.ClusterDumper -s output/clusters-6/ -o
>> /data/output
>> Exception in thread "main" java.lang.NullPointerException
>>        at
>> org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:112)
>>        at
>> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:253)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>>
>>
>> Drew Farris wrote:
>>>
>>> Each iteration of k-means clustering will produce a cluster-X file. In
>>> this case, there were 7 iterations prior to the clusters converging.
>>> The final cluster data can be found in clusters-6.
>>>
>>> There is a utility in mahout-util,
>>> o.a.m.utils.clustering.ClusterDumper that can be used to dump the data
>>> from clusters-6 and points into a json-like format. You could use that
>>> code as a starting point for discovering how to get at the data you're
>>> interested in.
>>>
>>> On Thu, Jan 7, 2010 at 3:23 PM, diveman <sh...@gmail.com> wrote:
>>>>
>>>> I'm new to Mahout. Installed 0.3 in a 4-node cluster and run mahout
>>>> kmean
>>>> example with syntheticcontrol data. I got outputs like the following:
>>>>
>>>> output/canopies
>>>> output/clusters-0
>>>> output/clusters-1
>>>> output/clusters-2
>>>> output/clusters-3
>>>> output/clusters-4
>>>> output/clusters-5
>>>> output/clusters-6
>>>> output/data
>>>> output/points
>>>>
>>>> by which I understand in the points folder, each point is labeled with
>>>> a
>>>> cluster id. I'm wondering where I can find the cluster center, radius
>>>> info,
>>>> etc. And what's in clusters-0~6? BTW, the sample data has 6 groups and
>>>> the
>>>> result has 7 clusters, any clue?
>>>>
>>>> Thanks!
>>>> --
>>>> View this message in context:
>>>> http://old.nabble.com/Kmeans-clustering-tp27066415p27066415.html
>>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Kmeans-clustering-tp27066415p27067350.html
>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Kmeans-clustering-tp27066415p27115555.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Kmeans clustering

Posted by Delroy Cameron <de...@gmail.com>.

indeed, for k-means clustering you should specify the path to the clusters
directory where the binary part-00000 file is, and not the actual binary
file as the input for the sequence file (-s).

that is
<path to clusters output>/clusters-<last iteration number>/
instead of 
<path to clusters output>/clusters-<last iteration number>/part-00000

clusterdump worked fine for me with the following command
./bin/mahout clusterdump 
-s <path to clusters output>/clusters-<last iteration number>/ \
-o <path for dump file> \
-p <path to clusters output>/clusteredPoints/ \
-d <path to input vectors>/dictionary.file-0 \
-dt sequencefile


-----
--cheers
Delroy
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Kmeans-clustering-tp641973p853194.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Kmeans clustering

Posted by Jake Mannix <ja...@gmail.com>.

You're running on hadoop, but using a relative path name like
"./clusteredPoints/", is there a directory *on HDFS* in

  /home/adam/Coding/hadoopResearch/mahout/trunk/clusteredPoints/

?

  -jake

On Fri, Apr 9, 2010 at 12:58 PM, adam35413 <ad...@gmail.com> wrote:

>
> Yes, I see that now.  I'm an idiot.
>
> Updated that, and now I'm seeing this:
>
> bin/mahout clusterdump --seqFileDir ./clusteredPoints/ --output
> testFile.txt
> running on hadoop, using
> HADOOP_HOME=/home/adam/Coding/hadoopResearch/hadoop-0.20.2 and
> HADOOP_CONF_DIR=/home/adam/Coding/hadoopResearch/hadoop-0.20.2/conf
> Input Path:
> /home/adam/Coding/hadoopResearch/mahout/trunk/clusteredPoints/part-00000
> 10/04/09 15:57:02 ERROR driver.MahoutDriver: MahoutDriver failed with args:
> [--seqFileDir, ./clusteredPoints/, --output, testFile.txt, null]
> File does not exist:
> /home/adam/Coding/hadoopResearch/mahout/trunk/clusteredPoints/part-00000
> Exception in thread "main" java.io.FileNotFoundException: File does not
> exist:
> /home/adam/Coding/hadoopResearch/mahout/trunk/clusteredPoints/part-00000
>        at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
>        at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>
>
> The file does exist however, and the permissions have been set to 777 to
> eliminate any potential conflict with access.  Making progress I guess, but
> still no dice...
> --
> View this message in context:
> http://n3.nabble.com/Kmeans-clustering-tp641973p709132.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>

Re: Kmeans clustering

Posted by adam35413 <ad...@gmail.com>.

Yes, I see that now.  I'm an idiot.

Updated that, and now I'm seeing this:  

bin/mahout clusterdump --seqFileDir ./clusteredPoints/ --output testFile.txt 
running on hadoop, using
HADOOP_HOME=/home/adam/Coding/hadoopResearch/hadoop-0.20.2 and
HADOOP_CONF_DIR=/home/adam/Coding/hadoopResearch/hadoop-0.20.2/conf
Input Path:
/home/adam/Coding/hadoopResearch/mahout/trunk/clusteredPoints/part-00000
10/04/09 15:57:02 ERROR driver.MahoutDriver: MahoutDriver failed with args:
[--seqFileDir, ./clusteredPoints/, --output, testFile.txt, null]
File does not exist:
/home/adam/Coding/hadoopResearch/mahout/trunk/clusteredPoints/part-00000
Exception in thread "main" java.io.FileNotFoundException: File does not
exist:
/home/adam/Coding/hadoopResearch/mahout/trunk/clusteredPoints/part-00000
	at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
	at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)


The file does exist however, and the permissions have been set to 777 to
eliminate any potential conflict with access.  Making progress I guess, but
still no dice...
-- 
View this message in context: http://n3.nabble.com/Kmeans-clustering-tp641973p709132.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Kmeans clustering

Posted by Drew Farris <dr...@gmail.com>.

I suspect you are seeing this error because ClusterDumper doesn't work
from hadoop/hdfs, you will need to copy the directories down to your
local disk and run from there using java.

On Thu, Jan 7, 2010 at 4:30 PM, diveman <sh...@gmail.com> wrote:
>
> Thanks!
>
> and when I try to run the dumper it gives me the following:
> hadoop jar mahout-utils-0.3-SNAPSHOT.jar
> org.apache.mahout.utils.clustering.ClusterDumper -s output/clusters-6/ -o
> /data/output
> Exception in thread "main" java.lang.NullPointerException
>        at
> org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:112)
>        at
> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:253)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
>
>
> Drew Farris wrote:
>>
>> Each iteration of k-means clustering will produce a cluster-X file. In
>> this case, there were 7 iterations prior to the clusters converging.
>> The final cluster data can be found in clusters-6.
>>
>> There is a utility in mahout-util,
>> o.a.m.utils.clustering.ClusterDumper that can be used to dump the data
>> from clusters-6 and points into a json-like format. You could use that
>> code as a starting point for discovering how to get at the data you're
>> interested in.
>>
>> On Thu, Jan 7, 2010 at 3:23 PM, diveman <sh...@gmail.com> wrote:
>>>
>>> I'm new to Mahout. Installed 0.3 in a 4-node cluster and run mahout kmean
>>> example with syntheticcontrol data. I got outputs like the following:
>>>
>>> output/canopies
>>> output/clusters-0
>>> output/clusters-1
>>> output/clusters-2
>>> output/clusters-3
>>> output/clusters-4
>>> output/clusters-5
>>> output/clusters-6
>>> output/data
>>> output/points
>>>
>>> by which I understand in the points folder, each point is labeled with a
>>> cluster id. I'm wondering where I can find the cluster center, radius
>>> info,
>>> etc. And what's in clusters-0~6? BTW, the sample data has 6 groups and
>>> the
>>> result has 7 clusters, any clue?
>>>
>>> Thanks!
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Kmeans-clustering-tp27066415p27066415.html
>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Kmeans-clustering-tp27066415p27067350.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>

Re: Kmeans clustering

Posted by diveman <sh...@gmail.com>.

Thanks!

and when I try to run the dumper it gives me the following:
hadoop jar mahout-utils-0.3-SNAPSHOT.jar
org.apache.mahout.utils.clustering.ClusterDumper -s output/clusters-6/ -o
/data/output
Exception in thread "main" java.lang.NullPointerException
        at
org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:112)
        at
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:253)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



Drew Farris wrote:
> 
> Each iteration of k-means clustering will produce a cluster-X file. In
> this case, there were 7 iterations prior to the clusters converging.
> The final cluster data can be found in clusters-6.
> 
> There is a utility in mahout-util,
> o.a.m.utils.clustering.ClusterDumper that can be used to dump the data
> from clusters-6 and points into a json-like format. You could use that
> code as a starting point for discovering how to get at the data you're
> interested in.
> 
> On Thu, Jan 7, 2010 at 3:23 PM, diveman <sh...@gmail.com> wrote:
>>
>> I'm new to Mahout. Installed 0.3 in a 4-node cluster and run mahout kmean
>> example with syntheticcontrol data. I got outputs like the following:
>>
>> output/canopies
>> output/clusters-0
>> output/clusters-1
>> output/clusters-2
>> output/clusters-3
>> output/clusters-4
>> output/clusters-5
>> output/clusters-6
>> output/data
>> output/points
>>
>> by which I understand in the points folder, each point is labeled with a
>> cluster id. I'm wondering where I can find the cluster center, radius
>> info,
>> etc. And what's in clusters-0~6? BTW, the sample data has 6 groups and
>> the
>> result has 7 clusters, any clue?
>>
>> Thanks!
>> --
>> View this message in context:
>> http://old.nabble.com/Kmeans-clustering-tp27066415p27066415.html
>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Kmeans-clustering-tp27066415p27067350.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Kmeans clustering

Posted by Drew Farris <dr...@gmail.com>.

Each iteration of k-means clustering will produce a cluster-X file. In
this case, there were 7 iterations prior to the clusters converging.
The final cluster data can be found in clusters-6.

There is a utility in mahout-util,
o.a.m.utils.clustering.ClusterDumper that can be used to dump the data
from clusters-6 and points into a json-like format. You could use that
code as a starting point for discovering how to get at the data you're
interested in.

On Thu, Jan 7, 2010 at 3:23 PM, diveman <sh...@gmail.com> wrote:
>
> I'm new to Mahout. Installed 0.3 in a 4-node cluster and run mahout kmean
> example with syntheticcontrol data. I got outputs like the following:
>
> output/canopies
> output/clusters-0
> output/clusters-1
> output/clusters-2
> output/clusters-3
> output/clusters-4
> output/clusters-5
> output/clusters-6
> output/data
> output/points
>
> by which I understand in the points folder, each point is labeled with a
> cluster id. I'm wondering where I can find the cluster center, radius info,
> etc. And what's in clusters-0~6? BTW, the sample data has 6 groups and the
> result has 7 clusters, any clue?
>
> Thanks!
> --
> View this message in context: http://old.nabble.com/Kmeans-clustering-tp27066415p27066415.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>