You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Paul Ingles <pi...@me.com> on 2009/07/14 01:02:17 UTC

Error with KMeans example in trunk (793689)

Hi,

I've been going over the kmeans stuff the last few days to try and  
understand how it works, and how I might extend it to work with the  
data I'm looking to process. It's taken me a while to get a basic  
understanding of things, and really appreciate having lists like this  
around for support.

I need to be able to label the vectors: each vector holds (for a  
document) a set of similarity scores across a number of attributes. I  
did some searching around payloads (after coming across the term in  
some comments) but couldn't see how I add a payload to the Vector. I  
then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65 
) that mentions the addition of the setName method to Vector. I've  
tried building trunk, and although there were a few test failures for  
other (seemingly unrelated) examples I continued and managed to get  
the mahout-examples jar/job files built to give it a whirl.

When I run the following:

$ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

I see it run the "Preparing Input", "Running Canopy to get initial  
clusters", and then finally it starts "Running KMeans". But, shortly  
after it breaks with the following trace:

---snip---
Running KMeans
09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/data  
Clusters In: output/canopies Out: output Distance:  
org.apache.mahout.utils.EuclideanDistanceMeasure
09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 max  
Iterations: 10 num Reduce Tasks: 1 Input Vectors:  
org.apache.mahout.matrix.SparseVector
09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0
09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser for  
parsing the arguments. Applications should implement Tool for the same.
09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths to  
process : 2
09/07/13 23:49:34 INFO mapred.JobClient: Running job:  
job_200907132019_0040
09/07/13 23:49:35 INFO mapred.JobClient:  map 0% reduce 0%
09/07/13 23:49:42 INFO mapred.JobClient:  map 50% reduce 0%
09/07/13 23:49:43 INFO mapred.JobClient:  map 100% reduce 0%
09/07/13 23:49:49 INFO mapred.JobClient:  map 100% reduce 100%
09/07/13 23:49:50 INFO mapred.JobClient: Job complete:  
job_200907132019_0040
09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
09/07/13 23:49:50 INFO mapred.JobClient:   File Systems
09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes read=465629
09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes written=5631
09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes read=7806
09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes written=15674
09/07/13 23:49:50 INFO mapred.JobClient:   Job Counters
09/07/13 23:49:50 INFO mapred.JobClient:     Launched reduce tasks=1
09/07/13 23:49:50 INFO mapred.JobClient:     Launched map tasks=2
09/07/13 23:49:50 INFO mapred.JobClient:     Data-local map tasks=2
09/07/13 23:49:50 INFO mapred.JobClient:   Map-Reduce Framework
09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input groups=7
09/07/13 23:49:50 INFO mapred.JobClient:     Combine output records=10
09/07/13 23:49:50 INFO mapred.JobClient:     Map input records=600
09/07/13 23:49:50 INFO mapred.JobClient:     Reduce output records=7
09/07/13 23:49:50 INFO mapred.JobClient:     Map output bytes=465600
09/07/13 23:49:50 INFO mapred.JobClient:     Map input bytes=448580
09/07/13 23:49:50 INFO mapred.JobClient:     Combine input records=600
09/07/13 23:49:50 INFO mapred.JobClient:     Map output records=600
09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input records=10
09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException:  
Cannot open filename /user/paul/output/clusters-0/_logs
java.io.IOException: Cannot open filename /user/paul/output/clusters-0/ 
_logs
	at org.apache.hadoop.hdfs.DFSClient 
$DFSInputStream.openInfo(DFSClient.java:1394)
	at org.apache.hadoop.hdfs.DFSClient 
$DFSInputStream.<init>(DFSClient.java:1385)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)
	at  
org 
.apache 
.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:171)
	at org.apache.hadoop.io.SequenceFile 
$Reader.openFile(SequenceFile.java:1437)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java: 
1424)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java: 
1417)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java: 
1412)
	at  
org 
.apache 
.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java: 
304)
	at  
org 
.apache 
.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java: 
241)
	at  
org 
.apache.mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java: 
194)
	at  
org 
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java: 
100)
	at  
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 
56)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at  
sun 
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 
39)
	at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
	at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
	at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
---snip---

This is against revision 793689, running on my development Mac Pro  
(pseudo-distributed single node) with Hadoop 0.19.1.

It's a bit late to be digging through what's going on, but will try  
and take a look tomorrow- really excited about giving kmeans a whirl  
on the document processing I'm playing with. In the meantime, I was  
wondering whether anyone else had seen the same, or knew a way to  
accomplish something similar with the released version (or point me to  
a past good revision perhaps?)

Thanks again,
Paul

Re: Error with KMeans example in trunk (793689)

Posted by Paul Ingles <pa...@oobaloo.co.uk>.

I've also tried r787776 on Hadoop 0.19.1, I get a NoClassDefFoundError  
for com/google/gson/reflect/TypeToken. I'm pretty sure this is the  
same error I was seeing when trying 793689 against Hadoop 0.20.0.

I've checked the mahout-*-examples.job file and the lib directory does  
contain gson-1.3.jar which does contain TypeToken.class at com/google/ 
gson/reflect so not too sure what's happening.

On 14 Jul 2009, at 13:23, Paul Ingles wrote:

> I noticed it was using 0.20.0 this morning and gave it a go. I think  
> it failed at the Clustering phases with a NoClassDef error for the  
> GSon stuff, but I don't remember exactly.
>
> I'm running from an earlier revision against 0.19 at the moment, but  
> will try 0.20 again when it's finished and let you know how it goes.
>
> Thanks again,
> Paul
>
> On 14 Jul 2009, at 12:58, Grant Ingersoll wrote:
>
>> Try Hadoop 0.20.0, which is what trunk is now on.  I will update  
>> the docs.
>>
>>
>> On Jul 13, 2009, at 7:02 PM, Paul Ingles wrote:
>>
>>> Hi,
>>>
>>> I've been going over the kmeans stuff the last few days to try and  
>>> understand how it works, and how I might extend it to work with  
>>> the data I'm looking to process. It's taken me a while to get a  
>>> basic understanding of things, and really appreciate having lists  
>>> like this around for support.
>>>
>>> I need to be able to label the vectors: each vector holds (for a  
>>> document) a set of similarity scores across a number of  
>>> attributes. I did some searching around payloads (after coming  
>>> across the term in some comments) but couldn't see how I add a  
>>> payload to the Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65 
>>> ) that mentions the addition of the setName method to Vector. I've  
>>> tried building trunk, and although there were a few test failures  
>>> for other (seemingly unrelated) examples I continued and managed  
>>> to get the mahout-examples jar/job files built to give it a whirl.
>>>
>>> When I run the following:
>>>
>>> $ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job  
>>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>>>
>>> I see it run the "Preparing Input", "Running Canopy to get initial  
>>> clusters", and then finally it starts "Running KMeans". But,  
>>> shortly after it breaks with the following trace:
>>>
>>> ---snip---
>>> Running KMeans
>>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/data  
>>> Clusters In: output/canopies Out: output Distance:  
>>> org.apache.mahout.utils.EuclideanDistanceMeasure
>>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 max  
>>> Iterations: 10 num Reduce Tasks: 1 Input Vectors:  
>>> org.apache.mahout.matrix.SparseVector
>>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0
>>> 09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser  
>>> for parsing the arguments. Applications should implement Tool for  
>>> the same.
>>> 09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths  
>>> to process : 2
>>> 09/07/13 23:49:34 INFO mapred.JobClient: Running job:  
>>> job_200907132019_0040
>>> 09/07/13 23:49:35 INFO mapred.JobClient:  map 0% reduce 0%
>>> 09/07/13 23:49:42 INFO mapred.JobClient:  map 50% reduce 0%
>>> 09/07/13 23:49:43 INFO mapred.JobClient:  map 100% reduce 0%
>>> 09/07/13 23:49:49 INFO mapred.JobClient:  map 100% reduce 100%
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Job complete:  
>>> job_200907132019_0040
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
>>> 09/07/13 23:49:50 INFO mapred.JobClient:   File Systems
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes read=465629
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes written=5631
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes read=7806
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes  
>>> written=15674
>>> 09/07/13 23:49:50 INFO mapred.JobClient:   Job Counters
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Launched reduce tasks=1
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Launched map tasks=2
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Data-local map tasks=2
>>> 09/07/13 23:49:50 INFO mapred.JobClient:   Map-Reduce Framework
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input groups=7
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Combine output  
>>> records=10
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Map input records=600
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Reduce output records=7
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Map output bytes=465600
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Map input bytes=448580
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Combine input  
>>> records=600
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Map output records=600
>>> 09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input records=10
>>> 09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException:  
>>> Cannot open filename /user/paul/output/clusters-0/_logs
>>> java.io.IOException: Cannot open filename /user/paul/output/ 
>>> clusters-0/_logs
>>> 	at org.apache.hadoop.hdfs.DFSClient 
>>> $DFSInputStream.openInfo(DFSClient.java:1394)
>>> 	at org.apache.hadoop.hdfs.DFSClient 
>>> $DFSInputStream.<init>(DFSClient.java:1385)
>>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)
>>> 	at  
>>> org 
>>> .apache 
>>> .hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java: 
>>> 171)
>>> 	at org.apache.hadoop.io.SequenceFile 
>>> $Reader.openFile(SequenceFile.java:1437)
>>> 	at org.apache.hadoop.io.SequenceFile 
>>> $Reader.<init>(SequenceFile.java:1424)
>>> 	at org.apache.hadoop.io.SequenceFile 
>>> $Reader.<init>(SequenceFile.java:1417)
>>> 	at org.apache.hadoop.io.SequenceFile 
>>> $Reader.<init>(SequenceFile.java:1412)
>>> 	at  
>>> org 
>>> .apache 
>>> .mahout 
>>> .clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:304)
>>> 	at  
>>> org 
>>> .apache 
>>> .mahout 
>>> .clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:241)
>>> 	at  
>>> org 
>>> .apache 
>>> .mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:194)
>>> 	at  
>>> org 
>>> .apache 
>>> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:100)
>>> 	at  
>>> org 
>>> .apache 
>>> .mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:56)
>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> 	at  
>>> sun 
>>> .reflect 
>>> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> 	at  
>>> sun 
>>> .reflect 
>>> .DelegatingMethodAccessorImpl 
>>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>> 	at java.lang.reflect.Method.invoke(Method.java:597)
>>> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>>> 	at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>>> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>> 	at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>> ---snip---
>>>
>>> This is against revision 793689, running on my development Mac Pro  
>>> (pseudo-distributed single node) with Hadoop 0.19.1.
>>>
>>> It's a bit late to be digging through what's going on, but will  
>>> try and take a look tomorrow- really excited about giving kmeans a  
>>> whirl on the document processing I'm playing with. In the  
>>> meantime, I was wondering whether anyone else had seen the same,  
>>> or knew a way to accomplish something similar with the released  
>>> version (or point me to a past good revision perhaps?)
>>>
>>> Thanks again,
>>> Paul
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>

Re: Error with KMeans example in trunk (793689)

Posted by Paul Ingles <pa...@oobaloo.co.uk>.

I noticed it was using 0.20.0 this morning and gave it a go. I think  
it failed at the Clustering phases with a NoClassDef error for the  
GSon stuff, but I don't remember exactly.

I'm running from an earlier revision against 0.19 at the moment, but  
will try 0.20 again when it's finished and let you know how it goes.

Thanks again,
Paul

On 14 Jul 2009, at 12:58, Grant Ingersoll wrote:

> Try Hadoop 0.20.0, which is what trunk is now on.  I will update the  
> docs.
>
>
> On Jul 13, 2009, at 7:02 PM, Paul Ingles wrote:
>
>> Hi,
>>
>> I've been going over the kmeans stuff the last few days to try and  
>> understand how it works, and how I might extend it to work with the  
>> data I'm looking to process. It's taken me a while to get a basic  
>> understanding of things, and really appreciate having lists like  
>> this around for support.
>>
>> I need to be able to label the vectors: each vector holds (for a  
>> document) a set of similarity scores across a number of attributes.  
>> I did some searching around payloads (after coming across the term  
>> in some comments) but couldn't see how I add a payload to the  
>> Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65 
>> ) that mentions the addition of the setName method to Vector. I've  
>> tried building trunk, and although there were a few test failures  
>> for other (seemingly unrelated) examples I continued and managed to  
>> get the mahout-examples jar/job files built to give it a whirl.
>>
>> When I run the following:
>>
>> $ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job  
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>>
>> I see it run the "Preparing Input", "Running Canopy to get initial  
>> clusters", and then finally it starts "Running KMeans". But,  
>> shortly after it breaks with the following trace:
>>
>> ---snip---
>> Running KMeans
>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/data  
>> Clusters In: output/canopies Out: output Distance:  
>> org.apache.mahout.utils.EuclideanDistanceMeasure
>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 max  
>> Iterations: 10 num Reduce Tasks: 1 Input Vectors:  
>> org.apache.mahout.matrix.SparseVector
>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0
>> 09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser  
>> for parsing the arguments. Applications should implement Tool for  
>> the same.
>> 09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths to  
>> process : 2
>> 09/07/13 23:49:34 INFO mapred.JobClient: Running job:  
>> job_200907132019_0040
>> 09/07/13 23:49:35 INFO mapred.JobClient:  map 0% reduce 0%
>> 09/07/13 23:49:42 INFO mapred.JobClient:  map 50% reduce 0%
>> 09/07/13 23:49:43 INFO mapred.JobClient:  map 100% reduce 0%
>> 09/07/13 23:49:49 INFO mapred.JobClient:  map 100% reduce 100%
>> 09/07/13 23:49:50 INFO mapred.JobClient: Job complete:  
>> job_200907132019_0040
>> 09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
>> 09/07/13 23:49:50 INFO mapred.JobClient:   File Systems
>> 09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes read=465629
>> 09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes written=5631
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes read=7806
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes  
>> written=15674
>> 09/07/13 23:49:50 INFO mapred.JobClient:   Job Counters
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Launched reduce tasks=1
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Launched map tasks=2
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Data-local map tasks=2
>> 09/07/13 23:49:50 INFO mapred.JobClient:   Map-Reduce Framework
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input groups=7
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Combine output  
>> records=10
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Map input records=600
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Reduce output records=7
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Map output bytes=465600
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Map input bytes=448580
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Combine input  
>> records=600
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Map output records=600
>> 09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input records=10
>> 09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException:  
>> Cannot open filename /user/paul/output/clusters-0/_logs
>> java.io.IOException: Cannot open filename /user/paul/output/ 
>> clusters-0/_logs
>> 	at org.apache.hadoop.hdfs.DFSClient 
>> $DFSInputStream.openInfo(DFSClient.java:1394)
>> 	at org.apache.hadoop.hdfs.DFSClient 
>> $DFSInputStream.<init>(DFSClient.java:1385)
>> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)
>> 	at  
>> org 
>> .apache 
>> .hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java: 
>> 171)
>> 	at org.apache.hadoop.io.SequenceFile 
>> $Reader.openFile(SequenceFile.java:1437)
>> 	at org.apache.hadoop.io.SequenceFile 
>> $Reader.<init>(SequenceFile.java:1424)
>> 	at org.apache.hadoop.io.SequenceFile 
>> $Reader.<init>(SequenceFile.java:1417)
>> 	at org.apache.hadoop.io.SequenceFile 
>> $Reader.<init>(SequenceFile.java:1412)
>> 	at  
>> org 
>> .apache 
>> .mahout 
>> .clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:304)
>> 	at  
>> org 
>> .apache 
>> .mahout 
>> .clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:241)
>> 	at  
>> org 
>> .apache 
>> .mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:194)
>> 	at  
>> org 
>> .apache 
>> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:100)
>> 	at  
>> org 
>> .apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 
>> 56)
>> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> 	at  
>> sun 
>> .reflect 
>> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> 	at  
>> sun 
>> .reflect 
>> .DelegatingMethodAccessorImpl 
>> .invoke(DelegatingMethodAccessorImpl.java:25)
>> 	at java.lang.reflect.Method.invoke(Method.java:597)
>> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>> 	at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> 	at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>> ---snip---
>>
>> This is against revision 793689, running on my development Mac Pro  
>> (pseudo-distributed single node) with Hadoop 0.19.1.
>>
>> It's a bit late to be digging through what's going on, but will try  
>> and take a look tomorrow- really excited about giving kmeans a  
>> whirl on the document processing I'm playing with. In the meantime,  
>> I was wondering whether anyone else had seen the same, or knew a  
>> way to accomplish something similar with the released version (or  
>> point me to a past good revision perhaps?)
>>
>> Thanks again,
>> Paul
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

Re: Error with KMeans example in trunk (793689)

Posted by Grant Ingersoll <gs...@apache.org>.

Try Hadoop 0.20.0, which is what trunk is now on.  I will update the  
docs.


On Jul 13, 2009, at 7:02 PM, Paul Ingles wrote:

> Hi,
>
> I've been going over the kmeans stuff the last few days to try and  
> understand how it works, and how I might extend it to work with the  
> data I'm looking to process. It's taken me a while to get a basic  
> understanding of things, and really appreciate having lists like  
> this around for support.
>
> I need to be able to label the vectors: each vector holds (for a  
> document) a set of similarity scores across a number of attributes.  
> I did some searching around payloads (after coming across the term  
> in some comments) but couldn't see how I add a payload to the  
> Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65 
> ) that mentions the addition of the setName method to Vector. I've  
> tried building trunk, and although there were a few test failures  
> for other (seemingly unrelated) examples I continued and managed to  
> get the mahout-examples jar/job files built to give it a whirl.
>
> When I run the following:
>
> $ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job  
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>
> I see it run the "Preparing Input", "Running Canopy to get initial  
> clusters", and then finally it starts "Running KMeans". But, shortly  
> after it breaks with the following trace:
>
> ---snip---
> Running KMeans
> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/data  
> Clusters In: output/canopies Out: output Distance:  
> org.apache.mahout.utils.EuclideanDistanceMeasure
> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 max  
> Iterations: 10 num Reduce Tasks: 1 Input Vectors:  
> org.apache.mahout.matrix.SparseVector
> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0
> 09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser  
> for parsing the arguments. Applications should implement Tool for  
> the same.
> 09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths to  
> process : 2
> 09/07/13 23:49:34 INFO mapred.JobClient: Running job:  
> job_200907132019_0040
> 09/07/13 23:49:35 INFO mapred.JobClient:  map 0% reduce 0%
> 09/07/13 23:49:42 INFO mapred.JobClient:  map 50% reduce 0%
> 09/07/13 23:49:43 INFO mapred.JobClient:  map 100% reduce 0%
> 09/07/13 23:49:49 INFO mapred.JobClient:  map 100% reduce 100%
> 09/07/13 23:49:50 INFO mapred.JobClient: Job complete:  
> job_200907132019_0040
> 09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
> 09/07/13 23:49:50 INFO mapred.JobClient:   File Systems
> 09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes read=465629
> 09/07/13 23:49:50 INFO mapred.JobClient:     HDFS bytes written=5631
> 09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes read=7806
> 09/07/13 23:49:50 INFO mapred.JobClient:     Local bytes written=15674
> 09/07/13 23:49:50 INFO mapred.JobClient:   Job Counters
> 09/07/13 23:49:50 INFO mapred.JobClient:     Launched reduce tasks=1
> 09/07/13 23:49:50 INFO mapred.JobClient:     Launched map tasks=2
> 09/07/13 23:49:50 INFO mapred.JobClient:     Data-local map tasks=2
> 09/07/13 23:49:50 INFO mapred.JobClient:   Map-Reduce Framework
> 09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input groups=7
> 09/07/13 23:49:50 INFO mapred.JobClient:     Combine output records=10
> 09/07/13 23:49:50 INFO mapred.JobClient:     Map input records=600
> 09/07/13 23:49:50 INFO mapred.JobClient:     Reduce output records=7
> 09/07/13 23:49:50 INFO mapred.JobClient:     Map output bytes=465600
> 09/07/13 23:49:50 INFO mapred.JobClient:     Map input bytes=448580
> 09/07/13 23:49:50 INFO mapred.JobClient:     Combine input records=600
> 09/07/13 23:49:50 INFO mapred.JobClient:     Map output records=600
> 09/07/13 23:49:50 INFO mapred.JobClient:     Reduce input records=10
> 09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException:  
> Cannot open filename /user/paul/output/clusters-0/_logs
> java.io.IOException: Cannot open filename /user/paul/output/ 
> clusters-0/_logs
> 	at org.apache.hadoop.hdfs.DFSClient 
> $DFSInputStream.openInfo(DFSClient.java:1394)
> 	at org.apache.hadoop.hdfs.DFSClient 
> $DFSInputStream.<init>(DFSClient.java:1385)
> 	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)
> 	at  
> org 
> .apache 
> .hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java: 
> 171)
> 	at org.apache.hadoop.io.SequenceFile 
> $Reader.openFile(SequenceFile.java:1437)
> 	at org.apache.hadoop.io.SequenceFile 
> $Reader.<init>(SequenceFile.java:1424)
> 	at org.apache.hadoop.io.SequenceFile 
> $Reader.<init>(SequenceFile.java:1417)
> 	at org.apache.hadoop.io.SequenceFile 
> $Reader.<init>(SequenceFile.java:1412)
> 	at  
> org 
> .apache 
> .mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java: 
> 304)
> 	at  
> org 
> .apache 
> .mahout 
> .clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:241)
> 	at  
> org 
> .apache 
> .mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:194)
> 	at  
> org 
> .apache 
> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:100)
> 	at  
> org 
> .apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java: 
> 56)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at  
> sun 
> .reflect 
> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at  
> sun 
> .reflect 
> .DelegatingMethodAccessorImpl 
> .invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
> 	at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
> ---snip---
>
> This is against revision 793689, running on my development Mac Pro  
> (pseudo-distributed single node) with Hadoop 0.19.1.
>
> It's a bit late to be digging through what's going on, but will try  
> and take a look tomorrow- really excited about giving kmeans a whirl  
> on the document processing I'm playing with. In the meantime, I was  
> wondering whether anyone else had seen the same, or knew a way to  
> accomplish something similar with the released version (or point me  
> to a past good revision perhaps?)
>
> Thanks again,
> Paul

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Error with KMeans example in trunk (793689)

Posted by Paul Ingles <pa...@oobaloo.co.uk>.

I'm not sure I'm afraid, they were whilst I was building at home.

I've just updated trunk here and the current revision (793894) builds  
successfully. I'm going to switch the cluster over to 0.20.0 and see  
whether I can get the KMeans example to run without the GSon problem I  
was having before.

Thanks again,
Paul


On 14 Jul 2009, at 14:04, Grant Ingersoll wrote:

>
> On Jul 13, 2009, at 7:02 PM, Paul Ingles wrote:
>
>> Hi,
>>
>> I've been going over the kmeans stuff the last few days to try and  
>> understand how it works, and how I might extend it to work with the  
>> data I'm looking to process. It's taken me a while to get a basic  
>> understanding of things, and really appreciate having lists like  
>> this around for support.
>>
>> I need to be able to label the vectors: each vector holds (for a  
>> document) a set of similarity scores across a number of attributes.  
>> I did some searching around payloads (after coming across the term  
>> in some comments) but couldn't see how I add a payload to the  
>> Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65 
>> ) that mentions the addition of the setName method to Vector. I've  
>> tried building trunk, and although there were a few test failures  
>> for other (seemingly unrelated) examples I continued and managed to  
>> get the mahout-examples jar/job files built to give it a whirl.
>
> What were the errors?

Re: Error with KMeans example in trunk (793689)

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 13, 2009, at 7:02 PM, Paul Ingles wrote:

> Hi,
>
> I've been going over the kmeans stuff the last few days to try and  
> understand how it works, and how I might extend it to work with the  
> data I'm looking to process. It's taken me a while to get a basic  
> understanding of things, and really appreciate having lists like  
> this around for support.
>
> I need to be able to label the vectors: each vector holds (for a  
> document) a set of similarity scores across a number of attributes.  
> I did some searching around payloads (after coming across the term  
> in some comments) but couldn't see how I add a payload to the  
> Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65 
> ) that mentions the addition of the setName method to Vector. I've  
> tried building trunk, and although there were a few test failures  
> for other (seemingly unrelated) examples I continued and managed to  
> get the mahout-examples jar/job files built to give it a whirl.

What were the errors?