You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Paul Ingles <pi...@me.com> on 2009/07/14 01:02:17 UTC
Error with KMeans example in trunk (793689)
Hi,
I've been going over the kmeans stuff the last few days to try and
understand how it works, and how I might extend it to work with the
data I'm looking to process. It's taken me a while to get a basic
understanding of things, and really appreciate having lists like this
around for support.
I need to be able to label the vectors: each vector holds (for a
document) a set of similarity scores across a number of attributes. I
did some searching around payloads (after coming across the term in
some comments) but couldn't see how I add a payload to the Vector. I
then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65
) that mentions the addition of the setName method to Vector. I've
tried building trunk, and although there were a few test failures for
other (seemingly unrelated) examples I continued and managed to get
the mahout-examples jar/job files built to give it a whirl.
When I run the following:
$ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
I see it run the "Preparing Input", "Running Canopy to get initial
clusters", and then finally it starts "Running KMeans". But, shortly
after it breaks with the following trace:
---snip---
Running KMeans
09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/data
Clusters In: output/canopies Out: output Distance:
org.apache.mahout.utils.EuclideanDistanceMeasure
09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 max
Iterations: 10 num Reduce Tasks: 1 Input Vectors:
org.apache.mahout.matrix.SparseVector
09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0
09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths to
process : 2
09/07/13 23:49:34 INFO mapred.JobClient: Running job:
job_200907132019_0040
09/07/13 23:49:35 INFO mapred.JobClient: map 0% reduce 0%
09/07/13 23:49:42 INFO mapred.JobClient: map 50% reduce 0%
09/07/13 23:49:43 INFO mapred.JobClient: map 100% reduce 0%
09/07/13 23:49:49 INFO mapred.JobClient: map 100% reduce 100%
09/07/13 23:49:50 INFO mapred.JobClient: Job complete:
job_200907132019_0040
09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
09/07/13 23:49:50 INFO mapred.JobClient: File Systems
09/07/13 23:49:50 INFO mapred.JobClient: HDFS bytes read=465629
09/07/13 23:49:50 INFO mapred.JobClient: HDFS bytes written=5631
09/07/13 23:49:50 INFO mapred.JobClient: Local bytes read=7806
09/07/13 23:49:50 INFO mapred.JobClient: Local bytes written=15674
09/07/13 23:49:50 INFO mapred.JobClient: Job Counters
09/07/13 23:49:50 INFO mapred.JobClient: Launched reduce tasks=1
09/07/13 23:49:50 INFO mapred.JobClient: Launched map tasks=2
09/07/13 23:49:50 INFO mapred.JobClient: Data-local map tasks=2
09/07/13 23:49:50 INFO mapred.JobClient: Map-Reduce Framework
09/07/13 23:49:50 INFO mapred.JobClient: Reduce input groups=7
09/07/13 23:49:50 INFO mapred.JobClient: Combine output records=10
09/07/13 23:49:50 INFO mapred.JobClient: Map input records=600
09/07/13 23:49:50 INFO mapred.JobClient: Reduce output records=7
09/07/13 23:49:50 INFO mapred.JobClient: Map output bytes=465600
09/07/13 23:49:50 INFO mapred.JobClient: Map input bytes=448580
09/07/13 23:49:50 INFO mapred.JobClient: Combine input records=600
09/07/13 23:49:50 INFO mapred.JobClient: Map output records=600
09/07/13 23:49:50 INFO mapred.JobClient: Reduce input records=10
09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException:
Cannot open filename /user/paul/output/clusters-0/_logs
java.io.IOException: Cannot open filename /user/paul/output/clusters-0/
_logs
at org.apache.hadoop.hdfs.DFSClient
$DFSInputStream.openInfo(DFSClient.java:1394)
at org.apache.hadoop.hdfs.DFSClient
$DFSInputStream.<init>(DFSClient.java:1385)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)
at
org
.apache
.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:171)
at org.apache.hadoop.io.SequenceFile
$Reader.openFile(SequenceFile.java:1437)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:
1424)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:
1417)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:
1412)
at
org
.apache
.mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:
304)
at
org
.apache
.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:
241)
at
org
.apache.mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:
194)
at
org
.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:
100)
at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun
.reflect
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
---snip---
This is against revision 793689, running on my development Mac Pro
(pseudo-distributed single node) with Hadoop 0.19.1.
It's a bit late to be digging through what's going on, but will try
and take a look tomorrow- really excited about giving kmeans a whirl
on the document processing I'm playing with. In the meantime, I was
wondering whether anyone else had seen the same, or knew a way to
accomplish something similar with the released version (or point me to
a past good revision perhaps?)
Thanks again,
Paul
Re: Error with KMeans example in trunk (793689)
Posted by Paul Ingles <pa...@oobaloo.co.uk>.
I've also tried r787776 on Hadoop 0.19.1, I get a NoClassDefFoundError
for com/google/gson/reflect/TypeToken. I'm pretty sure this is the
same error I was seeing when trying 793689 against Hadoop 0.20.0.
I've checked the mahout-*-examples.job file and the lib directory does
contain gson-1.3.jar which does contain TypeToken.class at com/google/
gson/reflect so not too sure what's happening.
On 14 Jul 2009, at 13:23, Paul Ingles wrote:
> I noticed it was using 0.20.0 this morning and gave it a go. I think
> it failed at the Clustering phases with a NoClassDef error for the
> GSon stuff, but I don't remember exactly.
>
> I'm running from an earlier revision against 0.19 at the moment, but
> will try 0.20 again when it's finished and let you know how it goes.
>
> Thanks again,
> Paul
>
> On 14 Jul 2009, at 12:58, Grant Ingersoll wrote:
>
>> Try Hadoop 0.20.0, which is what trunk is now on. I will update
>> the docs.
>>
>>
>> On Jul 13, 2009, at 7:02 PM, Paul Ingles wrote:
>>
>>> Hi,
>>>
>>> I've been going over the kmeans stuff the last few days to try and
>>> understand how it works, and how I might extend it to work with
>>> the data I'm looking to process. It's taken me a while to get a
>>> basic understanding of things, and really appreciate having lists
>>> like this around for support.
>>>
>>> I need to be able to label the vectors: each vector holds (for a
>>> document) a set of similarity scores across a number of
>>> attributes. I did some searching around payloads (after coming
>>> across the term in some comments) but couldn't see how I add a
>>> payload to the Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65
>>> ) that mentions the addition of the setName method to Vector. I've
>>> tried building trunk, and although there were a few test failures
>>> for other (seemingly unrelated) examples I continued and managed
>>> to get the mahout-examples jar/job files built to give it a whirl.
>>>
>>> When I run the following:
>>>
>>> $ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job
>>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>>>
>>> I see it run the "Preparing Input", "Running Canopy to get initial
>>> clusters", and then finally it starts "Running KMeans". But,
>>> shortly after it breaks with the following trace:
>>>
>>> ---snip---
>>> Running KMeans
>>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/data
>>> Clusters In: output/canopies Out: output Distance:
>>> org.apache.mahout.utils.EuclideanDistanceMeasure
>>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 max
>>> Iterations: 10 num Reduce Tasks: 1 Input Vectors:
>>> org.apache.mahout.matrix.SparseVector
>>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0
>>> 09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser
>>> for parsing the arguments. Applications should implement Tool for
>>> the same.
>>> 09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths
>>> to process : 2
>>> 09/07/13 23:49:34 INFO mapred.JobClient: Running job:
>>> job_200907132019_0040
>>> 09/07/13 23:49:35 INFO mapred.JobClient: map 0% reduce 0%
>>> 09/07/13 23:49:42 INFO mapred.JobClient: map 50% reduce 0%
>>> 09/07/13 23:49:43 INFO mapred.JobClient: map 100% reduce 0%
>>> 09/07/13 23:49:49 INFO mapred.JobClient: map 100% reduce 100%
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Job complete:
>>> job_200907132019_0040
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
>>> 09/07/13 23:49:50 INFO mapred.JobClient: File Systems
>>> 09/07/13 23:49:50 INFO mapred.JobClient: HDFS bytes read=465629
>>> 09/07/13 23:49:50 INFO mapred.JobClient: HDFS bytes written=5631
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Local bytes read=7806
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Local bytes
>>> written=15674
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Job Counters
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Launched reduce tasks=1
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Launched map tasks=2
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Data-local map tasks=2
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Map-Reduce Framework
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Reduce input groups=7
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Combine output
>>> records=10
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Map input records=600
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Reduce output records=7
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Map output bytes=465600
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Map input bytes=448580
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Combine input
>>> records=600
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Map output records=600
>>> 09/07/13 23:49:50 INFO mapred.JobClient: Reduce input records=10
>>> 09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException:
>>> Cannot open filename /user/paul/output/clusters-0/_logs
>>> java.io.IOException: Cannot open filename /user/paul/output/
>>> clusters-0/_logs
>>> at org.apache.hadoop.hdfs.DFSClient
>>> $DFSInputStream.openInfo(DFSClient.java:1394)
>>> at org.apache.hadoop.hdfs.DFSClient
>>> $DFSInputStream.<init>(DFSClient.java:1385)
>>> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)
>>> at
>>> org
>>> .apache
>>> .hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:
>>> 171)
>>> at org.apache.hadoop.io.SequenceFile
>>> $Reader.openFile(SequenceFile.java:1437)
>>> at org.apache.hadoop.io.SequenceFile
>>> $Reader.<init>(SequenceFile.java:1424)
>>> at org.apache.hadoop.io.SequenceFile
>>> $Reader.<init>(SequenceFile.java:1417)
>>> at org.apache.hadoop.io.SequenceFile
>>> $Reader.<init>(SequenceFile.java:1412)
>>> at
>>> org
>>> .apache
>>> .mahout
>>> .clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:304)
>>> at
>>> org
>>> .apache
>>> .mahout
>>> .clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:241)
>>> at
>>> org
>>> .apache
>>> .mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:194)
>>> at
>>> org
>>> .apache
>>> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:100)
>>> at
>>> org
>>> .apache
>>> .mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:56)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun
>>> .reflect
>>> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> at
>>> sun
>>> .reflect
>>> .DelegatingMethodAccessorImpl
>>> .invoke(DelegatingMethodAccessorImpl.java:25)
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>> ---snip---
>>>
>>> This is against revision 793689, running on my development Mac Pro
>>> (pseudo-distributed single node) with Hadoop 0.19.1.
>>>
>>> It's a bit late to be digging through what's going on, but will
>>> try and take a look tomorrow- really excited about giving kmeans a
>>> whirl on the document processing I'm playing with. In the
>>> meantime, I was wondering whether anyone else had seen the same,
>>> or knew a way to accomplish something similar with the released
>>> version (or point me to a past good revision perhaps?)
>>>
>>> Thanks again,
>>> Paul
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>
Re: Error with KMeans example in trunk (793689)
Posted by Paul Ingles <pa...@oobaloo.co.uk>.
I noticed it was using 0.20.0 this morning and gave it a go. I think
it failed at the Clustering phases with a NoClassDef error for the
GSon stuff, but I don't remember exactly.
I'm running from an earlier revision against 0.19 at the moment, but
will try 0.20 again when it's finished and let you know how it goes.
Thanks again,
Paul
On 14 Jul 2009, at 12:58, Grant Ingersoll wrote:
> Try Hadoop 0.20.0, which is what trunk is now on. I will update the
> docs.
>
>
> On Jul 13, 2009, at 7:02 PM, Paul Ingles wrote:
>
>> Hi,
>>
>> I've been going over the kmeans stuff the last few days to try and
>> understand how it works, and how I might extend it to work with the
>> data I'm looking to process. It's taken me a while to get a basic
>> understanding of things, and really appreciate having lists like
>> this around for support.
>>
>> I need to be able to label the vectors: each vector holds (for a
>> document) a set of similarity scores across a number of attributes.
>> I did some searching around payloads (after coming across the term
>> in some comments) but couldn't see how I add a payload to the
>> Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65
>> ) that mentions the addition of the setName method to Vector. I've
>> tried building trunk, and although there were a few test failures
>> for other (seemingly unrelated) examples I continued and managed to
>> get the mahout-examples jar/job files built to give it a whirl.
>>
>> When I run the following:
>>
>> $ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>>
>> I see it run the "Preparing Input", "Running Canopy to get initial
>> clusters", and then finally it starts "Running KMeans". But,
>> shortly after it breaks with the following trace:
>>
>> ---snip---
>> Running KMeans
>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/data
>> Clusters In: output/canopies Out: output Distance:
>> org.apache.mahout.utils.EuclideanDistanceMeasure
>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 max
>> Iterations: 10 num Reduce Tasks: 1 Input Vectors:
>> org.apache.mahout.matrix.SparseVector
>> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0
>> 09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser
>> for parsing the arguments. Applications should implement Tool for
>> the same.
>> 09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths to
>> process : 2
>> 09/07/13 23:49:34 INFO mapred.JobClient: Running job:
>> job_200907132019_0040
>> 09/07/13 23:49:35 INFO mapred.JobClient: map 0% reduce 0%
>> 09/07/13 23:49:42 INFO mapred.JobClient: map 50% reduce 0%
>> 09/07/13 23:49:43 INFO mapred.JobClient: map 100% reduce 0%
>> 09/07/13 23:49:49 INFO mapred.JobClient: map 100% reduce 100%
>> 09/07/13 23:49:50 INFO mapred.JobClient: Job complete:
>> job_200907132019_0040
>> 09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
>> 09/07/13 23:49:50 INFO mapred.JobClient: File Systems
>> 09/07/13 23:49:50 INFO mapred.JobClient: HDFS bytes read=465629
>> 09/07/13 23:49:50 INFO mapred.JobClient: HDFS bytes written=5631
>> 09/07/13 23:49:50 INFO mapred.JobClient: Local bytes read=7806
>> 09/07/13 23:49:50 INFO mapred.JobClient: Local bytes
>> written=15674
>> 09/07/13 23:49:50 INFO mapred.JobClient: Job Counters
>> 09/07/13 23:49:50 INFO mapred.JobClient: Launched reduce tasks=1
>> 09/07/13 23:49:50 INFO mapred.JobClient: Launched map tasks=2
>> 09/07/13 23:49:50 INFO mapred.JobClient: Data-local map tasks=2
>> 09/07/13 23:49:50 INFO mapred.JobClient: Map-Reduce Framework
>> 09/07/13 23:49:50 INFO mapred.JobClient: Reduce input groups=7
>> 09/07/13 23:49:50 INFO mapred.JobClient: Combine output
>> records=10
>> 09/07/13 23:49:50 INFO mapred.JobClient: Map input records=600
>> 09/07/13 23:49:50 INFO mapred.JobClient: Reduce output records=7
>> 09/07/13 23:49:50 INFO mapred.JobClient: Map output bytes=465600
>> 09/07/13 23:49:50 INFO mapred.JobClient: Map input bytes=448580
>> 09/07/13 23:49:50 INFO mapred.JobClient: Combine input
>> records=600
>> 09/07/13 23:49:50 INFO mapred.JobClient: Map output records=600
>> 09/07/13 23:49:50 INFO mapred.JobClient: Reduce input records=10
>> 09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException:
>> Cannot open filename /user/paul/output/clusters-0/_logs
>> java.io.IOException: Cannot open filename /user/paul/output/
>> clusters-0/_logs
>> at org.apache.hadoop.hdfs.DFSClient
>> $DFSInputStream.openInfo(DFSClient.java:1394)
>> at org.apache.hadoop.hdfs.DFSClient
>> $DFSInputStream.<init>(DFSClient.java:1385)
>> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)
>> at
>> org
>> .apache
>> .hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:
>> 171)
>> at org.apache.hadoop.io.SequenceFile
>> $Reader.openFile(SequenceFile.java:1437)
>> at org.apache.hadoop.io.SequenceFile
>> $Reader.<init>(SequenceFile.java:1424)
>> at org.apache.hadoop.io.SequenceFile
>> $Reader.<init>(SequenceFile.java:1417)
>> at org.apache.hadoop.io.SequenceFile
>> $Reader.<init>(SequenceFile.java:1412)
>> at
>> org
>> .apache
>> .mahout
>> .clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:304)
>> at
>> org
>> .apache
>> .mahout
>> .clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:241)
>> at
>> org
>> .apache
>> .mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:194)
>> at
>> org
>> .apache
>> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:100)
>> at
>> org
>> .apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
>> 56)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun
>> .reflect
>> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> at
>> sun
>> .reflect
>> .DelegatingMethodAccessorImpl
>> .invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>> ---snip---
>>
>> This is against revision 793689, running on my development Mac Pro
>> (pseudo-distributed single node) with Hadoop 0.19.1.
>>
>> It's a bit late to be digging through what's going on, but will try
>> and take a look tomorrow- really excited about giving kmeans a
>> whirl on the document processing I'm playing with. In the meantime,
>> I was wondering whether anyone else had seen the same, or knew a
>> way to accomplish something similar with the released version (or
>> point me to a past good revision perhaps?)
>>
>> Thanks again,
>> Paul
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
Re: Error with KMeans example in trunk (793689)
Posted by Grant Ingersoll <gs...@apache.org>.
Try Hadoop 0.20.0, which is what trunk is now on. I will update the
docs.
On Jul 13, 2009, at 7:02 PM, Paul Ingles wrote:
> Hi,
>
> I've been going over the kmeans stuff the last few days to try and
> understand how it works, and how I might extend it to work with the
> data I'm looking to process. It's taken me a while to get a basic
> understanding of things, and really appreciate having lists like
> this around for support.
>
> I need to be able to label the vectors: each vector holds (for a
> document) a set of similarity scores across a number of attributes.
> I did some searching around payloads (after coming across the term
> in some comments) but couldn't see how I add a payload to the
> Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65
> ) that mentions the addition of the setName method to Vector. I've
> tried building trunk, and although there were a few test failures
> for other (seemingly unrelated) examples I continued and managed to
> get the mahout-examples jar/job files built to give it a whirl.
>
> When I run the following:
>
> $ hadoop jar examples/target/mahout-examples-0.2-SNAPSHOT.job
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>
> I see it run the "Preparing Input", "Running Canopy to get initial
> clusters", and then finally it starts "Running KMeans". But, shortly
> after it breaks with the following trace:
>
> ---snip---
> Running KMeans
> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Input: output/data
> Clusters In: output/canopies Out: output Distance:
> org.apache.mahout.utils.EuclideanDistanceMeasure
> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: convergence: 0.5 max
> Iterations: 10 num Reduce Tasks: 1 Input Vectors:
> org.apache.mahout.matrix.SparseVector
> 09/07/13 23:49:34 INFO kmeans.KMeansDriver: Iteration 0
> 09/07/13 23:49:34 WARN mapred.JobClient: Use GenericOptionsParser
> for parsing the arguments. Applications should implement Tool for
> the same.
> 09/07/13 23:49:34 INFO mapred.FileInputFormat: Total input paths to
> process : 2
> 09/07/13 23:49:34 INFO mapred.JobClient: Running job:
> job_200907132019_0040
> 09/07/13 23:49:35 INFO mapred.JobClient: map 0% reduce 0%
> 09/07/13 23:49:42 INFO mapred.JobClient: map 50% reduce 0%
> 09/07/13 23:49:43 INFO mapred.JobClient: map 100% reduce 0%
> 09/07/13 23:49:49 INFO mapred.JobClient: map 100% reduce 100%
> 09/07/13 23:49:50 INFO mapred.JobClient: Job complete:
> job_200907132019_0040
> 09/07/13 23:49:50 INFO mapred.JobClient: Counters: 16
> 09/07/13 23:49:50 INFO mapred.JobClient: File Systems
> 09/07/13 23:49:50 INFO mapred.JobClient: HDFS bytes read=465629
> 09/07/13 23:49:50 INFO mapred.JobClient: HDFS bytes written=5631
> 09/07/13 23:49:50 INFO mapred.JobClient: Local bytes read=7806
> 09/07/13 23:49:50 INFO mapred.JobClient: Local bytes written=15674
> 09/07/13 23:49:50 INFO mapred.JobClient: Job Counters
> 09/07/13 23:49:50 INFO mapred.JobClient: Launched reduce tasks=1
> 09/07/13 23:49:50 INFO mapred.JobClient: Launched map tasks=2
> 09/07/13 23:49:50 INFO mapred.JobClient: Data-local map tasks=2
> 09/07/13 23:49:50 INFO mapred.JobClient: Map-Reduce Framework
> 09/07/13 23:49:50 INFO mapred.JobClient: Reduce input groups=7
> 09/07/13 23:49:50 INFO mapred.JobClient: Combine output records=10
> 09/07/13 23:49:50 INFO mapred.JobClient: Map input records=600
> 09/07/13 23:49:50 INFO mapred.JobClient: Reduce output records=7
> 09/07/13 23:49:50 INFO mapred.JobClient: Map output bytes=465600
> 09/07/13 23:49:50 INFO mapred.JobClient: Map input bytes=448580
> 09/07/13 23:49:50 INFO mapred.JobClient: Combine input records=600
> 09/07/13 23:49:50 INFO mapred.JobClient: Map output records=600
> 09/07/13 23:49:50 INFO mapred.JobClient: Reduce input records=10
> 09/07/13 23:49:50 WARN kmeans.KMeansDriver: java.io.IOException:
> Cannot open filename /user/paul/output/clusters-0/_logs
> java.io.IOException: Cannot open filename /user/paul/output/
> clusters-0/_logs
> at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.openInfo(DFSClient.java:1394)
> at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.<init>(DFSClient.java:1385)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:338)
> at
> org
> .apache
> .hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:
> 171)
> at org.apache.hadoop.io.SequenceFile
> $Reader.openFile(SequenceFile.java:1437)
> at org.apache.hadoop.io.SequenceFile
> $Reader.<init>(SequenceFile.java:1424)
> at org.apache.hadoop.io.SequenceFile
> $Reader.<init>(SequenceFile.java:1417)
> at org.apache.hadoop.io.SequenceFile
> $Reader.<init>(SequenceFile.java:1412)
> at
> org
> .apache
> .mahout.clustering.kmeans.KMeansDriver.isConverged(KMeansDriver.java:
> 304)
> at
> org
> .apache
> .mahout
> .clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:241)
> at
> org
> .apache
> .mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:194)
> at
> org
> .apache
> .mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:100)
> at
> org
> .apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:
> 56)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun
> .reflect
> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun
> .reflect
> .DelegatingMethodAccessorImpl
> .invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
> ---snip---
>
> This is against revision 793689, running on my development Mac Pro
> (pseudo-distributed single node) with Hadoop 0.19.1.
>
> It's a bit late to be digging through what's going on, but will try
> and take a look tomorrow- really excited about giving kmeans a whirl
> on the document processing I'm playing with. In the meantime, I was
> wondering whether anyone else had seen the same, or knew a way to
> accomplish something similar with the released version (or point me
> to a past good revision perhaps?)
>
> Thanks again,
> Paul
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
Re: Error with KMeans example in trunk (793689)
Posted by Paul Ingles <pa...@oobaloo.co.uk>.
I'm not sure I'm afraid, they were whilst I was building at home.
I've just updated trunk here and the current revision (793894) builds
successfully. I'm going to switch the cluster over to 0.20.0 and see
whether I can get the KMeans example to run without the GSon problem I
was having before.
Thanks again,
Paul
On 14 Jul 2009, at 14:04, Grant Ingersoll wrote:
>
> On Jul 13, 2009, at 7:02 PM, Paul Ingles wrote:
>
>> Hi,
>>
>> I've been going over the kmeans stuff the last few days to try and
>> understand how it works, and how I might extend it to work with the
>> data I'm looking to process. It's taken me a while to get a basic
>> understanding of things, and really appreciate having lists like
>> this around for support.
>>
>> I need to be able to label the vectors: each vector holds (for a
>> document) a set of similarity scores across a number of attributes.
>> I did some searching around payloads (after coming across the term
>> in some comments) but couldn't see how I add a payload to the
>> Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65
>> ) that mentions the addition of the setName method to Vector. I've
>> tried building trunk, and although there were a few test failures
>> for other (seemingly unrelated) examples I continued and managed to
>> get the mahout-examples jar/job files built to give it a whirl.
>
> What were the errors?
Re: Error with KMeans example in trunk (793689)
Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 13, 2009, at 7:02 PM, Paul Ingles wrote:
> Hi,
>
> I've been going over the kmeans stuff the last few days to try and
> understand how it works, and how I might extend it to work with the
> data I'm looking to process. It's taken me a while to get a basic
> understanding of things, and really appreciate having lists like
> this around for support.
>
> I need to be able to label the vectors: each vector holds (for a
> document) a set of similarity scores across a number of attributes.
> I did some searching around payloads (after coming across the term
> in some comments) but couldn't see how I add a payload to the
> Vector. I then stumbled on MAHOUT-65 (https://issues.apache.org/jira/browse/MAHOUT-65
> ) that mentions the addition of the setName method to Vector. I've
> tried building trunk, and although there were a few test failures
> for other (seemingly unrelated) examples I continued and managed to
> get the mahout-examples jar/job files built to give it a whirl.
What were the errors?