You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by "Severance, Steve" <ss...@ebay.com> on 2010/08/16 20:15:08 UTC

Clustering Questions

Hi. I have a few questions. I am using Mahout to do KMeans clustering. I have found the process somewhat complex. Some of my questions may have been answered in JIRA tickets but I did look before I wrote this.



1.       It appears that the .job files contain the code that is actually needed to run. How do I build these? They don't seem to be built when I build mahout with Maven.



2.       The Mahout 0.3 tag line numbers don't seem to match with the compiled jars. What revision number is 0.3 built from?



3.       It looks like the format of the cluster files changed between 0.3and 0.4. Is this true?



4.       I was never able to get the Cluster dumping tool to work. I wrotemy own to export the clusters to hive for analysis. Are there any plans for=  better Hive integration?





Thanks.



Steve


RE: Clustering Questions

Posted by "Severance, Steve" <ss...@ebay.com>.
I built everything on OSX and it works now.

Thanks.

-----Original Message-----
From: Robin Anil [mailto:robin.anil@gmail.com] 
Sent: Monday, August 16, 2010 7:04 PM
To: user@mahout.apache.org
Subject: Re: Clustering Questions

Seems to me like a lack of memory error. Try increasing the heap size.
Hadoop is throwing "out of mem" exception, which doesnt get propagated to the driver

Robin

On Tue, Aug 17, 2010 at 2:52 AM, Drew Farris <dr...@apache.org> wrote:

> On Mon, Aug 16, 2010 at 2:15 PM, Severance, Steve 
> <ss...@ebay.com>
> wrote:
>
> > 1.       It appears that the .job files contain the code that is actually
> needed to run. How do I build these? They don't seem to be built when 
> I build mahout with Maven.
>
> 'mvn clean install' will write the job files to */target/*.job -- 
> example/target/mahout-examples-0.4-SNAPSHOT.job for example. If the 
> unit tests are failing, the job files won't be built. You can do a 
> build with unit tests disabled using 'mvn clean install -Pfastinstall'
>
> There is likely a problem running the unit tests that is specific to 
> Windows 7, I know there have been reports regarding difficulties with 
> the test on Windows platforms previously.
>
> There are some tips on wiki page regarding building in Windows that 
> might be useful:
> https://cwiki.apache.org/confluence/display/MAHOUT/BuildingMahout
>
> HTH,
> Drew
>

Re: Clustering Questions

Posted by Robin Anil <ro...@gmail.com>.
Seems to me like a lack of memory error. Try increasing the heap size.
Hadoop is throwing "out of mem" exception, which doesnt get propagated to
the driver

Robin

On Tue, Aug 17, 2010 at 2:52 AM, Drew Farris <dr...@apache.org> wrote:

> On Mon, Aug 16, 2010 at 2:15 PM, Severance, Steve <ss...@ebay.com>
> wrote:
>
> > 1.       It appears that the .job files contain the code that is actually
> needed to run. How do I build these? They don't seem to be built when I
> build mahout with Maven.
>
> 'mvn clean install' will write the job files to */target/*.job --
> example/target/mahout-examples-0.4-SNAPSHOT.job for example. If the
> unit tests are failing, the job files won't be built. You can do a
> build with unit tests disabled using 'mvn clean install -Pfastinstall'
>
> There is likely a problem running the unit tests that is specific to
> Windows 7, I know there have been reports regarding difficulties with
> the test on Windows platforms previously.
>
> There are some tips on wiki page regarding building in Windows that
> might be useful:
> https://cwiki.apache.org/confluence/display/MAHOUT/BuildingMahout
>
> HTH,
> Drew
>

Re: Clustering Questions

Posted by Drew Farris <dr...@apache.org>.
On Mon, Aug 16, 2010 at 2:15 PM, Severance, Steve <ss...@ebay.com> wrote:

> 1.       It appears that the .job files contain the code that is actually needed to run. How do I build these? They don't seem to be built when I build mahout with Maven.

'mvn clean install' will write the job files to */target/*.job --
example/target/mahout-examples-0.4-SNAPSHOT.job for example. If the
unit tests are failing, the job files won't be built. You can do a
build with unit tests disabled using 'mvn clean install -Pfastinstall'

There is likely a problem running the unit tests that is specific to
Windows 7, I know there have been reports regarding difficulties with
the test on Windows platforms previously.

There are some tips on wiki page regarding building in Windows that
might be useful:
https://cwiki.apache.org/confluence/display/MAHOUT/BuildingMahout

HTH,
Drew

RE: Clustering Questions

Posted by "Severance, Steve" <ss...@ebay.com>.
I am on Windows 7. Building through Cygwin. Here is one of the surefire reports.

Steve

-------------------------------------------------------------------------------
Test set: org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest
-------------------------------------------------------------------------------
Tests run: 3, Failures: 2, Errors: 1, Skipped: 0, Time elapsed: 4.157 sec <<< FAILURE!
testStartParallelCounting(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest)  Time elapsed: 1.35 sec  <<< FAILURE!
junit.framework.ComparisonFailure: null expected:<[[(B,6), (D,6), (A,5), (E,4), (C,3)]]> but was:<[[]]>
	at junit.framework.Assert.assertEquals(Assert.java:81)
	at junit.framework.Assert.assertEquals(Assert.java:87)
	at org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest.testStartParallelCounting(PFPGrowthTest.java:93)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at junit.framework.TestCase.runTest(TestCase.java:168)
	at junit.framework.TestCase.runBare(TestCase.java:134)
	at junit.framework.TestResult$1.protect(TestResult.java:110)
	at junit.framework.TestResult.runProtected(TestResult.java:128)
	at junit.framework.TestResult.run(TestResult.java:113)
	at junit.framework.TestCase.run(TestCase.java:124)
	at junit.framework.TestSuite.runTest(TestSuite.java:232)
	at junit.framework.TestSuite.run(TestSuite.java:227)
	at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
	at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59)
	at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115)
	at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102)
	at org.apache.maven.surefire.Surefire.run(Surefire.java:180)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350)
	at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)

testStartGroupingItems(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest)  Time elapsed: 0.014 sec  <<< FAILURE!
junit.framework.ComparisonFailure: null expected:<{[D=0, E=1, A=0, B=0, C=1]}> but was:<{[]}>
	at junit.framework.Assert.assertEquals(Assert.java:81)
	at junit.framework.Assert.assertEquals(Assert.java:87)
	at org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest.testStartGroupingItems(PFPGrowthTest.java:101)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at junit.framework.TestCase.runTest(TestCase.java:168)
	at junit.framework.TestCase.runBare(TestCase.java:134)
	at junit.framework.TestResult$1.protect(TestResult.java:110)
	at junit.framework.TestResult.runProtected(TestResult.java:128)
	at junit.framework.TestResult.run(TestResult.java:113)
	at junit.framework.TestCase.run(TestCase.java:124)
	at junit.framework.TestSuite.runTest(TestSuite.java:232)
	at junit.framework.TestSuite.run(TestSuite.java:227)
	at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
	at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59)
	at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115)
	at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102)
	at org.apache.maven.surefire.Surefire.run(Surefire.java:180)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350)
	at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)

testStartParallelFPGrowth(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest)  Time elapsed: 2.789 sec  <<< ERROR!
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/D:/apache/mahout/trunk/core/output/frequentpatterns/fpgrowth
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
	at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
	at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
	at org.apache.mahout.fpm.pfpgrowth.PFPGrowth.startAggregating(PFPGrowth.java:240)
	at org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest.testStartParallelFPGrowth(PFPGrowthTest.java:110)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at junit.framework.TestCase.runTest(TestCase.java:168)
	at junit.framework.TestCase.runBare(TestCase.java:134)
	at junit.framework.TestResult$1.protect(TestResult.java:110)
	at junit.framework.TestResult.runProtected(TestResult.java:128)
	at junit.framework.TestResult.run(TestResult.java:113)
	at junit.framework.TestCase.run(TestCase.java:124)
	at junit.framework.TestSuite.runTest(TestSuite.java:232)
	at junit.framework.TestSuite.run(TestSuite.java:227)
	at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
	at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59)
	at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115)
	at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102)
	at org.apache.maven.surefire.Surefire.run(Surefire.java:180)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350)
	at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, August 16, 2010 12:15 PM
To: user@mahout.apache.org
Subject: Re: Clustering Questions

What platform (did you already say)?

On Mon, Aug 16, 2010 at 12:07 PM, Severance, Steve <ss...@ebay.com>wrote:

> I can provide any extra info needed.
>

Re: Clustering Questions

Posted by Ted Dunning <te...@gmail.com>.
What platform (did you already say)?

On Mon, Aug 16, 2010 at 12:07 PM, Severance, Steve <ss...@ebay.com>wrote:

> I can provide any extra info needed.
>

Re: Clustering Questions

Posted by Sean Owen <sr...@gmail.com>.
Hmm, these are all passing for me. Sounds like some quirk in your
local setup. Under target/surefire-reports you will find complete logs
from tests, which would probably reveal the nature of the problem.

On Mon, Aug 16, 2010 at 8:07 PM, Severance, Steve <ss...@ebay.com> wrote:
> I updated to the current revision of trunk. It does not package correctly as some of the tests fail.
>
> Failed tests:
>  testStartParallelCounting(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest)
>  testStartGroupingItems(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest)
>

RE: Clustering Questions

Posted by "Severance, Steve" <ss...@ebay.com>.
I updated to the current revision of trunk. It does not package correctly as some of the tests fail.

Failed tests: 
  testStartParallelCounting(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest)
  testStartGroupingItems(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest)

Tests in error: 
  testLoglikelihood(org.apache.mahout.math.hadoop.similarity.vector.DistributedLoglikelihoodVectorSimilarityTest)
  testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmeansClustering)
  testCompleteJob(org.apache.mahout.cf.taste.hadoop.item.RecommenderJobTest)
  testCompleteJobBoolean(org.apache.mahout.cf.taste.hadoop.item.RecommenderJobTest)
  testTanimoto(org.apache.mahout.math.hadoop.similarity.vector.DistributedTanimotoCoefficientVectorSimilarityTest)
  testStartParallelFPGrowth(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest)
  testCanopyMapperManhattan(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  testCanopyMapperEuclidean(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  testCanopyReducerManhattan(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  testCanopyReducerEuclidean(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  testCanopyGenManhattanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  testCanopyGenEuclideanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  testClusterMapperManhattan(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  testClusterMapperEuclidean(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  testClusteringManhattanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  testClusteringEuclideanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  testUserDefinedDistanceMeasure(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  testCanopyEuclideanMRJob(org.apache.mahout.clustering.meanshift.TestMeanShift)
  testCompleteJob(org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest)
  testMaxSimilaritiesPerItem(org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest)
  testRowWeightMapper(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob)
  testSimilarityReducer(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob)
  testSimilarityReducerSelfSimilarity(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob)
  testSmallSampleMatrix(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob)
  testLimitEntriesInSimilarityMatrix(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob)
  testEvaluate(org.apache.mahout.ga.watchmaker.MahoutEvaluatorTest)
  testMaxHeapFPGrowth(org.apache.mahout.fpm.pfpgrowth.FPGrowthTest)
  testFuzzyKMeansMRJob(org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering)
  testTranspose(org.apache.mahout.math.hadoop.TestDistributedRowMatrix)
  testMatrixTimesVector(org.apache.mahout.math.hadoop.TestDistributedRowMatrix)
  testMatrixTimesSquaredVector(org.apache.mahout.math.hadoop.TestDistributedRowMatrix)
  testMatrixTimesMatrix(org.apache.mahout.math.hadoop.TestDistributedRowMatrix)
  testSelfTestBayes(org.apache.mahout.classifier.bayes.BayesClassifierSelfTest)
  testSelfTestCBayes(org.apache.mahout.classifier.bayes.BayesClassifierSelfTest)
  testDistributedLanczosSolver(org.apache.mahout.math.hadoop.decomposer.TestDistributedLanczosSolver)

I can provide any extra info needed. 

My other build of trunk which was from August 4th fails to run seq2sparse because the lucene standard analyzer cannot be found. Which Job file should contain this?

Thanks.

Steve

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, August 16, 2010 11:20 AM
To: user@mahout.apache.org
Subject: Re: Clustering Questions

On Mon, Aug 16, 2010 at 11:15 AM, Severance, Steve <ss...@ebay.com>wrote:

>
> 1.       It appears that the .job files contain the code that is actually
> needed to run. How do I build these? They don't seem to be built when 
> I build mahout with Maven.
>

Which version are you using?  I recommend trunk for pretty much everything.


> 2.       The Mahout 0.3 tag line numbers don't seem to match with the
> compiled jars. What revision number is 0.3 built from?
>

It should have been what was tagged.  But, even so, I recommend using trunk.


> 3.       It looks like the format of the cluster files changed between
> 0.3and 0.4. Is this true?
>

Others can say for sure, but this is very likely.   0.4 is going to be a
major change.

4.       I was never able to get the Cluster dumping tool to work. I wrotemy
> own to export the clusters to hive for analysis. Are there any plans 
> for=  better Hive integration?
>

This has been substantially improved.  Is there something that can be done to facilitate Hive integration without making Hive a dependency?

RE: Clustering Questions

Posted by "Severance, Steve" <ss...@ebay.com>.
Thanks Ted.

I will move my code to trunk and get it working.

Steve

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Monday, August 16, 2010 11:20 AM
To: user@mahout.apache.org
Subject: Re: Clustering Questions

On Mon, Aug 16, 2010 at 11:15 AM, Severance, Steve <ss...@ebay.com>wrote:

>
> 1.       It appears that the .job files contain the code that is actually
> needed to run. How do I build these? They don't seem to be built when 
> I build mahout with Maven.
>

Which version are you using?  I recommend trunk for pretty much everything.


> 2.       The Mahout 0.3 tag line numbers don't seem to match with the
> compiled jars. What revision number is 0.3 built from?
>

It should have been what was tagged.  But, even so, I recommend using trunk.


> 3.       It looks like the format of the cluster files changed between
> 0.3and 0.4. Is this true?
>

Others can say for sure, but this is very likely.   0.4 is going to be a
major change.

4.       I was never able to get the Cluster dumping tool to work. I wrotemy
> own to export the clusters to hive for analysis. Are there any plans 
> for=  better Hive integration?
>

This has been substantially improved.  Is there something that can be done to facilitate Hive integration without making Hive a dependency?

Re: Clustering Questions

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Aug 16, 2010 at 11:15 AM, Severance, Steve <ss...@ebay.com>wrote:

>
> 1.       It appears that the .job files contain the code that is actually
> needed to run. How do I build these? They don't seem to be built when I
> build mahout with Maven.
>

Which version are you using?  I recommend trunk for pretty much everything.


> 2.       The Mahout 0.3 tag line numbers don't seem to match with the
> compiled jars. What revision number is 0.3 built from?
>

It should have been what was tagged.  But, even so, I recommend using trunk.


> 3.       It looks like the format of the cluster files changed between
> 0.3and 0.4. Is this true?
>

Others can say for sure, but this is very likely.   0.4 is going to be a
major change.

4.       I was never able to get the Cluster dumping tool to work. I wrotemy
> own to export the clusters to hive for analysis. Are there any plans for=
>  better Hive integration?
>

This has been substantially improved.  Is there something that can be done
to facilitate Hive integration without making Hive a dependency?