You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by fx MA XIAOJUN <xi...@fujixerox.co.jp> on 2014/03/17 08:42:54 UTC

RE: reduce is too slow in StreamingKmeans

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 140000 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8)
The maps run faster than before, but the reduce was still stuck at 76% for ever.

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option.

Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful.

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.
----------------------------------------------------------------------------------------
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
	at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
	at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
--------------------------------------------------------------------------------------------







-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com] 
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 10000 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 10000 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I still think there's an issue with -rskm option with Mahout 0.9 and trunk today while executing in MR mode, but it definitely works in the nonMR (-xm sequential) mode in 0.9.











On Monday, February 17, 2014 9:05 PM, Sylvia Ma <Xi...@fujixerox.co.jp> wrote:
 
I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.


Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 




Yours Sincerely,
Sylvia Ma

Re: reduce is too slow in StreamingKmeans

Posted by Suneel Marthi <su...@yahoo.com>.

-rskm option works only in sequential mode and fails in MR. That's still an issue in present trunk that needs to be fixed.
That should explain why Streaming KMeans with -rskm works only in sequential mode for you.

Mahout 0.9 has been built with Hadoop 1.2.1 profile, not sure if that's gonna work with 0.20.

On Monday, March 17, 2014 9:50 PM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:

Thank you for your extremely quick reply.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9

The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. 
It turned out that "Mahout kmeans" runs very well on mapreduce.
However, "Mahout streamingkmeans" runs properly in sequential mode, but fails in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t think "mahout kmeans" can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 140000 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option.

Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.
----------------------------------------------------------------------------------------
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
--------------------------------------------------------------------------------------------

Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=<hadoop 2.x version>

Please give that a try.

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 10000 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 10000 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I still think there's an issue with -rskm option with Mahout 0.9 and trunk today while executing in MR mode, but it definitely works in the nonMR (-xm sequential) mode in 0.9.

On Monday, February 17, 2014 9:05 PM, Sylvia Ma <Xi...@fujixerox.co.jp> wrote:

I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.

Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 

Yours Sincerely,
Sylvia Ma

RE: reduce is too slow in StreamingKmeans

Posted by fx MA XIAOJUN <xi...@fujixerox.co.jp>.

Dear Suneel,

I am very sorry that I did not reply to you for so long.
I just realized that your mail was automatically recognized as spam....

My data is 700MB. 
mapred.child.java.opt=-Xmx4g

It takes 2 hours for one map to compete its task.
15 map was started for the job.
Map runs very fast, I think. 

I changed reduce memory to 10g and run the job again. Nothing changes when it was still stuck at 76%. How much memory is enough?

ps.
As to sequential mode, I have asked the question in another mail that you have replied to. 
Thank you again.

Ma

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com] 
Sent: Wednesday, March 19, 2014 9:08 AM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

When dealing with Streaming KMeans, it would be helpful for troubleshooting purposes if u could provide the values for k (# of clusters), km ( = k log n) and n (# of datapoints).

Try setting -Xmx to a higher heap size and run the sequential version again.

I had seen OOM errors happen during the Reduce phase while running the MR version, my reduce heap size was set to 2GB and I was trying to cluster about 2M datapoints each of cardinality 100 (that's after running thru SSVD-PCA).  

Speaking from my experience, its either been that the Reducer fails with OOM errors or is stuck forever at 76% (and raise alarms with the Operations stuck because its not making any progress).

How big is ur dataset and how long did it take for the map phase to complete? 

On Tuesday, March 18, 2014 12:54 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:

As mahout streamingkmeans has no problems in sequential mode, I would like to try sequential mode.
However, "java.lang.OutofMemoryError" occurs.

I wonder where to set JVM heap size for sequential mode?
Is it the same with mapreduce mode?

-----Original Message-----
From: fx MA XIAOJUN [mailto:xiaojun.ma@fujixerox.co.jp]
Sent: Tuesday, March 18, 2014 10:50 AM
To: Suneel Marthi; user@mahout.apache.org
Subject: RE: reduce is too slow in StreamingKmeans

Thank you for your extremely quick reply.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9

The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. 
It turned out that "Mahout kmeans" runs very well on mapreduce.
However, "Mahout streamingkmeans" runs properly in sequential mode, but fails in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t think "mahout kmeans" can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com]
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 140000 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option.

Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.
----------------------------------------------------------------------------------------
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
--------------------------------------------------------------------------------------------

Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=<hadoop 2.x version>

Please give that a try.

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 10000 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 10000 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I still think there's an issue with -rskm option with Mahout 0.9 and trunk today while executing in MR mode, but it definitely works in the nonMR (-xm sequential) mode in 0.9.

On Monday, February 17, 2014 9:05 PM, Sylvia Ma <Xi...@fujixerox.co.jp> wrote:

I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.

Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 

Yours Sincerely,
Sylvia Ma

Re: reduce is too slow in StreamingKmeans

Posted by Suneel Marthi <su...@yahoo.com>.

When dealing with Streaming KMeans, it would be helpful for troubleshooting purposes if u could provide the values for k (# of clusters), km ( = k log n) and n (# of datapoints).

Try setting -Xmx to a higher heap size and run the sequential version again.

I had seen OOM errors happen during the Reduce phase while running the MR version, my reduce heap size was set to 2GB and I was trying to cluster about 2M datapoints each of cardinality 100 (that's after running thru SSVD-PCA).  

Speaking from my experience, its either been that the Reducer fails with OOM errors or is stuck forever at 76% (and raise alarms with the Operations stuck because its not making any progress).

How big is ur dataset and how long did it take for the map phase to complete? 

On Tuesday, March 18, 2014 12:54 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:

As mahout streamingkmeans has no problems in sequential mode, 
I would like to try sequential mode.
However, "java.lang.OutofMemoryError" occurs.

I wonder where to set JVM heap size for sequential mode?
Is it the same with mapreduce mode?

-----Original Message-----
From: fx MA XIAOJUN [mailto:xiaojun.ma@fujixerox.co.jp] 
Sent: Tuesday, March 18, 2014 10:50 AM
To: Suneel Marthi; user@mahout.apache.org
Subject: RE:
 reduce is too slow in StreamingKmeans

Thank you for your extremely quick reply.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9

The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. 
It turned out that "Mahout kmeans" runs very well on mapreduce.
However, "Mahout
 streamingkmeans" runs properly in sequential mode, but fails in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t think "mahout kmeans" can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 140000 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option.

Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful.

>> What do u mean by
 this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.
----------------------------------------------------------------------------------------
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at
 org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
--------------------------------------------------------------------------------------------

Seems like u r trying
 to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=<hadoop 2.x version>

Please give that a try.

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 10000 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 10000 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I still think there's an issue with -rskm option with Mahout 0.9 and trunk today while executing in MR mode, but it definitely works in the nonMR (-xm sequential) mode in 0.9.

On Monday, February 17, 2014 9:05 PM, Sylvia Ma <Xi...@fujixerox.co.jp> wrote:

I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still
 stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.

Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 

Yours Sincerely,
Sylvia Ma

RE: reduce is too slow in StreamingKmeans

Posted by fx MA XIAOJUN <xi...@fujixerox.co.jp>.

As mahout streamingkmeans has no problems in sequential mode, 
I would like to try sequential mode.
However, "java.lang.OutofMemoryError" occurs.

I wonder where to set JVM heap size for sequential mode?
Is it the same with mapreduce mode?

-----Original Message-----
From: fx MA XIAOJUN [mailto:xiaojun.ma@fujixerox.co.jp] 
Sent: Tuesday, March 18, 2014 10:50 AM
To: Suneel Marthi; user@mahout.apache.org
Subject: RE: reduce is too slow in StreamingKmeans

Thank you for your extremely quick reply.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9

The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. 
It turned out that "Mahout kmeans" runs very well on mapreduce.
However, "Mahout streamingkmeans" runs properly in sequential mode, but fails in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t think "mahout kmeans" can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 140000 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option.

Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.
----------------------------------------------------------------------------------------
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
--------------------------------------------------------------------------------------------

Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=<hadoop 2.x version>

Please give that a try.

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 10000 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 10000 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I still think there's an issue with -rskm option with Mahout 0.9 and trunk today while executing in MR mode, but it definitely works in the nonMR (-xm sequential) mode in 0.9.

On Monday, February 17, 2014 9:05 PM, Sylvia Ma <Xi...@fujixerox.co.jp> wrote:

I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.

Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 

Yours Sincerely,
Sylvia Ma

RE: reduce is too slow in StreamingKmeans

Posted by fx MA XIAOJUN <xi...@fujixerox.co.jp>.

Thank you for your extremely quick reply.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9

The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. 
It turned out that "Mahout kmeans" runs very well on mapreduce.
However, "Mahout streamingkmeans" runs properly in sequential mode, but fails in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t think "mahout kmeans" can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 140000 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option.

Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.
----------------------------------------------------------------------------------------
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
--------------------------------------------------------------------------------------------

Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=<hadoop 2.x version>

Please give that a try.

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 10000 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 10000 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I still think there's an issue with -rskm option with Mahout 0.9 and trunk today while executing in MR mode, but it definitely works in the nonMR (-xm sequential) mode in 0.9.

On Monday, February 17, 2014 9:05 PM, Sylvia Ma <Xi...@fujixerox.co.jp> wrote:

I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.

Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 

Yours Sincerely,
Sylvia Ma

Re: reduce is too slow in StreamingKmeans

Posted by Suneel Marthi <su...@yahoo.com>.

On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 140000 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8)
The maps run faster than before, but the reduce was still stuck at 76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option.

Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.
----------------------------------------------------------------------------------------
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
--------------------------------------------------------------------------------------------

Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=<hadoop 2.x version>

Please give that a try.

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com] 
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 10000 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 10000 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I still think there's an issue with -rskm option with Mahout 0.9 and trunk today while executing in MR mode, but it definitely works in the nonMR (-xm sequential) mode in 0.9.

On Monday, February 17, 2014 9:05 PM, Sylvia Ma <Xi...@fujixerox.co.jp> wrote:

I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.

Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 

Yours Sincerely,
Sylvia Ma