You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by fx MA XIAOJUN <xi...@fujixerox.co.jp> on 2014/03/25 07:44:27 UTC

GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

I am using Mahout Streamingkmeans in sequential mode.
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.

" GC Overhead limit exceed " error occurred. 
How to set java memory limit for sequential model?


Yours Sincerely, 
Ma


Re: GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

Posted by Suneel Marthi <su...@yahoo.com>.
Hi Ma,

R u really looking to create 10,000 clusters?

Could u first try with 1000 clusters so ur -km would then be =~ 14510 for k = 1000 ?




On Wednesday, March 26, 2014 10:33 PM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:
 
Dear Suneel,
Thank you for your reply.


Dear Roland, 
Thank you for your participation in discussing this problem.




My configuration is as followings.
-km is set as 140000.(10000*ln2000000)

mapred.child.java.opts=-Xmx4g

Sequential mode does not start mapreduce job. So I don’t know if mapred.child.java.opts will take effects.
My computer has 64G memory. 
I don’t know how to assign enough memory to mahout sequential job.

How about changing configuration in hadoop-env, such as heap_size Or datanode memory size?
Will they take effects?





Ma




-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com] 
Sent: Thursday, March 27, 2014 9:19 AM
To: Roland von Herget; user@mahout.apache.org
Subject: Re: GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

... forgot to ask?


How many dimensions r u trying to cluster on?

Adding a combiner may address this excessive memory usage issue in the reducer (presently not there).




On Wednesday, March 26, 2014 8:10 PM, Suneel Marthi <su...@yahoo.com> wrote:

Hi Roland,

Could u tell me
how many intermediate centroids were being emitted from the mappers to the single reducer in ur scenario?  You have 6GB allocated for a reducer which is way more than what I can get on my work cluster (only 2GB -:)) . 

I take it that you have not specified the -rskm option to further reduce the number of intermediate centroids. Guess its high priority now to fix Mahout-1469 that addresses this and few other Streaming KMeans issues. Please feel free to add to the JIRA anything else you may have about Streaming KMeans.

Anything else you had observed and would like to add on this? 

Let me make time to fix this.

Thanks again,
Suneel






On Wednesday, March 26, 2014 4:17 AM, Roland von Herget <ro...@gmail.com> wrote:

Hi Suneel,

I have the exact same problem with the following values:No of docs: 25.904.599 command line params: -k 1000 -km 17070 Reducer Xmx is 6GB, running in full Map/Reduce mode.

Do you have any other idea what to try?

Thanks,
Roland



On Tue, Mar 25, 2014 at 7:13 PM, Suneel Marthi <su...@yahoo.com> wrote:

What's ur value for -km?
>Based on what you had provided -km should be =  10000 * ln(2000000) = 
>145090
>
>Try reducing ur no. of clusters to 1000 and -km = 14509
>
>
>
>
>
>
>
>
>On Tuesday, March 25, 2014 2:45 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:
>
>I am using Mahout Streamingkmeans in sequential mode.
>With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.
>
>" GC Overhead limit exceed " error occurred.
>How to set java memory limit for sequential model?
>
>
>Yours Sincerely,
>Ma

RE: GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

Posted by fx MA XIAOJUN <xi...@fujixerox.co.jp>.
Dear Suneel,
Thank you for your reply.


Dear Roland, 
Thank you for your participation in discussing this problem.




My configuration is as followings.
-km is set as 140000.(10000*ln2000000)

mapred.child.java.opts=-Xmx4g

Sequential mode does not start mapreduce job. So I don’t know if mapred.child.java.opts will take effects.
My computer has 64G memory. 
I don’t know how to assign enough memory to mahout sequential job.

How about changing configuration in hadoop-env, such as heap_size Or datanode memory size?
Will they take effects?





Ma

 

-----Original Message-----
From: Suneel Marthi [mailto:suneel_marthi@yahoo.com] 
Sent: Thursday, March 27, 2014 9:19 AM
To: Roland von Herget; user@mahout.apache.org
Subject: Re: GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

... forgot to ask?


How many dimensions r u trying to cluster on?

Adding a combiner may address this excessive memory usage issue in the reducer (presently not there).




On Wednesday, March 26, 2014 8:10 PM, Suneel Marthi <su...@yahoo.com> wrote:
 
Hi Roland,

Could u tell me
 how many intermediate centroids were being emitted from the mappers to the single reducer in ur scenario?  You have 6GB allocated for a reducer which is way more than what I can get on my work cluster (only 2GB -:)) . 

I take it that you have not specified the -rskm option to further reduce the number of intermediate centroids. Guess its high priority now to fix Mahout-1469 that addresses this and few other Streaming KMeans issues. Please feel free to add to the JIRA anything else you may have about Streaming KMeans.

Anything else you had observed and would like to add on this? 

Let me make time to fix this.

Thanks again,
Suneel






On Wednesday, March 26, 2014 4:17 AM, Roland von Herget <ro...@gmail.com> wrote:
 
Hi Suneel,

I have the exact same problem with the following values:No of docs: 25.904.599 command line params: -k 1000 -km 17070 Reducer Xmx is 6GB, running in full Map/Reduce mode.

Do you have any other idea what to try?

Thanks,
Roland



On Tue, Mar 25, 2014 at 7:13 PM, Suneel Marthi <su...@yahoo.com> wrote:

What's ur value for -km?
>Based on what you had provided -km should be =  10000 * ln(2000000) = 
>145090
>
>Try reducing ur no. of clusters to 1000 and -km = 14509
>
>
>
>
>
>
>
>
>On Tuesday, March 25, 2014 2:45 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:
>
>I am using Mahout Streamingkmeans in sequential mode.
>With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.
>
>" GC Overhead limit exceed " error occurred.
>How to set java memory limit for sequential model?
>
>
>Yours Sincerely,
>Ma

Re: GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

Posted by Suneel Marthi <su...@yahoo.com>.
... forgot to ask?


How many dimensions r u trying to cluster on?

Adding a combiner may address this excessive memory usage issue in the reducer (presently not there).




On Wednesday, March 26, 2014 8:10 PM, Suneel Marthi <su...@yahoo.com> wrote:
 
Hi Roland,

Could u tell me
 how many intermediate centroids were being emitted from the mappers to the single reducer in ur scenario?  You have 6GB allocated for a reducer which is way more than what I can get on my work cluster (only 2GB -:)) . 

I take it that you have not specified the -rskm option to further reduce the number of intermediate centroids. Guess its high priority now to fix Mahout-1469 that addresses this and few other Streaming KMeans issues. Please feel free to add to the JIRA anything else you may have about Streaming KMeans.

Anything else you had observed and would like to add on this? 

Let me make time to fix this.

Thanks again,
Suneel






On Wednesday, March 26, 2014 4:17 AM, Roland von Herget <ro...@gmail.com> wrote:
 
Hi Suneel,

I have the exact same problem with the following values:No of docs: 25.904.599
command line params: -k 1000 -km 17070
Reducer Xmx is 6GB, running in full Map/Reduce mode.

Do you have any other idea what to try?

Thanks,
Roland



On Tue, Mar 25, 2014 at 7:13 PM, Suneel Marthi <su...@yahoo.com> wrote:

What's ur value for -km?
>Based on what you had provided -km should be =  10000 * ln(2000000) = 145090
>
>Try reducing ur no. of clusters to 1000 and -km = 14509
>
>
>
>
>
>
>
>
>On Tuesday, March 25, 2014 2:45 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:
>
>I am using Mahout Streamingkmeans in sequential mode.
>With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.
>
>" GC Overhead limit exceed " error occurred.
>How to set java memory limit for sequential model?
>
>
>Yours Sincerely,
>Ma

Re: GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

Posted by Suneel Marthi <su...@yahoo.com>.
Hi Roland,

Could u tell me how many intermediate centroids were being emitted from the mappers to the single reducer in ur scenario?  You have 6GB allocated for a reducer which is way more than what I can get on my work cluster (only 2GB -:)) . 

I take it that you have not specified the -rskm option to further reduce the number of intermediate centroids. Guess its high priority now to fix Mahout-1469 that addresses this and few other Streaming KMeans issues. Please feel free to add to the JIRA anything else you may have about Streaming KMeans.

Anything else you had observed and would like to add on this? 

Let me make time to fix this.

Thanks again,
Suneel






On Wednesday, March 26, 2014 4:17 AM, Roland von Herget <ro...@gmail.com> wrote:
 
Hi Suneel,

I have the exact same problem with the following values:No of docs: 25.904.599
command line params: -k 1000 -km 17070
Reducer Xmx is 6GB, running in full Map/Reduce mode.

Do you have any other idea what to try?

Thanks,
Roland



On Tue, Mar 25, 2014 at 7:13 PM, Suneel Marthi <su...@yahoo.com> wrote:

What's ur value for -km?
>Based on what you had provided -km should be =  10000 * ln(2000000) = 145090
>
>Try reducing ur no. of clusters to 1000 and -km = 14509
>
>
>
>
>
>
>
>
>On Tuesday, March 25, 2014 2:45 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:
>
>I am using Mahout Streamingkmeans in sequential mode.
>With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.
>
>" GC Overhead limit exceed " error occurred.
>How to set java memory limit for sequential model?
>
>
>Yours Sincerely,
>Ma

Re: GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

Posted by Roland von Herget <ro...@gmail.com>.
Hi Suneel,

I have the exact same problem with the following values:
No of docs: 25.904.599
command line params: -k 1000 -km 17070
Reducer Xmx is 6GB, running in full Map/Reduce mode.

Do you have any other idea what to try?

Thanks,
Roland


On Tue, Mar 25, 2014 at 7:13 PM, Suneel Marthi <su...@yahoo.com>wrote:

> What's ur value for -km?
> Based on what you had provided -km should be =  10000 * ln(2000000) =
> 145090
>
> Try reducing ur no. of clusters to 1000 and -km = 14509
>
>
>
>
>
>
>
> On Tuesday, March 25, 2014 2:45 AM, fx MA XIAOJUN <
> xiaojun.ma@fujixerox.co.jp> wrote:
>
> I am using Mahout Streamingkmeans in sequential mode.
> With a dataset of 2000000 objects, 128 variables, I would like to get
> 10000 clusters.
>
> " GC Overhead limit exceed " error occurred.
> How to set java memory limit for sequential model?
>
>
> Yours Sincerely,
> Ma
>

Re: GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

Posted by Suneel Marthi <su...@yahoo.com>.
What's ur value for -km?
Based on what you had provided -km should be =  10000 * ln(2000000) = 145090

Try reducing ur no. of clusters to 1000 and -km = 14509







On Tuesday, March 25, 2014 2:45 AM, fx MA XIAOJUN <xi...@fujixerox.co.jp> wrote:
 
I am using Mahout Streamingkmeans in sequential mode.
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 clusters.

" GC Overhead limit exceed " error occurred. 
How to set java memory limit for sequential model?


Yours Sincerely, 
Ma