You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2009/11/09 05:04:38 UTC

Re: got Error: GC overhead limit exceeded when generate product similariy

This basically means "out of heap space". It didn't technically run
out of memory, but, is triggering garbage collection so much that the
JVM figures it is critically low on free heap space.

You need to allocate more heap space. Try just 256MB with -Xm256m

Have you looked at the documentation, there are some other settings
you should be using with the JVM, like -server.

Also you don't need to create a GenericItemSimilarity. That is going
to consume a fair bit of memory as it is pre-computing and storing in
memory all item-item similarities from a Pearson correlation.


On Mon, Nov 9, 2009 at 2:28 AM, cumtyjh <cu...@163.com> wrote:
> hi,all
>
> i got some error when generate product similarity according to rating file, and there is about 250,000 recordes in rating file.
>
> it works when there is only 10,000 recordes in rating file.
>
>
> do you have some suggestion? any help is appreciated
>
> thanks in advance.
>
>
> following is code and log:
>
>
>  File file = new File(ratingFile);
>  logger.log(Level.INFO, "begin to load rating file...");
>  FileDataModel model = new FileDataModel(file);
>  logger.log(Level.INFO, "load rating file OK.");
> ItemSimilarity pearson = new LogLikelihoodSimilarity(model);
> GenericItemSimilarity gif = new GenericItemSimilarity(pearson,model);
>
>
>
> INFO: load rating file OK.
> - Reading file info...
> - Processed 100000 lines
> - Processed 200000 lines
> Exception in thread "Thread-9" java.lang.OutOfMemoryError: GC overhead limit exceeded
> at org.apache.mahout.cf.taste.impl.common.FastSet.<init>(FastSet.java:74)
> at org.apache.mahout.cf.taste.impl.model.GenericDataModel.getNumUsersWithPreferenceFor(GenericDataModel.java:195)
> at org.apache.mahout.cf.taste.impl.model.file.FileDataModel.getNumUsersWithPreferenceFor(FileDataModel.java:314)
> at org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity.itemSimilarity(LogLikelihoodSimilarity.java:48)
> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity$DataModelSimilaritiesIterator.next(GenericItemSimilarity.java:291)
> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity$DataModelSimilaritiesIterator.next(GenericItemSimilarity.java:260)
> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity.initSimilarityMaps(GenericItemSimilarity.java:128)
> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity.<init>(GenericItemSimilarity.java:103)
>
> 2009-11-09
>
>
>
> cumtyjh
>

Re: Re: Re: got Error: GC overhead limit exceeded when generateproductsimilariy

Posted by cumtyjh <cu...@163.com>.

thanks again.

i will set some jvm options and try it as you said.





2009-11-09 



cumtyjh 



发件人： Sean Owen 
发送时间： 2009-11-09  12:36:51 
收件人： mahout-user 
抄送： 
主题： Re: Re: got Error: GC overhead limit exceeded when generateproductsimilariy 
 
OK, I do suggest you use -server for an application computing
recommendations. In fact I recommend other flags for the best
performance:
http://lucene.apache.org/mahout/taste.html#performance
If, by "offline", you mean computing item-item similarity before any
recommendations are computed, then you are already doing so. The one
line of code which creates a GenericItemSimilarity from a
PearsonCorrelationSimilarity does exactly that.
You could also manually compare every item-item pair with
PearsonCorrelationSimilarity, write out the result in a file, then
read it back in later, and manually create a GenericItemSimilarity
from that info. It would involve writing a bit more code but would
work fine.
This approach will certainly take more memory, and your memory
requirements will grow as the square of the number of items. But it
will make things faster.
Sean
On Mon, Nov 9, 2009 at 4:31 AM, cumtyjh <cu...@163.com> wrote:
> thanks Sean Owen.
>
> i have set jvm options like -Xmx2048m,no -server.
>
>
> i want to generate item-item similarity offline, then i can use it for recommendation.
>
>
> do you have some suggestion on generating item-item similarity offline?
>
>
> 2009-11-09
>
>
>
> cumtyjh
>
>
>
> 发件人： Sean Owen
> 发送时间： 2009-11-09  12:05:08
> 收件人： mahout-user
> 抄送：
> 主题： Re: got Error: GC overhead limit exceeded when generate productsimilariy
>
> This basically means "out of heap space". It didn't technically run
> out of memory, but, is triggering garbage collection so much that the
> JVM figures it is critically low on free heap space.
> You need to allocate more heap space. Try just 256MB with -Xm256m
> Have you looked at the documentation, there are some other settings
> you should be using with the JVM, like -server.
> Also you don't need to create a GenericItemSimilarity. That is going
> to consume a fair bit of memory as it is pre-computing and storing in
> memory all item-item similarities from a Pearson correlation.
> On Mon, Nov 9, 2009 at 2:28 AM, cumtyjh <cu...@163.com> wrote:
>> hi,all
>>
>> i got some error when generate product similarity according to rating file, and there is about 250,000 recordes in rating file.
>>
>> it works when there is only 10,000 recordes in rating file.
>>
>>
>> do you have some suggestion? any help is appreciated
>>
>> thanks in advance.
>>
>>
>> following is code and log:
>>
>>
>>  File file = new File(ratingFile);
>>  logger.log(Level.INFO, "begin to load rating file...");
>>  FileDataModel model = new FileDataModel(file);
>>  logger.log(Level.INFO, "load rating file OK.");
>> ItemSimilarity pearson = new LogLikelihoodSimilarity(model);
>> GenericItemSimilarity gif = new GenericItemSimilarity(pearson,model);
>>
>>
>>
>> INFO: load rating file OK.
>> - Reading file info...
>> - Processed 100000 lines
>> - Processed 200000 lines
>> Exception in thread "Thread-9" java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at org.apache.mahout.cf.taste.impl.common.FastSet.<init>(FastSet.java:74)
>> at org.apache.mahout.cf.taste.impl.model.GenericDataModel.getNumUsersWithPreferenceFor(GenericDataModel.java:195)
>> at org.apache.mahout.cf.taste.impl.model.file.FileDataModel.getNumUsersWithPreferenceFor(FileDataModel.java:314)
>> at org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity.itemSimilarity(LogLikelihoodSimilarity.java:48)
>> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity$DataModelSimilaritiesIterator.next(GenericItemSimilarity.java:291)
>> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity$DataModelSimilaritiesIterator.next(GenericItemSimilarity.java:260)
>> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity.initSimilarityMaps(GenericItemSimilarity.java:128)
>> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity.<init>(GenericItemSimilarity.java:103)
>>
>> 2009-11-09
>>
>>
>>
>> cumtyjh
>>
>

Re: Re: got Error: GC overhead limit exceeded when generate productsimilariy

Posted by Sean Owen <sr...@gmail.com>.

OK, I do suggest you use -server for an application computing
recommendations. In fact I recommend other flags for the best
performance:

http://lucene.apache.org/mahout/taste.html#performance

If, by "offline", you mean computing item-item similarity before any
recommendations are computed, then you are already doing so. The one
line of code which creates a GenericItemSimilarity from a
PearsonCorrelationSimilarity does exactly that.

You could also manually compare every item-item pair with
PearsonCorrelationSimilarity, write out the result in a file, then
read it back in later, and manually create a GenericItemSimilarity
from that info. It would involve writing a bit more code but would
work fine.

This approach will certainly take more memory, and your memory
requirements will grow as the square of the number of items. But it
will make things faster.

Sean

On Mon, Nov 9, 2009 at 4:31 AM, cumtyjh <cu...@163.com> wrote:
> thanks Sean Owen.
>
> i have set jvm options like -Xmx2048m,no -server.
>
>
> i want to generate item-item similarity offline, then i can use it for recommendation.
>
>
> do you have some suggestion on generating item-item similarity offline?
>
>
> 2009-11-09
>
>
>
> cumtyjh
>
>
>
> 发件人： Sean Owen
> 发送时间： 2009-11-09  12:05:08
> 收件人： mahout-user
> 抄送：
> 主题： Re: got Error: GC overhead limit exceeded when generate productsimilariy
>
> This basically means "out of heap space". It didn't technically run
> out of memory, but, is triggering garbage collection so much that the
> JVM figures it is critically low on free heap space.
> You need to allocate more heap space. Try just 256MB with -Xm256m
> Have you looked at the documentation, there are some other settings
> you should be using with the JVM, like -server.
> Also you don't need to create a GenericItemSimilarity. That is going
> to consume a fair bit of memory as it is pre-computing and storing in
> memory all item-item similarities from a Pearson correlation.
> On Mon, Nov 9, 2009 at 2:28 AM, cumtyjh <cu...@163.com> wrote:
>> hi,all
>>
>> i got some error when generate product similarity according to rating file, and there is about 250,000 recordes in rating file.
>>
>> it works when there is only 10,000 recordes in rating file.
>>
>>
>> do you have some suggestion? any help is appreciated
>>
>> thanks in advance.
>>
>>
>> following is code and log:
>>
>>
>>  File file = new File(ratingFile);
>>  logger.log(Level.INFO, "begin to load rating file...");
>>  FileDataModel model = new FileDataModel(file);
>>  logger.log(Level.INFO, "load rating file OK.");
>> ItemSimilarity pearson = new LogLikelihoodSimilarity(model);
>> GenericItemSimilarity gif = new GenericItemSimilarity(pearson,model);
>>
>>
>>
>> INFO: load rating file OK.
>> - Reading file info...
>> - Processed 100000 lines
>> - Processed 200000 lines
>> Exception in thread "Thread-9" java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at org.apache.mahout.cf.taste.impl.common.FastSet.<init>(FastSet.java:74)
>> at org.apache.mahout.cf.taste.impl.model.GenericDataModel.getNumUsersWithPreferenceFor(GenericDataModel.java:195)
>> at org.apache.mahout.cf.taste.impl.model.file.FileDataModel.getNumUsersWithPreferenceFor(FileDataModel.java:314)
>> at org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity.itemSimilarity(LogLikelihoodSimilarity.java:48)
>> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity$DataModelSimilaritiesIterator.next(GenericItemSimilarity.java:291)
>> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity$DataModelSimilaritiesIterator.next(GenericItemSimilarity.java:260)
>> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity.initSimilarityMaps(GenericItemSimilarity.java:128)
>> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity.<init>(GenericItemSimilarity.java:103)
>>
>> 2009-11-09
>>
>>
>>
>> cumtyjh
>>
>

Re: Re: Re: got Error: GC overhead limit exceeded whengenerateproductsimilariy

Posted by cumtyjh <cu...@163.com>.

get it


thanks you all.

2009-11-09 



cumtyjh 



发件人： Ted Dunning 
发送时间： 2009-11-09  15:10:48 
收件人： mahout-user 
抄送： 
主题： Re: Re: Re: got Error: GC overhead limit exceeded whengenerateproductsimilariy 
 
Close.
See the link below for one approach to finding the most important ones.  I
believe that Sean has added something like this to Taste/Mahout.
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
On Sun, Nov 8, 2009 at 10:51 PM, Yi Wang <wa...@yahoo.com.cn>wrote:
> Maybe Ted means the top ones.
>
> --- 09年11月9日，周一, cumtyjh <cu...@163.com> 写道：
>
> 发件人: cumtyjh <cu...@163.com>
> 主题: Re: Re: Re: got Error: GC overhead limit exceeded when
> generateproductsimilariy
> 收件人: "mahout-user" <ma...@lucene.apache.org>
> 日期: 2009年11月9日,周一,下午2:42
>
>
> i am a new guy on recommendation, what is the meaning of  "significant
> ones"?
>
> 2009-11-09
>
>
>
> cumtyjh
>
>
>
> 发件人： Ted Dunning
> 发送时间： 2009-11-09  14:35:51
> 收件人： mahout-user
> 抄送：
> 主题： Re: Re: got Error: GC overhead limit exceeded when
> generateproductsimilariy
>
> You shouldn't be generating all item-item links.  You only want the
> significant ones.
> On Sun, Nov 8, 2009 at 8:31 PM, cumtyjh <cu...@163.com> wrote:
> > i want to generate item-item similarity offline, then i can use it for
> > recommendation.
> --
> Ted Dunning, CTO
> DeepDyve
>
>
>
>       ___________________________________________________________
>  好玩贺卡等你发，邮箱贺卡全新上线！
> http://card.mail.cn.yahoo.com/
>
-- 
Ted Dunning, CTO
DeepDyve

Re: Re: Re: got Error: GC overhead limit exceeded when generateproductsimilariy

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Nov 9, 2009 at 4:57 AM, Sean Owen <sr...@gmail.com> wrote:

> Ted will say, and I again I agree, that Pearson is not usually the
> best similarity metric, though it is widely mentioned in collaborative
> filtering examples and literature.
>

You said it!  I don't need to.

>  What Ted quotes below is implemented in the framework as
> LogLikelihoodSimilarity. For that, I believe it *is* the pairs with
> the largest resulting similarity score that you do want to keep. Or at
> least it is more reasonable. Ted maybe you can check my thinking on
> that.
>

Yes.  And you don't even need the score in the end, just the fact that it
passed the threshold.  I typically weight the pairing by IDF score of the
source item.

-- 
Ted Dunning, CTO
DeepDyve

Re: Re: Re: got Error: GC overhead limit exceeded when generateproductsimilariy

Posted by Sean Owen <sr...@gmail.com>.

Yes, I agree that keeping all pairs is quite expensive, unless your
data set is relatively small (like tens of thousands of items). If
you're not running out of memory, OK, you can get away with it for
now.

But yes, many of the similarities will not contain much information
and don't add much value -- the question is, which ones?

For Pearson correlation-based similarity, it's not just a matter of
keeping the ones with the largest and smallest similarity scores --
nearest 1 or -1. A similarity of 0 could still be very useful
information. I think you would actually want to keep an item-item pair
based on how many users expressed a preference for both items. The
more, the more important it is to keep that pair.

If you'd like an example of efficiently looking through a large list
of things, and keeping only the "top n" of them, see the TopItems
class. You don't want to generate all pairs at once, then throw some
away -- that would still run you out of memory.

Ted will say, and I again I agree, that Pearson is not usually the
best similarity metric, though it is widely mentioned in collaborative
filtering examples and literature.

What Ted quotes below is implemented in the framework as
LogLikelihoodSimilarity. For that, I believe it *is* the pairs with
the largest resulting similarity score that you do want to keep. Or at
least it is more reasonable. Ted maybe you can check my thinking on
that.

Sean

On Mon, Nov 9, 2009 at 7:09 AM, Ted Dunning <te...@gmail.com> wrote:
> Close.
>
> See the link below for one approach to finding the most important ones.  I
> believe that Sean has added something like this to Taste/Mahout.
>
> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

Re: Re: Re: got Error: GC overhead limit exceeded when generateproductsimilariy

Posted by Ted Dunning <te...@gmail.com>.

Close.

See the link below for one approach to finding the most important ones.  I
believe that Sean has added something like this to Taste/Mahout.

http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

On Sun, Nov 8, 2009 at 10:51 PM, Yi Wang <wa...@yahoo.com.cn>wrote:

> Maybe Ted means the top ones.
>
> --- 09年11月9日，周一, cumtyjh <cu...@163.com> 写道：
>
> 发件人: cumtyjh <cu...@163.com>
> 主题: Re: Re: Re: got Error: GC overhead limit exceeded when
> generateproductsimilariy
> 收件人: "mahout-user" <ma...@lucene.apache.org>
> 日期: 2009年11月9日,周一,下午2:42
>
>
> i am a new guy on recommendation, what is the meaning of  "significant
> ones"?
>
> 2009-11-09
>
>
>
> cumtyjh
>
>
>
> 发件人： Ted Dunning
> 发送时间： 2009-11-09  14:35:51
> 收件人： mahout-user
> 抄送：
> 主题： Re: Re: got Error: GC overhead limit exceeded when
> generateproductsimilariy
>
> You shouldn't be generating all item-item links.  You only want the
> significant ones.
> On Sun, Nov 8, 2009 at 8:31 PM, cumtyjh <cu...@163.com> wrote:
> > i want to generate item-item similarity offline, then i can use it for
> > recommendation.
> --
> Ted Dunning, CTO
> DeepDyve
>
>
>
>       ___________________________________________________________
>  好玩贺卡等你发，邮箱贺卡全新上线！
> http://card.mail.cn.yahoo.com/
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Re: Re: got Error: GC overhead limit exceeded when generateproductsimilariy

Posted by Yi Wang <wa...@yahoo.com.cn>.

Maybe Ted means the top ones.

--- 09年11月9日，周一, cumtyjh <cu...@163.com> 写道：

发件人: cumtyjh <cu...@163.com>
主题: Re: Re: Re: got Error: GC overhead limit exceeded when generateproductsimilariy
收件人: "mahout-user" <ma...@lucene.apache.org>
日期: 2009年11月9日,周一,下午2:42


i am a new guy on recommendation, what is the meaning of  "significant ones"?

2009-11-09 



cumtyjh 



发件人： Ted Dunning 
发送时间： 2009-11-09  14:35:51 
收件人： mahout-user 
抄送： 
主题： Re: Re: got Error: GC overhead limit exceeded when generateproductsimilariy 
 
You shouldn't be generating all item-item links.  You only want the
significant ones.
On Sun, Nov 8, 2009 at 8:31 PM, cumtyjh <cu...@163.com> wrote:
> i want to generate item-item similarity offline, then i can use it for
> recommendation.
-- 
Ted Dunning, CTO
DeepDyve



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/

Re: Re: Re: got Error: GC overhead limit exceeded when generateproductsimilariy

Posted by cumtyjh <cu...@163.com>.

i am a new guy on recommendation, what is the meaning of  "significant ones"?

2009-11-09 



cumtyjh 



发件人： Ted Dunning 
发送时间： 2009-11-09  14:35:51 
收件人： mahout-user 
抄送： 
主题： Re: Re: got Error: GC overhead limit exceeded when generateproductsimilariy 
 
You shouldn't be generating all item-item links.  You only want the
significant ones.
On Sun, Nov 8, 2009 at 8:31 PM, cumtyjh <cu...@163.com> wrote:
> i want to generate item-item similarity offline, then i can use it for
> recommendation.
-- 
Ted Dunning, CTO
DeepDyve

Re: Re: got Error: GC overhead limit exceeded when generate productsimilariy

Posted by Ted Dunning <te...@gmail.com>.

You shouldn't be generating all item-item links.  You only want the
significant ones.

On Sun, Nov 8, 2009 at 8:31 PM, cumtyjh <cu...@163.com> wrote:

> i want to generate item-item similarity offline, then i can use it for
> recommendation.




-- 
Ted Dunning, CTO
DeepDyve

Re: Re: got Error: GC overhead limit exceeded when generate productsimilariy

Posted by cumtyjh <cu...@163.com>.

thanks Sean Owen.

i have set jvm options like -Xmx2048m,no -server.


i want to generate item-item similarity offline, then i can use it for recommendation.


do you have some suggestion on generating item-item similarity offline?


2009-11-09 



cumtyjh 



发件人： Sean Owen 
发送时间： 2009-11-09  12:05:08 
收件人： mahout-user 
抄送： 
主题： Re: got Error: GC overhead limit exceeded when generate productsimilariy 
 
This basically means "out of heap space". It didn't technically run
out of memory, but, is triggering garbage collection so much that the
JVM figures it is critically low on free heap space.
You need to allocate more heap space. Try just 256MB with -Xm256m
Have you looked at the documentation, there are some other settings
you should be using with the JVM, like -server.
Also you don't need to create a GenericItemSimilarity. That is going
to consume a fair bit of memory as it is pre-computing and storing in
memory all item-item similarities from a Pearson correlation.
On Mon, Nov 9, 2009 at 2:28 AM, cumtyjh <cu...@163.com> wrote:
> hi,all
>
> i got some error when generate product similarity according to rating file, and there is about 250,000 recordes in rating file.
>
> it works when there is only 10,000 recordes in rating file.
>
>
> do you have some suggestion? any help is appreciated
>
> thanks in advance.
>
>
> following is code and log:
>
>
>  File file = new File(ratingFile);
>  logger.log(Level.INFO, "begin to load rating file...");
>  FileDataModel model = new FileDataModel(file);
>  logger.log(Level.INFO, "load rating file OK.");
> ItemSimilarity pearson = new LogLikelihoodSimilarity(model);
> GenericItemSimilarity gif = new GenericItemSimilarity(pearson,model);
>
>
>
> INFO: load rating file OK.
> - Reading file info...
> - Processed 100000 lines
> - Processed 200000 lines
> Exception in thread "Thread-9" java.lang.OutOfMemoryError: GC overhead limit exceeded
> at org.apache.mahout.cf.taste.impl.common.FastSet.<init>(FastSet.java:74)
> at org.apache.mahout.cf.taste.impl.model.GenericDataModel.getNumUsersWithPreferenceFor(GenericDataModel.java:195)
> at org.apache.mahout.cf.taste.impl.model.file.FileDataModel.getNumUsersWithPreferenceFor(FileDataModel.java:314)
> at org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity.itemSimilarity(LogLikelihoodSimilarity.java:48)
> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity$DataModelSimilaritiesIterator.next(GenericItemSimilarity.java:291)
> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity$DataModelSimilaritiesIterator.next(GenericItemSimilarity.java:260)
> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity.initSimilarityMaps(GenericItemSimilarity.java:128)
> at org.apache.mahout.cf.taste.impl.similarity.GenericItemSimilarity.<init>(GenericItemSimilarity.java:103)
>
> 2009-11-09
>
>
>
> cumtyjh
>