You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by WangRamon <ra...@hotmail.com> on 2011/10/18 09:55:14 UTC

Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?




Hi All I'm running a recommend job on a Hadoop environment with about 600000 users and 2000000 items, the total user-pref records is about 66260000, the data file is of 1GB size. I found the RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow, and get a lot of logs like these in the mapper task output:  2011-10-18 15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 73
2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 64
2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 55
2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 46 
Actually, i do find some similar question from the mail list, e.g. http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%3CBANLkTik5cswH8UnEPoErgePrKRMKJhf7LQ@mail.gmail.com%3E , Sebastian said something about to use Mahout 0.5 in that mail thread, and yes i'm using Mahout 0.5, however there is no further discussion, it will be great if you guys can share some ideas/suggestions here, that will be a big help to me, thanks in advance. BTW, i have the following parameters already set in Hadoop:mapred.child.java.opts -> 2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two servers, each with 32GB RAM, THANKS! CheersRamon

Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by Sebastian Schelter <ss...@apache.org>.

Very generally spoken, RowSimilarityJob starts with a matrix A',
transposes it back to A and computes A'A (with some slight modifications
that allow the embedding of similarity measures).

The way this multiplication is done is very similar to Jake's "outer
column" trick aka the column picture of matrix multiplication.

The crucial thing to look at are extremely long rows of A which
correspond to the power users in recommendation lingua. Of course the
same problems arise in other domains such as document similarities where
terms with a high document frequency would slown down the processing
time and techniques such throwing the 1% of terms with the highest df
are applied.

--sebastian

On 18.10.2011 10:24, Dan Brickley wrote:
> 2011/10/18 Sebastian Schelter <ss...@apache.org>:
>> Hi Ramon,
>>
>> my first suggestion would be to use Mahout 0.6 as significant
>> improvements have been made to RowSimilarityJob and the 0.5 version has
>> known bugs.
>>
>> The runtime of RowSimilarityJob is not only determined by the size of
>> the input but also by the distribution of the interactions among the
>> users.
> 
> As an aside, I've notice this 'users' terminology lurking in the
> background of RowSimilarityJob (eg. in JIRA discussion).
> 
> My use of it last week seemed perfectly reasonable; but rows were
> books (or bibliographic records), with feature columns from library
> topic codes. Does the 'user' terminology suggest it's really focussed
> on recommendations?
> 
> I'm used to seeing this in the Taste part of Mahout, where sometimes
> it's suggested we can re-use recommender pieces by eg. thinking more
> broadly and 'recommending topics to books' or vice versa. This makes
> sense but introduces an extra layer of conceptual confusion. Is there
> any important sense in which rows (or columns?) in RowSimilarityJob
> ought to be thought of as users? Or the values/weights as preferences?
> 
> cheers,
> 
> Dan

Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by Sean Owen <sr...@gmail.com>.

I misunderstood the original question then.

Thing-thing similarity is a key piece of most recommender algorithms.
RowSimiliartyJob is reused for distributed comptuation. In those
senses, the answer is 'yes'.

But I think the answer is 'no'. The similarity metrics are not derived
from recommenders and are used in other contexts. You can compute
thing-thing similarity for other reasons. There is no user-item
asymmetry at this level, I think. Using the user-item terms is not
well motivated here.

At the same time I don't think it hurts much, and, lets you understand
the relation to recommendations (which is a primary user of this
general component) more easily.


On Thu, Oct 20, 2011 at 12:06 PM, Dan Brickley <da...@danbri.org> wrote:
> In general, I completely agree with your perspective here. Even when
> everything bottoms out as matrix maths underneath, that doesn't mean
> that developers should only ever see that abstraction in their
> day-to-day hacking. Mahout lets you adopt at various levels; Taste
> gives almost a drop-in running service; the bin/mahout utility and
> recommender APIs give a variety of high level entry points, and then
> of course being opensource, Java developers can jump into the code at
> any level that suits their need. For lots of those entry points,
> 'user' and 'item' are a great way to present things.
>
> Anyhow, I think my question still holds: is the 'bin/mahout
> rowsimilarity' piece of Mahout something that should be understood
> primarily as a recommendations-oriented component? For my application
> I was seeking just 'the most similar books' for any given book, to
> feed those affinities to Gephi for visual mapping. I could
> conceptualise this in terms of recommending I guess; but I didn't. So
> that's why I was mildly suprised when I noticed that others in Jira
> and email did seem to think of rowsimiliarityjob in
> recommendation-oriented terms (ie. users and items). I completely
> agree that those are useful notions to have in the APIs and utilities,
> I just somehow wasn't expecting it right there (just as I wouldn't
> expect it on the more mathsy APIs either).
>
> cheers,
>
> Dan
>
> ps. as an aside, your points here also remind me of a few passages in
> http://en.wikipedia.org/wiki/Six_Degrees:_The_Science_of_a_Connected_Age
> that emphasise how a purely mathemetical perspective on
> networks/graphs can obscure the ways in which different kinds of
> network can usefully be understood, and that sometimes you do need to
> think about the social context alongside the maths...
>
>> On Tue, Oct 18, 2011 at 9:24 AM, Dan Brickley <da...@danbri.org> wrote:
>>> As an aside, I've notice this 'users' terminology lurking in the
>>> background of RowSimilarityJob (eg. in JIRA discussion).
>>>
>>> My use of it last week seemed perfectly reasonable; but rows were
>>> books (or bibliographic records), with feature columns from library
>>> topic codes. Does the 'user' terminology suggest it's really focussed
>>> on recommendations?
>>>
>>> I'm used to seeing this in the Taste part of Mahout, where sometimes
>>> it's suggested we can re-use recommender pieces by eg. thinking more
>>> broadly and 'recommending topics to books' or vice versa. This makes
>>> sense but introduces an extra layer of conceptual confusion. Is there
>>> any important sense in which rows (or columns?) in RowSimilarityJob
>>> ought to be thought of as users? Or the values/weights as preferences?
>>>
>>> cheers,
>>>
>>> Dan
>

Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by Dan Brickley <da...@danbri.org>.

Hi Sean,

On 18 October 2011 11:09, Sean Owen <sr...@gmail.com> wrote:
> Nice question. I have answers I like.
>
> Really, it would be better to find words that mean
> thing-being-recommended-to and thing-being-recommended. I couldn't
> find easy, general terms that were more intuitive than "user" and
> "item". Even though these things need not be actual people or
> products, and so are inaccurate terms, they connote the right sorts of
> ways of thinking about what they are and how they work.
>
> You could also say that since both can be anything, there should be at
> best one term for both -- a thing or entity. I don't like this on the
> same grounds that it makes things harder to think about in practice.
> Is that "thingID" the thing being recommended or recommended to in the
> code...?
>
> More important I don't think users and items are entirely symmetric,
> even though you could plug items in for users and vice versa. For
> instance, one is 'causing' the ratings and the other isn't. It's
> harder to make future predictions about the black-box source of new
> surprising data. That is, I may learn something quite new about you in
> your 1000th rating, when you rate your first classical music album
> ever; the 1000th rating for that same album probably didn't add much
> new info. Users, the causers, are more variable.
>
> And I think you do tend to have an independent/dependent variable, so
> to speak, in any setup. And, the algorithms sort of embed that
> assymmetry. Item-based recommenders aren't quite the same. For example
> it rather encourages you to pre-compute item-item similarity since
> this is likely to be relatively fixed, being the dependent variable.

In general, I completely agree with your perspective here. Even when
everything bottoms out as matrix maths underneath, that doesn't mean
that developers should only ever see that abstraction in their
day-to-day hacking. Mahout lets you adopt at various levels; Taste
gives almost a drop-in running service; the bin/mahout utility and
recommender APIs give a variety of high level entry points, and then
of course being opensource, Java developers can jump into the code at
any level that suits their need. For lots of those entry points,
'user' and 'item' are a great way to present things.

Anyhow, I think my question still holds: is the 'bin/mahout
rowsimilarity' piece of Mahout something that should be understood
primarily as a recommendations-oriented component? For my application
I was seeking just 'the most similar books' for any given book, to
feed those affinities to Gephi for visual mapping. I could
conceptualise this in terms of recommending I guess; but I didn't. So
that's why I was mildly suprised when I noticed that others in Jira
and email did seem to think of rowsimiliarityjob in
recommendation-oriented terms (ie. users and items). I completely
agree that those are useful notions to have in the APIs and utilities,
I just somehow wasn't expecting it right there (just as I wouldn't
expect it on the more mathsy APIs either).

cheers,

Dan

ps. as an aside, your points here also remind me of a few passages in
http://en.wikipedia.org/wiki/Six_Degrees:_The_Science_of_a_Connected_Age
that emphasise how a purely mathemetical perspective on
networks/graphs can obscure the ways in which different kinds of
network can usefully be understood, and that sometimes you do need to
think about the social context alongside the maths...

> On Tue, Oct 18, 2011 at 9:24 AM, Dan Brickley <da...@danbri.org> wrote:
>> As an aside, I've notice this 'users' terminology lurking in the
>> background of RowSimilarityJob (eg. in JIRA discussion).
>>
>> My use of it last week seemed perfectly reasonable; but rows were
>> books (or bibliographic records), with feature columns from library
>> topic codes. Does the 'user' terminology suggest it's really focussed
>> on recommendations?
>>
>> I'm used to seeing this in the Taste part of Mahout, where sometimes
>> it's suggested we can re-use recommender pieces by eg. thinking more
>> broadly and 'recommending topics to books' or vice versa. This makes
>> sense but introduces an extra layer of conceptual confusion. Is there
>> any important sense in which rows (or columns?) in RowSimilarityJob
>> ought to be thought of as users? Or the values/weights as preferences?
>>
>> cheers,
>>
>> Dan

Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by Sean Owen <sr...@gmail.com>.

Nice question. I have answers I like.

Really, it would be better to find words that mean
thing-being-recommended-to and thing-being-recommended. I couldn't
find easy, general terms that were more intuitive than "user" and
"item". Even though these things need not be actual people or
products, and so are inaccurate terms, they connote the right sorts of
ways of thinking about what they are and how they work.

You could also say that since both can be anything, there should be at
best one term for both -- a thing or entity. I don't like this on the
same grounds that it makes things harder to think about in practice.
Is that "thingID" the thing being recommended or recommended to in the
code...?

More important I don't think users and items are entirely symmetric,
even though you could plug items in for users and vice versa. For
instance, one is 'causing' the ratings and the other isn't. It's
harder to make future predictions about the black-box source of new
surprising data. That is, I may learn something quite new about you in
your 1000th rating, when you rate your first classical music album
ever; the 1000th rating for that same album probably didn't add much
new info. Users, the causers, are more variable.

And I think you do tend to have an independent/dependent variable, so
to speak, in any setup. And, the algorithms sort of embed that
assymmetry. Item-based recommenders aren't quite the same. For example
it rather encourages you to pre-compute item-item similarity since
this is likely to be relatively fixed, being the dependent variable.

On Tue, Oct 18, 2011 at 9:24 AM, Dan Brickley <da...@danbri.org> wrote:
> As an aside, I've notice this 'users' terminology lurking in the
> background of RowSimilarityJob (eg. in JIRA discussion).
>
> My use of it last week seemed perfectly reasonable; but rows were
> books (or bibliographic records), with feature columns from library
> topic codes. Does the 'user' terminology suggest it's really focussed
> on recommendations?
>
> I'm used to seeing this in the Taste part of Mahout, where sometimes
> it's suggested we can re-use recommender pieces by eg. thinking more
> broadly and 'recommending topics to books' or vice versa. This makes
> sense but introduces an extra layer of conceptual confusion. Is there
> any important sense in which rows (or columns?) in RowSimilarityJob
> ought to be thought of as users? Or the values/weights as preferences?
>
> cheers,
>
> Dan
>

Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by Dan Brickley <da...@danbri.org>.

2011/10/18 Sebastian Schelter <ss...@apache.org>:
> Hi Ramon,
>
> my first suggestion would be to use Mahout 0.6 as significant
> improvements have been made to RowSimilarityJob and the 0.5 version has
> known bugs.
>
> The runtime of RowSimilarityJob is not only determined by the size of
> the input but also by the distribution of the interactions among the
> users.

As an aside, I've notice this 'users' terminology lurking in the
background of RowSimilarityJob (eg. in JIRA discussion).

My use of it last week seemed perfectly reasonable; but rows were
books (or bibliographic records), with feature columns from library
topic codes. Does the 'user' terminology suggest it's really focussed
on recommendations?

I'm used to seeing this in the Taste part of Mahout, where sometimes
it's suggested we can re-use recommender pieces by eg. thinking more
broadly and 'recommending topics to books' or vice versa. This makes
sense but introduces an extra layer of conceptual confusion. Is there
any important sense in which rows (or columns?) in RowSimilarityJob
ought to be thought of as users? Or the values/weights as preferences?

cheers,

Dan

Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by Sean Owen <sr...@gmail.com>.

Good tip on the Hadoop args -- I just added this check.

0.6 might be out before the end of the year, not sure.

2011/10/18 WangRamon <ra...@hotmail.com>:
>
> Thanks Sebastian, will upgrade it, btw, do we have any plan to release 0.6 in the short future?

RE: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by WangRamon <ra...@hotmail.com>.

Thanks Sebastian, will upgrade it, btw, do we have any plan to release 0.6 in the short future?
 > Date: Tue, 18 Oct 2011 11:41:27 +0200
> From: ssc@apache.org
> To: user@mahout.apache.org
> Subject: Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?
> 
> 
> 
> On 18.10.2011 11:10, Sean Owen wrote:
> > 0.6 is not released. I think you will find it just as stable (or
> > unstable :) ) as 0.5. I would not be afraid to put it in production,
> > with the usual care and testing you'd normally do.
> 
> I would definitely recommend using 0.6 in your context.
> 
> > 
> > 2011/10/18 WangRamon <ra...@hotmail.com>:
> >>
> >>
> >>
> >>
> >>
> >> Hi Sebastian
> >>
> >> Thanks for your quick reply.
> >>
> >> As far as i know latest Mahout release is: Mahout 0.5. Mahout 0.6 is still under development, please correct me if i were wrong, so i'm not sure can i use Mahout 0.6 in a product environment? We plan to run Mahout recommend Job on a 30+ nodes environment.
> >>
> >> I'm doing benchmark test right now, so I'm using a test data, every user will recommend about 60~120 items, so I think the data file should be fine now. I cannot find the two parameters listed in your mail "maxPrefsPerUserInItemSimilarity " and "maxPrefsPerUser", are these two for Mahout 0.6.  I see you mentioned to use ItemSimilarityJob, this job is not included in class "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob", instead, RecommenderJob use RowSimilarityJob, so what's difference between ItemSimilarityJob and RowSimilarityJob? How do i use ItemSimilarityJob? ThanksRamon
> >>
>

Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by Sebastian Schelter <ss...@apache.org>.


On 18.10.2011 11:10, Sean Owen wrote:
> 0.6 is not released. I think you will find it just as stable (or
> unstable :) ) as 0.5. I would not be afraid to put it in production,
> with the usual care and testing you'd normally do.

I would definitely recommend using 0.6 in your context.

> 
> 2011/10/18 WangRamon <ra...@hotmail.com>:
>>
>>
>>
>>
>>
>> Hi Sebastian
>>
>> Thanks for your quick reply.
>>
>> As far as i know latest Mahout release is: Mahout 0.5. Mahout 0.6 is still under development, please correct me if i were wrong, so i'm not sure can i use Mahout 0.6 in a product environment? We plan to run Mahout recommend Job on a 30+ nodes environment.
>>
>> I'm doing benchmark test right now, so I'm using a test data, every user will recommend about 60~120 items, so I think the data file should be fine now. I cannot find the two parameters listed in your mail "maxPrefsPerUserInItemSimilarity " and "maxPrefsPerUser", are these two for Mahout 0.6.  I see you mentioned to use ItemSimilarityJob, this job is not included in class "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob", instead, RecommenderJob use RowSimilarityJob, so what's difference between ItemSimilarityJob and RowSimilarityJob? How do i use ItemSimilarityJob? ThanksRamon
>>

Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by Sean Owen <sr...@gmail.com>.

0.6 is not released. I think you will find it just as stable (or
unstable :) ) as 0.5. I would not be afraid to put it in production,
with the usual care and testing you'd normally do.

2011/10/18 WangRamon <ra...@hotmail.com>:
>
>
>
>
>
> Hi Sebastian
>
> Thanks for your quick reply.
>
> As far as i know latest Mahout release is: Mahout 0.5. Mahout 0.6 is still under development, please correct me if i were wrong, so i'm not sure can i use Mahout 0.6 in a product environment? We plan to run Mahout recommend Job on a 30+ nodes environment.
>
> I'm doing benchmark test right now, so I'm using a test data, every user will recommend about 60~120 items, so I think the data file should be fine now. I cannot find the two parameters listed in your mail "maxPrefsPerUserInItemSimilarity " and "maxPrefsPerUser", are these two for Mahout 0.6.  I see you mentioned to use ItemSimilarityJob, this job is not included in class "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob", instead, RecommenderJob use RowSimilarityJob, so what's difference between ItemSimilarityJob and RowSimilarityJob? How do i use ItemSimilarityJob? ThanksRamon
>

RE: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by WangRamon <ra...@hotmail.com>.





Hi Sebastian
 
Thanks for your quick reply.
 
As far as i know latest Mahout release is: Mahout 0.5. Mahout 0.6 is still under development, please correct me if i were wrong, so i'm not sure can i use Mahout 0.6 in a product environment? We plan to run Mahout recommend Job on a 30+ nodes environment.
 
I'm doing benchmark test right now, so I'm using a test data, every user will recommend about 60~120 items, so I think the data file should be fine now. I cannot find the two parameters listed in your mail "maxPrefsPerUserInItemSimilarity " and "maxPrefsPerUser", are these two for Mahout 0.6.  I see you mentioned to use ItemSimilarityJob, this job is not included in class "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob", instead, RecommenderJob use RowSimilarityJob, so what's difference between ItemSimilarityJob and RowSimilarityJob? How do i use ItemSimilarityJob? ThanksRamon
 
> Date: Tue, 18 Oct 2011 10:10:43 +0200
> From: ssc@apache.org
> To: user@mahout.apache.org
> Subject: Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?
> 
> Hi Ramon,
> 
> my first suggestion would be to use Mahout 0.6 as significant
> improvements have been made to RowSimilarityJob and the 0.5 version has
> known bugs.
> 
> The runtime of RowSimilarityJob is not only determined by the size of
> the input but also by the distribution of the interactions among the
> users. In typical collaborative filtering datasets the interactions will
> roughly follow a power-law distribution which means that there are a few
> "power"-users with an enormous amount of interactions.
> 
> For each of these "power"-users the square of the number of their
> interactions has to be processed which means they significantly slow
> down the job without providing too much value (you don't learn a lot
> from people that like "nearly everything"). The interactions of these
> power-users need to be down-sampled which is done via the parameter
> --maxPrefsPerUserInItemSimilarity in RecommenderJob and
> --maxPrefsPerUser in ItemSimilarityJob.
> 
> --sebastian
> 
> 
> On 18.10.2011 09:55, WangRamon wrote:
> > 
> > 
> > 
> > 
> > Hi All I'm running a recommend job on a Hadoop environment with about 600000 users and 2000000 items, the total user-pref records is about 66260000, the data file is of 1GB size. I found the RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow, and get a lot of logs like these in the mapper task output:  2011-10-18 15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 73
> > 2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 64
> > 2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 55
> > 2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 46 
> > Actually, i do find some similar question from the mail list, e.g. http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%3CBANLkTik5cswH8UnEPoErgePrKRMKJhf7LQ@mail.gmail.com%3E , Sebastian said something about to use Mahout 0.5 in that mail thread, and yes i'm using Mahout 0.5, however there is no further discussion, it will be great if you guys can share some ideas/suggestions here, that will be a big help to me, thanks in advance. BTW, i have the following parameters already set in Hadoop:mapred.child.java.opts -> 2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two servers, each with 32GB RAM, THANKS! CheersRamon 		 	   		  
>

Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Ramon,

my first suggestion would be to use Mahout 0.6 as significant
improvements have been made to RowSimilarityJob and the 0.5 version has
known bugs.

The runtime of RowSimilarityJob is not only determined by the size of
the input but also by the distribution of the interactions among the
users. In typical collaborative filtering datasets the interactions will
roughly follow a power-law distribution which means that there are a few
"power"-users with an enormous amount of interactions.

For each of these "power"-users the square of the number of their
interactions has to be processed which means they significantly slow
down the job without providing too much value (you don't learn a lot
from people that like "nearly everything"). The interactions of these
power-users need to be down-sampled which is done via the parameter
--maxPrefsPerUserInItemSimilarity in RecommenderJob and
--maxPrefsPerUser in ItemSimilarityJob.

--sebastian


On 18.10.2011 09:55, WangRamon wrote:
> 
> 
> 
> 
> Hi All I'm running a recommend job on a Hadoop environment with about 600000 users and 2000000 items, the total user-pref records is about 66260000, the data file is of 1GB size. I found the RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow, and get a lot of logs like these in the mapper task output:  2011-10-18 15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 73
> 2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 64
> 2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 55
> 2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 46 
> Actually, i do find some similar question from the mail list, e.g. http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%3CBANLkTik5cswH8UnEPoErgePrKRMKJhf7LQ@mail.gmail.com%3E , Sebastian said something about to use Mahout 0.5 in that mail thread, and yes i'm using Mahout 0.5, however there is no further discussion, it will be great if you guys can share some ideas/suggestions here, that will be a big help to me, thanks in advance. BTW, i have the following parameters already set in Hadoop:mapred.child.java.opts -> 2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two servers, each with 32GB RAM, THANKS! CheersRamon

RE: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by WangRamon <ra...@hotmail.com>.

Hi Sean I will try to increase properties "io.sort.factor" and "io.sort.mb" in core-site.xml and to see what happen. BTW, I see you use  String javaOpts = conf.get("mapred.child.java.opts"); to get the heap size for each map/reduce task, that's fine for Hadoop 0.20.2 and before, but since 0.20.3 it has been replaced by "mapred.map.child.java.opts" and "mapred.reduce.child.java.opts", so may be you should use a default configuration or make it as a argument for user to input. CheersRamon 
 > Date: Tue, 18 Oct 2011 09:58:30 +0100
> Subject: Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?
> From: srowen@gmail.com
> To: user@mahout.apache.org
> 
> If the merge phase is what's taking a while, I can suggest two
> parameter changes to help speed that up. (This is in addition to what
> Sebastian said.)
> 
> First, I think it's useful to let it do a 100-way segment merge
> instead of 10-way. (Or more.) This is controlled by "io.sort.factor"
> in Hadoop.
> 
> Second, you probably want to let the combiner do more combining, to
> reduce the number of records spilled and merged. For this, you set
> "io.sort.mb". This job has a Combiner so it's valid. You could set it
> up to half of your worker memory or so.
> 
> Here's a section of code in RecommenderJob that is used to configure
> this all automatically on a JobContext; if it works for you, we could
> include it in this job too:
> 
>   private static void setIOSort(JobContext job) {
>     Configuration conf = job.getConfiguration();
>     conf.setInt("io.sort.factor", 100);
>     int assumedHeapSize = 512;
>     String javaOpts = conf.get("mapred.child.java.opts");
>     if (javaOpts != null) {
>       Matcher m = Pattern.compile("-Xmx([0-9]+)([mMgG])").matcher(javaOpts);
>       if (m.find()) {
>         assumedHeapSize = Integer.parseInt(m.group(1));
>         String megabyteOrGigabyte = m.group(2);
>         if ("g".equalsIgnoreCase(megabyteOrGigabyte)) {
>           assumedHeapSize *= 1024;
>         }
>       }
>     }
>     // Cap this at 1024MB now; see
> https://issues.apache.org/jira/browse/MAPREDUCE-2308
>     conf.setInt("io.sort.mb", Math.min(assumedHeapSize / 2, 1024));
>     // For some reason the Merger doesn't report status for a long
> time; increase
>     // timeout when running these jobs
>     conf.setInt("mapred.task.timeout", 60 * 60 * 1000);
>   }
> 
> 
> 2011/10/18 WangRamon <ra...@hotmail.com>:
> >
> >
> >
> >
> > Hi All I'm running a recommend job on a Hadoop environment with about 600000 users and 2000000 items, the total user-pref records is about 66260000, the data file is of 1GB size. I found the RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow, and get a lot of logs like these in the mapper task output:  2011-10-18 15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 73
> > 2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 64
> > 2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 55
> > 2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 46
> > Actually, i do find some similar question from the mail list, e.g. http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%3CBANLkTik5cswH8UnEPoErgePrKRMKJhf7LQ@mail.gmail.com%3E , Sebastian said something about to use Mahout 0.5 in that mail thread, and yes i'm using Mahout 0.5, however there is no further discussion, it will be great if you guys can share some ideas/suggestions here, that will be a big help to me, thanks in advance. BTW, i have the following parameters already set in Hadoop:mapred.child.java.opts -> 2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two servers, each with 32GB RAM, THANKS! CheersRamon

Re: Any general performance tips for job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer?

Posted by Sean Owen <sr...@gmail.com>.

If the merge phase is what's taking a while, I can suggest two
parameter changes to help speed that up. (This is in addition to what
Sebastian said.)

First, I think it's useful to let it do a 100-way segment merge
instead of 10-way. (Or more.) This is controlled by "io.sort.factor"
in Hadoop.

Second, you probably want to let the combiner do more combining, to
reduce the number of records spilled and merged. For this, you set
"io.sort.mb". This job has a Combiner so it's valid. You could set it
up to half of your worker memory or so.

Here's a section of code in RecommenderJob that is used to configure
this all automatically on a JobContext; if it works for you, we could
include it in this job too:

  private static void setIOSort(JobContext job) {
    Configuration conf = job.getConfiguration();
    conf.setInt("io.sort.factor", 100);
    int assumedHeapSize = 512;
    String javaOpts = conf.get("mapred.child.java.opts");
    if (javaOpts != null) {
      Matcher m = Pattern.compile("-Xmx([0-9]+)([mMgG])").matcher(javaOpts);
      if (m.find()) {
        assumedHeapSize = Integer.parseInt(m.group(1));
        String megabyteOrGigabyte = m.group(2);
        if ("g".equalsIgnoreCase(megabyteOrGigabyte)) {
          assumedHeapSize *= 1024;
        }
      }
    }
    // Cap this at 1024MB now; see
https://issues.apache.org/jira/browse/MAPREDUCE-2308
    conf.setInt("io.sort.mb", Math.min(assumedHeapSize / 2, 1024));
    // For some reason the Merger doesn't report status for a long
time; increase
    // timeout when running these jobs
    conf.setInt("mapred.task.timeout", 60 * 60 * 1000);
  }


2011/10/18 WangRamon <ra...@hotmail.com>:
>
>
>
>
> Hi All I'm running a recommend job on a Hadoop environment with about 600000 users and 2000000 items, the total user-pref records is about 66260000, the data file is of 1GB size. I found the RowSimilarityJob-CooccurrencesMapper-SimilarityReducer job is very slow, and get a lot of logs like these in the mapper task output:  2011-10-18 15:18:49,300 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 73
> 2011-10-18 15:20:23,410 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 64
> 2011-10-18 15:22:45,466 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 55
> 2011-10-18 15:25:07,928 INFO org.apache.hadoop.mapred.Merger: Merging 10 intermediate segments out of a total of 46
> Actually, i do find some similar question from the mail list, e.g. http://mail-archives.apache.org/mod_mbox/mahout-user/201104.mbox/%3CBANLkTik5cswH8UnEPoErgePrKRMKJhf7LQ@mail.gmail.com%3E , Sebastian said something about to use Mahout 0.5 in that mail thread, and yes i'm using Mahout 0.5, however there is no further discussion, it will be great if you guys can share some ideas/suggestions here, that will be a big help to me, thanks in advance. BTW, i have the following parameters already set in Hadoop:mapred.child.java.opts -> 2048Mfs.inmemory.size.mb->200io.file.buffer.size->131072 I have two servers, each with 32GB RAM, THANKS! CheersRamon