You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2010/04/23 11:44:04 UTC

Overhauled org.apache.mahout.cf.taste.hadoop.item

I thought it might be worth bringing this back to the user list.

Ankur effectively raised issues about the performance of
org.apache.mahout.cf.taste.hadoop.item by adding
org.apache.mahout.cf.taste.hadoop.cooccurrence, which is a similar
recommender job (item cooccurrence-based) but with a different
implementation. ".item" ultimately does not distribute the matrix-user
vector multiply, and ".coocurrence" highly distributes it.

.item accomplished this by side-loading the co-occurrence matrix into
a reducer, by accessing it from disk as MapFiles. This way of
accessing columns proved to be very slow.

After much experimentation, I've completely overhauled .item by
grafting in ideas from .cooccurrence. It is a sort of
best-of-both-worlds hybrid of the two. It borrows a clever way to join
two kinds of input into one MapReduce, in order to join the
co-occurrence matrix columns and individual elements of each user
vector. The product is output and recombined later. This hybrid
retains features of .item like accommodating user ratings.

Letting Hadoop manage the data flow, even though it takes a bit more
copying, avoiding reading from MapFile in a random-access manner,
using features like the Combiner, and being smarter about Writables
has sped this up for me by at least a factor of 10 -- mostly that
avoiding MapFiles.

I bring it up since it's interesting, a good development for anyone
using this implementation, and an area that is ripe for more testing
and improvement I imagine.

Sean

Re: Overhauled org.apache.mahout.cf.taste.hadoop.item

Posted by Jake Mannix <ja...@gmail.com>.
I really should get my partially finished version of this in there... it
seems you guys keep converging closer and closer to my weird
matrix-triple-product way of doing it as time goes on.  :)

But yes, in general: avoiding MapFiles always helps.  Hadoop
is designed for bulk sequential access, and letting it do that
allows for maximal throughput, doing anything else is... fraught
with peril.

  -jake

On Fri, Apr 23, 2010 at 2:44 AM, Sean Owen <sr...@gmail.com> wrote:

> I thought it might be worth bringing this back to the user list.
>
> Ankur effectively raised issues about the performance of
> org.apache.mahout.cf.taste.hadoop.item by adding
> org.apache.mahout.cf.taste.hadoop.cooccurrence, which is a similar
> recommender job (item cooccurrence-based) but with a different
> implementation. ".item" ultimately does not distribute the matrix-user
> vector multiply, and ".coocurrence" highly distributes it.
>
> .item accomplished this by side-loading the co-occurrence matrix into
> a reducer, by accessing it from disk as MapFiles. This way of
> accessing columns proved to be very slow.
>
> After much experimentation, I've completely overhauled .item by
> grafting in ideas from .cooccurrence. It is a sort of
> best-of-both-worlds hybrid of the two. It borrows a clever way to join
> two kinds of input into one MapReduce, in order to join the
> co-occurrence matrix columns and individual elements of each user
> vector. The product is output and recombined later. This hybrid
> retains features of .item like accommodating user ratings.
>
> Letting Hadoop manage the data flow, even though it takes a bit more
> copying, avoiding reading from MapFile in a random-access manner,
> using features like the Combiner, and being smarter about Writables
> has sped this up for me by at least a factor of 10 -- mostly that
> avoiding MapFiles.
>
> I bring it up since it's interesting, a good development for anyone
> using this implementation, and an area that is ripe for more testing
> and improvement I imagine.
>
> Sean
>

Re: Overhauled org.apache.mahout.cf.taste.hadoop.item

Posted by "Ankur C. Goel" <ga...@yahoo-inc.com>.
Glad to hear that co-occurrence (my baby) could contribute some useful performance enhancements.
Have you checked in the code ? Should we move the comments here to
https://issues.apache.org/jira/browse/MAHOUT-305 ? since this essentially tracks the merger.

-@nkur


On 4/23/10 3:14 PM, "Sean Owen" <sr...@gmail.com> wrote:

I thought it might be worth bringing this back to the user list.

Ankur effectively raised issues about the performance of
org.apache.mahout.cf.taste.hadoop.item by adding
org.apache.mahout.cf.taste.hadoop.cooccurrence, which is a similar
recommender job (item cooccurrence-based) but with a different
implementation. ".item" ultimately does not distribute the matrix-user
vector multiply, and ".coocurrence" highly distributes it.

.item accomplished this by side-loading the co-occurrence matrix into
a reducer, by accessing it from disk as MapFiles. This way of
accessing columns proved to be very slow.

After much experimentation, I've completely overhauled .item by
grafting in ideas from .cooccurrence. It is a sort of
best-of-both-worlds hybrid of the two. It borrows a clever way to join
two kinds of input into one MapReduce, in order to join the
co-occurrence matrix columns and individual elements of each user
vector. The product is output and recombined later. This hybrid
retains features of .item like accommodating user ratings.

Letting Hadoop manage the data flow, even though it takes a bit more
copying, avoiding reading from MapFile in a random-access manner,
using features like the Combiner, and being smarter about Writables
has sped this up for me by at least a factor of 10 -- mostly that
avoiding MapFiles.

I bring it up since it's interesting, a good development for anyone
using this implementation, and an area that is ripe for more testing
and improvement I imagine.

Sean