You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Chris Schilling <ch...@gmail.com> on 2011/11/14 19:24:57 UTC

distributed similarity calculation for CF

Hi All,

I was just curious if the job flow for the distributed similarity calculation is documented anywhere.  What is the difference between calculating a similarity sequentially versus using distributed matrix operations on Hadoop.  I am just looking for a high level description of how to get from the User-Item matrix to a Item Item similarity score in map-reduce.

Thanks!
Chris


Re: distributed similarity calculation for CF

Posted by Chris Schilling <ch...@thecleversense.com>.
Hey Sean, 

Yeah, thanks for this.  I am about to start getting into some more of the details in the code and wanted a higher level overview.  It seemed to me like some of the similarity calculations were more difficult to distribute than others.  Anyway, Ill dig down a bit farther in the next few weeks.

Chris

On Nov 14, 2011, at 1:18 PM, Sean Owen wrote:

> ably differs slightly due to bits of logic in the Hadoop job that
> would prune small or insignificant dat

Chris Schilling
Sr. Data Mining Engineer
Clever Sense, Inc.
"Curating the World Around You"
--------------------------------------------------------------
Winner of the 2011 Fortune Brainstorm Start-up Idol

Wanna join the Clever Team? We're hiring!
--------------------------------------------------------------


Re: distributed similarity calculation for CF

Posted by Sean Owen <sr...@gmail.com>.
I don't know if it's explicitly documented. It's just the jobs you see
in RowSimilarityJob though. Crudely: phase 1 computes some statistics
per row (vector, item) and transposes. Phase 2 does the similarity
computation. Phase 3 puts the results together.

At a high level it's not different than computing these values without
Hadoop. Of course, the parallel implementation on Hadoop is very
different in its details. The result is in theory the same, but
probably differs slightly due to bits of logic in the Hadoop job that
would prune small or insignificant data.

Does that start to answer?

@bejoy, this is not what is describe in MiA Chapter 6. This is
RowSimilarityJob, which isn't described directly in the book.

On Mon, Nov 14, 2011 at 6:24 PM, Chris Schilling
<ch...@gmail.com> wrote:
> Hi All,
>
> I was just curious if the job flow for the distributed similarity calculation is documented anywhere.  What is the difference between calculating a similarity sequentially versus using distributed matrix operations on Hadoop.  I am just looking for a high level description of how to get from the User-Item matrix to a Item Item similarity score in map-reduce.
>
> Thanks!
> Chris
>
>