You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Suneel Marthi <su...@yahoo.com> on 2012/01/20 17:38:09 UTC

Question on RowSimilarityJob

I am working on determining document similarity of a corpus I am working with using RowSimilarity.

Questions:-

a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?
b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .

c) Also do we have any docs on RowIdJob ?

Thanks and Regards,
Suneel

Re: Question on RowSimilarityJob

Posted by Suneel Marthi <su...@yahoo.com>.

Thanks for the details, Sebastian. This is indeed very helpful and I completely understand now what you had on your whiteboard.

I had been normalizing my vectors during seq2sparse (with L2 norm since I would be using Cosine/Tanimoto Similarity in the RowSimilarityJob). But since the first step in the RowSimilarity job is normalizing the vectors I don't have to do that in seq2sparse; doing that would not impact the final outcome anyways.  


This has been a very useful thread and thanks for all your help.



________________________________
 From: Sebastian Schelter <ss...@apache.org>
To: Vicky <ja...@yahoo.com> 
Cc: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Thursday, February 2, 2012 8:08 AM
Subject: Re: Question on RowSimilarityJob
 
The heart of RowSimilarityJob is simple cooccurrence counting (with some
tweaks and optimizations). I'll give a simplified example:

We have a 3x3 matrix (could represent items rated by users, docs with
word counts e.g.)

row1: column1, column2
row2: column1, column3
row3: column1, column2

In the first MapReduce pass, we created an inverted index over the
columns (which means we transpose the matrix)

column1: row1, row2, row3
column2: row1, row3
column3: row2

In the second MapReduce pass, we the mapper emits all coocurring row
pairs per column:

for column1:

(row1,row2)
(row1,row3)
(row2,row1)
(row2,row3)
(row3,row1)
(row3,row2)

for column2:

(row1,row3)
(row3,row1)

for column3 there's nothing to emit.

The reducer receives all cooccurrences for a particular row and simply
needs to sum them up:

for row1:

(row1,row2)
(row1,row3)
(row1,row3)

row1,row2 -> 1 cooccurrence
row1,row3 -> 2 cooccurrences

for row2:

(row2,row1)
(row2,row3)

row2,row1 -> 1 cooccurrence
row2,row3 -> 1 cooccurrence

for row3:

(row3,row1)
(row3,row2)
(row3,row1)

row3,row1 -> 2 cooccurrences
row3,row2 -> 1 cooccurrence


Hope that helps with understanding the approach.

--sebastian




On 02.02.2012 12:47, Vicky wrote:
> Sebastian,
> when you said,
>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared
> 
> 
> if there are three columns in the document (columnA,columnB,columnC,columnD).  
> 
> Doc1: ColumnA, columnB, columnC,columnD
> Doc2:ColumnA,columnB,columnC,columnD
> 
> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?
> 
> 
> 
> 
> ----- Original Message -----
> From: Sebastian Schelter <ss...@apache.org>
> To: user@mahout.apache.org
> Cc: 
> Sent: Wednesday, February 1, 2012 6:06 AM
> Subject: Re: Question on RowSimilarityJob
> 
> Here's a brief description of RowSimilarityJob:
> 
> The goal is to compute all pairwise similarities between the rows of a
> sparse matrix A.
> 
> The computation should be executed in a way that only rows that have at
> least one non-zero value in the same dimension (column) are compared. We
> need this to avoid a quadratic number of pairwise comparisons.
> Furthermore we should be able to 'embed' arbitrary similarity measures
> and we should always be able to use a combiner in all MapReduce steps.
> 
> The computation is executed using three MapReduce passes:
> 
> In the first step, the rows of A are preprocessed via
> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
> scaled to unit-length), a single number for each row of A is computed
> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
> 
> The second steps operates on the rows of A' (the columns of A). The
> mapper sums up all pairwise cooccurrences using
> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
> called 'stripes' pattern). The reducers sums up all cooccurrence vectors
> for one row and uses the similarity measure and the precomputed numbers
> from step one to compute all similarities via
> VectorSimilarityMeasure.similarity().
> 
> The third step ensures that only the top k similar rows per row are kept.
> 
> It's hard to see from the code but actually the job performs the matrix
> multiplication AA' via outer products with a modified (similarity
> measure specific) dot product.
> 
> /s
> 
> 
> 
> On 31.01.2012 22:40, Suneel Marthi wrote:
>> Sebastian,
>>
>> Question on the RowSimilarity job.
>>
>> a) I created sequence files from my document corpus.
>> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1
>> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of  documents in my collection, N = cardinality of the vector - Correct?
>>
>> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where  R < N. 
>>
>>
>>      I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf  and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.
>>
>> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.
>>
>> Thanks for your help.
>>
>>
>>
>> ________________________________
>>   From: Sebastian Schelter <ss...@apache.org>
>> To: user@mahout.apache.org 
>> Sent: Friday, January 20, 2012 12:58 PM
>> Subject: Re: Question on RowSimilarityJob
>>  
>> Hi,
>>
>> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows
>> (documents in your use case) to keep per document.
>> 'excludeSelfSimilarity' means that rows (documents) should not be
>> compared to themselves.
>>
>> Sry for the lack of documentation, RowSimilarityJob was originally only
>> an internal job for the recommendation code. I'll try to add something
>> on the wiki in the next days.
>>
>> --sebastian
>>
>>
>> On 20.01.2012 17:38, Suneel Marthi wrote:
>>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.
>>>
>>> Questions:-
>>>
>>> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?
>>> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .
>>>
>>> c) Also do we have any docs on RowIdJob ?
>>>
>>> Thanks and Regards,
>>> Suneel
>>>

Re: Question on RowSimilarityJob

Posted by Sebastian Schelter <ss...@apache.org>.

Well these details are only a very, very simplified view of the job's
actual implementation, so it might confuse the readers.

And actually RowSimilarityJob is executing a kind of matrix
multiplication. If A is 0-1 matrix and we create AA' by summing outer
products, we end up with a computation that is very similar to the example.

--sebastian

On 03.02.2012 08:44, Lance Norskog wrote:
> Please add these details to the wiki page. I had no idea what this
> thing does: I though it was doing matrix multiplies.
> 
> On Thu, Feb 2, 2012 at 5:08 AM, Sebastian Schelter <ss...@apache.org> wrote:
>> The heart of RowSimilarityJob is simple cooccurrence counting (with some
>> tweaks and optimizations). I'll give a simplified example:
>>
>> We have a 3x3 matrix (could represent items rated by users, docs with
>> word counts e.g.)
>>
>> row1: column1, column2
>> row2: column1, column3
>> row3: column1, column2
>>
>> In the first MapReduce pass, we created an inverted index over the
>> columns (which means we transpose the matrix)
>>
>> column1: row1, row2, row3
>> column2: row1, row3
>> column3: row2
>>
>> In the second MapReduce pass, we the mapper emits all coocurring row
>> pairs per column:
>>
>> for column1:
>>
>> (row1,row2)
>> (row1,row3)
>> (row2,row1)
>> (row2,row3)
>> (row3,row1)
>> (row3,row2)
>>
>> for column2:
>>
>> (row1,row3)
>> (row3,row1)
>>
>> for column3 there's nothing to emit.
>>
>> The reducer receives all cooccurrences for a particular row and simply
>> needs to sum them up:
>>
>> for row1:
>>
>> (row1,row2)
>> (row1,row3)
>> (row1,row3)
>>
>> row1,row2 -> 1 cooccurrence
>> row1,row3 -> 2 cooccurrences
>>
>> for row2:
>>
>> (row2,row1)
>> (row2,row3)
>>
>> row2,row1 -> 1 cooccurrence
>> row2,row3 -> 1 cooccurrence
>>
>> for row3:
>>
>> (row3,row1)
>> (row3,row2)
>> (row3,row1)
>>
>> row3,row1 -> 2 cooccurrences
>> row3,row2 -> 1 cooccurrence
>>
>>
>> Hope that helps with understanding the approach.
>>
>> --sebastian
>>
>>
>>
>>
>> On 02.02.2012 12:47, Vicky wrote:
>>> Sebastian,
>>> when you said,
>>>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared
>>>
>>>
>>> if there are three columns in the document (columnA,columnB,columnC,columnD).
>>>
>>> Doc1: ColumnA, columnB, columnC,columnD
>>> Doc2:ColumnA,columnB,columnC,columnD
>>>
>>> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: Sebastian Schelter <ss...@apache.org>
>>> To: user@mahout.apache.org
>>> Cc:
>>> Sent: Wednesday, February 1, 2012 6:06 AM
>>> Subject: Re: Question on RowSimilarityJob
>>>
>>> Here's a brief description of RowSimilarityJob:
>>>
>>> The goal is to compute all pairwise similarities between the rows of a
>>> sparse matrix A.
>>>
>>> The computation should be executed in a way that only rows that have at
>>> least one non-zero value in the same dimension (column) are compared. We
>>> need this to avoid a quadratic number of pairwise comparisons.
>>> Furthermore we should be able to 'embed' arbitrary similarity measures
>>> and we should always be able to use a combiner in all MapReduce steps.
>>>
>>> The computation is executed using three MapReduce passes:
>>>
>>> In the first step, the rows of A are preprocessed via
>>> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
>>> scaled to unit-length), a single number for each row of A is computed
>>> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
>>>
>>> The second steps operates on the rows of A' (the columns of A). The
>>> mapper sums up all pairwise cooccurrences using
>>> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
>>> called 'stripes' pattern). The reducers sums up all cooccurrence vectors
>>> for one row and uses the similarity measure and the precomputed numbers
>>> from step one to compute all similarities via
>>> VectorSimilarityMeasure.similarity().
>>>
>>> The third step ensures that only the top k similar rows per row are kept.
>>>
>>> It's hard to see from the code but actually the job performs the matrix
>>> multiplication AA' via outer products with a modified (similarity
>>> measure specific) dot product.
>>>
>>> /s
>>>
>>>
>>>
>>> On 31.01.2012 22:40, Suneel Marthi wrote:
>>>> Sebastian,
>>>>
>>>> Question on the RowSimilarity job.
>>>>
>>>> a) I created sequence files from my document corpus.
>>>> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1
>>>> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of  documents in my collection, N = cardinality of the vector - Correct?
>>>>
>>>> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where  R < N.
>>>>
>>>>
>>>>      I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf  and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.
>>>>
>>>> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.
>>>>
>>>> Thanks for your help.
>>>>
>>>>
>>>>
>>>> ________________________________
>>>>   From: Sebastian Schelter <ss...@apache.org>
>>>> To: user@mahout.apache.org
>>>> Sent: Friday, January 20, 2012 12:58 PM
>>>> Subject: Re: Question on RowSimilarityJob
>>>>
>>>> Hi,
>>>>
>>>> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows
>>>> (documents in your use case) to keep per document.
>>>> 'excludeSelfSimilarity' means that rows (documents) should not be
>>>> compared to themselves.
>>>>
>>>> Sry for the lack of documentation, RowSimilarityJob was originally only
>>>> an internal job for the recommendation code. I'll try to add something
>>>> on the wiki in the next days.
>>>>
>>>> --sebastian
>>>>
>>>>
>>>> On 20.01.2012 17:38, Suneel Marthi wrote:
>>>>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.
>>>>>
>>>>> Questions:-
>>>>>
>>>>> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?
>>>>> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .
>>>>>
>>>>> c) Also do we have any docs on RowIdJob ?
>>>>>
>>>>> Thanks and Regards,
>>>>> Suneel
>>>>>
>>
> 
> 
>

Re: Question on RowSimilarityJob

Posted by Lance Norskog <go...@gmail.com>.

Please add these details to the wiki page. I had no idea what this
thing does: I though it was doing matrix multiplies.

On Thu, Feb 2, 2012 at 5:08 AM, Sebastian Schelter <ss...@apache.org> wrote:
> The heart of RowSimilarityJob is simple cooccurrence counting (with some
> tweaks and optimizations). I'll give a simplified example:
>
> We have a 3x3 matrix (could represent items rated by users, docs with
> word counts e.g.)
>
> row1: column1, column2
> row2: column1, column3
> row3: column1, column2
>
> In the first MapReduce pass, we created an inverted index over the
> columns (which means we transpose the matrix)
>
> column1: row1, row2, row3
> column2: row1, row3
> column3: row2
>
> In the second MapReduce pass, we the mapper emits all coocurring row
> pairs per column:
>
> for column1:
>
> (row1,row2)
> (row1,row3)
> (row2,row1)
> (row2,row3)
> (row3,row1)
> (row3,row2)
>
> for column2:
>
> (row1,row3)
> (row3,row1)
>
> for column3 there's nothing to emit.
>
> The reducer receives all cooccurrences for a particular row and simply
> needs to sum them up:
>
> for row1:
>
> (row1,row2)
> (row1,row3)
> (row1,row3)
>
> row1,row2 -> 1 cooccurrence
> row1,row3 -> 2 cooccurrences
>
> for row2:
>
> (row2,row1)
> (row2,row3)
>
> row2,row1 -> 1 cooccurrence
> row2,row3 -> 1 cooccurrence
>
> for row3:
>
> (row3,row1)
> (row3,row2)
> (row3,row1)
>
> row3,row1 -> 2 cooccurrences
> row3,row2 -> 1 cooccurrence
>
>
> Hope that helps with understanding the approach.
>
> --sebastian
>
>
>
>
> On 02.02.2012 12:47, Vicky wrote:
>> Sebastian,
>> when you said,
>>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared
>>
>>
>> if there are three columns in the document (columnA,columnB,columnC,columnD).
>>
>> Doc1: ColumnA, columnB, columnC,columnD
>> Doc2:ColumnA,columnB,columnC,columnD
>>
>> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?
>>
>>
>>
>>
>> ----- Original Message -----
>> From: Sebastian Schelter <ss...@apache.org>
>> To: user@mahout.apache.org
>> Cc:
>> Sent: Wednesday, February 1, 2012 6:06 AM
>> Subject: Re: Question on RowSimilarityJob
>>
>> Here's a brief description of RowSimilarityJob:
>>
>> The goal is to compute all pairwise similarities between the rows of a
>> sparse matrix A.
>>
>> The computation should be executed in a way that only rows that have at
>> least one non-zero value in the same dimension (column) are compared. We
>> need this to avoid a quadratic number of pairwise comparisons.
>> Furthermore we should be able to 'embed' arbitrary similarity measures
>> and we should always be able to use a combiner in all MapReduce steps.
>>
>> The computation is executed using three MapReduce passes:
>>
>> In the first step, the rows of A are preprocessed via
>> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
>> scaled to unit-length), a single number for each row of A is computed
>> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
>>
>> The second steps operates on the rows of A' (the columns of A). The
>> mapper sums up all pairwise cooccurrences using
>> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
>> called 'stripes' pattern). The reducers sums up all cooccurrence vectors
>> for one row and uses the similarity measure and the precomputed numbers
>> from step one to compute all similarities via
>> VectorSimilarityMeasure.similarity().
>>
>> The third step ensures that only the top k similar rows per row are kept.
>>
>> It's hard to see from the code but actually the job performs the matrix
>> multiplication AA' via outer products with a modified (similarity
>> measure specific) dot product.
>>
>> /s
>>
>>
>>
>> On 31.01.2012 22:40, Suneel Marthi wrote:
>>> Sebastian,
>>>
>>> Question on the RowSimilarity job.
>>>
>>> a) I created sequence files from my document corpus.
>>> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1
>>> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of  documents in my collection, N = cardinality of the vector - Correct?
>>>
>>> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where  R < N.
>>>
>>>
>>>      I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf  and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.
>>>
>>> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.
>>>
>>> Thanks for your help.
>>>
>>>
>>>
>>> ________________________________
>>>   From: Sebastian Schelter <ss...@apache.org>
>>> To: user@mahout.apache.org
>>> Sent: Friday, January 20, 2012 12:58 PM
>>> Subject: Re: Question on RowSimilarityJob
>>>
>>> Hi,
>>>
>>> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows
>>> (documents in your use case) to keep per document.
>>> 'excludeSelfSimilarity' means that rows (documents) should not be
>>> compared to themselves.
>>>
>>> Sry for the lack of documentation, RowSimilarityJob was originally only
>>> an internal job for the recommendation code. I'll try to add something
>>> on the wiki in the next days.
>>>
>>> --sebastian
>>>
>>>
>>> On 20.01.2012 17:38, Suneel Marthi wrote:
>>>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.
>>>>
>>>> Questions:-
>>>>
>>>> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?
>>>> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .
>>>>
>>>> c) Also do we have any docs on RowIdJob ?
>>>>
>>>> Thanks and Regards,
>>>> Suneel
>>>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Question on RowSimilarityJob

Posted by Sebastian Schelter <ss...@apache.org>.

The heart of RowSimilarityJob is simple cooccurrence counting (with some
tweaks and optimizations). I'll give a simplified example:

We have a 3x3 matrix (could represent items rated by users, docs with
word counts e.g.)

row1: column1, column2
row2: column1, column3
row3: column1, column2

In the first MapReduce pass, we created an inverted index over the
columns (which means we transpose the matrix)

column1: row1, row2, row3
column2: row1, row3
column3: row2

In the second MapReduce pass, we the mapper emits all coocurring row
pairs per column:

for column1:

(row1,row2)
(row1,row3)
(row2,row1)
(row2,row3)
(row3,row1)
(row3,row2)

for column2:

(row1,row3)
(row3,row1)

for column3 there's nothing to emit.

The reducer receives all cooccurrences for a particular row and simply
needs to sum them up:

for row1:

(row1,row2)
(row1,row3)
(row1,row3)

row1,row2 -> 1 cooccurrence
row1,row3 -> 2 cooccurrences

for row2:

(row2,row1)
(row2,row3)

row2,row1 -> 1 cooccurrence
row2,row3 -> 1 cooccurrence

for row3:

(row3,row1)
(row3,row2)
(row3,row1)

row3,row1 -> 2 cooccurrences
row3,row2 -> 1 cooccurrence


Hope that helps with understanding the approach.

--sebastian




On 02.02.2012 12:47, Vicky wrote:
> Sebastian,
> when you said,
>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared
> 
> 
> if there are three columns in the document (columnA,columnB,columnC,columnD).  
> 
> Doc1: ColumnA, columnB, columnC,columnD
> Doc2:ColumnA,columnB,columnC,columnD
> 
> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?
> 
> 
> 
> 
> ----- Original Message -----
> From: Sebastian Schelter <ss...@apache.org>
> To: user@mahout.apache.org
> Cc: 
> Sent: Wednesday, February 1, 2012 6:06 AM
> Subject: Re: Question on RowSimilarityJob
> 
> Here's a brief description of RowSimilarityJob:
> 
> The goal is to compute all pairwise similarities between the rows of a
> sparse matrix A.
> 
> The computation should be executed in a way that only rows that have at
> least one non-zero value in the same dimension (column) are compared. We
> need this to avoid a quadratic number of pairwise comparisons.
> Furthermore we should be able to 'embed' arbitrary similarity measures
> and we should always be able to use a combiner in all MapReduce steps.
> 
> The computation is executed using three MapReduce passes:
> 
> In the first step, the rows of A are preprocessed via
> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
> scaled to unit-length), a single number for each row of A is computed
> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
> 
> The second steps operates on the rows of A' (the columns of A). The
> mapper sums up all pairwise cooccurrences using
> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
> called 'stripes' pattern). The reducers sums up all cooccurrence vectors
> for one row and uses the similarity measure and the precomputed numbers
> from step one to compute all similarities via
> VectorSimilarityMeasure.similarity().
> 
> The third step ensures that only the top k similar rows per row are kept.
> 
> It's hard to see from the code but actually the job performs the matrix
> multiplication AA' via outer products with a modified (similarity
> measure specific) dot product.
> 
> /s
> 
> 
> 
> On 31.01.2012 22:40, Suneel Marthi wrote:
>> Sebastian,
>>
>> Question on the RowSimilarity job.
>>
>> a) I created sequence files from my document corpus.
>> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1
>> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of  documents in my collection, N = cardinality of the vector - Correct?
>>
>> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where  R < N. 
>>
>>
>>      I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf  and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.
>>
>> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.
>>
>> Thanks for your help.
>>
>>
>>
>> ________________________________
>>   From: Sebastian Schelter <ss...@apache.org>
>> To: user@mahout.apache.org 
>> Sent: Friday, January 20, 2012 12:58 PM
>> Subject: Re: Question on RowSimilarityJob
>>   
>> Hi,
>>
>> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows
>> (documents in your use case) to keep per document.
>> 'excludeSelfSimilarity' means that rows (documents) should not be
>> compared to themselves.
>>
>> Sry for the lack of documentation, RowSimilarityJob was originally only
>> an internal job for the recommendation code. I'll try to add something
>> on the wiki in the next days.
>>
>> --sebastian
>>
>>
>> On 20.01.2012 17:38, Suneel Marthi wrote:
>>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.
>>>
>>> Questions:-
>>>
>>> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?
>>> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .
>>>
>>> c) Also do we have any docs on RowIdJob ?
>>>
>>> Thanks and Regards,
>>> Suneel
>>>

Re: Question on RowSimilarityJob

Posted by Vicky <ja...@yahoo.com>.

Sebastian,
when you said,
>>The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared

if there are three columns in the document (columnA,columnB,columnC,columnD).  

Doc1: ColumnA, columnB, columnC,columnD
Doc2:ColumnA,columnB,columnC,columnD

Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?

----- Original Message -----
From: Sebastian Schelter <ss...@apache.org>
To: user@mahout.apache.org
Cc: 
Sent: Wednesday, February 1, 2012 6:06 AM
Subject: Re: Question on RowSimilarityJob

Here's a brief description of RowSimilarityJob:

The goal is to compute all pairwise similarities between the rows of a
sparse matrix A.

The computation should be executed in a way that only rows that have at
least one non-zero value in the same dimension (column) are compared. We
need this to avoid a quadratic number of pairwise comparisons.
Furthermore we should be able to 'embed' arbitrary similarity measures
and we should always be able to use a combiner in all MapReduce steps.

The computation is executed using three MapReduce passes:

In the first step, the rows of A are preprocessed via
VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
scaled to unit-length), a single number for each row of A is computed
via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

The second steps operates on the rows of A' (the columns of A). The
mapper sums up all pairwise cooccurrences using
VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
called 'stripes' pattern). The reducers sums up all cooccurrence vectors
for one row and uses the similarity measure and the precomputed numbers
from step one to compute all similarities via
VectorSimilarityMeasure.similarity().

The third step ensures that only the top k similar rows per row are kept.

It's hard to see from the code but actually the job performs the matrix
multiplication AA' via outer products with a modified (similarity
measure specific) dot product.

/s

On 31.01.2012 22:40, Suneel Marthi wrote:
> Sebastian,
> 
> Question on the RowSimilarity job.
> 
> a) I created sequence files from my document corpus.
> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1
> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of  documents in my collection, N = cardinality of the vector - Correct?
> 
> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where  R < N. 
> 
> 
>     I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf  and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.
> 
> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.
> 
> Thanks for your help.
> 
> 
> 
> ________________________________
>  From: Sebastian Schelter <ss...@apache.org>
> To: user@mahout.apache.org 
> Sent: Friday, January 20, 2012 12:58 PM
> Subject: Re: Question on RowSimilarityJob
>  
> Hi,
> 
> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows
> (documents in your use case) to keep per document.
> 'excludeSelfSimilarity' means that rows (documents) should not be
> compared to themselves.
> 
> Sry for the lack of documentation, RowSimilarityJob was originally only
> an internal job for the recommendation code. I'll try to add something
> on the wiki in the next days.
> 
> --sebastian
> 
> 
> On 20.01.2012 17:38, Suneel Marthi wrote:
>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.
>>
>> Questions:-
>>
>> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?
>> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .
>>
>> c) Also do we have any docs on RowIdJob ?
>>
>> Thanks and Regards,
>> Suneel
>>

Re: Question on RowSimilarityJob

Posted by Sebastian Schelter <ss...@apache.org>.

Great! Thank you!

On 02.02.2012 13:59, Dan Brickley wrote:
> On 1 February 2012 12:06, Sebastian Schelter <ss...@apache.org> wrote:
>> Here's a brief description of RowSimilarityJob:
>>
>> The goal is to compute all pairwise similarities between the rows of a
>> sparse matrix A.
>>
>> The computation should be executed in a way that only rows that have at
>> least one non-zero value in the same dimension (column) are compared. We
>> need this to avoid a quadratic number of pairwise comparisons.
>> Furthermore we should be able to 'embed' arbitrary similarity measures
>> and we should always be able to use a combiner in all MapReduce steps.
>>
>> The computation is executed using three MapReduce passes:
>>
>> In the first step, the rows of A are preprocessed via
>> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
>> scaled to unit-length), a single number for each row of A is computed
>> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
>>
>> The second steps operates on the rows of A' (the columns of A). The
>> mapper sums up all pairwise cooccurrences using
>> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
>> called 'stripes' pattern). The reducers sums up all cooccurrence vectors
>> for one row and uses the similarity measure and the precomputed numbers
>> from step one to compute all similarities via
>> VectorSimilarityMeasure.similarity().
>>
>> The third step ensures that only the top k similar rows per row are kept.
>>
>> It's hard to see from the code but actually the job performs the matrix
>> multiplication AA' via outer products with a modified (similarity
>> measure specific) dot product.
> 
> Thanks for this. I've taken liberty of dropping it in the Wiki at
> https://cwiki.apache.org/confluence/display/MAHOUT/RowSimilarityJob
> (and linking this thread). Hope that's ok...
> 
> cheers,
> 
> Dan

Re: Question on RowSimilarityJob

Posted by Dan Brickley <da...@danbri.org>.

On 1 February 2012 12:06, Sebastian Schelter <ss...@apache.org> wrote:
> Here's a brief description of RowSimilarityJob:
>
> The goal is to compute all pairwise similarities between the rows of a
> sparse matrix A.
>
> The computation should be executed in a way that only rows that have at
> least one non-zero value in the same dimension (column) are compared. We
> need this to avoid a quadratic number of pairwise comparisons.
> Furthermore we should be able to 'embed' arbitrary similarity measures
> and we should always be able to use a combiner in all MapReduce steps.
>
> The computation is executed using three MapReduce passes:
>
> In the first step, the rows of A are preprocessed via
> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
> scaled to unit-length), a single number for each row of A is computed
> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
>
> The second steps operates on the rows of A' (the columns of A). The
> mapper sums up all pairwise cooccurrences using
> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
> called 'stripes' pattern). The reducers sums up all cooccurrence vectors
> for one row and uses the similarity measure and the precomputed numbers
> from step one to compute all similarities via
> VectorSimilarityMeasure.similarity().
>
> The third step ensures that only the top k similar rows per row are kept.
>
> It's hard to see from the code but actually the job performs the matrix
> multiplication AA' via outer products with a modified (similarity
> measure specific) dot product.

Thanks for this. I've taken liberty of dropping it in the Wiki at
https://cwiki.apache.org/confluence/display/MAHOUT/RowSimilarityJob
(and linking this thread). Hope that's ok...

cheers,

Dan

Re: Question on RowSimilarityJob

Posted by Sebastian Schelter <ss...@apache.org>.

Here's a brief description of RowSimilarityJob:

The goal is to compute all pairwise similarities between the rows of a
sparse matrix A.

The computation should be executed in a way that only rows that have at
least one non-zero value in the same dimension (column) are compared. We
need this to avoid a quadratic number of pairwise comparisons.
Furthermore we should be able to 'embed' arbitrary similarity measures
and we should always be able to use a combiner in all MapReduce steps.

The computation is executed using three MapReduce passes:

In the first step, the rows of A are preprocessed via
VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
scaled to unit-length), a single number for each row of A is computed
via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.

The second steps operates on the rows of A' (the columns of A). The
mapper sums up all pairwise cooccurrences using
VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
called 'stripes' pattern). The reducers sums up all cooccurrence vectors
for one row and uses the similarity measure and the precomputed numbers
from step one to compute all similarities via
VectorSimilarityMeasure.similarity().

The third step ensures that only the top k similar rows per row are kept.

It's hard to see from the code but actually the job performs the matrix
multiplication AA' via outer products with a modified (similarity
measure specific) dot product.

/s

On 31.01.2012 22:40, Suneel Marthi wrote:
> Sebastian,
> 
> Question on the RowSimilarity job.
> 
> a) I created sequence files from my document corpus.
> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1
> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of  documents in my collection, N = cardinality of the vector - Correct?
> 
> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where  R < N. 
> 
> 
>     I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf  and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.
> 
> I am willing to put some docs on the wiki for RowSimilarityJob once I am done.
> 
> Thanks for your help.
> 
> 
> 
> ________________________________
>  From: Sebastian Schelter <ss...@apache.org>
> To: user@mahout.apache.org 
> Sent: Friday, January 20, 2012 12:58 PM
> Subject: Re: Question on RowSimilarityJob
>  
> Hi,
> 
> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows
> (documents in your use case) to keep per document.
> 'excludeSelfSimilarity' means that rows (documents) should not be
> compared to themselves.
> 
> Sry for the lack of documentation, RowSimilarityJob was originally only
> an internal job for the recommendation code. I'll try to add something
> on the wiki in the next days.
> 
> --sebastian
> 
> 
> On 20.01.2012 17:38, Suneel Marthi wrote:
>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.
>>
>> Questions:-
>>
>> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?
>> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .
>>
>> c) Also do we have any docs on RowIdJob ?
>>
>> Thanks and Regards,
>> Suneel
>>

Re: Question on RowSimilarityJob

Posted by Suneel Marthi <su...@yahoo.com>.

Sebastian,

Question on the RowSimilarity job.

a) I created sequence files from my document corpus.
b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1
c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of  documents in my collection, N = cardinality of the vector - Correct?

d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where  R < N. 

    I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf  and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.

I am willing to put some docs on the wiki for RowSimilarityJob once I am done.

Thanks for your help.

________________________________
 From: Sebastian Schelter <ss...@apache.org>
To: user@mahout.apache.org 
Sent: Friday, January 20, 2012 12:58 PM
Subject: Re: Question on RowSimilarityJob

Hi,

'maxSimilaritiesPerRow' denotes the maximum number of similar rows
(documents in your use case) to keep per document.
'excludeSelfSimilarity' means that rows (documents) should not be
compared to themselves.

Sry for the lack of documentation, RowSimilarityJob was originally only
an internal job for the recommendation code. I'll try to add something
on the wiki in the next days.

--sebastian

On 20.01.2012 17:38, Suneel Marthi wrote:
> I am working on determining document similarity of a corpus I am working with using RowSimilarity.
> 
> Questions:-
> 
> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?
> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .
> 
> c) Also do we have any docs on RowIdJob ?
> 
> Thanks and Regards,
> Suneel
>

Re: Question on RowSimilarityJob

Posted by Lance Norskog <go...@gmail.com>.

In fact I will back off there. It is possible to use Hadoop to create
a data model, which you can then use to do  recommendations for your
interactive search. It is similarly possible to do your clustering job
and use a disk-backed store to fetch a cluster.

I have not teased apart the various ways to compose a Hadoop-based
recommender job. There seem to be more options than I understood.

Lance

On Mon, Jan 23, 2012 at 6:06 AM, Suneel Marthi <su...@yahoo.com> wrote:
> Thanks Lance for your reply.
>
> I cannot go with any in-memory implementations as I am working with a corpus
> of ~600K documents.
>
> As regards clustering, I did try Mahout's various clustering implementations
> - kmeans, canopy and minhash. Minhash would have been the right algorithm
> for
> what I am trying to do but I never had good results from Mahout's minhash
> implementation.
>
> I had never tried any recommenders in Mahout, which one would you suggest; I
> would definitely need a Hadoop based solution as I am dealing with large
> data.
>
> Thanks,
> Suneel
>
> ________________________________
> From: Lance Norskog <go...@gmail.com>
> To: user@mahout.apache.org; Suneel Marthi <su...@yahoo.com>
> Sent: Friday, January 20, 2012 5:48 PM
>
> Subject: Re: Question on RowSimilarityJob
>
> This can be done as a recommender or with clustering. As a
> recommender, you would say, "find similar documents". In clustering,
> you can take one document and find its neighbors in the cluster graph.
>
> How much data do you have? An important point here is that most of
> Mahout's recommender algorithms are in-memory programs. There is only
> one Hadoop-based recommender. There are several Hadoop-based
> clustering algorithms.
>
> On Fri, Jan 20, 2012 at 2:02 PM, Suneel Marthi <su...@yahoo.com>
> wrote:
>> Thanks for the reply Sebastian.
>>
>>
>> For the task I am working on - 'determine similar documents in a
>> collection' - would RowSimilarity be the right approach?
>>
>>
>>
>>
>>
>>
>> ________________________________
>>  From: Sebastian Schelter <ss...@apache.org>
>> To: user@mahout.apache.org
>> Sent: Friday, January 20, 2012 12:58 PM
>> Subject: Re: Question on RowSimilarityJob
>>
>> Hi,
>>
>> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows
>> (documents in your use case) to keep per document.
>> 'excludeSelfSimilarity' means that rows (documents) should not be
>> compared to themselves.
>>
>> Sry for the lack of documentation, RowSimilarityJob was originally only
>> an internal job for the recommendation code. I'll try to add something
>> on the wiki in the next days.
>>
>> --sebastian
>>
>>
>> On 20.01.2012 17:38, Suneel Marthi wrote:
>>> I am working on determining document similarity of a corpus I am working
>>> with using RowSimilarity.
>>>
>>> Questions:-
>>>
>>> a) What do the parameters - 'maxSimilaritiesPerRow' and
>>> 'excludeSelfSimilarity' mean?
>>> b) Are there any docs available on RowSimilarityJob available, this is
>>> the best I could find on Sebastian's blog -
>>> http://ssc.io/rowsimilarityjob-on-steroids/ .
>>>
>>> c) Also do we have any docs on RowIdJob ?
>>>
>>> Thanks and Regards,
>>> Suneel
>>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Question on RowSimilarityJob

Posted by Lance Norskog <go...@gmail.com>.

This can be done as a recommender or with clustering. As a
recommender, you would say, "find similar documents". In clustering,
you can take one document and find its neighbors in the cluster graph.

How much data do you have? An important point here is that most of
Mahout's recommender algorithms are in-memory programs. There is only
one Hadoop-based recommender. There are several Hadoop-based
clustering algorithms.

On Fri, Jan 20, 2012 at 2:02 PM, Suneel Marthi <su...@yahoo.com> wrote:
> Thanks for the reply Sebastian.
>
>
> For the task I am working on - 'determine similar documents in a collection' - would RowSimilarity be the right approach?
>
>
>
>
>
>
> ________________________________
>  From: Sebastian Schelter <ss...@apache.org>
> To: user@mahout.apache.org
> Sent: Friday, January 20, 2012 12:58 PM
> Subject: Re: Question on RowSimilarityJob
>
> Hi,
>
> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows
> (documents in your use case) to keep per document.
> 'excludeSelfSimilarity' means that rows (documents) should not be
> compared to themselves.
>
> Sry for the lack of documentation, RowSimilarityJob was originally only
> an internal job for the recommendation code. I'll try to add something
> on the wiki in the next days.
>
> --sebastian
>
>
> On 20.01.2012 17:38, Suneel Marthi wrote:
>> I am working on determining document similarity of a corpus I am working with using RowSimilarity.
>>
>> Questions:-
>>
>> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?
>> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .
>>
>> c) Also do we have any docs on RowIdJob ?
>>
>> Thanks and Regards,
>> Suneel
>>



-- 
Lance Norskog
goksron@gmail.com

Re: Question on RowSimilarityJob

Posted by Suneel Marthi <su...@yahoo.com>.

Thanks for the reply Sebastian.

For the task I am working on - 'determine similar documents in a collection' - would RowSimilarity be the right approach?

________________________________
 From: Sebastian Schelter <ss...@apache.org>
To: user@mahout.apache.org 
Sent: Friday, January 20, 2012 12:58 PM
Subject: Re: Question on RowSimilarityJob

Hi,

'maxSimilaritiesPerRow' denotes the maximum number of similar rows
(documents in your use case) to keep per document.
'excludeSelfSimilarity' means that rows (documents) should not be
compared to themselves.

Sry for the lack of documentation, RowSimilarityJob was originally only
an internal job for the recommendation code. I'll try to add something
on the wiki in the next days.

--sebastian

On 20.01.2012 17:38, Suneel Marthi wrote:
> I am working on determining document similarity of a corpus I am working with using RowSimilarity.
> 
> Questions:-
> 
> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?
> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .
> 
> c) Also do we have any docs on RowIdJob ?
> 
> Thanks and Regards,
> Suneel
>

Re: Question on RowSimilarityJob

Posted by Sebastian Schelter <ss...@apache.org>.

Hi,

'maxSimilaritiesPerRow' denotes the maximum number of similar rows
(documents in your use case) to keep per document.
'excludeSelfSimilarity' means that rows (documents) should not be
compared to themselves.

Sry for the lack of documentation, RowSimilarityJob was originally only
an internal job for the recommendation code. I'll try to add something
on the wiki in the next days.

--sebastian

On 20.01.2012 17:38, Suneel Marthi wrote:
> I am working on determining document similarity of a corpus I am working with using RowSimilarity.
> 
> Questions:-
> 
> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?
> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .
> 
> c) Also do we have any docs on RowIdJob ?
> 
> Thanks and Regards,
> Suneel
>