You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Niall Riddell <ni...@xspca.com> on 2011/07/22 13:23:38 UTC

Pairwise Document Similarity

Hi,

I would like to sense check an approach to near-duplicate detection of
documents using Mahout.  After some basic research I've implemented a
basic proof which works effectively on a small corpus.

I have taken the following pre-processing steps:

1) Parse the document
2) Remove unnecessary tokens
3) Split by sentence
4) Create w-shingles from sentence tokens
5) Hash shingles
6) Minhash hashes
7) Jaccard Similarity adjusting for number of hash functions used in minhash

In order to scale this I will be doing the following:

1) Use M/R for all steps
2) Avoid adding exact duplicate documents to similarity matrix
3) Constructing an (additional) LSH matrix (threshold >=0.2) splitting
into buckets
4) Split the similarity job by blocks of document keys for each mapper
5) Every document in the minhash matrix gets submitted to every mapper
6) Each mapper queries the LSH matrix to look for candidates for
matching against
7) Each mapper matches against candidates in it's block and writes out
a key (docid) and a vector of all similar documents ({docid, score})
8) The reducer then combines the results from each mapper into the
final similarity matrix

I've only really used Mahout so far for doing the minhash stuff but
would like and can't find an LSH implementation.  To avoid
re-inventing the wheel I was looking for general pointers as to the
efficacy of my approach in the first instance and then any guidance on
how best to implement using the rest of mahout.

I've gone through MIA and felt the the rowsimilarityjob was a
possibility, however I understand that a JIRA has been raised to make
this potentially less general and in it's current form it may not
match my performance/cost criteria (i.e. high/low).

Any help is greatly appreciated.

Thanks in advance.

Niall

Re: Pairwise Document Similarity

Posted by Niall Riddell <ni...@xspca.com>.
Thanks Grant and apologies for not actually reading the JIRA properly it was
not intended as a criticism of the excellent Recommender work in Taste.  I
see know that the intention is to improve performance for specific use
cases.

My intention is to build a scalable near duplicate document detection
capability using Mahout.  The use case that I'm researching will require a
scalable approach across a large corpus of documents (>50m and growing).  I
see that there is another thread today on this topic.

I will go ahead and implement the solution as outlined in my original post
and delve further into the LSH implementation embedded in the clustering
code and share my findings with the community.  I'm looking to test this on
a large corpus using EMR in order to get a feeling for the timings

Cheers Niall

On Jul 23, 2011 7:55 AM, "Grant Ingersoll" <gs...@apache.org> wrote:
>
>
> On Jul 22, 2011, at 7:23 AM, Niall Riddell wrote:
> >
> >
> > I've gone through MIA and felt the the rowsimilarityjob was a
> > possibility, however I understand that a JIRA has been raised to make
> > this potentially less general and in it's current form it may not
> > match my performance/cost criteria (i.e. high/low).
>
> I don't think the goal of the JIRA issue is to make it less general, ti's
to make the cases that can benefit from smarter use of the co-occurrences
scale better.  I see no reason why the existing format can't also be
maintained for those similarity measures that can't benefit from more map
side calculation.
>
> -Grant
>

Re: Pairwise Document Similarity

Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 22, 2011, at 7:23 AM, Niall Riddell wrote:
> 
> 
> I've gone through MIA and felt the the rowsimilarityjob was a
> possibility, however I understand that a JIRA has been raised to make
> this potentially less general and in it's current form it may not
> match my performance/cost criteria (i.e. high/low).

I don't think the goal of the JIRA issue is to make it less general, ti's to make the cases that can benefit from smarter use of the co-occurrences scale better.  I see no reason why the existing format can't also be maintained for those similarity measures that can't benefit from more map side calculation.

-Grant