You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Miles Osborne <mi...@inf.ed.ac.uk> on 2009/07/17 21:49:41 UTC

Re: Finding the similarity of documents using Mahout for deduplication

a direct MR approach is as follows:

http://web.jhu.edu/HLTCOE/Publications/acl08_elsayed_pairwise_sim.pdf

this is not particularly efficient;  a better approach would be to use a
randomised approach, eg:

Broder et al:

http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202000.pdf

it would be very nice if someone could implement a randomised approach using
Hadoop ...

(this should be fairly esy to do, since you have to convert each document
into a set of shingles --could be done in one mapper-- and then sort these
documents plus some extra twists)

Miles

2009/7/17 Jason Rutherglen <ja...@gmail.com>

> I think this comes up fairly often in search apps, duplicate
> documents are indexed (for example using SimplyHired's search
> there are 20 of the same job listed from different websites). A
> similarity score above a threshold would determine the documents
> are too similar, are duplicates, and therefore can be removed.
> Is there a recommended Mahout algorithm for this?
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: Finding the similarity of documents using Mahout for deduplication

Posted by Ted Dunning <te...@gmail.com>.

I think that this would be nearly equivalent to the Lucene solution that I
mentioned ... good for real-time single document queries.

I would be very surprised if this were able to out-do the MR version for the
all-pairs problem.

On Sat, Jul 18, 2009 at 1:30 AM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:

> you could probably eliminate phase 2 if the output of phase 1 was stored in
> Perfect Hashing table (say using Hypertable).  this works by storing a
> fingerprint for each shingle/count pair (a few bits) and organising the
> hash
> table such that you never get collisions (hence the Perfect Hashing).
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Finding the similarity of documents using Mahout for deduplication

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

basically a decoder does a beam search through a vast set of options, guided
by various probabilities

(those tables I mentioned are mainly language models)

it would be vey interesting indeed to really open-up the decoder in this
manner:  get MR to do a parallel search over the search graph.  my initial
guess is that this would be too slow, since the actual decoding can be done
in-memory on a typical Dell box (eg in 32G).  this is per-sentence; per
document is a different matter.

what i had in mind was something less radical:  take a document, split it
into shards; translate each shard (as a mapper) and have the reducer
recombine the translated sentences.  Hypertable would be used to serve the
language models and phrase tables (string-string mappings, specifying the
set of possible ways some source fragment gets translated into target
fragments).

language models are interesting in that the vast majority of requests won't
be for a valid ngram  (the decoder searches over a space of possible
translations, most of which are garbage).  so, possible speed-ups would be
to have a small memory on-board cache of the table (as a Bloom Filter).
more interesting still would be to consider the stream of language model
requests and see whether it itself can be modelled  (think of this as say
Bernoulli trials:  do i get a hit or not).  if there is structure in this
stream, then i could design an algorithm to learn when it is in a sequence
of misses (exploring some low probabily space of translations) or whether it
switched to say a high probability region.

Miles

2009/7/18 Ted Dunning <te...@gmail.com>

> On Sat, Jul 18, 2009 at 1:30 AM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>
> > i wish people would consider more often using Hypertable / Hbase etc in
> > algorithms:  there are times when you want random access and as a
> > community,
> > i think we need to put more effort into working-out good ways to use all
> > the
> > options available.
> >
>
> I think that this is very sage advice.  I would actually add Lucene to this
> list.  By doing an implicit matrix multiple and sparsification, it provides
> an extraordinarily useful primitive operation.
>
>
> > currently as a background task I'm thinking how to
> > Hadoop-ify our Machine Translation system;  this involves random access
> to
> > big tables of string-value pairs, as well as a few other tables.
> > compressed, these can be 20G each and we need to hit these tables 100s of
> > thousands of times per sentence we translate.
> >
>
> For translating a single sentence, this is a very plausible design option.
>
>
> > so, the reseach questions
> > here then become how to (a) modify the Machine Translation decoding
> > procedure to best batch-up table requests --Google have published on
> this--
> > and more interestingly, try to be more clever about whether a network
> > request actually needs to be made.
> >
>
> And this problem is also very interesting if you consider how these
> operations can be interleaved and reordered if you are translating
> thousands
> of sentences at a time.  Can you rephrase the problem so that the map takes
> a single sentence and emits all of the queries it would like to do for that
> sentence.  The map would also inject all of the tables that you want to
> query.  Then reduce can group by table key "perform" the lookups for each
> key value.  Then a second map reduce would reorganize the results back to
> the sentence level.
>
> If your tables exceed memory, or if the sort of many queries is faster
> asymptotically than queries, this could be much faster.
>
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: Finding the similarity of documents using Mahout for deduplication

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Jul 18, 2009 at 1:30 AM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:

> i wish people would consider more often using Hypertable / Hbase etc in
> algorithms:  there are times when you want random access and as a
> community,
> i think we need to put more effort into working-out good ways to use all
> the
> options available.
>

I think that this is very sage advice.  I would actually add Lucene to this
list.  By doing an implicit matrix multiple and sparsification, it provides
an extraordinarily useful primitive operation.

> currently as a background task I'm thinking how to
> Hadoop-ify our Machine Translation system;  this involves random access to
> big tables of string-value pairs, as well as a few other tables.
> compressed, these can be 20G each and we need to hit these tables 100s of
> thousands of times per sentence we translate.
>

For translating a single sentence, this is a very plausible design option.

> so, the reseach questions
> here then become how to (a) modify the Machine Translation decoding
> procedure to best batch-up table requests --Google have published on this--
> and more interestingly, try to be more clever about whether a network
> request actually needs to be made.
>

And this problem is also very interesting if you consider how these
operations can be interleaved and reordered if you are translating thousands
of sentences at a time.  Can you rephrase the problem so that the map takes
a single sentence and emits all of the queries it would like to do for that
sentence.  The map would also inject all of the tables that you want to
query.  Then reduce can group by table key "perform" the lookups for each
key value.  Then a second map reduce would reorganize the results back to
the sentence level.

If your tables exceed memory, or if the sort of many queries is faster
asymptotically than queries, this could be much faster.

-- 
Ted Dunning, CTO
DeepDyve

Re: Finding the similarity of documents using Mahout for deduplication

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

you could probably eliminate phase 2 if the output of phase 1 was stored in
Perfect Hashing table (say using Hypertable).  this works by storing a
fingerprint for each shingle/count pair (a few bits) and organising the hash
table such that you never get collisions (hence the Perfect Hashing).

the table would be compact and --modulo network latency-- would have O(1)
lookup time for each shingle-count query.

i wish people would consider more often using Hypertable / Hbase etc in
algorithms:  there are times when you want random access and as a community,
i think we need to put more effort into working-out good ways to use all the
options available.  currently as a background task I'm thinking how to
Hadoop-ify our Machine Translation system;  this involves random access to
big tables of string-value pairs, as well as a few other tables.
compressed, these can be 20G each and we need to hit these tables 100s of
thousands of times per sentence we translate.  so, the reseach questions
here then become how to (a) modify the Machine Translation decoding
procedure to best batch-up table requests --Google have published on this--
and more interestingly, try to be more clever about whether a network
request actually needs to be made.  i have ideas here

Miles

2009/7/18 Ted Dunning <te...@gmail.com>

> Computing min-sets of shingle hashes is definitely the preferred approach.
> I think that a full implementation to find all duplicates in a corpus
> requires more than one map-reduce phase.
>
> phase 1:
> Convert to an inverted index that maps hash => docids for each hash in the
> minset of some document.
>
>    map: (_, document) => for (shingle-hash in min-set(document))
> emit(shingle-hash, docid)
>    reduce: (shingle-hash, docid-list) => (shingle-hash, docid-list)
>
> phase 2:
> Count the number of times that documents share a shingle hash value in
> their
> minsets.  Duplicate or near duplicate documents will share many shingle
> hashes.  The map explodes a list of documents into all pairs, the combine
> reduces this to counts, the reduce filters this to only output pairs with
> high enough counts.
>
>    map: (shingle-hash, docid-list) => ([doc1, doc2], 1)
>    combine: ([doc1, doc2], list-of-count) => ([doc1, doc2],
> sum(list-of-count))
>    reduce: ([doc1, doc2], list-of-count) => let n = sum(list-of-count) if
> (n > threshold) emit (null, [doc1, doc2])
>
> If you only want to be able test a single document at a time in near
> real-time, you can adapt this to build a Lucene index out of the minsets of
> each document (rather than the terms).  That Lucene index can be used to
> find duplicates in real-time.
>
>
> Fri, Jul 17, 2009 at 12:49 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>
> >
> >
> >
> http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202000.pdf
> >
> > it would be very nice if someone could implement a randomised approach
> > using
> > Hadoop ...
> >
> > (this should be fairly esy to do, since you have to convert each document
> > into a set of shingles --could be done in one mapper-- and then sort
> these
> > documents plus some extra twists)
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: Finding the similarity of documents using Mahout for deduplication

Posted by Jason Rutherglen <ja...@gmail.com>.

In this context does min-set refer to:
http://portal.acm.org/citation.cfm?id=1226882

> If you only want to be able test a single document at a time in near
> real-time, you can adapt this to build a Lucene index out of the minsets of
> each document (rather than the terms).  That Lucene index can be used to
> find duplicates in real-time.

Interesting.

Thanks!

On Fri, Jul 17, 2009 at 4:01 PM, Ted Dunning<te...@gmail.com> wrote:
> Computing min-sets of shingle hashes is definitely the preferred approach.
> I think that a full implementation to find all duplicates in a corpus
> requires more than one map-reduce phase.
>
> phase 1:
> Convert to an inverted index that maps hash => docids for each hash in the
> minset of some document.
>
>    map: (_, document) => for (shingle-hash in min-set(document))
> emit(shingle-hash, docid)
>    reduce: (shingle-hash, docid-list) => (shingle-hash, docid-list)
>
> phase 2:
> Count the number of times that documents share a shingle hash value in their
> minsets.  Duplicate or near duplicate documents will share many shingle
> hashes.  The map explodes a list of documents into all pairs, the combine
> reduces this to counts, the reduce filters this to only output pairs with
> high enough counts.
>
>    map: (shingle-hash, docid-list) => ([doc1, doc2], 1)
>    combine: ([doc1, doc2], list-of-count) => ([doc1, doc2],
> sum(list-of-count))
>    reduce: ([doc1, doc2], list-of-count) => let n = sum(list-of-count) if
> (n > threshold) emit (null, [doc1, doc2])
>
> If you only want to be able test a single document at a time in near
> real-time, you can adapt this to build a Lucene index out of the minsets of
> each document (rather than the terms).  That Lucene index can be used to
> find duplicates in real-time.
>
>
> Fri, Jul 17, 2009 at 12:49 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>
>>
>>
>> http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202000.pdf
>>
>> it would be very nice if someone could implement a randomised approach
>> using
>> Hadoop ...
>>
>> (this should be fairly esy to do, since you have to convert each document
>> into a set of shingles --could be done in one mapper-- and then sort these
>> documents plus some extra twists)
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Finding the similarity of documents using Mahout for deduplication

Posted by Ted Dunning <te...@gmail.com>.

Computing min-sets of shingle hashes is definitely the preferred approach.
I think that a full implementation to find all duplicates in a corpus
requires more than one map-reduce phase.

phase 1:
Convert to an inverted index that maps hash => docids for each hash in the
minset of some document.

    map: (_, document) => for (shingle-hash in min-set(document))
emit(shingle-hash, docid)
    reduce: (shingle-hash, docid-list) => (shingle-hash, docid-list)

phase 2:
Count the number of times that documents share a shingle hash value in their
minsets.  Duplicate or near duplicate documents will share many shingle
hashes.  The map explodes a list of documents into all pairs, the combine
reduces this to counts, the reduce filters this to only output pairs with
high enough counts.

    map: (shingle-hash, docid-list) => ([doc1, doc2], 1)
    combine: ([doc1, doc2], list-of-count) => ([doc1, doc2],
sum(list-of-count))
    reduce: ([doc1, doc2], list-of-count) => let n = sum(list-of-count) if
(n > threshold) emit (null, [doc1, doc2])

If you only want to be able test a single document at a time in near
real-time, you can adapt this to build a Lucene index out of the minsets of
each document (rather than the terms).  That Lucene index can be used to
find duplicates in real-time.

Fri, Jul 17, 2009 at 12:49 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:

>
>
> http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202000.pdf
>
> it would be very nice if someone could implement a randomised approach
> using
> Hadoop ...
>
> (this should be fairly esy to do, since you have to convert each document
> into a set of shingles --could be done in one mapper-- and then sort these
> documents plus some extra twists)

-- 
Ted Dunning, CTO
DeepDyve