You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Rich Heimann <he...@gmail.com> on 2011/07/28 17:49:51 UTC

Duplicate documents in a corpus

All,

I am curious if Lucene and/or Mahout can identify duplicate documents? I am
having trouble with many redundant docs in my corpus, which is causing
inflated values and an expense on users to process and reprocess much of the
material. Can the redundancy be removed or managed in some sense my either
Lucene at ingestion or Mahout at post-processing? The Vector Space Model
seems to be notional similar to PCA or Factor Analysis, which both have
similar ambitions. Thoughts???

Thank you in advance....

Regards,
Rich Heimann

Richard Heimann

Re: Duplicate documents in a corpus

Posted by Ted Dunning <te...@gmail.com>.

We also have a minhash implementation of some sort that I don't know much
about.

On Thu, Jul 28, 2011 at 4:33 PM, Chris Schilling
<ch...@thecleversense.com>wrote:

> Hey Lance,
>
> LSH is a hashing mechanism:
> http://en.wikipedia.org/wiki/Locality-sensitive_hashing
>
> Ted implemented something like this to hash vectors for training SGD
> Logistic Regression.
>
> Chris
>
> On Jul 28, 2011, at 3:43 PM, Lance Norskog wrote:
>
> > Three different answers, for different levels of one questions: how
> > similar are these documents?
> >
> > If they have the same exact bytes, the Solr/Lucene deduplication
> > technique will work, and is very fast. (I don't remember if it is a
> > Lucene or Solr feature.)
> >
> > If they have "minor text changes", different metadata etc., the
> > Nutch/Hadoop job may work.
> >
> > If they are rearranged, plagiarized, etc. the Mahout LSA/LSI tools
> > (can't find LSH as an acronym) are the most useful.
> >
> > Order of execution: the Solr/Lucene deduplication feature can be done
> > one document at a time, almost entirely in memory. I don't know about
> > the Nutch/Hadoop idea. The LSA/LSI tools very definitely need all (or
> > most) of the documents to build a model, then tests each document
> > against the model. Since this is a numerical comparison, there will be
> > a failure rate, both ways: false positives and false negatives. False
> > positives throw away valid documents.
> >
> >
> >
> > On 7/28/11, Ted Dunning <te...@gmail.com> wrote:
> >> Mahout also has an LSH implementation that can help with this.
> >>
> >> On Thu, Jul 28, 2011 at 9:37 AM, Ken Krugler
> >> <kk...@transpac.com>wrote:
> >>
> >>>
> >>> On Jul 28, 2011, at 8:49am, Rich Heimann wrote:
> >>>
> >>>> All,
> >>>>
> >>>> I am curious if Lucene and/or Mahout can identify duplicate documents?
> I
> >>> am
> >>>> having trouble with many redundant docs in my corpus, which is causing
> >>>> inflated values and an expense on users to process and reprocess much
> of
> >>> the
> >>>> material. Can the redundancy be removed or managed in some sense my
> >>> either
> >>>> Lucene at ingestion or Mahout at post-processing? The Vector Space
> Model
> >>>> seems to be notional similar to PCA or Factor Analysis, which both
> have
> >>>> similar ambitions. Thoughts???
> >>>
> >>> Nutch has a TextProfileSignature class that creates a hash which is
> >>> somewhat resilient to minor text changes between documents.
> >>>
> >>> Assuming you have such a hash, then it's trivial to use a Hadoop
> workflow
> >>> to remove duplicates.
> >>>
> >>> Or Solr supports removing duplicates as well - see
> >>> http://wiki.apache.org/solr/Deduplication
> >>>
> >>> -- Ken
> >>>
> >>> --------------------------
> >>> Ken Krugler
> >>> +1 530-210-6378
> >>> http://bixolabs.com
> >>> custom data mining solutions
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
>
> Chris Schilling
> Sr. Data Mining Engineer
> Clever Sense, Inc.
> "Curating the World Around You"
> --------------------------------------------------------------
> Winner of the 2011 Fortune Brainstorm Start-up Idol
>
> Wanna join the Clever Team? We're hiring!
> --------------------------------------------------------------
>
>

Re: Duplicate documents in a corpus

Posted by Chris Schilling <ch...@thecleversense.com>.

Hey Lance,

LSH is a hashing mechanism:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing

Ted implemented something like this to hash vectors for training SGD Logistic Regression.

Chris

On Jul 28, 2011, at 3:43 PM, Lance Norskog wrote:

> Three different answers, for different levels of one questions: how
> similar are these documents?
> 
> If they have the same exact bytes, the Solr/Lucene deduplication
> technique will work, and is very fast. (I don't remember if it is a
> Lucene or Solr feature.)
> 
> If they have "minor text changes", different metadata etc., the
> Nutch/Hadoop job may work.
> 
> If they are rearranged, plagiarized, etc. the Mahout LSA/LSI tools
> (can't find LSH as an acronym) are the most useful.
> 
> Order of execution: the Solr/Lucene deduplication feature can be done
> one document at a time, almost entirely in memory. I don't know about
> the Nutch/Hadoop idea. The LSA/LSI tools very definitely need all (or
> most) of the documents to build a model, then tests each document
> against the model. Since this is a numerical comparison, there will be
> a failure rate, both ways: false positives and false negatives. False
> positives throw away valid documents.
> 
> 
> 
> On 7/28/11, Ted Dunning <te...@gmail.com> wrote:
>> Mahout also has an LSH implementation that can help with this.
>> 
>> On Thu, Jul 28, 2011 at 9:37 AM, Ken Krugler
>> <kk...@transpac.com>wrote:
>> 
>>> 
>>> On Jul 28, 2011, at 8:49am, Rich Heimann wrote:
>>> 
>>>> All,
>>>> 
>>>> I am curious if Lucene and/or Mahout can identify duplicate documents? I
>>> am
>>>> having trouble with many redundant docs in my corpus, which is causing
>>>> inflated values and an expense on users to process and reprocess much of
>>> the
>>>> material. Can the redundancy be removed or managed in some sense my
>>> either
>>>> Lucene at ingestion or Mahout at post-processing? The Vector Space Model
>>>> seems to be notional similar to PCA or Factor Analysis, which both have
>>>> similar ambitions. Thoughts???
>>> 
>>> Nutch has a TextProfileSignature class that creates a hash which is
>>> somewhat resilient to minor text changes between documents.
>>> 
>>> Assuming you have such a hash, then it's trivial to use a Hadoop workflow
>>> to remove duplicates.
>>> 
>>> Or Solr supports removing duplicates as well - see
>>> http://wiki.apache.org/solr/Deduplication
>>> 
>>> -- Ken
>>> 
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> custom data mining solutions
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Chris Schilling
Sr. Data Mining Engineer
Clever Sense, Inc.
"Curating the World Around You"
--------------------------------------------------------------
Winner of the 2011 Fortune Brainstorm Start-up Idol

Wanna join the Clever Team? We're hiring!
--------------------------------------------------------------

Re: Duplicate documents in a corpus

Posted by Lance Norskog <go...@gmail.com>.

Three different answers, for different levels of one questions: how
similar are these documents?

If they have the same exact bytes, the Solr/Lucene deduplication
technique will work, and is very fast. (I don't remember if it is a
Lucene or Solr feature.)

If they have "minor text changes", different metadata etc., the
Nutch/Hadoop job may work.

If they are rearranged, plagiarized, etc. the Mahout LSA/LSI tools
(can't find LSH as an acronym) are the most useful.

Order of execution: the Solr/Lucene deduplication feature can be done
one document at a time, almost entirely in memory. I don't know about
the Nutch/Hadoop idea. The LSA/LSI tools very definitely need all (or
most) of the documents to build a model, then tests each document
against the model. Since this is a numerical comparison, there will be
a failure rate, both ways: false positives and false negatives. False
positives throw away valid documents.

On 7/28/11, Ted Dunning <te...@gmail.com> wrote:
> Mahout also has an LSH implementation that can help with this.
>
> On Thu, Jul 28, 2011 at 9:37 AM, Ken Krugler
> <kk...@transpac.com>wrote:
>
>>
>> On Jul 28, 2011, at 8:49am, Rich Heimann wrote:
>>
>> > All,
>> >
>> > I am curious if Lucene and/or Mahout can identify duplicate documents? I
>> am
>> > having trouble with many redundant docs in my corpus, which is causing
>> > inflated values and an expense on users to process and reprocess much of
>> the
>> > material. Can the redundancy be removed or managed in some sense my
>> either
>> > Lucene at ingestion or Mahout at post-processing? The Vector Space Model
>> > seems to be notional similar to PCA or Factor Analysis, which both have
>> > similar ambitions. Thoughts???
>>
>> Nutch has a TextProfileSignature class that creates a hash which is
>> somewhat resilient to minor text changes between documents.
>>
>> Assuming you have such a hash, then it's trivial to use a Hadoop workflow
>> to remove duplicates.
>>
>> Or Solr supports removing duplicates as well - see
>> http://wiki.apache.org/solr/Deduplication
>>
>> -- Ken
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> custom data mining solutions
>>
>>
>>
>>
>>
>>
>>
>

-- 
Lance Norskog
goksron@gmail.com

Re: Duplicate documents in a corpus

Posted by Ted Dunning <te...@gmail.com>.

Mahout also has an LSH implementation that can help with this.

On Thu, Jul 28, 2011 at 9:37 AM, Ken Krugler <kk...@transpac.com>wrote:

>
> On Jul 28, 2011, at 8:49am, Rich Heimann wrote:
>
> > All,
> >
> > I am curious if Lucene and/or Mahout can identify duplicate documents? I
> am
> > having trouble with many redundant docs in my corpus, which is causing
> > inflated values and an expense on users to process and reprocess much of
> the
> > material. Can the redundancy be removed or managed in some sense my
> either
> > Lucene at ingestion or Mahout at post-processing? The Vector Space Model
> > seems to be notional similar to PCA or Factor Analysis, which both have
> > similar ambitions. Thoughts???
>
> Nutch has a TextProfileSignature class that creates a hash which is
> somewhat resilient to minor text changes between documents.
>
> Assuming you have such a hash, then it's trivial to use a Hadoop workflow
> to remove duplicates.
>
> Or Solr supports removing duplicates as well - see
> http://wiki.apache.org/solr/Deduplication
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom data mining solutions
>
>
>
>
>
>
>

Re: Duplicate documents in a corpus

Posted by Ken Krugler <kk...@transpac.com>.

On Jul 28, 2011, at 8:49am, Rich Heimann wrote:

> All,
> 
> I am curious if Lucene and/or Mahout can identify duplicate documents? I am
> having trouble with many redundant docs in my corpus, which is causing
> inflated values and an expense on users to process and reprocess much of the
> material. Can the redundancy be removed or managed in some sense my either
> Lucene at ingestion or Mahout at post-processing? The Vector Space Model
> seems to be notional similar to PCA or Factor Analysis, which both have
> similar ambitions. Thoughts???

Nutch has a TextProfileSignature class that creates a hash which is somewhat resilient to minor text changes between documents.

Assuming you have such a hash, then it's trivial to use a Hadoop workflow to remove duplicates.

Or Solr supports removing duplicates as well - see http://wiki.apache.org/solr/Deduplication

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions