You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by JAGANADH G <ja...@gmail.com> on 2010/07/07 16:23:16 UTC

Document Comparison with Mahout

Dear All

Is there any way or algo available to compare tow documents.
Eg. Check if doc "A" is a copy (palagirised version) of document "B".

With regards

-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Document Comparison with Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 8, 2010, at 2:21 AM, JAGANADH G wrote:

> On Wed, Jul 7, 2010 at 11:49 PM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> How do you want to determine copy?  Strictly or loosely?  Solr and Nutch
>> have some deduplication capabilities, including fuzzy matching.  They
>> probably could be brought into Mahout, too.
>> 
>> -Grant
>> 
>> 
>> 
> Dear Grant
> I am trying to make a strict match.
> I will try Solar and Nutch.

So, then you can do a checksum or something like that, right?

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: Document Comparison with Mahout

Posted by JAGANADH G <ja...@gmail.com>.

On Wed, Jul 7, 2010 at 11:49 PM, Grant Ingersoll <gs...@apache.org>wrote:

> How do you want to determine copy?  Strictly or loosely?  Solr and Nutch
> have some deduplication capabilities, including fuzzy matching.  They
> probably could be brought into Mahout, too.
>
> -Grant
>
>
>
Dear Grant
I am trying to make a strict match.
I will try Solar and Nutch.
Thanks and Regards
-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Document Comparison with Mahout

Posted by JAGANADH G <ja...@gmail.com>.

On Thu, Jul 8, 2010 at 7:35 AM, dc tech <dc...@gmail.com> wrote:

> Document similarity is unlikely to work as the typical case is a term paper
> for the class where the papers will be similar - many similar words etc.
> One
> approach (suggesting in a book.. I do not recall the title now) is to take
> a
> sample of text fragments from document 1 and use those fragments as queries
> against the larger corpus. Plagiarism may be suggested if m of the n
> fragments match assuming the cheater is smart and has at least not copied
> the entire document. Key questions would be:
> - how many text fragments (n) to take from the document under consideration
> (call it doc 1) and fragment size and extraction technique (i.e. sentence
> breaks)
> - how many matches constitute a possible match (i.e. out of 10 fragments,
> match is when 6 show up in a different document)
> - one pass only or multiple passes
>
> I got the same idea from some research papers.
Some where I saw that LSI will be also useful for the same. But I dont know
the details
-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Document Comparison with Mahout

Posted by dc tech <dc...@gmail.com>.

Document similarity is unlikely to work as the typical case is a term paper
for the class where the papers will be similar - many similar words etc. One
approach (suggesting in a book.. I do not recall the title now) is to take a
sample of text fragments from document 1 and use those fragments as queries
against the larger corpus. Plagiarism may be suggested if m of the n
fragments match assuming the cheater is smart and has at least not copied
the entire document. Key questions would be:
- how many text fragments (n) to take from the document under consideration
(call it doc 1) and fragment size and extraction technique (i.e. sentence
breaks)
- how many matches constitute a possible match (i.e. out of 10 fragments,
match is when 6 show up in a different document)
- one pass only or multiple passes

Hope that helps.





On Wed, Jul 7, 2010 at 2:19 PM, Grant Ingersoll <gs...@apache.org> wrote:

> How do you want to determine copy?  Strictly or loosely?  Solr and Nutch
> have some deduplication capabilities, including fuzzy matching.  They
> probably could be brought into Mahout, too.
>
> -Grant
>
> On Jul 7, 2010, at 10:23 AM, JAGANADH G wrote:
>
> > Dear All
> >
> > Is there any way or algo available to compare tow documents.
> > Eg. Check if doc "A" is a copy (palagirised version) of document "B".
> >
> > With regards
> >
> > --
> > **********************************
> > JAGANADH G
> > http://jaganadhg.freeflux.net/blog
>
>

Re: Document Comparison with Mahout

Posted by Grant Ingersoll <gs...@apache.org>.

How do you want to determine copy?  Strictly or loosely?  Solr and Nutch have some deduplication capabilities, including fuzzy matching.  They probably could be brought into Mahout, too.

-Grant

On Jul 7, 2010, at 10:23 AM, JAGANADH G wrote:

> Dear All
> 
> Is there any way or algo available to compare tow documents.
> Eg. Check if doc "A" is a copy (palagirised version) of document "B".
> 
> With regards
> 
> -- 
> **********************************
> JAGANADH G
> http://jaganadhg.freeflux.net/blog