You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by JAGANADH G <ja...@gmail.com> on 2010/07/07 16:23:16 UTC
Document Comparison with Mahout
Dear All
Is there any way or algo available to compare tow documents.
Eg. Check if doc "A" is a copy (palagirised version) of document "B".
With regards
--
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog
Re: Document Comparison with Mahout
Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 8, 2010, at 2:21 AM, JAGANADH G wrote:
> On Wed, Jul 7, 2010 at 11:49 PM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> How do you want to determine copy? Strictly or loosely? Solr and Nutch
>> have some deduplication capabilities, including fuzzy matching. They
>> probably could be brought into Mahout, too.
>>
>> -Grant
>>
>>
>>
> Dear Grant
> I am trying to make a strict match.
> I will try Solar and Nutch.
So, then you can do a checksum or something like that, right?
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Document Comparison with Mahout
Posted by JAGANADH G <ja...@gmail.com>.
On Wed, Jul 7, 2010 at 11:49 PM, Grant Ingersoll <gs...@apache.org>wrote:
> How do you want to determine copy? Strictly or loosely? Solr and Nutch
> have some deduplication capabilities, including fuzzy matching. They
> probably could be brought into Mahout, too.
>
> -Grant
>
>
>
Dear Grant
I am trying to make a strict match.
I will try Solar and Nutch.
Thanks and Regards
--
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog
Re: Document Comparison with Mahout
Posted by JAGANADH G <ja...@gmail.com>.
On Thu, Jul 8, 2010 at 7:35 AM, dc tech <dc...@gmail.com> wrote:
> Document similarity is unlikely to work as the typical case is a term paper
> for the class where the papers will be similar - many similar words etc.
> One
> approach (suggesting in a book.. I do not recall the title now) is to take
> a
> sample of text fragments from document 1 and use those fragments as queries
> against the larger corpus. Plagiarism may be suggested if m of the n
> fragments match assuming the cheater is smart and has at least not copied
> the entire document. Key questions would be:
> - how many text fragments (n) to take from the document under consideration
> (call it doc 1) and fragment size and extraction technique (i.e. sentence
> breaks)
> - how many matches constitute a possible match (i.e. out of 10 fragments,
> match is when 6 show up in a different document)
> - one pass only or multiple passes
>
> I got the same idea from some research papers.
Some where I saw that LSI will be also useful for the same. But I dont know
the details
--
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog
Re: Document Comparison with Mahout
Posted by dc tech <dc...@gmail.com>.
Document similarity is unlikely to work as the typical case is a term paper
for the class where the papers will be similar - many similar words etc. One
approach (suggesting in a book.. I do not recall the title now) is to take a
sample of text fragments from document 1 and use those fragments as queries
against the larger corpus. Plagiarism may be suggested if m of the n
fragments match assuming the cheater is smart and has at least not copied
the entire document. Key questions would be:
- how many text fragments (n) to take from the document under consideration
(call it doc 1) and fragment size and extraction technique (i.e. sentence
breaks)
- how many matches constitute a possible match (i.e. out of 10 fragments,
match is when 6 show up in a different document)
- one pass only or multiple passes
Hope that helps.
On Wed, Jul 7, 2010 at 2:19 PM, Grant Ingersoll <gs...@apache.org> wrote:
> How do you want to determine copy? Strictly or loosely? Solr and Nutch
> have some deduplication capabilities, including fuzzy matching. They
> probably could be brought into Mahout, too.
>
> -Grant
>
> On Jul 7, 2010, at 10:23 AM, JAGANADH G wrote:
>
> > Dear All
> >
> > Is there any way or algo available to compare tow documents.
> > Eg. Check if doc "A" is a copy (palagirised version) of document "B".
> >
> > With regards
> >
> > --
> > **********************************
> > JAGANADH G
> > http://jaganadhg.freeflux.net/blog
>
>
Re: Document Comparison with Mahout
Posted by Grant Ingersoll <gs...@apache.org>.
How do you want to determine copy? Strictly or loosely? Solr and Nutch have some deduplication capabilities, including fuzzy matching. They probably could be brought into Mahout, too.
-Grant
On Jul 7, 2010, at 10:23 AM, JAGANADH G wrote:
> Dear All
>
> Is there any way or algo available to compare tow documents.
> Eg. Check if doc "A" is a copy (palagirised version) of document "B".
>
> With regards
>
> --
> **********************************
> JAGANADH G
> http://jaganadhg.freeflux.net/blog