You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Amel Fraisse <am...@gmail.com> on 2010/12/30 12:05:54 UTC

Using Lucene/Solr for Plagiarism detection

Hello,

I am using Lucene for plagiarism detection.

The goal is that: when I have a new document, I will check on the solr index
if there is a document that contain some common chunk.

So to compute similarity between the query and a source document I would use
this formula :

Score (suspicious document, source document) = Number of common chunk
between source document and suspicious document  / Number of total chunk in
the suspicious document.

So I have to change the scoring formula in the Similarity class.

How can I change the scoring formula? ( by customizing only the Similarity
class? or Scorer?)

Do you have an Example of this use case?

Thank for your help.

Re: Using Lucene/Solr for Plagiarism detection

Posted by Lance Norskog <go...@gmail.com>.
The MoreLikeThis feature may be exactly what you want. Try it out.

On Thu, Dec 30, 2010 at 8:28 AM, Amel Fraisse <am...@gmail.com> wrote:
> Hello,
>
> No I'm not using cosine similarity metrics.
>
>
> 2010/12/30 Shashi Kant <sk...@sloan.mit.edu>
>
>> Have you considered using document similarity metrics such as Cosine
>> Similarity?
>>
>>
>> On Thu, Dec 30, 2010 at 6:05 AM, Amel Fraisse <am...@gmail.com>
>> wrote:
>> > Hello,
>> >
>> > I am using Lucene for plagiarism detection.
>> >
>> > The goal is that: when I have a new document, I will check on the solr
>> index
>> > if there is a document that contain some common chunk.
>> >
>> > So to compute similarity between the query and a source document I would
>> use
>> > this formula :
>> >
>> > Score (suspicious document, source document) = Number of common chunk
>> > between source document and suspicious document  / Number of total chunk
>> in
>> > the suspicious document.
>> >
>> > So I have to change the scoring formula in the Similarity class.
>> >
>> > How can I change the scoring formula? ( by customizing only the
>> Similarity
>> > class? or Scorer?)
>> >
>> > Do you have an Example of this use case?
>> >
>> > Thank for your help.
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> --------------
> Amel Fraisse
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Using Lucene/Solr for Plagiarism detection

Posted by Amel Fraisse <am...@gmail.com>.
Hello,

No I'm not using cosine similarity metrics.


2010/12/30 Shashi Kant <sk...@sloan.mit.edu>

> Have you considered using document similarity metrics such as Cosine
> Similarity?
>
>
> On Thu, Dec 30, 2010 at 6:05 AM, Amel Fraisse <am...@gmail.com>
> wrote:
> > Hello,
> >
> > I am using Lucene for plagiarism detection.
> >
> > The goal is that: when I have a new document, I will check on the solr
> index
> > if there is a document that contain some common chunk.
> >
> > So to compute similarity between the query and a source document I would
> use
> > this formula :
> >
> > Score (suspicious document, source document) = Number of common chunk
> > between source document and suspicious document  / Number of total chunk
> in
> > the suspicious document.
> >
> > So I have to change the scoring formula in the Similarity class.
> >
> > How can I change the scoring formula? ( by customizing only the
> Similarity
> > class? or Scorer?)
> >
> > Do you have an Example of this use case?
> >
> > Thank for your help.
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
--------------
Amel Fraisse

Re: Using Lucene/Solr for Plagiarism detection

Posted by Shashi Kant <sk...@sloan.mit.edu>.
Have you considered using document similarity metrics such as Cosine Similarity?


On Thu, Dec 30, 2010 at 6:05 AM, Amel Fraisse <am...@gmail.com> wrote:
> Hello,
>
> I am using Lucene for plagiarism detection.
>
> The goal is that: when I have a new document, I will check on the solr index
> if there is a document that contain some common chunk.
>
> So to compute similarity between the query and a source document I would use
> this formula :
>
> Score (suspicious document, source document) = Number of common chunk
> between source document and suspicious document  / Number of total chunk in
> the suspicious document.
>
> So I have to change the scoring formula in the Similarity class.
>
> How can I change the scoring formula? ( by customizing only the Similarity
> class? or Scorer?)
>
> Do you have an Example of this use case?
>
> Thank for your help.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org