You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@creadur.apache.org by Alexei Fedotov <al...@gmail.com> on 2009/03/23 08:07:29 UTC

[rat-1-cutnpaste] Cut&Paste Detector - Ideas

Hello Amila,
Thanks for your questions.

1. As for imports and annotations, the order of them may be a serious proof
for cut & pasted code. Generally I believe that the algorithm should be
language independent, and important structural things like method bodies
should be defined in a form of a grammar or plug-ins.

2. If you succeed to find the whole cut&pasted method, and this method is
more or less unique according to the search results, this proves it was
cut&pasted. I agree that it may worth to download the whole project the
method was stolen from and do local in-depth analysis. While it may be
useful and worth designing an extensible tool, I believe this is out of the
scope of the initial GSoC task due to availability of exisitng open source
solutions. The first run of our tool may create a list of URLs for further
in-depth analysis and configure script for CPD [1] to perform it.

BTW, I have updated the task [2] on required skills and added prefix to the
mail subject.

[1] http://pmd.sourceforge.net/cpd.html
[2] http://wiki.apache.org/general/SummerOfCode2009#rat-1-cutnpaste


On Sat, Mar 21, 2009 at 8:09 AM, Amila De Silva <ja...@gmail.com> wrote:

> Hi all,
>        I'm Amila. I'm interested in apache RAT cut&paste detector. After
> having a little discussion with Alexei I felt that
> It's worthy to express my ideas about the project through this mailing
> list.
> After going through the mails I found that you have considered about
> saving search engine resources, whilst detecting a plagarised code.
>
> This is what I thought of;
> 1. Say that we are going to check a whole class against the existing
> codes.As the import statement,annotations are common to most of the
> class
> we should avoid them. So first we are going to remove the import
> statements , annotations (and such common features) and have a clean
> code.This is the one
> we will use for searching afterwards.
>
> 2.Than directly using a sliding window mechanism I sugest this:
> First the code can be broken into components. For a java code these
> components are varaibles, methods,comments,...
> Then we are performing the search using these components.
> Suppose that there are five methods in a particular class(so there are
> five method components). We are taking one method out of those then
> perform a search.
> If it has some result,we compare the component we searched with the
> result and calculate number of similarities between them. I think that
> Levenshtein algorithm(
> which is used to calculate word distances) can be used here to
> determine the similarity between two codes (this value is a one
> heuristic value).
> In a similar manner we perform the search using other components also.
> If there are results then it is compared as above and heuristic value
> is
> added to the first.
>
> When performing a search if any component fails to return a result
> ,then we use sliding window mechanism on that component to determine
> its heuristic value.
> First start by getting the first word in the component(here the
> component is a single method), perform the search ,then add the second
> word to it, perform search
> We do this until we find the largest match. After that similarities
> are calculated.
>
> After performing these component search we'll have some value for a
> particular class file.
> This value can be used to determine if the code has been cut&paste.
>
> I'd like to hear your comments on this and start a bit discussion.
>
> Thanks In Advance!
> Best Regards,
> Amila
>



-- 
С уважением,
Алексей Федотов,
http://www.telecom-express.ru/
http://people.apache.org/~aaf/

Re: [rat-1-cutnpaste] Cut&Paste Detector - Ideas

Posted by Amila De Silva <ja...@gmail.com>.
Hello Alexei,
Thanks for the detailed reply!

I have a little problem to get clarified about the new eligibility. After
introducing the new prototype,
what RAT should be able to do?
Should it be able to find duplicates in a target code( which now CPD is
doing) or should it be able to find
a given code snippet in the target code?

2009/3/23 Alexei Fedotov <al...@gmail.com>

> Hello Amila,
> Thanks for your questions.
>
> 1. As for imports and annotations, the order of them may be a serious proof
> for cut & pasted code. Generally I believe that the algorithm should be
> language independent, and important structural things like method bodies
> should be defined in a form of a grammar or plug-ins.
>
> 2. If you succeed to find the whole cut&pasted method, and this method is
> more or less unique according to the search results, this proves it was
> cut&pasted. I agree that it may worth to download the whole project the
> method was stolen from and do local in-depth analysis. While it may be
> useful and worth designing an extensible tool, I believe this is out of the
> scope of the initial GSoC task due to availability of exisitng open source
> solutions. The first run of our tool may create a list of URLs for further
> in-depth analysis and configure script for CPD [1] to perform it.
>
> BTW, I have updated the task [2] on required skills and added prefix to the
> mail subject.
>
> [1] http://pmd.sourceforge.net/cpd.html
> [2] http://wiki.apache.org/general/SummerOfCode2009#rat-1-cutnpaste
>
>
> On Sat, Mar 21, 2009 at 8:09 AM, Amila De Silva <ja...@gmail.com> wrote:
>
> > Hi all,
> >        I'm Amila. I'm interested in apache RAT cut&paste detector. After
> > having a little discussion with Alexei I felt that
> > It's worthy to express my ideas about the project through this mailing
> > list.
> > After going through the mails I found that you have considered about
> > saving search engine resources, whilst detecting a plagarised code.
> >
> > This is what I thought of;
> > 1. Say that we are going to check a whole class against the existing
> > codes.As the import statement,annotations are common to most of the
> > class
> > we should avoid them. So first we are going to remove the import
> > statements , annotations (and such common features) and have a clean
> > code.This is the one
> > we will use for searching afterwards.
> >
> > 2.Than directly using a sliding window mechanism I sugest this:
> > First the code can be broken into components. For a java code these
> > components are varaibles, methods,comments,...
> > Then we are performing the search using these components.
> > Suppose that there are five methods in a particular class(so there are
> > five method components). We are taking one method out of those then
> > perform a search.
> > If it has some result,we compare the component we searched with the
> > result and calculate number of similarities between them. I think that
> > Levenshtein algorithm(
> > which is used to calculate word distances) can be used here to
> > determine the similarity between two codes (this value is a one
> > heuristic value).
> > In a similar manner we perform the search using other components also.
> > If there are results then it is compared as above and heuristic value
> > is
> > added to the first.
> >
> > When performing a search if any component fails to return a result
> > ,then we use sliding window mechanism on that component to determine
> > its heuristic value.
> > First start by getting the first word in the component(here the
> > component is a single method), perform the search ,then add the second
> > word to it, perform search
> > We do this until we find the largest match. After that similarities
> > are calculated.
> >
> > After performing these component search we'll have some value for a
> > particular class file.
> > This value can be used to determine if the code has been cut&paste.
> >
> > I'd like to hear your comments on this and start a bit discussion.
> >
> > Thanks In Advance!
> > Best Regards,
> > Amila
> >
>
>
>
> --
> С уважением,
> Алексей Федотов,
> http://www.telecom-express.ru/
> http://people.apache.org/~aaf/ <http://people.apache.org/%7Eaaf/>
>