You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@creadur.apache.org by Amila De Silva <ja...@gmail.com> on 2009/03/21 06:09:17 UTC

Cut&Paste Detector - Ideas

Hi all,
	I'm Amila. I'm interested in apache RAT cut&paste detector. After
having a little discussion with Alexei I felt that
It's worthy to express my ideas about the project through this mailing list.
After going through the mails I found that you have considered about
saving search engine resources, whilst detecting a plagarised code.

This is what I thought of;
1. Say that we are going to check a whole class against the existing
codes.As the import statement,annotations are common to most of the
class
we should avoid them. So first we are going to remove the import
statements , annotations (and such common features) and have a clean
code.This is the one
we will use for searching afterwards.

2.Than directly using a sliding window mechanism I sugest this:
First the code can be broken into components. For a java code these
components are varaibles, methods,comments,...
Then we are performing the search using these components.
Suppose that there are five methods in a particular class(so there are
five method components). We are taking one method out of those then
perform a search.
If it has some result,we compare the component we searched with the
result and calculate number of similarities between them. I think that
Levenshtein algorithm(
which is used to calculate word distances) can be used here to
determine the similarity between two codes (this value is a one
heuristic value).
In a similar manner we perform the search using other components also.
If there are results then it is compared as above and heuristic value
is
added to the first.

When performing a search if any component fails to return a result
,then we use sliding window mechanism on that component to determine
its heuristic value.
First start by getting the first word in the component(here the
component is a single method), perform the search ,then add the second
word to it, perform search
We do this until we find the largest match. After that similarities
are calculated.

After performing these component search we'll have some value for a
particular class file.
This value can be used to determine if the code has been cut&paste.

I'd like to hear your comments on this and start a bit discussion.

Thanks In Advance!
Best Regards,
Amila

Re: [rat-1-cutnpaste] Cut&Paste Detector - Ideas

Posted by Amila De Silva <ja...@gmail.com>.
Hello Alexei,
Thanks for the detailed reply!

I have a little problem to get clarified about the new eligibility. After
introducing the new prototype,
what RAT should be able to do?
Should it be able to find duplicates in a target code( which now CPD is
doing) or should it be able to find
a given code snippet in the target code?

2009/3/23 Alexei Fedotov <al...@gmail.com>

> Hello Amila,
> Thanks for your questions.
>
> 1. As for imports and annotations, the order of them may be a serious proof
> for cut & pasted code. Generally I believe that the algorithm should be
> language independent, and important structural things like method bodies
> should be defined in a form of a grammar or plug-ins.
>
> 2. If you succeed to find the whole cut&pasted method, and this method is
> more or less unique according to the search results, this proves it was
> cut&pasted. I agree that it may worth to download the whole project the
> method was stolen from and do local in-depth analysis. While it may be
> useful and worth designing an extensible tool, I believe this is out of the
> scope of the initial GSoC task due to availability of exisitng open source
> solutions. The first run of our tool may create a list of URLs for further
> in-depth analysis and configure script for CPD [1] to perform it.
>
> BTW, I have updated the task [2] on required skills and added prefix to the
> mail subject.
>
> [1] http://pmd.sourceforge.net/cpd.html
> [2] http://wiki.apache.org/general/SummerOfCode2009#rat-1-cutnpaste
>
>
> On Sat, Mar 21, 2009 at 8:09 AM, Amila De Silva <ja...@gmail.com> wrote:
>
> > Hi all,
> >        I'm Amila. I'm interested in apache RAT cut&paste detector. After
> > having a little discussion with Alexei I felt that
> > It's worthy to express my ideas about the project through this mailing
> > list.
> > After going through the mails I found that you have considered about
> > saving search engine resources, whilst detecting a plagarised code.
> >
> > This is what I thought of;
> > 1. Say that we are going to check a whole class against the existing
> > codes.As the import statement,annotations are common to most of the
> > class
> > we should avoid them. So first we are going to remove the import
> > statements , annotations (and such common features) and have a clean
> > code.This is the one
> > we will use for searching afterwards.
> >
> > 2.Than directly using a sliding window mechanism I sugest this:
> > First the code can be broken into components. For a java code these
> > components are varaibles, methods,comments,...
> > Then we are performing the search using these components.
> > Suppose that there are five methods in a particular class(so there are
> > five method components). We are taking one method out of those then
> > perform a search.
> > If it has some result,we compare the component we searched with the
> > result and calculate number of similarities between them. I think that
> > Levenshtein algorithm(
> > which is used to calculate word distances) can be used here to
> > determine the similarity between two codes (this value is a one
> > heuristic value).
> > In a similar manner we perform the search using other components also.
> > If there are results then it is compared as above and heuristic value
> > is
> > added to the first.
> >
> > When performing a search if any component fails to return a result
> > ,then we use sliding window mechanism on that component to determine
> > its heuristic value.
> > First start by getting the first word in the component(here the
> > component is a single method), perform the search ,then add the second
> > word to it, perform search
> > We do this until we find the largest match. After that similarities
> > are calculated.
> >
> > After performing these component search we'll have some value for a
> > particular class file.
> > This value can be used to determine if the code has been cut&paste.
> >
> > I'd like to hear your comments on this and start a bit discussion.
> >
> > Thanks In Advance!
> > Best Regards,
> > Amila
> >
>
>
>
> --
> С уважением,
> Алексей Федотов,
> http://www.telecom-express.ru/
> http://people.apache.org/~aaf/ <http://people.apache.org/%7Eaaf/>
>

[rat-1-cutnpaste] Cut&Paste Detector - Ideas

Posted by Alexei Fedotov <al...@gmail.com>.
Hello Amila,
Thanks for your questions.

1. As for imports and annotations, the order of them may be a serious proof
for cut & pasted code. Generally I believe that the algorithm should be
language independent, and important structural things like method bodies
should be defined in a form of a grammar or plug-ins.

2. If you succeed to find the whole cut&pasted method, and this method is
more or less unique according to the search results, this proves it was
cut&pasted. I agree that it may worth to download the whole project the
method was stolen from and do local in-depth analysis. While it may be
useful and worth designing an extensible tool, I believe this is out of the
scope of the initial GSoC task due to availability of exisitng open source
solutions. The first run of our tool may create a list of URLs for further
in-depth analysis and configure script for CPD [1] to perform it.

BTW, I have updated the task [2] on required skills and added prefix to the
mail subject.

[1] http://pmd.sourceforge.net/cpd.html
[2] http://wiki.apache.org/general/SummerOfCode2009#rat-1-cutnpaste


On Sat, Mar 21, 2009 at 8:09 AM, Amila De Silva <ja...@gmail.com> wrote:

> Hi all,
>        I'm Amila. I'm interested in apache RAT cut&paste detector. After
> having a little discussion with Alexei I felt that
> It's worthy to express my ideas about the project through this mailing
> list.
> After going through the mails I found that you have considered about
> saving search engine resources, whilst detecting a plagarised code.
>
> This is what I thought of;
> 1. Say that we are going to check a whole class against the existing
> codes.As the import statement,annotations are common to most of the
> class
> we should avoid them. So first we are going to remove the import
> statements , annotations (and such common features) and have a clean
> code.This is the one
> we will use for searching afterwards.
>
> 2.Than directly using a sliding window mechanism I sugest this:
> First the code can be broken into components. For a java code these
> components are varaibles, methods,comments,...
> Then we are performing the search using these components.
> Suppose that there are five methods in a particular class(so there are
> five method components). We are taking one method out of those then
> perform a search.
> If it has some result,we compare the component we searched with the
> result and calculate number of similarities between them. I think that
> Levenshtein algorithm(
> which is used to calculate word distances) can be used here to
> determine the similarity between two codes (this value is a one
> heuristic value).
> In a similar manner we perform the search using other components also.
> If there are results then it is compared as above and heuristic value
> is
> added to the first.
>
> When performing a search if any component fails to return a result
> ,then we use sliding window mechanism on that component to determine
> its heuristic value.
> First start by getting the first word in the component(here the
> component is a single method), perform the search ,then add the second
> word to it, perform search
> We do this until we find the largest match. After that similarities
> are calculated.
>
> After performing these component search we'll have some value for a
> particular class file.
> This value can be used to determine if the code has been cut&paste.
>
> I'd like to hear your comments on this and start a bit discussion.
>
> Thanks In Advance!
> Best Regards,
> Amila
>



-- 
С уважением,
Алексей Федотов,
http://www.telecom-express.ru/
http://people.apache.org/~aaf/