You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@creadur.apache.org by Marija Šljivović <ma...@gmail.com> on 2009/04/08 07:45:37 UTC

Re: RAT 1 Cut&Paste Detector project

HI!

Firstly, I apologise for my English.

I make a small research for already known algorithms for
code-plagiarism detection There are several of them,
but I am afraid that neither one of them cannot be used in our case
because code for comparation is missing. I will search more about
that.

Lets talk about things we know now and which I can implement.

The greatest problem in this tool which have to be solved is that if
this tool is being run on large source-code, time for completely
processing of it will be very long. Because of that we have to make
logic which recognise "critical" code segments.
 We will try to develope our own heuristic which can recognize which
code is "good" for cut&paste, and then pass that code on examination
(of course, we must have an brute-force (will check everything) mode
in this tool).

There are several ideas for heuristic which will recognise code which
is "good for copy and paste". As I am suggested, which is good and
easy to implement is: If we recognise a comment in source code which
is "uncommon", we can check it on search engines. Identical non-common
comment will be great indication of possible plagiarised code.
 In addition, identical comment with even small amount of "identical"
source code associated with them is the sign of cut&pasted code too.

If we can somehow recognize "independent" code chunks (almost all
variables are in this part of code and this part of code works only
with this variables and returns or sets single result), like
functions, this code will be sent to be checked because code with
these properties is more often plagiarised.

Instead of this we can do it in opposite way by removing code which is
common - for example in Java we use setProperty() and getProperty()
methods very often. If we exclude this code of our code for
comparation, we can send all rest to be checked.

Code "good" for copying is very complex code with little number of
input variables (searching functions, sorting functions...)

I think that just one thing is sure - there is no generic, universal
algorithm with which we will get good results... Well-documented
combination of rules that defines common copied code in combination
with brute-force approach can get best results.



If anybody has any idea or suggestion about this tool, it will be
great to inform us! We together can define which functionalities will
have this plagiariem checker tool.

My very basic prototype of tool is here http://b0ss.on.neobee.net/MinuliRad/RAT/

For the end...in my prototype of this tool I use one interface -
ISearchEngine and all parsers for search engines must implement it. If
we can provide Class name of current ISearchEngine implementation to
main program (by argument for example), we can instantiate object of
that class thought reflection. Why? In that way we can separate
parsers from logic and parsers will be "plug ins" - I will make parser
for krugle.org and place it in classpath of this tool, run tool with
ClassName of parser plug in tool arguments and have my own
implementation of krugle.org code search. Somebody else can provide
another implementation..., all plugins will be simple .jar files in
RAT classpath...


Best regards,

Marija Sljivovic

Re: RAT 1 Cut&Paste Detector project

Posted by Alexei Fedotov <al...@gmail.com>.
Hello Marija,
I'm impressed with your progress in learning things and prototyping a
solution. As for your question about useful reading, there is a bunch
of fun examples of Google Code Search usage here [1].

I agree that there is no one best algorithm to approach the problem.
My suggestion on this is to make it modular and parameterizable.
Instead of removing common code we can just slide a window through the
code, e.g. do something like that:

IHeuristic heuristic = new DictionaryHeuristic(commonWordProbabilities);
ILanguage language = new CPlusPlusLanguage();

while (window.slide()) {
  String content = window.content();
  if (heuristic.unique(content) + language.independent(content) > LIMIT) {
    ISearchResult result =
searchEngine.search(language.replaceVarsWithWildcards(content),
excludeUrls);
    if (result != null) {
      report.add(result);
    }
  }
}

Please, note that I'm not a fluent Java programmer, hence, you may
discover a better way to express this algorithm. The main intention of
this example is to show that the algorithm itself may be freed from
language and dictionary specifics.


BTW. Your English is ok, don't be sorry about it.

[1] http://kottke.org/06/10/google-code-search


2009/4/8 Marija Šljivović <ma...@gmail.com>:
> HI!
>
> Firstly, I apologise for my English.
>
> I make a small research for already known algorithms for
> code-plagiarism detection There are several of them,
> but I am afraid that neither one of them cannot be used in our case
> because code for comparation is missing. I will search more about
> that.
>
> Lets talk about things we know now and which I can implement.
>
> The greatest problem in this tool which have to be solved is that if
> this tool is being run on large source-code, time for completely
> processing of it will be very long. Because of that we have to make
> logic which recognise "critical" code segments.
>  We will try to develope our own heuristic which can recognize which
> code is "good" for cut&paste, and then pass that code on examination
> (of course, we must have an brute-force (will check everything) mode
> in this tool).
>
> There are several ideas for heuristic which will recognise code which
> is "good for copy and paste". As I am suggested, which is good and
> easy to implement is: If we recognise a comment in source code which
> is "uncommon", we can check it on search engines. Identical non-common
> comment will be great indication of possible plagiarised code.
>  In addition, identical comment with even small amount of "identical"
> source code associated with them is the sign of cut&pasted code too.
>
> If we can somehow recognize "independent" code chunks (almost all
> variables are in this part of code and this part of code works only
> with this variables and returns or sets single result), like
> functions, this code will be sent to be checked because code with
> these properties is more often plagiarised.
>
> Instead of this we can do it in opposite way by removing code which is
> common - for example in Java we use setProperty() and getProperty()
> methods very often. If we exclude this code of our code for
> comparation, we can send all rest to be checked.
>
> Code "good" for copying is very complex code with little number of
> input variables (searching functions, sorting functions...)
>
> I think that just one thing is sure - there is no generic, universal
> algorithm with which we will get good results... Well-documented
> combination of rules that defines common copied code in combination
> with brute-force approach can get best results.
>
>
>
> If anybody has any idea or suggestion about this tool, it will be
> great to inform us! We together can define which functionalities will
> have this plagiariem checker tool.
>
> My very basic prototype of tool is here http://b0ss.on.neobee.net/MinuliRad/RAT/
>
> For the end...in my prototype of this tool I use one interface -
> ISearchEngine and all parsers for search engines must implement it. If
> we can provide Class name of current ISearchEngine implementation to
> main program (by argument for example), we can instantiate object of
> that class thought reflection. Why? In that way we can separate
> parsers from logic and parsers will be "plug ins" - I will make parser
> for krugle.org and place it in classpath of this tool, run tool with
> ClassName of parser plug in tool arguments and have my own
> implementation of krugle.org code search. Somebody else can provide
> another implementation..., all plugins will be simple .jar files in
> RAT classpath...
>
>
> Best regards,
>
> Marija Sljivovic
>



-- 
With best regards / с наилучшими пожеланиями,
Alexei Fedotov / Алексей Федотов,
http://www.telecom-express.ru/
http://people.apache.org/~aaf/
http://harmony.apache.org/
http://code.google.com/p/openmeetings/