You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@creadur.apache.org by Marija Šljivović <ma...@gmail.com> on 2009/06/16 22:03:32 UTC

apache-rat-pd

Hi!
I am working on copy&paste(plagiarism) detector.
You  can see information about project and reports of my progress on this
locations:
http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal
https://issues.apache.org/jira/browse/RAT-45
or get source code and binary distributions on:
http://code.google.com/p/apache-rat-pd/
I think now to make some misspellings heuristic checkers. This algorithms
will be able to notice some misspelled words in source code.
Then this part of code will be sent to some of code search
engines(GoogleCodeSearch for example) to check if it can find any similar
misspellings in public code bases.
On that way we can check possibility if code part is plagiarised.
Now i search for an open source library which can be used for this task. I
found one: jazzy ( http://jazzy.sourceforge.net/ ) and I think that it is
good for this purpose.
Any suggestion for other solution that is better then jazzy?
Work on apache-rat-pd(plagiarism detector) is continuing. If you have any
suggestions or advice, please say.
Best regards,
Marija

Re: apache-rat-pd

Posted by Marija Šljivović <ma...@gmail.com>.
Thanks!

> I'm not sure whether it would be better but an alternative approach
> would be to use a semi-structured text analysis tool for example UIMA> (
http://incubator.apache.org/uima/) or lucene
> for lucene, start by looking at
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/spellchecker
> and then create a custom dictionary by tokenising a large number of
> source files
> robert


I will take a look... I did not know that lucene has a spell checker...

> probably best to make the API pluggable (jazzy is LGPL but this is good
> advice in any case)

Making checkers pluggable is good idea.

Best regards,
Marija

Re: apache-rat-pd

Posted by Robert Burrell Donkin <ro...@blueyonder.co.uk>.
Robert Burrell Donkin wrote:
> Marija Šljivović wrote:
>> Hi!
>> I am working on copy&paste(plagiarism) detector.
> 
> cool
> 
>> You  can see information about project and reports of my progress on this
>> locations:
>> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal
>> https://issues.apache.org/jira/browse/RAT-45
>> or get source code and binary distributions on:
>> http://code.google.com/p/apache-rat-pd/
>> I think now to make some misspellings heuristic checkers. This algorithms
>> will be able to notice some misspelled words in source code.
>> Then this part of code will be sent to some of code search
>> engines(GoogleCodeSearch for example) to check if it can find any similar
>> misspellings in public code bases.
>> On that way we can check possibility if code part is plagiarised.
>> Now i search for an open source library which can be used for this task. I
>> found one: jazzy ( http://jazzy.sourceforge.net/ ) and I think that it is
>> good for this purpose.
> 
> probably best to make the API pluggable (jazzy is LGPL but this is good
> advice in any case)
> 
>> Any suggestion for other solution that is better then jazzy?
> 
> i'm not sure whether it would be better but an alternative approach
> would be to use a semi-structured text analysis tool for example UIMA
> (http://incubator.apache.org/uima/) or lucene

for lucene, start by looking at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/spellchecker/
and then create a custom dictionary by tokenising a large number of
source files

- robert


Re: apache-rat-pd

Posted by Robert Burrell Donkin <ro...@blueyonder.co.uk>.
Marija Šljivović wrote:
> Hi!
> I am working on copy&paste(plagiarism) detector.

cool

> You  can see information about project and reports of my progress on this
> locations:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal
> https://issues.apache.org/jira/browse/RAT-45
> or get source code and binary distributions on:
> http://code.google.com/p/apache-rat-pd/
> I think now to make some misspellings heuristic checkers. This algorithms
> will be able to notice some misspelled words in source code.
> Then this part of code will be sent to some of code search
> engines(GoogleCodeSearch for example) to check if it can find any similar
> misspellings in public code bases.
> On that way we can check possibility if code part is plagiarised.
> Now i search for an open source library which can be used for this task. I
> found one: jazzy ( http://jazzy.sourceforge.net/ ) and I think that it is
> good for this purpose.

probably best to make the API pluggable (jazzy is LGPL but this is good
advice in any case)

> Any suggestion for other solution that is better then jazzy?

i'm not sure whether it would be better but an alternative approach
would be to use a semi-structured text analysis tool for example UIMA
(http://incubator.apache.org/uima/) or lucene

> Work on apache-rat-pd(plagiarism detector) is continuing. 

great :-)

- robert