You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@creadur.apache.org by "Marija Sljivovic (JIRA)" <ji...@apache.org> on 2009/04/14 22:27:14 UTC

[jira] Created: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
--------------------------------------------------------------------------------------------------------------------

                 Key: RAT-45
                 URL: https://issues.apache.org/jira/browse/RAT-45
             Project: RAT
          Issue Type: New Feature
         Environment: This improvements of Apache RAT tool will be written in Java.
Requirements: OS with RE already installed on  and Internet connection
            Reporter: Marija Sljivovic


This document is about implementing new tool which will be included in Apache RAT project.
Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project

Aim is to create working, modular, configurable command-line tool
for searching the web based code search  engines for possible plagiarised code in our code bases.

Tool will be heuristic in nature. It will make guesses about code parts.
If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
This part of code will be stored in report if any  match is found.
Man who read this report will decide about is code really copied or it is not.

Algorithm which will be in base of this tool is variant of sliding-window algorithm.
Current code parts which algorithm generate will be checked by different heuristic methods and optionally
will be sent to some code search engine for checking.

More information and ideas about this project can be found here:
http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Posted by "Marija Sljivovic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RAT-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marija Sljivovic updated RAT-45:
--------------------------------

    Attachment: apache-rat-pd-src-0.05.zip

After code-reviews(https://issues.apache.org/jira/browse/RAT-45) it is decided not to use enums and if/switch, but inheritance and polymorphism if it is possible.
This leads to more simple classes. JavaCommentsHeuristicChecker and PascalCommentsHeuristicChecker are written. 
In these classes regular expressions are used for comment matching. Some tests for these classes are written, too.
I will now think about ISearchEngine interface according to google code search API and tray to make parser using libraries for google code search.This is very important part of this tool.

> Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RAT-45
>                 URL: https://issues.apache.org/jira/browse/RAT-45
>             Project: RAT
>          Issue Type: New Feature
>         Environment: This improvements of Apache RAT tool will be written in Java.
> Requirements: OS with RE already installed on  and Internet connection
>            Reporter: Marija Sljivovic
>         Attachments: apache-rat-pd(maven included)0.03.zip, apache-rat-pd-0.02.zip, apache-rat-pd-src-0.04.zip, apache-rat-pd-src-0.05.zip, copyandpaste.zip, copyandpastedetector-src-0.01.zip, pom.xml
>
>   Original Estimate: 2688h
>  Remaining Estimate: 2688h
>
> This document is about implementing new tool which will be included in Apache RAT project.
> Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project
> Aim is to create working, modular, configurable command-line tool
> for searching the web based code search  engines for possible plagiarised code in our code bases.
> Tool will be heuristic in nature. It will make guesses about code parts.
> If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
> This part of code will be stored in report if any  match is found.
> Man who read this report will decide about is code really copied or it is not.
> Algorithm which will be in base of this tool is variant of sliding-window algorithm.
> Current code parts which algorithm generate will be checked by different heuristic methods and optionally
> will be sent to some code search engine for checking.
> More information and ideas about this project can be found here:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Posted by "Alexei Fedotov (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/RAT-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701108#action_12701108 ] 

Alexei Fedotov commented on RAT-45:
-----------------------------------

When designing search engine interface, please take a look on the Google Code Search API. They use compatible APL 2.0, so we at least can cast a glance. It would be nice if we can reuse some of their code. The ground rule for reuse is to avoid mixing Google code with ours.

http://code.google.com/intl/en/apis/codesearch/
http://gdata-java-client.googlecode.com/files/gdata-src.java-1.26.0.java.zip

> Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RAT-45
>                 URL: https://issues.apache.org/jira/browse/RAT-45
>             Project: RAT
>          Issue Type: New Feature
>         Environment: This improvements of Apache RAT tool will be written in Java.
> Requirements: OS with RE already installed on  and Internet connection
>            Reporter: Marija Sljivovic
>         Attachments: copyandpaste.zip, copyandpastedetector-src-0.01.zip
>
>   Original Estimate: 2688h
>  Remaining Estimate: 2688h
>
> This document is about implementing new tool which will be included in Apache RAT project.
> Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project
> Aim is to create working, modular, configurable command-line tool
> for searching the web based code search  engines for possible plagiarised code in our code bases.
> Tool will be heuristic in nature. It will make guesses about code parts.
> If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
> This part of code will be stored in report if any  match is found.
> Man who read this report will decide about is code really copied or it is not.
> Algorithm which will be in base of this tool is variant of sliding-window algorithm.
> Current code parts which algorithm generate will be checked by different heuristic methods and optionally
> will be sent to some code search engine for checking.
> More information and ideas about this project can be found here:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Posted by "Alexei Fedotov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RAT-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexei Fedotov updated RAT-45:
------------------------------

    Attachment: pom.xml

Attaching a very basic example of the POM file we need to compile the project with "mvn install".



> Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RAT-45
>                 URL: https://issues.apache.org/jira/browse/RAT-45
>             Project: RAT
>          Issue Type: New Feature
>         Environment: This improvements of Apache RAT tool will be written in Java.
> Requirements: OS with RE already installed on  and Internet connection
>            Reporter: Marija Sljivovic
>         Attachments: apache-rat-pd-0.02.zip, copyandpaste.zip, copyandpastedetector-src-0.01.zip, pom.xml
>
>   Original Estimate: 2688h
>  Remaining Estimate: 2688h
>
> This document is about implementing new tool which will be included in Apache RAT project.
> Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project
> Aim is to create working, modular, configurable command-line tool
> for searching the web based code search  engines for possible plagiarised code in our code bases.
> Tool will be heuristic in nature. It will make guesses about code parts.
> If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
> This part of code will be stored in report if any  match is found.
> Man who read this report will decide about is code really copied or it is not.
> Algorithm which will be in base of this tool is variant of sliding-window algorithm.
> Current code parts which algorithm generate will be checked by different heuristic methods and optionally
> will be sent to some code search engine for checking.
> More information and ideas about this project can be found here:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Posted by "Alexei Fedotov (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/RAT-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701091#action_12701091 ] 

Alexei Fedotov commented on RAT-45:
-----------------------------------

Thanks for the update.

Let me start code review. Regardless of style my comments are not mandatory to fix - they just reflect what do I think.

1. The directory name should be aligned with other rat artifacts, e.g. apache-rat-<a short token>. If you get three letter token, that would be nice.
2. pom.xml is missed. If you read more about maven, this would affect the whole directory structure. You would get 
    apache-rat-<the short token>/src/main/java/org/
    apache-rat-<the short token>/src/test/java/org/
3. The package names should be org.apache.rat.<the short token>
4. raport->report
As a general rule I suggest using Eclipse IDE for development. It has an embedded spell checker.

5. ReadMe.txt I like the content. The typical name of this file is given here: http://en.wikipedia.org/wiki/README Mixed case is from rare Windows dialect.

6. package "common" - I believe there should be "util" package for different manipulators. I believe there should be a separate package for language parser implementations.
7. package "tool" - well, the whole thing is the tool. This may be "core".
8. package "searchengines" it would be nice to have something shorter
"engines"?

9. ISearchEngine.java the license text should be the first comment
for search engine, it should provide the following info:
1. if a given pattern is found
2. how often a given pattern appears in different projects
3. if it is not too common pattern, the engine should return where it is found

The interface requires more work to reflect this logic (well, at least number of interface functions does not match).

10.  enum ProgramingLanguages 
probably not needed

11.  There are general rules how to describe javadoc. I do not request all methods to be documented in this way, but it would be nice to have all interface methods documented. This would save us writing architectural documents and facing misunderstanding.
http://java.sun.com/j2se/javadoc/writingdoccomments/#format

Finally, you are doing a great job! Thanks!



> Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RAT-45
>                 URL: https://issues.apache.org/jira/browse/RAT-45
>             Project: RAT
>          Issue Type: New Feature
>         Environment: This improvements of Apache RAT tool will be written in Java.
> Requirements: OS with RE already installed on  and Internet connection
>            Reporter: Marija Sljivovic
>         Attachments: copyandpaste.zip, copyandpastedetector-src-0.01.zip
>
>   Original Estimate: 2688h
>  Remaining Estimate: 2688h
>
> This document is about implementing new tool which will be included in Apache RAT project.
> Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project
> Aim is to create working, modular, configurable command-line tool
> for searching the web based code search  engines for possible plagiarised code in our code bases.
> Tool will be heuristic in nature. It will make guesses about code parts.
> If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
> This part of code will be stored in report if any  match is found.
> Man who read this report will decide about is code really copied or it is not.
> Algorithm which will be in base of this tool is variant of sliding-window algorithm.
> Current code parts which algorithm generate will be checked by different heuristic methods and optionally
> will be sent to some code search engine for checking.
> More information and ideas about this project can be found here:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Posted by "Marija Sljivovic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RAT-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marija Sljivovic updated RAT-45:
--------------------------------

    Attachment: apache-rat-pd-0.1.1.zip

This is up-to-date version of apacce-rat-pd.

> Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RAT-45
>                 URL: https://issues.apache.org/jira/browse/RAT-45
>             Project: RAT
>          Issue Type: New Feature
>         Environment: This improvements of Apache RAT tool will be written in Java.
> Requirements: OS with RE already installed on  and Internet connection
>            Reporter: Marija Sljivovic
>         Attachments: apache-rat-pd(maven included)0.03.zip, apache-rat-pd-0.02.zip, apache-rat-pd-0.1.1.zip, apache-rat-pd-src-0.04.zip, apache-rat-pd-src-0.05.zip, copyandpaste.zip, copyandpastedetector-src-0.01.zip, pom.xml
>
>   Original Estimate: 2688h
>  Remaining Estimate: 2688h
>
> This document is about implementing new tool which will be included in Apache RAT project.
> Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project
> Aim is to create working, modular, configurable command-line tool
> for searching the web based code search  engines for possible plagiarised code in our code bases.
> Tool will be heuristic in nature. It will make guesses about code parts.
> If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
> This part of code will be stored in report if any  match is found.
> Man who read this report will decide about is code really copied or it is not.
> Algorithm which will be in base of this tool is variant of sliding-window algorithm.
> Current code parts which algorithm generate will be checked by different heuristic methods and optionally
> will be sent to some code search engine for checking.
> More information and ideas about this project can be found here:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Posted by "Marija Sljivovic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RAT-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marija Sljivovic updated RAT-45:
--------------------------------

    Attachment: copyandpastedetector-src-0.01.zip

In core algorithm is added support for heuristic algorithms.

> Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RAT-45
>                 URL: https://issues.apache.org/jira/browse/RAT-45
>             Project: RAT
>          Issue Type: New Feature
>         Environment: This improvements of Apache RAT tool will be written in Java.
> Requirements: OS with RE already installed on  and Internet connection
>            Reporter: Marija Sljivovic
>         Attachments: copyandpaste.zip, copyandpastedetector-src-0.01.zip
>
>   Original Estimate: 2688h
>  Remaining Estimate: 2688h
>
> This document is about implementing new tool which will be included in Apache RAT project.
> Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project
> Aim is to create working, modular, configurable command-line tool
> for searching the web based code search  engines for possible plagiarised code in our code bases.
> Tool will be heuristic in nature. It will make guesses about code parts.
> If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
> This part of code will be stored in report if any  match is found.
> Man who read this report will decide about is code really copied or it is not.
> Algorithm which will be in base of this tool is variant of sliding-window algorithm.
> Current code parts which algorithm generate will be checked by different heuristic methods and optionally
> will be sent to some code search engine for checking.
> More information and ideas about this project can be found here:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Posted by "Marija Sljivovic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/RAT-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701324#action_12701324 ] 

Marija Sljivovic commented on RAT-45:
-------------------------------------

Firstly, I would like to thank you for allowing me to be a part of this project 
I would do my best to justify your trust in me. 

About Suggestions: 
1.I like long, description names, but I agree with you in this situation(what do you think about: apache-rat-pd (Plagiarism Detector) or apache-rat-plagdet ?) 
2.I know very Little about maven, but I will inform myself more about it. 
3. Using the reverse domain naming convention is useful. Thanks for suggestion. I will do that. 
4. It is typo...I use Eclipse spell checker always, but it can't say when is package name misspelled... 
5. "Mixed case is from rare Windows dialect." I like this comment. README is better for me, too. 
6. In this prototype is used class for loading source file from file system and when I thought where to place this class I decide to create common directory . I beleave that I can use org.apache.rat.DirectoryWalker for reading whole source directory so FileManiplator will be deleted. So common will be deleted then. 
7, 8...OK. 
9. I will think more about this. 
10. Language Enum can be Iner class or something. Lets use it for a while and if we decide that it is not more useful, we will delete it. 
11. I like well-documented code. I will tray to document this code by standards you give to me. Thank you for the link. 

About Google Code Search API: I studied several days ago this libraries. I was afraid that licence of it will be restrictive, but it is Apache Licence so I suppose that I can use this API in parser for Google Code Search. There will not be mixing of code at all... 
I think even to instantiate all parsers using reflection, including Google Code Search parser too. 
On this way parsers will be plugins for our application and we will not have problems with licence oh any jar library in any parser. 
We can have more than one parser for one code search engine too...if anyone want to write other plugin 
What do you think about it? 

Thank you for this suggestions. I found them very useful. 
I will very soon correct source according to this list.

> Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RAT-45
>                 URL: https://issues.apache.org/jira/browse/RAT-45
>             Project: RAT
>          Issue Type: New Feature
>         Environment: This improvements of Apache RAT tool will be written in Java.
> Requirements: OS with RE already installed on  and Internet connection
>            Reporter: Marija Sljivovic
>         Attachments: copyandpaste.zip, copyandpastedetector-src-0.01.zip
>
>   Original Estimate: 2688h
>  Remaining Estimate: 2688h
>
> This document is about implementing new tool which will be included in Apache RAT project.
> Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project
> Aim is to create working, modular, configurable command-line tool
> for searching the web based code search  engines for possible plagiarised code in our code bases.
> Tool will be heuristic in nature. It will make guesses about code parts.
> If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
> This part of code will be stored in report if any  match is found.
> Man who read this report will decide about is code really copied or it is not.
> Algorithm which will be in base of this tool is variant of sliding-window algorithm.
> Current code parts which algorithm generate will be checked by different heuristic methods and optionally
> will be sent to some code search engine for checking.
> More information and ideas about this project can be found here:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Posted by "Marija Sljivovic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RAT-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marija Sljivovic updated RAT-45:
--------------------------------

    Attachment: apache-rat-pd-src-0.04.zip

Improved maven support. Issue with wrong package structure in jar file generated by maven is corrected.
Custom manifest file is added to project.
Project depends from apache-rat-core(apache-rat-core-0.6.jar), commons-lang-2.1.jar and junit(junit-3.8.1.jar).
Added explanations about building and running apache-rat-pd.

> Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RAT-45
>                 URL: https://issues.apache.org/jira/browse/RAT-45
>             Project: RAT
>          Issue Type: New Feature
>         Environment: This improvements of Apache RAT tool will be written in Java.
> Requirements: OS with RE already installed on  and Internet connection
>            Reporter: Marija Sljivovic
>         Attachments: apache-rat-pd(maven included)0.03.zip, apache-rat-pd-0.02.zip, apache-rat-pd-src-0.04.zip, copyandpaste.zip, copyandpastedetector-src-0.01.zip, pom.xml
>
>   Original Estimate: 2688h
>  Remaining Estimate: 2688h
>
> This document is about implementing new tool which will be included in Apache RAT project.
> Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project
> Aim is to create working, modular, configurable command-line tool
> for searching the web based code search  engines for possible plagiarised code in our code bases.
> Tool will be heuristic in nature. It will make guesses about code parts.
> If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
> This part of code will be stored in report if any  match is found.
> Man who read this report will decide about is code really copied or it is not.
> Algorithm which will be in base of this tool is variant of sliding-window algorithm.
> Current code parts which algorithm generate will be checked by different heuristic methods and optionally
> will be sent to some code search engine for checking.
> More information and ideas about this project can be found here:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Posted by "Marija Sljivovic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RAT-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marija Sljivovic updated RAT-45:
--------------------------------

    Attachment: apache-rat-pd(maven included)0.03.zip

Thank You for suggestions and for help. I make a first version of pom.xml file for this project, but I think that it not enough good.
Support for maven is added, project depends from apache-rat-core(apache-rat-core-0.6.jar) and junit(junit-3.8.1.jar).
Maven can build project but package structure inside jar is not correct.  It starts with src package... This must be corrected.

> Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RAT-45
>                 URL: https://issues.apache.org/jira/browse/RAT-45
>             Project: RAT
>          Issue Type: New Feature
>         Environment: This improvements of Apache RAT tool will be written in Java.
> Requirements: OS with RE already installed on  and Internet connection
>            Reporter: Marija Sljivovic
>         Attachments: apache-rat-pd(maven included)0.03.zip, apache-rat-pd-0.02.zip, copyandpaste.zip, copyandpastedetector-src-0.01.zip, pom.xml
>
>   Original Estimate: 2688h
>  Remaining Estimate: 2688h
>
> This document is about implementing new tool which will be included in Apache RAT project.
> Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project
> Aim is to create working, modular, configurable command-line tool
> for searching the web based code search  engines for possible plagiarised code in our code bases.
> Tool will be heuristic in nature. It will make guesses about code parts.
> If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
> This part of code will be stored in report if any  match is found.
> Man who read this report will decide about is code really copied or it is not.
> Algorithm which will be in base of this tool is variant of sliding-window algorithm.
> Current code parts which algorithm generate will be checked by different heuristic methods and optionally
> will be sent to some code search engine for checking.
> More information and ideas about this project can be found here:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Posted by "Marija Sljivovic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RAT-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marija Sljivovic updated RAT-45:
--------------------------------

    Attachment: apache-rat-pd-0.02.zip

Project structure is changed by some of recommendations which can be found at: http://issues.apache.org/jira/browse/RAT-45, but still there is space for improvements.
Basic support for reading whole directories of source files is added .
Maven support is still unfinished. There are still several things waiting to be written by recommendations. ISearchEngine interface is still the same but new implementation of GoogleCodeSearchParser is in development phase. It will be use Google Code Search API.
More about it on:http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

> Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RAT-45
>                 URL: https://issues.apache.org/jira/browse/RAT-45
>             Project: RAT
>          Issue Type: New Feature
>         Environment: This improvements of Apache RAT tool will be written in Java.
> Requirements: OS with RE already installed on  and Internet connection
>            Reporter: Marija Sljivovic
>         Attachments: apache-rat-pd-0.02.zip, copyandpaste.zip, copyandpastedetector-src-0.01.zip
>
>   Original Estimate: 2688h
>  Remaining Estimate: 2688h
>
> This document is about implementing new tool which will be included in Apache RAT project.
> Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project
> Aim is to create working, modular, configurable command-line tool
> for searching the web based code search  engines for possible plagiarised code in our code bases.
> Tool will be heuristic in nature. It will make guesses about code parts.
> If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
> This part of code will be stored in report if any  match is found.
> Man who read this report will decide about is code really copied or it is not.
> Algorithm which will be in base of this tool is variant of sliding-window algorithm.
> Current code parts which algorithm generate will be checked by different heuristic methods and optionally
> will be sent to some code search engine for checking.
> More information and ideas about this project can be found here:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Posted by "Aleksey Shipilev (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/RAT-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704008#action_12704008 ] 

Aleksey Shipilev commented on RAT-45:
-------------------------------------

Hi Marija!

Overall, the project looks great! I just have some minor issues to face:

1. Please use the Java coding conventions, they are available in the comprehensive guide: http://java.sun.com/docs/codeconv/html/CodeConvTOC.doc.html

2. Please use shorter names where possible. If you haven't enough name space in one current package, it's probably time to make a new package, rather that resorting to this:

	private int checkByJavaCommentHueristicCheckers(String codeToBeChecked) {
		return checkByJavaSlashSlashCommentHueristicChecker(codeToBeChecked)
				+ checkByJavaSlashStarCommentHueristicChecker(codeToBeChecked);
	}

3. Please use proper OOP idioms. E.g. use polymorphism where appropriate, about this:

		switch (language) {
		case Java:
			toret = checkByJavaCommentHueristicCheckers(codeToBeChecked) > limit;
			break;

Switch statement here is the maintenance disaster! You should go for polymorphic call here... and this leads to:

4. Please don't be afraid of making another class if task decomposition wants it. Your methods check* can go into separate classes, implementing the same interface. That would be scalable and maintainable solution, not the overblown switch-case statements.

5. And to enforce the rules above, use FindBugs early and often. It would break most of mishabits of coding for you. I haven't see anyone to go with strictest checks, but relaxed checks help a lot. It would also tell you a lot more than me :)

6. I'm eager to to try this project in practice, but I had stuck with how to run your prototype. Ideally, there should be the intuitive way to checkout the project, build it, and try it. Maven is good for that.

You have the potential to go. Go! :)

> Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RAT-45
>                 URL: https://issues.apache.org/jira/browse/RAT-45
>             Project: RAT
>          Issue Type: New Feature
>         Environment: This improvements of Apache RAT tool will be written in Java.
> Requirements: OS with RE already installed on  and Internet connection
>            Reporter: Marija Sljivovic
>         Attachments: apache-rat-pd-0.02.zip, copyandpaste.zip, copyandpastedetector-src-0.01.zip
>
>   Original Estimate: 2688h
>  Remaining Estimate: 2688h
>
> This document is about implementing new tool which will be included in Apache RAT project.
> Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project
> Aim is to create working, modular, configurable command-line tool
> for searching the web based code search  engines for possible plagiarised code in our code bases.
> Tool will be heuristic in nature. It will make guesses about code parts.
> If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
> This part of code will be stored in report if any  match is found.
> Man who read this report will decide about is code really copied or it is not.
> Algorithm which will be in base of this tool is variant of sliding-window algorithm.
> Current code parts which algorithm generate will be checked by different heuristic methods and optionally
> will be sent to some code search engine for checking.
> More information and ideas about this project can be found here:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (RAT-45) Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines

Posted by "Marija Sljivovic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RAT-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marija Sljivovic updated RAT-45:
--------------------------------

    Attachment: copyandpaste.zip

This is first, very basic prototipe for copy&paste checker tool.
This simple demo tool checks is source code is copy&pasted from somewhere by searching on
Google code search....
Tool is still unfinished with very basic usability.....

> Apache RAT copy&paste detector - tool for detecting copied(plagiarised) code by searching on web code search engines
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RAT-45
>                 URL: https://issues.apache.org/jira/browse/RAT-45
>             Project: RAT
>          Issue Type: New Feature
>         Environment: This improvements of Apache RAT tool will be written in Java.
> Requirements: OS with RE already installed on  and Internet connection
>            Reporter: Marija Sljivovic
>         Attachments: copyandpaste.zip
>
>   Original Estimate: 2688h
>  Remaining Estimate: 2688h
>
> This document is about implementing new tool which will be included in Apache RAT project.
> Original idea: http://wiki.apache.org/general/SummerOfCode2009#rat-project
> Aim is to create working, modular, configurable command-line tool
> for searching the web based code search  engines for possible plagiarised code in our code bases.
> Tool will be heuristic in nature. It will make guesses about code parts.
> If it decide that code is good-to-be-copy&pasted, it will check if there is matching code on code search engines.
> This part of code will be stored in report if any  match is found.
> Man who read this report will decide about is code really copied or it is not.
> Algorithm which will be in base of this tool is variant of sliding-window algorithm.
> Current code parts which algorithm generate will be checked by different heuristic methods and optionally
> will be sent to some code search engine for checking.
> More information and ideas about this project can be found here:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.