You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@creadur.apache.org by maka82 <ma...@gmail.com> on 2009/06/24 23:57:52 UTC

Questions about limitations of use Google Code Search engine

Hi.
I am working on my project: apache-rat-pd.
Apache RAT plagiarism detector is a command-line tool for searching
the code
base for possibly plagiarized code using web code search engines.
This project is a part of Google Summer of Code 2009. It is mentored
by Apache.

The idea is to query code search engines(like Google Code Search [1],
Koders [2] or Krugle [3])
to check if the code we send in the query is copied from somewhere.
More info about project can be found at http://code.google.com/p/apache-rat-pd/

Our initial plan was to make it to work with Google Code Search first
because it is open for developers, it has custom libraries, and has a
great support for searching by regular expressions.
 So far, I created an initial version of this tool. It queries a part
of code we assume to be plagiarized. We faced some problems and need
your help to resolve them.

Sometimes, when we query Google Code Search with a great number of
queries in
small time amount, the engine starts rejecting our queries.

So we have some questions:

 1. Do Google Code Search have some sort of DDOS attack [4] protecting
 mechanism  which we activate?
 If it is true, how we can avoid this behaviour? What are the rules we
must
 follow?

 2. Our aim is to locate plagiarised code, so we sometimes query
Google Code Search with very big queries if the code part is big.
Is there some limit of query length?

3. In this implementation of regular expression generator in apache-
rat-pd
 we may make some mistake and sometimes we have false positive result
of our
unplagiarized code query. We ask Google Code Search to find a code
part using our
regular expression query. Sometimes the engine returns some results
which
partially matches our code.
Do you have some advice how to get only exact matches to avoid false
positives?

We use gdata-codesearch-2.0 library to communicate with google
codesearch
engine.

Example of query for simple HelloWorld.java :
Code:
public class HelloWorld {
    public static void main(String argv[]) {
      System.out.println("Hello World.");
    }
 }

 Query:
 
http://www.google.com/codesearch/feeds/search?q=public(\s?)+class(\s?)+HelloWorld(\s?)+\{(\s?)+public(\s?)+static(\s?)+void(\s?)+main(\s?)+\((\s?)+String(\s?)+argv(\s?)+\[(\s?)+\](\s?)+\)(\s?)+\{(\s?)+System(\s?)+\.(\s?)+out(\s?)+\.(\s?)+println(\s?)+\((\s?)+"Hello(\s?)+World(\s?)+\.(\s?)+"(\s?)+\)(\s?)+;(\s?)+\}(\s?)+\}&start-index=null&max-results=null


 [1] http://www.google.com/codesearch
 [2] http://www.koders.com/
 [3] http://www.krugle.com/
 [4]  http://en.wikipedia.org/wiki/Denial-of-service_attack

Best regards,
Marija

Re: Questions about limitations of use Google Code Search engine

Posted by Alexei Fedotov <al...@gmail.com>.
Great news! I join Marija in thanking Ben!



On Sat, Jul 4, 2009 at 3:33 AM, maka82<ma...@gmail.com> wrote:
> Hi!
>
> I got some answers and I will write it here for everyone
> who is interested.
>
> 1. There is protection mechanism against DoS attacks in Google Code
> Search service.
>   Increasing wait time between each two queries should not activate
> it.
>
> 2.  There is the query limit length and it is now 1024 characters.
>
> 3. Problem with previous query is that Code Search does not look for
> multi line matches.
>    The fact that it returns results at all is because there are
> spaces in query.
>    So each atom is matched in the file, but not necessarily on the
> same line.
>    To get a better result all spaces should be escaped , e.g. query
> something like this:
>
>    ^\s*public\s*class\s*HelloWorld\s*\{\s$
>    ^\s*public\s*static\s*void\s*main\s*\(\s*String\s*argv\[\]\)\s*\{\s
> $
>    etc.
>
>  That way it is sure at least that every complete line is matched. But
> whether the lines are next to each other, Code Search cannot tell.
>
>  Unfortunately, it is still not possible to get at the raw file
> content with Code Search.
>
>  I would like to thank Ben for this information. :)
>
> Best regards,
> Marija
>
>
> On Jul 3, 10:21 am, maka82 <ma...@gmail.com> wrote:
>> After research I found out that it is not enough only to ask google
>> code search is there similar part of code. False positive matching are
>> very often so I think to process whole code using links from results
>> provided by gdata-codesearch api. When I potentially matching code
>> from one of results, I will locally do more analyses to determine is
>> it really same code part.
>> Interesting is that gdata-codesearch api do not provide ability to
>> download source file linked by codesearch result [1]
>> Anyway, it is possible to do that using some third part libraries, but
>> it is not elegant solution.
>>
>> Best regards,
>> Marija
>>
>> [1]http://groups.google.com/group/Google-Code-Search/browse_thread/threa...
>>
>> On Jun 24, 11:57 pm, maka82 <ma...@gmail.com> wrote:
>>
>>
>>
>> > Hi.
>> > I am working on my project: apache-rat-pd.
>> > Apache RAT plagiarism detector is a command-line tool for searching
>> > the code
>> > base for possibly plagiarized code using web code search engines.
>> > This project is a part of Google Summer of Code 2009. It is mentored
>> > by Apache.
>>
>> > The idea is to query code search engines(like Google Code Search [1],
>> > Koders [2] or Krugle [3])
>> > to check if the code we send in the query is copied from somewhere.
>> > More info about project can be found athttp://code.google.com/p/apache-rat-pd/
>>
>> > Our initial plan was to make it to work with Google Code Search first
>> > because it is open for developers, it has custom libraries, and has a
>> > great support for searching by regular expressions.
>> >  So far, I created an initial version of this tool. It queries a part
>> > of code we assume to be plagiarized. We faced some problems and need
>> > your help to resolve them.
>>
>> > Sometimes, when we query Google Code Search with a great number of
>> > queries in
>> > small time amount, the engine starts rejecting our queries.
>>
>> > So we have some questions:
>>
>> >  1. Do Google Code Search have some sort of DDOS attack [4] protecting
>> >  mechanism  which we activate?
>> >  If it is true, how we can avoid this behaviour? What are the rules we
>> > must
>> >  follow?
>>
>> >  2. Our aim is to locate plagiarised code, so we sometimes query
>> > Google Code Search with very big queries if the code part is big.
>> > Is there some limit of query length?
>>
>> > 3. In this implementation of regular expression generator in apache-
>> > rat-pd
>> >  we may make some mistake and sometimes we have false positive result
>> > of our
>> > unplagiarized code query. We ask Google Code Search to find a code
>> > part using our
>> > regular expression query. Sometimes the engine returns some results
>> > which
>> > partially matches our code.
>> > Do you have some advice how to get only exact matches to avoid false
>> > positives?
>>
>> > We use gdata-codesearch-2.0 library to communicate with google
>> > codesearch
>> > engine.
>>
>> > Example of query for simple HelloWorld.java :
>> > Code:
>> > public class HelloWorld {
>> >     public static void main(String argv[]) {
>> >       System.out.println("Hello World.");
>> >     }
>> >  }
>>
>> >  Query:
>>
>> >http://www.google.com/codesearch/feeds/search?q=public(\s?)+class(\s?)+HelloWorld(\s?)+\{(\s?)+public(\s?)+static(\s?)+void(\s?)+ma--in(\s?)+\((\s?)+String(\s?)+argv(\s?)+\[(\s?)+\](\s?)+\)(\s?)+\{(\s?)+Syst-e-m(\s?)+\.(\s?)+out(\s?)+\.(\s?)+println(\s?)+\((\s?)+"Hello(\s?)+World(\s-?)-+\.(\s?)+"(\s?)+\)(\s?)+;(\s?)+\}(\s?)+\}&start-index=null&max-results=n-ull
>>
>> >  [1]http://www.google.com/codesearch
>> >  [2]http://www.koders.com/
>> >  [3]http://www.krugle.com/
>> >  [4]  http://en.wikipedia.org/wiki/Denial-of-service_attack
>>
>> > Best regards,
>> > Marija
>



-- 
With best regards / с наилучшими пожеланиями,
Alexei Fedotov / Алексей Федотов,
http://www.telecom-express.ru/
http://harmony.apache.org/
http://code.google.com/p/openmeetings/

Re: Questions about limitations of use Google Code Search engine

Posted by maka82 <ma...@gmail.com>.
Hi!

I got some answers and I will write it here for everyone
who is interested.

1. There is protection mechanism against DoS attacks in Google Code
Search service.
   Increasing wait time between each two queries should not activate
it.

2.  There is the query limit length and it is now 1024 characters.

3. Problem with previous query is that Code Search does not look for
multi line matches.
    The fact that it returns results at all is because there are
spaces in query.
    So each atom is matched in the file, but not necessarily on the
same line.
    To get a better result all spaces should be escaped , e.g. query
something like this:

    ^\s*public\s*class\s*HelloWorld\s*\{\s$
    ^\s*public\s*static\s*void\s*main\s*\(\s*String\s*argv\[\]\)\s*\{\s
$
    etc.

 That way it is sure at least that every complete line is matched. But
whether the lines are next to each other, Code Search cannot tell.

 Unfortunately, it is still not possible to get at the raw file
content with Code Search.

 I would like to thank Ben for this information. :)

Best regards,
Marija


On Jul 3, 10:21 am, maka82 <ma...@gmail.com> wrote:
> After research I found out that it is not enough only to ask google
> code search is there similar part of code. False positive matching are
> very often so I think to process whole code using links from results
> provided by gdata-codesearch api. When I potentially matching code
> from one of results, I will locally do more analyses to determine is
> it really same code part.
> Interesting is that gdata-codesearch api do not provide ability to
> download source file linked by codesearch result [1]
> Anyway, it is possible to do that using some third part libraries, but
> it is not elegant solution.
>
> Best regards,
> Marija
>
> [1]http://groups.google.com/group/Google-Code-Search/browse_thread/threa...
>
> On Jun 24, 11:57 pm, maka82 <ma...@gmail.com> wrote:
>
>
>
> > Hi.
> > I am working on my project: apache-rat-pd.
> > Apache RAT plagiarism detector is a command-line tool for searching
> > the code
> > base for possibly plagiarized code using web code search engines.
> > This project is a part of Google Summer of Code 2009. It is mentored
> > by Apache.
>
> > The idea is to query code search engines(like Google Code Search [1],
> > Koders [2] or Krugle [3])
> > to check if the code we send in the query is copied from somewhere.
> > More info about project can be found athttp://code.google.com/p/apache-rat-pd/
>
> > Our initial plan was to make it to work with Google Code Search first
> > because it is open for developers, it has custom libraries, and has a
> > great support for searching by regular expressions.
> >  So far, I created an initial version of this tool. It queries a part
> > of code we assume to be plagiarized. We faced some problems and need
> > your help to resolve them.
>
> > Sometimes, when we query Google Code Search with a great number of
> > queries in
> > small time amount, the engine starts rejecting our queries.
>
> > So we have some questions:
>
> >  1. Do Google Code Search have some sort of DDOS attack [4] protecting
> >  mechanism  which we activate?
> >  If it is true, how we can avoid this behaviour? What are the rules we
> > must
> >  follow?
>
> >  2. Our aim is to locate plagiarised code, so we sometimes query
> > Google Code Search with very big queries if the code part is big.
> > Is there some limit of query length?
>
> > 3. In this implementation of regular expression generator in apache-
> > rat-pd
> >  we may make some mistake and sometimes we have false positive result
> > of our
> > unplagiarized code query. We ask Google Code Search to find a code
> > part using our
> > regular expression query. Sometimes the engine returns some results
> > which
> > partially matches our code.
> > Do you have some advice how to get only exact matches to avoid false
> > positives?
>
> > We use gdata-codesearch-2.0 library to communicate with google
> > codesearch
> > engine.
>
> > Example of query for simple HelloWorld.java :
> > Code:
> > public class HelloWorld {
> >     public static void main(String argv[]) {
> >       System.out.println("Hello World.");
> >     }
> >  }
>
> >  Query:
>
> >http://www.google.com/codesearch/feeds/search?q=public(\s?)+class(\s?)+HelloWorld(\s?)+\{(\s?)+public(\s?)+static(\s?)+void(\s?)+ma­­in(\s?)+\((\s?)+String(\s?)+argv(\s?)+\[(\s?)+\](\s?)+\)(\s?)+\{(\s?)+Syst­e­m(\s?)+\.(\s?)+out(\s?)+\.(\s?)+println(\s?)+\((\s?)+"Hello(\s?)+World(\s­?)­+\.(\s?)+"(\s?)+\)(\s?)+;(\s?)+\}(\s?)+\}&start-index=null&max-results=n­ull
>
> >  [1]http://www.google.com/codesearch
> >  [2]http://www.koders.com/
> >  [3]http://www.krugle.com/
> >  [4]  http://en.wikipedia.org/wiki/Denial-of-service_attack
>
> > Best regards,
> > Marija

Re: Questions about limitations of use Google Code Search engine

Posted by maka82 <ma...@gmail.com>.
After research I found out that it is not enough only to ask google
code search is there similar part of code. False positive matching are
very often so I think to process whole code using links from results
provided by gdata-codesearch api. When I potentially matching code
from one of results, I will locally do more analyses to determine is
it really same code part.
Interesting is that gdata-codesearch api do not provide ability to
download source file linked by codesearch result [1]
Anyway, it is possible to do that using some third part libraries, but
it is not elegant solution.

Best regards,
Marija

[1]
http://groups.google.com/group/Google-Code-Search/browse_thread/thread/e93c701fac029a67/d7f6a97b72838e12?hl=en&lnk=gst&q=download#d7f6a97b72838e12


On Jun 24, 11:57 pm, maka82 <ma...@gmail.com> wrote:
> Hi.
> I am working on my project: apache-rat-pd.
> Apache RAT plagiarism detector is a command-line tool for searching
> the code
> base for possibly plagiarized code using web code search engines.
> This project is a part of Google Summer of Code 2009. It is mentored
> by Apache.
>
> The idea is to query code search engines(like Google Code Search [1],
> Koders [2] or Krugle [3])
> to check if the code we send in the query is copied from somewhere.
> More info about project can be found athttp://code.google.com/p/apache-rat-pd/
>
> Our initial plan was to make it to work with Google Code Search first
> because it is open for developers, it has custom libraries, and has a
> great support for searching by regular expressions.
>  So far, I created an initial version of this tool. It queries a part
> of code we assume to be plagiarized. We faced some problems and need
> your help to resolve them.
>
> Sometimes, when we query Google Code Search with a great number of
> queries in
> small time amount, the engine starts rejecting our queries.
>
> So we have some questions:
>
>  1. Do Google Code Search have some sort of DDOS attack [4] protecting
>  mechanism  which we activate?
>  If it is true, how we can avoid this behaviour? What are the rules we
> must
>  follow?
>
>  2. Our aim is to locate plagiarised code, so we sometimes query
> Google Code Search with very big queries if the code part is big.
> Is there some limit of query length?
>
> 3. In this implementation of regular expression generator in apache-
> rat-pd
>  we may make some mistake and sometimes we have false positive result
> of our
> unplagiarized code query. We ask Google Code Search to find a code
> part using our
> regular expression query. Sometimes the engine returns some results
> which
> partially matches our code.
> Do you have some advice how to get only exact matches to avoid false
> positives?
>
> We use gdata-codesearch-2.0 library to communicate with google
> codesearch
> engine.
>
> Example of query for simple HelloWorld.java :
> Code:
> public class HelloWorld {
>     public static void main(String argv[]) {
>       System.out.println("Hello World.");
>     }
>  }
>
>  Query:
>
> http://www.google.com/codesearch/feeds/search?q=public(\s?)+class(\s?)+HelloWorld(\s?)+\{(\s?)+public(\s?)+static(\s?)+void(\s?)+ma­in(\s?)+\((\s?)+String(\s?)+argv(\s?)+\[(\s?)+\](\s?)+\)(\s?)+\{(\s?)+Syste­m(\s?)+\.(\s?)+out(\s?)+\.(\s?)+println(\s?)+\((\s?)+"Hello(\s?)+World(\s?)­+\.(\s?)+"(\s?)+\)(\s?)+;(\s?)+\}(\s?)+\}&start-index=null&max-results=null
>
>  [1]http://www.google.com/codesearch
>  [2]http://www.koders.com/
>  [3]http://www.krugle.com/
>  [4]  http://en.wikipedia.org/wiki/Denial-of-service_attack
>
> Best regards,
> Marija