You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@creadur.apache.org by Amila De Silva <ja...@gmail.com> on 2009/03/27 15:49:19 UTC

rat-1-cutnpaste - Code Search Optimization

Hi,
I did a bit search on code search engines (Google Code Search, Krugel
and Koder) to find out a scalable solution.
As the first step we can set an initial size for the sliding
window(this size can be changed by the user).
When a long string is sent to the search engine, it will be tokenized
before searching. As I understood there is a limit of tokens they
create; if the query string is too long ,after a certain amount of
tokens the rest of the string will be considered as a single token.
If we can get this number of tokens , its better to set this as the
window length(so that window contains that much of tokens).
Let’s say this size is n. If a query fails to find any result, then
whole n tokens will be removed and the next n tokens will be loaded
and the search will be performed again.
If this query returns any result those URLs will be recorded (I think
it’s better to take first 3 or 4 URLs only).
Even the query returns any result the next n tokens will be newly loaded.
By this way the whole code can be searched much quickly , preserving
search engine resources. After a list of URLs has been prepared
in-depth search can be performed.

I’d like to hear your comments on this methods.

Best Regards,
Amila

Re: rat-1-cutnpaste - Code Search Optimization

Posted by Alexei Fedotov <al...@gmail.com>.

Amila, thanks.
In addition to the proposal writing skill I encourage you to
demonstrate your coding skills for those who vote for your acceptance
as a GSoC participant. For example, you can write a class which would
parse cut&paste detector arguments into some internal representation.
It should contain a main method, a loop to cycle throwgh arguments,
and a usage message. The latter is actually a lightweight way to
approach an architectural question of the tool scope.

Thanks.




2009/4/2 Amila De Silva <ja...@gmail.com>:
> Hi Alexei,
> Thanks for the reply!
> I'll send my application asap.
> BR,
> Amila
>
>
> On 4/1/09, Alexei Fedotov <al...@gmail.com> wrote:
>> Amila,
>> I'm sorry, I have unintentionally marked your mail as read. Please
>> don't hesitate to ping me again if there is no answer.
>>
>> Your method would do the job. Let me just add that making sliding
>> window size automatically adjustable would have the same linear
>> algorithm complexity, so it might be a proper investment.
>>
>> Please send a proposal to the official GSoC app now.
>>
>> Thanks!
>>
>>
>> On Fri, Mar 27, 2009 at 6:49 PM, Amila De Silva <ja...@gmail.com> wrote:
>>> Hi,
>>> I did a bit search on code search engines (Google Code Search, Krugel
>>> and Koder) to find out a scalable solution.
>>> As the first step we can set an initial size for the sliding
>>> window(this size can be changed by the user).
>>> When a long string is sent to the search engine, it will be tokenized
>>> before searching. As I understood there is a limit of tokens they
>>> create; if the query string is too long ,after a certain amount of
>>> tokens the rest of the string will be considered as a single token.
>>> If we can get this number of tokens , its better to set this as the
>>> window length(so that window contains that much of tokens).
>>> Let's say this size is n. If a query fails to find any result, then
>>> whole n tokens will be removed and the next n tokens will be loaded
>>> and the search will be performed again.
>>> If this query returns any result those URLs will be recorded (I think
>>> it's better to take first 3 or 4 URLs only).
>>> Even the query returns any result the next n tokens will be newly loaded.
>>> By this way the whole code can be searched much quickly , preserving
>>> search engine resources. After a list of URLs has been prepared
>>> in-depth search can be performed.
>>>
>>> I'd like to hear your comments on this methods.
>>>
>>> Best Regards,
>>> Amila
>>>
>>
>>
>>
>> --
>> With best regards / с наилучшими пожеланиями,
>> Alexei Fedotov / Алексей Федотов,
>> http://www.telecom-express.ru/
>> http://people.apache.org/~aaf/
>>
>



-- 
With best regards / с наилучшими пожеланиями,
Alexei Fedotov / Алексей Федотов,
http://www.telecom-express.ru/
http://people.apache.org/~aaf/

Re: rat-1-cutnpaste - Code Search Optimization

Posted by Amila De Silva <ja...@gmail.com>.

Hi Alexei,
Thanks for the reply!
I'll send my application asap.
BR,
Amila


On 4/1/09, Alexei Fedotov <al...@gmail.com> wrote:
> Amila,
> I'm sorry, I have unintentionally marked your mail as read. Please
> don't hesitate to ping me again if there is no answer.
>
> Your method would do the job. Let me just add that making sliding
> window size automatically adjustable would have the same linear
> algorithm complexity, so it might be a proper investment.
>
> Please send a proposal to the official GSoC app now.
>
> Thanks!
>
>
> On Fri, Mar 27, 2009 at 6:49 PM, Amila De Silva <ja...@gmail.com> wrote:
>> Hi,
>> I did a bit search on code search engines (Google Code Search, Krugel
>> and Koder) to find out a scalable solution.
>> As the first step we can set an initial size for the sliding
>> window(this size can be changed by the user).
>> When a long string is sent to the search engine, it will be tokenized
>> before searching. As I understood there is a limit of tokens they
>> create; if the query string is too long ,after a certain amount of
>> tokens the rest of the string will be considered as a single token.
>> If we can get this number of tokens , its better to set this as the
>> window length(so that window contains that much of tokens).
>> Let's say this size is n. If a query fails to find any result, then
>> whole n tokens will be removed and the next n tokens will be loaded
>> and the search will be performed again.
>> If this query returns any result those URLs will be recorded (I think
>> it's better to take first 3 or 4 URLs only).
>> Even the query returns any result the next n tokens will be newly loaded.
>> By this way the whole code can be searched much quickly , preserving
>> search engine resources. After a list of URLs has been prepared
>> in-depth search can be performed.
>>
>> I'd like to hear your comments on this methods.
>>
>> Best Regards,
>> Amila
>>
>
>
>
> --
> With best regards / с наилучшими пожеланиями,
> Alexei Fedotov / Алексей Федотов,
> http://www.telecom-express.ru/
> http://people.apache.org/~aaf/
>

Re: rat-1-cutnpaste - Code Search Optimization

Posted by Alexei Fedotov <al...@gmail.com>.

Amila,
I'm sorry, I have unintentionally marked your mail as read. Please
don't hesitate to ping me again if there is no answer.

Your method would do the job. Let me just add that making sliding
window size automatically adjustable would have the same linear
algorithm complexity, so it might be a proper investment.

Please send a proposal to the official GSoC app now.

Thanks!


On Fri, Mar 27, 2009 at 6:49 PM, Amila De Silva <ja...@gmail.com> wrote:
> Hi,
> I did a bit search on code search engines (Google Code Search, Krugel
> and Koder) to find out a scalable solution.
> As the first step we can set an initial size for the sliding
> window(this size can be changed by the user).
> When a long string is sent to the search engine, it will be tokenized
> before searching. As I understood there is a limit of tokens they
> create; if the query string is too long ,after a certain amount of
> tokens the rest of the string will be considered as a single token.
> If we can get this number of tokens , its better to set this as the
> window length(so that window contains that much of tokens).
> Let's say this size is n. If a query fails to find any result, then
> whole n tokens will be removed and the next n tokens will be loaded
> and the search will be performed again.
> If this query returns any result those URLs will be recorded (I think
> it's better to take first 3 or 4 URLs only).
> Even the query returns any result the next n tokens will be newly loaded.
> By this way the whole code can be searched much quickly , preserving
> search engine resources. After a list of URLs has been prepared
> in-depth search can be performed.
>
> I'd like to hear your comments on this methods.
>
> Best Regards,
> Amila
>



-- 
With best regards / с наилучшими пожеланиями,
Alexei Fedotov / Алексей Федотов,
http://www.telecom-express.ru/
http://people.apache.org/~aaf/