You are viewing a plain text version of this content. The canonical link for it is here.

Posted to droids-dev@incubator.apache.org by Mingfai Ma <mi...@gmail.com> on 2009/04/05 01:34:15 UTC

Re: NoRobotClient bug? Seems like it doesn't check the to-be-crawled URI against robots.txt properly

On 5 Apr 2009, at 7:44 AM, Robin Howlett <ro...@gmail.com>  
wrote:

> I was just looking through NoRobotClient and have concern whether  
> Droids
> will actually respect robots.txt when force allow is false in most
> scenarios; consider the following robots.txt:
>
> User-agent: *
> Disallow: /foo/
>
> and the starting URI: http://www.example.com/foo/bar.html
>
> In the code I see - in NoRobotClient.isUrlAllowed() - the following:
>
> String path = uri.getPath();
> String basepath = baseURI.getPath();
> if (path.startsWith(basepath)) {
> path = path.substring(basepath.length());
> if (!path.startsWith("/")) {
>   path = "/" + path;
> }
> }
> ...
>
> Boolean allowed = this.rules != null ?  
> this.rules.isAllowed( path ) : null;
> if(allowed == null) {
> allowed = this.wildcardRules != null ?  
> this.wildcardRules.isAllowed( path )
> : null;
> }
> if(allowed == null) {
> allowed = Boolean.TRUE;
> }
>
> The path will always be converted to /bar.html and is checked  
> against the
> Rules in rules and wildcardRules but won't be found. However,  
> basepath (which
> will now be /foo) is never checked against the Rules, therefore  
> giving an
> incorrect true result for the isUrlAllowed method, no?
>
> robin

I believe the NoRobotClient has problem, too. My crawling job stuck  
when accessing the 2nd link of a domain. I hv to workaround the  
problem with the force allow flag.

Regards
Mingfai