You are viewing a plain text version of this content. The canonical link for it is here.
Posted to droids-dev@incubator.apache.org by Robin Howlett <ro...@gmail.com> on 2009/04/04 23:44:46 UTC

NoRobotClient bug? Seems like it doesn't check the to-be-crawled URI against robots.txt properly

I was just looking through NoRobotClient and have concern whether Droids
will actually respect robots.txt when force allow is false in most
scenarios; consider the following robots.txt:

User-agent: *
Disallow: /foo/

and the starting URI: http://www.example.com/foo/bar.html

In the code I see - in NoRobotClient.isUrlAllowed() - the following:

String path = uri.getPath();
String basepath = baseURI.getPath();
if (path.startsWith(basepath)) {
 path = path.substring(basepath.length());
 if (!path.startsWith("/")) {
   path = "/" + path;
 }
}
...

Boolean allowed = this.rules != null ? this.rules.isAllowed( path ) : null;
if(allowed == null) {
allowed = this.wildcardRules != null ? this.wildcardRules.isAllowed( path )
: null;
}
if(allowed == null) {
allowed = Boolean.TRUE;
}

The path will always be converted to /bar.html and is checked against the
Rules in rules and wildcardRules but won't be found. However, basepath (which
will now be /foo) is never checked against the Rules, therefore giving an
incorrect true result for the isUrlAllowed method, no?

robin

Re: NoRobotClient bug? Seems like it doesn't check the to-be-crawled URI against robots.txt properly

Posted by Mingfai Ma <mi...@gmail.com>.
On 5 Apr 2009, at 7:44 AM, Robin Howlett <ro...@gmail.com>  
wrote:

> I was just looking through NoRobotClient and have concern whether  
> Droids
> will actually respect robots.txt when force allow is false in most
> scenarios; consider the following robots.txt:
>
> User-agent: *
> Disallow: /foo/
>
> and the starting URI: http://www.example.com/foo/bar.html
>
> In the code I see - in NoRobotClient.isUrlAllowed() - the following:
>
> String path = uri.getPath();
> String basepath = baseURI.getPath();
> if (path.startsWith(basepath)) {
> path = path.substring(basepath.length());
> if (!path.startsWith("/")) {
>   path = "/" + path;
> }
> }
> ...
>
> Boolean allowed = this.rules != null ?  
> this.rules.isAllowed( path ) : null;
> if(allowed == null) {
> allowed = this.wildcardRules != null ?  
> this.wildcardRules.isAllowed( path )
> : null;
> }
> if(allowed == null) {
> allowed = Boolean.TRUE;
> }
>
> The path will always be converted to /bar.html and is checked  
> against the
> Rules in rules and wildcardRules but won't be found. However,  
> basepath (which
> will now be /foo) is never checked against the Rules, therefore  
> giving an
> incorrect true result for the isUrlAllowed method, no?
>
> robin

I believe the NoRobotClient has problem, too. My crawling job stuck  
when accessing the 2nd link of a domain. I hv to workaround the  
problem with the force allow flag.

Regards
Mingfai

Re: NoRobotClient bug? Seems like it doesn't check the to-be-crawled URI against robots.txt properly

Posted by Thorsten Scherler <th...@apache.org>.
On Sun, 2009-04-05 at 23:26 +0200, Thorsten Scherler wrote:
> On Sun, 2009-04-05 at 22:01 +0200, Thorsten Scherler wrote:
> > On Sun, 2009-04-05 at 00:44 +0100, Robin Howlett wrote:
> > >...
> > > The path will always be converted to /bar.html and is checked against the
> > > Rules in rules and wildcardRules but won't be found. However, basepath (which
> > > will now be /foo) is never checked against the Rules, therefore giving an
> > > incorrect true result for the isUrlAllowed method, no?
> > 
> > Hmm, see above, I disagree but have not debug yet. will do that now.
> > 
> 
> I just tried and you are right. The norobot code is original coming from
> the hc project. Will have a look now whether the bug is original in
> there or not.

http://svn.apache.org/viewvc/incubator/droids/branch/preIncubator/src/core/java/org/apache/http/norobots/NoRobotClient.java?revision=366650&view=markup

I just love svn. ;)

So seems it has been always like this. Maybe we are calling it in a way
we should not. Let me explain:

https://svn.apache.org/repos/asf/incubator/droids/trunk/droids-norobots/src/test/java/org/apache/droids/norobots/TestNorobotsClient.java

I said earlier in the thread:
"The base path in our example is http://www.example.com."

I said this because of 
https://issues.apache.org/jira/browse/DROIDS-4
"...
http://www.robotstxt.org/norobots-rfc.txt (sec 3.1) 
"...under a standard relative path on the server: "/robots.txt"."

> It should be "new URL(base, "/robots.txt");" "

Meaning the base should be the root of the server and not
http://www.example.com/foo.

Can you open an issue so we do not loose track.

TIA

salu2
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source <consulting, training and solutions>


Re: NoRobotClient bug? Seems like it doesn't check the to-be-crawled URI against robots.txt properly

Posted by Thorsten Scherler <th...@apache.org>.
On Sun, 2009-04-05 at 22:01 +0200, Thorsten Scherler wrote:
> On Sun, 2009-04-05 at 00:44 +0100, Robin Howlett wrote:
> >...
> > The path will always be converted to /bar.html and is checked against the
> > Rules in rules and wildcardRules but won't be found. However, basepath (which
> > will now be /foo) is never checked against the Rules, therefore giving an
> > incorrect true result for the isUrlAllowed method, no?
> 
> Hmm, see above, I disagree but have not debug yet. will do that now.
> 

I just tried and you are right. The norobot code is original coming from
the hc project. Will have a look now whether the bug is original in
there or not.

Anyway can you submit a patch and add it to our issue tracker?

TIA

salu2
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source <consulting, training and solutions>


Re: NoRobotClient bug? Seems like it doesn't check the to-be-crawled URI against robots.txt properly

Posted by Thorsten Scherler <th...@apache.org>.
On Sun, 2009-04-05 at 00:44 +0100, Robin Howlett wrote:
> I was just looking through NoRobotClient and have concern whether Droids
> will actually respect robots.txt when force allow is false in most
> scenarios; consider the following robots.txt:

It is easier to have a test class to debug this.

> 
> User-agent: *
> Disallow: /foo/
> 
> and the starting URI: http://www.example.com/foo/bar.html
> 
> In the code I see - in NoRobotClient.isUrlAllowed() - the following:
> 
> String path = uri.getPath();
> String basepath = baseURI.getPath();

The base path in our example is http://www.example.com.

> if (path.startsWith(basepath)) {
>  path = path.substring(basepath.length());
>  if (!path.startsWith("/")) {
>    path = "/" + path;
>  }
> }

path is /foo/bar.html

> ...
> 
> Boolean allowed = this.rules != null ? this.rules.isAllowed( path ) : null;
> if(allowed == null) {
> allowed = this.wildcardRules != null ? this.wildcardRules.isAllowed( path )
> : null;
> }
> if(allowed == null) {
> allowed = Boolean.TRUE;
> }
> 
> The path will always be converted to /bar.html and is checked against the
> Rules in rules and wildcardRules but won't be found. However, basepath (which
> will now be /foo) is never checked against the Rules, therefore giving an
> incorrect true result for the isUrlAllowed method, no?

Hmm, see above, I disagree but have not debug yet. will do that now.

salu2

> robin
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source <consulting, training and solutions>