You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Brian Behlendorf <br...@collab.net> on 2000/12/31 22:18:14 UTC

recursive robot queries

Doing a tail -f of /logs/www/weblogs on apache.org is a lesson
in... something.  Mainly robot insanity.  Every time I've checked
recently, it looks like 1 out of every 20-30 accesses looks like this:

xml.apache.org 139.179.10.17 - - [31/Dec/2000:12:49:03 -0800] "HEAD /xerces-c/faq-other.html/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/graphics/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/images/images/build.html HTTP/1.0" 200 0 "http://xml.apache.org:80/xerces-c/faq-other.html/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/graphics/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/images/images/install.html" "Wget/1.4.5"

and

www.apache.org 210.73.88.163 - - [31/Dec/2000:12:49:04 -0800] "GET /index/full/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/images/images/images/images/foundation/images/images/images/images/foundation/images/apache_pb.gif HTTP/1.0" 403 1282 "http://www.apache.org:80/index/full/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/images/images/images/images/foundation/images/images/images/images/foundation/FAQ.html" "Wget/1.5.3"

These are allowed to happen due to content negotiation - any extra
information after a valid link is presumed to simply be PATH_INFO
information.  So in the www.apache.org example, the above URL will pull up
the page "/index", i.e. index.html, with "/full/foundation/...." as the
PATH_INFO.  How did this recursion start?

I narrowed it down to this sequence of accesses from that host:

httpd.apache.org 210.73.88.163 - - [31/Dec/2000:08:07:15 -0800] "GET /docs/misc/known_client_problems.html HTTP/1.0" 200 13973 "http://httpd.apache.org/docs/misc/compat_notes.html" "Wget/1.5.3"
www.apache.org 210.73.88.163 - - [31/Dec/2000:08:07:25 -0800] "GET /index/full/4118 HTTP/1.0" 200 3785 "http://httpd.apache.org/docs/misc/known_client_problems.html" "Wget/1.5.3"
www.apache.org 210.73.88.163 - - [31/Dec/2000:08:07:26 -0800] "GET /index/full/foundation/images/asf_logo.gif HTTP/1.0" 200 3785 "http://www.apache.org:80/index/full/4118" "Wget/1.5.3"

Somehow Wget is munging the link from known_client_problems.html to
http://bugs.apache.org/index/full/4118 (a perfectly valid link) into a
link to http://www.apache.org/index/full/4118, and that URL renders what
http://www.apache.org/index, only the relative URL on that page to
foundation/images/asf_logo.gif renders out to
http://www.apache.org/index/full/foundation/images/asf_logo.gif, and
getting that page leads to....

Gar.  This is silly.  OK, so I can fix this by redirecting any requests to
www.apache.org/index/full to www.apache.org/, but that feels like and is
an ugly hack.  What's a more general way of solving this?  Is this a bug
in Wget?

In the XML case, I see the following chain:

xml.apache.org 139.179.10.17 - - [31/Dec/2000:02:15:31 -0800] "GET /xerces-c/feedback.html HTTP/1.0" 200 15788 "http://xml.apache.org:80/xerces-c/releases.html" "Wget/1.4.5"
xml.apache.org 139.179.10.17 - - [31/Dec/2000:02:18:20 -0800] "GET /xerces-c/faq-other.html/ HTTP/1.0" 200 29459 "http://xml.apache.org:80/xerces-c/feedback.html" "Wget/1.4.5"
xml.apache.org 139.179.10.17 - - [31/Dec/2000:02:18:28 -0800] "GET /xerces-c/faq-other.html/resources/script.js HTTP/1.0" 200 29459 "http://xml.apache.org:80/xerces-c/faq-other.html/" "Wget/1.4.5"
xml.apache.org 139.179.10.17 - - [31/Dec/2000:02:18:36 -0800] "GET /xerces-c/faq-other.html/resources/resources/script.js HTTP/1.0" 200 29459 "http://xml.apache.org:80/xerces-c/faq-other.html/resources/script.js" "Wget/1.4.5"
xml.apache.org 139.179.10.17 - - [31/Dec/2000:02:18:43 -0800] "GET /xerces-c/faq-other.html/resources/resources/resources/script.js HTTP/1.0" 200 29459 "http://xml.apache.org:80/xerces-c/faq-other.html/resources/resources/script.js" "Wget/1.4.5"

This is clearly a typo in /xerces-c/feedback.html.  I'll ask to have this
fixed, but it's painful to see how such an easy typo to make can cause
such a cascade.

Anyways, just thought I'd post about this, coz I thought it was a humorous
problem.

	Brian



Re: recursive robot queries

Posted by "William A. Rowe, Jr." <wr...@covalent.net>.
From: "Roy T. Fielding" <fi...@ebuilt.com>
Sent: Monday, January 01, 2001 12:34 AM
>

> > These are allowed to happen due to content negotiation - any extra
> > information after a valid link is presumed to simply be PATH_INFO
> > information.  So in the www.apache.org example, the above URL will pull up
> > the page "/index", i.e. index.html, with "/full/foundation/...." as the
> > PATH_INFO.  How did this recursion start?
> 
> Blecko... there needs to be a way for ssi files to declare that they
> are going to use path_info (or declare that they are not) so that the
> server can redirect or block access to bogus URLs.

I've thought alot this last week about our ssi suport (trying to parse
FAQ.html, for one, and dealing with other aspects under 2.0.)

We've got alot of conditions now that SSI was just never ment to
cope with.  Take an include of a footer.html from index.html.ja.jis ...
where the charset changes from the main body to the included body.
Some include targets nearly require their own 'mini-headers' ... 
header processing from ssi that would allow a subrequest working with
mod_charset_lite, for example, to force the subrequest encoding
back to the parent encoding.

Not to mention that we could kill etag/lastmodified issues for good,
or employ cache control headers.

The reason I've hedged on a rewrite of mod_autoindex is that it's
really a specialized case of ssi.  [that ought to make heads spin.]
Right now it's outside in, I want to look at it inside out.  It will
really make life simple for customizing and fancy indexing, and we
can get rid of the "Header and readme names aren't picking up the
right files!" bugs.

I'm sure we can come up with a ton of these.  The real questions are,
is SSI dead (as a 'growing' entity), in the sense that real growth
of that spec is no longer worthwhile?  If not, how do we begin to
make it relevant in a mixed-content, HTTP/1.1 world?

If it's going to be relevant, I believe we need to begin dealing
with the 'contents as a whole', merge included headers, perhaps
even deal with 'our' meta-tags (heck, we are parsing the document,
why not deal with a content-language meta tag?)  I'm guessing we
do some of this now, and ignore most of it.

Welcome to this next thousand years, everyone :-)

Bill 





Re: recursive robot queries

Posted by dean gaudet <dg...@arctic.org>.
On Sun, 31 Dec 2000, Roy T. Fielding wrote:

> > These are allowed to happen due to content negotiation - any extra
> > information after a valid link is presumed to simply be PATH_INFO
> > information.  So in the www.apache.org example, the above URL will pull up
> > the page "/index", i.e. index.html, with "/full/foundation/...." as the
> > PATH_INFO.  How did this recursion start?
>
> Blecko... there needs to be a way for ssi files to declare that they
> are going to use path_info (or declare that they are not) so that the
> server can redirect or block access to bogus URLs.

you mean something such as:

<!--#if expr="$PATH_INFO" -->
you're lame
<!--#else -->
rest of file including lots of <img src=""> and other crud
which robots would actually chase
<!--#endif -->

?

-dean


Re: recursive robot queries

Posted by Brian Behlendorf <br...@collab.net>.
On Sun, 31 Dec 2000, Roy T. Fielding wrote:
> > I narrowed it down to this sequence of accesses from that host:
> > 
> > httpd.apache.org 210.73.88.163 - - [31/Dec/2000:08:07:15 -0800] "GET /docs/misc/known_client_problems.html HTTP/1.0" 200 13973 "http://httpd.apache.org/docs/misc/compat_notes.html" "Wget/1.5.3"
> > www.apache.org 210.73.88.163 - - [31/Dec/2000:08:07:25 -0800] "GET /index/full/4118 HTTP/1.0" 200 3785 "http://httpd.apache.org/docs/misc/known_client_problems.html" "Wget/1.5.3"
> > www.apache.org 210.73.88.163 - - [31/Dec/2000:08:07:26 -0800] "GET /index/full/foundation/images/asf_logo.gif HTTP/1.0" 200 3785 "http://www.apache.org:80/index/full/4118" "Wget/1.5.3"
> 
> I don't think so -- the presence of www.apache.org:80 would seem to indicate
> that something on our side did a redirect using the default hostname instead
> of using bugs.apache.org.  

I am pretty sure I didn't skip a request, though; the chain pretty clearly
went

http://httpd.apache.org/docs/misc/compat_notes.html
http://httpd.apache.org/docs/misc/known_client_problems.html
http://www.apache.org:80/index/full/4118

There are intervening HEAD requests for each resource right before a
GET; hmm.  Does this look right?  Could this be confusing Wget?

[taz] 6:08pm primary > telnet bugs.apache.org 80
Trying 64.208.42.41...
Connected to bugs.apache.org.
Escape character is '^]'.
HEAD /index/full/4118 HTTP/1.0
Host: bugs.apache.org

HTTP/1.1 302 Found
Date: Tue, 02 Jan 2001 02:08:48 GMT
Server: Apache/1.3.15-dev (Unix) tomcat/1.0
Location: http://bugs.apache.org/index.cgi/full/4118
Connection: close
Content-Type: text/html; charset=iso-8859-1
Connection closed by foreign host.



	Brian




Re: recursive robot queries

Posted by Joshua Slive <sl...@finance.commerce.ubc.ca>.
On Sun, 31 Dec 2000, Roy T. Fielding wrote:

> > These are allowed to happen due to content negotiation - any extra
> > information after a valid link is presumed to simply be PATH_INFO
> > information.  So in the www.apache.org example, the above URL will pull up
> > the page "/index", i.e. index.html, with "/full/foundation/...." as the
> > PATH_INFO.  How did this recursion start?
>
> Blecko... there needs to be a way for ssi files to declare that they
> are going to use path_info (or declare that they are not) so that the
> server can redirect or block access to bogus URLs.
>

Yes!  This has been the subject of numerous bug reports and questions to
the newsgroup, and there is no real solution in the current server.

Joshua.


Re: recursive robot queries

Posted by "Roy T. Fielding" <fi...@ebuilt.com>.
> These are allowed to happen due to content negotiation - any extra
> information after a valid link is presumed to simply be PATH_INFO
> information.  So in the www.apache.org example, the above URL will pull up
> the page "/index", i.e. index.html, with "/full/foundation/...." as the
> PATH_INFO.  How did this recursion start?

Blecko... there needs to be a way for ssi files to declare that they
are going to use path_info (or declare that they are not) so that the
server can redirect or block access to bogus URLs.

> I narrowed it down to this sequence of accesses from that host:
> 
> httpd.apache.org 210.73.88.163 - - [31/Dec/2000:08:07:15 -0800] "GET /docs/misc/known_client_problems.html HTTP/1.0" 200 13973 "http://httpd.apache.org/docs/misc/compat_notes.html" "Wget/1.5.3"
> www.apache.org 210.73.88.163 - - [31/Dec/2000:08:07:25 -0800] "GET /index/full/4118 HTTP/1.0" 200 3785 "http://httpd.apache.org/docs/misc/known_client_problems.html" "Wget/1.5.3"
> www.apache.org 210.73.88.163 - - [31/Dec/2000:08:07:26 -0800] "GET /index/full/foundation/images/asf_logo.gif HTTP/1.0" 200 3785 "http://www.apache.org:80/index/full/4118" "Wget/1.5.3"
> 
> Somehow Wget is munging the link from known_client_problems.html to
> http://bugs.apache.org/index/full/4118 (a perfectly valid link) into a
> link to http://www.apache.org/index/full/4118, and that URL renders what
> http://www.apache.org/index, only the relative URL on that page to
> foundation/images/asf_logo.gif renders out to
> http://www.apache.org/index/full/foundation/images/asf_logo.gif, and
> getting that page leads to....
> 
> Gar.  This is silly.  OK, so I can fix this by redirecting any requests to
> www.apache.org/index/full to www.apache.org/, but that feels like and is
> an ugly hack.  What's a more general way of solving this?  Is this a bug
> in Wget?

I don't think so -- the presence of www.apache.org:80 would seem to indicate
that something on our side did a redirect using the default hostname instead
of using bugs.apache.org.  I suspect that it is a problem with the
httpd.apache.org vhost config [but this is just guessing on my part].
Or maybe we just need to update the output to use full URLs instead
of relative links.

....Roy