You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Marc Slemko <ma...@znep.com> on 1999/12/26 09:10:27 UTC

Ban MSIECrawler

Just a FYI; if you haven't already (and most people probably have), make
sure to ban anything resembling:

Mozilla/4.0 (compatible; MSIE 4.01; MSIECrawler; Windows 95)

from your site.  This is, AFAIK, IE in it's lame-ass "kill the web" mode.

I just saw a site get pummelled by over 250 hits per second from a bunch
of users using this.  They weren't doing anything special, I have no
reason to think there was any planned DoS attack.  The site simply had
something like:

ErrorDocument 404 http://site.example.com/notthere.html

in a .htaccess, ...where /notthere.html didn't exist on the site.

So whenever this pile of junk got a 404, it ended up getting stuck in a
loop of redirects to the same page it was already on.  How can MS release
software like this?


Re: Ban MSIECrawler

Posted by Marc Slemko <ma...@znep.com>.
On Sun, 26 Dec 1999, Greg Stein wrote:

> On Sun, 26 Dec 1999, Marc Slemko wrote:
> > Just a FYI; if you haven't already (and most people probably have), make
> > sure to ban anything resembling:
> > 
> > Mozilla/4.0 (compatible; MSIE 4.01; MSIECrawler; Windows 95)
> > 
> > from your site.  This is, AFAIK, IE in it's lame-ass "kill the web" mode.
> > 
> > I just saw a site get pummelled by over 250 hits per second from a bunch
> > of users using this.  They weren't doing anything special, I have no
> > reason to think there was any planned DoS attack.  The site simply had
> > something like:
> > 
> > ErrorDocument 404 http://site.example.com/notthere.html
> > 
> > in a .htaccess, ...where /notthere.html didn't exist on the site.
> > 
> > So whenever this pile of junk got a 404, it ended up getting stuck in a
> > loop of redirects to the same page it was already on.  How can MS release
> > software like this?
> 
> Actually, it is a pretty nice feature to yank down a bunch of pages so
> that they will be available when you disconnect your laptop from the
> Internet. I'm not familiar with a similar feature in any other browser.
> It's kind of nice, actually, to have a copy of a web site on the plane
> with you for reference while you work.

That's fine, but its implementation is horrible enough that making the
webserver available to actual users is far more important than letting
MSIECrawler crap all over it...

> 
> If a site continues to return a redirect on an error, then what is the
> client supposed to do? It should go get the other document.
> 
> Personally... I can easily understand the fetch loop that results. Since

That's why RFC2068 said (and 1945 had something almost the same):

10.3 Redirection 3xx

   This class of status code indicates that further action needs to be
   taken by the user agent in order to fulfill the request. The action
   required MAY be carried out by the user agent without interaction
   with the user if and only if the method used in the second request is
   GET or HEAD. A user agent SHOULD NOT automatically redirect a request
   more than 5 times, since such redirections usually indicate an
   infinite loop.

which was amended in RFC2616 to:

10.3 Redirection 3xx

   This class of status code indicates that further action needs to be
   taken by the user agent in order to fulfill the request.  The action
   required MAY be carried out by the user agent without interaction
   with the user if and only if the method used in the second request is
   GET or HEAD. A client SHOULD detect infinite redirection loops, since
   such loops generate network traffic for each redirection.

      Note: previous versions of this specification recommended a
      maximum of five redirections. Content developers should be aware
      that there might be clients that implement such a fixed
      limitation.

If someone actually reads the RFCs for the protocol they are trying to
implement, the potential problem becomes clear.  It is made even clearer
in various less formal guidelines for robots, which the MSIECrawler
designers should have investigated since that is what it is.

[...]
> IMO, don't ban MSIECrawler (that'll just piss off users trying to cache
> your site). Fix the ErrorDocument response.

That doesn't work too well when you have hundreds of thousands of users,
any one of which could stick one in without you knowing until the server
is attacked by a roving pack of MSIECrawlers and brought to its knees
before you can jump in with a club and fight them off.


Re: Ban MSIECrawler

Posted by Greg Stein <gs...@lyra.org>.
On Sun, 26 Dec 1999, Marc Slemko wrote:
> Just a FYI; if you haven't already (and most people probably have), make
> sure to ban anything resembling:
> 
> Mozilla/4.0 (compatible; MSIE 4.01; MSIECrawler; Windows 95)
> 
> from your site.  This is, AFAIK, IE in it's lame-ass "kill the web" mode.
> 
> I just saw a site get pummelled by over 250 hits per second from a bunch
> of users using this.  They weren't doing anything special, I have no
> reason to think there was any planned DoS attack.  The site simply had
> something like:
> 
> ErrorDocument 404 http://site.example.com/notthere.html
> 
> in a .htaccess, ...where /notthere.html didn't exist on the site.
> 
> So whenever this pile of junk got a 404, it ended up getting stuck in a
> loop of redirects to the same page it was already on.  How can MS release
> software like this?

Actually, it is a pretty nice feature to yank down a bunch of pages so
that they will be available when you disconnect your laptop from the
Internet. I'm not familiar with a similar feature in any other browser.
It's kind of nice, actually, to have a copy of a web site on the plane
with you for reference while you work.

If a site continues to return a redirect on an error, then what is the
client supposed to do? It should go get the other document.

Personally... I can easily understand the fetch loop that results. Since
an error occurred, it doesn't have any of the documents in its cache, so
it figures it has to go get it. Easy mistake to make, I'd say -- recording
what is in the cache as opposed to what was requested.

And it takes a real serious brain on a tester to think, "hey... how about
if I misconfigure the server in *this* way to see what IE does." There are
probably three testers on the planet that might have thought of that.

IMO, don't ban MSIECrawler (that'll just piss off users trying to cache
your site). Fix the ErrorDocument response.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/