You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Paul Sutton <pa...@awe.com> on 1998/02/09 15:17:29 UTC

killing robots

Umm, www.apacheweek.com is being attacked by a nasty robot. None of the
other vhosts we have are affected though. Perhaps it doesn't like apache?
Just thought I'd let you know in case it is attacking other apache-related
sites. 

We got 170,000 hits from it last week (fairly noticeable since we normally
only get 40,000 or so). It is coming from 193.136.17.202
(donald.di.uminho.pt) with a UA of "GETWWW-ROBOT/2.0".

We are also getting a few hits from another robot-like thing: from
118.40.17.203 (dp-m-a18.werple.net.au) with UA "Java1.1.3" (there is also
a Java1.1.4 agent out there, but that has only made a few requests). The
robot seems particularly broken -- we use multiviews on every request, but
Java1.1.3 seems to always add a trailing / unless the link contained an
extension, then it tries without the /.

Anyway, what's the current wisdom on how to deal with robots? Do you match
its UA & IP, then reject with a 404 or 500, or just trash the whole IP?  I
haven't really kept up with the robot wars, so any advice would be useful. 
Is there a good site which tracks nasty robot issues?

Paul


Re: killing robots

Posted by Chia-liang Kao <cl...@pamud.net>.
on 02/09/98 Mon, Paul Sutton <pa...@awe.com> wrote:
> Anyway, what's the current wisdom on how to deal with robots? Do you match
> its UA & IP, then reject with a 404 or 500, or just trash the whole IP?  I
> haven't really kept up with the robot wars, so any advice would be useful. 
> Is there a good site which tracks nasty robot issues?
>
> Paul
There is a `Standard for Robot Exclusion' indicates the robots should not
grab data from sites with /robots.txt which specified content.

refer to: http://info.webcrawler.com/mak/projects/robots/exclusion.html

But bad-mannered robots can simply not implement to obey the standards.

Just my $0.02. 

CLK
-- 
Chia-liang Kao  /  clkao@cirx.org
Panther Tech Co. , Taichung, Taiwan
http://www.pamud.net/~clkao
`白爛濤濤我不怕' -- IOI 97

Re: killing robots

Posted by Rob Hartill <ro...@imdb.com>.
On Mon, 9 Feb 1998, Paul Sutton wrote:

> Umm, www.apacheweek.com is being attacked by a nasty robot. None of the
> other vhosts we have are affected though. Perhaps it doesn't like apache?
> Just thought I'd let you know in case it is attacking other apache-related
> sites. 
> 
> We got 170,000 hits from it last week (fairly noticeable since we normally
> only get 40,000 or so). It is coming from 193.136.17.202
> (donald.di.uminho.pt) with a UA of "GETWWW-ROBOT/2.0".

'GETWWW' is on my list of UA substrings to reject outright.

> We are also getting a few hits from another robot-like thing: from
> 118.40.17.203 (dp-m-a18.werple.net.au) with UA "Java1.1.3" (there is also
> a Java1.1.4 agent out there, but that has only made a few requests). The

'Java1' and 'Java3' are also on the list.

> robot seems particularly broken -- we use multiviews on every request, but
> Java1.1.3 seems to always add a trailing / unless the link contained an
> extension, then it tries without the /.
> 
> Anyway, what's the current wisdom on how to deal with robots?

catch them early and block them forever.

> Do you match
> its UA & IP, then reject with a 404 or 500, or just trash the whole IP?

blocking UAs is best if they don't pretent to be Mozilla, well actually
it's also safe to block ^Mozilla/3.0$ ^Mozilla/4.0$ ^Mozilla/4.03$
because they are also badly behaved robots trying to spoof servers into
treating them with the respect they don't deserve.

The best line of defence against the worst offenders is a lower level
packet dropper (e.g. ipfw for FreeBSD). Lots of robots don't appreciate
that 'no' really does mean 'no'.

> I
> haven't really kept up with the robot wars, so any advice would be useful. 
> Is there a good site which tracks nasty robot issues?

We used to keep an alert list but it consumes more time reporting the
offenders than they are worth.

Reaction time is the key to saving your server or diskspace from getting
toasted. I run a perl script to count IP hits on the tail end of the
access log every 15-30m. Anything that crosses preset values for rate/volume
triggers an email warning.

xxx.lanl.gov have been known to send 1 email message per unwanted request
to site admins when they ignore earlier requests to clean up their act.

-=-==

Lincoln Stein <ls...@W3.ORG> is writing an article on bad UAs, you might
want to ping him for pointers to any info he may be willing to share.