You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Rob Hartill <ro...@imdb.com> on 1996/09/23 00:22:34 UTC

be affraid, be very affraid

Just when you thought it couldn't get any worse, the Brits bring you
"Agentware". I'm sure they'll argue that robots don't kill servers; people
do.


----- Forwarded message from Benjamin Franz -----

From: Benjamin Franz <sn...@netimages.com>
To: robots@webcrawler.com
Subject: Bad agent...A *very* bad agent.


It was brought to my attention that there is a 'personal agent' robot
available at <URL:http://www.agentware.com/>. I was asked my opinion on it
and its impact on the web in general, so I downloaded and reviewed it.

Not good. 

To be net friendly, this class of robot should be closely associated with
a central general purpose robot that caches high request items. However,
this one does not do that - which makes it a long term lose for the net as
a whole - especially since this robot doesn't even play by the rules most
robots abide by - it doesn't even identify itself to a server. 

This *particular* agent has a *severe* problem in that it ignores
robots.txt files and will walk right into infinite tree spaces at high
speed. I set an agent to look for 'devilbunnies rabbits' - and it
attempted to index an archive with in excess of 50,000 saved articles that
was firewalled with a robots.txt file precisely to *prevent* such an
occcurance. It crashed the agent program a few hundred articles into the
attempt after firing a 'machine gun' series of requests to the server. Not
server friendly at all. Lastly - rather than being self-limiting by
default (halting when it doesn't find any high relavance pages or when it
reaches a certain amount of found material) - its default is to try and
explore the entire net, starting from its 'high grade' list from existing
general purpose search engines. To top it all off - it has a 'fire and
forget' mode where you can leave it running for hours..days...weeks...

This is very close to a 'worst case' robot - the only things that could be
worse would be integrating it directly into a web browser or allowing to
to fire parallel requests rather than the 'one request at a time' approach
it uses now.

--
Benjamin Franz



----- End of forwarded message from Benjamin Franz -----

it might be necessary to adapt the "user_agents" patch I submitted so that
one can block anonymous agents... If they don't have to courtesy to identify
themselves then one should be able to decline access to them. (not by
default, but by choice).

BTW, anyone know how many robots there are loose on the net?
I've had over 150 different ones pick up robots.txt in the last 3 months.


rob