You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Bill Moseley <mo...@hank.org> on 2000/11/05 01:39:10 UTC

Dealing with spiders

This is slightly OT, but any solution I use will be mod_perl, of course.

I'm wondering how people deal with spiders.  I don't mind being spidered as
long as it's a well behaved spider and follows robots.txt.  And at this
point I'm not concerned with the load spiders put on the server (and I know
there are modules for dealing with load issues).

But it's amazing how many are just lame in that they take perfectly good
HREF tags and mess them up in the request.  For example, every day I see
many requests from Novell's BorderManager where they forgot to convert HTML
entities in HREFs before making the request.

Here's another example:

64.3.57.99 - "-" [04/Nov/2000:04:36:22 -0800] "GET /../../../ HTTP/1.0" 400
265 "-" "Microsoft Internet Explorer/4.40.426 (Windows 95)" 5740

In the last day that IP has requested about 10,000 documents.  Over half
were 404 requests where some 404s were non-converted entities from HREFs,
but most were just for documents that do not and have never existed on this
site.  Almost 1000 request were 400s (Bad Request like the example above).
And I'd guess that's not really the correct user agent, either....

In general, what I'm interested in stopping are the thousands of requests
for documents that just don't exist on the site.  And to simply block the
lame ones, since they are, well, lame.

Anyway, what do you do with spiders like this, if anything?  Is it even an
issue that you deal with?

Do you use any automated methods to detect spiders, and perhaps block the
lame ones?  I wouldn't want to track every IP, but seems like I could do
well just looking at IPs that have a high proportion of 404s to 200 and
304s and have been requesting over a long period of time, or very frequently.

The reason I'm asking is that I was asked about all the 404s in the web
usage reports.  I know I could post-process the logs before running the web
reports, but it would be much more fun to use mod_perl to catch and block
them on the fly.

BTW -- I have blocked spiders on the fly before -- I used to have a decoy
in robots.txt that, if followed, would add that IP to the blocked list.  It
was interesting to see one spider get caught by that trick because it took
thousands and thousands of 403 errors before that spider got a clue that it
was blocked on every request.

Thanks,


Bill Moseley
mailto:moseley@hank.org

Re: Dealing with spiders

Posted by Christoph Wernli <cw...@dwc.ch>.
Bill Moseley wrote:
> 
> At 03:29 PM 11/10/00 +0100, Marko van der Puil wrote:
> >What we could do as a community is create spiderlawenforcement.org,
> >a centralized database where we keep track of spiders and how they
> >index our sites.
> 
> At this point, I'd just like to figure out how to detect them
> programmatically.  It seems easy to spot them as a human looking through
> the logs, but less so with a program.  Some spiders fake the user agent.

Randal wrote Stonehenge::Throttle to deal with this - the relevant article is here:
<ht...@halfdome.holdit.com>

Cheers,

-Christoph

-- 
Let's say the docs present a simplified view of reality...    :-)
             -- Larry Wall in  <69...@jpl-devvax.JPL.NASA.GOV>

Re: Dealing with spiders

Posted by Bill Moseley <mo...@hank.org>.
At 03:29 PM 11/10/00 +0100, Marko van der Puil wrote:
>What we could do as a community is create spiderlawenforcement.org,
>a centralized database where we keep track of spiders and how they
>index our sites.

It's an issue weekly, but hasn't become that much of a problem yet.  The
bad spiders could just change IPs and user agent strings, too.

Yesterday I had 12,000 requests from a spider, but the spider added a slash
to the end of every query string so over 11,000 were invalid requests --
but the Apache log showed the requests as being a 200 (only the application
knew it was a bad request).

At this point, I'd just like to figure out how to detect them
programmatically.  It seems easy to spot them as a human looking through
the logs, but less so with a program.  Some spiders fake the user agent.

It probably makes sense to run a cron job every few minutes to scan the
logs and write out a file of bad IP numbers, and use mod_perl to the list
of IPs to block every 100 requests or so.  I could look for lots of
requests from the same IP with a really high relation of bad requests to
good.  But I'm sure it wouldn't be long before an AOL proxy got blocked.

Again, the hard part is finding a good way to detect them...

And in my experience blocking doesn't always mean the requests from that
spider stop coming ;)




Bill Moseley
mailto:moseley@hank.org

Re: Dealing with spiders

Posted by ___cliff rayman___ <cl...@genwax.com>.
Robin Berjon wrote:

> But on a related issue, I got several logfiles corrupted because I log
> user-agents there and some seem to use some unicode names that confuse
> Apache and convert to \n. Does anyone else have this problem ? I don't
> think it could lead to server compromission, but it's never pleasant to
> have corrupted logs...

if this is true, i would log an apache bug report.

--
___cliff rayman___cliff@genwax.com___http://www.genwax.com/



Re: Dealing with spiders

Posted by Robin Berjon <ro...@knowscape.com>.
At 10:46 06/11/2000 -0800, ___cliff rayman___ wrote:
>> 64.3.57.99 - "-" [04/Nov/2000:04:36:22 -0800] "GET /../../../ HTTP/1.0" 400
>> 265 "-" "Microsoft Internet Explorer/4.40.426 (Windows 95)" 5740
>
>i don't think u have a lame spider here.  i think u have a hacker trying to 
>hack
>your server.

That may not be a spider, but Bill's problem is a real one, I get a lot of
awful stuff from bots. A lot of it is due to bots not understanding
relative urls properly and requesting lots of documents that don't exist in
under a minute before giving up.

But on a related issue, I got several logfiles corrupted because I log
user-agents there and some seem to use some unicode names that confuse
Apache and convert to \n. Does anyone else have this problem ? I don't
think it could lead to server compromission, but it's never pleasant to
have corrupted logs...

-- robin b.
Mathematicians often resort to something called Hilbert space, which is
described as being n-dimensional.  Like modern sex, any number can play. --
James Blish, the Quincunx of Time


Re: Dealing with spiders

Posted by ___cliff rayman___ <cl...@genwax.com>.
Bill Moseley wrote:

> But it's amazing how many are just lame in that they take perfectly good
> HREF tags and mess them up in the request.  For example, every day I see
> many requests from Novell's BorderManager where they forgot to convert HTML
> entities in HREFs before making the request.
>
> Here's another example:
>
> 64.3.57.99 - "-" [04/Nov/2000:04:36:22 -0800] "GET /../../../ HTTP/1.0" 400
> 265 "-" "Microsoft Internet Explorer/4.40.426 (Windows 95)" 5740

i don't think u have a lame spider here.  i think u have a hacker trying to hack
your server.

>
>
> In the last day that IP has requested about 10,000 documents.  Over half
> were 404 requests where some 404s were non-converted entities from HREFs,
> but most were just for documents that do not and have never existed on this
> site.  Almost 1000 request were 400s (Bad Request like the example above).
> And I'd guess that's not really the correct user agent, either....

there is a current exploit for non-converted entities on Microsoft IIS.  Maybe
they're trying them out on your Apache for some reason.

>
>
> In general, what I'm interested in stopping are the thousands of requests
> for documents that just don't exist on the site.  And to simply block the
> lame ones, since they are, well, lame.

perhaps u can run a cron job that scans your logs.  identify lame spiders and/or
hackers and add a rule to IPChains (assuming linux 2.2.??) to deny access from
that IP to your server.  i understand that Portsentry does this trick when it
determines that an IP is scanning for open ports.

hth,

--
___cliff rayman___cliff@genwax.com___http://www.genwax.com/


Re: Dealing with spiders

Posted by Jimi Thompson <jt...@link.com>.
I vote for that!  Make my life about 5000 times simpler :)

Marko van der Puil wrote:

> Hi,
>
> I had the same thing, sometimes the spiders are programmed VERY sloppy. I had a
> site that responed to ANY request made to its location. The mayoraty of spiders
> does not understand about single and double qoutes or if you leave quotes out of
> your HREF's at all. also I understand that absolute href="/bla" and relative
> href="../bla" are also a problem.
>
> Those spiders would simply start getting urls like GET
> /foo/file=1243/date=12-30-2000/name=foobar'/foo/file=1243/date=12-30-2000/name=foobar
>
> or
> GET ../bla'
> or
> GET ../bla/'../bla'../bla'
> aso...
>
> then that page would generate a page with a load of faulty links that would also
> be followed.
> alle HREF got built on the basis of the data that were in the requested URL.
>
> Then other spiders got those faulty links from eachother and soon I got more
> traffic from spiders trying to index faulty links than from regular visitors. :)
>
> What I did was to check the input for a particular url and see if it was correct.
> (should have done that in the first place.) Then I 404red the bastards.... I am
> now redirecting them to the main page, which looks nicer on yer logs too. Plus
> the spider might be tempted to spider yer page regularly. (most spiders drop
> redirects.) You could also just return a plaintext OK. lots of nice 200's in yer
> stats...
> Another solution I have seen is returning a doorway page to your site.
> (Searchengine SPAM!) Thats hittingthem back where it hurts. :)
>
> I've made remarks about this to the owners of those spiders (excite/altavista)
> but I have had no satisfactory responses from them.
>
> What we could do as a community is create spiderlawenforcement.org, a centralized
> database where we keep track of spiders and how they index our sites. We could
> build a database of spiders indexed by Agent tag, those following robots.txt and
> those explicitly exploiting that, or blacklist some by IP if they keep breaking
> the rules. Lots of developers could use this database to block those nasty sons
> of.... er well, sons of spiders I suppose. All opensourced of course, and the
> data available for free, some perl modules to approach the db. Send an email to
> the administrator of the spider everytime a spider tries a bad link on a member
> site, and watch how fast thell fix the bl**dy things!
>
> Let me know if any of you are interrested in such a thing.
>
> Bill Moseley wrote:
>
> > This is slightly OT, but any solution I use will be mod_perl, of course.
> >
> > I'm wondering how people deal with spiders.  I don't mind being spidered as
> > long as it's a well behaved spider and follows robots.txt.  And at this
> > point I'm not concerned with the load spiders put on the server (and I know
> > there are modules for dealing with load issues).
> >
> > But it's amazing how many are just lame in that they take perfectly good
> > HREF tags and mess them up in the request.  For example, every day I see
> > many requests from Novell's BorderManager where they forgot to convert HTML
> > entities in HREFs before making the request.
> >
> > Here's another example:
> >
> > 64.3.57.99 - "-" [04/Nov/2000:04:36:22 -0800] "GET /../../../ HTTP/1.0" 400
> > 265 "-" "Microsoft Internet Explorer/4.40.426 (Windows 95)" 5740
> >
> > In the last day that IP has requested about 10,000 documents.  Over half
> > were 404 requests where some 404s were non-converted entities from HREFs,
> > but most were just for documents that do not and have never existed on this
> > site.  Almost 1000 request were 400s (Bad Request like the example above).
> > And I'd guess that's not really the correct user agent, either....
> >
> > In general, what I'm interested in stopping are the thousands of requests
> > for documents that just don't exist on the site.  And to simply block the
> > lame ones, since they are, well, lame.
> >
> > Anyway, what do you do with spiders like this, if anything?  Is it even an
> > issue that you deal with?
> >
> > Do you use any automated methods to detect spiders, and perhaps block the
> > lame ones?  I wouldn't want to track every IP, but seems like I could do
> > well just looking at IPs that have a high proportion of 404s to 200 and
> > 304s and have been requesting over a long period of time, or very frequently.
> >
> > The reason I'm asking is that I was asked about all the 404s in the web
> > usage reports.  I know I could post-process the logs before running the web
> > reports, but it would be much more fun to use mod_perl to catch and block
> > them on the fly.
> >
> > BTW -- I have blocked spiders on the fly before -- I used to have a decoy
> > in robots.txt that, if followed, would add that IP to the blocked list.  It
> > was interesting to see one spider get caught by that trick because it took
> > thousands and thousands of 403 errors before that spider got a clue that it
> > was blocked on every request.
> >
> > Thanks,
> >
> > Bill Moseley
> > mailto:moseley@hank.org
>
> --
> Yours sincerely,
> Met vriendelijke groeten,
>
> Marko van der Puil http://www.renesse.com
>    marko@renesse.com

--
Jimi Thompson
Web Master
L3 communications

"It's the same thing we do every night, Pinky."


Re: Dealing with spiders

Posted by Marko van der Puil <ma...@renesse.com>.
Hi,

I had the same thing, sometimes the spiders are programmed VERY sloppy. I had a
site that responed to ANY request made to its location. The mayoraty of spiders
does not understand about single and double qoutes or if you leave quotes out of
your HREF's at all. also I understand that absolute href="/bla" and relative
href="../bla" are also a problem.

Those spiders would simply start getting urls like GET
/foo/file=1243/date=12-30-2000/name=foobar'/foo/file=1243/date=12-30-2000/name=foobar

or
GET ../bla'
or
GET ../bla/'../bla'../bla'
aso...

then that page would generate a page with a load of faulty links that would also
be followed.
alle HREF got built on the basis of the data that were in the requested URL.

Then other spiders got those faulty links from eachother and soon I got more
traffic from spiders trying to index faulty links than from regular visitors. :)

What I did was to check the input for a particular url and see if it was correct.
(should have done that in the first place.) Then I 404red the bastards.... I am
now redirecting them to the main page, which looks nicer on yer logs too. Plus
the spider might be tempted to spider yer page regularly. (most spiders drop
redirects.) You could also just return a plaintext OK. lots of nice 200's in yer
stats...
Another solution I have seen is returning a doorway page to your site.
(Searchengine SPAM!) Thats hittingthem back where it hurts. :)

I've made remarks about this to the owners of those spiders (excite/altavista)
but I have had no satisfactory responses from them.

What we could do as a community is create spiderlawenforcement.org, a centralized
database where we keep track of spiders and how they index our sites. We could
build a database of spiders indexed by Agent tag, those following robots.txt and
those explicitly exploiting that, or blacklist some by IP if they keep breaking
the rules. Lots of developers could use this database to block those nasty sons
of.... er well, sons of spiders I suppose. All opensourced of course, and the
data available for free, some perl modules to approach the db. Send an email to
the administrator of the spider everytime a spider tries a bad link on a member
site, and watch how fast thell fix the bl**dy things!

Let me know if any of you are interrested in such a thing.



Bill Moseley wrote:

> This is slightly OT, but any solution I use will be mod_perl, of course.
>
> I'm wondering how people deal with spiders.  I don't mind being spidered as
> long as it's a well behaved spider and follows robots.txt.  And at this
> point I'm not concerned with the load spiders put on the server (and I know
> there are modules for dealing with load issues).
>
> But it's amazing how many are just lame in that they take perfectly good
> HREF tags and mess them up in the request.  For example, every day I see
> many requests from Novell's BorderManager where they forgot to convert HTML
> entities in HREFs before making the request.
>
> Here's another example:
>
> 64.3.57.99 - "-" [04/Nov/2000:04:36:22 -0800] "GET /../../../ HTTP/1.0" 400
> 265 "-" "Microsoft Internet Explorer/4.40.426 (Windows 95)" 5740
>
> In the last day that IP has requested about 10,000 documents.  Over half
> were 404 requests where some 404s were non-converted entities from HREFs,
> but most were just for documents that do not and have never existed on this
> site.  Almost 1000 request were 400s (Bad Request like the example above).
> And I'd guess that's not really the correct user agent, either....
>
> In general, what I'm interested in stopping are the thousands of requests
> for documents that just don't exist on the site.  And to simply block the
> lame ones, since they are, well, lame.
>
> Anyway, what do you do with spiders like this, if anything?  Is it even an
> issue that you deal with?
>
> Do you use any automated methods to detect spiders, and perhaps block the
> lame ones?  I wouldn't want to track every IP, but seems like I could do
> well just looking at IPs that have a high proportion of 404s to 200 and
> 304s and have been requesting over a long period of time, or very frequently.
>
> The reason I'm asking is that I was asked about all the 404s in the web
> usage reports.  I know I could post-process the logs before running the web
> reports, but it would be much more fun to use mod_perl to catch and block
> them on the fly.
>
> BTW -- I have blocked spiders on the fly before -- I used to have a decoy
> in robots.txt that, if followed, would add that IP to the blocked list.  It
> was interesting to see one spider get caught by that trick because it took
> thousands and thousands of 403 errors before that spider got a clue that it
> was blocked on every request.
>
> Thanks,
>
> Bill Moseley
> mailto:moseley@hank.org

--
Yours sincerely,
Met vriendelijke groeten,


Marko van der Puil http://www.renesse.com
   marko@renesse.com