You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by André Warnier <aw...@ice-sa.com> on 2013/04/30 12:03:28 UTC

URL scanning by bots

Dear Apache developers,

This is a suggestion relative to the code of the Apache httpd webserver, and a possible
default new default option in the standard distribution of Apache httpd.
It also touches on WWW security, which is why I felt that it belongs on this list, rather
than on the general user's list. Please correct me if I am mistaken.

According to Netcraft, there are currently some 600 Million webservers on the WWW, with
more than 60% of those identified as "Apache".
I currently administer about 25 Apache httpd/Tomcat of these webservers, not remarkable in
any way (business applications for medium-sized companies).
In the logs of these servers, every day, there are episodes like the following :

209.212.145.91 - - [03/Apr/2013:00:52:32 +0200] "GET /muieblackcat HTTP/1.1" 404 362 "-" "-"
209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/index.php HTTP/1.1" 404 365
"-" "-"
209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/pma/index.php HTTP/1.1" 404
369 "-" "-"
209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/phpmyadmin/index.php
HTTP/1.1" 404 376 "-" "-"
209.212.145.91 - - [03/Apr/2013:00:52:37 +0200] "GET //db/index.php HTTP/1.1" 404 362 "-" "-"
209.212.145.91 - - [03/Apr/2013:00:52:37 +0200] "GET //dbadmin/index.php HTTP/1.1" 404 367
"-" "-"
... etc..

Such lines are the telltale trace of a "URL-scanning bot" or of the "URL-scanning" part of
a bot, and I am sure that you are all familiar with them. Obviously, these bots are
trying to find webservers which exhibit poorly-designed or poorly-configured applications,
with the aim of identifying hosts which can be submitted to various kinds of attacks, for
various purposes. As far as I can tell from my own unremarkable servers, I would surmise
that many or most webservers facing the Internet are submitted to this type of scan every
day.

Hopefully, most webservers are not really vulnerable to this type of scan.
But the fact is that *these scans are happening, every day, on millions of webservers*.
And they are at least a nuisance, and at worst a serious security problem when, as a
result of poorly configured webservers or applications, they lead to break-ins and
compromised systems.

It is basically a numbers game, like malicious emails : it costs very little to do this,
and if even a tiny proportion of webservers exhibit one of these vulnerabilities, because
of the numbers involved, it is worth doing it.
If there are 600 Million webservers, and 50% of them are scanned every day, and 0.01% of
these webservers are vulnerable because of one of these URLs, then it means that every
day, 30,000 (600,000,000 x 0.5 x 0.0001) vulnerable servers will be identified.

About the "cost" aspect : from the data in my own logs, such bots seem to be scanning
about 20-30 URLs per pass, at a rate of about 3-4 URLs per second.
Since it is taking my Apache httpd servers approximately 10 ms on average to respond (by a
404 Not Found) to one of these requests, and they only request 1 URL per 250 ms, I would
imagine that these bots have some built-in rate-limiting mechanism, to avoid being
"caught" by various webserver-protection tools. Maybe also they are smart, and scan
several servers in parallel, so as to limit the rate at which they "burden" any server in
particular. (In this rough calculation, I am ignoring network latency for now).

So if we imagine a smart bot which is scanning 10 servers in parallel, issuing 4 requests
per second to each of them, for a total of 20 URLs per server, and we assume that all
these requests result in 404 responses with an average response time of 10 ms, then it
"costs" this bot only about 2 seconds to complete the scan of 10 servers.
If there are 300 Million servers to scan, then the total cost for scanning all the
servers, by any number of such bots working cooperatively, is an aggregated 60 Million
seconds. And if one of such "botnets" has 10,000 bots, that boils down to only 6,000
seconds per bot.

Scary, that 50% of all Internet webservers can be scanned for vulnerabilities in less than
2 hours, and that such a scan may result in "harvesting" several thousand hosts,
candidates for takeover.

Now, how about making it so that without any special configuration or add-on software or
skills on the part of webserver administrators, it would cost these same bots *about 100
times as long (several days)* to do their scan ?

The only cost would a relatively small change to the Apache webservers, which is what my
suggestion consists of : adding a variable delay (say between 100 ms and 2000 ms) to any
404 response.

The suggestion is based on the observation that there is a dichotomy between this kind of
access by bots, and the kind of access made by legitimate HTTP users/clients : legitimate
users/clients (including the "good bots") are accessing mostly links "which work", so they
rarely get "404 Not Found" responses. Malicious URL-scanning bots on the other hand, by
the very nature of what they are scanning for, are getting many "404 Not Found" responses.

As a general idea thus, anything which impacts the delay to obtain a 404 response, should
impact these bots much more than it impacts legitimate users/clients.

How much ?

Let us imagine for a moment that this suggestion is implemented in the Apache webservers,
and is enabled in the default configuration. And let's imagine that after a while, 20% of
the Apache webservers deployed on the Internet have this feature enabled, and are now
delaying any 404 response by an average of 1000 ms.
And let's re-use the numbers above, and redo the calculation.
The same "botnet" of 10,000 bots is thus still scanning 300 Million webservers, each bot
scanning 10 servers at a time for 20 URLs per server. Previously, this took about 6000
seconds.
However now, instead of an average delay of 10 ms to obtain a 404 response, in 20% of the
cases (60 Million webservers) they will experience an average 1000 ms additional delay per
URL scanned.
This adds (60,000,000 / 10 * 20 URLs * 1000 ms) 120,000,000 seconds to the scan.
Divided by 10,000 bots, this is 12,000 additional seconds per bot (roughly 3 1/2 hours).

So with a small change to the code, no add-ons, no special configuration skills on the
part of the webserver administrator, no firewalls, no filtering, no need for updates to
any list of URLs or bot characteristics, little inconvenience to legitimate users/clients,
and a very partial adoption over time, it seems that this scheme could more than double
the cost for bots to acquire the same number of targets. Or, seen another way, it could
more than halve the number of webservers being scanned every day.

I know that this is a hard sell. The basic idea sounds a bit too simple to be effective.
It will not kill the bots, and it will not stop the bots from scanning Internet servers in
other ways that they use. It does not miraculously protect any single server against such
scans, and the benefit of any one server implementing this is diluted over all webservers
on the Internet.
But it is also not meant as an absolute weapon. It is targeted specifically at a
particular type of scan done by a particular type of bot for a particular purpose, and is
is just a scheme to make this more expensive for them. It may or may not discourage these
bots from continuing with this type of scan (if it does, that would be a very big result).
But at the same time, compared to any other kind of tool that can be used against these
scans, this one seems really cheap to implement, it does not seem to be easy to
circumvent, and it seems to have at least a potential of bringing big benefits to the WWW
at large.

If there are reasonable objections to it, I am quite prepared to accept that, and drop it.
I have already floated the idea in a couple of other places, and gotten what could be
described as "tepid" responses. But it seems to me that most of the negative-leaning
responses which I received so far, were more of the a-priori "it will never work" kind,
rather than real objections based on real facts.

So my hope here is that someone has the patience to read through this, and would have the
additional patience to examine the idea "professionally".

Re: URL scanning by bots

Posted by Yehuda Katz <ye...@ymkatz.net>.

On Tuesday, April 30, 2013, Christian Folini wrote:

> But you can try it out for yourself easily with
> 2-3 ModSecurity rules and the "pause" directive.
>
Someone suggested the same idea to me and I tried it out on one of my
servers by setting PHP as the 404 handler and having it loop there. (which
saves you the trouble of setting up mod_security if you already have PHP).
I noticed increased server load and no decrease in bot requests.
- Y

Re: URL scanning by bots

Posted by Christian Folini <ch...@netnea.com>.

On Fri, May 03, 2013 at 09:39:44AM +1000, Noel Butler wrote:
> > real-time blacklist lookup (-> ModSecurity's @rbl operator).
> 
> Try using that on busy servers (webhosts/ISP's)... might be fine for a
> SOHO, but in a larger commercial world, forget it, the impact is  far
> far worse than the other suggestions.

Certainly. But if we run 100% https anyways, enable a local dns cache
or even cache the results within apache, would it still be as
dangerous? So far my answer has been yes. But I would be interested
to hear a response from somebody who was crazy enough to enable it.

regs,

Christian

-- 
Complexity is the worst enemy of security, and the Internet -- 
and the computers and processes connected to it -- is getting
more complex all the time.
-- Bruce Schneier

Re: URL scanning by bots

Posted by Noel Butler <no...@ausics.net>.

On Wed, 2013-05-01 at 21:15 +0200, Christian Folini wrote:


> real-time blacklist lookup (-> ModSecurity's @rbl operator).


Try using that on busy servers (webhosts/ISP's)... might be fine for a
SOHO, but in a larger commercial world, forget it, the impact is  far
far worse than the other suggestions.

Re: URL scanning by bots

Posted by Reindl Harald <h....@thelounge.net>.


Am 02.05.2013 10:22, schrieb André Warnier:
> These tools must be downloaded separately, installed, configured and maintained, all by 
> someone who knows what he's doing. And this means that, in the end (and as the evidence 
> shows), only a tiny minority of webservers on the Internet will effectively set up one 
> of those, and the vast majority of webservers will not

FINE and this should stay as it is

if you make such things default sooner or later only a few people
are knowing what they are doing - this is a bad attitude!

if i need this useless protection i enable it
since it does not protect me from anything i do not need it
in most cases it would only waste ressources

if there is no vulnerable application this doe snot protect
you from anything and if there is a vulnerable webapp and
you believe this would protect youby obsucrity you have not
learned the lessons of the last few years

Re: URL scanning by bots

Posted by Reindl Harald <h....@thelounge.net>.

Am 03.05.2013 06:35, schrieb Ben Reser:
> On Thu, May 2, 2013 at 4:53 PM, Guenter Knauf <fu...@apache.org> wrote:
>> isnt that one of the core issues - that folks who dont know what they do run
>> a webserver? And then, shouldnt these get punished with being hacked so that
>> they try to learn and finally *know* what they do, and do it right next
>> time? ;-)
> 
> I have to say this is a horrible attitude.  Nobody should be
> advocating that a lack of knowledge should result in getting hacked

after many years in this business: people only learn if it hurts

Re: URL scanning by bots

Posted by Ben Reser <be...@reser.org>.

On Thu, May 2, 2013 at 4:53 PM, Guenter Knauf <fu...@apache.org> wrote:
> isnt that one of the core issues - that folks who dont know what they do run
> a webserver? And then, shouldnt these get punished with being hacked so that
> they try to learn and finally *know* what they do, and do it right next
> time? ;-)

I have to say this is a horrible attitude.  Nobody should be
advocating that a lack of knowledge should result in getting hacked.
Yes, it's the obvious result of the situation that exists today.  But
we should be striving to make things as easy for the inexperienced
user as possible.

Just the other day I saw someone saying pretty much that unless you're
an Apache httpd expert you should just use nginx.  That if you aren't
an expert then you won't have Apache configured properly and when you
get a heavy load your website will fall over.

It's just disappointing to see this attitude coming from a httpd PMC member.

Re: URL scanning by bots

Posted by "William A. Rowe Jr." <wr...@rowe-clan.net>.

On Fri, 03 May 2013 01:53:01 +0200
Guenter Knauf <fu...@apache.org> wrote:

> On 02.05.2013 10:22, André Warnier wrote:
> >
> > But I am a bit at a loss as to what to do next.  I could easily
> > enough install such a change on my own servers (they are all
> > running mod_perl). But then, if it shows that the bots do slow down
> > on my servers or avoid them, it still doesn't quite provide enough
> > evidence to prove that this would benefit the Internet at large,
> > does it ?

> No. But you wrote above that its not your intention to protect
> yourself and your servers, but more that you want to cure the world
> and enable to run webservers by 'folks who dont know what they do',
> or???

I like this meme.  The browser world learned long ago to protect their
users from themselves, to the point where the Goog went overboard and
set the do-not-track as a default option for a time.

We seem to have a far less forgiving attitude.  True that most of our
users are competent professionals or dedicated hobbyists who want to
run servers, not our home PC's/netbooks/tablets/phones.  But, all are
not equally experienced; if you have either had to hand off admin of
a significant farm or pick up a poorly set up cluster of servers, you
have to consider that even the most skilled admins would appreciate
that our defaults are sane.

This may not be the right proposal (I'm thinking a combo of honeypots
and iptables on-the-fly mods would serve the same purpose, better),
but let's not dismiss out of hand what seems simple or obvious to any
experienced sysadmin, only to have journeymen fall into whatever traps
we lay for them as defaults.

Re: URL scanning by bots

Posted by Guenter Knauf <fu...@apache.org>.

André,
On 02.05.2013 10:22, André Warnier wrote:
> I'd like to say that I do agree with you, in that there are already many
> tools to help defend one's servers against such scans, and against more
> targeted attacks.
> I have absolutely nothing /against/ these tools, and indeed installing
> and configuring such tools on a majority of webservers would do much
> more for Internet security in general, than my proposal ever would.
>
> But at the same time, there is the rub, as you say yourself : "All that
> is missing is enough people configuring their servers as you propose."
>
> These tools must be downloaded separately, installed, configured and
> maintained, all by someone who knows what he's doing.
isnt that one of the core issues - that folks who dont know what they do 
run a webserver? And then, shouldnt these get punished with being hacked 
so that they try to learn and finally *know* what they do, and do it 
right next time? ;-)

> And this means that, in the end (and as the evidence shows), only a tiny
> minority of webservers on the Internet will effectively set up one of
> those, and the vast majority of webservers will not.
> And among the millions of webservers that don't, there will be enough
> candidates for break-in to justify these URL scans, because URL-scanning
> at this moment is cheap and really fast.
>
> In contrast, my proposal is so simple from an Apache user point of view,
> that I believe that it could spread widely, without any other measure
> than configuring it by default in the default Apache distribution (and
> be easily turned off by whoever decides he doesn't want it).
>
> If my purpose was merely to shield my own servers, then I would not
> spend so much time trying to defend the merits of this proposal.
> Instead, I would install one of these tools and be done with it.  I am
> not doing it, because on the one hand my servers - as far as I know of
> course - do not exhibit any of these flaws which they are scanning for,
> and on the other hand because these traces in the logs provide me with
> information about how they work.
>
> I apologise if I repeat myself, and if I am perceived as "hot" sometimes.
> It may be because of a modicum of despair.  I don't know what I was
> expecting as a reaction to this proposal, but I am disappointed - maybe
> wrongly so.
> I was ready for criticism of the proposal, or for someone proving me
> wrong, on the base of real facts or calculations.  But what I am mostly
> seeing so far, are objections apparently based on a-priori opinions
> which my own factual observations show to be false.
> Not only my own though : the couple of people here who have contributed
> based on a real experience with real servers, do not seem to contradict
> my own findings.  So I am not totally despairing yet.
>
> But I am a bit at a loss as to what to do next.  I could easily enough
> install such a change on my own servers (they are all running mod_perl).
> But then, if it shows that the bots do slow down on my servers or avoid
> them, it still doesn't quite provide enough evidence to prove that this
> would benefit the Internet at large, does it ?
No. But you wrote above that its not your intention to protect yourself 
and your servers, but more that you want to cure the world and enable to 
run webservers by 'folks who dont know what they do', or???

Ok, 1st lets again assume that you really get here enough httpd 
developers who support your idea, and we get finally such functionality 
into httpd, and lets also assume the even more unlikely case that you 
get us to make this the default - but what do you expect when this will 
happen? *If* this really happens then this would go into trunk, that 
means unreleased code. Currently we have two maintained release 
branches, that are 2.2.x and 2.4.x, where the 1st 2.4.x release happened 
about 15 months ago. I dont know numbers, but I assume that currently 
after 15 months only 1 or 2 % have upgraded to 2.4.x, and the vast 
majority is still running the 2.2.x releases, or even still 2.0.x.
Maybe that within the next 9 months another 2-3% will upgrade, so that 
we have probably 5% running latest version 2 years after 1st release.
Now lets further assume that the httpd project decides to release a 
2.6.x within these next 9 months (which seems very unlikely, but who 
knows ...) which contains your slow-down code, and now imagine self how 
long from now on it would take until your slow-down code would be in use 
by default of at least 10%? 3 years, 4 years, ....?? When would it show 
up an effect? After 5 years? And are the bots in 5 years still the same 
as nowadays?
Furthermore, unless we are forced by security issues, there's no reason 
to break our policies and backport such a feature to the 2.4.y and 2.2.x 
branches ....
yeah, now I imagine that I did you totally disappoint with the above, 
but hey, thats the reality how things work with the httpd project ...

Ok, now let me throw in some things which I can think of what you still 
can do in order to make the world's internet better within the next 5 
years ... :-P

Instead of returning the 404s give them what they ask for; f.e. write a 
script which scans your logs and filter for those 404s, and within a few 
days you should have a nice list of those bot URLs; let the script 
automatically write/update a proxy config and proxy the bot requests to 
another dedicated test host, and make there your plays this them ...
send them an index page with a PHP script and let it return the bytes 
very slow; or do it quick and see what comes next, etc. etc.
mainly try to study the bots so that you really can predict how they do 
things, and how they behave next to such things you suggest;
and even more interesting is to analyse the host which runs the bot: is 
it perhaps vulnerable? f.e. what happens if you send them a 1 GB index 
page back? perhaps they get buffer overflows? is perhaps the backdoor 
for the bot control vulnerable?

All that reminds me of a worm attack which happened some years ago;
IIRC it was Nimda, and after studying the beast a bit I was finally able 
to understand the weakness of the worm, and able to strike back (I know, 
this is illegal, and I dont recommend it - I just want to mention whats 
possible if you want to make the inet better ...);
in order to give the botmaster control over the infected hosts the worm 
1st installed a backdoor, and that backdoor was then also the door to 
finally stop the beast and kill it: it was possible to take over control 
of the infected host, remove the worm code, and then close the backdoor ...
Isnt this the real thing you want to do?

Gün.

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Christian Folini wrote:
> André,
> 
> On Wed, May 01, 2013 at 02:47:55AM +0200, André Warnier wrote:
>> With respect, I think that you misunderstood the purpose of the proposal.
>> It is not a protection mechanism for any server in particular.
>> And installing the delay on one server is not going to achieve much.
> 
> In fact I did understand the purpose, but I wanted to get
> my point across without writing a lengthy message on the
> merits and flaws of your theory.
> 
> My point is: ModSecurity has all you need to do this
> right now. All that is missing is enough people configuring
> their servers as you propose.
> 
> Like many others, I do not think this will work. If it really
> bothers you (and your bandwidth), then I would try and use a 
> real-time blacklist lookup (-> ModSecurity's @rbl operator).
> Given the work of the spam defenders these blacklist should
> contain the ipaddresses of the scanning bots as well.
> I do not have this configured, but I would be really
> interested to see the effect on average load, connection
> use and number of scanning attempts on a server.
> 
> Interesting discussion by the way. Maybe a bit hot, though.
> 

Hi.
Thank you for this "cool" contribution.

I'd like to say that I do agree with you, in that there are already many tools to help 
defend one's servers against such scans, and against more targeted attacks.
I have absolutely nothing /against/ these tools, and indeed installing and configuring 
such tools on a majority of webservers would do much more for Internet security in 
general, than my proposal ever would.

But at the same time, there is the rub, as you say yourself : "All that is missing is 
enough people configuring their servers as you propose."

These tools must be downloaded separately, installed, configured and maintained, all by 
someone who knows what he's doing.
And this means that, in the end (and as the evidence shows), only a tiny minority of 
webservers on the Internet will effectively set up one of those, and the vast majority of 
webservers will not.
And among the millions of webservers that don't, there will be enough candidates for 
break-in to justify these URL scans, because URL-scanning at this moment is cheap and 
really fast.

In contrast, my proposal is so simple from an Apache user point of view, that I believe 
that it could spread widely, without any other measure than configuring it by default in 
the default Apache distribution (and be easily turned off by whoever decides he doesn't 
want it).

If my purpose was merely to shield my own servers, then I would not spend so much time 
trying to defend the merits of this proposal.  Instead, I would install one of these tools 
and be done with it.  I am not doing it, because on the one hand my servers - as far as I 
know of course - do not exhibit any of these flaws which they are scanning for, and on the 
other hand because these traces in the logs provide me with information about how they work.

I apologise if I repeat myself, and if I am perceived as "hot" sometimes.
It may be because of a modicum of despair.  I don't know what I was expecting as a 
reaction to this proposal, but I am disappointed - maybe wrongly so.
I was ready for criticism of the proposal, or for someone proving me wrong, on the base of 
real facts or calculations.  But what I am mostly seeing so far, are objections apparently 
based on a-priori opinions which my own factual observations show to be false.
Not only my own though : the couple of people here who have contributed based on a real 
experience with real servers, do not seem to contradict my own findings.  So I am not 
totally despairing yet.

But I am a bit at a loss as to what to do next.  I could easily enough install such a 
change on my own servers (they are all running mod_perl). But then, if it shows that the 
bots do slow down on my servers or avoid them, it still doesn't quite provide enough 
evidence to prove that this would benefit the Internet at large, does it ?

Does anyone have knowledge of some organisation which could try this out on a sufficient 
number of servers to definitely either prove or disprove the idea ?

Re: URL scanning by bots

Posted by Christian Folini <ch...@netnea.com>.

André,

On Wed, May 01, 2013 at 02:47:55AM +0200, André Warnier wrote:
> With respect, I think that you misunderstood the purpose of the proposal.
> It is not a protection mechanism for any server in particular.
> And installing the delay on one server is not going to achieve much.

In fact I did understand the purpose, but I wanted to get
my point across without writing a lengthy message on the
merits and flaws of your theory.

My point is: ModSecurity has all you need to do this
right now. All that is missing is enough people configuring
their servers as you propose.

Like many others, I do not think this will work. If it really
bothers you (and your bandwidth), then I would try and use a 
real-time blacklist lookup (-> ModSecurity's @rbl operator).
Given the work of the spam defenders these blacklist should
contain the ipaddresses of the scanning bots as well.
I do not have this configured, but I would be really
interested to see the effect on average load, connection
use and number of scanning attempts on a server.

Interesting discussion by the way. Maybe a bit hot, though.

Best,

Christian Folini

-- 
We have to remember that what we observe is not nature herself, but
nature exposed to our method of questioning.  
-- Werner Heisenberg

Re: URL scanning by bots

Posted by Graham Leggett <mi...@sharp.fm>.

On 01 May 2013, at 2:47 AM, André Warnier <aw...@ice-sa.com> wrote:

> With respect, I think that you misunderstood the purpose of the proposal.
> It is not a protection mechanism for any server in particular.
> And installing the delay on one server is not going to achieve much.
> 
> It is something that, if it is installed on enough webservers on the Internet, may slow down the URL-scanning bots (hopefully a lot), and thereby inconvenience their botmasters.

You need to consider the environment that a typical URL scanner runs in, the open internet, which consists of vast swaths of machines that don't exist and hang when connecting, machines that are hidden behind firewalls that look like they don't exist, machines that are on slow internet connections that respond slowly. Bots are already engineered to handle these real world internet conditions, encountering a slow host is just something bots are used to doing anyway are are very unlikely to be slowed down because of it. And if they are slowed down, the bot authors simply treat that problem as a bug and fix it, and continue with what they are doing.

Regards,
Graham
--

Re: URL scanning by bots

Posted by Tom Evans <te...@googlemail.com>.

On Wed, May 1, 2013 at 10:37 AM, Ben Laurie <be...@links.org> wrote:
> So your argument is that extra connections use resources in servers
> but not clients?

I only care about the servers. However, the clients are most likely
constrained by CPU or network. Slowing down all the requests at the
server means that the client has more CPU and network available that
can be used to issue other requests to other servers.

In other words, how fast you respond to the client is largely
irrelevant to how many URLs the client can check per hour - the slower
you respond, the more requests the client can perform simultaneously.

Tom

Re: URL scanning by bots

Posted by Reindl Harald <h....@thelounge.net>.


Am 01.05.2013 11:37, schrieb Ben Laurie:
>> Well, no, actually this is not accurate. You are assuming that these
>> bots are written using blocking io semantics; that if a bot is delayed
>> by 2 seconds when getting a 404 from your server, it is not able to do
>> anything else in those 2 seconds. This is just incorrect.
>> Each bot process could launch multiple requests to multiple unrelated
>> hosts simultaneously, and select whatever ones are available to read
>> from. If you could globally add a delay to bots on all servers in the
>> world, all the bot owner needs to do to maintain the same throughput
>> is to raise the concurrency level of the bot's requests. The bot does
>> the same amount of work in the same amount of time, but now all our
>> servers use extra resources and are slow for clients on 404.
> 
> So your argument is that extra connections use resources in servers
> but not clients?

no - his argument was he has one server crippled down by the proposal
of the OP and the botnet has thousands of clients and only one job
while you have on your server PRIMARY the job to serve clients and
not playing games with ressources

Re: URL scanning by bots

Posted by Ben Laurie <be...@links.org>.

On 1 May 2013 10:19, Tom Evans <te...@googlemail.com> wrote:
> On Wed, May 1, 2013 at 1:47 AM, André Warnier <aw...@ice-sa.com> wrote:
>> Christian Folini wrote:
>>>
>>> Hey André,
>>>
>>> I do not think your protection mechanism is very good (for reasons
>>> mentioned before) But you can try it out for yourself easily with 2-3
>>> ModSecurity rules and the "pause" directive.
>>>
>>> Regs,
>>>
>>> Christian
>>>
>> Hi Christian.
>>
>> With respect, I think that you misunderstood the purpose of the proposal.
>> It is not a protection mechanism for any server in particular.
>> And installing the delay on one server is not going to achieve much.
>>
>
> Putting in any kind of delay means using more resources to deal with
> the same number of requests, even if you use a dedicated 'slow down'
> worker to deal especially just with this.
>
> The truth of the matter is that these sorts of spidering requests are
> irrelevant noise on the internet. It's not a targeted attack, it is
> simply someone looking for easy access to any machine.
>
>> It is something that, if it is installed on enough webservers on the
>> Internet, may slow down the URL-scanning bots (hopefully a lot), and thereby
>> inconvenience their botmasters. Hopefully to the point where they would
>> decide that it is not worth scanning that way anymore.  And if it dos not
>> inconvenience them enough to achieve that, at least it should reduce the
>> effectiveness of these bots, and diminish the number of systems that they
>> can scan over any given time period with the same number of bots.
>>
>
> Well, no, actually this is not accurate. You are assuming that these
> bots are written using blocking io semantics; that if a bot is delayed
> by 2 seconds when getting a 404 from your server, it is not able to do
> anything else in those 2 seconds. This is just incorrect.
> Each bot process could launch multiple requests to multiple unrelated
> hosts simultaneously, and select whatever ones are available to read
> from. If you could globally add a delay to bots on all servers in the
> world, all the bot owner needs to do to maintain the same throughput
> is to raise the concurrency level of the bot's requests. The bot does
> the same amount of work in the same amount of time, but now all our
> servers use extra resources and are slow for clients on 404.

So your argument is that extra connections use resources in servers
but not clients?

>
> Thanks, but no thanks.
>
> Tom

Re: URL scanning by bots

Posted by Stefan Fritsch <sf...@sfritsch.de>.

On Wednesday 01 May 2013, Graham Leggett wrote:
> Of course it might have an effect - the real important question is
> will it have a useful effect.
> 
> A bot that gives up scanning a box that by definition isn't
> vulnerable to that bot (thus the 404) doesn't achieve anything
> useful, the bot failed to infect the host before, it fails to
> infect the host now, nothing has stopped the bot moving to the
> next host and trying it's luck there. Perhaps it does achieve a
> reduction in traffic for you, but that is for you to decide, and
> the tools already exist for you to achieve this.

From my experience, a single bot will often scan for dozens of 
vulnerable php applications. If the delay causes the bot to go away 
before having scanned for all those applications, that may decrease 
the likelihood that forgotten, badly maintained web applications are 
found by bots. This would be a positive effect.

With mpm event, the delay could be done without tying up a worker 
(like mod_dialup). But even with the other mpms, I don't think the 
potential for DoSs would increase iff the 404 delay is kept so low 
that the sum of the delay and the time to read the request headers 
would stay below mod_reqtimeout's header read timeout value.

I would not be against anyone implementing such a delay scheme. But I 
am definitely not volunteering because I am not sure if it would 
actually be worth the effort. And there are other more important 
issues to fix.

Re: URL scanning by bots

Posted by Dirk-Willem van Gulik <Di...@bbc.co.uk>.

I think we're mixing three issues

1) Prevent Starvation.

protecting a server from server side/machine starvation (i.e. running out of file descriptors, sockets, mbuf's, whatever).

So here you are in the domain where there is no argument in terms of protocol violation/bad citizens; there simply
is no resource - some something gotta give.

So in this case - having hard cap's on certain things is helpful. And if for certain setups the current MaxXXX are
not good enough (e.g. to keep a slowlaris in check as per normal modus operandi) - then that needs to be fixed.

IMHO this is something dear to me, as a developer. And it should be fixed 'by default'. I.e. a normal configured
apache should behave sensible in situations like this.

And this is 'alway' the case.

There is a special case of this - and that is of sites which to some extend have shot themselves in the foot
with a 'bad' architecture; where somehow some application/language environment has a tendency to
eat rather a lot of resources; and leave (too little) to the httpd daemon.

2) Dealing with Noise

Generally observing that on a quiet server some 30-60% (speaking for myself here) of your 'load' is caused
by dubious/robotic stuff which does not benefit 'you' as the site owner with a desire to be found/public directly
or even indirectly (e.g. like a webcrawler of a search engine would do).

And this then begs the question - what can I, as a site owner (rather than a developer) configure - even if
that means that I 'break the internet' a bit.

And this is something you decide on a per server basis; for long periods of time.

3) Dealing with Attacks.

Dealing with specific attempts to overload a server one way or the other; with the intention to lock
other users out.

For most of us, and unlike above two examples, a specific situation; and one which probably last
relatively short (days, weeks).

And for which you are willing to do fairly draconian things - even if that significantly breaks normal
internet standard behaviour.

Would be useful to discuss each separately. As I do agree with some of the posters that apache is (no longer)
that strong* in area 1 and 3. And some modernisation may do wonders,

Thanks,

*: i.e. meaning that one tends to do highly specific things when dealing with issues like that; and while
generally (very) effective - they are not part of core apache lore & default config.

Re: URL scanning by bots

Posted by Reindl Harald <h....@thelounge.net>.

Am 01.05.2013 14:00, schrieb Reindl Harald:
> here you have something to read and learn that more and more attacks
> are done this way by exhausting ressources without high bandwith and
> THIS are the real problems server-admins have to fight and not the noise
> you see on your small site
> 
> http://www.slashroot.in/slowloris-http-dosdenial-serviceattack-and-prevention

and if you not realize the context:

* your idea would lead at more open connections
* slowloris protection limits the connections per IP

your idea would lead in make connections from ISP gateways with
a lot of users behind coming from the same IP where most likely
some are part of a botnet fragile

that's the other side of the coin, besides the fact that this open
connectionbs are a huge problem in large enivironments, for the
sevrer itsellf as also for connection tracking on routers and
firewalls if the count would be more than uninteresting noise

Re: URL scanning by bots

Posted by Reindl Harald <h....@thelounge.net>.


Am 01.05.2013 14:09, schrieb Marian Marinov:
> On 05/01/2013 03:00 PM, Reindl Harald wrote:
>> and YES making DOS-attacks easier is treatet as security risk by any
>> professional auditor and there where i work "threat middle" means
>> "fix it or shut down the customers project" and the last  time i got
>> this was by a not visible protection against Slowloris from the view
>> of the security-scanner
>> __________________________________________
>>
>> here you have something to read and learn that more and more attacks
>> are done this way by exhausting ressources without high bandwith and
>> THIS are the real problems server-admins have to fight and not the noise
>> you see on your small site
>>
>> http://www.slashroot.in/slowloris-http-dosdenial-serviceattack-and-prevention
>>
> 
> I have to agree that delaying 'malicious' requests is opening the servers to DoS attacks and SHOULD NOT be the
> default!
> This is not a solution to the problem. In fact what we have done was to automatically disable the delaying during
> excessive usage

and keep in mind that a SELF-DOS happens faster as someone thinks

* server with let say "MaxClients 150"
* hosting 500 domains on this machine (yes this is no problem these days)
* now you can start to calculate how many 404 errors in a specific timeframe are likely
* some of the 500 domains are pages with 50 images
* in this HIGH LIKELY scenario you want to serve ANY connection as fast as possible

i have seen webservers dying way too often

and this starts ALWAYS with load reaching a peak where all your slots are
busy and the amount of client machines which try to connect does not get
lower and after a specific point of load you have no way to survive

load in the context of a webserver means also connections

* a website with 50 images
* the website URL is called in the daily news on a TV sender
* now you have 50000 people which try to open the page more or less at the same time
* this means more then 2.5 Mio requests, you can serve 150 at the same time
* god beware you from 50 of your 150 slots are burried by "slow down the client"

Re: URL scanning by bots

Posted by Marian Marinov <mm...@yuhu.biz>.

On 05/01/2013 03:00 PM, Reindl Harald wrote:
>
>
> Am 01.05.2013 13:51, schrieb André Warnier:
>> There is so far one possible pitfall, which was identified by someone earlier on this list : the fact that delaying
>> 404 responses might have a bad effect on some particular kind of usage by legitimate clients/users.  So far, I
>> believe that such an effect could be mitigated by the fact that this option could be turned off, by any webserver
>> administrator with a modicum of knowledge
>
> do you really not understand it?
>
> anything which bring security risks and makes normal operations more
> fragile MUST NOT be the default behavior of a webserver
>
> and YES making DOS-attacks easier is treatet as security risk by any
> professional auditor and there where i work "threat middle" means
> "fix it or shut down the customers project" and the last  time i got
> this was by a not visible protection against Slowloris from the view
> of the security-scanner
> __________________________________________
>
> here you have something to read and learn that more and more attacks
> are done this way by exhausting ressources without high bandwith and
> THIS are the real problems server-admins have to fight and not the noise
> you see on your small site
>
> http://www.slashroot.in/slowloris-http-dosdenial-serviceattack-and-prevention
>

I have to agree that delaying 'malicious' requests is opening the servers to DoS attacks and SHOULD NOT be the default!
This is not a solution to the problem. In fact what we have done was to automatically disable the delaying during 
excessive usage.

Re: URL scanning by bots

Posted by Reindl Harald <h....@thelounge.net>.


Am 01.05.2013 13:51, schrieb André Warnier:
> There is so far one possible pitfall, which was identified by someone earlier on this list : the fact that delaying
> 404 responses might have a bad effect on some particular kind of usage by legitimate clients/users.  So far, I
> believe that such an effect could be mitigated by the fact that this option could be turned off, by any webserver
> administrator with a modicum of knowledge

do you really not understand it?

anything which bring security risks and makes normal operations more
fragile MUST NOT be the default behavior of a webserver

and YES making DOS-attacks easier is treatet as security risk by any
professional auditor and there where i work "threat middle" means
"fix it or shut down the customers project" and the last  time i got
this was by a not visible protection against Slowloris from the view
of the security-scanner
__________________________________________

here you have something to read and learn that more and more attacks
are done this way by exhausting ressources without high bandwith and
THIS are the real problems server-admins have to fight and not the noise
you see on your small site

http://www.slashroot.in/slowloris-http-dosdenial-serviceattack-and-prevention

Re: URL scanning by bots

Posted by Ben Reser <be...@reser.org>.

On Wed, May 1, 2013 at 7:16 AM, André Warnier <aw...@ice-sa.com> wrote:
> If it tries just one URL per server, and walks off if the response takes
> longer than some pre-determined value, then it all depends on what this
> value is.
> If the value is very small, then it will miss a larger proportion of the
> potential candidates. If the value is larger, then it miss less candidate
> servers, but it will be able to scan comparatively less servers within the
> same period of time.

The question becomes can they still achieve the number of hacked
servers they require in the timeframe they require while ignoring some
proportion of the servers on the Internet.  I think the answer to that
question is clearly yes.  The reason for that is the number of poorly
updated and secured servers is much higher than the number of secure
servers.

If you want the scanning to stop a far more productive effort would be
to try and get the people running vulnerable systems to secure them.
Until this happens there will always be incentive to scan, even if
you're  tar pitting them on some systems.

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Graham Leggett wrote:
> On 01 May 2013, at 1:51 PM, André Warnier <aw...@ice-sa.com> wrote:
> 
>> But *based on the actual data and patterns which I can observe on my servers (not guesses), I think it might have an effect*.
> 
> Of course it might have an effect - the real important question is will it have a *useful* effect.
> 
> A bot that gives up scanning a box that by definition isn't vulnerable to that bot (thus the 404) doesn't achieve anything useful, the bot failed to infect the host before, it fails to infect the host now, nothing has stopped the bot moving to the next host and trying it's luck there. Perhaps it does achieve a reduction in traffic for you, but that is for you to decide, and the tools already exist for you to achieve this.
> 

Let me take this line of reasoning "ad absurdum" : the best strategy then for the bot 
would be not to scan at all, and just give up ahead of time, wouldn't it ?

Instead, isn't the logical explanation more like this :

The bot can not give up.  It's very purpose is to identify servers which have 
vulnerabilities that would allow a more targeted attempt at breaking into that server, 
right ?  In order to do that, it /must/ try a number of potentially-vulnerable URLs on 
each server, and it must wait to check how they respond.  It it walks off before waiting 
for the response, it has not achieved its main purpose, because it doesn't know the 
response to its question.

If it tries just one URL per server, and walks off if the response takes longer than some 
pre-determined value, then it all depends on what this value is.
If the value is very small, then it will miss a larger proportion of the potential 
candidates. If the value is larger, then it miss less candidate servers, but it will be 
able to scan comparatively less servers within the same period of time.

> To put this into perspective, Rackspace will give me a midrange virtual server instance with 8GB of RAM for $350-ish per month. If I wanted 10 000 of these, that's a $3.5m dollar a month server bill. Or I could break into and steal access to 10 000 servers in my botnet, some far larger than my 8GB ballpark, and save myself $3.5m per month. Will attempts by sites across the net to slow down my bots convince me to stop? For $3.5m worth of computing power that I am getting for free, I think not.
> 

Ah, but you are disregarding two important factors here:
1) spending 3.5 M$ to rent 10,000 servers is legal, and will not lead you to jail.
If anything, it will probably earn you some nice discount coupons.
In contrast, deploying and running a botnet of 10,000 servers is a criminal activity, and 
can result in a big fine and being put in jail.
If am going to take a certain risk of having to pay millions of $ in fines and damages, 
and spend some time in jail to boot, I would want to have a corresponding probability of 
making a profit. Not you ?
2) you seem to believe that deploying a botnet of 1000 bots costs nothing.  Who is going 
to write the code for your bot ? or alternatively, how much money would you be wanting to 
spend in order to buy the code ? (You can find prices in Google)
And would you know exactly who are the people you would be buying that code from ?

Let me pick on another element of your message : "the tools already exist for you to 
achieve this"
Yes, they do.  There are plenty of tools available, which achieve a much better protection 
for a server than my proposal ever would (although that is not really my purpose).

But have you already looked at these tools, really ?
Most of these tools require at least a significant expertise (and time) on the part of the 
webserver administrator to set them up correctly.  Many of the most effective ones also 
consume a significant amount of resources when running. Some of them even cost money.

Which in the end and practically leads to the current real-world situation : there are 
hundreds of millions of webservers on the Internet which do /not/ implement any of these 
tools.  Which is one of the elements which makes running these URL-scanning bots be a 
profitable proposition, until now.

In contrast, my proposal would not require any expertise or any time or any money on the 
part of whoever installs an Apache server.  They would just install the default server "as 
is", as they get it from the Apache website or from their preferred platform distribution.
And it would slow down the bots (until someone proves the opposite to me, I'll stick with 
that assertion).

Re: URL scanning by bots

Posted by Noel Butler <no...@ausics.net>.

On Wed, 2013-05-01 at 14:40 +0200, Graham Leggett wrote:

> On 01 May 2013, at 1:51 PM, André Warnier <aw...@ice-sa.com> wrote:
> 
> > But *based on the actual data and patterns which I can observe on my servers (not guesses), I think it might have an effect*.
> 
> Of course it might have an effect - the real important question is will it have a *useful* effect.
> 

Not that I can see, unless you're still running a 286 on 2400baud modem

> A bot that gives up scanning a box that by definition isn't vulnerable to that bot (thus the 404) 
> doesn't achieve anything useful, the bot failed to infect the host before, it fails to infect the 
> host now, nothing has stopped the bot moving to the next host and trying it's luck there. 

Exactly, I think too many people are overly paranoid, stop one bot now,
and in 15 seconds another one, and then another one and anoth......

> Will attempts by sites across the net to slow down my bots convince me to stop? For $3.5m worth of 
> computing power that I am getting for free, I think not.

I'm rather sure NOT, and those who hijack care even less, bots have been
a fact of server life since basically the dawn of the net, and they'll
still be here in another hundred years.

Unless I've missed something since I've been away for a bit, I think the
OP here should be more concerned about the code he runs or allows to be
run, than bothering with something as petty as rate limiting which will
ultimately affect genuine users and peeve them off pretty quick.

I can not see this feature useful, given its adverse effect potential,
it would need be off by default, and I could not see many server admins
bothering to enable it - but for fun, I just asked the question on IRC
(sysadminy type chan), currently 41 users in channel from various
countries  AU-NZ-DE-UK-FR-US-IN-ID-SA and a few unresolved, so it gives
a reasonable world-view,  I got 32 responses saying they wouldn't use
it, I got zero responses saying they would use it - so for me, too much
work for too little worth.

Cheers
Noel

Re: URL scanning by bots

Posted by Graham Leggett <mi...@sharp.fm>.

On 01 May 2013, at 1:51 PM, André Warnier <aw...@ice-sa.com> wrote:

> But *based on the actual data and patterns which I can observe on my servers (not guesses), I think it might have an effect*.

Of course it might have an effect - the real important question is will it have a *useful* effect.

A bot that gives up scanning a box that by definition isn't vulnerable to that bot (thus the 404) doesn't achieve anything useful, the bot failed to infect the host before, it fails to infect the host now, nothing has stopped the bot moving to the next host and trying it's luck there. Perhaps it does achieve a reduction in traffic for you, but that is for you to decide, and the tools already exist for you to achieve this.

To put this into perspective, Rackspace will give me a midrange virtual server instance with 8GB of RAM for $350-ish per month. If I wanted 10 000 of these, that's a $3.5m dollar a month server bill. Or I could break into and steal access to 10 000 servers in my botnet, some far larger than my 8GB ballpark, and save myself $3.5m per month. Will attempts by sites across the net to slow down my bots convince me to stop? For $3.5m worth of computing power that I am getting for free, I think not.

Regards,
Graham
--

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Marian Marinov wrote:
> On 05/01/2013 12:19 PM, Tom Evans wrote:
>> On Wed, May 1, 2013 at 1:47 AM, André Warnier <aw...@ice-sa.com> wrote:
>>> Christian Folini wrote:
>>>>
>>>> Hey André,
>>>>
>>>> I do not think your protection mechanism is very good (for reasons
>>>> mentioned before) But you can try it out for yourself easily with 2-3
>>>> ModSecurity rules and the "pause" directive.
>>>>
>>>> Regs,
>>>>
>>>> Christian
>>>>
>>> Hi Christian.
>>>
>>> With respect, I think that you misunderstood the purpose of the 
>>> proposal.
>>> It is not a protection mechanism for any server in particular.
>>> And installing the delay on one server is not going to achieve much.
>>>
>>
>> Putting in any kind of delay means using more resources to deal with
>> the same number of requests, even if you use a dedicated 'slow down'
>> worker to deal especially just with this.
>>
>> The truth of the matter is that these sorts of spidering requests are
>> irrelevant noise on the internet. It's not a targeted attack, it is
>> simply someone looking for easy access to any machine.
> 
> I'm Head of Sysops at fairly large hosting provider, we have more then 
> 2000 machines and I can assure you, this 'noise' as you call it accounts 
> for about 20-25% of all requests to our servers. And the spam uploaded 
> on our servers accounts for about 35-40% of the DB size of all of our 
> customers.
> 
>>
>>> It is something that, if it is installed on enough webservers on the
>>> Internet, may slow down the URL-scanning bots (hopefully a lot), and 
>>> thereby
>>> inconvenience their botmasters. Hopefully to the point where they would
>>> decide that it is not worth scanning that way anymore.  And if it dos 
>>> not
>>> inconvenience them enough to achieve that, at least it should reduce the
>>> effectiveness of these bots, and diminish the number of systems that 
>>> they
>>> can scan over any given time period with the same number of bots.
>>>
>>
>> Well, no, actually this is not accurate. You are assuming that these
>> bots are written using blocking io semantics; that if a bot is delayed
>> by 2 seconds when getting a 404 from your server, it is not able to do
>> anything else in those 2 seconds. This is just incorrect.
>> Each bot process could launch multiple requests to multiple unrelated
>> hosts simultaneously, and select whatever ones are available to read
>> from. If you could globally add a delay to bots on all servers in the
>> world, all the bot owner needs to do to maintain the same throughput
>> is to raise the concurrency level of the bot's requests. The bot does
>> the same amount of work in the same amount of time, but now all our
>> servers use extra resources and are slow for clients on 404.
> 
> Actually, what we are observing is completely opposite to what you are 
> saying.
> Delaying spam bots, brute force attacks, and vulnerability scanners 
> significantly decreases the amount of requests we get from them.
> So, our observation tells us, that if you pretend that your machine is 
> slow, the bots abandon this IP and continue to the next one.
> 
> I believe that the bots are doing that, because there are many 
> vulnerable machines on the internet and there is no point in losing time 
> with a few slower ones. I may be wrong, but this is what we have seen.
> 

Thank you immensely.
This illustrates perfectly one of the problems I am encountering with this proposal.
Most of the objections to it are made by people who somehow seem to have some intellectual 
"a priori" of how bots or bot-masters would act or react, but without providing any actual 
fact to substantiate their opinion.

Once again : I am a little webserver administrator of a small collection of webservers. 
My vision of what happens on the Internet at large is limited to what I can observe on my 
own servers.  I do /not/ pretend that this proposal is correct, and I do /not/ pretend 
that its ultimate effect will be what I hope it could be.

But *based on the actual data and patterns which I can observe on my servers (not 
guesses), I think it might have an effect*. And when I try to substantiate this by some 
rough calculations - also based on real numbers which I can observe -, so far I can see 
nothing that would tell me that I am dead wrong.

There is so far one possible pitfall, which was identified by someone earlier on this list 
: the fact that delaying 404 responses might have a bad effect on some particular kind of 
usage by legitimate clients/users.  So far, I believe that such an effect could be 
mitigated by the fact that this option could be turned off, by any webserver administrator 
with a modicum of knowledge.

Re: URL scanning by bots

Posted by Marian Marinov <mm...@yuhu.biz>.

On 05/01/2013 03:22 PM, André Warnier wrote:
> Dirk-Willem van Gulik wrote:
>> On 1 mei 2013, at 13:31, Graham Leggett <mi...@sharp.fm> wrote:
>>> The evidence was just explained - a bot that does not get an answer quick enough gives up and looks elsewhere.
>>> The key words are "looks elsewhere".
>>
>>
>> For what it is worth - I've been experimenting with this (up till about 6 months ago) on a machine of mine. Having the
>> 200, 403, 404, 500 etc determined by an entirely unscientific 'modulo' of the IP address. Both on the main URL as well
>> as on a few PHP/plesk hole URLs. And have ignored/behaved normal for any source IP that has (ever) fetched robot.txt
>> from the same IP masked by the first 20 bits.
>>
>> That showed that bot's indeed slowdown/do-not-come back so soon if you give them a 403 or similar - but I saw no
>> differences as to which non 200 you give them (not tried slow reply or no reply). Do note though that I was focusing
>> on naughty non-robot.txt fetching bots.
>>
> For what it's worth also, thank you.
>
> This kind of response really helps, even if/when it would contradict the proposal that I am trying to push.  It helps
> because it provides some *evidence* which I am having difficulties collecting by myself, and which would allow to
> *really* judge the proposal on its merits, not just on unsubstantiated opinions.
>
> At another level, I would add this : if implementing my proposal turns out to have no effect, or a very small effect on
> the Internet at large, but effectively helps the server where it is active to avoid some of these scans, then I believe
> that considering the ease and very low cost of implementing this proposal, it would still be worth the trouble.
>
>
>
If the majority of web servers start slowing down the bots, this will simply make the bot authors make them stick to the 
IPs for more time. Once something becomes the standard they can very easily adopt to the new standard.

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Dirk-Willem van Gulik wrote:
> On 1 mei 2013, at 13:31, Graham Leggett <mi...@sharp.fm> wrote:
>> The evidence was just explained - a bot that does not get an answer quick enough gives up and looks elsewhere.
>> The key words are "looks elsewhere".
> 
> 
> For what it is worth - I've been experimenting with this (up till about 6 months ago) on a machine of mine. Having the 200, 403, 404, 500 etc determined by an entirely unscientific 'modulo' of the IP address. Both on the main URL as well as on a few PHP/plesk hole URLs. And have ignored/behaved normal for any source IP that has (ever) fetched robot.txt from the same IP masked by the first 20 bits.
> 
> That showed that bot's indeed slowdown/do-not-come back so soon if you give them a 403 or similar - but I saw no differences as to which non 200 you give them (not tried slow reply or no reply). Do note though that I was focusing on naughty non-robot.txt fetching bots.
> 
For what it's worth also, thank you.

This kind of response really helps, even if/when it would contradict the proposal that I 
am trying to push.  It helps because it provides some *evidence* which I am having 
difficulties collecting by myself, and which would allow to *really* judge the proposal on 
its merits, not just on unsubstantiated opinions.

At another level, I would add this : if implementing my proposal turns out to have no 
effect, or a very small effect on the Internet at large, but effectively helps the server 
where it is active to avoid some of these scans, then I believe that considering the ease 
and very low cost of implementing this proposal, it would still be worth the trouble.

Re: URL scanning by bots

Posted by Dirk-Willem van Gulik <Di...@bbc.co.uk>.

On 1 mei 2013, at 13:31, Graham Leggett <mi...@sharp.fm> wrote:
> 
> The evidence was just explained - a bot that does not get an answer quick enough gives up and looks elsewhere.
> The key words are "looks elsewhere".

For what it is worth - I've been experimenting with this (up till about 6 months ago) on a machine of mine. Having the 200, 403, 404, 500 etc determined by an entirely unscientific 'modulo' of the IP address. Both on the main URL as well as on a few PHP/plesk hole URLs. And have ignored/behaved normal for any source IP that has (ever) fetched robot.txt from the same IP masked by the first 20 bits.

That showed that bot's indeed slowdown/do-not-come back so soon if you give them a 403 or similar - but I saw no differences as to which non 200 you give them (not tried slow reply or no reply). Do note though that I was focusing on naughty non-robot.txt fetching bots.

Dw

Re: URL scanning by bots

Posted by Graham Leggett <mi...@sharp.fm>.

On 01 May 2013, at 1:14 PM, Ben Laurie <be...@links.org> wrote:

> The fact you cannot explain the evidence does not invalidate the evidence.

The evidence was just explained - a bot that does not get an answer quick enough gives up and looks elsewhere.

The key words are "looks elsewhere".

> Jeez.

While some people might be frothing hysterically, others of us are trying to have a sensible discussion. :)

Regards,
Graham
--

Re: URL scanning by bots

Posted by Reindl Harald <h....@thelounge.net>.


Am 01.05.2013 13:14, schrieb Ben Laurie:
> The fact you cannot explain the evidence does not invalidate the evidence

what evidence has this thread?

the whole idea of slow down 404 repsones is broken and must never be default
on any setup nor should it be implemented at all - period

Re: URL scanning by bots

Posted by Ben Laurie <be...@links.org>.

On 1 May 2013 11:11, Graham Leggett <mi...@sharp.fm> wrote:
> On 01 May 2013, at 11:34 AM, Marian Marinov <mm...@yuhu.biz> wrote:
>
>> Actually, what we are observing is completely opposite to what you are saying.
>> Delaying spam bots, brute force attacks, and vulnerability scanners significantly decreases the amount of requests we get from them.
>> So, our observation tells us, that if you pretend that your machine is slow, the bots abandon this IP and continue to the next one.
>
> I don't see what difference this makes practically from the perspective of a bot. A server that returns 404 to a bot is of no interest to that bot. Whether that bot gave up because it saw a 404, or because it perceived the box to be too slow to bother is moot, in either case the bot isn't interested in that host anyway.
>
> Remember we're talking about bots, not people. Bots don't get bored, they don't get "dismayed" or "disillusioned", they just crack on with the job they've been given to do, massively, in parallel.
>
> I think bots would prefer it if servers it wasn't interested in returned nothing, as it means less incoming traffic for the bot to process, and potentially a lower chance that the bot would be discovered.

The fact you cannot explain the evidence does not invalidate the evidence. Jeez.

Re: URL scanning by bots

Posted by Graham Leggett <mi...@sharp.fm>.

On 01 May 2013, at 11:34 AM, Marian Marinov <mm...@yuhu.biz> wrote:

> Actually, what we are observing is completely opposite to what you are saying.
> Delaying spam bots, brute force attacks, and vulnerability scanners significantly decreases the amount of requests we get from them.
> So, our observation tells us, that if you pretend that your machine is slow, the bots abandon this IP and continue to the next one.

I don't see what difference this makes practically from the perspective of a bot. A server that returns 404 to a bot is of no interest to that bot. Whether that bot gave up because it saw a 404, or because it perceived the box to be too slow to bother is moot, in either case the bot isn't interested in that host anyway.

Remember we're talking about bots, not people. Bots don't get bored, they don't get "dismayed" or "disillusioned", they just crack on with the job they've been given to do, massively, in parallel.

I think bots would prefer it if servers it wasn't interested in returned nothing, as it means less incoming traffic for the bot to process, and potentially a lower chance that the bot would be discovered.

Regards,
Graham
--

Re: URL scanning by bots

Posted by Marian Marinov <mm...@yuhu.biz>.

On 05/01/2013 12:19 PM, Tom Evans wrote:
> On Wed, May 1, 2013 at 1:47 AM, André Warnier <aw...@ice-sa.com> wrote:
>> Christian Folini wrote:
>>>
>>> Hey André,
>>>
>>> I do not think your protection mechanism is very good (for reasons
>>> mentioned before) But you can try it out for yourself easily with 2-3
>>> ModSecurity rules and the "pause" directive.
>>>
>>> Regs,
>>>
>>> Christian
>>>
>> Hi Christian.
>>
>> With respect, I think that you misunderstood the purpose of the proposal.
>> It is not a protection mechanism for any server in particular.
>> And installing the delay on one server is not going to achieve much.
>>
>
> Putting in any kind of delay means using more resources to deal with
> the same number of requests, even if you use a dedicated 'slow down'
> worker to deal especially just with this.
>
> The truth of the matter is that these sorts of spidering requests are
> irrelevant noise on the internet. It's not a targeted attack, it is
> simply someone looking for easy access to any machine.

I'm Head of Sysops at fairly large hosting provider, we have more then 2000 machines and I can assure you, this 'noise' 
as you call it accounts for about 20-25% of all requests to our servers. And the spam uploaded on our servers accounts 
for about 35-40% of the DB size of all of our customers.

>
>> It is something that, if it is installed on enough webservers on the
>> Internet, may slow down the URL-scanning bots (hopefully a lot), and thereby
>> inconvenience their botmasters. Hopefully to the point where they would
>> decide that it is not worth scanning that way anymore.  And if it dos not
>> inconvenience them enough to achieve that, at least it should reduce the
>> effectiveness of these bots, and diminish the number of systems that they
>> can scan over any given time period with the same number of bots.
>>
>
> Well, no, actually this is not accurate. You are assuming that these
> bots are written using blocking io semantics; that if a bot is delayed
> by 2 seconds when getting a 404 from your server, it is not able to do
> anything else in those 2 seconds. This is just incorrect.
> Each bot process could launch multiple requests to multiple unrelated
> hosts simultaneously, and select whatever ones are available to read
> from. If you could globally add a delay to bots on all servers in the
> world, all the bot owner needs to do to maintain the same throughput
> is to raise the concurrency level of the bot's requests. The bot does
> the same amount of work in the same amount of time, but now all our
> servers use extra resources and are slow for clients on 404.

Actually, what we are observing is completely opposite to what you are saying.
Delaying spam bots, brute force attacks, and vulnerability scanners significantly decreases the amount of requests we 
get from them.
So, our observation tells us, that if you pretend that your machine is slow, the bots abandon this IP and continue to 
the next one.

I believe that the bots are doing that, because there are many vulnerable machines on the internet and there is no point 
in losing time with a few slower ones. I may be wrong, but this is what we have seen.

>
> Thanks, but no thanks.
>
> Tom
>
>

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Tom Evans wrote:
> On Wed, May 1, 2013 at 1:47 AM, André Warnier <aw...@ice-sa.com> wrote:
>> Christian Folini wrote:
>>> Hey André,
>>>
>>> I do not think your protection mechanism is very good (for reasons
>>> mentioned before) But you can try it out for yourself easily with 2-3
>>> ModSecurity rules and the "pause" directive.
>>>
>>> Regs,
>>>
>>> Christian
>>>
>> Hi Christian.
>>
>> With respect, I think that you misunderstood the purpose of the proposal.
>> It is not a protection mechanism for any server in particular.
>> And installing the delay on one server is not going to achieve much.
>>
> 
> Putting in any kind of delay means using more resources to deal with
> the same number of requests, even if you use a dedicated 'slow down'
> worker to deal especially just with this.
> 
> The truth of the matter is that these sorts of spidering requests are
> irrelevant noise on the internet. It's not a targeted attack, it is
> simply someone looking for easy access to any machine.

I agree with the last statement.
But why is this "irrelevant noise" ? It is noise, and like all noise it is at least 
annoying, as it interferes with normal information flow.  It is exactly like spam, which 
has been estimated as representing at some moments as constituting up to 50% of the total 
Internet bandwidth.
My 25 unremarkable servers, collectively, have been for years on the receiving end of such 
noise, at the aggregated rate of several hundred or thousand of requests per day. That is 
25 servers on an Internet total of about 600 Million.  If my servers are not being 
specially targeted - and in principle I cannot imagine why they would - then I have to 
imagine that in aggregate over the Internet, we are talking about several hundred million 
HTTP requests per day.  Is that "irrelevant noise" ?

> 
>> It is something that, if it is installed on enough webservers on the
>> Internet, may slow down the URL-scanning bots (hopefully a lot), and thereby
>> inconvenience their botmasters. Hopefully to the point where they would
>> decide that it is not worth scanning that way anymore.  And if it dos not
>> inconvenience them enough to achieve that, at least it should reduce the
>> effectiveness of these bots, and diminish the number of systems that they
>> can scan over any given time period with the same number of bots.
>>
> 
> Well, no, actually this is not accurate. You are assuming that these
> bots are written using blocking io semantics; that if a bot is delayed
> by 2 seconds when getting a 404 from your server, it is not able to do
> anything else in those 2 seconds. This is just incorrect.
> Each bot process could launch multiple requests to multiple unrelated
> hosts simultaneously, and select whatever ones are available to read
> from. If you could globally add a delay to bots on all servers in the
> world, all the bot owner needs to do to maintain the same throughput
> is to raise the concurrency level of the bot's requests. The bot does
> the same amount of work in the same amount of time, but now all our
> servers use extra resources and are slow for clients on 404.
> 

I believe that this line of reasoning is deeply flawed.
If you use blocking I/O, then while your process is waiting, the scheduler can allocate 
the resources to another process in te meantime.
If you do not use blocking I/O, then you use CPU time polling the socket(s) to find out if 
they have something to read from, and this CPU time cannot be re-allocated to another 
process. You do not get something for nothing.
Opening 200 sockets to send 200 parallel requests, and then cycling through those 200 
sockets to see which one has a response yet may improve the *apparent* speed at which you 
are processing these requests/responses, but it will also dramatically raise the 
resource-usage profile of such a bot on the host it is running on.

Re: URL scanning by bots

Posted by Tom Evans <te...@googlemail.com>.

On Wed, May 1, 2013 at 1:47 AM, André Warnier <aw...@ice-sa.com> wrote:
> Christian Folini wrote:
>>
>> Hey André,
>>
>> I do not think your protection mechanism is very good (for reasons
>> mentioned before) But you can try it out for yourself easily with 2-3
>> ModSecurity rules and the "pause" directive.
>>
>> Regs,
>>
>> Christian
>>
> Hi Christian.
>
> With respect, I think that you misunderstood the purpose of the proposal.
> It is not a protection mechanism for any server in particular.
> And installing the delay on one server is not going to achieve much.
>

Putting in any kind of delay means using more resources to deal with
the same number of requests, even if you use a dedicated 'slow down'
worker to deal especially just with this.

The truth of the matter is that these sorts of spidering requests are
irrelevant noise on the internet. It's not a targeted attack, it is
simply someone looking for easy access to any machine.

> It is something that, if it is installed on enough webservers on the
> Internet, may slow down the URL-scanning bots (hopefully a lot), and thereby
> inconvenience their botmasters. Hopefully to the point where they would
> decide that it is not worth scanning that way anymore.  And if it dos not
> inconvenience them enough to achieve that, at least it should reduce the
> effectiveness of these bots, and diminish the number of systems that they
> can scan over any given time period with the same number of bots.
>

Well, no, actually this is not accurate. You are assuming that these
bots are written using blocking io semantics; that if a bot is delayed
by 2 seconds when getting a 404 from your server, it is not able to do
anything else in those 2 seconds. This is just incorrect.
Each bot process could launch multiple requests to multiple unrelated
hosts simultaneously, and select whatever ones are available to read
from. If you could globally add a delay to bots on all servers in the
world, all the bot owner needs to do to maintain the same throughput
is to raise the concurrency level of the bot's requests. The bot does
the same amount of work in the same amount of time, but now all our
servers use extra resources and are slow for clients on 404.

Thanks, but no thanks.

Tom

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Christian Folini wrote:
> Hey André,
> 
> I do not think your protection mechanism is very good (for reasons
> mentioned before) But you can try it out for yourself easily with 
> 2-3 ModSecurity rules and the "pause" directive.
> 
> Regs,
> 
> Christian
> 
Hi Christian.

With respect, I think that you misunderstood the purpose of the proposal.
It is not a protection mechanism for any server in particular.
And installing the delay on one server is not going to achieve much.

It is something that, if it is installed on enough webservers on the Internet, may slow 
down the URL-scanning bots (hopefully a lot), and thereby inconvenience their botmasters. 
Hopefully to the point where they would decide that it is not worth scanning that way 
anymore.  And if it dos not inconvenience them enough to achieve that, at least it should 
reduce the effectiveness of these bots, and diminish the number of systems that they can 
scan over any given time period with the same number of bots.

















> 
> On Tue, Apr 30, 2013 at 12:03:28PM +0200, André Warnier wrote:
>> Dear Apache developers,
>>
>> This is a suggestion relative to the code of the Apache httpd webserver, and a possible
>> default new default option in the standard distribution of Apache httpd.
>> It also touches on WWW security, which is why I felt that it belongs on this list, rather
>> than on the general user's list. Please correct me if I am mistaken.
>>
>> According to Netcraft, there are currently some 600 Million webservers on the WWW, with
>> more than 60% of those identified as "Apache".
>> I currently administer about 25 Apache httpd/Tomcat of these webservers, not remarkable in
>> any way (business applications for medium-sized companies).
>> In the logs of these servers, every day, there are episodes like the following :
>>
>> 209.212.145.91 - - [03/Apr/2013:00:52:32 +0200] "GET /muieblackcat HTTP/1.1" 404 362 "-" "-"
>> 209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/index.php HTTP/1.1" 404 365
>> "-" "-"
>> 209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/pma/index.php HTTP/1.1" 404
>> 369 "-" "-"
>> 209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/phpmyadmin/index.php
>> HTTP/1.1" 404 376 "-" "-"
>> 209.212.145.91 - - [03/Apr/2013:00:52:37 +0200] "GET //db/index.php HTTP/1.1" 404 362 "-" "-"
>> 209.212.145.91 - - [03/Apr/2013:00:52:37 +0200] "GET //dbadmin/index.php HTTP/1.1" 404 367
>> "-" "-"
>> ... etc..
>>
>> Such lines are the telltale trace of a "URL-scanning bot" or of the "URL-scanning" part of
>> a bot, and I am sure that you are all familiar with them.  Obviously, these bots are
>> trying to find webservers which exhibit poorly-designed or poorly-configured applications,
>> with the aim of identifying hosts which can be submitted to various kinds of attacks, for
>> various purposes.  As far as I can tell from my own unremarkable servers, I would surmise
>> that many or most webservers facing the Internet are submitted to this type of scan every
>> day.
>>
>> Hopefully, most webservers are not really vulnerable to this type of scan.
>> But the fact is that *these scans are happening, every day, on millions of webservers*.
>> And they are at least a nuisance, and at worst a serious security problem  when, as a
>> result of poorly configured webservers or applications, they lead to break-ins and
>> compromised systems.
>>
>> It is basically a numbers game, like malicious emails : it costs very little to do this,
>> and if even a tiny proportion of webservers exhibit one of these vulnerabilities, because
>> of the numbers involved, it is worth doing it.
>> If there are 600 Million webservers, and 50% of them are scanned every day, and 0.01% of
>> these webservers are vulnerable because of one of these URLs, then it means that every
>> day, 30,000 (600,000,000 x 0.5 x 0.0001) vulnerable servers will be identified.
>>
>> About the "cost" aspect : from the data in my own logs, such bots seem to be scanning
>> about  20-30 URLs per pass, at a rate of about 3-4 URLs per second.
>> Since it is taking my Apache httpd servers approximately 10 ms on average to respond (by a
>> 404 Not Found) to one of these requests, and they only request 1 URL per 250 ms, I would
>> imagine that these bots have some built-in rate-limiting mechanism, to avoid being
>> "caught" by various webserver-protection tools.  Maybe also they are smart, and scan
>> several servers in parallel, so as to limit the rate at which they "burden" any server in
>> particular. (In this rough calculation, I am ignoring network latency for now).
>>
>> So if we imagine a smart bot which is scanning 10 servers in parallel, issuing 4 requests
>> per second to each of them, for a total of 20 URLs per server, and we assume that all
>> these requests result in 404 responses with an average response time of 10 ms, then it
>> "costs" this bot only about 2 seconds to complete the scan of 10 servers.
>> If there are 300 Million servers to scan, then the total cost for scanning all the
>> servers, by any number of such bots working cooperatively, is an aggregated 60 Million
>> seconds.  And if one of such "botnets" has 10,000 bots, that boils down to only 6,000
>> seconds per bot.
>>
>> Scary, that 50% of all Internet webservers can be scanned for vulnerabilities in less than
>> 2 hours, and that such a scan may result in "harvesting" several thousand hosts,
>> candidates for takeover.
>>
>> Now, how about making it so that without any special configuration or add-on software or
>> skills on the part of webserver administrators, it would cost these same bots *about 100
>> times as long (several days)* to do their scan ?
>>
>> The only cost would a relatively small change to the Apache webservers, which is what my
>> suggestion consists of : adding a variable delay (say between 100 ms and 2000 ms) to any
>> 404 response.
>>
>> The suggestion is based on the observation that there is a dichotomy between this kind of
>> access by bots, and the kind of access made by legitimate HTTP users/clients : legitimate
>> users/clients (including the "good bots") are accessing mostly links "which work", so they
>> rarely get "404 Not Found" responses.  Malicious URL-scanning bots on the other hand, by
>> the very nature of what they are scanning for, are getting many "404 Not Found" responses.
>>
>> As a general idea thus, anything which impacts the delay to obtain a 404 response, should
>> impact these bots much more than it impacts legitimate users/clients.
>>
>> How much ?
>>
>> Let us imagine for a moment that this suggestion is implemented in the Apache webservers,
>> and is enabled in the default configuration.  And let's imagine that after a while, 20% of
>> the Apache webservers deployed on the Internet have this feature enabled, and are now
>> delaying any 404 response by an average of 1000 ms.
>> And let's re-use the numbers above, and redo the calculation.
>> The same "botnet" of 10,000 bots is thus still scanning 300 Million webservers, each bot
>> scanning 10 servers at a time for 20 URLs per server.  Previously, this took about 6000
>> seconds.
>> However now, instead of an average delay of 10 ms to obtain a 404 response, in 20% of the
>> cases (60 Million webservers) they will experience an average 1000 ms additional delay per
>> URL scanned.
>> This adds (60,000,000 / 10 * 20 URLs * 1000 ms) 120,000,000 seconds to the scan.
>> Divided by 10,000 bots, this is 12,000 additional seconds per bot (roughly 3 1/2 hours).
>>
>> So with a small change to the code, no add-ons, no special configuration skills on the
>> part of the webserver administrator, no firewalls, no filtering, no need for updates to
>> any list of URLs or bot characteristics, little inconvenience to legitimate users/clients,
>> and a very partial adoption over time, it seems that this scheme could more than double
>> the cost for bots to acquire the same number of targets.  Or, seen another way, it could
>> more than halve the number of webservers being scanned every day.
>>
>> I know that this is a hard sell.  The basic idea sounds a bit too simple to be effective.
>> It will not kill the bots, and it will not stop the bots from scanning Internet servers in
>> other ways that they use. It does not miraculously protect any single server against such
>> scans, and the benefit of any one server implementing this is diluted over all webservers
>> on the Internet.
>> But it is also not meant as an absolute weapon.  It is targeted specifically at a
>> particular type of scan done by a particular type of bot for a particular purpose, and is
>> is just a scheme to make this more expensive for them.  It may or may not discourage these
>> bots from continuing with this type of scan (if it does, that would be a very big result).
>> But at the same time, compared to any other kind of tool that can be used against these
>> scans, this one seems really cheap to implement, it does not seem to be easy to
>> circumvent, and it seems to have at least a potential of bringing big benefits to the WWW
>> at large.
>>
>> If there are reasonable objections to it, I am quite prepared to accept that, and drop it.
>>  I have already floated the idea in a couple of other places, and gotten what could be
>> described as "tepid" responses.  But it seems to me that most of the negative-leaning
>> responses which I received so far, were more of the a-priori "it will never work" kind,
>> rather than real objections based on real facts.
>>
>> So my hope here is that someone has the patience to read through this, and would have the
>> additional patience to examine the idea "professionally".
>>
>

Re: URL scanning by bots

Posted by Christian Folini <ch...@netnea.com>.

Hey André,

I do not think your protection mechanism is very good (for reasons
mentioned before) But you can try it out for yourself easily with 
2-3 ModSecurity rules and the "pause" directive.

Regs,

Christian


On Tue, Apr 30, 2013 at 12:03:28PM +0200, André Warnier wrote:
> Dear Apache developers,
> 
> This is a suggestion relative to the code of the Apache httpd webserver, and a possible
> default new default option in the standard distribution of Apache httpd.
> It also touches on WWW security, which is why I felt that it belongs on this list, rather
> than on the general user's list. Please correct me if I am mistaken.
> 
> According to Netcraft, there are currently some 600 Million webservers on the WWW, with
> more than 60% of those identified as "Apache".
> I currently administer about 25 Apache httpd/Tomcat of these webservers, not remarkable in
> any way (business applications for medium-sized companies).
> In the logs of these servers, every day, there are episodes like the following :
> 
> 209.212.145.91 - - [03/Apr/2013:00:52:32 +0200] "GET /muieblackcat HTTP/1.1" 404 362 "-" "-"
> 209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/index.php HTTP/1.1" 404 365
> "-" "-"
> 209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/pma/index.php HTTP/1.1" 404
> 369 "-" "-"
> 209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/phpmyadmin/index.php
> HTTP/1.1" 404 376 "-" "-"
> 209.212.145.91 - - [03/Apr/2013:00:52:37 +0200] "GET //db/index.php HTTP/1.1" 404 362 "-" "-"
> 209.212.145.91 - - [03/Apr/2013:00:52:37 +0200] "GET //dbadmin/index.php HTTP/1.1" 404 367
> "-" "-"
> ... etc..
> 
> Such lines are the telltale trace of a "URL-scanning bot" or of the "URL-scanning" part of
> a bot, and I am sure that you are all familiar with them.  Obviously, these bots are
> trying to find webservers which exhibit poorly-designed or poorly-configured applications,
> with the aim of identifying hosts which can be submitted to various kinds of attacks, for
> various purposes.  As far as I can tell from my own unremarkable servers, I would surmise
> that many or most webservers facing the Internet are submitted to this type of scan every
> day.
> 
> Hopefully, most webservers are not really vulnerable to this type of scan.
> But the fact is that *these scans are happening, every day, on millions of webservers*.
> And they are at least a nuisance, and at worst a serious security problem  when, as a
> result of poorly configured webservers or applications, they lead to break-ins and
> compromised systems.
> 
> It is basically a numbers game, like malicious emails : it costs very little to do this,
> and if even a tiny proportion of webservers exhibit one of these vulnerabilities, because
> of the numbers involved, it is worth doing it.
> If there are 600 Million webservers, and 50% of them are scanned every day, and 0.01% of
> these webservers are vulnerable because of one of these URLs, then it means that every
> day, 30,000 (600,000,000 x 0.5 x 0.0001) vulnerable servers will be identified.
> 
> About the "cost" aspect : from the data in my own logs, such bots seem to be scanning
> about  20-30 URLs per pass, at a rate of about 3-4 URLs per second.
> Since it is taking my Apache httpd servers approximately 10 ms on average to respond (by a
> 404 Not Found) to one of these requests, and they only request 1 URL per 250 ms, I would
> imagine that these bots have some built-in rate-limiting mechanism, to avoid being
> "caught" by various webserver-protection tools.  Maybe also they are smart, and scan
> several servers in parallel, so as to limit the rate at which they "burden" any server in
> particular. (In this rough calculation, I am ignoring network latency for now).
> 
> So if we imagine a smart bot which is scanning 10 servers in parallel, issuing 4 requests
> per second to each of them, for a total of 20 URLs per server, and we assume that all
> these requests result in 404 responses with an average response time of 10 ms, then it
> "costs" this bot only about 2 seconds to complete the scan of 10 servers.
> If there are 300 Million servers to scan, then the total cost for scanning all the
> servers, by any number of such bots working cooperatively, is an aggregated 60 Million
> seconds.  And if one of such "botnets" has 10,000 bots, that boils down to only 6,000
> seconds per bot.
> 
> Scary, that 50% of all Internet webservers can be scanned for vulnerabilities in less than
> 2 hours, and that such a scan may result in "harvesting" several thousand hosts,
> candidates for takeover.
> 
> Now, how about making it so that without any special configuration or add-on software or
> skills on the part of webserver administrators, it would cost these same bots *about 100
> times as long (several days)* to do their scan ?
> 
> The only cost would a relatively small change to the Apache webservers, which is what my
> suggestion consists of : adding a variable delay (say between 100 ms and 2000 ms) to any
> 404 response.
> 
> The suggestion is based on the observation that there is a dichotomy between this kind of
> access by bots, and the kind of access made by legitimate HTTP users/clients : legitimate
> users/clients (including the "good bots") are accessing mostly links "which work", so they
> rarely get "404 Not Found" responses.  Malicious URL-scanning bots on the other hand, by
> the very nature of what they are scanning for, are getting many "404 Not Found" responses.
> 
> As a general idea thus, anything which impacts the delay to obtain a 404 response, should
> impact these bots much more than it impacts legitimate users/clients.
> 
> How much ?
> 
> Let us imagine for a moment that this suggestion is implemented in the Apache webservers,
> and is enabled in the default configuration.  And let's imagine that after a while, 20% of
> the Apache webservers deployed on the Internet have this feature enabled, and are now
> delaying any 404 response by an average of 1000 ms.
> And let's re-use the numbers above, and redo the calculation.
> The same "botnet" of 10,000 bots is thus still scanning 300 Million webservers, each bot
> scanning 10 servers at a time for 20 URLs per server.  Previously, this took about 6000
> seconds.
> However now, instead of an average delay of 10 ms to obtain a 404 response, in 20% of the
> cases (60 Million webservers) they will experience an average 1000 ms additional delay per
> URL scanned.
> This adds (60,000,000 / 10 * 20 URLs * 1000 ms) 120,000,000 seconds to the scan.
> Divided by 10,000 bots, this is 12,000 additional seconds per bot (roughly 3 1/2 hours).
> 
> So with a small change to the code, no add-ons, no special configuration skills on the
> part of the webserver administrator, no firewalls, no filtering, no need for updates to
> any list of URLs or bot characteristics, little inconvenience to legitimate users/clients,
> and a very partial adoption over time, it seems that this scheme could more than double
> the cost for bots to acquire the same number of targets.  Or, seen another way, it could
> more than halve the number of webservers being scanned every day.
> 
> I know that this is a hard sell.  The basic idea sounds a bit too simple to be effective.
> It will not kill the bots, and it will not stop the bots from scanning Internet servers in
> other ways that they use. It does not miraculously protect any single server against such
> scans, and the benefit of any one server implementing this is diluted over all webservers
> on the Internet.
> But it is also not meant as an absolute weapon.  It is targeted specifically at a
> particular type of scan done by a particular type of bot for a particular purpose, and is
> is just a scheme to make this more expensive for them.  It may or may not discourage these
> bots from continuing with this type of scan (if it does, that would be a very big result).
> But at the same time, compared to any other kind of tool that can be used against these
> scans, this one seems really cheap to implement, it does not seem to be easy to
> circumvent, and it seems to have at least a potential of bringing big benefits to the WWW
> at large.
> 
> If there are reasonable objections to it, I am quite prepared to accept that, and drop it.
>  I have already floated the idea in a couple of other places, and gotten what could be
> described as "tepid" responses.  But it seems to me that most of the negative-leaning
> responses which I received so far, were more of the a-priori "it will never work" kind,
> rather than real objections based on real facts.
> 
> So my hope here is that someone has the patience to read through this, and would have the
> additional patience to examine the idea "professionally".
> 

-- 
Christian Folini - <ch...@netnea.com>

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Ben Reser wrote:
> On Tue, Apr 30, 2013 at 3:03 AM, André Warnier <aw...@ice-sa.com> wrote:
>> Let us imagine for a moment that this suggestion is implemented in the
>> Apache webservers,
>> and is enabled in the default configuration.  And let's imagine that after a
>> while, 20% of
>> the Apache webservers deployed on the Internet have this feature enabled,
>> and are now
>> delaying any 404 response by an average of 1000 ms.
>> And let's re-use the numbers above, and redo the calculation.
>> The same "botnet" of 10,000 bots is thus still scanning 300 Million
>> webservers, each bot
>> scanning 10 servers at a time for 20 URLs per server.  Previously, this took
>> about 6000
>> seconds.
>> However now, instead of an average delay of 10 ms to obtain a 404 response,
>> in 20% of the
>> cases (60 Million webservers) they will experience an average 1000 ms
>> additional delay per
>> URL scanned.
>> This adds (60,000,000 / 10 * 20 URLs * 1000 ms) 120,000,000 seconds to the
>> scan.
>> Divided by 10,000 bots, this is 12,000 additional seconds per bot (roughly 3
>> 1/2 hours).
> 
> Let's assume that such a feature gets added, however it's not likely
> going to be the default feature.  There are quite a few places that
> serve a lot of legitimate soft 404s for reasons that I'm not going to
> bother to get into here.

Could you actually give an example of such a "legitimate" use-case ?
(I am not saying that you are wrong, it's just that I genuinely cannot think of such a case)

One comment apart from that, is that if there are indeed such sites, I would imagine that 
they are of the kind which is professionally managed, and that it would not be difficult 
in that case for the administrator to disable (or tune) the feature.

> 
> Any site that goes to the trouble of enabling such a feature is
> probably not going to be a site that is vulnerable to what these
> scanners are looking for.  So if I was a bot writer I'd wait for some
> amount of time and if I didn't have a response I'd move on.  I'd also
> not just move along with the next scan on your web server, I'd
> probably just move on to a different host.  If nothing else a sever
> that responds to request slowly is not likely to be interesting to me.
> 
> As a result I'd say your suggestion if wildly practiced actually helps
> the scanners rather than hurting them, because they can identify hosts
> that are unlikely to worth their time scanning with a single request.
> 

Assuming that you meant "widely" ..

Allow me to reply to that (worthy) objection :

In the simple calculations which I indicated initially, I omitted the impact of the 
network latency, and I used a single figure of 10 ms to estimate the average response time 
of a server (for a 404 response).

According to my own experiments, average network latency to reach Internet servers (even 
with standard pings) is of the order of magnitude of at least 50 ms.  That is for 
well-connected servers.
So from the bot client's point of view, to the basic server response time for a single 
request, you would have to add at least 50 ms on average.

On the other hand, let me disgress a bit to introduce the rest of the answer.

My professional specialty is information management, and many of my customers have 
databases containing URL links to reference pages on the WWW, which they maintain and 
provide to their own internal users.  From time to time we need to go through their 
databases, and verify that the links which they have stored are still current.
So for these customers we are regularly running programs of the "URL checker" type.  These 
are in a way similar to URL-scanning bots, except that they target a longer list of URLs 
(usually several hundred or thousand), usually distributed over many servers, and these 
are real URLs that work (or worked at some point in time).

So anyway, these programs thus try a long list of WWW URLs, and check the type of response 
that they get : if they get a 200 then the link is ok; if they get most anything else then 
the link is flagged as "dubious" in the database, for further manual inspection.
Since the program needs to scan many URLs in a reasonable time, it has to use a timeout 
for each URL that it is trying to check. For example, it will issue a request to a server, 
and if it does not receive a response within (say) 5 seconds, it gives up and flags the 
link as dubious.
Over many runs of these programs, I have noticed that if I set this timeout much below 5 
seconds (say 2 seconds), then I get of the order of 30% or more "false dubious" links.
In reality most of the time these are working links, but it just so happens that many 
servers occasionally do not respond faster than 2 seconds. (And if I re-run the same 
program with the same parameters immediately afterward, I will again get 30% of slow 
links, and many will be different compared to the previous run).
Obviously I cannot do that, because it would mean that my customer has to check hundreds 
of URLs by hand afterward. So on average the timeout is set at 5 seconds, and this is a 
value obtained empirically after many many runs.

What I am leading to is : if the time by which each 404 response is delayed, is randomly 
variable, for example between 50 ms and 2 seconds, then it is very difficult for a bot to 
determine if this is a "normal" delay just due to the load on the webserver at that 
particular time, or if this is deliberate, or if this server is just slow in general.
And if the bot gets a first response which is fast (or slow), it doesn't really say 
anything about how fast or slow the next response would be.

That's what I meant when I stated that this scheme would be hard for a bot to circumvent. 
I am not saying that it is impossible, but any scheme to circumvent this would need at 
least a certain level of sophistication, which again raises the cost.

And now facetiously again, if what you are writing above about bots detecting this anyway 
and consequently avoiding my own websites was correct, then I would be very happy too, 
since I would have found a very simple way to have the bots avoid my servers.

Re: URL scanning by bots

Posted by Ben Reser <be...@reser.org>.

On Tue, Apr 30, 2013 at 3:03 AM, André Warnier <aw...@ice-sa.com> wrote:
> Let us imagine for a moment that this suggestion is implemented in the
> Apache webservers,
> and is enabled in the default configuration.  And let's imagine that after a
> while, 20% of
> the Apache webservers deployed on the Internet have this feature enabled,
> and are now
> delaying any 404 response by an average of 1000 ms.
> And let's re-use the numbers above, and redo the calculation.
> The same "botnet" of 10,000 bots is thus still scanning 300 Million
> webservers, each bot
> scanning 10 servers at a time for 20 URLs per server.  Previously, this took
> about 6000
> seconds.
> However now, instead of an average delay of 10 ms to obtain a 404 response,
> in 20% of the
> cases (60 Million webservers) they will experience an average 1000 ms
> additional delay per
> URL scanned.
> This adds (60,000,000 / 10 * 20 URLs * 1000 ms) 120,000,000 seconds to the
> scan.
> Divided by 10,000 bots, this is 12,000 additional seconds per bot (roughly 3
> 1/2 hours).

Let's assume that such a feature gets added, however it's not likely
going to be the default feature.  There are quite a few places that
serve a lot of legitimate soft 404s for reasons that I'm not going to
bother to get into here.

Any site that goes to the trouble of enabling such a feature is
probably not going to be a site that is vulnerable to what these
scanners are looking for.  So if I was a bot writer I'd wait for some
amount of time and if I didn't have a response I'd move on.  I'd also
not just move along with the next scan on your web server, I'd
probably just move on to a different host.  If nothing else a sever
that responds to request slowly is not likely to be interesting to me.

As a result I'd say your suggestion if wildly practiced actually helps
the scanners rather than hurting them, because they can identify hosts
that are unlikely to worth their time scanning with a single request.

Re: URL scanning by bots

Posted by Reindl Harald <h....@thelounge.net>.


Am 30.04.2013 20:38, schrieb Ben Laurie:
> On 30 April 2013 11:14, Reindl Harald <h....@thelounge.net> wrote:
>> no - this idea is very very bad and if you ever saw a
>> DDOS-attack from 10 thousands of ip-addresses on a
>> machine you maintain you would not consider anything
>> which makes responses slower because it is the wrong
>> direction
> 
> There's no reason to make this a DoS vector - clearly you can queue
> all the delayed responses in a single process and not tie up available
> processes. And if that process gets full, you just drop them on the
> floor

PLEASE inform you how a server works

* you have at least a lot of open connections
* you will overload port and/or file-hanlde ressources
* delay respones is purely idiotic
* on any server with load you will ALWAYS get rid of connections
  as fast as possible in ANY situation and context

PLEASE come back until you understood why delay responses is
simply idiotic, even for regular sites if they have noticeable
load and a lot of 404 because a relaunch or whatever bug



> In general, I hate the argument that improvement X has obvious
> workaround A and therefore we should not bother with it. It's
> absolutely impossible to make forward progress in security with that
> attitude

and with your attitude propose things while not understand
any basics it would be a imporvment of what possible?

Re: URL scanning by bots

Posted by Issac Goldstand <ma...@beamartyr.net>.

On 30/04/2013 21:38, Ben Laurie wrote:
> On 30 April 2013 11:14, Reindl Harald <h....@thelounge.net> wrote:
>> Am 30.04.2013 12:03, schrieb André Warnier:
>>> As a general idea thus, anything which impacts the delay to obtain a 404 response, should
>>> impact these bots much more than it impacts legitimate users/clients.
>>>
>>> How much ?
>>>
>>> Let us imagine for a moment that this suggestion is implemented in the Apache webservers,
>>> and is enabled in the default configuration.  And let's imagine that after a while, 20% of
>>> the Apache webservers deployed on the Internet have this feature enabled, and are now
>>> delaying any 404 response by an average of 1000 ms
>>
>> which is a invitation for a DDOS-attack because it would
>> make it easier to use every available worker and by the
>> delay at the same time active iptables-rate-controls
>> get useless because you need fewer connections for the
>> same damage
>>
>> no - this idea is very very bad and if you ever saw a
>> DDOS-attack from 10 thousands of ip-addresses on a
>> machine you maintain you would not consider anything
>> which makes responses slower because it is the wrong
>> direction
> 
> There's no reason to make this a DoS vector - clearly you can queue
> all the delayed responses in a single process and not tie up available
> processes. And if that process gets full, you just drop them on the
> floor.
> 

1) You're still keeping TCP connections in your kernel which on an
incredibly busy server are an important resource.
2) What do you mean "drop them on the floor"?  You're going to let your
real users suffer by getting dropped connections instead of a 404?  At
best, they'll just try again, to the same URL, producing another 404
that'll hang around for a long time.  At worst, you're pissing off real
people.

Re: URL scanning by bots

Posted by Ben Laurie <be...@links.org>.

On 30 April 2013 11:14, Reindl Harald <h....@thelounge.net> wrote:
> Am 30.04.2013 12:03, schrieb André Warnier:
>> As a general idea thus, anything which impacts the delay to obtain a 404 response, should
>> impact these bots much more than it impacts legitimate users/clients.
>>
>> How much ?
>>
>> Let us imagine for a moment that this suggestion is implemented in the Apache webservers,
>> and is enabled in the default configuration.  And let's imagine that after a while, 20% of
>> the Apache webservers deployed on the Internet have this feature enabled, and are now
>> delaying any 404 response by an average of 1000 ms
>
> which is a invitation for a DDOS-attack because it would
> make it easier to use every available worker and by the
> delay at the same time active iptables-rate-controls
> get useless because you need fewer connections for the
> same damage
>
> no - this idea is very very bad and if you ever saw a
> DDOS-attack from 10 thousands of ip-addresses on a
> machine you maintain you would not consider anything
> which makes responses slower because it is the wrong
> direction

There's no reason to make this a DoS vector - clearly you can queue
all the delayed responses in a single process and not tie up available
processes. And if that process gets full, you just drop them on the
floor.

Re: URL scanning by bots

Posted by Reindl Harald <h....@thelounge.net>.

Am 30.04.2013 12:03, schrieb André Warnier:
> As a general idea thus, anything which impacts the delay to obtain a 404 response, should
> impact these bots much more than it impacts legitimate users/clients.
> 
> How much ?
> 
> Let us imagine for a moment that this suggestion is implemented in the Apache webservers,
> and is enabled in the default configuration.  And let's imagine that after a while, 20% of
> the Apache webservers deployed on the Internet have this feature enabled, and are now
> delaying any 404 response by an average of 1000 ms

which is a invitation for a DDOS-attack because it would
make it easier to use every available worker and by the
delay at the same time active iptables-rate-controls
get useless because you need fewer connections for the
same damage

no - this idea is very very bad and if you ever saw a
DDOS-attack from 10 thousands of ip-addresses on a
machine you maintain you would not consider anything
which makes responses slower because it is the wrong
direction

Re: URL scanning by bots

Posted by "Steinar H. Gunderson" <sg...@bigfoot.com>.

On Tue, Apr 30, 2013 at 08:54:47PM +0200, Lazy wrote:
> mod_security + simple scripts+ ipset + iptables TARPIT in the raw table
> 
> this way You would be able to block efficiently a very large number of
> ipnumbers, using
> TARPIT will take care of the
> delaying new bot connections at minimal cost (much lower then delaying the
> request in userspace, or even returning some error code)

Note that tarpit is not such a cool strategy anymore once you make a mistake
and hit legitimate traffic. E.g., someone once took down all of Debian's
email handling for a day or so, due to misconfigured tarpitting overloading
the server.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

Re: URL scanning by bots

Posted by Lazy <la...@gmail.com>.

2013/4/30 Graham Leggett <mi...@sharp.fm>

> On 30 Apr 2013, at 12:03 PM, André Warnier <aw...@ice-sa.com> wrote:
>
> > The only cost would a relatively small change to the Apache webservers,
> which is what my
> > suggestion consists of : adding a variable delay (say between 100 ms and
> 2000 ms) to any
> > 404 response.
>
> This would have no real effect.
>
> Bots are patient, slowing them down isn't going to inconvenience a bot in
> any way. The simple workaround if the bot does take too long is to simply
> send the requests in parallel. At the same time, slowing down 404s would
> break real websites, as 404 isn't necessarily an error, but rather simply a
> notice that says the resource isn't found.
>
> Regards,
> Graham
> --
>
>
If you want to slow down the bots I whould suggest using

mod_security + simple scripts+ ipset + iptables TARPIT in the raw table

this way You would be able to block efficiently a very large number of
ipnumbers, using
TARPIT will take care of the
delaying new bot connections at minimal cost (much lower then delaying the
request in userspace, or even returning some error code)

http://ipset.netfilter.org/
http://serverfault.com/questions/113796/setting-up-tarpit-technique-in-
iptables
http://www.modsecurity.org/documentation/modsecurity-apache/1.9.3/html-
multipage/05-actions.html
-- 
Michal Grzedzicki

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Ben Laurie wrote:
> On 30 April 2013 11:29, Graham Leggett <mi...@sharp.fm> wrote:
>> On 30 Apr 2013, at 12:03 PM, André Warnier <aw...@ice-sa.com> wrote:
>>
>>> The only cost would a relatively small change to the Apache webservers, which is what my
>>> suggestion consists of : adding a variable delay (say between 100 ms and 2000 ms) to any
>>> 404 response.
>> This would have no real effect.
>>
>> Bots are patient, slowing them down isn't going to inconvenience a bot in any way. The simple workaround if the bot does take too long is to simply send the requests in parallel.
> 
> Disagree. Raising the bar reduces volume.
> 
> In general, I hate the argument that improvement X has obvious
> workaround A and therefore we should not bother with it. It's
> absolutely impossible to make forward progress in security with that
> attitude. Every defence is defeatable (says experience) yet some are
> still worth putting in place.
> 

Thank you for putting this succintly.
That is exactly the point of my proposal : raising the bar.

Honestly, I do not know by how much it would raise the bar, nor how much it would have as 
an effect in general. It just seems to me like an idea that may be worth trying, or at 
least really evaluated "scientifically", to verify my many assumptions and approximations.

I just cannot think of how to do this practically, without actually rolling it out on a 
sufficient number of servers, and involving some organisation that has the infrastructure 
and the tools to measure the impact.

Re: URL scanning by bots

Posted by Graham Leggett <mi...@sharp.fm>.

On 30 Apr 2013, at 8:42 PM, Ben Laurie <be...@links.org> wrote:

>> This would have no real effect.
>> 
>> Bots are patient, slowing them down isn't going to inconvenience a bot in any way. The simple workaround if the bot does take too long is to simply send the requests in parallel.
> 
> Disagree. Raising the bar reduces volume.
> 
> In general, I hate the argument that improvement X has obvious
> workaround A and therefore we should not bother with it. It's
> absolutely impossible to make forward progress in security with that
> attitude. Every defence is defeatable (says experience) yet some are
> still worth putting in place.

It's not worth breaking the web to do it.

If you wanted to do something constructive with these requests, come up with a way to signal the owner of the originating IP address that something dodgy is running on their machine.

Regards,
Graham
--

Re: URL scanning by bots

Posted by Ben Laurie <be...@links.org>.

On 30 April 2013 11:29, Graham Leggett <mi...@sharp.fm> wrote:
> On 30 Apr 2013, at 12:03 PM, André Warnier <aw...@ice-sa.com> wrote:
>
>> The only cost would a relatively small change to the Apache webservers, which is what my
>> suggestion consists of : adding a variable delay (say between 100 ms and 2000 ms) to any
>> 404 response.
>
> This would have no real effect.
>
> Bots are patient, slowing them down isn't going to inconvenience a bot in any way. The simple workaround if the bot does take too long is to simply send the requests in parallel.

Disagree. Raising the bar reduces volume.

In general, I hate the argument that improvement X has obvious
workaround A and therefore we should not bother with it. It's
absolutely impossible to make forward progress in security with that
attitude. Every defence is defeatable (says experience) yet some are
still worth putting in place.

> At the same time, slowing down 404s would break real websites, as 404 isn't necessarily an error, but rather simply a notice that says the resource isn't found.
>
> Regards,
> Graham
> --
>

Re: URL scanning by bots

Posted by Micha Lenk <mi...@lenk.info>.

Hi,

Am 03.05.2013 11:27, schrieb Dirk-Willem van Gulik:
> FWIIW - the same sentiments where expressed when 'greylisting[1]' in
> SMTP came in vogue. For small relays (speaking just from personal
> experience and from the vantage of my own private tiny MTA's) that
> has however not been the case. Greylisting did dampen things
> significantly - and the effect lasts to this day.

The main difference I see here is, that a SMTP server that uses
greylisting really can close the client connection almost immediately
with keeping minimal state, usually on cheap disk. So, until the client
retries, neither the kernel nor any processes have to deal with the
greylisting during the delay period.

In HTTP this is totally different. You can't just return a temporary
error code and assume that web browser will retry some reasonable
moments later. For this reason you would have to delay the real HTTP
response. And this has a substantial resource usage impact, as you have
to maintain state across all operating system layers. The network stack
needs to maintain the TCP connection open, the kernel needs to maintain
an open socket for the server process, the server process needs to
maintain an (some kind of) active HTTP request -- for every single
delayed request. These resources would just wait for the delay timer to
expire, so essentially hanging around without doing anything useful, and
without changing any outcome of the actual HTTP transaction. As others
already pointed out, this opens the doors for denial of service attacks
by excessive resource usage. From a security point of view, you
definitely don't want such HTTP response delays.

Regards,
Micha

Re: URL scanning by bots

Posted by Dirk-Willem van Gulik <Di...@bbc.co.uk>.

On 3 mei 2013, at 10:55, Marian Marinov <mm...@yuhu.biz> wrote:
> 
> If Apache by default delays 404s, this may have some effect in the first month or two after the release of this change. But then the the botnet writers will learn and update their software.
> I do believe that these guys are monitoring mailing lists like these or at least reading the change logs of the most popular web servers.
> So, I believe that such change would have a very limited impact on the whole Internet or at least will be combated fairly easy.

FWIIW - the same sentiments where expressed when 'greylisting[1]' in SMTP came in vogue. For small relays (speaking just from personal experience and from the vantage of my own private tiny MTA's) that has however not been the case. Greylisting did dampen things significantly - and the effect lasts to this day.

But agreed - I think it is important to deal with this issue differently that with the overload issue.

Dw.

1: http://en.wikipedia.org/wiki/Greylisting

Re: URL scanning by bots

Posted by Micha Lenk <mi...@lenk.info>.

Hi André

Am 03.05.2013 14:37, schrieb André Warnier:
> Basically, after a few cycles like this, all his 100 pool connections
> will be waiting for a response, and it would have no choice between
> either waiting, or starting to kill the connections that have been
> waiting more than a certain amount of time.
> Or, increasing its number of connections and become more conspicuous (***).

If I were about to run such a bot (having read this thread) I would just
let it wait for the responses on all 100 connections. As long as the bot
is busy, I could easily work on other stuff. It's just a shift in the
numbers game. So why should I care about it?

Regards,
Micha

Re: URL scanning by bots

Posted by Michael Felt <ma...@gmail.com>.

An interesting discussion. The admin of the server I use is rather critical
about malicious connections. His way to prevent continuing malicious
connections is to route the source IP address (incoming) to 127.0.0.1 after
X errors reported from a single IP address within Y minutes.

>From the logic presented here I guess that makes his server susceptible to
a DOS serivce: many bots could attack his server using all available ports
as no ACK are ever returned. However, there is also code in the server OS
to combat this particular problem.

So, yes: if servers delay answering - they are holding ports longer - and
this consumes IP (port) resources. If delaying responses means bot-masters
are going, because they are upset, re-program bots to massively spam a slow
server - the bot-masters also run a risk of detection.

My impression is that, all in all, a good discussion has been started - at
least people are (again?) talking about possibilities for managing "noise"
in a healthy way. Maybe, let's hope!, a good suggestion/idea does come up.
Just because it is hard is not a good reason to stop working towards a
solution - aka surrender.

So, to the person who started this discussion: thanks for the impulse.

On Fri, May 3, 2013 at 2:37 PM, André Warnier <aw...@ice-sa.com> wrote:

> Tom Evans wrote:
>
>> On Fri, May 3, 2013 at 10:54 AM, André Warnier <aw...@ice-sa.com> wrote:
>>
>>> So here is a challenge for the Apache devs : describe how a bot-writer
>>> could
>>> update his software to avoid the consequences of the scheme that I am
>>> advocating, without consequences on the effectivity of their
>>> URL-scanning.
>>>
>>
>> This has been explained several times. The bot makes requests
>> asynchronously with a short select() timeout. If it doesn't have a
>> response from one of its current requests due to artificial delays, it
>> makes an additional request, not necessarily to the same server.
>>
>> The fact that a single response takes longer to arrive is not
>> relevant, the bot can overall process roughly as many requests in the
>> same period as without a delay. The amount of concurrency that would
>> be required would be proportional to the artificial delay and the
>> network RTT.
>>
>> There is a little overhead due to the extra concurrency, but not much
>> - you are not processing any more requests in a specific time period,
>> nor using more network traffic than without concurrency, the only real
>> cost is more simultaneous network connections, most of which are idle
>> waiting for the artificial delay to expire.
>>
>> I would not be surprised if bots already behave like this, as it is a
>> useful way of increasing scanning rate if you have servers that are
>> slow to respond already, or have high network RTT.
>>
>>
> Ok, maybe I am understanding this wrongly. But I am open to be proven
> wrong.
>
> Suppose a bot is scanning 10000 IP's, 100 at a time concurrently (*), for
> 20 potentially
> vulnerable URLs per server. That is thus 200,000 HTTP requests to make.
> And let's suppose that the bot cannot tell, from the delay experienced
> when receiving any
> particular response, if this is a server that is artifically delaying
> responses, or if
> this is a normal delay due to whatever condition (**).
> And let's also suppose that, on the total of 200,000 requests, only 1%
> (2000) will be "hits" (where the URL actually responds by other than a 404
> response). That leaves 99% of requests (198,000) responding with a 404.
> And let's suppose that the bot is extra-smart, and always keeps his "pool"
> of 100 outgoing connections busy, in the sense that as soon as a response
> was received on one connection, that connection is closed and immediately
> re-opened for another HTTP request.
>
> If no webserver implements the scheme, we assume 10 ms per 404 response.
>
> So the bot launches the first batch of 100 requests (taking 10 ms to do
> so), then goes back to check its first connection and finds a response. If
> the response is  not a 404, it's a "hit" and gets added to the table of
> vulnerable IP's
> (and to gain some extra time, it means that if there would have been extra
> URLs to scan for the same server, they could now be canceled - although
> this could be disputed).
> If the response is a 404, it's a "miss". But it doesn't mean that there
> are no other vulnerable URLs on that server, so it still needs to scan the
> others.
> All in all, if the bot can keep issuing requests and processing responses
> at the rate of 100 per 10 ms on average, it will take it a total of 200,000
> / 100 * 10 ms = 2,000 ms to perform the scan of the 200,000 URLs, and it
> will have collected 2000 hits after doing so.
>
> Now let's suppose that out of these 10000 servers, 10% of them implement
> the scheme, and delay their 404 responses by an average of 1000 ms.
> So now the bot launches the first 100 requests in 10 ms, then goes back to
> check the status of the first one. With a probability of 0.1, this could be
> one of the delayed ones.
> In that case, no response will be there yet, and the bot skips to the next
> connection.
> At the end of this pass, the bot will thus have received 90 responses (10
> are still delayed), and re-issued 90 new requests. Then on the next pass,
> the same 10 delayed responses would still be delayed (on average), and
> among the 90 new ones, 9 would also be.
> So now it can only issue 81 new requests, and when it comes back to check,
> 10 + 9 + 8 = 27 will be delayed.
> Basically, after a few cycles like this, all his 100 pool connections will
> be waiting for a response, and it would have no choice between either
> waiting, or starting to kill the connections that have been waiting more
> than a certain amount of time.
> Or, increasing its number of connections and become more conspicuous (***).
>
> If it choses to wait, then its time to complete the scan of the 10000 IP's
> will have increased by 200,000 * 10% * 1000 ms = 20,000,000 ms.
> If it chooses not to wait, then it will never know if this URL was
> vulnerable or not.
>
> Is there a flaw in this reasoning ?
>
> If not, then the avoidance-scheme based on becoming more parallel would be
> quite ineffective, no ?
>
>
>
> (*) I pick 100 at a time, imagining that as the number of established
> outgoing connections
> increases, a bot becomes more and more visible on the host it is running
> on. So I imagine
> that there is a reasonable limit to how many of them it can open at a time.
>
> (**) this being because the server varies the individual 404 delay
> randomly between 2
> reasonable values (100 ms and 2000 ms e.g.)which can happen on any normal
> server.
>
> (***) I would say that a bot which would be opening 100 outgoing
> connections in parallel on average would already be *very* conspicuous.
>
>

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Tom Evans wrote:
> On Fri, May 3, 2013 at 10:54 AM, André Warnier <aw...@ice-sa.com> wrote:
>> So here is a challenge for the Apache devs : describe how a bot-writer could
>> update his software to avoid the consequences of the scheme that I am
>> advocating, without consequences on the effectivity of their URL-scanning.
> 
> This has been explained several times. The bot makes requests
> asynchronously with a short select() timeout. If it doesn't have a
> response from one of its current requests due to artificial delays, it
> makes an additional request, not necessarily to the same server.
> 
> The fact that a single response takes longer to arrive is not
> relevant, the bot can overall process roughly as many requests in the
> same period as without a delay. The amount of concurrency that would
> be required would be proportional to the artificial delay and the
> network RTT.
> 
> There is a little overhead due to the extra concurrency, but not much
> - you are not processing any more requests in a specific time period,
> nor using more network traffic than without concurrency, the only real
> cost is more simultaneous network connections, most of which are idle
> waiting for the artificial delay to expire.
> 
> I would not be surprised if bots already behave like this, as it is a
> useful way of increasing scanning rate if you have servers that are
> slow to respond already, or have high network RTT.
> 

Ok, maybe I am understanding this wrongly. But I am open to be proven wrong.

Suppose a bot is scanning 10000 IP's, 100 at a time concurrently (*), for 20 potentially
vulnerable URLs per server. That is thus 200,000 HTTP requests to make.
And let's suppose that the bot cannot tell, from the delay experienced when receiving any
particular response, if this is a server that is artifically delaying responses, or if
this is a normal delay due to whatever condition (**).
And let's also suppose that, on the total of 200,000 requests, only 1% (2000) will be 
"hits" (where the URL actually responds by other than a 404 response). That leaves 99% of 
requests (198,000) responding with a 404.
And let's suppose that the bot is extra-smart, and always keeps his "pool" of 100 outgoing 
connections busy, in the sense that as soon as a response was received on one connection, 
that connection is closed and immediately re-opened for another HTTP request.

If no webserver implements the scheme, we assume 10 ms per 404 response.

So the bot launches the first batch of 100 requests (taking 10 ms to do so), then goes 
back to check its first connection and finds a response. If the response is  not a 404, 
it's a "hit" and gets added to the table of vulnerable IP's
(and to gain some extra time, it means that if there would have been extra URLs to scan 
for the same server, they could now be canceled - although this could be disputed).
If the response is a 404, it's a "miss". But it doesn't mean that there are no other 
vulnerable URLs on that server, so it still needs to scan the others.
All in all, if the bot can keep issuing requests and processing responses at the rate of 
100 per 10 ms on average, it will take it a total of 200,000 / 100 * 10 ms = 2,000 ms to 
perform the scan of the 200,000 URLs, and it will have collected 2000 hits after doing so.

Now let's suppose that out of these 10000 servers, 10% of them implement the scheme, and 
delay their 404 responses by an average of 1000 ms.
So now the bot launches the first 100 requests in 10 ms, then goes back to check the 
status of the first one. With a probability of 0.1, this could be one of the delayed ones.
In that case, no response will be there yet, and the bot skips to the next connection.
At the end of this pass, the bot will thus have received 90 responses (10 are still 
delayed), and re-issued 90 new requests. Then on the next pass, the same 10 delayed 
responses would still be delayed (on average), and among the 90 new ones, 9 would also be.
So now it can only issue 81 new requests, and when it comes back to check, 10 + 9 + 8 = 27 
will be delayed.
Basically, after a few cycles like this, all his 100 pool connections will be waiting for 
a response, and it would have no choice between either waiting, or starting to kill the 
connections that have been waiting more than a certain amount of time.
Or, increasing its number of connections and become more conspicuous (***).

If it choses to wait, then its time to complete the scan of the 10000 IP's will have 
increased by 200,000 * 10% * 1000 ms = 20,000,000 ms.
If it chooses not to wait, then it will never know if this URL was vulnerable or not.

Is there a flaw in this reasoning ?

If not, then the avoidance-scheme based on becoming more parallel would be quite 
ineffective, no ?

(*) I pick 100 at a time, imagining that as the number of established outgoing connections
increases, a bot becomes more and more visible on the host it is running on. So I imagine
that there is a reasonable limit to how many of them it can open at a time.

(**) this being because the server varies the individual 404 delay randomly between 2
reasonable values (100 ms and 2000 ms e.g.)which can happen on any normal server.

(***) I would say that a bot which would be opening 100 outgoing connections in parallel 
on average would already be *very* conspicuous.

Re: URL scanning by bots

Posted by Tom Evans <te...@googlemail.com>.

On Fri, May 3, 2013 at 10:54 AM, André Warnier <aw...@ice-sa.com> wrote:
> So here is a challenge for the Apache devs : describe how a bot-writer could
> update his software to avoid the consequences of the scheme that I am
> advocating, without consequences on the effectivity of their URL-scanning.

This has been explained several times. The bot makes requests
asynchronously with a short select() timeout. If it doesn't have a
response from one of its current requests due to artificial delays, it
makes an additional request, not necessarily to the same server.

The fact that a single response takes longer to arrive is not
relevant, the bot can overall process roughly as many requests in the
same period as without a delay. The amount of concurrency that would
be required would be proportional to the artificial delay and the
network RTT.

There is a little overhead due to the extra concurrency, but not much
- you are not processing any more requests in a specific time period,
nor using more network traffic than without concurrency, the only real
cost is more simultaneous network connections, most of which are idle
waiting for the artificial delay to expire.

I would not be surprised if bots already behave like this, as it is a
useful way of increasing scanning rate if you have servers that are
slow to respond already, or have high network RTT.

Tom

Re: URL scanning by bots

Posted by Graham Leggett <mi...@sharp.fm>.

On 03 May 2013, at 11:54 AM, André Warnier <aw...@ice-sa.com> wrote:

> So here is a challenge for the Apache devs : describe how a bot-writer could update his software to avoid the consequences of the scheme that I am advocating, without consequences on the effectivity of their URL-scanning.

Attempt to process multiple connections asynchronously, in parallel. For a practical implementation of exactly this, look at the ab load testing tool that comes with httpd.

Regards,
Graham
--

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Marian Marinov wrote:
> On 05/03/2013 07:24 AM, Ben Reser wrote:
>> On Tue, Apr 30, 2013 at 5:23 PM, André Warnier <aw...@ice-sa.com> wrote:
>>> Alternatives :
>>> 1) if you were running such a site (which I would still suppose is a
>>> minority of the 600 Million websites which exist), you could easily 
>>> disable
>>> the feature.
>>> 2) you could instead return a redirect response, to a page saying 
>>> "that one
>>> was sold, but look at these".
>>> That may be even more friendly to search engines, and to customers.
>>
>> My point isn't that there aren't alternatives, but that 404's are
>> legitimate responses that legitimate users can be expected to receive.
>>   As such you'll find it nearly impossible in my opinion to convince
>> people to degrade performance for them as a default.  If it isn't a
>> default you're hardly any better off than you are today since it will
>> not be widely deployed.
>>
>> If you want to see a case where server behavior has been tweaked in
>> order to combat miscreants go take a look at SMTP.  SMTP is no longer
>> simple, largely because of the various schemes people have undertaken
>> to stop spam.  Despite all these schemes, spam still exists and the
>> only effective counters has been:
>> 1) Securing open-relays.
>> 2) Removing the bot-nets that are sending the spam.
>> 3) Ultimately improving the security of the vulnerable systems that
>> are sending the spam.
>>
>> All the effort towards black lists, SPF, domainkeys, etc... has been
>> IMHO a waste of time.  At best it has been a temporarily road block.
>>
>>
> If Apache by default delays 404s, this may have some effect in the first 
> month or two after the release of this change. 

I like that. So at least we are not at the "no effect" stage anymore. ;-)

But then the the botnet
> writers will learn and update their software.
> I do believe that these guys are monitoring mailing lists like these or 
> at least reading the change logs of the most popular web servers.
> So, I believe that such change would have a very limited impact on the 
> whole Internet or at least will be combated fairly easy.
> 

And I believe that the Apache developers are smart people, as smart or smarter 
collectively than the bot writers.  And one of the tenets of open-source software is that 
"security by obscurity is not security".

So here is a challenge for the Apache devs : describe how a bot-writer could update his 
software to avoid the consequences of the scheme that I am advocating, without 
consequences on the effectivity of their URL-scanning.

P.S. About discussing this on the dev list : I originally tried a couple more discrete 
channels. But I was either ignored, or sent back to the user's list. So I picked this list 
as somewhat in-between.
This being said, I believe that letting the bot-writers know about such a change may 
actually help the scheme. If the bot-writers do not find a good way to avoid the 
consequences of the scheme, they might just decide to avoid URL-scanning, and focus their 
efforts elsewhere.  As far as I am concerned, that would be the biggest prize of all.

Re: URL scanning by bots

Posted by Marian Marinov <mm...@yuhu.biz>.

On 05/03/2013 07:24 AM, Ben Reser wrote:
> On Tue, Apr 30, 2013 at 5:23 PM, André Warnier <aw...@ice-sa.com> wrote:
>> Alternatives :
>> 1) if you were running such a site (which I would still suppose is a
>> minority of the 600 Million websites which exist), you could easily disable
>> the feature.
>> 2) you could instead return a redirect response, to a page saying "that one
>> was sold, but look at these".
>> That may be even more friendly to search engines, and to customers.
>
> My point isn't that there aren't alternatives, but that 404's are
> legitimate responses that legitimate users can be expected to receive.
>   As such you'll find it nearly impossible in my opinion to convince
> people to degrade performance for them as a default.  If it isn't a
> default you're hardly any better off than you are today since it will
> not be widely deployed.
>
> If you want to see a case where server behavior has been tweaked in
> order to combat miscreants go take a look at SMTP.  SMTP is no longer
> simple, largely because of the various schemes people have undertaken
> to stop spam.  Despite all these schemes, spam still exists and the
> only effective counters has been:
> 1) Securing open-relays.
> 2) Removing the bot-nets that are sending the spam.
> 3) Ultimately improving the security of the vulnerable systems that
> are sending the spam.
>
> All the effort towards black lists, SPF, domainkeys, etc... has been
> IMHO a waste of time.  At best it has been a temporarily road block.
>
>
If Apache by default delays 404s, this may have some effect in the first month or two after the release of this change. 
But then the the botnet writers will learn and update their software.
I do believe that these guys are monitoring mailing lists like these or at least reading the change logs of the most 
popular web servers.
So, I believe that such change would have a very limited impact on the whole Internet or at least will be combated 
fairly easy.

Marian

Re: URL scanning by bots

Posted by Reindl Harald <h....@thelounge.net>.


Am 03.05.2013 11:38, schrieb André Warnier:
> I agree that 404's are legitimate responses.
> And I agree that legitimate clients/users can expect to receive them.
> But if they do receive them when appropriate, but receive them slower than other kinds of responses, this is not
> really "breaking the rules"

maybe you have not much expierience and not watching error-logs
on servers with some hundret domains

on our machines 99% of any web-apps is *carefully* written inhouse
there are always mistakes resulting in a lot of 404 which people
building templates and inclduing modules mostly not realize

this starts with no "favicon.ico" in the docroot while most
"modern" browsers try to access them without a link-tag and
so your first delay is on the homepage itself

looking at the crap-quality of the most common webapps the
situation is much more worse - well, this all does affect you
only if you have load and traffic on your machine but then
it hurts really
_____________________

i had a large project 2 years ago where some hundret people
where in front of their machine with a webcam, the application
generated thumbnails which where shown on the page and cleanup
routines to get rid of the thumbs of no longer active users

we worked hard to optimize all the code to get as less as possible
404 errors while bot fill the disks, the overall connection count
was very very high while the braodcast show connected to the app
was on air, the load was very high but all ran smooth with 500
apache workers

with the proposal of this thread the server would not have
survived 10 minutes by have all wroker-processes in this
useless wait-state for zero benefit

if someone is paranoid enough he may setup such nonsense but
do not believe you heal the world this way

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Ben Reser wrote:
> On Tue, Apr 30, 2013 at 5:23 PM, André Warnier <aw...@ice-sa.com> wrote:
>> Alternatives :
>> 1) if you were running such a site (which I would still suppose is a
>> minority of the 600 Million websites which exist), you could easily disable
>> the feature.
>> 2) you could instead return a redirect response, to a page saying "that one
>> was sold, but look at these".
>> That may be even more friendly to search engines, and to customers.
> 
> My point isn't that there aren't alternatives, but that 404's are
> legitimate responses that legitimate users can be expected to receive.

I agree that 404's are legitimate responses.
And I agree that legitimate clients/users can expect to receive them.
But if they do receive them when appropriate, but receive them slower than other kinds of 
responses, this is not really "breaking the rules".  There is nothing anywhere in the HTTP 
RFC's which promises a response within a certain timeframe to any kind of HTTP request, 
and being slow to respond - to any kind of request - is a totally normal occurrence on the 
WWW.

To push the reasoning a bit further : imagine that hardware and software were still as 
they were 10 years ago (with CPU's running at 2-300 MHz e.g.).  Then any webserver would 
never respond to a 404 within less than the kind of delay time I am talking about.  But 
that should not invalidate any well-built web application.  Any legitimate web application 
which would rely on a particular response time for a 404 response would seem to me very 
badly-designed.  Not to you ?

>  As such you'll find it nearly impossible in my opinion to convince
> people to degrade performance for them as a default.

Of course, I'm having some difficulties convincing people. What do you think I'm doing 
right now ? ;-)

But here also, I would like to state things in a more nuanced way.  First you'd have to 
define "degrading performance".  For whom ?
Part of my argument is that this could be implemented so as to not degrade the performance 
of the webserver very much, or at all (see (*) below).
Another part is that the client delay would only really impact such clients which by the 
nature of what they are doing, are expected indeed to receive a lot of 404 responses.
I believe that this includes most URL-scanning bots, but extremely few legitimate 
clients/users. I cannot prove that, but it seems to me a reasonable assumption. (**)

Then, a more general comment on "degrading performance" : the current situation is that 
URL-scanning is degrading performance, for everyone on the whole WWW.  These URL-scanning 
bots serve no useful purpose, except to the criminals who run them.  But they are using up 
today a significant portion of the webserver resources and of the general WWW bandwidth, 
which results in degraded performance for everyone, every day.
You may not realise it, but every day you are paying a tax because of these URL-scans : 
whether you are a server or a client, you are paying for some CPU cycles and some bits/s 
bandwidth which you are not using yourself; or you are paying for a filter to block them 
out. Is this something that you feel that you have to just accept, without trying to do 
anything about it ?

   If it isn't a
> default you're hardly any better off than you are today since it will
> not be widely deployed.

I agree entirely.  Having this deployed as the default option is a vital part of the plan. 
Otherwise, it would suffer from the same ills as all the very good methods of webserver 
protection which already exist : they require special resources to deploy, and in 
consequence they are not being widely deployed, despite being available and being effective.

For a more philosophical response, see (***) below.

> 
> If you want to see a case where server behavior has been tweaked in
> order to combat miscreants go take a look at SMTP.  SMTP is no longer
> simple, largely because of the various schemes people have undertaken
> to stop spam.  Despite all these schemes, spam still exists and the
> only effective counters has been:
> 1) Securing open-relays.
> 2) Removing the bot-nets that are sending the spam.
> 3) Ultimately improving the security of the vulnerable systems that
> are sending the spam.
> 
> All the effort towards black lists, SPF, domainkeys, etc... has been
> IMHO a waste of time.  At best it has been a temporarily road block.
> 

In a way, you are providing more arguments in favor of this 404-delay scheme.
What are the main reasons why many attempts at blocking spam were not very succesful ?
I would argue that it was because
a) they were complicated and costly to implement
b) they did not tackle the problem close enough to the source
c) they relied on "a posteriori" information, such as black-lists (before you can 
black-list some IP, you first have to have received spam from it, and then you can start 
broadcasting this IP, and then it takes a while for the recipients to receive and 
implement the information, by which time it is already obsolete)

The scheme which I propose would avoid some of these pitfalls, by
- being very easy to implement (in fact, apart from the original work needed to 
incorporate the scheme in the webserver code once, it would not require any additional 
effort by anyone
- attacking the botnets who do URL-scanning (admittedly as a very small part of their 
activities, but that is the target of this proposal) as close to the source as possible : 
making their very activity in that respect unprofitable in a global sense
- not relying on any prior knowledge of the bots, nor requiring any information to be 
either maintained or accessed. The response is 404 ? add the delay (or not,
if you have some valid reason not to).

How much simpler can a scheme be ?

------------
(*) In terms of performance for the server :
The scheme would come into play once the server has already determined that all that 
remains to be done to satisfy this request, is to send back a 404 response.

That is a relatively simple task, which does not require the entire resources which are 
needed to generate a "normal" response. So this could be off-loaded to a relatively 
lightweigth "child" or "thread" of the server, and the original request-processing "heavy" 
thread/child would become free earlier, to process other potentially valuable requests.

I do not know exactly what the overhead would be for passing this task (returning the 404 
response) from the heavy thread/child to the lightweight thread/child, and if the benefit 
of freeing the original thread/child earlier would compensate for that overhead.

Intuitively however, I would not be very surprised if this kind of scheme would prove 
profitable for the server as a whole, even for a number of other non-404 responses.
There are a whole series of non-200 responses which by RFC definition do not include a 
body, or include a standard body which is the same for all requests with the same status 
code. If I have a prefork Apache server where each child contains a whole perl interpreter 
e.g., why tie all of this up just for sending back a status line to some slow client ?

(**) professionally, for the last 15 years I have been running the technical part of a 
company which specialises in information management through web interfaces. In that 
timespan, I have designed a lot of web applications, and examined a whole lot more which I 
didn't design myself. I have never in that timespan ecountered an application which relied 
on a 404 response as "valid". They were always considered as "errors" and treated as such 
in the code.  That is not a proof that there aren't any, but a reasonable base for my 
assumption, I believe.

(***)
If you like sweeping comparisons, here is one :

Up to some 40 years ago, large cities such as London, Paris, Los Angeles etc. were 
periodically afflicted by smog, which apart from being disagreeable, was also damaging 
people's health.  The problem was that what caused this smog, was also the result of a 
whole lot of individual activities which on the other hand brought individual people 
prosperity, lower costs and higher standards of living.  Nevertheless, at some point a 
wide-enough consensus developed that allowed laws to be passed, which forced people to 
spend more money (e.g. paying for catalytic converters, smoke scrubbers etc.), but in 
return brought cleaner air over their cities.  These laws are not perfect, and affect some 
people more than others, but by and large nobody today in these cities could deny the 
improvements in the quality of the air that they breathe.
That it did not stop air pollution in general, and that some of the polluting activities 
just moved somewhere else ? yes, but slowly these other places are also passing laws, and 
little by little the improvement becomes global (or it stops getting much worse than if 
nothing had been done).

Without taking myself too seriously, I believe that what I propose is of the same category 
of things. It is a global measure meant to tackle a small fraction of what currently 
pollutes the Internet and is an inconvenience and a cost to everyone.
And what distinguishes it from the above laws, is that it doesn't really cost anything.

Let my try to provide some elements to substantiate that last sentence :

Let's say that altogether it would cost 5 days of development on the part of one of the 
Apache dev gurus.  And let's say that on the other hand it would result, one way or 
another, over a period of 2 years, in a global decrease of only 10% of only the 
URL-scanning activity.  What would be the real cost/benefit analysis ?

Let's take a dev cost of 1000$/day. The feature development would thus cost 5,000$.

The other side is more tricky, but let's use some of the numbers that I have used before.
I have 25 servers, and in total these servers receive at least 1000 such individual 
URL-scanning requests per day, and on average they take at least 10 ms to return such a 
404 response.  So let's say that in aggregate this costs me 10 ms * 1000 = 10 seconds of 
server time per day, over the 25 servers.

A server costs about 2000$ to purchase, and is obsolete in 3 years. To simplify, say that 
3 years is 1000 days, so it's basic cost is 2$/day. I also pay hosting charges, bandwidth, 
maintenance, support etc. which raise this cost, say, to 5$/day.
A day has 86400 seconds, so the cost of one server for 1 second is 5$/86400 ~ 0.00005 $.
So the cost of URL-scanning for my 25 servers, per day, is 0.00005 $ x 10s = 0.0005 $.

That is ridiculous, for me, not even worth writing about.

But, there are 600 million webservers on the Internet, and by and large, they are all 
being scanned in the same way.
So this is a total cost of 0.0005 $ x 600,000,000 / 25 = 12,000 $ / day.

So if the scheme reduces the amount of URL-scanning by as little as 10%, that would be a 
saving of 1,200 $/day. It would thus take less than a week to recoup the initial 
development costs, and it would be pure profit thereafter, because there is essentially 
nothing else to do.

If a company could be set up to do this commercially, would you join me as an investor ?

Re: URL scanning by bots

Posted by Ben Reser <be...@reser.org>.

On Tue, Apr 30, 2013 at 5:23 PM, André Warnier <aw...@ice-sa.com> wrote:
> Alternatives :
> 1) if you were running such a site (which I would still suppose is a
> minority of the 600 Million websites which exist), you could easily disable
> the feature.
> 2) you could instead return a redirect response, to a page saying "that one
> was sold, but look at these".
> That may be even more friendly to search engines, and to customers.

My point isn't that there aren't alternatives, but that 404's are
legitimate responses that legitimate users can be expected to receive.
 As such you'll find it nearly impossible in my opinion to convince
people to degrade performance for them as a default.  If it isn't a
default you're hardly any better off than you are today since it will
not be widely deployed.

If you want to see a case where server behavior has been tweaked in
order to combat miscreants go take a look at SMTP.  SMTP is no longer
simple, largely because of the various schemes people have undertaken
to stop spam.  Despite all these schemes, spam still exists and the
only effective counters has been:
1) Securing open-relays.
2) Removing the bot-nets that are sending the spam.
3) Ultimately improving the security of the vulnerable systems that
are sending the spam.

All the effort towards black lists, SPF, domainkeys, etc... has been
IMHO a waste of time.  At best it has been a temporarily road block.

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Ben Reser wrote:
> On Tue, Apr 30, 2013 at 4:09 PM, André Warnier <aw...@ice-sa.com> wrote:
>> But I have been trying to figure out a real use case, where expecting 404
>> responses in the course of legitimate applications or website access would
>> be a normal thing to do, and I admit that I haven't been able to think of
>> any.
>> Can you come up with an example where this would really be a user case and
>> where delying 404 responses would really "break something" ?
> 
> Imagine you're a real estate agent.  You have listings for properties,
> each one gets a unique URL.  You want search engines to index your
> properties that are for sale.  When they sell you want those
> properties to stop being shown in search engines.  So you start
> returning a 404 for those pages.  For a while you might still show
> something about the property with something saying that it's been sold
> and some links to other similar properties that aren't sold.
> Eventually you purge that property entirely and it's a generic 404
> page.  Clearly in such a scenario you don't want to delay 404 pages.
> 
> There are of course other examples in other industries.  But basically
> any situation where you were you have pages for things that are
> temporary you're probably going to want to do something like this.
> 

Thank you.  That is a good example.

Alternatives :
1) if you were running such a site (which I would still suppose is a minority of the 600 
Million websites which exist), you could easily disable the feature.
2) you could instead return a redirect response, to a page saying "that one was sold, but 
look at these".
That may be even more friendly to search engines, and to customers.

Re: URL scanning by bots

Posted by Ben Reser <be...@reser.org>.

On Tue, Apr 30, 2013 at 4:09 PM, André Warnier <aw...@ice-sa.com> wrote:
> But I have been trying to figure out a real use case, where expecting 404
> responses in the course of legitimate applications or website access would
> be a normal thing to do, and I admit that I haven't been able to think of
> any.
> Can you come up with an example where this would really be a user case and
> where delying 404 responses would really "break something" ?

Imagine you're a real estate agent.  You have listings for properties,
each one gets a unique URL.  You want search engines to index your
properties that are for sale.  When they sell you want those
properties to stop being shown in search engines.  So you start
returning a 404 for those pages.  For a while you might still show
something about the property with something saying that it's been sold
and some links to other similar properties that aren't sold.
Eventually you purge that property entirely and it's a generic 404
page.  Clearly in such a scenario you don't want to delay 404 pages.

There are of course other examples in other industries.  But basically
any situation where you were you have pages for things that are
temporary you're probably going to want to do something like this.

Re: URL scanning by bots

Posted by André Warnier <aw...@ice-sa.com>.

Graham Leggett wrote:
> On 30 Apr 2013, at 12:03 PM, André Warnier <aw...@ice-sa.com> wrote:
>
>> The only cost would a relatively small change to the Apache webservers, which is what my
>> suggestion consists of : adding a variable delay (say between 100 ms and 2000 ms) to any
>> 404 response.
>
> This would have no real effect.
>
> Bots are patient, slowing them down isn't going to inconvenience a bot in any way. The simple workaround if the bot does take too long is to simply send the requests in parallel. At the same time, slowing down 404s would break real websites, as 404 isn't necessarily an error, but rather simply a notice that says the resource isn't found.
>
Hello.
Thank you for your response.
You make several points above, and I would like to respond to them separately.

1) This would have no real effect.
A: yes, it would.

This is a facetious response, of course. I am making it just in order to illustrate a
kind of objection which I have encountered before : an "a priori" objection, without a
real justification. So I am responding in kind.
This was just for illustration, I hope that you don't mind.

But, you /do/ provide some arguments to justify that, so let me discuss them :

2) "Bots are patient, slowing them down isn't going to inconvenience a bot in any way"

A: I beg to disagree.
First, I would make a distinction between "the bot" (which is just a program running
somewhere, and can obviously not be inconvenienced), and the "owner" of the bot, usually
called "bot-master". And I believe that the bot-master can be seriously inconvenienced.
And through the bots, he is the real target.

Here are my reasons for believing that he can be inconvenienced :

I may seem that creating a bot, distributing it and running it for malicious purposes is
free. But that is not true, it has a definite cost.
Most countries now have laws defining this as criminal behaviour, and many countries now
have dedicated officials which are trying to track down "bot-masters" and bring them to
justice.
So the very first cost of running a botnet is the opportunity risk of getting caught,
paying a big fine and maybe going to prison. And this is not just theory.
There have been several items in the news in the last few of years that show this to be
true. Search Google for "botmaster jailed" e.g.

As a second argument, I would state that if it did not cost anything to create and run a
botnet, then nobody would pay for it. And that it not true. Nowadays one can purchase bot
code, or rent an existing botnet - or even parts of it - for a price. And the price is not
trivial. To rent a botnet of several thousand bots for a week can cost several thousand US
Dollars. And obviously, there is a market.
See here :
http://www.zdnet.com/blog/security/study-finds-the-average-price-for-renting-a-botnet/6528
or search Google for equivalent information.

If it does cost something to create and run a malicious botnet, then if someone does it,
it is in order to get a return on his investment.
The kind of desired return can vary (think Anonymous or some intelligence services), but
it is obvious to me that if someone is running botnets which *do* scan my servers (and
most servers on the Internet) for vulnerable URLs, they are not doing this for the simple
pleasure of doing it. They are expecting a return, or else they wouldn't do it.
The faster that they can scan servers and identify likely targets for further mischief,
the better the return compared to the costs.
As long as the likely return outweighs the costs, they will continue.
Raise the cost or lower the return below a certain treshold however, and it will become
uneconomical, and they will stop.
At what point this would happen, I can't tell.
But I do know one thing : what I am suggesting /would/ slow them down, so it goes in the
right direction : to raise their cost and/or diminish their return.

2) The simple workaround if the bot does take too long is to simply send the requests in
parallel.

A: I already mentioned that point in my original suggestion and I tried to show that it
doesn't really matter, but let me add another aspect :

The people who run bots are not using their own computers or their own bandwidth to do
this. That would be really uneconomical, and really dangerous for them.
Instead, they rely on discreetely "infecting" computers belonging to other people, and
then using those computers and their bandwidth to run their operation.

If your computer has been infected and is running a bot in the background, you may not
notice it, as long as the bot is using a small amount of resources.
But if the bot running on your computer starts to use any significant amount of CPU or
bandwidth, then the probability of you noticing will increase. And if you notice it, you
will kill it, won't you ? And if you do that, there is one less bot in the botnet.

What I am saying is that one can not just increase forever the amount of parallelism in
the scans that a bot is performing. There is a limit to the amount of resources that a bot
can use in its host while remaining discreet.
My original sample calculations used individual bots, each issuing 200 requests in 2
seconds. How many more can one bot issue and remain discreet ?

So really, if you admit that the suggestion, if implemented, would slow down the action of
scanning a number of servers, then in order to keep scanning the same number of servers in
the same time, the only practical response would be to increase the number of bots doing
the scanning.
And then, we run back to the argument above : it increases the cost.

3) "At the same time, slowing down 404s would break real websites, as 404 isn't
necessarily an error, but rather simply a notice that says the resource isn't found."

A: I believe that this is a more tricky objection.
I agree, a 404 is just an indication that the resource isn't found.

But I have been trying to figure out a real use case, where expecting 404 responses in the
course of legitimate applications or website access would be a normal thing to do, and I
admit that I haven't been able to think of any.
Can you come up with an example where this would really be a user case and where delying
404 responses would really "break something" ?

I would also like to offer a precision : my suggestion is to make this an *optional*
feature, that can be easily tuned or disabled by a webserver administrator.
(Similarly to a number of other security-minded configuration directives in Apache httpd)

It would just be a lot more effective if it was enabled by default, in the standard
configuration of the standard Apache httpd distributions.
The reason for that is again a numbers game : there are about 600 Million webservers in
total, at least 60% of them (360,000,000) being "Apache" servers.
Of these 360 million, how many would you say are professionally installed and managed ?
(How many competent webserver administrators are there in the world, and how many
webservers can each one of them take care of ?)
If I was to venture a number, I would say that the number of Apache webservers that are
professionally installed and managed is probably not higher than a few milllions, maybe
10% of the above.
That leaves many more millions which are not so, and those are the target of the
sugestion. If it was a default option, then over time, as new Apache httpd webservers are
installed - or older ones upgraded - the proportion of servers where this option is
activated would automatically increase, without any further intervention.

And as I have already tried to show, any additional percent overall of the installed
webservers where this would be active, increases the total URL scan time by several
million seconds. No matter how parallel the scan is, that number doesn't change.

I hope to have provided convincing arguments in my responses to your objections.
And if not, I'll try harder.

There is also a limit for me though : I do not have the skills nor the resources to
actually set up a working model of this. I cannot create (or rent) a real botnet and
thousands of target servers in order to really prove my arguments.
But maybe someone could think of a way to really prove or disprove this ? Whatever the
results, I would be really delighted.

Re: URL scanning by bots

Posted by Graham Leggett <mi...@sharp.fm>.

On 30 Apr 2013, at 12:03 PM, André Warnier <aw...@ice-sa.com> wrote:

> The only cost would a relatively small change to the Apache webservers, which is what my
> suggestion consists of : adding a variable delay (say between 100 ms and 2000 ms) to any
> 404 response.

This would have no real effect.

Bots are patient, slowing them down isn't going to inconvenience a bot in any way. The simple workaround if the bot does take too long is to simply send the requests in parallel. At the same time, slowing down 404s would break real websites, as 404 isn't necessarily an error, but rather simply a notice that says the resource isn't found.

Regards,
Graham
--