You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@httpd.apache.org by Tom Ray <to...@blazestudios.com> on 2002/08/23 18:42:40 UTC

Wget

Is there a way to protect the websites on my server from someone using 
Wget??

Any help is apreciated.

TIA.

Tom


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


[users@httpd] Re: Wget

Posted by Bruno Wolff III <br...@wolff.to>.
On Mon, Aug 26, 2002 at 15:13:31 +0200,
  Wolter Kamphuis <ap...@wkamphuis.student.utwente.nl> wrote:
> 
> I now use robotcop (http://www.robotcop.org/) to block webspiders. On some
> of my pages (especially dynamic ones) I include a one-pixel image-link.
> Everyone following this link will be blocked for two days. Normal browsers
> won't follow this link so they are unaffected. I catch about 10 to 20
> people a day using wget, teleport pro and more of such spiders.

I use a two step process. I add links that don't surround content that
point to a separate page. That page displays a warning not to follow
any links off of the page. It also has meta-robot tags saying not to index
the page or follow links off of it. There is a link on that page that
runs a cgi-bin script which will block the connecting IP address until
it is manually removed.

This is more to stop robots that ignore meta-robot tags than to catch
things like wget that pull stuff too fast, but don't do that repeatedly.

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: Wget

Posted by Michael <mi...@asstr.org>.
	This is a good approach :) You don't want to block wget
(because there are a lot worse things out there, most of which
you will never have heard of). What you want to block is the
particular behavior that all web mirroring programs will exhibit,
specifically harvesting large blocks of pages (real pages, not
"supplementals" like images, etc.) in a very short period of time.

	You can actually use mod_throttle to do that, kind of. The problem
is that if you're seriously interested in blocking that behavior
then chances are you have a lot of traffic, and mod_throttle is
not the best-written program algorithm-wise, nor does it have
some of the features you'll need (like the ability to permanently
un-block certain IPs like cache-*.aol.com). robotcop may be a
good alternative...

	About anti-mirroring software in general... If you had a site
with, say, 165,000 pieces of erotic literature from the benign to
the bizarre you'd do your best to block mirroring (even though
the site is free and ad-free) because the people doing mirroring either
don't understand that they'll dislike 98% of what they download
or they're trying to set up a mirror. Since many of the authors
who contribute content to my site have specifically requested
that they not be published anywhere else, it's part of my job
to try to ensure that their wishes are fulfilled, so mirrors should
ask for permission - not just attempt to download the entire
site willy-nilly.

	The other thing that really gets me is that all the mirroring
software I've come across doesn't even use the transparent gzip
compression feature that manages to keep our bandwidth bills
in a reasonable range, so they're "stealing" twice (once for
downloading stuff they'll never read and once for using more
bandwidth to do it than they need to) from legitimate users of
the site.

- Michael

On Mon, 26 Aug 2002, Wolter Kamphuis wrote:

> Hi,
>
> I also had some problems with webspiders. A website I�m running consists
> of many (1500) pages showing each one image, like a gallery. People who
> wanted to have all the images just let wget do a recursive download of the
> complete website. The result was that almost half of my traffic went to
> those webspiders.
>
> I now use robotcop (http://www.robotcop.org/) to block webspiders. On some
> of my pages (especially dynamic ones) I include a one-pixel image-link.
> Everyone following this link will be blocked for two days. Normal browsers
> won't follow this link so they are unaffected. I catch about 10 to 20
> people a day using wget, teleport pro and more of such spiders.
>
> However, there are some issues using robotcop. There always is a change
> you will block innocent users, about one or two of the spiders I daily
> catch are innocent users. There�s not much I can do about it since I don�t
> know why they follow the �invisible link�. Still one or two of 30k
> visitors isn�t that much.
>
> Also, if you have robotcop behave like a tarpit (very slowly serve crap to
> the clients) every caught spider will occupy one (or more) apache
> processes, in that case its easy to perform a dos attack if you have the
> right tools. I found a way to solve this by building a special �tarpitd�
> daemon that handles the �crap serving�. It also helps against worms and
> people trying to scan apache, scanning my webserver takes hours for it to
> complete.
>
> mzzl
>   Wolter
>
>
> > Is there a way to protect the websites on my server from someone using
> > Wget??
> >
> > Any help is apreciated.
> >
> > TIA.
> >
> > Tom
> >
> >
> > ---------------------------------------------------------------------
> > The official User-To-User support forum of the Apache HTTP Server
> > Project. See <URL:http://httpd.apache.org/userslist.html> for more info.
> > To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
> >    "   from the digest: users-digest-unsubscribe@httpd.apache.org
> > For additional commands, e-mail: users-help@httpd.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>    "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org
>


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: Wget

Posted by Wolter Kamphuis <ap...@wkamphuis.student.utwente.nl>.
Hi,

I also had some problems with webspiders. A website I’m running consists
of many (1500) pages showing each one image, like a gallery. People who
wanted to have all the images just let wget do a recursive download of the
complete website. The result was that almost half of my traffic went to
those webspiders.

I now use robotcop (http://www.robotcop.org/) to block webspiders. On some
of my pages (especially dynamic ones) I include a one-pixel image-link.
Everyone following this link will be blocked for two days. Normal browsers
won't follow this link so they are unaffected. I catch about 10 to 20
people a day using wget, teleport pro and more of such spiders.

However, there are some issues using robotcop. There always is a change
you will block innocent users, about one or two of the spiders I daily
catch are innocent users. There’s not much I can do about it since I don’t
know why they follow the ‘invisible link’. Still one or two of 30k
visitors isn’t that much.

Also, if you have robotcop behave like a tarpit (very slowly serve crap to
the clients) every caught spider will occupy one (or more) apache
processes, in that case its easy to perform a dos attack if you have the
right tools. I found a way to solve this by building a special ‘tarpitd’
daemon that handles the ‘crap serving’. It also helps against worms and
people trying to scan apache, scanning my webserver takes hours for it to
complete.

mzzl
  Wolter


> Is there a way to protect the websites on my server from someone using
> Wget??
>
> Any help is apreciated.
>
> TIA.
>
> Tom
>
>
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server
> Project. See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>    "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org




---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: Wget

Posted by Rodent of Unusual Size <Ke...@Golux.Com>.
Tom Ray wrote:
> 
> Is there a way to protect the websites on my server from someone using
> Wget??

You can do something with BrowserMatch and 'Deny from env=', but it's not
a solution -- just a sieve.  Wget allows the user-agent to be changed through
a command-line option, which will defeat this.
-- 
#ken	P-)}

Ken Coar, Sanagendamgagwedweinini  http://Golux.Com/coar/
Author, developer, opinionist      http://Apache-Server.Com/

"Millennium hand and shrimp!"

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: Wget

Posted by "Webmaster EraSinar.com" <we...@erasinar.com>.
Hope This Help

Type in Your robots.txt

User-agent: Wget/1.6
Disallow: /

User-agent: Wget/1.5.3
Disallow: /

User-agent: Wget
Disallow: /


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org