You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@httpd.apache.org by Carlos S <ne...@gmail.com> on 2011/01/05 00:03:02 UTC

[users@httpd] disable wget-like user-agents

Is there any way to disable download/traffic from wget-like user
agents? Can this be done using user-agent string? Any documentation
link or example will be really helpful.

--
cs.

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] disable wget-like user-agents

Posted by Igor Galić <i....@brainsware.org>.
----- "Mark Montague" <ma...@catseye.org> wrote:

> On January 4, 2011 22:32 , Carlos S <ne...@gmail.com> wrote:
> > Recently I was trying to download a package using wget, but the
> > website prevented access to it. I tried --user-agent  option but it
> > didn't work either. So I was curious to know what strategy this web
> > admin must have implemented.
> 
> Without an example URL, I can only speculate, but the ideas that come
> to 
> mind first are denying the download unless a cookie is set (you could

i.galic@panic ~ % wget --help | grep cook
       --no-cookies            don’t use cookies.
       --load-cookies=FILE     load cookies from FILE before session.
       --save-cookies=FILE     save cookies to FILE after session.
       --keep-session-cookies  load and save session (non-permanent) cookies.
i.galic@panic ~ %                             
 
> get quite complex with this, such as setting the cookie via
> JavaScript, 

Yup.. that (JS) would kill off wget.. but also many other (sensible) clients

> which wget won't execute), checking the referrer header, or other 

i.galic@panic ~ % wget --help | grep -i referer
       --referer=URL           include ‘Referer: URL’ header in HTTP request.
i.galic@panic ~ %

> JavaScript based checks.

i

> --
>    Mark Montague
>    mark@catseye.org

i

-- 
Igor Galić

Tel: +43 (0) 664 886 22 883
Mail: i.galic@brainsware.org
URL: http://brainsware.org/

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] disable wget-like user-agents

Posted by Mark Montague <ma...@catseye.org>.
  On January 4, 2011 22:32 , Carlos S <ne...@gmail.com> wrote:
> Recently I was trying to download a package using wget, but the
> website prevented access to it. I tried --user-agent  option but it
> didn't work either. So I was curious to know what strategy this web
> admin must have implemented.

Without an example URL, I can only speculate, but the ideas that come to 
mind first are denying the download unless a cookie is set (you could 
get quite complex with this, such as setting the cookie via JavaScript, 
which wget won't execute), checking the referrer header, or other 
JavaScript based checks.

--
   Mark Montague
   mark@catseye.org


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] disable wget-like user-agents

Posted by Carlos S <ne...@gmail.com>.
Thanks for the links Mark and Doug. The webscrapers thing looks interesting..

I had looked at mod_rewrite and User-Agent header solution.

Recently I was trying to download a package using wget, but the
website prevented access to it. I tried --user-agent  option but it
didn't work either. So I was curious to know what strategy this web
admin must have implemented. May be I used incorrect user-agent
string?? I remember using AppleWebKit and Mozilla strings, will try
again.

(Not giving out that particular URL out of courtesy).

-cs.


On Tue, Jan 4, 2011 at 5:33 PM, Doug McNutt <do...@macnauchtan.com> wrote:
> At 18:19 -0500 1/4/11, Mark Montague wrote:
>>Follow the example below, but use only the user agent condition, omit the IP condition, and suitably adjust the RewriteRule regular expression to match the URL(s) you wish to block:
>>
>>http://httpd.apache.org/docs/2.2/rewrite/rewrite_guide.html#blocking-of-robots
>>
>>Note that wget has a -U option that can be used to get around this block by using a user agent string that you are not blocking -- so the block will not prevent a determined downloader.
>
> *******
>
> You might want to have a look at this rather new mailing list.  It's interested in doing exactly the opposite of what you want.
>
> List-Id: webscrapers talk <webscrapers.cool.haxx.se>
> List-Archive: <http://cool.haxx.se/pipermail/webscrapers>
> List-Post: <ma...@cool.haxx.se>
> List-Help: <mailto:webscrapers-request@cool.haxx.se?subject=help>
> List-Subscribe: <http://cool.haxx.se/cgi-bin/mailman/listinfo/webscrapers>, <mailto:webscrapers-request@cool.haxx.se?subject=subscribe>
>
>
>
> --
>
> --> From the U S of A, the only socialist country that refuses to admit it. <--
>
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
>   "   from the digest: users-digest-unsubscribe@httpd.apache.org
> For additional commands, e-mail: users-help@httpd.apache.org
>
>

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] disable wget-like user-agents

Posted by Doug McNutt <do...@macnauchtan.com>.
At 18:19 -0500 1/4/11, Mark Montague wrote:
>Follow the example below, but use only the user agent condition, omit the IP condition, and suitably adjust the RewriteRule regular expression to match the URL(s) you wish to block:
>
>http://httpd.apache.org/docs/2.2/rewrite/rewrite_guide.html#blocking-of-robots
>
>Note that wget has a -U option that can be used to get around this block by using a user agent string that you are not blocking -- so the block will not prevent a determined downloader.

*******

You might want to have a look at this rather new mailing list.  It's interested in doing exactly the opposite of what you want. 

List-Id: webscrapers talk <webscrapers.cool.haxx.se>
List-Archive: <http://cool.haxx.se/pipermail/webscrapers>
List-Post: <ma...@cool.haxx.se>
List-Help: <mailto:webscrapers-request@cool.haxx.se?subject=help>
List-Subscribe: <http://cool.haxx.se/cgi-bin/mailman/listinfo/webscrapers>, <mailto:webscrapers-request@cool.haxx.se?subject=subscribe>



-- 

--> From the U S of A, the only socialist country that refuses to admit it. <--

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] disable wget-like user-agents

Posted by Mark Montague <ma...@catseye.org>.
  On January 4, 2011 18:03 , Carlos S <ne...@gmail.com> wrote:
> Is there any way to disable download/traffic from wget-like user
> agents? Can this be done using user-agent string? Any documentation
> link or example will be really helpful.

Follow the example below, but use only the user agent condition, omit 
the IP condition, and suitably adjust the RewriteRule regular expression 
to match the URL(s) you wish to block:

http://httpd.apache.org/docs/2.2/rewrite/rewrite_guide.html#blocking-of-robots

Note that wget has a -U option that can be used to get around this block 
by using a user agent string that you are not blocking -- so the block 
will not prevent a determined downloader.

--
   Mark Montague
   mark@catseye.org


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org