You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Thomas Beale <th...@deepthought.com.au> on 2006/07/09 18:02:44 UTC
stopping webcrawlers using robots.txt
Hi,
I have looked around but not found the answer to the question: how to
make /robots.txt visible in an apache virtual host config for a
subversion server. How would I tell Apache to allow requests to read
/robots.txt given the following configuration? (Or - how can I just
block robots going into the SVN repositories)?
<VirtualHost 1.2.3.4>
ServerAdmin webmaster@xxxx.org
ServerName svn.xxxx.org
<Location />
DAV svn
SVNParentPath /usr/local/var/svn
# authorisation
AuthzSVNAccessFile /etc/subversion/access-control
# authentication
AuthType Basic
AuthName "development Subversion Repository"
AuthUserFile /etc/subversion/authentication
# anonymous access rules
Satisfy Any
Require valid-user
</Location>
</VirtualHost>
thanks,
- thomas beale
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: stopping webcrawlers using robots.txt
Posted by Thomas Beale <th...@deepthought.com.au>.
Sorry - this fix doesn't work - it does what it is supposed to do with
respect to serving robots.txt files, but now subversion clients can't do
updates, although they can do some other operations like commit and log.....
apologies for the previous post...
- thomas beale
Thomas Beale wrote:
>
> Our system administrator messed around with this, and produced the
> following solution, which works. The key was to change Location
> directives to Directory directives, i.e. to match directory patterns not
> URL patterns.
>
> <VirtualHost 1.2.3.4>
> ServerAdmin webmaster@xxxx.org
>
> ServerName svn.xxxx.org
>
> DocumentRoot /usr/local/var/svn
>
> RewriteEngine On
> RewriteRule .*robots\.txt$ /generic-root/robots.txt [PT]
>
> Alias /generic-root /usr/local/var/generic-root
>
> <Directory /usr/local/var/generic-root>
> SetHandler default-handler
> allow from all
> </Directory>
>
> <Directory /usr/local/var/svn>
> DAV svn
> SVNParentPath /usr/local/var/svn
>
> # authorisation
> AuthzSVNAccessFile /etc/subversion/access-control
>
> # authentication
> AuthType Basic
> AuthName "development Subversion Repository"
> AuthUserFile /etc/subversion/authentication
>
> # anonymous access rules
> Satisfy Any
> Require valid-user
> </Directory>
> </VirtualHost>
>
>
>
> Thomas Beale wrote:
>>
>> Hi,
>>
>> I have looked around but not found the answer to the question: how to
>> make /robots.txt visible in an apache virtual host config for a
>> subversion server. How would I tell Apache to allow requests to read
>> /robots.txt given the following configuration? (Or - how can I just
>> block robots going into the SVN repositories)?
>>
>> <VirtualHost 1.2.3.4>
>> ServerAdmin webmaster@xxxx.org
>>
>> ServerName svn.xxxx.org
>>
>> <Location />
>> DAV svn
>> SVNParentPath /usr/local/var/svn
>>
>> # authorisation
>> AuthzSVNAccessFile /etc/subversion/access-control
>>
>> # authentication
>> AuthType Basic
>> AuthName "development Subversion Repository"
>> AuthUserFile /etc/subversion/authentication
>>
>> # anonymous access rules
>> Satisfy Any
>> Require valid-user
>> </Location>
>> </VirtualHost>
>>
>>
>>
>> thanks,
>>
>> - thomas beale
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: stopping webcrawlers using robots.txt
Posted by Thomas Beale <th...@deepthought.com.au>.
Our system administrator messed around with this, and produced the
following solution, which works. The key was to change Location
directives to Directory directives, i.e. to match directory patterns not
URL patterns.
<VirtualHost 1.2.3.4>
ServerAdmin webmaster@xxxx.org
ServerName svn.xxxx.org
DocumentRoot /usr/local/var/svn
RewriteEngine On
RewriteRule .*robots\.txt$ /generic-root/robots.txt [PT]
Alias /generic-root /usr/local/var/generic-root
<Directory /usr/local/var/generic-root>
SetHandler default-handler
allow from all
</Directory>
<Directory /usr/local/var/svn>
DAV svn
SVNParentPath /usr/local/var/svn
# authorisation
AuthzSVNAccessFile /etc/subversion/access-control
# authentication
AuthType Basic
AuthName "development Subversion Repository"
AuthUserFile /etc/subversion/authentication
# anonymous access rules
Satisfy Any
Require valid-user
</Directory>
</VirtualHost>
Thomas Beale wrote:
>
> Hi,
>
> I have looked around but not found the answer to the question: how to
> make /robots.txt visible in an apache virtual host config for a
> subversion server. How would I tell Apache to allow requests to read
> /robots.txt given the following configuration? (Or - how can I just
> block robots going into the SVN repositories)?
>
> <VirtualHost 1.2.3.4>
> ServerAdmin webmaster@xxxx.org
>
> ServerName svn.xxxx.org
>
> <Location />
> DAV svn
> SVNParentPath /usr/local/var/svn
>
> # authorisation
> AuthzSVNAccessFile /etc/subversion/access-control
>
> # authentication
> AuthType Basic
> AuthName "development Subversion Repository"
> AuthUserFile /etc/subversion/authentication
>
> # anonymous access rules
> Satisfy Any
> Require valid-user
> </Location>
> </VirtualHost>
>
>
>
> thanks,
>
> - thomas beale
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: stopping webcrawlers using robots.txt
Posted by Ryan Schmidt <su...@ryandesign.com>.
I worked on this problem today and reached the following solution. If
you try this out and it works (or if you try it out and it doesn't
work) please let me know!
<VirtualHost *:80>
ServerName svn.example.com
RewriteEngine on
RewriteCond %{REQUEST_METHOD} ^GET$
RewriteRule ^/(favicon\.ico|robots\.txt|svnrsrc/.*)$ http://
www.example.com/svnroot/$1 [P,L]
<Location />
DAV svn
SVNParentPath /path/to/subversion/repositories
SVNListParentPath on
SVNIndexXSLT /svnrsrc/index.xslt
AuthType Basic
AuthName "Subversion Repositories"
AuthUserFile /path/to/subversion/conf/users
Require valid-user
</Location>
</VirtualHost>
Basically, no matter what manner of Alias directives and the like I
tried, the DAV server in the Location directive always wanted to take
precedence. The only way I found to conditionally override the DAV
server was to proxy the request away to a separate vhost using
mod_rewrite. I only do this for GET requests, and only for
favicon.ico, robots.txt and a directory svnrsrc where you can put
xslt files, css files, images and anything else you might need to
properly show your directory listings. All other requests are still
handled by mod_dav_svn.
Here, I just made a directory svnroot in the document root of my
normal vhost (www.example.com) to contain the favicon.ico, robots.txt
and svnrsrc that will be used by svn.example.com. It shouldn't bother
anybody there. If you'd like, you should even be able to prevent
people from accessing it directly via http://www.example.com/svnroot/
and make it only available via the proxied connection, like this:
<VirtualHost *:80>
ServerName www.example.com
DocumentRoot /wherever/the/document/root/is
...
<Location /svnroot>
Order allow,deny
Allow from 127.0.0.1
</Location>
</VirtualHost>
Note: This is your normal vhost, not your Subversion vhost.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: stopping webcrawlers using robots.txt
Posted by Thomas Beale <th...@deepthought.com.au>.
Ryan Schmidt wrote:
>
>
> I tried this suggestion from Todd which sounded promising:
>
> On Jul 10, 2006, at 01:44, Todd D. Esposito wrote:
>
>> Alias /robots.txt /some/non/svn/path/robots.txt
>> <Location /robots.txt>
>> SetHandler default-handler
>> </Location>
>
> But it doesn't seem to be working.
not for me either...I added the following, but it still doesn't help...
Alias /robots.txt /usr/local/var/svn/robots.txt
<Location /robots.txt>
SetHandler default-handler
allow from any ## added this
</Location>
I'm not an apache specialist at all, so I don't really like messing
around in a trial and error fashion too much...I'm getting our sysadmin
to have a look at it.
The other thing that occurred to me is that we are running wsvn, and web
bots and crawlers use those URLs heavily as well (I just looked in the
logs - they are there alright). So I should block /wsvn/ from our main
server as well....
>
> This may not be a great help to you, but when I was unable to solve this
> within Apache, and since I was playing around with the lighttpd web
> server anyway, I arranged it so that web access to the repository
> occurred via lighttpd, which proxied all requests to Apache running on a
> different port -- all requests, that is, except for favicon.ico,
> robots.txt, and the CSS and XSLT stylesheets. Working copies themselves
> directly accessed the Apache port (since although lighttpd is supposed
> to support proxying to Apache / Subversion, it seems to be broken at the
> moment).
I can see that this would work, but I think I will persevere more with
the current vhost config to see if we can't get apache to do the right
thing (i.e., what I want it to do;-)
Any more advice is welcome of course...
thanks,
- thomas beale
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: stopping webcrawlers using robots.txt
Posted by Ryan Schmidt <su...@ryandesign.com>.
On Jul 10, 2006, at 00:07, Evert|Rooftop wrote:
> I'm guessing you could still create an alias for robots.txt.. but
> im not
> a 100% sure..
>
> We simply use authentication everywhere.. I can't really understand
> why
> you would want to open you repository for everyone, except robots..
The motivation might not be to exclude or include any particular
robots. (Well-written) robots will request /robots.txt on hosts they
crawl, just like (many modern) browsers will request /favicon.ico. If
these files are not present, Apache will log a 404 to the error log.
For a sysadmin trying to use the error log to see if there are any
real problems on the site, these "false positives" quickly become
very irritating, and the sysadmin will look for a way to shut them off.
If you were using a single repository with SVNPath, you could use
Bob's suggestion:
On Jul 9, 2006, at 20:42, Bob Proulx wrote:
> I suppose you
> could check in robots.txt into the top level of your repository. But
> then it would be part of your project and so forth.
But since you're using SVNParentPath and multiple repositories, that
option is not available.
I tried this suggestion from Todd which sounded promising:
On Jul 10, 2006, at 01:44, Todd D. Esposito wrote:
> Alias /robots.txt /some/non/svn/path/robots.txt
> <Location /robots.txt>
> SetHandler default-handler
> </Location>
But it doesn't seem to be working.
This may not be a great help to you, but when I was unable to solve
this within Apache, and since I was playing around with the lighttpd
web server anyway, I arranged it so that web access to the repository
occurred via lighttpd, which proxied all requests to Apache running
on a different port -- all requests, that is, except for favicon.ico,
robots.txt, and the CSS and XSLT stylesheets. Working copies
themselves directly accessed the Apache port (since although lighttpd
is supposed to support proxying to Apache / Subversion, it seems to
be broken at the moment).
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: stopping webcrawlers using robots.txt
Posted by Evert | Rooftop <ev...@rooftopsolutions.nl>.
I'm guessing you could still create an alias for robots.txt.. but im not
a 100% sure..
We simply use authentication everywhere.. I can't really understand why
you would want to open you repository for everyone, except robots..
Evert
Thomas Beale wrote:
> Bob Proulx wrote:
>>> ...
>>> <Location />
>>> DAV svn
>>> SVNParentPath /usr/local/var/svn
>>>
>>
>> Oh, I see. You have configured your subversion repository as the only
>> visible paths in your web server. In my opinion you have made a bad
>> choice of repository location. By putting it in the root directly you
>> have prevented your ability to do anything else with your web server.
>> At that point I think you are unable to do what you want.
> we chose that URL so as to be able to have http://svn.openEHR.org
> (i.e. our main URL is http://www.openEHR.org). Does no-one else do
> this? It seems an obvious thing to do....
>> So to
>> answer your question, in your configuration you can't. I suppose you
>> could check in robots.txt into the top level of your repository. But
>> then it would be part of your project and so forth.
>>
> no, I think that would just be plain confusing, and I don't think it
> would work anyway....but surely there is a way to tell apache to deal
> in a certain way with a request straight off '/' , even with this
> configuration?
>
> I don't think changing the URL is an option - it is published all over
> the place - everyone in our community knows it. I know we could put
> some kind of redirect in, but that seems a pretty poor option as
> well....I am quite surprised to find that the configuration which
> seems to be preferred by the subversion manual is not compatible with
> managing web robots...
>
> - thomas
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: users-help@subversion.tigris.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: stopping webcrawlers using robots.txt
Posted by Thomas Beale <th...@deepthought.com.au>.
Bob Proulx wrote:
> Thomas Beale wrote:
>> I don't think changing the URL is an option - it is published all over
>> the place - everyone in our community knows it.
>
> I understand the pain of it.
>
>> I am quite surprised to find that the configuration which seems to
>> be preferred by the subversion manual is not compatible with
>> managing web robots...
>
> Where does it say this in the subversion manual? I can't find
> anything that recommends that and if any exist I presume it to be a
> documentation but that should be reported. The examples I see all use
> /repos. Also the subversion project itself uses the /repos
> convention.
You are right - I just checked; what I was remembering was the
SVNParentPath approach, which means that authorisation is down one
level, in each named repository, rather than being just off /
But still it seems unfortunate that using http://svn.xxx.yyy/repo_name
isn't more flexible - it's a nice simple URL and easy to remember. And
we have no problems except that I think we possibly need a way to
control robots ....
- thomas
>
> http://svn.collab.net/repos/svn/trunk
>
> Bob
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: stopping webcrawlers using robots.txt
Posted by Bob Proulx <bo...@proulx.com>.
Thomas Beale wrote:
> I don't think changing the URL is an option - it is published all over
> the place - everyone in our community knows it.
I understand the pain of it.
> I am quite surprised to find that the configuration which seems to
> be preferred by the subversion manual is not compatible with
> managing web robots...
Where does it say this in the subversion manual? I can't find
anything that recommends that and if any exist I presume it to be a
documentation but that should be reported. The examples I see all use
/repos. Also the subversion project itself uses the /repos
convention.
http://svn.collab.net/repos/svn/trunk
Bob
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: stopping webcrawlers using robots.txt
Posted by Thomas Beale <Th...@OceanInformatics.biz>.
Bob Proulx wrote:
>> ...
>> <Location />
>> DAV svn
>> SVNParentPath /usr/local/var/svn
>>
>
> Oh, I see. You have configured your subversion repository as the only
> visible paths in your web server. In my opinion you have made a bad
> choice of repository location. By putting it in the root directly you
> have prevented your ability to do anything else with your web server.
> At that point I think you are unable to do what you want.
we chose that URL so as to be able to have http://svn.openEHR.org (i.e.
our main URL is http://www.openEHR.org). Does no-one else do this? It
seems an obvious thing to do....
> So to
> answer your question, in your configuration you can't. I suppose you
> could check in robots.txt into the top level of your repository. But
> then it would be part of your project and so forth.
>
no, I think that would just be plain confusing, and I don't think it
would work anyway....but surely there is a way to tell apache to deal in
a certain way with a request straight off '/' , even with this
configuration?
I don't think changing the URL is an option - it is published all over
the place - everyone in our community knows it. I know we could put some
kind of redirect in, but that seems a pretty poor option as well....I am
quite surprised to find that the configuration which seems to be
preferred by the subversion manual is not compatible with managing web
robots...
- thomas
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: stopping webcrawlers using robots.txt
Posted by Bob Proulx <bo...@proulx.com>.
Thomas Beale wrote:
> how to make /robots.txt visible in an apache virtual host config for
> a subversion server.
Put it in your document root. Let it be served normally by the web
server.
> How would I tell Apache to allow requests to read /robots.txt given
> the following configuration?
> ...
> <Location />
> DAV svn
> SVNParentPath /usr/local/var/svn
Oh, I see. You have configured your subversion repository as the only
visible paths in your web server. In my opinion you have made a bad
choice of repository location. By putting it in the root directly you
have prevented your ability to do anything else with your web server.
At that point I think you are unable to do what you want. So to
answer your question, in your configuration you can't. I suppose you
could check in robots.txt into the top level of your repository. But
then it would be part of your project and so forth.
I suggest you reconfigure your server to put the subversion
repositories under an /svn directory. That will release your web
server for other uses under your document root.
> <Location /svn>
That way you can put a robots.txt file into your document root and all
will behave normally.
Sorry that I am proposing that you change your URL. But I think that
is the best course of action. If you don't do it for this problem you
will eventually need to do it for another reason later. The earlier
you make URL changes the better because it only gets more painful
later.
Bob
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: stopping webcrawlers using robots.txt
Posted by "Todd D. Esposito" <To...@ToddEsposito.com>.
Thomas,
Try inserting something like what I've indicated below, inline with your
vhost block:
On Sun, July 9, 2006 13:02, Thomas Beale said:
>
> Hi,
>
> I have looked around but not found the answer to the question: how to
> make /robots.txt visible in an apache virtual host config for a
> subversion server. How would I tell Apache to allow requests to read
> /robots.txt given the following configuration? (Or - how can I just
> block robots going into the SVN repositories)?
>
> <VirtualHost 1.2.3.4>
> ServerAdmin webmaster@xxxx.org
>
> ServerName svn.xxxx.org
Alias /robots.txt /some/non/svn/path/robots.txt
<Location /robots.txt>
SetHandler default-handler
</Location>
>
> <Location />
> DAV svn
> SVNParentPath /usr/local/var/svn
>
> # authorisation
> AuthzSVNAccessFile /etc/subversion/access-control
>
> # authentication
> AuthType Basic
> AuthName "development Subversion Repository"
> AuthUserFile /etc/subversion/authentication
>
> # anonymous access rules
> Satisfy Any
> Require valid-user
> </Location>
> </VirtualHost>
That should do it, but be warned I didn't test it before sending it off;
YMMV.
Todd D. Esposito
www.Turtol.com -- Web Applications and Hosting
Todd@Turtol.com
Todd@ToddEsposito.com
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org