You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@subversion.apache.org by Thomas Beale <th...@deepthought.com.au> on 2006/07/09 18:02:44 UTC

stopping webcrawlers using robots.txt

Hi,

I have looked around but not found the answer to the question: how to 
make /robots.txt visible in an apache virtual host config for a 
subversion server. How would I tell Apache to allow requests to read 
/robots.txt given the following configuration? (Or - how can I just 
block robots going into the SVN repositories)?

<VirtualHost 1.2.3.4>
         ServerAdmin webmaster@xxxx.org

         ServerName svn.xxxx.org

         <Location />
                 DAV svn
                 SVNParentPath /usr/local/var/svn

                 # authorisation
                 AuthzSVNAccessFile /etc/subversion/access-control

                 # authentication
                 AuthType Basic
                 AuthName "development Subversion Repository"
                 AuthUserFile /etc/subversion/authentication

                 # anonymous access rules
                 Satisfy Any
                 Require valid-user
         </Location>
</VirtualHost>



thanks,

- thomas beale

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: stopping webcrawlers using robots.txt

Posted by Thomas Beale <th...@deepthought.com.au>.

Sorry - this fix doesn't work - it does what it is supposed to do with 
respect to serving robots.txt files, but now subversion clients can't do 
updates, although they can do some other operations like commit and log.....

apologies for the previous post...

- thomas beale

Thomas Beale wrote:
> 
> Our system administrator messed around with this, and produced the 
> following solution, which works. The key was to change Location 
> directives to Directory directives, i.e. to match directory patterns not 
> URL patterns.
> 
> <VirtualHost 1.2.3.4>
>         ServerAdmin webmaster@xxxx.org
> 
>         ServerName svn.xxxx.org
> 
>         DocumentRoot /usr/local/var/svn
> 
>         RewriteEngine  On
>         RewriteRule    .*robots\.txt$    /generic-root/robots.txt  [PT]
> 
>         Alias /generic-root /usr/local/var/generic-root
> 
>         <Directory /usr/local/var/generic-root>
>                 SetHandler default-handler
>                 allow from all
>         </Directory>
> 
>         <Directory /usr/local/var/svn>
>                 DAV svn
>                 SVNParentPath /usr/local/var/svn
> 
>                 # authorisation
>                 AuthzSVNAccessFile /etc/subversion/access-control
> 
>                 # authentication
>                 AuthType Basic
>                 AuthName "development Subversion Repository"
>                 AuthUserFile /etc/subversion/authentication
> 
>                 # anonymous access rules
>                 Satisfy Any
>                 Require valid-user
>         </Directory>
> </VirtualHost>
> 
> 
> 
> Thomas Beale wrote:
>>
>> Hi,
>>
>> I have looked around but not found the answer to the question: how to 
>> make /robots.txt visible in an apache virtual host config for a 
>> subversion server. How would I tell Apache to allow requests to read 
>> /robots.txt given the following configuration? (Or - how can I just 
>> block robots going into the SVN repositories)?
>>
>> <VirtualHost 1.2.3.4>
>>         ServerAdmin webmaster@xxxx.org
>>
>>         ServerName svn.xxxx.org
>>
>>         <Location />
>>                 DAV svn
>>                 SVNParentPath /usr/local/var/svn
>>
>>                 # authorisation
>>                 AuthzSVNAccessFile /etc/subversion/access-control
>>
>>                 # authentication
>>                 AuthType Basic
>>                 AuthName "development Subversion Repository"
>>                 AuthUserFile /etc/subversion/authentication
>>
>>                 # anonymous access rules
>>                 Satisfy Any
>>                 Require valid-user
>>         </Location>
>> </VirtualHost>
>>
>>
>>
>> thanks,
>>
>> - thomas beale

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: stopping webcrawlers using robots.txt

Posted by Thomas Beale <th...@deepthought.com.au>.

Our system administrator messed around with this, and produced the 
following solution, which works. The key was to change Location 
directives to Directory directives, i.e. to match directory patterns not 
URL patterns.

<VirtualHost 1.2.3.4>
         ServerAdmin webmaster@xxxx.org

         ServerName svn.xxxx.org

         DocumentRoot /usr/local/var/svn

         RewriteEngine  On
         RewriteRule    .*robots\.txt$    /generic-root/robots.txt  [PT]

         Alias /generic-root /usr/local/var/generic-root

         <Directory /usr/local/var/generic-root>
                 SetHandler default-handler
                 allow from all
         </Directory>

         <Directory /usr/local/var/svn>
                 DAV svn
                 SVNParentPath /usr/local/var/svn

                 # authorisation
                 AuthzSVNAccessFile /etc/subversion/access-control

                 # authentication
                 AuthType Basic
                 AuthName "development Subversion Repository"
                 AuthUserFile /etc/subversion/authentication

                 # anonymous access rules
                 Satisfy Any
                 Require valid-user
         </Directory>
</VirtualHost>



Thomas Beale wrote:
> 
> Hi,
> 
> I have looked around but not found the answer to the question: how to 
> make /robots.txt visible in an apache virtual host config for a 
> subversion server. How would I tell Apache to allow requests to read 
> /robots.txt given the following configuration? (Or - how can I just 
> block robots going into the SVN repositories)?
> 
> <VirtualHost 1.2.3.4>
>         ServerAdmin webmaster@xxxx.org
> 
>         ServerName svn.xxxx.org
> 
>         <Location />
>                 DAV svn
>                 SVNParentPath /usr/local/var/svn
> 
>                 # authorisation
>                 AuthzSVNAccessFile /etc/subversion/access-control
> 
>                 # authentication
>                 AuthType Basic
>                 AuthName "development Subversion Repository"
>                 AuthUserFile /etc/subversion/authentication
> 
>                 # anonymous access rules
>                 Satisfy Any
>                 Require valid-user
>         </Location>
> </VirtualHost>
> 
> 
> 
> thanks,
> 
> - thomas beale

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: stopping webcrawlers using robots.txt

Posted by Ryan Schmidt <su...@ryandesign.com>.

I worked on this problem today and reached the following solution. If  
you try this out and it works (or if you try it out and it doesn't  
work) please let me know!

<VirtualHost *:80>
	ServerName svn.example.com
	
	RewriteEngine on
	RewriteCond %{REQUEST_METHOD} ^GET$
	RewriteRule ^/(favicon\.ico|robots\.txt|svnrsrc/.*)$ http:// 
www.example.com/svnroot/$1 [P,L]
	
	<Location />
		DAV svn
		SVNParentPath /path/to/subversion/repositories
		SVNListParentPath on
		SVNIndexXSLT /svnrsrc/index.xslt
		AuthType Basic
		AuthName "Subversion Repositories"
		AuthUserFile /path/to/subversion/conf/users
		Require valid-user
	</Location>
</VirtualHost>

Basically, no matter what manner of Alias directives and the like I  
tried, the DAV server in the Location directive always wanted to take  
precedence. The only way I found to conditionally override the DAV  
server was to proxy the request away to a separate vhost using  
mod_rewrite. I only do this for GET requests, and only for  
favicon.ico, robots.txt and a directory svnrsrc where you can put  
xslt files, css files, images and anything else you might need to  
properly show your directory listings. All other requests are still  
handled by mod_dav_svn.

Here, I just made a directory svnroot in the document root of my  
normal vhost (www.example.com) to contain the favicon.ico, robots.txt  
and svnrsrc that will be used by svn.example.com. It shouldn't bother  
anybody there. If you'd like, you should even be able to prevent  
people from accessing it directly via http://www.example.com/svnroot/  
and make it only available via the proxied connection, like this:

<VirtualHost *:80>
	ServerName www.example.com
	DocumentRoot /wherever/the/document/root/is
	...
	<Location /svnroot>
		Order allow,deny
		Allow from 127.0.0.1
	</Location>
</VirtualHost>

Note: This is your normal vhost, not your Subversion vhost.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: stopping webcrawlers using robots.txt

Posted by Thomas Beale <th...@deepthought.com.au>.

Ryan Schmidt wrote:
> 

> 
> I tried this suggestion from Todd which sounded promising:
> 
> On Jul 10, 2006, at 01:44, Todd D. Esposito wrote:
> 
>> Alias /robots.txt /some/non/svn/path/robots.txt
>> <Location /robots.txt>
>>   SetHandler default-handler
>> </Location>
> 
> But it doesn't seem to be working.

not for me either...I added the following, but it still doesn't help...

         Alias /robots.txt /usr/local/var/svn/robots.txt
         <Location /robots.txt>
                 SetHandler default-handler
                 allow from any		## added this
         </Location>

I'm not an apache specialist at all, so I don't really like messing 
around in a trial and error fashion too much...I'm getting our sysadmin 
to have a look at it.

The other thing that occurred to me is that we are running wsvn, and web 
bots and crawlers use those URLs heavily as well (I just looked in the 
logs - they are there alright). So I should block /wsvn/ from our main 
server as well....

> 
> This may not be a great help to you, but when I was unable to solve this 
> within Apache, and since I was playing around with the lighttpd web 
> server anyway, I arranged it so that web access to the repository 
> occurred via lighttpd, which proxied all requests to Apache running on a 
> different port -- all requests, that is, except for favicon.ico, 
> robots.txt, and the CSS and XSLT stylesheets. Working copies themselves 
> directly accessed the Apache port (since although lighttpd is supposed 
> to support proxying to Apache / Subversion, it seems to be broken at the 
> moment).

I can see that this would work, but I think I will persevere more with 
the current vhost config to see if we can't get apache to do the right 
thing (i.e., what I want it to do;-)

Any more advice is welcome of course...

thanks,

- thomas beale

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: stopping webcrawlers using robots.txt

Posted by Ryan Schmidt <su...@ryandesign.com>.

On Jul 10, 2006, at 00:07, Evert|Rooftop wrote:

> I'm guessing you could still create an alias for robots.txt.. but  
> im not
> a 100% sure..
>
> We simply use authentication everywhere.. I can't really understand  
> why
> you would want to open you repository for everyone, except robots..

The motivation might not be to exclude or include any particular  
robots. (Well-written) robots will request /robots.txt on hosts they  
crawl, just like (many modern) browsers will request /favicon.ico. If  
these files are not present, Apache will log a 404 to the error log.  
For a sysadmin trying to use the error log to see if there are any  
real problems on the site, these "false positives" quickly become  
very irritating, and the sysadmin will look for a way to shut them off.

If you were using a single repository with SVNPath, you could use  
Bob's suggestion:

On Jul 9, 2006, at 20:42, Bob Proulx wrote:

> I suppose you
> could check in robots.txt into the top level of your repository.  But
> then it would be part of your project and so forth.

But since you're using SVNParentPath and multiple repositories, that  
option is not available.

I tried this suggestion from Todd which sounded promising:

On Jul 10, 2006, at 01:44, Todd D. Esposito wrote:

> Alias /robots.txt /some/non/svn/path/robots.txt
> <Location /robots.txt>
>   SetHandler default-handler
> </Location>

But it doesn't seem to be working.

This may not be a great help to you, but when I was unable to solve  
this within Apache, and since I was playing around with the lighttpd  
web server anyway, I arranged it so that web access to the repository  
occurred via lighttpd, which proxied all requests to Apache running  
on a different port -- all requests, that is, except for favicon.ico,  
robots.txt, and the CSS and XSLT stylesheets. Working copies  
themselves directly accessed the Apache port (since although lighttpd  
is supposed to support proxying to Apache / Subversion, it seems to  
be broken at the moment).

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: stopping webcrawlers using robots.txt

Posted by Evert | Rooftop <ev...@rooftopsolutions.nl>.

I'm guessing you could still create an alias for robots.txt.. but im not
a 100% sure..

We simply use authentication everywhere.. I can't really understand why
you would want to open you repository for everyone, except robots..

Evert

Thomas Beale wrote:
> Bob Proulx wrote:
>>> ...
>>>         <Location />
>>>                 DAV svn
>>>                 SVNParentPath /usr/local/var/svn
>>>     
>>
>> Oh, I see.  You have configured your subversion repository as the only
>> visible paths in your web server.  In my opinion you have made a bad
>> choice of repository location.  By putting it in the root directly you
>> have prevented your ability to do anything else with your web server.
>> At that point I think you are unable to do what you want.  
> we chose that URL so as to be able to have http://svn.openEHR.org 
> (i.e. our main URL is http://www.openEHR.org). Does no-one else do 
> this? It seems an obvious thing to do....
>> So to
>> answer your question, in your configuration you can't.  I suppose you
>> could check in robots.txt into the top level of your repository.  But
>> then it would be part of your project and so forth.
>>   
> no, I think that would just be plain confusing, and I don't think it 
> would work anyway....but surely there is a way to tell apache to deal 
> in a certain way with a request straight off '/' , even with this 
> configuration?
>
> I don't think changing the URL is an option - it is published all over 
> the place - everyone in our community knows it. I know we could put 
> some kind of redirect in, but that seems a pretty poor option as 
> well....I am quite surprised to find that the configuration which 
> seems to be preferred by the subversion manual is not compatible with 
> managing web robots...
>
> - thomas
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: users-help@subversion.tigris.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: stopping webcrawlers using robots.txt

Posted by Thomas Beale <th...@deepthought.com.au>.

Bob Proulx wrote:
> Thomas Beale wrote:
>> I don't think changing the URL is an option - it is published all over 
>> the place - everyone in our community knows it.
> 
> I understand the pain of it.
> 
>> I am quite surprised to find that the configuration which seems to
>> be preferred by the subversion manual is not compatible with
>> managing web robots...
> 
> Where does it say this in the subversion manual?  I can't find
> anything that recommends that and if any exist I presume it to be a
> documentation but that should be reported.  The examples I see all use
> /repos.  Also the subversion project itself uses the /repos
> convention.

You are right - I just checked; what I was remembering was the 
SVNParentPath approach, which means that authorisation is down one 
level, in each named repository, rather than being just off /

But still it seems unfortunate that using http://svn.xxx.yyy/repo_name 
isn't more flexible - it's a nice simple URL and easy to remember. And 
we have no problems except that I think we possibly need a way to 
control robots ....

- thomas


> 
>   http://svn.collab.net/repos/svn/trunk
> 
> Bob

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: stopping webcrawlers using robots.txt

Posted by Bob Proulx <bo...@proulx.com>.

Thomas Beale wrote:
> I don't think changing the URL is an option - it is published all over 
> the place - everyone in our community knows it.

I understand the pain of it.

> I am quite surprised to find that the configuration which seems to
> be preferred by the subversion manual is not compatible with
> managing web robots...

Where does it say this in the subversion manual?  I can't find
anything that recommends that and if any exist I presume it to be a
documentation but that should be reported.  The examples I see all use
/repos.  Also the subversion project itself uses the /repos
convention.

  http://svn.collab.net/repos/svn/trunk

Bob

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: stopping webcrawlers using robots.txt

Posted by Thomas Beale <Th...@OceanInformatics.biz>.

Bob Proulx wrote:
>> ...
>>         <Location />
>>                 DAV svn
>>                 SVNParentPath /usr/local/var/svn
>>     
>
> Oh, I see.  You have configured your subversion repository as the only
> visible paths in your web server.  In my opinion you have made a bad
> choice of repository location.  By putting it in the root directly you
> have prevented your ability to do anything else with your web server.
> At that point I think you are unable to do what you want.  
we chose that URL so as to be able to have http://svn.openEHR.org (i.e. 
our main URL is http://www.openEHR.org). Does no-one else do this? It 
seems an obvious thing to do....
> So to
> answer your question, in your configuration you can't.  I suppose you
> could check in robots.txt into the top level of your repository.  But
> then it would be part of your project and so forth.
>   
no, I think that would just be plain confusing, and I don't think it 
would work anyway....but surely there is a way to tell apache to deal in 
a certain way with a request straight off '/' , even with this 
configuration?

I don't think changing the URL is an option - it is published all over 
the place - everyone in our community knows it. I know we could put some 
kind of redirect in, but that seems a pretty poor option as well....I am 
quite surprised to find that the configuration which seems to be 
preferred by the subversion manual is not compatible with managing web 
robots...

- thomas


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: stopping webcrawlers using robots.txt

Posted by Bob Proulx <bo...@proulx.com>.

Thomas Beale wrote:
> how to make /robots.txt visible in an apache virtual host config for
> a subversion server.

Put it in your document root.  Let it be served normally by the web
server.

> How would I tell Apache to allow requests to read /robots.txt given
> the following configuration?
> ...
>         <Location />
>                 DAV svn
>                 SVNParentPath /usr/local/var/svn

Oh, I see.  You have configured your subversion repository as the only
visible paths in your web server.  In my opinion you have made a bad
choice of repository location.  By putting it in the root directly you
have prevented your ability to do anything else with your web server.
At that point I think you are unable to do what you want.  So to
answer your question, in your configuration you can't.  I suppose you
could check in robots.txt into the top level of your repository.  But
then it would be part of your project and so forth.

I suggest you reconfigure your server to put the subversion
repositories under an /svn directory.  That will release your web
server for other uses under your document root.

>         <Location /svn>

That way you can put a robots.txt file into your document root and all
will behave normally.

Sorry that I am proposing that you change your URL.  But I think that
is the best course of action.  If you don't do it for this problem you
will eventually need to do it for another reason later.  The earlier
you make URL changes the better because it only gets more painful
later.

Bob

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: stopping webcrawlers using robots.txt

Posted by "Todd D. Esposito" <To...@ToddEsposito.com>.

Thomas,

Try inserting something like what I've indicated below, inline with your
vhost block:

On Sun, July 9, 2006 13:02, Thomas Beale said:
>
> Hi,
>
> I have looked around but not found the answer to the question: how to
> make /robots.txt visible in an apache virtual host config for a
> subversion server. How would I tell Apache to allow requests to read
> /robots.txt given the following configuration? (Or - how can I just
> block robots going into the SVN repositories)?
>
> <VirtualHost 1.2.3.4>
>          ServerAdmin webmaster@xxxx.org
>
>          ServerName svn.xxxx.org

Alias /robots.txt /some/non/svn/path/robots.txt
<Location /robots.txt>
  SetHandler default-handler
</Location>

>
>          <Location />
>                  DAV svn
>                  SVNParentPath /usr/local/var/svn
>
>                  # authorisation
>                  AuthzSVNAccessFile /etc/subversion/access-control
>
>                  # authentication
>                  AuthType Basic
>                  AuthName "development Subversion Repository"
>                  AuthUserFile /etc/subversion/authentication
>
>                  # anonymous access rules
>                  Satisfy Any
>                  Require valid-user
>          </Location>
> </VirtualHost>

That should do it, but be warned I didn't test it before sending it off;
YMMV.

Todd D. Esposito
www.Turtol.com  -- Web Applications and Hosting
Todd@Turtol.com
Todd@ToddEsposito.com


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org