You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Martin Kouba <ma...@symbiont-it.cz> on 2011/05/24 13:50:05 UTC

CrawlerSessionManagerValve question

What is the reason NOT to assume that request with more than one 
User-Agent header originates from a bot?
See lines 133, 134 in Tomcat 7.0.14.

Thanks
Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: CrawlerSessionManagerValve question

Posted by André Warnier <aw...@ice-sa.com>.
Mark Thomas wrote:
> On 24/05/2011 12:50, Martin Kouba wrote:
>> What is the reason NOT to assume that request with more than one
>> User-Agent header originates from a bot?
>> See lines 133, 134 in Tomcat 7.0.14.
> 
> Simply that none of the samples I looked at had multiple UA headers and
> a suggestion from another committer that skipping those requests might
> be a way to save a few cycles.
> 
> If you have traces that show multiple headers, I'd be interested in
> seeing them.
> 

 From the RFC police :

RFC 2616, 4.2 Message Headers :

Multiple message-header fields with the same field-name MAY be present in a message if and 
only if the entire field-value for that header field is defined as a comma-separated list 
[i.e., #(values)].

(note the "if and only")

RFC 2616, 14.43 User-Agent

User-Agent     = "User-Agent" ":" 1*( product | comment )

(so *not* defined as '#(values)')


==> (my interpretation) : multiple User-Agent headers are invalid.

Discussion :

14.43 otherwise says :

The field can contain multiple product tokens (section 3.8) and comments identifying the 
agent and any subproducts which form a significant part of the user agent. By convention, 
the product tokens are listed in order of their significance for identifying the application.

and 4.2 otherwise says :

It MUST be possible to combine the multiple header fields into one "field-name: 
field-value" pair, without changing the semantics of the message, by appending each 
subsequent field-value to the first, each separated by a comma. The order in which header 
fields with the same field-name are received is therefore significant to the 
interpretation of the combined field value, and thus a proxy MUST NOT change the order of 
these field values when a message is forwarded.

Thus, if one were to accept multiple User-Agent headers, and combine them as a 
comma-separated list, one would then have trouble respecting the "order of their 
significance" as expressed in 14.43.

So it makes sense to allow only one User-Agent header.

And maybe the "lines 133, 134 in Tomcat 7.0.14" should be modified to reject the request 
if it has more than one such ?



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: CrawlerSessionManagerValve question

Posted by Mark Thomas <ma...@apache.org>.
On 24/05/2011 12:50, Martin Kouba wrote:
> What is the reason NOT to assume that request with more than one
> User-Agent header originates from a bot?
> See lines 133, 134 in Tomcat 7.0.14.

Simply that none of the samples I looked at had multiple UA headers and
a suggestion from another committer that skipping those requests might
be a way to save a few cycles.

If you have traces that show multiple headers, I'd be interested in
seeing them.

Cheers,

Mark



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org