You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Chris Cheshire <ya...@gmail.com> on 2020/02/28 14:51:07 UTC

CrawlerSessionManagerValve

(9.0.31)

What is the reason why the pattern isn't compiled with the case
insensitive flag? Is it due to performance?

Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: CrawlerSessionManagerValve

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Chris,

On 2/28/20 13:25, Chris Cheshire wrote:
> On Fri, Feb 28, 2020 at 12:51 PM Christopher Schultz
> <ch...@christopherschultz.net> wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
>>
>> Chris and Mark,
>>
>> On 2/28/20 11:51, Mark Thomas wrote:
>>> On 28/02/2020 14:51, Chris Cheshire wrote:
>>>> (9.0.31)
>>>>
>>>> What is the reason why the pattern isn't compiled with the
>>>> case insensitive flag? Is it due to performance?
>>>
>>> I wrote that Valve. At least the first iteration anyway.
>>> Others improved it along the way.
>>>
>>> I honestly can't remember why I opted for [bB]bot rather than
>>> using CASE_INSENSITIVE.
>>>
>>> I do remember that the focus was on fixing an issue we (the
>>> ASF) were having with our public Jira instance at the time in
>>> that bots were generating huge numbers of sessions and, in
>>> turn, using up large amounts of memory.
>>>
>>> Looking at it with the benefit of hindsight I'd worry about: -
>>> performance - avoiding false positives
>>>
>>> There probably isn't much in it but I'd expect the current
>>> solution is the right one for both of those. Unless you have a
>>> very different UA pattern, in which case CASE_INSENSITIVE might
>>> help. But I am guessing about the performance which really
>>> isn't the done thing.
>>>
>>> If someone was to demonstrate that there was a measurable
>>> performance benefit to some realistic patterns to using
>>> CASE_INSENSITIVE then I'd support an enhancement to add an
>>> attribute to specify the flags to use when compiling the
>>> pattern.
>>
>
> More of a curiosity. I am doing some crawler checking in my webapp.
> I have an grossly repetitive regex and I was looking at this valve
> as an example to optimize things a bit. I figured if it was a CI
> check then it would negate the need for patterns like [bB].
>
> There are a couple of common patterns that it is leaving out though
> :
>
> .*[sS]p[iy]der.* .*facebookexternalhit.*
> .*(Mediapartners|Feedfetcher)-[gG]oogle.*
>
> (last one is adding 'mediapartners' to the subpattern already in
> your default regex)
>
>> You can always use the (?i) flag-enabler if you want to use
>> case-insensitive matches without changing the code.
>>
>
> +1 Did not know about this! If the flags can be specified in the
> pattern itself, then there probably isn't much need for adding
> extra attributes to the valve to achieve it.
>
> Java regex tutorial[1] does say there is a slight performance hit
> for a CI check, but it's not quantified. With processing speed
> increases, my guess is it is completely negligible per request.

Case-insensitive checks are probably pretty quick unless you start to
get into Unicode casing and locale-specific casing. Converting [A-Z]
- -> [a-z] is a simple comparison of byte value and adding the value of
('A' - 'a'). If you need to be able to convert Б to lowercase before
performing that comparison, it requires ... some additional effort.

See the documentation for the (?u) or UNICODE_CASE flag.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl5ZkmwACgkQHPApP6U8
pFiKmRAAs2hm7q/kE1OU9IU5bMUVX5gjI3YAVjvn0NhNglYlYp3p11mIQ0YAB4wl
kS+kONvTXKhI0pZiEovwPizUnOkTDnAK3S6cC5NFWeK1JJfOII6MrUW4hXBQimUf
t4kNXnjXuf1/2k7kOcvAcxqx+ORCwyRemA/+U7T9IGMYeodfyIKQps3sZk6ClZvP
hC+GO2tGbR825e64IXK4ZPouoTTparsBo1j6CNe3ZlrAFLzUeqAmqwFRY0EPk7cW
AmGvX7X54AilZhzD+xrXFOUY0+V1B05qLoQYsm9j8UwSHt1a8dEMWv/Bauzwulbu
gRYnEIJeTueSSfTOO6vneAexHf5WqfG+sFgbPMsvCKvjgUywDtiiadzYE/TN1/z5
ZwkS6uvv8dpahVlzE3z12HlEGhQ0vc0Y/p+p5cwDNWTNVFS434Zxu8OplpfVREaB
fybKmCAD32ENR8KcH5fY7C25hLRPo8d1TZK9VuTSj+fhJHrStE6o0Opln6JOhjsX
rL/KPj6dma623PaH5RKSDWGtyYzI49rjLQVEA2Qw/eIh8h32Gts3DtE6uYDLDApS
6t7ELYxI0w8JCYUiFeRstmmGUYOBk0kdH0yfkP9Wc0CWInDCZvI7kI1AYRxtMdVK
Y7mrPKxOthPyyYE9LjfhnHV+LLwRk7AknxWtrRwTzzUGJhOkR24=
=cprm
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: CrawlerSessionManagerValve

Posted by Chris Cheshire <ya...@gmail.com>.
On Fri, Feb 28, 2020 at 12:51 PM Christopher Schultz
<ch...@christopherschultz.net> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Chris and Mark,
>
> On 2/28/20 11:51, Mark Thomas wrote:
> > On 28/02/2020 14:51, Chris Cheshire wrote:
> >> (9.0.31)
> >>
> >> What is the reason why the pattern isn't compiled with the case
> >> insensitive flag? Is it due to performance?
> >
> > I wrote that Valve. At least the first iteration anyway. Others
> > improved it along the way.
> >
> > I honestly can't remember why I opted for [bB]bot rather than
> > using CASE_INSENSITIVE.
> >
> > I do remember that the focus was on fixing an issue we (the ASF)
> > were having with our public Jira instance at the time in that bots
> > were generating huge numbers of sessions and, in turn, using up
> > large amounts of memory.
> >
> > Looking at it with the benefit of hindsight I'd worry about: -
> > performance - avoiding false positives
> >
> > There probably isn't much in it but I'd expect the current solution
> > is the right one for both of those. Unless you have a very
> > different UA pattern, in which case CASE_INSENSITIVE might help.
> > But I am guessing about the performance which really isn't the done
> > thing.
> >
> > If someone was to demonstrate that there was a measurable
> > performance benefit to some realistic patterns to using
> > CASE_INSENSITIVE then I'd support an enhancement to add an
> > attribute to specify the flags to use when compiling the pattern.
>

More of a curiosity. I am doing some crawler checking in my webapp. I
have an grossly repetitive regex and I was looking at this valve as an
example to optimize things a bit. I figured if it was a CI check then
it would negate the need for patterns like [bB].

There are a couple of common patterns that it is leaving out though :

.*[sS]p[iy]der.*
.*facebookexternalhit.*
.*(Mediapartners|Feedfetcher)-[gG]oogle.*

(last one is adding 'mediapartners' to the subpattern already in your
default regex)

> You can always use the (?i) flag-enabler if you want to use
> case-insensitive matches without changing the code.
>

+1 Did not know about this! If the flags can be specified in the
pattern itself, then there probably isn't much need for adding extra
attributes to the valve to achieve it.

Java regex tutorial[1] does say there is a slight performance hit for
a CI check, but it's not quantified. With processing speed increases,
my guess is it is completely negligible per request.

Chris

[1] https://docs.oracle.com/javase/tutorial/essential/regex/pattern.html

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: CrawlerSessionManagerValve

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Chris and Mark,

On 2/28/20 11:51, Mark Thomas wrote:
> On 28/02/2020 14:51, Chris Cheshire wrote:
>> (9.0.31)
>>
>> What is the reason why the pattern isn't compiled with the case
>> insensitive flag? Is it due to performance?
>
> I wrote that Valve. At least the first iteration anyway. Others
> improved it along the way.
>
> I honestly can't remember why I opted for [bB]bot rather than
> using CASE_INSENSITIVE.
>
> I do remember that the focus was on fixing an issue we (the ASF)
> were having with our public Jira instance at the time in that bots
> were generating huge numbers of sessions and, in turn, using up
> large amounts of memory.
>
> Looking at it with the benefit of hindsight I'd worry about: -
> performance - avoiding false positives
>
> There probably isn't much in it but I'd expect the current solution
> is the right one for both of those. Unless you have a very
> different UA pattern, in which case CASE_INSENSITIVE might help.
> But I am guessing about the performance which really isn't the done
> thing.
>
> If someone was to demonstrate that there was a measurable
> performance benefit to some realistic patterns to using
> CASE_INSENSITIVE then I'd support an enhancement to add an
> attribute to specify the flags to use when compiling the pattern.

You can always use the (?i) flag-enabler if you want to use
case-insensitive matches without changing the code.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl5ZUyAACgkQHPApP6U8
pFgAERAAtsF3fKvMs/nWMM+wEPBJPOs+81Sd+LoWkWaTc2dtxXlqmi+VVtK0HsOj
RKO6WDVHD39Bzlf5PP8gi0Wjb9VgteyUl8A+Iz8TXC++iC2oUfQ8j3E3jX9zSWgV
fxgrNmGtvJ4g6Rh+/sjLDjIBJ0dAyfJtYn3D+XpJgd1d9F/YNgHjXNWhHQ2pjG5j
b0hVCcdvSHC+VMt7bGA1KdHFeVm6FqM6m8JVnPmlOBnAmOw0NSioMymEphvXiLfX
2Qy+RYe5J/LJwoJkYV5CYDuqLKBYci+t+vaOAfVt3a+RVs2aJGH86YbWjMLQ89o3
r5AC6K3RXMdKoyKRthwQ2/+uL9UoytFyikydImSREO+L9xpvaOruFZI8mSjeHHIt
RwpbaOZntiCmzfaKEgBJNtnlcPg54VgW49RJ8WdUS0z2q3+nc9oJ3VB15bjUSgLb
uH/0ak8GfjTyO8rnYgG1hXcQWby2iSdfMZvxNjU/SPL8qwN9UB3TIIfjTsoUAgcy
xkbaGKkdh/ChMXles4QRVBNXUK1MFCWKhgfW2a8oHuBRmguk36ORJ1f6lO1gEz2j
GVl7g8MhFTKcCx9DE8axd2Ywt6VPI5f/8RdTFGbJ/taO7p69YnoOD8fgqKzs1kcc
4oYtU1ZlN2y08wz9owkUuYFjUWT9spLkHX8tE3DwbkdZdo3QfUs=
=s2f5
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: CrawlerSessionManagerValve

Posted by Mark Thomas <ma...@apache.org>.
On 28/02/2020 14:51, Chris Cheshire wrote:
> (9.0.31)
> 
> What is the reason why the pattern isn't compiled with the case
> insensitive flag? Is it due to performance?

I wrote that Valve. At least the first iteration anyway. Others improved
it along the way.

I honestly can't remember why I opted for [bB]bot rather than using
CASE_INSENSITIVE.

I do remember that the focus was on fixing an issue we (the ASF) were
having with our public Jira instance at the time in that bots were
generating huge numbers of sessions and, in turn, using up large amounts
of memory.

Looking at it with the benefit of hindsight I'd worry about:
- performance
- avoiding false positives

There probably isn't much in it but I'd expect the current solution is
the right one for both of those. Unless you have a very different UA
pattern, in which case CASE_INSENSITIVE might help. But I am guessing
about the performance which really isn't the done thing.

If someone was to demonstrate that there was a measurable performance
benefit to some realistic patterns to using CASE_INSENSITIVE then I'd
support an enhancement to add an attribute to specify the flags to use
when compiling the pattern.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org