You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by bu...@apache.org on 2020/04/05 20:22:13 UTC

[Bug 64309] New: Improve repository regular expression performance used for bootstrapping Catalina

https://bz.apache.org/bugzilla/show_bug.cgi?id=64309

            Bug ID: 64309
           Summary: Improve repository regular expression performance used
                    for bootstrapping Catalina
           Product: Tomcat 9
           Version: 9.0.x
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Catalina
          Assignee: dev@tomcat.apache.org
          Reporter: paulmuriel.biyabi@gmail.com
  Target Milestone: -----

The goal of this enhancement is to improve the regular expression used for
searching class loader repositories when bootstrapping Catalina in Tomcat 9

The regular expression currently used is (\".*?\")|(([^,])*)
The above regular expression an alternation.
An alternation is used to match a single regular expression out of several
possible regular expressions.
it uses a vertical bar or pipe symbol (|) to separate the different options in
a regular expression.
In our case, the different options are (\".*?\") and (([^,])*)

Our enhancement is about the first option, namely, (\".*?\") 

As can be seen in the first option, there is a question mark is in fact a lazy
quantifier. Lazy quantifiers makes repeatable symbols such as the asterisk (*)
match as few characters as possible. That is fine. But that comes with a cost
called, backtracking, implemented by regex-directed engines. There are broadly
two types of regular expression engines, namely, regex-directed engines and
text-directed engines. The Java regular expression engine is regex-directed,
which means it implements backtracking. Backtracking slows down regex-directed
engines. Therefore, if it is possible to get rid of backtracking by finding an
equivalent regular expression, then the application gains in performance.

Now, looking at the regular (\".*?\") expression, it simply says: “Give me the
shortest sequence of characters enclosed in double quotes”. The word shortest
comes from the usage of the lazy quantifier (?) as explained above. An
equivalent regular expression is (\"[^\"]*\")

The equivalent regular expression replaces the .* with the [^\"] (negated
character class). The negated character class does not use backtracking and is
therefore by far more efficient.

Let us examine the efficiencies of both regular expressions on a sample string,
namely (\"abc\") 

Case 1: With the lazy quantifier.
 In this case, the regex engine does the following:

It compares the first token \" in the regular expression with the first
character \" in the string. Both match. The engine now compares the next token
.* to the second character a in the string. Again, both match. Because of the
lazy quantifier (?), the engine compares the next character b in the string to
the to the last token \"  in the regular expression. But both do not match. The
engine backtracks to the .* token and notices that b matches that token. The
engine again compares the next character c to the last token \" in the regular
expression. They do not match. The engine backtracks again to .*  and notices 
that c matches that token. The engine now compares the last token \" in the
regular expression to the last character \" in the string, and finds that both
match. The engine reports a match. For the above string with 3 characters (a,
b, and c) between quotes, it took 7 steps for the engine to find a match. For n
(n>=1) characters between quotes, it will take the engine 3 + 2(n-1) steps.

Case 2: With the negated character class.
In this case, the regex engine does:

It compares the first token \" in the regular expression with the first
character \" in the string. Both match. The engine now compares the next token
[^\"] to the second character a in the string. They both match (because the
character a is not equal to the quote character). The engine again compares
[^\"] to the next character b in the string (because of the repetition symbol
*). Again, they match. The engine again compares [^\"] to the next character c
in the string (because of the repetition symbol *).  Again, they match. The
engine again compares [^\"] to the next character \" in the string (because of
the repetition symbol *). Now, they do not match because of the negation (^).
However, because at least one match was found for the repetition symbol, the
engine compares the last token \" in the regular expression with the last
character \" in the string. Both match. The engine reports a match.

For the above string with 3 characters (a, b, and c) between quotes, it took 6
steps for the engine to find a match. For n (n>=1) characters between quotes,
it will take the engine n + 3 steps.

Therefore, for case 1, we have  3 + 2(n-1) steps, and for case 2, we have  n +
3 steps. As n grows, 3 + 2(n-1) becomes very bigger than n + 3.

For example, if n = 50:
3 + 2(n-1)  gives 101
n + 3 gives 53
We therefore avoid 48 computations with the negated character sequence.

In conclusion, we gain be using (\"[^\"]*\")|(([^,])*) rather than
(\".*?\")|(([^,])*) for searching class loader repositories when bootstrapping
Catalina in Tomcat 9.

I will create a pull request for the enhancement if this proposal is approved.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


[Bug 64309] Improve repository regular expression performance used for bootstrapping Catalina

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=64309

--- Comment #1 from Mark Thomas <ma...@apache.org> ---
No objections but I do wonder if the performance difference is noticeable or
even measurable given that this happens once on start-up.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


[Bug 64309] Improve repository regular expression performance used for bootstrapping Catalina

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=64309

Mark Thomas <ma...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #2 from Mark Thomas <ma...@apache.org> ---
Fixed in:
- master for 10.0.0-M5 onwards
- 9.0.x for 9.0.35 onwards
- 8.5.x for 8.5.55 onwards

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org