You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-dev@jakarta.apache.org by "Bob Dickinson (BSL)" <bo...@brutesquadlabs.com> on 2002/10/14 12:42:10 UTC

ORO Performance

In my previous post I mentioned that we found ORO to be faster than JDK 1.4
regex.  I tend to agree with Jeffrey Friedl about regex performance
profiling, in that it's hard to come up with meaningful comparisons, but
here is what we found while attempting to model our applications use of
regex...

We tested ORO versus JDK 1.4 regex with the regular expressions from the
SpamAssassin rules set.  This is a set of roughly 600 expressions ranging
from creepy-complicated to very simple.  Each package failed to handle
roughly 20 expressions, so those expressions were not included in the
respective tests (the unsupported ORO expressions were all caused by the
lack of support for (?-i) and the unsupported JDK expressions were caused by
a parsing bug where it didn't understand that certain characters don't need
to be escaped inside of a character class).

We tested using a single text stream, which was roughly 3k of data pulled
from an email message (which included plaintext and html).  About 10 of the
600 expressions produced a match (modeling our expected use).  We ran each
expressions 1000 times to get an average time per expression.  The
expressions were compiled at the start of the test before test timing began.

Attached is an XML file that contains the expression sets and times for ORO
and SUN.  The "time" attribute on the top level node is the average number
of milliseconds per expression.  The "time" attribute on the individial
expressions represent the average number of milliseconds for that
expression.  The list for each package is sorted by time.

The bottom line result was .256 ms per expression for ORO and  .416 ms per
expression for SUN, which is a pretty significant difference.  It is
interesting to see that the packages have different performance curves (SUN
has the slowest slow expressions and the fastest fast expressions), and to
see that the list of degenerate case expressions is only partially the same
between the packages.

We did some hand tuning to address the degenerate cases for both packages,
and while there was some improvement, the relative performance stayed more
or less the same.

We also tested the regex packages in our application, sending batches of
10,000 messages at a time through a relay that applied all of these
expressions.  Those results were consistent with the observed performance of
the packages above.

Bob Dickinson
Brute Squad Labs, Inc.