You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-user@jakarta.apache.org by Jordi Salvat i Alabart <js...@atg.com> on 2003/12/02 13:33:08 UTC
Re: Need help optimizing a regexp
Hi Daniel,
just writing back to thank you for your help and report on the outcome:
As you said, Awk regexps perform the same whether I factor out that '<'.
It was easy to use Awk regexps on byte input, which reduced the
garbage-generation rate, but did not change CPU consumption measurably.
The 'good' Perl regexp (the one with the '<' factored out) still
performed slightly better than the Awk option in CPU usage. Since Perl's
are much more flexible and allowed for more readable regexps, I decided
to stay with Perl's.
--
Thanks again,
Jordi.
En/na Daniel F. Savarese ha escrit:
> In message <3F...@atg.com>, Jordi Salvat i Alabart writes:
>
>>First question (out of sheer curiosity): why is this later regexp faster
>>than the earlier one?
>
>
> The expression is too long for me to analyze on a glance, but anything
> you can do to rewrite a pattern that reduces backtracking will yield
> performance gains. It looks like you moved the common prefix < for
> each alternation to the beginning of the expression and grouped the rest.
> That will definitely reduce backtracking and since it definitively
> establishes the first character of the pattern, match attempts need
> only be made at instances of < in the input.
>
>
>>Second question: I would like to run the regexps against the HTML
>>content as a byte array (byte[]) without having to convert it into a
>>string. Can ORO do this?
>>
>>On the reasons why I don't want to do the byte[]-to-String conversion:
>>1/ Memory efficiency.
>>2/ I don't need it: even if there were multi-byte characters in the
>>input, they are not part of my problem.
>>3/ The conversion can cause problems if the input is wrong.
>
>
> Since you're dealing with 8-bit input, I think you may get better
> results using AwkMatcher. It would have been able to optimize
> your original expression rather than requiring you to tweak it
> yourself. In other words, both of the expressions you listed
> would have been converted into the same DFA, whereas they result
> in two different Perl NFAs. You still have to work with char
> input, but if you're just doing a single pass on the input
> an feeding an InputStreamReader that wraps ByteArrayInputStream
> to AwkStreamInput will work. But it would be more efficient
> to just store the HTML in a char[].
>
>
>>Third question: I've read that byte-based regexp engines use a type of
>>state machines which is significantly faster than char-based regexp
>>engines. Am I correct? Can ORO take advantage of this? Could you
>>recommend a regexp engine which can?
>
>
> The difference is really DFA versus NFA. You can't build a straight
> table-based DFA with 16-bit characters because you wind up with
> 64k possible inputs for each state. Building table based DFAs
> for 16-bit characters just uses too much memory or is too slow
> if you try to save memory using sparse matrices, so they tend to
> be implemented as NFAs of one sort or another. 8-bit characters
> give you 256 transitions for each state, which is manageable when
> using array-based table lookups. Even though AwkMatcher uses char
> input, it only pays attention to the lower 8 bits, so it can be
> rather fast in the right situations like your application. The
> thing to keep in mind is that it builds the DFA for a pattern
> in lazy fashion as it is performing a match, so initial matches
> (when the DFA is being built) will be slower than later matches
> (after the DFA has been built).
>
> In any case, I suggest you stick with Perl5Matcher if it meets your
> needs just because Perl regular expressions have a richer syntax
> than awk. Otherwise, try AwkMatcher. You shouldn't have to change
> any code other than making your PatternMatcher a new AwkMatcher()
> instead of a new Perl5Matcher(). However your pattern will have
> to change slightly since the non capturing (?:) group construct
> isn't supported by awk.
>
> daniel
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: oro-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-user-help@jakarta.apache.org