You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-user@jakarta.apache.org by Jordi Salvat i Alabart <js...@atg.com> on 2003/12/02 13:33:08 UTC
Re: Need help optimizing a regexp

Hi Daniel,

just writing back to thank you for your help and report on the outcome:

As you said, Awk regexps perform the same whether I factor out that '<'.

It was easy to use Awk regexps on byte input, which reduced the 
garbage-generation rate, but did not change CPU consumption measurably.

The 'good' Perl regexp (the one with the '<' factored out) still 
performed slightly better than the Awk option in CPU usage. Since Perl's 
are much more flexible and allowed for more readable regexps, I decided 
to stay with Perl's.

-- 
Thanks again,

Jordi.

En/na Daniel F. Savarese ha escrit:
> In message <3F...@atg.com>, Jordi Salvat i Alabart writes:
> 
>>First question (out of sheer curiosity): why is this later regexp faster 
>>than the earlier one?
> 
> 
> The expression is too long for me to analyze on a glance, but anything
> you can do to rewrite a pattern that reduces backtracking will yield
> performance gains.  It looks like you moved the common prefix < for
> each alternation to the beginning of the expression and grouped the rest.
> That will definitely reduce backtracking and since it definitively
> establishes the first character of the pattern, match attempts need
> only be made at instances of < in the input.
> 
> 
>>Second question: I would like to run the regexps against the HTML 
>>content as a byte array (byte[]) without having to convert it into a 
>>string. Can ORO do this?
>>
>>On the reasons why I don't want to do the byte[]-to-String conversion:
>>1/ Memory efficiency.
>>2/ I don't need it: even if there were multi-byte characters in the 
>>input, they are not part of my problem.
>>3/ The conversion can cause problems if the input is wrong.
> 
> 
> Since you're dealing with 8-bit input, I think you may get better
> results using AwkMatcher.  It would have been able to optimize
> your original expression rather than requiring you to tweak it
> yourself.  In other words, both of the expressions you listed
> would have been converted into the same DFA, whereas they result
> in two different Perl NFAs.  You still have to work with char
> input, but if you're just doing a single pass on the input
> an feeding an InputStreamReader that wraps ByteArrayInputStream
> to AwkStreamInput will work.  But it would be more efficient
> to just store the HTML in a char[].
> 
> 
>>Third question: I've read that byte-based regexp engines use a type of 
>>state machines which is significantly faster than char-based regexp 
>>engines. Am I correct? Can ORO take advantage of this? Could you 
>>recommend a regexp engine which can?
> 
> 
> The difference is really DFA versus NFA.  You can't build a straight
> table-based DFA with 16-bit characters because you wind up with
> 64k possible inputs for each state.  Building table based DFAs
> for 16-bit characters just uses too much memory or is too slow
> if you try to save memory using sparse matrices, so they tend to
> be implemented as NFAs of one sort or another.  8-bit characters
> give you 256 transitions for each state, which is manageable when
> using array-based table lookups.  Even though AwkMatcher uses char
> input, it only pays attention to the lower 8 bits, so it can be
> rather fast in the right situations like your application.  The
> thing to keep in mind is that it builds the DFA for a pattern
> in lazy fashion as it is performing a match, so initial matches
> (when the DFA is being built) will be slower than later matches
> (after the DFA has been built).
> 
> In any case, I suggest you stick with Perl5Matcher if it meets your
> needs just because Perl regular expressions have a richer syntax
> than awk.  Otherwise, try AwkMatcher.  You shouldn't have to change
> any code other than making your PatternMatcher a new AwkMatcher()
> instead of a new Perl5Matcher().  However your pattern will have
> to change slightly since the non capturing (?:) group construct
> isn't supported by awk.
> 
> daniel
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: oro-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-user-help@jakarta.apache.org