You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-user@jakarta.apache.org by Janek Bogucki <ja...@yahoo.co.uk> on 2001/10/15 12:14:02 UTC

Performance dips with a four matches

Hi,

I have a regex which I use to parse HTML files which
are marked up with HTML comments. Performance is fine
but then dips as I increase the number of contained
matches in the page.

<html>
<body>
<!-- name="news"  -->
The news goes here.
Add the template in /_include/templates.html.
Could we use miniburst.gif for the icon?
<!-- name="/news" -->
</body>
</html>

The main thing is the pairing of the string "news" in
the HTML comments. I have found a performance dip when
parsing document with 4 of these sections in sequence:

<!-- name="A" -->
--- some HTML ---
<!-- name="/A" -->
<!-- name="B" -->
--- some HTML ---
<!-- name="/B" -->
<!-- name="C" -->
--- some HTML ---
<!-- name="/C" -->
<!-- name="D" -->
--- some HTML ---
<!-- name="/D" -->

My pattern can parse 3 such pairs in 12 seconds but
when I moved to 4 pairs it took about 3300 seconds. I
have an alternative approach I will use however I'm
interested if there is a problem with my expression
(which otherwise works).

This is my pattern (for the curious).

    private static final String HTML_PATTERN =
    /*
     * <!-- name="news" -->
     * or
     * <!-- name="news" template="news-tmpl" -->
     *
     * (Remember to escape backslash
     *
     * \n -> \\n
     * \w -> \\w
     *
     * etc)
     *
     */
   
"<!--\\s*name\\s*=\\s*\"([\\w\\-]+)\"\\s*(template\\s*=\\s*\"[\\w\\-]+\")?\\s*-->"
+
    
    /*
     * enclosed content
     */
    "((\\s|.)*)" +
    
    /*
     * <!-- name="/news" -->
     */
    "<!--\\s*name\\s*=\\s*\"/\\1\"\\s*-->" ;

Many Thanks,
Janek Bogucki

____________________________________________________________
Do You Yahoo!?
Get your free @yahoo.co.uk address at http://mail.yahoo.co.uk
or your free @yahoo.ie address at http://mail.yahoo.ie