You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-user@jakarta.apache.org by Janek Bogucki <ja...@yahoo.co.uk> on 2001/10/15 12:14:02 UTC
Performance dips with a four matches
Hi,
I have a regex which I use to parse HTML files which
are marked up with HTML comments. Performance is fine
but then dips as I increase the number of contained
matches in the page.
<html>
<body>
<!-- name="news" -->
The news goes here.
Add the template in /_include/templates.html.
Could we use miniburst.gif for the icon?
<!-- name="/news" -->
</body>
</html>
The main thing is the pairing of the string "news" in
the HTML comments. I have found a performance dip when
parsing document with 4 of these sections in sequence:
<!-- name="A" -->
--- some HTML ---
<!-- name="/A" -->
<!-- name="B" -->
--- some HTML ---
<!-- name="/B" -->
<!-- name="C" -->
--- some HTML ---
<!-- name="/C" -->
<!-- name="D" -->
--- some HTML ---
<!-- name="/D" -->
My pattern can parse 3 such pairs in 12 seconds but
when I moved to 4 pairs it took about 3300 seconds. I
have an alternative approach I will use however I'm
interested if there is a problem with my expression
(which otherwise works).
This is my pattern (for the curious).
private static final String HTML_PATTERN =
/*
* <!-- name="news" -->
* or
* <!-- name="news" template="news-tmpl" -->
*
* (Remember to escape backslash
*
* \n -> \\n
* \w -> \\w
*
* etc)
*
*/
"<!--\\s*name\\s*=\\s*\"([\\w\\-]+)\"\\s*(template\\s*=\\s*\"[\\w\\-]+\")?\\s*-->"
+
/*
* enclosed content
*/
"((\\s|.)*)" +
/*
* <!-- name="/news" -->
*/
"<!--\\s*name\\s*=\\s*\"/\\1\"\\s*-->" ;
Many Thanks,
Janek Bogucki
____________________________________________________________
Do You Yahoo!?
Get your free @yahoo.co.uk address at http://mail.yahoo.co.uk
or your free @yahoo.ie address at http://mail.yahoo.ie