You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by "Geoff M. Granum" <ge...@lookingglasssoftware.com> on 2007/07/06 04:17:42 UTC

Re: More on (Re: (XERCESJ-589) Bug with pattern restriction on long strings)

Hello Michael,

Everything works so far, with the exception of UNION, which technically  
works fine but exposes an infinite loop bug which was being broken by a  
stack overflow before.

The regex: (((((boy)|(girl))[0-1][x-z]{2})?)|(man|woman)[0-1]?[y|n])*
With Target: boy0xxwoman1ygirl1xyman

Loops forever (the ending is invalid, needs [0-1]?[y|n] ). Well, forever  
being 'until you run out of memory'. I didn't notice this until I set my  
memory back to a reasonable size, as I had the -Xmx flag set to a GB  
earlier for some playing.

So the options are
a. ) Fix it. Being a serious edge case, I'm thinking this isn't really  
worth the risk. More bluntly, it would take me far more time than I want  
to spend on it to become comfortable with any fix I might produce.

b. ) Ignore the case (let the system die of OOM *eventually*, recognizing  
your CPU will be pegged for a while before this happens)

c. ) Throw an exception if the stack depth gets to some crazy level (a  
million deep, for instance), perhaps adding a System variable to set the  
allowable depth.

d. ) Other?


For reference, using the standard JVM (no flags, 64MB heap space, IIRC), I  
can check the DNA string appended to itself ~425 times before reaching a  
depth of a million stack elements. DNA_STRING is 2273 characters long.

I run out of memory (heap space) around 650*DNA_STRING, with a depth of  
1,477,450. (the string being 1,477,450 characters long... big shock).

DNA String runs out of memory very quickly, whereas the above regex is  
pretty slow about it because it pushes and pops a huge number of stack  
items, *eventually* achieving an overflow. The DNA string pushes items  
until it gets an answer, then backs out in the same order.

Also, and oddly, the 'CAPTURE' option doesn't get hit a single time in the  
7000 or so regex tests provided by the W3C group. Testing it with the full  
suite, which takes a painfully long time.

I'm moving into optimization mode now that the regex tests check out; once  
I hear back as to how the group wants the new bug handled, I'll implement  
it and post the code for review.

Oh, did you want the Test Suite code I had to implement? There's a lot of  
generics code in it, and it's incredibly hackish, but it's free to whoever  
wants it. It is by no means robust nor complete. IntelliJ has an 'Export  
to Eclipse' setting for modules, if that interests you.

Cheers,
-- 
Geoff M. Granum
Portland, Oregon

On Mon, 25 Jun 2007 05:43:16 -0700, Michael Glavassevich  
<mr...@ca.ibm.com> wrote:

> Hi Geoff,
>
> The W3C test suite contains many regex tests, particularly this large
> bucket [2] of tests contributed last year. That should give you a pretty
> good selection though beware that some of the tests are invalid. The  
> known
> problems are documented in the W3C's Bugzilla here [3].
>
> As for the code, one thing that may not be obvious is that it needs to be
> thread-safe. This is because the RegularExpression objects are cached in
> the schema grammar which could be shared with several parsers and
> validators. To avoid having many large synchronized blocks, the matching
> code keeps its state local to the call stack. Hoping that's the approach
> you've been taking.
>
> Thanks.
>
> [1] http://www.w3.org/XML/2004/xml-schema-test-suite/index.html#releases
> [2]
> http://dev.w3.org/cvsweb/XML/xml-schema-test-suite/2004-01-14/xmlschema2006-11-06/msMeta/Regex_w3c.xml
> [3]
> http://www.w3.org/Bugs/Public/buglist.cgi?query_format=specific&order=relevance+desc&bug_status=__open__&product=XML+Schema+Test+Suite&content=
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> "Geoff M. Granum" <ge...@lookingglasssoftware.com> wrote on
> 06/25/2007 04:15:27 AM:
>
>> (If you don't care about the particulars, but have some Regex's you can
>> contribute, jump to the code bit. Thanks)
>>
>> I have two implementations to test; one is a (somewhat) naive linked
> list
>> stack manager, the other is (as yet) still recursive.
>>
>> The former works, but I put it together as a proof of concept and don't
>> trust it much. Fifty-two return points in one method is a tad much.
>> Implemented as a raw java.util.Stack is ten times as slow as the
> original,
>> and creating a private static LocalStack class as a LinkedList is twice
> as
>> slow.
>>
>> Though, 10K runs of the first thousand chars of the two example regex
>> patterns take ~1.2 and 2.6 seconds, respectively. So .12ms and .26ms per
>
>> run. I'm rather set against ANY performance decrement, or I'd have just
>> verified that code and moved on.
>>
>> The latter implementation is a refactor of the method to a single point
> of
>> exit. THAT goal is working, now I have to make sure that I can add
> values
>> to an internal stack manager without blowing away any state -- some of
> the
>> CASE statements are a mite obtuse, and I don't like using breaks much.
>> Breaks also seem to affect the ability of the optimizer to do its job,
> as
>> the last CASE I modified (op.CLOSURE) gave a 10% performance boost
> without
>> it. Although I'm suspicious, as it's late and now the stack overflows
>> somewhat (ok, a lot) earlier than before. I did add a number of
> variables,
>> so it's possible I made no mistake in the logic (I'd better not have!).
>>
>> --- The request part ---
>>
>> Regardless of the final form, I need to populate a test library:
>>
>> I have a few regular expressions lying around, and I figure I'll parse
> in
>> a few of my XML files and modify the RegularExpression class to dump
>> anything it sees to a file... I still doubt I'd have more than 20, and
>> none of them shockingly complex.
>>
>> So if you could send me your favorite regular expressions, along with a
>> couple of stings to match them against (some pass, some fail, but
> indicate
>> which), it would be a big help.
>>
>> Even better, if you could format them like this sample:
>>
>> testCases.add(new TestCase(
>>    "Overall description",
>>    "Your Regex Pattern",
>>    new SubCase("A description", shouldPass, "matchString" ),
>>    new SubCase("A description2", shouldPass, "matchString2" ),
>>    ... more SubCases ...
>> ));
>>
>> I would be able to paste them straight into the unit test and run them.
>> The SubCase argument uses varArgs, so add as many as you want/will. Feel
>
>> free to add your 'contributed by:' to the overall description area for
>> credit... Though I'd remind you not to include a parsable (or any, lest
>> random-someone ask you for help later) e-mail address on this list, as
> it
>> is public and archived.
>>
>> My own direct e-mail address is (my first name @ my last name).biz. And
> if
>> someone has written a parser for THAT, they can have it.
>>
>> The more complex your tests the better, for the beat down. Tailored
>> regex's would be grand for focused testing (e.g. the simplest lookahead,
>
>> lookbehind, singleline, multiline, etc). But I figure that's asking for
>> real work.
>>
>> Also, or instead, if you have a 'regular expression rich' schema and
>> conforming xml file that you can send (think 'might become public'), I
>> should be able to parse those out without much trouble.
>>
>> And yes (obviously), my test library uses 1.5 features... I'll convert
> it
>> if the changes are approved for commit. Keeps me sane.
>>
>> Of course the changes to RegularExpression are using JDK 1.3 as a
> target,
>> as that is the lowest I've available. My memory of the differences
> between
>> 1.2 and 1.3 are fuzzy, but I don't think anything I'm using has changed
>> since 1.0. My only real concern is that my JVM has a better optimizer
> and
>> could be hiding poor performance that I induce.
>>
>>
>> Thanks much,
>> --
>> Geoff M. Granum
>> 760-534-1636
>> Portland, Oregon
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
>> For additional commands, e-mail: j-dev-help@xerces.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org


Re: More on (Re: (XERCESJ-589) Bug with pattern restriction on long strings)

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Geoff,

"Geoff M. Granum" <ge...@lookingglasssoftware.com> wrote on 
07/05/2007 10:17:42 PM:

> Hello Michael,
> 
> Everything works so far, with the exception of UNION, which technically 
> works fine but exposes an infinite loop bug which was being broken by a 
> stack overflow before.
> 
> The regex: (((((boy)|(girl))[0-1][x-z]{2})?)|(man|woman)[0-1]?[y|n])*
> With Target: boy0xxwoman1ygirl1xyman
> 
> Loops forever (the ending is invalid, needs [0-1]?[y|n] ). Well, forever 
 
> being 'until you run out of memory'. I didn't notice this until I set my 
 
> memory back to a reasonable size, as I had the -Xmx flag set to a GB 
> earlier for some playing.
> 
> So the options are
> a. ) Fix it. Being a serious edge case, I'm thinking this isn't really 
> worth the risk. More bluntly, it would take me far more time than I want 
 
> to spend on it to become comfortable with any fix I might produce.
> 
> b. ) Ignore the case (let the system die of OOM *eventually*, 
recognizing 
> your CPU will be pegged for a while before this happens)
> 
> c. ) Throw an exception if the stack depth gets to some crazy level (a 
> million deep, for instance), perhaps adding a System variable to set the 
 
> allowable depth.
> 
> d. ) Other?

I would go with option d: open a new JIRA issue [1] to make folks aware 
that this is a bug. Perhaps someone will fix it one day.

> For reference, using the standard JVM (no flags, 64MB heap space, IIRC), 
I 
> can check the DNA string appended to itself ~425 times before reaching a 
 
> depth of a million stack elements. DNA_STRING is 2273 characters long.
> 
> I run out of memory (heap space) around 650*DNA_STRING, with a depth of 
> 1,477,450. (the string being 1,477,450 characters long... big shock).

Cool. That's leaps and bounds better than the current code. I think folks 
were getting the stack overflow with around 2000 to 3000 characters 
(without increasing the stack size from the default).

> DNA String runs out of memory very quickly, whereas the above regex is 
> pretty slow about it because it pushes and pops a huge number of stack 
> items, *eventually* achieving an overflow. The DNA string pushes items 
> until it gets an answer, then backs out in the same order.
> 
> Also, and oddly, the 'CAPTURE' option doesn't get hit a single time in 
the 
> 7000 or so regex tests provided by the W3C group. Testing it with the 
full 
> suite, which takes a painfully long time.
> 
> I'm moving into optimization mode now that the regex tests check out; 
once 
> I hear back as to how the group wants the new bug handled, I'll 
implement 
> it and post the code for review.
> 
> Oh, did you want the Test Suite code I had to implement? There's a lot 
of 
> generics code in it, and it's incredibly hackish, but it's free to 
whoever 
> wants it. It is by no means robust nor complete. IntelliJ has an 'Export 
 
> to Eclipse' setting for modules, if that interests you.

If your test suite can be back-ported to Java 1.3 perhaps it could be 
included with the other sanity/unit tests which run off the build.xml 
'test' target.

> Cheers,
> -- 
> Geoff M. Granum
> Portland, Oregon

Thanks.

[1] http://issues.apache.org/jira/browse/XERCESJ

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org