You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by "Brandon Sloane (JIRA)" <ji...@apache.org> on 2019/01/08 21:10:00 UTC

[jira] [Created] (DAFFODIL-2048) Exponential blowup in regular expression matching

Brandon Sloane created DAFFODIL-2048:
----------------------------------------

             Summary: Exponential blowup in regular expression matching
                 Key: DAFFODIL-2048
                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2048
             Project: Daffodil
          Issue Type: Bug
          Components: Libraries
    Affects Versions: 2.3.0
            Reporter: Brandon Sloane


It is possible for simple regular expression to trigger exponential behavior at run time.

For instance, the regular expression "(a|a)*x" is exponential in the number of "a" characters that occurs in the beginning of the string.

If the string does not match, this behavior is directly observable.

If the string does match, the behavior may still be observable. Daffodil performs regular expression matches using a fixed size buffer. If such a match fails because it ran out of buffer, Daffodil will try again using a bigger buffer. This means that even if a string will end up matching, if it requires resizing the buffer, we will observe exponential time when trying to match using the smaller buffer.

 

Below is a testcase demonstrating this:
{quote}@Test def testUSASCII7BitEncoderMalformedError {
 val encoder = BitsCharsetUSASCII7BitPacked.newEncoder
 val bb = ByteBuffer.allocate(3)
 val cb = CharBuffer.wrap("ab" + 128.toChar) // 128 is not encodable in 7 bits
 val coderResult = encoder.encode(cb, bb, true)
 assertTrue(coderResult.isUnmappable())
 assertEquals(coderResult.length, 1)
 }

@Test def testRegexpMatch{
 
 val regex = "(a|a)*x".r
 var baseStr=""
 
// val regex = "(a|a)*y*x".r
// var baseStr="x"
// for(i <- 0 to 64){
// baseStr = "y"+baseStr;
// }
 
 val oldDecoder = finfo.decoder
 finfo.decoder=BitsCharsetUSASCII.newDecoder();
 
 for(i <- 0 to 300){
 var str="";
 for(j <- 0 to i){
 str = str+"a"
 }
 str=str+baseStr;
 val dis = org.apache.daffodil.io.InputSourceDataInputStream(str.getBytes);
 
 val matcher = regex.pattern.matcher(str)
 matcher.reset()
 val start=System.currentTimeMillis();
 val ans = dis.lookingAt(matcher, finfo)
 val stop = System.currentTimeMillis()
 println(i+": "+(stop-start)+"ms")
 }
 
 finfo.decoder=oldDecoder;
 
 }
{quote}
Replace the definition of baseStr and regex with the commented out version to see the behaviour when the match is successful.

 

There are several related issues here:

1) The regular expression above is, in fact, regular. It should be able to run in linear time. The obvious replacement of "a" for "(a|a)" fixes this. However, this same situation can occur in less obvious situations (for instance, I ran into this by adding the (?s) flag to a regex containing (.|\r)).

2) We currently accept regular expressions that are not actually regular, so there will be ones that genuinely take exponential time. In this case, we should have some form of timeout. (Preferably deterministic)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)