You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@daffodil.apache.org by "Steve Lawrence (Jira)" <ji...@apache.org> on 2021/02/04 20:36:00 UTC

[jira] [Closed] (DAFFODIL-2048) Exponential blowup in regular expression matching

     [ https://issues.apache.org/jira/browse/DAFFODIL-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Lawrence closed DAFFODIL-2048.
------------------------------------

> Exponential blowup in regular expression matching
> -------------------------------------------------
>
>                 Key: DAFFODIL-2048
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2048
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Libraries
>    Affects Versions: 2.3.0
>            Reporter: Brandon Sloane
>            Priority: Major
>
> It is possible for simple regular expression to trigger exponential behavior at run time.
> For instance, the regular expression "(a|a)*x" is exponential in the number of "a" characters that occurs in the beginning of the string.
> If the string does not match, this behavior is directly observable.
> If the string does match, the behavior may still be observable. Daffodil performs regular expression matches using a fixed size buffer. If such a match fails because it ran out of buffer, Daffodil will try again using a bigger buffer. This means that even if a string will end up matching, if it requires resizing the buffer, we will observe exponential time when trying to match using the smaller buffer.
>  
> Below is a testcase demonstrating this:
> {quote}@Test def testUSASCII7BitEncoderMalformedError {
>  val encoder = BitsCharsetUSASCII7BitPacked.newEncoder
>  val bb = ByteBuffer.allocate(3)
>  val cb = CharBuffer.wrap("ab" + 128.toChar) // 128 is not encodable in 7 bits
>  val coderResult = encoder.encode(cb, bb, true)
>  assertTrue(coderResult.isUnmappable())
>  assertEquals(coderResult.length, 1)
>  }
> @Test def testRegexpMatch{
>  
>  val regex = "(a|a)*x".r
>  var baseStr=""
>  
> // val regex = "(a|a)*y*x".r
> // var baseStr="x"
> // for(i <- 0 to 64){
> // baseStr = "y"+baseStr;
> // }
>  
>  val oldDecoder = finfo.decoder
>  finfo.decoder=BitsCharsetUSASCII.newDecoder();
>  
>  for(i <- 0 to 300){
>  var str="";
>  for(j <- 0 to i){
>  str = str+"a"
>  }
>  str=str+baseStr;
>  val dis = org.apache.daffodil.io.InputSourceDataInputStream(str.getBytes);
>  
>  val matcher = regex.pattern.matcher(str)
>  matcher.reset()
>  val start=System.currentTimeMillis();
>  val ans = dis.lookingAt(matcher, finfo)
>  val stop = System.currentTimeMillis()
>  println(i+": "+(stop-start)+"ms")
>  }
>  
>  finfo.decoder=oldDecoder;
>  
>  }
> {quote}
> Replace the definition of baseStr and regex with the commented out version to see the behaviour when the match is successful.
>  
> There are several related issues here:
> 1) The regular expression above is, in fact, regular. It should be able to run in linear time. The obvious replacement of "a" for "(a|a)" fixes this. However, this same situation can occur in less obvious situations (for instance, I ran into this by adding the (?s) flag to a regex containing (.|\r)).
> 2) We currently accept regular expressions that are not actually regular, so there will be ones that genuinely take exponential time. In this case, we should have some form of timeout. (Preferably deterministic)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)