You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@daffodil.apache.org by "Brandon Sloane (JIRA)" <ji...@apache.org> on 2019/01/08 21:10:00 UTC
[jira] [Created] (DAFFODIL-2048) Exponential blowup in regular
expression matching
Brandon Sloane created DAFFODIL-2048:
----------------------------------------
Summary: Exponential blowup in regular expression matching
Key: DAFFODIL-2048
URL: https://issues.apache.org/jira/browse/DAFFODIL-2048
Project: Daffodil
Issue Type: Bug
Components: Libraries
Affects Versions: 2.3.0
Reporter: Brandon Sloane
It is possible for simple regular expression to trigger exponential behavior at run time.
For instance, the regular expression "(a|a)*x" is exponential in the number of "a" characters that occurs in the beginning of the string.
If the string does not match, this behavior is directly observable.
If the string does match, the behavior may still be observable. Daffodil performs regular expression matches using a fixed size buffer. If such a match fails because it ran out of buffer, Daffodil will try again using a bigger buffer. This means that even if a string will end up matching, if it requires resizing the buffer, we will observe exponential time when trying to match using the smaller buffer.
Below is a testcase demonstrating this:
{quote}@Test def testUSASCII7BitEncoderMalformedError {
val encoder = BitsCharsetUSASCII7BitPacked.newEncoder
val bb = ByteBuffer.allocate(3)
val cb = CharBuffer.wrap("ab" + 128.toChar) // 128 is not encodable in 7 bits
val coderResult = encoder.encode(cb, bb, true)
assertTrue(coderResult.isUnmappable())
assertEquals(coderResult.length, 1)
}
@Test def testRegexpMatch{
val regex = "(a|a)*x".r
var baseStr=""
// val regex = "(a|a)*y*x".r
// var baseStr="x"
// for(i <- 0 to 64){
// baseStr = "y"+baseStr;
// }
val oldDecoder = finfo.decoder
finfo.decoder=BitsCharsetUSASCII.newDecoder();
for(i <- 0 to 300){
var str="";
for(j <- 0 to i){
str = str+"a"
}
str=str+baseStr;
val dis = org.apache.daffodil.io.InputSourceDataInputStream(str.getBytes);
val matcher = regex.pattern.matcher(str)
matcher.reset()
val start=System.currentTimeMillis();
val ans = dis.lookingAt(matcher, finfo)
val stop = System.currentTimeMillis()
println(i+": "+(stop-start)+"ms")
}
finfo.decoder=oldDecoder;
}
{quote}
Replace the definition of baseStr and regex with the commented out version to see the behaviour when the match is successful.
There are several related issues here:
1) The regular expression above is, in fact, regular. It should be able to run in linear time. The obvious replacement of "a" for "(a|a)" fixes this. However, this same situation can occur in less obvious situations (for instance, I ran into this by adding the (?s) flag to a regex containing (.|\r)).
2) We currently accept regular expressions that are not actually regular, so there will be ones that genuinely take exponential time. In this case, we should have some form of timeout. (Preferably deterministic)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)