You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-dev@jakarta.apache.org by bu...@apache.org on 2001/09/19 23:26:28 UTC

DO NOT REPLY [Bug 3730] New: - Perl5Matcher sometimes confuses the begin/end offsets on similar sub patterns in a regular expression

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=3730>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=3730

Perl5Matcher sometimes confuses the begin/end offsets on similar sub patterns in a regular expression

           Summary: Perl5Matcher sometimes confuses the begin/end offsets on
                    similar sub patterns in a regular expression
           Product: ORO
           Version: 2.0
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: Main
        AssignedTo: oro-dev@jakarta.apache.org
        ReportedBy: jamesv@screamingmedia.com


here is the test program:


import com.oroinc.text.regex.*;
import java.io.*;

public class bug_report
{
    public static void main(String[] args) throws Exception
    {
        String regex  = "\010[(]GAME +GID:([^;]+); +GDATE:([^;]*); +GSTART:([^;]
*); +GSITE:([^;]*); +GNEUTRAL:([^;]*); +GSTAT:([^;]*); +GPERIOD:([^;]*);[^\r\n]*
[\r\n]+"
                       +"("
                       +"(\010[(]TEAM +TNAME:([^;]*);( +[^:]+:[^;]*;){3} 
+THOME: *([Yy][Ee][Ss]); +TSCORE:([^;]*); +TSTAT:([^;]*)[^\r\n]*[\r\n]+)"
                       +"|"
                       +"(\010[(]TEAM +TNAME:([^;]*);( +[^:]+:[^;]*;){3} 
+THOME: *([Nn][Oo]); +TSCORE:([^;]*); +TSTAT:([^;]*)[^\r\n]*[\r\n]+)"
                       +"){2}";


        String input  = "(GAME GID:13805; GDATE:11/01/2000; GSTART:19:30; 
GSITE:Charlotte Coliseum; GNEUTRAL:NO; GSTAT:Final; GPERIOD:4; \n"
                       +"(TEAM TNAME:Hornets; TLOCALE:Charlotte; 
TCONF:Eastern; TDIV:Central; THOME:YES; TSCORE:77; TSTAT:LOST; TID:9;)\n"
                       +"(TEAM TNAME:Wizards; TLOCALE:Washington; 
TCONF:Eastern; TDIV:Atlantic; THOME:NO; TSCORE:95; TSTAT:WON; TID:7;))\n";

        String input2 = "(GAME GID:13789; GDATE:10/31/2000; GSTART:19:30; 
GSITE:TD Waterhouse Centre; GNEUTRAL:NO; GSTAT:Final; GPERIOD:4; \n"
                       +"(TEAM TNAME:Magic; TLOCALE:Orlando; TCONF:Eastern; 
TDIV:Atlantic; THOME:YES; TSCORE:97; TSTAT:WON; TID:5;)\n"
                       +"(TEAM TNAME:Wizards; TLOCALE:Washington; 
TCONF:Eastern; TDIV:Atlantic; THOME:NO; TSCORE:86; TSTAT:LOST; TID:7;))\n";
        	
	    Perl5Compiler p5compiler = new Perl5Compiler();
	    Perl5Pattern p5pattern = null;
	    Perl5Matcher p5matcher = new Perl5Matcher();
	    PatternMatcherInput p5input = new PatternMatcherInput(input2);
	    
		try {
			p5pattern = (Perl5Pattern) p5compiler.compile(regex,
				        Perl5Compiler.SINGLELINE_MASK |
				        Perl5Compiler.READ_ONLY_MASK  );
		} catch(MalformedPatternException e) {
			System.out.println("Error:  Bad Perl5 pattern.");
			System.out.println(e.getMessage());
		}
		
		boolean result = p5matcher.matchesPrefix(p5input, p5pattern);
		
		if( result )
		{
            MatchResult mr = p5matcher.getMatch();
            int groups     = mr.groups();
            int start      = -1;
            int end        = -1;
            String matchStr = null;
            for( int x = 0; x < groups; x++ )
            {
                start = mr.beginOffset(x);
                end   = mr.endOffset(x);
                //matchStr = mr.group(x);
                
                //System.out.print
("Pos: "+x+"\tStart: "+start+"\tEnd: "+end+"\tMatch: "+matchStr);
                System.out.print("Pos: "+x+"\tStart: "+start+"\tEnd: "+end);
                
                if( start > end )
                    System.out.println( " -- ERROR" );
                else
                    System.out.println();
            }
		}
		else
		{
		    System.out.println("No Match");
		}
		System.out.println("Program terminating");
    }
    
}    


and here is some output:

Pos: 0    Start: 0    End: 338
Pos: 1    Start: 11    End: 16
Pos: 2    Start: 24    End: 34
Pos: 3    Start: 43    End: 48
Pos: 4    Start: 56    End: 76
Pos: 5    Start: 87    End: 89
Pos: 6    Start: 97    End: 102
Pos: 7    Start: 112    End: 113
Pos: 8    Start: 224    End: 338
Pos: 9    Start: 224    End: 224
Pos: 10    Start: 237    End: 237
Pos: 11    Start: 280    End: 295
Pos: 12    Start: 302    End: 192 -- ERROR
Pos: 13    Start: 201    End: 203
Pos: 14    Start: 211    End: 214
Pos: 15    Start: 224    End: 338
Pos: 16    Start: 237    End: 244
Pos: 17    Start: 280    End: 295
Pos: 18    Start: 302    End: 304
Pos: 19    Start: 313    End: 315
Pos: 20    Start: 323    End: 327
Program terminating



if you'll notice, Pos 12 and Pos 18 share the same Start value.  In the regex
they have the same pattern.  Granted, there are many similar sub patterns as a
matter of fact lines 2 and 3 of the pattern are almost exatly the same except 
for [Yy][Ee][Ss] and [Nn][Oo]...