You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-dev@jakarta.apache.org by bu...@apache.org on 2005/11/07 14:11:08 UTC

DO NOT REPLY [Bug 37382] New: - stack over flow while using a Regex

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG�
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=37382>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND�
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=37382

           Summary: stack over flow while using a Regex
           Product: ORO
           Version: 2.0.7
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main
        AssignedTo: oro-dev@jakarta.apache.org
        ReportedBy: hi_pkr@yahoo.com
                CC: hi_pkr@yahoo.com


Hi,

I am using ORO Regex API version 2.0.7 and my objective is to extract some 
tagged data from html source. For example i am interested in getting the source 
code for all the forms found in a html page. So i made my regex like this:

Regex formReg = new Regex("(?i)(<form(.|\\s)*?>(.|\\s)*?</form>)");

because following one didn't work,

Regex formReg = new Regex("(?i)(<form.*?>.*?</form>)");

because . is taken as any character but not newline.

So my first regex worked well and i was able to get complete form data starting 
from <form..... to </form>

BUT

when the form was big say like it had around 400 lines and 30K bytes then it 
failed and resulted in Stack Overflow. I am pasting below the stack overflow 
error:

Matched <form name="param" action="http://www/parametric/ProductParametric" 
method="post">
<input name="sterm" type="hidden">
</form>
matcher.getMatch().endOffset(1) 4480
Matched <form name="cross" action="http://www/crossref/search.jsp" 
method="post">
<input name="partNumber" type="hidden">
</form>
matcher.getMatch().endOffset(1) 127
java.lang.StackOverflowError
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)
	at org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)


Also i am pasting my code(method) which i wrote for extraction and it can be 
simply called from main method and run,

----------------------------------------------------------------------------

public static void testRegOro() {
		try {
			String html = IoUtils.readFile("file.txt");
//			String html = "all work and no play makes jack a dull 
boy"; //IoUtils.readFile("file.txt");
			Perl5Compiler compiler=new Perl5Compiler();
			Perl5Pattern pattern = (Perl5Pattern) compiler.compile
("(<form(.|\\s)*?>(.|\\s)*?</form>)",
			          Perl5Compiler.CASE_INSENSITIVE_MASK | 
Perl5Compiler.READ_ONLY_MASK);
			PatternMatcher matcher = new Perl5Matcher();
			int i=0;
			while(matcher.contains(html,pattern) && i++<3) {
		        System.out.println("Matched " + matcher.getMatch().group
(1));
		        System.out.println("matcher.getMatch().endOffset(1) " + 
matcher.getMatch().endOffset(1));
		        html = html.substring(matcher.getMatch().endOffset(1));
		        //System.out.println("html " + html);
		      }
		} catch (Throwable e) {
			e.printStackTrace();
		}
	}

------------------------------------------------------------------------------

As my code shows i am reading a file.txt file i am attaching that file also in 
the bug.

I will really appreciate if you can look into this and throw some light on this 
and if it can be improved?

Thanks in Advance!
Regards,
Pushpesh Kr. Rajwanshi

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: oro-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-dev-help@jakarta.apache.org