You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-user@jakarta.apache.org by ravi <ra...@htinc.com> on 2003/05/08 19:16:20 UTC

Performance

I don't know if this mail belongs here. Probably it should go to the
developers list. But I'm sending it anyway. I apologize for that. 

I'm working on a text mining application that extracts key words/phrases
from documents. I was using the jregex package in my application. It was
working fine, but it was taking a lot of time. Then I tried ORO. But
looks like ORO is taking even more time than jregex. I ran a profiler on
both the packages for 50 documents (Each document has around 1000 words
on the average). The ORO's regex package made around 119 million
invocations. That's a lot. The execution time was 35% more when I used
ORO instead of jregex. I'm attaching the outputs of the profiler for
both the packages. I was wondering if you guys can do any performance
optimizations.

Thanks,
Ravi.


RE: Performance

Posted by ravi <ra...@htinc.com>.
I'm attaching the code that I used. I don't know why it's taking a lot
of time. There should be something wrong with my regular expressions or
with my code. Can somebody please look at it and let me know what's
wrong. You can try any piece of text as input. I would really really
appreciate it. Thanks in advance.

private static Perl5Compiler compiler;
private static PatternMatcher matcher;
private static Perl5Substitution substitution;

public static void main(String args[])
{
compiler = new Perl5Compiler();
matcher = new Perl5Matcher();
substitution = new Perl5Substitution();
String text = args[0];
Pattern pattern = getPattern("\t");
text = replaceText(pattern," ",text);
pattern = getPattern("[\\[\\]\\{\\}\\^\\~?!()\";/\\|,<>`]");
text = replaceText(pattern," $& ",text);
pattern = getPattern("^('|&)");
text = replaceText(pattern,"$& ",text);
pattern = getPattern("([^A-Za-z0-9])('|&|@|%|\\*)");
text = replaceText(pattern,"$& ",text);
pattern =
getPattern("('|:|-|#|\\*|\\+|\\$|&|@|'S|'D|'M|'LL|'RE|'VE|N'T|'s|'d|'m|'
ll|'re|'ve|n't)$");
text = replaceText(pattern," $&",text);
pattern =
getPattern("('|:|-|#|\\*|\\+|\\$|&|@|'S|'D|'M|'LL|'RE|'VE|N'T|'s|'d|'m|'
ll|'re|'ve|n't)([^A-Za-z0-9])");
text = replaceText(pattern," $&",text);
StringTokenizer strTok = new StringTokenizer(text);
while(strTok.hasMoreTokens())
 {
  String token = strTok.nextToken();
  token = token.trim()
  pattern = getPattern("([A-Za-z0-9][.])$");
  if(contains(pattern,token))
  {
	pattern =
getPattern("^([A-Za-z]\\.([A-Za-z]\\.)+|[A-Za-z]\\.|[A-Z][bcdfghj-np-tvx
z]+\\.)$");
	if(contains(pattern,token))
	{
	   ///code to process token which does not use any regex stuff
	}
  }
  else
  {
	pattern = getPattern("^([A-Za-z0-9])");
	if(contains(pattern,token))
	{
 	 pattern =
getPattern("([A-Za-z0-9]+\\.[A-Za-z]+|[0-9]+\\.[A-Za-z])");
	   if(contains(pattern,token))
	   {
		////code to process token which does not use any regex
stuff
	   }
	   else
	   {
	      if(contains(getPattern("^([A-Za-z])"),token))
	      {
		   //code to process token which does not use any regex
stuff
	      }
	   }
	}
	else
	{
	   pattern = getPattern("([.!?])$");
	   if(contains(pattern,token))
	   {
		//code to process token which does not use any regex
stuff
	   }
	}
    }
 }
}
public static boolean contains(Pattern pattern,String str)
{
   return matcher.contains(str,pattern);
}
public static String replaceText(Pattern pattern,String
replacement,String str)
{
   substitution.setSubstitution(replacement);
   return
Util.substitute(matcher,pattern,substitution,str,Util.SUBSTITUTE_ALL);

}
public static Pattern getPattern(String pattern)
{
   return compiler.compile(pattern);
}

Thanks,
Ravi.


---------------------------------------------------------------------
To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-user-help@jakarta.apache.org


Re: Performance

Posted by Kwok Peng Tuck <pe...@makmal.net>.
Maybe you could try refining the way you write your regexp ?
I've processed a log file which is about 8 MB (gziped ~400kb, without 
unpacking) in size and managed to finish
the whole job under a minute.

Let the list have a look at your regexp in question and some sample 
input to play with.
I'm sure someone can come up with a way to improve the regexp if its needed.



ravi wrote:

>Hello,
> Somehow my attachments did n't go through. I'm just including the
>profiler results for just the ORO package. I apologize for the format of
>this. 
>
>Name 					Inv. 	   %   Time   %
>Time/Inv. Total time
>
>org.apache.oro.text.regex 118426552 95.82 261140 55.68 0.002205 2032223 
>
>  OpCode 			  98762481 79.91 138054 29.43 0.001398
>177015 
>    _getNext 		  32752815 26.50 70327  14.99 0.002147 107332 
>    _getNextOffset 	  32752761 26.50 35063   7.48 0.001071 36034 
>    _getNextOperator 	  15641479 12.66 15632   3.33 0.000999 16096 
>    _getOperand 		  12828688 10.38 12497   2.66 0.000974
>12877 
>    _getArg1 		  3560870  2.88  3468    0.74 0.000974 3573 
>    _getArg2 		  612934   0.50  563     0.12 0.000919 581 
>    _getPrevOperator 	  612934   0.50  504     0.11 0.000822 522 
>
> Perl5Matcher 		  18888412 15.28 121894 25.99 0.006453 1397696 
>    _match 			  14612106 11.82 109887 23.43 0.007520
>570284 
>    _tryExpression 	  1381490  1.12  3551    0.76 0.002570 258582 
>    _interpret 		  147604   0.12  3251    0.69 0.022025 263162 
>    _matchUnicodeClass 	  1731118  1.40  1857    0.40 0.001073 1908 
>    _repeat 		  365360   0.30  1654    0.35 0.004527 2699 
>    contains 		  141828   0.11  556     0.12 0.003920 36498 
>    _initInterpreterGlobals 147604 0.12  376     0.08 0.002547 601 
>    contains 		  141828   0.11  356     0.08 0.002510 35938 
>    _pushState 		  93608    0.08  138     0.03 0.001474 140 
>    contains 		  5776     0.00  90      0.02 0.015582 227684 
>    _popState 		  81988    0.07  88      0.02 0.001073 90 
>    _setLastMatchResult   4402     0.00  60      0.01 0.013630 80 
>    _compare 		  33700    0.03  30      0.01 0.000890 30 
>
>Perl5Repetition 		  760538   0.62  791     0.17 0.001040
>813 
>    <init> 			  760538   0.62  791     0.17 0.001040
>813 
>    Util 			  2748     0.00  171     0.04 0.062227
>456109 
>    substitute 		  1374     0.00  90      0.02 0.065502 228004 
>    substitute 		  1374     0.00  81      0.02 0.058952 228105 
>    Perl5Compiler 	  467      0.00  110     0.02 0.235546 400 
>    compile 		  19       0.00  50      0.01 2.631579 140 
>    _parseExpression 	  108      0.00  30      0.01 0.277778 140 
>    _parseUnicodeClass 	  34       0.00  20      0.00 0.588235 20 
>    _parseBranch 		  306      0.00  10      0.00 0.032680
>100 
>
>Perl5Substitution 	  4657     0.00  80      0.02 0.017178 150 
>    _calcSub 		  4078     0.00  70 	 0.01 0.017165 120 30 
>    appendSubstitution 	  579      0.00  10 	 0.00 0.017271 30 
>    CharStringPointer 	  5024     0.00  20 	 0.00 0.003981 20 
>    getValue 		  3454     0.00  10 	 0.00 0.002895 10 
>    _isAtEnd 		  1570     0.00  10 	 0.00 0.006369 10
> 
>Perl5MatchResult 		  2225     0.00  20 	 0.00 0.008989
>20 
>        groups 		  2225     0.00  20 	 0.00 0.008989 20 
>
>Thanks,
>Ravi.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: oro-user-help@jakarta.apache.org
>
>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-user-help@jakarta.apache.org


RE: Performance

Posted by ravi <ra...@htinc.com>.
Hello,
 Somehow my attachments did n't go through. I'm just including the
profiler results for just the ORO package. I apologize for the format of
this. 

Name 					Inv. 	   %   Time   %
Time/Inv. Total time

org.apache.oro.text.regex 118426552 95.82 261140 55.68 0.002205 2032223 

  OpCode 			  98762481 79.91 138054 29.43 0.001398
177015 
    _getNext 		  32752815 26.50 70327  14.99 0.002147 107332 
    _getNextOffset 	  32752761 26.50 35063   7.48 0.001071 36034 
    _getNextOperator 	  15641479 12.66 15632   3.33 0.000999 16096 
    _getOperand 		  12828688 10.38 12497   2.66 0.000974
12877 
    _getArg1 		  3560870  2.88  3468    0.74 0.000974 3573 
    _getArg2 		  612934   0.50  563     0.12 0.000919 581 
    _getPrevOperator 	  612934   0.50  504     0.11 0.000822 522 

 Perl5Matcher 		  18888412 15.28 121894 25.99 0.006453 1397696 
    _match 			  14612106 11.82 109887 23.43 0.007520
570284 
    _tryExpression 	  1381490  1.12  3551    0.76 0.002570 258582 
    _interpret 		  147604   0.12  3251    0.69 0.022025 263162 
    _matchUnicodeClass 	  1731118  1.40  1857    0.40 0.001073 1908 
    _repeat 		  365360   0.30  1654    0.35 0.004527 2699 
    contains 		  141828   0.11  556     0.12 0.003920 36498 
    _initInterpreterGlobals 147604 0.12  376     0.08 0.002547 601 
    contains 		  141828   0.11  356     0.08 0.002510 35938 
    _pushState 		  93608    0.08  138     0.03 0.001474 140 
    contains 		  5776     0.00  90      0.02 0.015582 227684 
    _popState 		  81988    0.07  88      0.02 0.001073 90 
    _setLastMatchResult   4402     0.00  60      0.01 0.013630 80 
    _compare 		  33700    0.03  30      0.01 0.000890 30 

Perl5Repetition 		  760538   0.62  791     0.17 0.001040
813 
    <init> 			  760538   0.62  791     0.17 0.001040
813 
    Util 			  2748     0.00  171     0.04 0.062227
456109 
    substitute 		  1374     0.00  90      0.02 0.065502 228004 
    substitute 		  1374     0.00  81      0.02 0.058952 228105 
    Perl5Compiler 	  467      0.00  110     0.02 0.235546 400 
    compile 		  19       0.00  50      0.01 2.631579 140 
    _parseExpression 	  108      0.00  30      0.01 0.277778 140 
    _parseUnicodeClass 	  34       0.00  20      0.00 0.588235 20 
    _parseBranch 		  306      0.00  10      0.00 0.032680
100 

Perl5Substitution 	  4657     0.00  80      0.02 0.017178 150 
    _calcSub 		  4078     0.00  70 	 0.01 0.017165 120 30 
    appendSubstitution 	  579      0.00  10 	 0.00 0.017271 30 
    CharStringPointer 	  5024     0.00  20 	 0.00 0.003981 20 
    getValue 		  3454     0.00  10 	 0.00 0.002895 10 
    _isAtEnd 		  1570     0.00  10 	 0.00 0.006369 10
 
Perl5MatchResult 		  2225     0.00  20 	 0.00 0.008989
20 
        groups 		  2225     0.00  20 	 0.00 0.008989 20 

Thanks,
Ravi.


---------------------------------------------------------------------
To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-user-help@jakarta.apache.org