You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-user@jakarta.apache.org by ravi <ra...@htinc.com> on 2003/05/08 19:16:20 UTC
Performance
I don't know if this mail belongs here. Probably it should go to the
developers list. But I'm sending it anyway. I apologize for that.
I'm working on a text mining application that extracts key words/phrases
from documents. I was using the jregex package in my application. It was
working fine, but it was taking a lot of time. Then I tried ORO. But
looks like ORO is taking even more time than jregex. I ran a profiler on
both the packages for 50 documents (Each document has around 1000 words
on the average). The ORO's regex package made around 119 million
invocations. That's a lot. The execution time was 35% more when I used
ORO instead of jregex. I'm attaching the outputs of the profiler for
both the packages. I was wondering if you guys can do any performance
optimizations.
Thanks,
Ravi.
RE: Performance
Posted by ravi <ra...@htinc.com>.
I'm attaching the code that I used. I don't know why it's taking a lot
of time. There should be something wrong with my regular expressions or
with my code. Can somebody please look at it and let me know what's
wrong. You can try any piece of text as input. I would really really
appreciate it. Thanks in advance.
private static Perl5Compiler compiler;
private static PatternMatcher matcher;
private static Perl5Substitution substitution;
public static void main(String args[])
{
compiler = new Perl5Compiler();
matcher = new Perl5Matcher();
substitution = new Perl5Substitution();
String text = args[0];
Pattern pattern = getPattern("\t");
text = replaceText(pattern," ",text);
pattern = getPattern("[\\[\\]\\{\\}\\^\\~?!()\";/\\|,<>`]");
text = replaceText(pattern," $& ",text);
pattern = getPattern("^('|&)");
text = replaceText(pattern,"$& ",text);
pattern = getPattern("([^A-Za-z0-9])('|&|@|%|\\*)");
text = replaceText(pattern,"$& ",text);
pattern =
getPattern("('|:|-|#|\\*|\\+|\\$|&|@|'S|'D|'M|'LL|'RE|'VE|N'T|'s|'d|'m|'
ll|'re|'ve|n't)$");
text = replaceText(pattern," $&",text);
pattern =
getPattern("('|:|-|#|\\*|\\+|\\$|&|@|'S|'D|'M|'LL|'RE|'VE|N'T|'s|'d|'m|'
ll|'re|'ve|n't)([^A-Za-z0-9])");
text = replaceText(pattern," $&",text);
StringTokenizer strTok = new StringTokenizer(text);
while(strTok.hasMoreTokens())
{
String token = strTok.nextToken();
token = token.trim()
pattern = getPattern("([A-Za-z0-9][.])$");
if(contains(pattern,token))
{
pattern =
getPattern("^([A-Za-z]\\.([A-Za-z]\\.)+|[A-Za-z]\\.|[A-Z][bcdfghj-np-tvx
z]+\\.)$");
if(contains(pattern,token))
{
///code to process token which does not use any regex stuff
}
}
else
{
pattern = getPattern("^([A-Za-z0-9])");
if(contains(pattern,token))
{
pattern =
getPattern("([A-Za-z0-9]+\\.[A-Za-z]+|[0-9]+\\.[A-Za-z])");
if(contains(pattern,token))
{
////code to process token which does not use any regex
stuff
}
else
{
if(contains(getPattern("^([A-Za-z])"),token))
{
//code to process token which does not use any regex
stuff
}
}
}
else
{
pattern = getPattern("([.!?])$");
if(contains(pattern,token))
{
//code to process token which does not use any regex
stuff
}
}
}
}
}
public static boolean contains(Pattern pattern,String str)
{
return matcher.contains(str,pattern);
}
public static String replaceText(Pattern pattern,String
replacement,String str)
{
substitution.setSubstitution(replacement);
return
Util.substitute(matcher,pattern,substitution,str,Util.SUBSTITUTE_ALL);
}
public static Pattern getPattern(String pattern)
{
return compiler.compile(pattern);
}
Thanks,
Ravi.
---------------------------------------------------------------------
To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-user-help@jakarta.apache.org
Re: Performance
Posted by Kwok Peng Tuck <pe...@makmal.net>.
Maybe you could try refining the way you write your regexp ?
I've processed a log file which is about 8 MB (gziped ~400kb, without
unpacking) in size and managed to finish
the whole job under a minute.
Let the list have a look at your regexp in question and some sample
input to play with.
I'm sure someone can come up with a way to improve the regexp if its needed.
ravi wrote:
>Hello,
> Somehow my attachments did n't go through. I'm just including the
>profiler results for just the ORO package. I apologize for the format of
>this.
>
>Name Inv. % Time %
>Time/Inv. Total time
>
>org.apache.oro.text.regex 118426552 95.82 261140 55.68 0.002205 2032223
>
> OpCode 98762481 79.91 138054 29.43 0.001398
>177015
> _getNext 32752815 26.50 70327 14.99 0.002147 107332
> _getNextOffset 32752761 26.50 35063 7.48 0.001071 36034
> _getNextOperator 15641479 12.66 15632 3.33 0.000999 16096
> _getOperand 12828688 10.38 12497 2.66 0.000974
>12877
> _getArg1 3560870 2.88 3468 0.74 0.000974 3573
> _getArg2 612934 0.50 563 0.12 0.000919 581
> _getPrevOperator 612934 0.50 504 0.11 0.000822 522
>
> Perl5Matcher 18888412 15.28 121894 25.99 0.006453 1397696
> _match 14612106 11.82 109887 23.43 0.007520
>570284
> _tryExpression 1381490 1.12 3551 0.76 0.002570 258582
> _interpret 147604 0.12 3251 0.69 0.022025 263162
> _matchUnicodeClass 1731118 1.40 1857 0.40 0.001073 1908
> _repeat 365360 0.30 1654 0.35 0.004527 2699
> contains 141828 0.11 556 0.12 0.003920 36498
> _initInterpreterGlobals 147604 0.12 376 0.08 0.002547 601
> contains 141828 0.11 356 0.08 0.002510 35938
> _pushState 93608 0.08 138 0.03 0.001474 140
> contains 5776 0.00 90 0.02 0.015582 227684
> _popState 81988 0.07 88 0.02 0.001073 90
> _setLastMatchResult 4402 0.00 60 0.01 0.013630 80
> _compare 33700 0.03 30 0.01 0.000890 30
>
>Perl5Repetition 760538 0.62 791 0.17 0.001040
>813
> <init> 760538 0.62 791 0.17 0.001040
>813
> Util 2748 0.00 171 0.04 0.062227
>456109
> substitute 1374 0.00 90 0.02 0.065502 228004
> substitute 1374 0.00 81 0.02 0.058952 228105
> Perl5Compiler 467 0.00 110 0.02 0.235546 400
> compile 19 0.00 50 0.01 2.631579 140
> _parseExpression 108 0.00 30 0.01 0.277778 140
> _parseUnicodeClass 34 0.00 20 0.00 0.588235 20
> _parseBranch 306 0.00 10 0.00 0.032680
>100
>
>Perl5Substitution 4657 0.00 80 0.02 0.017178 150
> _calcSub 4078 0.00 70 0.01 0.017165 120 30
> appendSubstitution 579 0.00 10 0.00 0.017271 30
> CharStringPointer 5024 0.00 20 0.00 0.003981 20
> getValue 3454 0.00 10 0.00 0.002895 10
> _isAtEnd 1570 0.00 10 0.00 0.006369 10
>
>Perl5MatchResult 2225 0.00 20 0.00 0.008989
>20
> groups 2225 0.00 20 0.00 0.008989 20
>
>Thanks,
>Ravi.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: oro-user-help@jakarta.apache.org
>
>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-user-help@jakarta.apache.org
RE: Performance
Posted by ravi <ra...@htinc.com>.
Hello,
Somehow my attachments did n't go through. I'm just including the
profiler results for just the ORO package. I apologize for the format of
this.
Name Inv. % Time %
Time/Inv. Total time
org.apache.oro.text.regex 118426552 95.82 261140 55.68 0.002205 2032223
OpCode 98762481 79.91 138054 29.43 0.001398
177015
_getNext 32752815 26.50 70327 14.99 0.002147 107332
_getNextOffset 32752761 26.50 35063 7.48 0.001071 36034
_getNextOperator 15641479 12.66 15632 3.33 0.000999 16096
_getOperand 12828688 10.38 12497 2.66 0.000974
12877
_getArg1 3560870 2.88 3468 0.74 0.000974 3573
_getArg2 612934 0.50 563 0.12 0.000919 581
_getPrevOperator 612934 0.50 504 0.11 0.000822 522
Perl5Matcher 18888412 15.28 121894 25.99 0.006453 1397696
_match 14612106 11.82 109887 23.43 0.007520
570284
_tryExpression 1381490 1.12 3551 0.76 0.002570 258582
_interpret 147604 0.12 3251 0.69 0.022025 263162
_matchUnicodeClass 1731118 1.40 1857 0.40 0.001073 1908
_repeat 365360 0.30 1654 0.35 0.004527 2699
contains 141828 0.11 556 0.12 0.003920 36498
_initInterpreterGlobals 147604 0.12 376 0.08 0.002547 601
contains 141828 0.11 356 0.08 0.002510 35938
_pushState 93608 0.08 138 0.03 0.001474 140
contains 5776 0.00 90 0.02 0.015582 227684
_popState 81988 0.07 88 0.02 0.001073 90
_setLastMatchResult 4402 0.00 60 0.01 0.013630 80
_compare 33700 0.03 30 0.01 0.000890 30
Perl5Repetition 760538 0.62 791 0.17 0.001040
813
<init> 760538 0.62 791 0.17 0.001040
813
Util 2748 0.00 171 0.04 0.062227
456109
substitute 1374 0.00 90 0.02 0.065502 228004
substitute 1374 0.00 81 0.02 0.058952 228105
Perl5Compiler 467 0.00 110 0.02 0.235546 400
compile 19 0.00 50 0.01 2.631579 140
_parseExpression 108 0.00 30 0.01 0.277778 140
_parseUnicodeClass 34 0.00 20 0.00 0.588235 20
_parseBranch 306 0.00 10 0.00 0.032680
100
Perl5Substitution 4657 0.00 80 0.02 0.017178 150
_calcSub 4078 0.00 70 0.01 0.017165 120 30
appendSubstitution 579 0.00 10 0.00 0.017271 30
CharStringPointer 5024 0.00 20 0.00 0.003981 20
getValue 3454 0.00 10 0.00 0.002895 10
_isAtEnd 1570 0.00 10 0.00 0.006369 10
Perl5MatchResult 2225 0.00 20 0.00 0.008989
20
groups 2225 0.00 20 0.00 0.008989 20
Thanks,
Ravi.
---------------------------------------------------------------------
To unsubscribe, e-mail: oro-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-user-help@jakarta.apache.org