You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@harmony.apache.org by Deven You <de...@gmail.com> on 2010/07/06 05:02:46 UTC

A question about android regex implementation

Hi  Jesse and All,
I have written some simple benchmarks for harmony regex and find the
performance of harmony is poor compared to RI. For example, Mathcer.find()
only reach 60% of that of RI. I heard Android use icu4jni re-implement this
module. Since icu4jni use native code I think it may has higher performance
than harmony. I am trying to use icu4jni as the back-end of harmony regex
but find icu4jni has no functions related to regex operations.
I know there are some android guys in our community. So can anyone tell me
some detail info for android's regex, like if it re-implement the regex
logic using native code by android itself rather than icu4jni and really get
higher performance compared to harmony regex? Thanks a lot!

Re: A question about android regex implementation

Posted by Deven You <de...@gmail.com>.
2010/7/8 enh <en...@google.com>

> On Wed, Jul 7, 2010 at 09:55, Jesse Wilson <je...@google.com> wrote:
> > On Mon, Jul 5, 2010 at 8:02 PM, Deven You <de...@gmail.com> wrote:
> > > I have written some simple benchmarks for harmony regex and find the
> > > performance of harmony is poor compared to RI. For example,
> Mathcer.find()
> > > only reach 60% of that of RI.
>
> i'd be interested to see your benchmark code. our regular expression
> benchmarks actually focus on replacing use of Pattern/Matcher with
> simpler fast paths where the full complexity isn't needed. for
> example, most String.split calls don't actually use regular
> expressions. i'm not sure how much of this work made it into froyo,
> and it's certainly still a work in progress.
>
I have written a cpp simple becnmark use icu4c regex corresponding to the
java ones to test icu4c regex performance:

  int32_t            flags=0;
  UParseError        pe;
  UErrorCode         status=U_ZERO_ERROR;
  UnicodeString      re("(\\d{1,3})");
  UnicodeString      data = "aaaa123456789045";
  int i=0;


  RegexPattern *pat = RegexPattern::compile(re, flags, pe, status);

  RegexMatcher *matcher = pat->matcher(data, status);

  //warm up
  for (i=0; i<1000000; i++) {
    while (matcher->find()) {
    }
    matcher->reset();
  }

  timeval time_val;
  long sec1, sec2, us1, us2;
  gettimeofday(&time_val, NULL);
  sec1=time_val.tv_sec;
  us1=time_val.tv_usec;

  for (i=0; i<10000000; i++) {
    while (matcher->find()) {
    }
    matcher->reset();
   }
  gettimeofday(&time_val, NULL);
  sec2=time_val.tv_sec;
  us2=time_val.tv_usec;
  double interval = (sec2 -sec1)*1000.0 + (us2-us1)/1000.0;
  printf("total time of MatcherFind is %f ms \n", interval);

The test result is similar to harmony regex about 9883 to 10977 ms.
However when I use pattern ^([\S]+)\s+([\S]+).*\r?\n and matching string
"Test Matcher-find() performance estimation!\r\n". The test results show
icu4c has the best performance:
harmony:  14962 - 15760 ms
RI:             10999 - 11587 ms
ICU4C:     7193 - 7558 ms
It seems if the regex get complex, ICU4C will do the best performance.

> one thing we want to do is use http://code.google.com/p/caliper/
> benchmarks to track the relative performance of Java implementations
> of the stuff that's currently native code (but doesn't inherently need
> to be). we don't have anything like that for regular expressions yet.
> (our tool for conveniently running benchmarks is
> http://code.google.com/p/vogar/ and you can find some of our actual
> benchmarks at http://code.google.com/p/dalvik/.)
>
> > > I heard Android use icu4jni re-implement this
> > > module. Since icu4jni use native code I think it may has higher
> performance
> > > than harmony. I am trying to use icu4jni as the back-end of harmony
> regex
> > > but find icu4jni has no functions related to regex operations.
>
> although, as jesse said, regular expressions are a special case where
> there was no original icu4jni implementation, icu4jni is a dead
> project (http://site.icu-project.org/#TOC-ICU4JNI). Android used the
> icu4jni code, but it's been gradually rewritten since then. the
> process is incomplete, and the extent of the rewrite varies from class
> to class, but you're probably better off taking code from Android
> rather than icu4jni.
>
> > > I know there are some android guys in our community. So can anyone tell
> me
> > > some detail info for android's regex, like if it re-implement the regex
> > > logic using native code by android itself rather than icu4jni and
> really get
> > > higher performance compared to harmony regex? Thanks a lot!
> >
> > We actually use icu4c's pattern, which we access via JNI. Documentation
> is
> > here:
> >  http://icu-project.org/apiref/icu4c/classRegexPattern.html
> >
> > Our regex-java integration code is Apache-licensed and available in our
> Git
> > repository here:
> >
> >
> http://android.git.kernel.org/?p=platform/libcore.git;a=tree;f=regex/src/main/java/java/util/regex;h=f9207d043a24d3dc034bf2cc3a5fdd57115692c3;hb=master
>
> actually, in shipping versions of Android we use the C API rather than
> the C++ API. see
>
> http://android.git.kernel.org/?p=platform/libcore.git;a=blob;f=icu/src/main/native/NativeRegEx.cpp;h=7b3cafc0779ef7f3ed69a32128af7f96e3498922;hb=master
> for the gory details.
>
> i've recently rewritten this to use the C++ APIs, and it's much
> cleaner for it. if you look at the current code you'll see that we
> copy the input to the native heap, because icu4c < 4.6 can't cope with
> its input moving about between calls (and a VM can't necessarily
> guarantee the Java char[] doesn't move around).
>
> in future, we'll use a new icu4c API to get the best of both worlds:
>
>
> http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTim7czTaFNEBQm8tO3LhdDingvS3pshpW79XBIhu@mail.gmail.com&forum_name=icu-design
>
> if you're shipping your own copy of icu (rather than using some system
> one you have no control over), you might want to backport that change
> (which has been submitted).
>
> one final thing to be aware of is that icu4c's regular expression
> implementation isn't 100% compatible with the RI. most obviously,
> there's no support for CANON_EQ, but there are a variety of more
> subtle incompatibilities too.
>
> --
> Elliott Hughes - http://who/enh - http://jessies.org/~enh/
>

Fwd: A question about android regex implementation

Posted by Deven You <de...@gmail.com>.
---------- Forwarded message ----------
From: Deven You <de...@gmail.com>
Date: 2010/7/8
Subject: Re: A question about android regex implementation
To: enh <en...@google.com>




2010/7/8 enh <en...@google.com>

On Wed, Jul 7, 2010 at 09:55, Jesse Wilson <je...@google.com> wrote:
> > On Mon, Jul 5, 2010 at 8:02 PM, Deven You <de...@gmail.com> wrote:
> > > I have written some simple benchmarks for harmony regex and find the
> > > performance of harmony is poor compared to RI. For example,
> Mathcer.find()
> > > only reach 60% of that of RI.
>
> i'd be interested to see your benchmark code. our regular expression
> benchmarks actually focus on replacing use of Pattern/Matcher with
> simpler fast paths where the full complexity isn't needed. for
> example, most String.split calls don't actually use regular
> expressions. i'm not sure how much of this work made it into froyo,
> and it's certainly still a work in progress.
>
I first found  harmony regex is slower than RI  is from an internal
benchmark test result. In this benchmark, Mather.find() is relatively hot
and slower than RI. Then I write a simple test to verify it:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MatcherFind {


public static void main(String[] args) {
 Pattern pat = Pattern.compile("(\\d{1,3})");
Matcher matcher = pat.matcher("aaaa123456789045");
 //warm up
for (int i = 0; i <1000000; i++) {
 while(matcher.find()) {
}
 matcher.reset();
}
 long time = System.currentTimeMillis();
for (int i = 0; i <10000000; i++) {
 while(matcher.find()) {
}
 matcher.reset();
}
time = System.currentTimeMillis() - time;
               System.out.println("total time of MatcherFind is " + time + "
ms!");

}

The results are harmony takes 9841 -10079 ms meanwhile RI takes 5596 - 5877
ms.

In volano benchmark, Matcher(Pattern, CharSequence) and Mather.group(int)
are also slower than RI, although they are not hot methods. My simple
benchmarks for these 2 methods also approve this.
1. Matcher(Pattern, CharSequence)

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MatcherConstruct {
public static void main(String[] args) {
        Pattern pat = Pattern.compile("XX");
        Matcher matcher = null;

        //warmup phase
        for (int i = 0; i < 100000; i++) {
         matcher = pat.matcher("Today is XX-XX-XX ...");
        }

        long time = System.currentTimeMillis();
        for (int i = 0; i < 1000000; i++) {
         matcher = pat.matcher("Today is XX-XX-XX ...");
        }
        time = System.currentTimeMillis() - time;
        System.out.println("Total time of MatcherConstruct is " + time + "
ms!");
}
}

2. Matcher.groun(int)

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MatcherGroupInt {
 public static void main(String[] args) {
        String testPattern = "(\\d{1,3})";
        String testString = "aaaa123456789045";

        Pattern pat = Pattern.compile(testPattern);
        Matcher matcher = pat.matcher(testString);

        //warm up
        for (int i = 0; i < 500000; i++) {
            while (matcher.find()) {
                String result = matcher.group(1);
               }
            matcher.reset();
        }

        long time = System.currentTimeMillis();
        for (int i = 0; i < 1000000; i++) {
            while (matcher.find()) {
                String result = matcher.group(1);
               }
            matcher.reset();
        }
        time = System.currentTimeMillis() - time;
        System.out.println("total time of MatcherGroupInt is " + time + "
ms!");
}
}


> one thing we want to do is use http://code.google.com/p/caliper/
> benchmarks to track the relative performance of Java implementations
> of the stuff that's currently native code (but doesn't inherently need
> to be). we don't have anything like that for regular expressions yet.
> (our tool for conveniently running benchmarks is
> http://code.google.com/p/vogar/ and you can find some of our actual
> benchmarks at http://code.google.com/p/dalvik/.)
>
> > > I heard Android use icu4jni re-implement this
> > > module. Since icu4jni use native code I think it may has higher
> performance
> > > than harmony. I am trying to use icu4jni as the back-end of harmony
> regex
> > > but find icu4jni has no functions related to regex operations.
>
> although, as jesse said, regular expressions are a special case where
> there was no original icu4jni implementation, icu4jni is a dead
> project (http://site.icu-project.org/#TOC-ICU4JNI). Android used the
> icu4jni code, but it's been gradually rewritten since then. the
> process is incomplete, and the extent of the rewrite varies from class
> to class, but you're probably better off taking code from Android
> rather than icu4jni.
>
> > > I know there are some android guys in our community. So can anyone tell
> me
> > > some detail info for android's regex, like if it re-implement the regex
> > > logic using native code by android itself rather than icu4jni and
> really get
> > > higher performance compared to harmony regex? Thanks a lot!
> >
> > We actually use icu4c's pattern, which we access via JNI. Documentation
> is
> > here:
> >  http://icu-project.org/apiref/icu4c/classRegexPattern.html
> >
> > Our regex-java integration code is Apache-licensed and available in our
> Git
> > repository here:
> >
> >
> http://android.git.kernel.org/?p=platform/libcore.git;a=tree;f=regex/src/main/java/java/util/regex;h=f9207d043a24d3dc034bf2cc3a5fdd57115692c3;hb=master
>
> actually, in shipping versions of Android we use the C API rather than
> the C++ API. see
>
> http://android.git.kernel.org/?p=platform/libcore.git;a=blob;f=icu/src/main/native/NativeRegEx.cpp;h=7b3cafc0779ef7f3ed69a32128af7f96e3498922;hb=master
> for the gory details.
>
> i've recently rewritten this to use the C++ APIs, and it's much
> cleaner for it. if you look at the current code you'll see that we
> copy the input to the native heap, because icu4c < 4.6 can't cope with
> its input moving about between calls (and a VM can't necessarily
> guarantee the Java char[] doesn't move around).
>
> in future, we'll use a new icu4c API to get the best of both worlds:
>
>
> http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTim7czTaFNEBQm8tO3LhdDingvS3pshpW79XBIhu@mail.gmail.com&forum_name=icu-design
>
> if you're shipping your own copy of icu (rather than using some system
> one you have no control over), you might want to backport that change
> (which has been submitted).
>
> one final thing to be aware of is that icu4c's regular expression
> implementation isn't 100% compatible with the RI. most obviously,
> there's no support for CANON_EQ, but there are a variety of more
> subtle incompatibilities too.
>
> --
> Elliott Hughes - http://who/enh - http://jessies.org/~enh/
>

Re: A question about android regex implementation

Posted by enh <en...@google.com>.
On Wed, Jul 7, 2010 at 09:55, Jesse Wilson <je...@google.com> wrote:
> On Mon, Jul 5, 2010 at 8:02 PM, Deven You <de...@gmail.com> wrote:
> > I have written some simple benchmarks for harmony regex and find the
> > performance of harmony is poor compared to RI. For example, Mathcer.find()
> > only reach 60% of that of RI.

i'd be interested to see your benchmark code. our regular expression
benchmarks actually focus on replacing use of Pattern/Matcher with
simpler fast paths where the full complexity isn't needed. for
example, most String.split calls don't actually use regular
expressions. i'm not sure how much of this work made it into froyo,
and it's certainly still a work in progress.

one thing we want to do is use http://code.google.com/p/caliper/
benchmarks to track the relative performance of Java implementations
of the stuff that's currently native code (but doesn't inherently need
to be). we don't have anything like that for regular expressions yet.
(our tool for conveniently running benchmarks is
http://code.google.com/p/vogar/ and you can find some of our actual
benchmarks at http://code.google.com/p/dalvik/.)

> > I heard Android use icu4jni re-implement this
> > module. Since icu4jni use native code I think it may has higher performance
> > than harmony. I am trying to use icu4jni as the back-end of harmony regex
> > but find icu4jni has no functions related to regex operations.

although, as jesse said, regular expressions are a special case where
there was no original icu4jni implementation, icu4jni is a dead
project (http://site.icu-project.org/#TOC-ICU4JNI). Android used the
icu4jni code, but it's been gradually rewritten since then. the
process is incomplete, and the extent of the rewrite varies from class
to class, but you're probably better off taking code from Android
rather than icu4jni.

> > I know there are some android guys in our community. So can anyone tell me
> > some detail info for android's regex, like if it re-implement the regex
> > logic using native code by android itself rather than icu4jni and really get
> > higher performance compared to harmony regex? Thanks a lot!
>
> We actually use icu4c's pattern, which we access via JNI. Documentation is
> here:
>  http://icu-project.org/apiref/icu4c/classRegexPattern.html
>
> Our regex-java integration code is Apache-licensed and available in our Git
> repository here:
>
> http://android.git.kernel.org/?p=platform/libcore.git;a=tree;f=regex/src/main/java/java/util/regex;h=f9207d043a24d3dc034bf2cc3a5fdd57115692c3;hb=master

actually, in shipping versions of Android we use the C API rather than
the C++ API. see
http://android.git.kernel.org/?p=platform/libcore.git;a=blob;f=icu/src/main/native/NativeRegEx.cpp;h=7b3cafc0779ef7f3ed69a32128af7f96e3498922;hb=master
for the gory details.

i've recently rewritten this to use the C++ APIs, and it's much
cleaner for it. if you look at the current code you'll see that we
copy the input to the native heap, because icu4c < 4.6 can't cope with
its input moving about between calls (and a VM can't necessarily
guarantee the Java char[] doesn't move around).

in future, we'll use a new icu4c API to get the best of both worlds:

http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTim7czTaFNEBQm8tO3LhdDingvS3pshpW79XBIhu@mail.gmail.com&forum_name=icu-design

if you're shipping your own copy of icu (rather than using some system
one you have no control over), you might want to backport that change
(which has been submitted).

one final thing to be aware of is that icu4c's regular expression
implementation isn't 100% compatible with the RI. most obviously,
there's no support for CANON_EQ, but there are a variety of more
subtle incompatibilities too.

--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/

Re: A question about android regex implementation

Posted by Jesse Wilson <je...@google.com>.
On Mon, Jul 5, 2010 at 8:02 PM, Deven You <de...@gmail.com> wrote:

> I have written some simple benchmarks for harmony regex and find the
> performance of harmony is poor compared to RI. For example, Mathcer.find()
> only reach 60% of that of RI. I heard Android use icu4jni re-implement this
> module. Since icu4jni use native code I think it may has higher performance
> than harmony. I am trying to use icu4jni as the back-end of harmony regex
> but find icu4jni has no functions related to regex operations.
> I know there are some android guys in our community. So can anyone tell me
> some detail info for android's regex, like if it re-implement the regex
> logic using native code by android itself rather than icu4jni and really get
> higher performance compared to harmony regex? Thanks a lot!
>

We actually use icu4c's pattern, which we access via JNI. Documentation is
here:
  http://icu-project.org/apiref/icu4c/classRegexPattern.html

Our regex-java integration code is Apache-licensed and available in our Git
repository here:

http://android.git.kernel.org/?p=platform/libcore.git;a=tree;f=regex/src/main/java/java/util/regex;h=f9207d043a24d3dc034bf2cc3a5fdd57115692c3;hb=master