You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@openoffice.apache.org by "Pedro F. Giffuni" <gi...@tutopia.com> on 2011/06/23 21:39:49 UTC

RegExp replacement (was Re: Some more strange files in the OOo code)

--- On Thu, 6/23/11, Mathias Bauer <Ma...@gmx.net> wrote:
...

> 
> You are talking about the list of external source tarballs?

Yes, my mailer ate the original reply, sorry.

Anyways ...

I looked at the RegExp stuff, as I promised.

OpenOffice has a C++ interface to GNU regex so Google's
RE2 seemed like a natural fit there. Unfortunately I see
TextSearch::RESrchBkwrd in textsearch.cxx so I assume we
need backreferences. The Re2 website says:

"If you absolutely need backreferences and generalized
assertions, then RE2 is not for you, but you might be
interested in irregexp, Google Chrome's regular expression
engine."

irregexp replaced PCRE and is mentioned here:
http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html

And the code, integrated in chrome's v8, is here:
http://v8.googlecode.com/svn/trunk/src/ regex-*

It's also C++ and it's under a BSD license. I couldn't find
it as an independent package so someone that actually knows
well C++ will have to do the fun part. Well, at least it's
much better than writing our own ;-).

Pedro.


RE: RegExp replacement (was Re: Some more strange files in the OOo code)

Posted by "Dennis E. Hamilton" <de...@acm.org>.
+1

That would certainly be a clean choice for those regular expressions that show up in the OpenDocument format of ODF 1.2 documents.  It is an easy way to document the implementation-dependent choice.

The ICU License is on this page <http://userguide.icu-project.org/intro>.  It is on the BSD model.

 - Dennis

PS: And welcome, Eike, it is great to see you here. 



-----Original Message-----
From: Eike Rathke [mailto:ooo@erack.de] 
Sent: Thursday, June 23, 2011 15:36
To: ooo-dev@incubator.apache.org
Subject: Re: RegExp replacement (was Re: Some more strange files in the OOo code)

Hi Pedro,

On Thursday, 2011-06-23 12:39:49 -0700, Pedro F. Giffuni wrote:

> OpenOffice has a C++ interface to GNU regex so Google's
> RE2 seemed like a natural fit there. Unfortunately I see
> TextSearch::RESrchBkwrd in textsearch.cxx so I assume we
> need backreferences. The Re2 website says:
> 
> "If you absolutely need backreferences and generalized
> assertions, then RE2 is not for you, but you might be
> interested in irregexp, Google Chrome's regular expression
> engine."

I strongly propose to go for ICU's RE instead. OOo already makes heavy
use of ICU, the ICU REs support Unicode conforming to TR18
http://www.unicode.org/reports/tr18/ and seem to have all we need. See
http://userguide.icu-project.org/strings/regexp

  Eike

-- 
 PGP/OpenPGP/GnuPG encrypted mail preferred in all private communication.
 Key ID: 0x293C05FD - 997A 4C60 CE41 0149 0DB3  9E96 2F1A D073 293C 05FD


Re: RegExp replacement (was Re: Some more strange files in the OOo code)

Posted by Greg Stein <gs...@gmail.com>.
On Jun 24, 2011 6:35 AM, "Eike Rathke" <oo...@erack.de> wrote:
>
> Hi Mathias,
>
> On Friday, 2011-06-24 10:11:43 +0200, Mathias Bauer wrote:
>
> > Eike, while we are at regexp: do you know something about
> >
> > boost/Regex_Experimental.tar.gz
>
> Indeed ;-)  A very old experiment to get boost REs compiling, see
> boost/README.Regex_Experimental
> It's not needed for OOo, and now that there are ICU REs we'd have an
> alternative.

I've had poor experience with boost. Let's go with ICU. (or PCRE if needed)

Cheers,
-g

Re: RegExp replacement (was Re: Some more strange files in the OOo code)

Posted by Eike Rathke <oo...@erack.de>.
Hi Mathias,

On Friday, 2011-06-24 10:11:43 +0200, Mathias Bauer wrote:

> Eike, while we are at regexp: do you know something about
> 
> boost/Regex_Experimental.tar.gz

Indeed ;-)  A very old experiment to get boost REs compiling, see
boost/README.Regex_Experimental
It's not needed for OOo, and now that there are ICU REs we'd have an
alternative.

  Eike

-- 
 PGP/OpenPGP/GnuPG encrypted mail preferred in all private communication.
 Key ID: 0x293C05FD - 997A 4C60 CE41 0149 0DB3  9E96 2F1A D073 293C 05FD

Re: RegExp replacement (was Re: Some more strange files in the OOo code)

Posted by Mathias Bauer <Ma...@gmx.net>.
On 24.06.2011 00:35, Eike Rathke wrote:
> Hi Pedro,
>
> On Thursday, 2011-06-23 12:39:49 -0700, Pedro F. Giffuni wrote:
>
>> OpenOffice has a C++ interface to GNU regex so Google's
>> RE2 seemed like a natural fit there. Unfortunately I see
>> TextSearch::RESrchBkwrd in textsearch.cxx so I assume we
>> need backreferences. The Re2 website says:
>>
>> "If you absolutely need backreferences and generalized
>> assertions, then RE2 is not for you, but you might be
>> interested in irregexp, Google Chrome's regular expression
>> engine."
>
> I strongly propose to go for ICU's RE instead. OOo already makes heavy
> use of ICU, the ICU REs support Unicode conforming to TR18
> http://www.unicode.org/reports/tr18/ and seem to have all we need. See
> http://userguide.icu-project.org/strings/regexp
>
>    Eike
>

Eike, while we are at regexp: do you know something about

boost/Regex_Experimental.tar.gz

Regards,
Mathias

Re: RegExp replacement (was Re: Some more strange files in the OOo code)

Posted by Pedro Giffuni <gi...@tutopia.com>.
 On Fri, 24 Jun 2011 00:35:45 +0200, Eike Rathke <oo...@erack.de> wrote:
> Hi Pedro,
>
> On Thursday, 2011-06-23 12:39:49 -0700, Pedro F. Giffuni wrote:
>
>> OpenOffice has a C++ interface to GNU regex so Google's
>> RE2 seemed like a natural fit there. Unfortunately I see
>> TextSearch::RESrchBkwrd in textsearch.cxx so I assume we
>> need backreferences. The Re2 website says:
>>
>> "If you absolutely need backreferences and generalized
>> assertions, then RE2 is not for you, but you might be
>> interested in irregexp, Google Chrome's regular expression
>> engine."
>
> I strongly propose to go for ICU's RE instead. OOo already makes 
> heavy
> use of ICU, the ICU REs support Unicode conforming to TR18
> http://www.unicode.org/reports/tr18/ and seem to have all we need. 
> See
> http://userguide.icu-project.org/strings/regexp
>
>   Eike

 Thanks for the excellent suggestion Eike!

 +1 For code deduplication.

 I understand we also have some C++ experts from (ahem) IBM so
 we can always point to them for any issue that arises ;-).

 Pedro.

Re: RegExp replacement (was Re: Some more strange files in the OOo code)

Posted by Eike Rathke <oo...@erack.de>.
Hi Pedro,

On Thursday, 2011-06-23 12:39:49 -0700, Pedro F. Giffuni wrote:

> OpenOffice has a C++ interface to GNU regex so Google's
> RE2 seemed like a natural fit there. Unfortunately I see
> TextSearch::RESrchBkwrd in textsearch.cxx so I assume we
> need backreferences. The Re2 website says:
> 
> "If you absolutely need backreferences and generalized
> assertions, then RE2 is not for you, but you might be
> interested in irregexp, Google Chrome's regular expression
> engine."

I strongly propose to go for ICU's RE instead. OOo already makes heavy
use of ICU, the ICU REs support Unicode conforming to TR18
http://www.unicode.org/reports/tr18/ and seem to have all we need. See
http://userguide.icu-project.org/strings/regexp

  Eike

-- 
 PGP/OpenPGP/GnuPG encrypted mail preferred in all private communication.
 Key ID: 0x293C05FD - 997A 4C60 CE41 0149 0DB3  9E96 2F1A D073 293C 05FD

Re: RegExp replacement (was Re: Some more strange files in the OOo code)

Posted by "Pedro F. Giffuni" <gi...@tutopia.com>.
--- On Thu, 6/23/11, Greg Stein <gs...@gmail.com> wrote:
...
> 
> I was talking about the C++ wrappers that are part of PCRE
> itself.
>

Ah didn't know about those ... no objection at all.

FWIW, since irregexp actually replaced PCRE in Chrome I
would expect it to be faster but then it also does some
weird things in order to get something around 10%
improvement.

I agree PCRE is a fine and working solution.

cheers,
 
Pedro.

Re: RegExp replacement (was Re: Some more strange files in the OOo code)

Posted by Greg Stein <gs...@gmail.com>.
On Thu, Jun 23, 2011 at 16:26, Pedro F. Giffuni <gi...@tutopia.com> wrote:
> (Sorry my previous message only went private)
>
> --- On Thu, 6/23/11, Greg Stein <gs...@gmail.com> wrote:
>
> (Snipped the irregexp stuff)
>
>> > much better than writing our own ;-).
>>
>> PCRE also has C++ wrappers, and *is* packaged up and
>> delivered as a library.
>>
>
> Hmm.. and it's documented:
>    http://www.daemon.de/PCRE
>
> One thing to take into account though:
>
> PCRE   --> BSD licensed
> PCRE++ --> LGPL
>
> I certainly won't object to PCRE++ if developers feel more
> comfortable with it, but if we use PCRE++ it will have to
> be a dependency because we cannot bring it directly to the
> tree. It wouldn't make much sense to design our own C++
> wrapper.

I was talking about the C++ wrappers that are part of PCRE itself.

For example:
  http://vcs.pcre.org/viewvc/code/trunk/pcrecpp.h?view=markup


Cheers,
-g

Re: RegExp replacement (was Re: Some more strange files in the OOo code)

Posted by "Pedro F. Giffuni" <gi...@tutopia.com>.
(Sorry my previous message only went private)

--- On Thu, 6/23/11, Greg Stein <gs...@gmail.com> wrote:

(Snipped the irregexp stuff)

> > much better than writing our own ;-).
> 
> PCRE also has C++ wrappers, and *is* packaged up and
> delivered as a library.
>

Hmm.. and it's documented:
    http://www.daemon.de/PCRE

One thing to take into account though:

PCRE   --> BSD licensed
PCRE++ --> LGPL

I certainly won't object to PCRE++ if developers feel more
comfortable with it, but if we use PCRE++ it will have to
be a dependency because we cannot bring it directly to the
tree. It wouldn't make much sense to design our own C++
wrapper.


Pedro.  


Re: RegExp replacement (was Re: Some more strange files in the OOo code)

Posted by Greg Stein <gs...@gmail.com>.
On Thu, Jun 23, 2011 at 15:39, Pedro F. Giffuni <gi...@tutopia.com> wrote:
>
> --- On Thu, 6/23/11, Mathias Bauer <Ma...@gmx.net> wrote:
> ...
>
>>
>> You are talking about the list of external source tarballs?
>
> Yes, my mailer ate the original reply, sorry.
>
> Anyways ...
>
> I looked at the RegExp stuff, as I promised.
>
> OpenOffice has a C++ interface to GNU regex so Google's
> RE2 seemed like a natural fit there. Unfortunately I see
> TextSearch::RESrchBkwrd in textsearch.cxx so I assume we
> need backreferences. The Re2 website says:
>
> "If you absolutely need backreferences and generalized
> assertions, then RE2 is not for you, but you might be
> interested in irregexp, Google Chrome's regular expression
> engine."
>
> irregexp replaced PCRE and is mentioned here:
> http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html
>
> And the code, integrated in chrome's v8, is here:
> http://v8.googlecode.com/svn/trunk/src/ regex-*
>
> It's also C++ and it's under a BSD license. I couldn't find
> it as an independent package so someone that actually knows
> well C++ will have to do the fun part. Well, at least it's
> much better than writing our own ;-).

PCRE also has C++ wrappers, and *is* packaged up and delivered as a library.

Cheers,
-g