You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Nick Wellnhofer <we...@aevum.de> on 2013/03/12 00:31:59 UTC

Re: [lucy-dev] git commit: refs/heads/c-bindings-wip2 - Implement POSIX RegexTokenizer

On Mar 11, 2013, at 23:21 , Marvin Humphrey <ma...@rectangular.com> wrote:

> What might theoretically be useful is specifying a regex engine for the sake
> of index portability across hosts -- for example, specifying that a Perl build
> of Lucy use PCRE instead of Perl's regex engine.  There are a couple of ways
> we could do that.
> 
> One option would be to offer a compile-time configuration option for
> RegexTokenizer.  However, incompatible configurations would fail silently,
> producing subtly different results under the inappropriate engine rather than
> bombing out.
> 
> A more reliable technique would be to provide dedicated classes such as
> "PCRETokenizer" which are associated with specific regex engines.  However,
> such an approach has notable cost because the regex engine code would need to
> be bundled to protect against incompatibilities across regex engine versions.

We could simply include the type of the regex engine and even the particular version in the serialization and equality test of RegexTokenizer. This should safely guard against using different engines for indexing.

Whether regex engines are selected at compile time, by choosing a dedicated class, or by a constructor argument shouldn't make a difference then. I'd prefer the latter approach.

Nick

Re: [lucy-dev] git commit: refs/heads/c-bindings-wip2 - Implement POSIX RegexTokenizer

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Mon, Mar 11, 2013 at 4:31 PM, Nick Wellnhofer <we...@aevum.de> wrote:

> We could simply include the type of the regex engine and even the particular
> version in the serialization and equality test of RegexTokenizer. This
> should safely guard against using different engines for indexing.
>
> Whether regex engines are selected at compile time, by choosing a dedicated
> class, or by a constructor argument shouldn't make a difference then. I'd
> prefer the latter approach.

Having the equality test fail leads to a harsh consequence, though -- when the
regex engine changes (e.g. because you upgraded the host language) all your
apps start throwing exceptions as soon as they try to open the index.  And
it's quite possible that the changes in the regex behavior don't even affect
your app or cause only minor degradation.

This problem of degraded recall is the same one we face with all Analyzer
behavior changes.  The best solution for many users is to live with a window
of inferior search results while refreshing the index after an upgrade.

If we introduce extra fields specifying the engine and possibly the version,
how about leaving them undefined by default, falling back to whatever is
available?  It's a little strange to have an opt-in which ties your hands on
upgrade, though...

Fortunately, with StandardTokenizer and EasyAnalyzer moving to the forefront
in all our sample code, we can assume that fewer people use RegexTokenizer
these days and all this matters less than it once did. :)

Marvin Humphrey