You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/12/23 04:18:33 UTC

[lucy-dev] Promoting new analysis components

On Thu, Dec 22, 2011 at 11:31:21PM +0100, Nick Wellnhofer wrote:
> On 22/12/11 06:15, Marvin Humphrey wrote:
>> Can you please create a JIRA issue for this, Nick?  The reason is that our
>> CHANGES file is a list of JIRA issues, and we want people to be able to see
>> that EasyAnalyzer was added in 0.3.0 and get a link to an explanation.
>
> Done.
>
>> One thought: We may not want to have EasyAnalyzer inherit from PolyAnalyzer.
>> It's fine, because it works for now... but how about we at least override
>> Dump/Load to store only "_class" and "language"?  If we refactor the
>> Analyzer chain, PolyAnalyzer, since it allows *any* Analyzer to be first --
>> not just a Tokenizer -- may not survive the refactoring in its current
>> form.
>
> OK, I changed that in commit r1222495.

Looks great, Nick!

Now that EasyAnalyzer is in, I think we should promote the use of all the
improvements Nick has made to the analysis chain.

  * Swap in EasyAnalyzer for PolyAnalyzer, Normalizer for CaseFolder, and
    StandardTokenizer for RegexTokenizer everywhere we can.
  * Deprecate the "language" parameter to PolyAnalyzer#new.

By "deprecate", I mean:

  * Open a JIRA issue so that a suitably titled entry ends up in the CHANGES
    file.
  * Mark the "language" param as "deprecated" in the PolyAnalyzer docs.

We don't have a strong deprecation mechanism available to us right now, so I
think that's the best we can do.

Here are some of the files which are going to need documentation changes
because they promote the old Analyzers:

  perl/lib/Lucy.pod
  perl/lib/Lucy/Docs/Cookbook/CustomQueryParser.pod
  perl/lib/Lucy/Docs/Cookbook/CustomQuery.pod
  perl/lib/Lucy/Docs/Tutorial/Analysis.pod
  perl/lib/Lucy/Docs/Tutorial/BeyondSimple.pod
  perl/lib/Lucy/Plan/Schema.pm
  perl/lib/Lucy/Plan/FullTextType.pm
  perl/sample/indexer.pl

It's not important that any of these changes happen before 0.3.0.  The docs
changes can happen at any time, and the parameter deprecation only allows the
simplification of a single class (PolyAnalyzer itself).  It would also be nice
to switch most test cases to use the new Analyzers, but that can also happen
at any time.

In contrast, here are a couple changes we should *not* make prior to 0.3.0,
because they have index compatibility implications:

  * Change Lucy::Simple to use EasyAnalyzer instead of PolyAnalyzer.
  * Implement CaseFolder as a subclass of Normalizer.

Marvin Humphrey

Re: [lucy-dev] Promoting new analysis components

Posted by Nick Wellnhofer <we...@aevum.de>.

On 09/02/2012 02:49, Marvin Humphrey wrote:
> After reviewing the Lucy::Simple code, I realized that we can avoid breaking
> compat with only a few extra lines.
>
>    * If the index exists during new(), extract the schema and type from what's
>      on disk.
>    * Otherwise, create a new EasyAnalyzer for the type.
>
> That way, we avoid a schema conflict crash when indexes built by Lucy::Simple
> prior to 0.4.0 are read by 0.4.0 or above.

That's a good idea.

> However, CaseFolder and Normalizer presumably have slightly different case
> mappings, thus the subclassing change is a back compat break.  It shouldn't be
> a horrible break (depending on how close the mappings are) because it will
> only affect search-time, screwing up the results only for terms which contain
> code points whose mapping has changed.

The German sharp s ("ß") is handled differently by the CaseFolder and 
the Normalizer. The CaseFolder leaves it untouched, whereas the 
Normalizer converts it to "ss". Fortunately, the snowball stemmer also 
converts sharp s to "ss", so many users should be fine.

> I don't think we should outright remove CaseFolder without a really good
> reason, because that will force almost all of our users to change their code
> and then reindex from scratch.  But a subtle compat break might be OK,
> especially since you can update all the docs in place after upgrading and only
> suffer during a window of time from slightly degraded search results.

+1

Nick

Re: [lucy-dev] Promoting new analysis components

Posted by Nick Wellnhofer <we...@aevum.de>.

On 10/02/2012 01:40, Marvin Humphrey wrote:
> On Thu, Feb 09, 2012 at 09:09:08PM +0100, Nick Wellnhofer wrote:
>> See the attached patch for my second attempt.
>
> There is still potential for a schema conflict here.  We have to get
> $self->{type} out of the existing schema when the index exists.

Oh yes, of course. I just committed another version that also takes care 
of the FieldType.

Nick

Re: [lucy-dev] Promoting new analysis components

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Feb 09, 2012 at 09:09:08PM +0100, Nick Wellnhofer wrote:
> On 09/02/2012 18:49, Marvin Humphrey wrote:
>> Rehashing our short exchange on IRC for the benefit of the list... I suggest
>> using PolyReader#open for this, since it returns NULL rather than throwing an
>> exception on failure.  If open() is successful, the resulting reader can be
>> used as an argument to IndexSearcher#new:
>
> Actually, PolyReader#open doesn't return NULL if the index is empty.
>
> But the list of seg_readers will be empty, so we can test for that.

Ah, sorry for misremembering.  

That behavior seems sub-optimal in retrospect.

Nevertheless, your revised approach will work.

> See the attached patch for my second attempt.

There is still potential for a schema conflict here.  We have to get
$self->{type} out of the existing schema when the index exists.

    if ( !@{ $reader->seg_readers } ) {
        # index is empty, create new schema
        $self->{schema} = Lucy::Plan::Schema->new;
    }
    else {
        # get schema from reader
        my $schema = $self->{schema} = $reader->get_schema;
        my ($field) = @{ $schema->get_fields };
        $self->{type} = $schema->fetch_type($field);
    }

    # Create a new FieldType if we haven't discovered one yet.
    if ( !$self->{type} ) {
        my $analyzer = Lucy::Analysis::EasyAnalyzer->new(
            language => $language );
        $self->{type} = Lucy::Plan::FullTextType->new(
            analyzer => $analyzer, );
    }

Aside from that, the patch looked good to me.

> Nick

> diff --git a/perl/lib/Lucy/Simple.pm b/perl/lib/Lucy/Simple.pm
> index aeb92c4..f09301b 100644
> --- a/perl/lib/Lucy/Simple.pm
> +++ b/perl/lib/Lucy/Simple.pm
> @@ -50,7 +50,6 @@ sub new {
>      # Get type and schema.
>      my $analyzer = Lucy::Analysis::EasyAnalyzer->new( language => $language );
>      $self->{type} = Lucy::Plan::FullTextType->new( analyzer => $analyzer, );
> -    my $schema = $self->{schema} = Lucy::Plan::Schema->new;
>  
>      # Cache the object for later clean-up.
>      weaken( $obj_cache{ refaddr $self } = $self );
> @@ -61,6 +60,15 @@ sub new {
>  sub _lazily_create_indexer {
>      my $self = shift;
>      if ( !defined $self->{indexer} ) {
> +        my $reader = Lucy::Index::PolyReader->open( index => $self->{path} );
> +        if ( ! @{ $reader->seg_readers } ) {
> +            # index is empty, create new schema
> +            $self->{schema} = Lucy::Plan::Schema->new;
> +        }
> +        else {
> +            # get schema from reader
> +            $self->{schema} = $reader->get_schema;
> +        }
>          $self->{indexer} = Lucy::Index::Indexer->new(
>              schema => $self->{schema},
>              index  => $self->{path},
> @@ -70,11 +78,11 @@ sub _lazily_create_indexer {
>  
>  sub add_doc {
>      my ( $self, $hashref ) = @_;
> -    my $schema = $self->{schema};
>      my $type   = $self->{type};
>      croak("add_doc requires exactly one argument: a hashref")
>          unless ( @_ == 2 and reftype($hashref) eq 'HASH' );
>      $self->_lazily_create_indexer;
> +    my $schema = $self->{schema};
>      $schema->spec_field( name => $_, type => $type ) for keys %$hashref;
>      $self->{indexer}->add_doc($hashref);
>  }

Re: [lucy-dev] Promoting new analysis components

Posted by Nick Wellnhofer <we...@aevum.de>.

On 09/02/2012 18:49, Marvin Humphrey wrote:
> Rehashing our short exchange on IRC for the benefit of the list... I suggest
> using PolyReader#open for this, since it returns NULL rather than throwing an
> exception on failure.  If open() is successful, the resulting reader can be
> used as an argument to IndexSearcher#new:

Actually, PolyReader#open doesn't return NULL if the index is empty. But 
the list of seg_readers will be empty, so we can test for that.

See the attached patch for my second attempt.

Nick

Re: [lucy-dev] Promoting new analysis components

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Thu, Feb 09, 2012 at 01:50:54PM +0100, Nick Wellnhofer wrote:
> On 09/02/2012 02:49, Marvin Humphrey wrote:
>> After reviewing the Lucy::Simple code, I realized that we can avoid breaking
>> compat with only a few extra lines.
>>
>>    * If the index exists during new(), extract the schema and type from what's
>>      on disk.
>>    * Otherwise, create a new EasyAnalyzer for the type.
>>
>> That way, we avoid a schema conflict crash when indexes built by Lucy::Simple
>> prior to 0.4.0 are read by 0.4.0 or above.
>
> I tried to implement this and ran into two little problems:
>
> 1. If the index doesn't exist and a schema isn't supplied, a "no schema"  
> exception is thrown, but the write lock isn't released.

Rehashing our short exchange on IRC for the benefit of the list... I suggest
using PolyReader#open for this, since it returns NULL rather than throwing an
exception on failure.  If open() is successful, the resulting reader can be
used as an argument to IndexSearcher#new:

    my $reader = Lucy::Index::PolyReader->open( index => $path );
    if ($reader) {
        $self->{searcher} = Lucy::Search::IndexSearcher->new(
            index => $reader) );
    }

>From there, getting the Schema is just an accessor call, and getting a
FieldType is annoying but doable.

The write lock fix is justified on its own, so +1 to that.

As an aside, the fact that IndexReader#open throws an exception is a legacy of
Lucene porting.  I think it might be nice to explore making our constructors
consistent, so that we use open() when there are system resources involved,
new() when there aren't, and have new() always throw exceptions and open()
consistently return NULL on failure if possible.

> 2. I didn't find a way to get the schema of an indexer from Perl.

+1 to add Indexer_Get_Schema() and to expose it from Perl.  The code in your
patch looked good.

Marvin Humphrey

Re: [lucy-dev] Promoting new analysis components

Posted by Nick Wellnhofer <we...@aevum.de>.

On 09/02/2012 02:49, Marvin Humphrey wrote:
> After reviewing the Lucy::Simple code, I realized that we can avoid breaking
> compat with only a few extra lines.
>
>    * If the index exists during new(), extract the schema and type from what's
>      on disk.
>    * Otherwise, create a new EasyAnalyzer for the type.
>
> That way, we avoid a schema conflict crash when indexes built by Lucy::Simple
> prior to 0.4.0 are read by 0.4.0 or above.

I tried to implement this and ran into two little problems:

1. If the index doesn't exist and a schema isn't supplied, a "no schema" 
exception is thrown, but the write lock isn't released.

2. I didn't find a way to get the schema of an indexer from Perl.

See the attached patch for my attempt to fix this.

Nick

Re: [lucy-dev] Promoting new analysis components

Posted by Nick Wellnhofer <we...@aevum.de>.

On 10/02/2012 19:41, Marvin Humphrey wrote:
> As an academic exercise, though, I'd like to explore how Dump/Load might still
> work under subclassing.

[...]

> Therefore, we need to override Load() in the subclassed CaseFolder.  We can't
> invoke the super Load() method, but that's OK -- we can go through
> CaseFolder_init() to flesh out the object:
>
>      CaseFolder*
>      CaseFolder_load(CaseFolder *self, Obj *dump) {
>          UNUSED_VAR(dump);
>          return CaseFolder_init(self);
>      }

 From an academic point of view, this is ugly because it would break if 
someone implements another set of Dump/Load methods in a parent class of 
Normalizer (for example Analyzer) and expects them to be called from all 
subclasses.

On the plus side, it would allow us to completely remove the Transform 
methods of CaseFolder. But the methods are already there and we only 
have to convert them to wrappers around Normalizer.

IMO composition is the most maintable approach, because it doesn't 
require deeper knowledge of the Dump mechanism.

Nick

Re: [lucy-dev] Promoting new analysis components

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Fri, Feb 10, 2012 at 02:05:26PM +0100, Nick Wellnhofer wrote:
> The original plan was to implement CaseFolder as a subclass of  
> Normalizer, but I think that doesn't play well with the Dump/Load  
> functions. Composition might be a better approach.

Composition is a fine approach as well, so +1 if that's your preference.

As an academic exercise, though, I'd like to explore how Dump/Load might still
work under subclassing.

If the subclassed CaseFolder were a brand new class, it would be fine for it
to inherit Dump/Load from Normalizer.  The class name is part of the dump
data; aside from that, everything else would be the same:

    {
        "_class": "Lucy::Analysis::CaseFolder",    # <--------
        "case_fold": 1,
        "normalization_form": "NFKC",
        "strip_accents": 0
    }

The problem is that there are schema files out there in the wild which contain
serialized CaseFolders with dump data that won't satisfy Normalizer's
implementation of Load():

    # Missing "case_fold", "normalization_form", and "strip_accents".
    {
        "_class": "Lucy::Analysis::CaseFolder"
    }

Therefore, we need to override Load() in the subclassed CaseFolder.  We can't
invoke the super Load() method, but that's OK -- we can go through
CaseFolder_init() to flesh out the object:

    CaseFolder*
    CaseFolder_load(CaseFolder *self, Obj *dump) {
        UNUSED_VAR(dump);
        return CaseFolder_init(self);
    }

CaseFolder_init() will invoke Normalizer_init() in a standard subclass
implementation, doing the work that invoking the superclass's Load() ordinarily
would have done.

We should also override Dump() in the subclass (just keeping the implementation
in the current CaseFolder will do) so that the dump data stays consistent and
this change has no visible impact on schema files.

Marvin Humphrey

Re: [lucy-dev] Promoting new analysis components

Posted by Nick Wellnhofer <we...@aevum.de>.

On 09/02/2012 02:49, Marvin Humphrey wrote:
> On Wed, Feb 08, 2012 at 05:04:56PM +0100, Nick Wellnhofer wrote:
>> On 23/12/2011 04:18, Marvin Humphrey wrote:
>>>     * Implement CaseFolder as a subclass of Normalizer.
>>
>> This has yet to be done. We could also mark the CaseFolder as deprecated
>> and remove it completely later.
>
> The cost for keeping CaseFolder around in its current form is high, because it
> is tied into a perlapi function and thus needs a per-host implementation. (The
> perlapi function's name broke in late Perl 5.15 releases, which was a PITA to
> troubleshoot).  In contrast, the cost for keeping CaseFolder around is small
> if it becomes a subclass of Normalizer.
>
> However, CaseFolder and Normalizer presumably have slightly different case
> mappings, thus the subclassing change is a back compat break.  It shouldn't be
> a horrible break (depending on how close the mappings are) because it will
> only affect search-time, screwing up the results only for terms which contain
> code points whose mapping has changed.
>
> I don't think we should outright remove CaseFolder without a really good
> reason, because that will force almost all of our users to change their code
> and then reindex from scratch.  But a subtle compat break might be OK,
> especially since you can update all the docs in place after upgrading and only
> suffer during a window of time from slightly degraded search results.

The original plan was to implement CaseFolder as a subclass of 
Normalizer, but I think that doesn't play well with the Dump/Load 
functions. Composition might be a better approach.

Nick

Re: [lucy-dev] Promoting new analysis components

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Feb 08, 2012 at 05:04:56PM +0100, Nick Wellnhofer wrote:
> On 23/12/2011 04:18, Marvin Humphrey wrote:
>> Now that EasyAnalyzer is in, I think we should promote the use of all the
>> improvements Nick has made to the analysis chain.
>>
>>    * Swap in EasyAnalyzer for PolyAnalyzer, Normalizer for CaseFolder, and
>>      StandardTokenizer for RegexTokenizer everywhere we can.
>
> Done.

Excellent!  Lots of great-looking commits coming through.   The revisions to
the tutorial looked sane; I figured that would be the trickiest part.

>>    * Deprecate the "language" parameter to PolyAnalyzer#new.
>>
>> By "deprecate", I mean:
>>
>>    * Open a JIRA issue so that a suitably titled entry ends up in the CHANGES
>>      file.
>>    * Mark the "language" param as "deprecated" in the PolyAnalyzer docs.
>>
>> We don't have a strong deprecation mechanism available to us right now, so I
>> think that's the best we can do.
>
> I just noticed that I removed the "language" parameter from the  
> PolyAnalyzer docs, but I can revert that part of my commit and mark it  
> as deprecated.
>
> Regarding the JIRA issue: I couldn't find a good issue type for  
> deprecations. "Task" seems the most appropriate to me.

I agree, there's no good answer, so +1 for "Task".

>> It's not important that any of these changes happen before 0.3.0.  The docs
>> changes can happen at any time, and the parameter deprecation only allows the
>> simplification of a single class (PolyAnalyzer itself).  It would also be nice
>> to switch most test cases to use the new Analyzers, but that can also happen
>> at any time.
>
> The tests have been converted, too.

Lookin' good!

>> In contrast, here are a couple changes we should *not* make prior to 0.3.0,
>> because they have index compatibility implications:
>>
>>    * Change Lucy::Simple to use EasyAnalyzer instead of PolyAnalyzer.
>
> I've done that now.

After reviewing the Lucy::Simple code, I realized that we can avoid breaking
compat with only a few extra lines.

  * If the index exists during new(), extract the schema and type from what's
    on disk.
  * Otherwise, create a new EasyAnalyzer for the type.

That way, we avoid a schema conflict crash when indexes built by Lucy::Simple
prior to 0.4.0 are read by 0.4.0 or above.

>>    * Implement CaseFolder as a subclass of Normalizer.
>
> This has yet to be done. We could also mark the CaseFolder as deprecated  
> and remove it completely later.

The cost for keeping CaseFolder around in its current form is high, because it
is tied into a perlapi function and thus needs a per-host implementation. (The
perlapi function's name broke in late Perl 5.15 releases, which was a PITA to
troubleshoot).  In contrast, the cost for keeping CaseFolder around is small
if it becomes a subclass of Normalizer.

However, CaseFolder and Normalizer presumably have slightly different case
mappings, thus the subclassing change is a back compat break.  It shouldn't be
a horrible break (depending on how close the mappings are) because it will
only affect search-time, screwing up the results only for terms which contain
code points whose mapping has changed.

I don't think we should outright remove CaseFolder without a really good
reason, because that will force almost all of our users to change their code
and then reindex from scratch.  But a subtle compat break might be OK,
especially since you can update all the docs in place after upgrading and only
suffer during a window of time from slightly degraded search results.

Marvin Humphrey

Re: [lucy-dev] Promoting new analysis components

Posted by Nick Wellnhofer <we...@aevum.de>.

On 23/12/2011 04:18, Marvin Humphrey wrote:
> Now that EasyAnalyzer is in, I think we should promote the use of all the
> improvements Nick has made to the analysis chain.
>
>    * Swap in EasyAnalyzer for PolyAnalyzer, Normalizer for CaseFolder, and
>      StandardTokenizer for RegexTokenizer everywhere we can.

Done.

>    * Deprecate the "language" parameter to PolyAnalyzer#new.
>
> By "deprecate", I mean:
>
>    * Open a JIRA issue so that a suitably titled entry ends up in the CHANGES
>      file.
>    * Mark the "language" param as "deprecated" in the PolyAnalyzer docs.
>
> We don't have a strong deprecation mechanism available to us right now, so I
> think that's the best we can do.

I just noticed that I removed the "language" parameter from the 
PolyAnalyzer docs, but I can revert that part of my commit and mark it 
as deprecated.

Regarding the JIRA issue: I couldn't find a good issue type for 
deprecations. "Task" seems the most appropriate to me.

> It's not important that any of these changes happen before 0.3.0.  The docs
> changes can happen at any time, and the parameter deprecation only allows the
> simplification of a single class (PolyAnalyzer itself).  It would also be nice
> to switch most test cases to use the new Analyzers, but that can also happen
> at any time.

The tests have been converted, too.

> In contrast, here are a couple changes we should *not* make prior to 0.3.0,
> because they have index compatibility implications:
>
>    * Change Lucy::Simple to use EasyAnalyzer instead of PolyAnalyzer.

I've done that now.

>    * Implement CaseFolder as a subclass of Normalizer.

This has yet to be done. We could also mark the CaseFolder as deprecated 
and remove it completely later.

Nick