You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Gerald Richter <ri...@ecos.de> on 2015/01/17 15:55:14 UTC

[lucy-user] Running query string thru Analyzer?

Hi,

 
I have defined a field in the following way:

 
    my $tokenizer    = Lucy::Analysis::StandardTokenizer->new;
    my $normalizer   = Lucy::Analysis::Normalizer->new (strip_accents => 1, case_fold => 1) ;
    my $field_analyzer = Lucy::Analysis::PolyAnalyzer->new
                            (
                            analyzers => [ $tokenizer, $normalizer ],
                            );
    my $field_type  = Lucy::Plan::FullTextType->new (analyzer => $field_analyzer) ;
    $schema->spec_field( name => 'option_ndx',  type => $field_type );

 
When I now run a query (either with a TermQuery or a WildcardQuery), and the indexed document was "Foo baß", it works as long as I query for "foo", but not when I query for "Foo" or "baß". So I guess I have to run the query string thru the same analyzer as the indexer does.

 
The question is how can I do this or is Lucy able to do this for me?

 
Thanks & Regards

 
Gerald

 
P.S. I am using Lucy 0.42

 
 

 

Re: [lucy-user] Running query string thru Analyzer?

Posted by Nick Wellnhofer <we...@aevum.de>.
On 17/01/2015 18:19, Marvin Humphrey wrote:
> On Sat, Jan 17, 2015 at 8:58 AM, Nick Wellnhofer <we...@aevum.de> wrote:
>
>> Oops, I just saw that queries for "Foo" don't work either, so scratch that.
>> Can you show us your indexing and querying code or even a self-contained
>> test case?
>
> The issue is probably that TermQuery's constructor takes exactly what you give
> it, which may not match what's in the index.  In this case, `foo` is in the
> index, so queries for `Foo` don't work.

Ah yes, of course. Sorry for the noise.

Another approach to manually analyze fields for a TermQuery would be:

     my $type = $schema->fetch_type('option_ndx');
     # get_analyzer only works for FullTextType.
     my $analyzer = $type->get_analyzer;
     my $tokens = $analyzer->split('Foo');
     # Make sure to check the size of the returned array.
     my $term_query = Lucy::Search::TermQuery->new(
         field => 'option_ndx',
         term  => $tokens->[0],
     );

Some of this is explained in the QueryObjects tutorial and the 
CustomQueryParser cookbook entry:

 
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Docs/Tutorial/QueryObjects.pod
 
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Docs/Cookbook/CustomQueryParser.pod

But unfortunately, the get_analyzer method of FullTextType is undocumented. I 
think this should be fixed.

Nick


Re: [lucy-user] Running query string thru Analyzer?

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Sat, Jan 17, 2015 at 8:58 AM, Nick Wellnhofer <we...@aevum.de> wrote:

> Oops, I just saw that queries for "Foo" don't work either, so scratch that.
> Can you show us your indexing and querying code or even a self-contained
> test case?

The issue is probably that TermQuery's constructor takes exactly what you give
it, which may not match what's in the index.  In this case, `foo` is in the
index, so queries for `Foo` don't work.

A QueryParser will probably give Gerald what he wants, because it will apply
the appropriate Analyzer.

  use Lucy;
  use Data::Dumper qw( Dumper );
  my $searcher = Lucy::Search::IndexSearcher->new(index => '/path/to/index');
  my $qparser = Lucy::Search::QueryParser->new(
    schema => $searcher->get_schema,
    fields => ['content'],  # optional, try without this.
  );
  my $query = $qparser->parse("Foo");
  warn Dumper($query->dump);

Marvin Humphrey

Re: [lucy-user] Running query string thru Analyzer?

Posted by Nick Wellnhofer <we...@aevum.de>.
On 17/01/2015 17:50, Nick Wellnhofer wrote:
> Lucy's Query classes do that automatically for you. My guess is that either
> your indexed document or your query term contain a "ß" character in the wrong
> encoding. The most common reasons are:

Oops, I just saw that queries for "Foo" don't work either, so scratch that. 
Can you show us your indexing and querying code or even a self-contained test 
case?

Nick


Re: [lucy-user] Running query string thru Analyzer?

Posted by Nick Wellnhofer <we...@aevum.de>.
On 17/01/2015 15:55, Gerald Richter wrote:
> When I now run a query (either with a TermQuery or a WildcardQuery), and the indexed document was "Foo baß", it works as long as I query for "foo", but not when I query for "Foo" or "baß". So I guess I have to run the query string thru the same analyzer as the indexer does.
>
> The question is how can I do this or is Lucy able to do this for me?

Lucy's Query classes do that automatically for you. My guess is that either 
your indexed document or your query term contain a "ß" character in the wrong 
encoding. The most common reasons are:

- UTF-8 string in source code without "use utf8;".
- String read from UTF-8 file without setting the file encoding
   or without decoding manually.

If a search for "ba\xC3\x9F" works, then the problem is with the indexed 
document. If a search for "ba\xDF" works, the problem is with your query term.

Nick


Re: [lucy-user] Running query string thru Analyzer?

Posted by Peter Karman <pe...@peknet.com>.
On 1/17/15 8:55 AM, Gerald Richter wrote:
> Hi,
> 
>  
> I have defined a field in the following way:
> 
>  
>     my $tokenizer    = Lucy::Analysis::StandardTokenizer->new;
>     my $normalizer   = Lucy::Analysis::Normalizer->new (strip_accents => 1, case_fold => 1) ;
>     my $field_analyzer = Lucy::Analysis::PolyAnalyzer->new
>                             (
>                             analyzers => [ $tokenizer, $normalizer ],
>                             );
>     my $field_type  = Lucy::Plan::FullTextType->new (analyzer => $field_analyzer) ;
>     $schema->spec_field( name => 'option_ndx',  type => $field_type );
> 
>  
> When I now run a query (either with a TermQuery or a WildcardQuery), and the indexed document was "Foo baß", it works as long as I query for "foo", but not when I query for "Foo" or "baß". So I guess I have to run the query string thru the same analyzer as the indexer does.
> 
>  
> The question is how can I do this or is Lucy able to do this for me?
> 

In addition to the good advice elsewhere on this thread, you can use the
Search::Query Lucy dialect to parse and analyze plain strings
appropriately, with code like this:

----------------------------------
use Lucy;
use Search::Query;

my ($idx, $query) = get_index_name_and_query();

my $searcher = Lucy::Search::IndexSearcher->new( index => $idx );
my $schema   = $searcher->get_schema();

# build field mapping
my %fields;
for my $field_name ( @{ $schema->all_fields() } ) {
    $fields{$field_name} = {
        type     => $schema->fetch_type($field_name),
        analyzer => $schema->fetch_analyzer($field_name),
    };
}

my $query_parser = Search::Query->parser(
    dialect        => 'Lucy',
    croak_on_error => 1,
    default_field  => 'foo',  # applied to "bare" terms with no field
    fields         => \%fields
);

my $parsed_query = $query_parser->parse($query);
my $lucy_query   = $parsed_query->as_lucy_query();
my $hits         = $searcher->hits( query => $lucy_query );

--------------------------------



Something similar is performed in Dezi::Lucy::Searcher:
https://metacpan.org/source/KARMAN/Dezi-App-0.013/lib/Dezi/Lucy/Searcher.pm#L124

See
https://metacpan.org/pod/Search::Query::Dialect::Lucy



-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com