You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Anil Pachuri <an...@yahoo.com> on 2014/03/02 23:18:31 UTC

[lucy-user] synonym terms

Hi there,

How should one handle synonym terms in Lucy? I wonder if expanding the query (e.g. terms separated by 'OR') is the best way to do this. Is there a built-in function/sample code available in Lucy that shows how to handle synonym terms at the index level? Please advise.

TIA!
AP

Re: [lucy-user] synonym terms

Posted by Anil Pachuri <an...@yahoo.com>.
Very helpful and clear reply. Thanks a lot, Peter.



On Monday, March 3, 2014 2:22 PM, Peter Karman <pe...@peknet.com> wrote:
 
On 3/2/14 4:18 PM, Anil Pachuri wrote:

> Hi there,
>
> How should one handle synonym terms in Lucy? I wonder if expanding
> the query (e.g. terms separated by 'OR') is the best way to do this.
> Is there a built-in function/sample code available in Lucy that shows
> how to handle synonym terms at the index level? Please advise.
>

As you allude, there are two ways to solve the problem: at index time, 
or at search time.

There are trade-offs to both; I prefer to do as much at index time as 
possible, for a couple of reasons. One, stuffing the index with extra 
data at index time means the search-time code doesn't have to work 
harder (running a long OR'd string, e.g.). Two, it makes debugging 
easier IME, because standard searching code gets the same results as 
customized searching code. E.g., you can dump a lexicon to see exactly 
what is in the index, synonyms included. OTOH, see the caveats below.

I don't know of any examples in the wild for doing this at index time, 
but I image something like this would work:

  my %doc = get_doc_to_index();
  my @terms = get_terms_from_doc($doc);  # should analyze like Lucy does
  my %synonyms;
  for my $term (@terms) {
      for my $syn (get_synonyms($term)) {
          $synonyms{$syn}++;  # avoid duplicates
      }
  }
  # make sure your schema has a 'synonyms' field defined
  $doc{synonyms} = join ' ', keys %synonyms;
  add_to_indexer(\%doc);


The caveats here (and anytime you do this at index-time) include:

  * snipping/highlighting will be strange, since a match in the 
'synonyms' field will have zero context.

  * you're increasing the size of your index with content that doesn't 
actually exist in your document corpus. That can have unforeseen 
usability impact, depending on your application.

  * the 'synonyms' field is "virtual" or "private" so you'll have to 
decide whether you want to expose it as part of your public interface or 
not.


Otherwise, if you do this at search-time with query expansion, I would 
expect a small (maybe not measurable) performance hit and more 
complicated search code. You could use the Search::Query term_expander 
feature[0].

  my $parser = Search::Query->parser(
      dialect => 'Lucy',
      term_expander => sub {
          my ($term, $field) = @_;
          return ($term) if ref $term;    # skip ranges
          return ( get_array_of_synonyms_for_term($term), $term );
      },
  );
  my $query      = $parser->parse($str);
  my $lucy_query = $query->as_lucy_query();
  my $hits       = $lucy_searcher->hits( query => $lucy_query );


A third way to approach the problem, though it doesn't directly answer 
the question you posed, is to treat the synonyms as 'suggestions' for 
further searches, rather than searching for them automatically. 
Something like LucyX::Suggester[1] could be extended to include synonyms 
in addition to spellings.


[0] https://metacpan.org/pod/Search::Query::Parser#term_expander
[1] https://metacpan.org/pod/LucyX::Suggester

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [lucy-user] synonym terms

Posted by Peter Karman <pe...@peknet.com>.
On 3/2/14 4:18 PM, Anil Pachuri wrote:
> Hi there,
>
> How should one handle synonym terms in Lucy? I wonder if expanding
> the query (e.g. terms separated by 'OR') is the best way to do this.
> Is there a built-in function/sample code available in Lucy that shows
> how to handle synonym terms at the index level? Please advise.
>

As you allude, there are two ways to solve the problem: at index time, 
or at search time.

There are trade-offs to both; I prefer to do as much at index time as 
possible, for a couple of reasons. One, stuffing the index with extra 
data at index time means the search-time code doesn't have to work 
harder (running a long OR'd string, e.g.). Two, it makes debugging 
easier IME, because standard searching code gets the same results as 
customized searching code. E.g., you can dump a lexicon to see exactly 
what is in the index, synonyms included. OTOH, see the caveats below.

I don't know of any examples in the wild for doing this at index time, 
but I image something like this would work:

  my %doc = get_doc_to_index();
  my @terms = get_terms_from_doc($doc);  # should analyze like Lucy does
  my %synonyms;
  for my $term (@terms) {
      for my $syn (get_synonyms($term)) {
          $synonyms{$syn}++;  # avoid duplicates
      }
  }
  # make sure your schema has a 'synonyms' field defined
  $doc{synonyms} = join ' ', keys %synonyms;
  add_to_indexer(\%doc);


The caveats here (and anytime you do this at index-time) include:

  * snipping/highlighting will be strange, since a match in the 
'synonyms' field will have zero context.

  * you're increasing the size of your index with content that doesn't 
actually exist in your document corpus. That can have unforeseen 
usability impact, depending on your application.

  * the 'synonyms' field is "virtual" or "private" so you'll have to 
decide whether you want to expose it as part of your public interface or 
not.


Otherwise, if you do this at search-time with query expansion, I would 
expect a small (maybe not measurable) performance hit and more 
complicated search code. You could use the Search::Query term_expander 
feature[0].

  my $parser = Search::Query->parser(
      dialect => 'Lucy',
      term_expander => sub {
          my ($term, $field) = @_;
          return ($term) if ref $term;    # skip ranges
          return ( get_array_of_synonyms_for_term($term), $term );
      },
  );
  my $query      = $parser->parse($str);
  my $lucy_query = $query->as_lucy_query();
  my $hits       = $lucy_searcher->hits( query => $lucy_query );


A third way to approach the problem, though it doesn't directly answer 
the question you posed, is to treat the synonyms as 'suggestions' for 
further searches, rather than searching for them automatically. 
Something like LucyX::Suggester[1] could be extended to include synonyms 
in addition to spellings.


[0] https://metacpan.org/pod/Search::Query::Parser#term_expander
[1] https://metacpan.org/pod/LucyX::Suggester

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com