You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Anil Pachuri <an...@yahoo.com> on 2014/03/02 23:18:31 UTC
[lucy-user] synonym terms
Hi there,
How should one handle synonym terms in Lucy? I wonder if expanding the query (e.g. terms separated by 'OR') is the best way to do this. Is there a built-in function/sample code available in Lucy that shows how to handle synonym terms at the index level? Please advise.
TIA!
AP
Re: [lucy-user] synonym terms
Posted by Anil Pachuri <an...@yahoo.com>.
Very helpful and clear reply. Thanks a lot, Peter.
On Monday, March 3, 2014 2:22 PM, Peter Karman <pe...@peknet.com> wrote:
On 3/2/14 4:18 PM, Anil Pachuri wrote:
> Hi there,
>
> How should one handle synonym terms in Lucy? I wonder if expanding
> the query (e.g. terms separated by 'OR') is the best way to do this.
> Is there a built-in function/sample code available in Lucy that shows
> how to handle synonym terms at the index level? Please advise.
>
As you allude, there are two ways to solve the problem: at index time,
or at search time.
There are trade-offs to both; I prefer to do as much at index time as
possible, for a couple of reasons. One, stuffing the index with extra
data at index time means the search-time code doesn't have to work
harder (running a long OR'd string, e.g.). Two, it makes debugging
easier IME, because standard searching code gets the same results as
customized searching code. E.g., you can dump a lexicon to see exactly
what is in the index, synonyms included. OTOH, see the caveats below.
I don't know of any examples in the wild for doing this at index time,
but I image something like this would work:
my %doc = get_doc_to_index();
my @terms = get_terms_from_doc($doc); # should analyze like Lucy does
my %synonyms;
for my $term (@terms) {
for my $syn (get_synonyms($term)) {
$synonyms{$syn}++; # avoid duplicates
}
}
# make sure your schema has a 'synonyms' field defined
$doc{synonyms} = join ' ', keys %synonyms;
add_to_indexer(\%doc);
The caveats here (and anytime you do this at index-time) include:
* snipping/highlighting will be strange, since a match in the
'synonyms' field will have zero context.
* you're increasing the size of your index with content that doesn't
actually exist in your document corpus. That can have unforeseen
usability impact, depending on your application.
* the 'synonyms' field is "virtual" or "private" so you'll have to
decide whether you want to expose it as part of your public interface or
not.
Otherwise, if you do this at search-time with query expansion, I would
expect a small (maybe not measurable) performance hit and more
complicated search code. You could use the Search::Query term_expander
feature[0].
my $parser = Search::Query->parser(
dialect => 'Lucy',
term_expander => sub {
my ($term, $field) = @_;
return ($term) if ref $term; # skip ranges
return ( get_array_of_synonyms_for_term($term), $term );
},
);
my $query = $parser->parse($str);
my $lucy_query = $query->as_lucy_query();
my $hits = $lucy_searcher->hits( query => $lucy_query );
A third way to approach the problem, though it doesn't directly answer
the question you posed, is to treat the synonyms as 'suggestions' for
further searches, rather than searching for them automatically.
Something like LucyX::Suggester[1] could be extended to include synonyms
in addition to spellings.
[0] https://metacpan.org/pod/Search::Query::Parser#term_expander
[1] https://metacpan.org/pod/LucyX::Suggester
--
Peter Karman . http://peknet.com/ . peter@peknet.com
Re: [lucy-user] synonym terms
Posted by Peter Karman <pe...@peknet.com>.
On 3/2/14 4:18 PM, Anil Pachuri wrote:
> Hi there,
>
> How should one handle synonym terms in Lucy? I wonder if expanding
> the query (e.g. terms separated by 'OR') is the best way to do this.
> Is there a built-in function/sample code available in Lucy that shows
> how to handle synonym terms at the index level? Please advise.
>
As you allude, there are two ways to solve the problem: at index time,
or at search time.
There are trade-offs to both; I prefer to do as much at index time as
possible, for a couple of reasons. One, stuffing the index with extra
data at index time means the search-time code doesn't have to work
harder (running a long OR'd string, e.g.). Two, it makes debugging
easier IME, because standard searching code gets the same results as
customized searching code. E.g., you can dump a lexicon to see exactly
what is in the index, synonyms included. OTOH, see the caveats below.
I don't know of any examples in the wild for doing this at index time,
but I image something like this would work:
my %doc = get_doc_to_index();
my @terms = get_terms_from_doc($doc); # should analyze like Lucy does
my %synonyms;
for my $term (@terms) {
for my $syn (get_synonyms($term)) {
$synonyms{$syn}++; # avoid duplicates
}
}
# make sure your schema has a 'synonyms' field defined
$doc{synonyms} = join ' ', keys %synonyms;
add_to_indexer(\%doc);
The caveats here (and anytime you do this at index-time) include:
* snipping/highlighting will be strange, since a match in the
'synonyms' field will have zero context.
* you're increasing the size of your index with content that doesn't
actually exist in your document corpus. That can have unforeseen
usability impact, depending on your application.
* the 'synonyms' field is "virtual" or "private" so you'll have to
decide whether you want to expose it as part of your public interface or
not.
Otherwise, if you do this at search-time with query expansion, I would
expect a small (maybe not measurable) performance hit and more
complicated search code. You could use the Search::Query term_expander
feature[0].
my $parser = Search::Query->parser(
dialect => 'Lucy',
term_expander => sub {
my ($term, $field) = @_;
return ($term) if ref $term; # skip ranges
return ( get_array_of_synonyms_for_term($term), $term );
},
);
my $query = $parser->parse($str);
my $lucy_query = $query->as_lucy_query();
my $hits = $lucy_searcher->hits( query => $lucy_query );
A third way to approach the problem, though it doesn't directly answer
the question you posed, is to treat the synonyms as 'suggestions' for
further searches, rather than searching for them automatically.
Something like LucyX::Suggester[1] could be extended to include synonyms
in addition to spellings.
[0] https://metacpan.org/pod/Search::Query::Parser#term_expander
[1] https://metacpan.org/pod/LucyX::Suggester
--
Peter Karman . http://peknet.com/ . peter@peknet.com