You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Laurent Gilles <lg...@sollan.com> on 2007/09/11 13:27:31 UTC

Synonyms expressions sens

Hi,

 

I'm actually facing a relevancy issue with multiword synonyms.

 

Let's expose it by a test case:

 

Giving the following synonyms definitions:

--------------------------------------------------------------------

capital punishement, death sentence, death penalty

--------------------------------------------------------------------

 

And a SynonymsFilter@expand=true defined at index time, so the document:

--------------------------------------------------------------------

The prisoner escaped just before the death sentence had been set.

--------------------------------------------------------------------

 

Will be indexed like

--------------------------------------------------------------------

The prisoner escaped just before the (death sentence | death penalty |
capital punishment) had been set.

--------------------------------------------------------------------

 

Now, if a user asks for "capital", the system will match "capital" (that
could mean 'Paris, capital of France') into the index time synonyms expanded
document, which doesn't have sense.

I was expecting that in order to match, I'll have to give the entire
expression "capital punishment" to match a document that contains " death
sentence" and not only a part of the expression.

 

It seems to be the normal Solr behaviour, but what I'm actually facing is a
relevance problem with the given results, since a given word contained in an
expression could have a completely different meaning compared with the same
isolated word.

 

Is their a trick or a way to match synonym complete expression and not the
words the expands have added into documents ?

 

Thanks,

 

Laurent


RE: Synonyms expressions sens

Posted by Laurent Gilles <lg...@sollan.com>.
Thanks for the advice Grant,

I've tried putting '_' into synonyms, but step by step I've realised that it
what always more intrusive into Solr source code...
But I've found another solution, that I want to expose here in order to have
external advice and perhaps pointing out some bugs or side effect I've not
seen.
I do not touch the source code but I only change my synonym.txt and the way
I manage indexes on schema.xml.

Giving a synonyms list like :

capital punishement, death sentence, death penalty
10, dix, X
17, Dix sept, XVII
18, dix huit, XVIII
Rock, jazz, modern music => modern music
Coluche, colucci => colucci
Coluche, coluci => coluci
Coluche, colucchi => colucchi
coluche, michel colucci => michel colucci

I was faced with two major problems with index time synonym expansion (@
expand=true:
- Possibility of synonyms mix ("10, dix, X" with "17, Dix sept, XVII" or
"18, dix huit, XVIII")
- Possibility of query that could match some unexpected result due to
language ambiguity, and in a more generic way, due to the fact that
expansion put new token in document that will be matched at wuery time (ex:
query "capitale" will match a document with " death sentence "..)

So here what I've done:

A single line in synonym file could by seen as a family of synonyms, or
switcheable term and expressions.
So instead of injecting (into document at index time) for a single match,
all the possibilities founded in the synonyms list, I've changed the list in
order to give an ID for each synonyms families and the index time synonyms
filter is no more configured with expand=true but with expand=false in order
to replace a matched term with the ID of his family.

Then at query time, I reintroduced the synonyms filter with expand=false in
order to replace in the query the matched synonyms with their corresponding
ID

Her my synonyms list used with expand=false

SynFamily1, capital punishement, death sentence, death penalty
SynFamily2, 10, dix, x
SynFamily89, 17, xvii, dix sept
SynFamily112, 18, xviii, dix huit
rock, modern music => HierFamily2017
jazz, modern music => HierFamily2014
coluche, collucci => HierFamily1537
coluche, colluche => HierFamily1538
coluche, colucchi => HierFamily1541
coluche, colucci => HierFamily1542
coluche, coluchi => HierFamily1543
coluche, coluci => HierFamily1544

It seems to work fine since now a query "capital" will not match a document
that originally contains "death sentence" since the synonyms expansion is
limited to the one-token ID "SynFamily1", and in order to match such a
document, a query like "capital punishement" must been made.

The synonyms mixing also seems to have disappeared (document containing "dix
huit" will not match for a query "10")

My question is, do I've missed something ? The solution seems to much simple
and since I'm working on fulltext search engine I've always faced side
effects problems after logic modification, so I'm a little sceptic... :) 

Voila !

Thanks for your time

Laurent



-----Message d'origine-----
De : Grant Ingersoll [mailto:gsingers@apache.org] 
Envoyé : mardi 11 septembre 2007 14:53
À : solr-user@lucene.apache.org
Objet : Re: Synonyms expressions sens

Inline...
On Sep 11, 2007, at 7:27 AM, Laurent Gilles wrote:

> Hi,
>
>
>
> I'm actually facing a relevancy issue with multiword synonyms.
>
>
>
> Let's expose it by a test case:
>
>
>
> Giving the following synonyms definitions:
>
> --------------------------------------------------------------------
>
> capital punishement, death sentence, death penalty
>
> --------------------------------------------------------------------
>
>
>
> And a SynonymsFilter@expand=true defined at index time, so the  
> document:
>
> --------------------------------------------------------------------
>
> The prisoner escaped just before the death sentence had been set.
>
> --------------------------------------------------------------------
>
>
>
> Will be indexed like
>
> --------------------------------------------------------------------
>
> The prisoner escaped just before the (death sentence | death penalty |
> capital punishment) had been set.
>
> --------------------------------------------------------------------
>
>
>
> Now, if a user asks for "capital", the system will match  
> "capital" (that
> could mean 'Paris, capital of France') into the index time synonyms  
> expanded
> document, which doesn't have sense.
>
> I was expecting that in order to match, I'll have to give the entire
> expression "capital punishment" to match a document that contains "  
> death
> sentence" and not only a part of the expression.
>
>
>
> It seems to be the normal Solr behaviour, but what I'm actually  
> facing is a
> relevance problem with the given results, since a given word  
> contained in an
> expression could have a completely different meaning compared with  
> the same
> isolated word.
>




>
>
> Is their a trick or a way to match synonym complete expression and  
> not the
> words the expands have added into documents ?
>

Ah, the ambiguity of language :-)

I can think of a couple of different suggestions to try:
1. Index your phrase synonyms as a single token, such as  
capital_punishment, death_penalty, etc. This requires that you be  
able to recognize phrases during indexing and querying, since you  
will want to transform capital punishment in your documents to  
capital_punishment.  Alternatively, you could create a query like  
("capital punishment" OR capital_punishment)

2. On the query side, you could produce queries like: capital AND  
-"capital punishment"

I don't know your system, but I suppose there is always the chance  
that a user searching for capital really does want all occurrences of  
capital (assuming no other context) which may cause problems

HTH,
Grant



Re: Synonyms expressions sens

Posted by Grant Ingersoll <gs...@apache.org>.
Inline...
On Sep 11, 2007, at 7:27 AM, Laurent Gilles wrote:

> Hi,
>
>
>
> I'm actually facing a relevancy issue with multiword synonyms.
>
>
>
> Let's expose it by a test case:
>
>
>
> Giving the following synonyms definitions:
>
> --------------------------------------------------------------------
>
> capital punishement, death sentence, death penalty
>
> --------------------------------------------------------------------
>
>
>
> And a SynonymsFilter@expand=true defined at index time, so the  
> document:
>
> --------------------------------------------------------------------
>
> The prisoner escaped just before the death sentence had been set.
>
> --------------------------------------------------------------------
>
>
>
> Will be indexed like
>
> --------------------------------------------------------------------
>
> The prisoner escaped just before the (death sentence | death penalty |
> capital punishment) had been set.
>
> --------------------------------------------------------------------
>
>
>
> Now, if a user asks for "capital", the system will match  
> "capital" (that
> could mean 'Paris, capital of France') into the index time synonyms  
> expanded
> document, which doesn't have sense.
>
> I was expecting that in order to match, I'll have to give the entire
> expression "capital punishment" to match a document that contains "  
> death
> sentence" and not only a part of the expression.
>
>
>
> It seems to be the normal Solr behaviour, but what I'm actually  
> facing is a
> relevance problem with the given results, since a given word  
> contained in an
> expression could have a completely different meaning compared with  
> the same
> isolated word.
>




>
>
> Is their a trick or a way to match synonym complete expression and  
> not the
> words the expands have added into documents ?
>

Ah, the ambiguity of language :-)

I can think of a couple of different suggestions to try:
1. Index your phrase synonyms as a single token, such as  
capital_punishment, death_penalty, etc. This requires that you be  
able to recognize phrases during indexing and querying, since you  
will want to transform capital punishment in your documents to  
capital_punishment.  Alternatively, you could create a query like  
("capital punishment" OR capital_punishment)

2. On the query side, you could produce queries like: capital AND  
-"capital punishment"

I don't know your system, but I suppose there is always the chance  
that a user searching for capital really does want all occurrences of  
capital (assuming no other context) which may cause problems

HTH,
Grant