You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by anuvenk <an...@hotmail.com> on 2008/01/23 21:06:54 UTC

solr synonyms behaviour

I need to understand this synonym behaviour

I have this synonym
divorce mediation,alternative dispute resolution

so when i do a debug this is the parsedquery_tostring i see:
(((text:divorc^0.8 | name:divorc^2.0)~0.01 (text:mediat^0.8 |
name:mediat^2.0)~0.01)~2) (text:"(divorc altern) (disput mediat)
resolut"~5^0.8 | name:"(divorc altern) (disput mediat) resolut"~5^2.0)~0.01

I understand how its grouping the synonyms like this:
(divorc altern) (disput mediat) resolut

Now what i don't understand is how its doing the matching

Does it mean it will find all matches with either of the words (divorc
altern), either of the words (disput mediat) (and/or) resolut

I have the synonym filter only at query time coz i can't re-index data (or
portion of data) everytime i add a synonym and a couple of other reasons.

Could someone please explain how the matching works in this case. thanks.

-- 
View this message in context: http://www.nabble.com/solr-synonyms-behaviour-tp15051211p15051211.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr synonyms behaviour

Posted by Guillaume Smet <gu...@gmail.com>.

Chris,

On Sat, Jan 26, 2008 at 2:30 AM, Chris Hostetter
<ho...@fucit.org> wrote:
> : I have the synonym filter only at query time coz i can't re-index data (or
> : portion of data) everytime i add a synonym and a couple of other reasons.
>
> Use cases like yours will *never* work as a query time synonym ... hence
> all of the information about multi-word synonyms and the caveats about
> using them in the wiki...
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter

Considering these problems, it might be better to move the
SynonymFilter from type="query" to type="index" in the example file.
This file is very often used as a reference.

Or perhaps we should just mention potential problems and a link to the
documentation in the existing comment: "in this example, we will only
use synonyms at query time".

Thoughts?

-- 
Guillaume

RE: solr synonyms behaviour

Posted by Laurent Gilles <lg...@sollan.com>.

Hi Swarag,

Indeed, we were faced with a problem with what we called Hiearchy synonym
search, I think it is a little different of what you are looking for, but
who know, maybe it could lead you to find a solution for you problem too.

So here was our need:
Let's say we have this hierarchy of words

              +---- Jazz
              |  
Modern music -+---- Rock
              |
              +---- Hip Hop

So the term "Modern music" includes lots of music style, in our case, Jazz,
Rock and Hip Hop.

We need a search that behaves that way:

 - If a user searches for Jazz, it should return any document containing
only Jazz, but not documents containing Rock, Hip Hop or Modern Music
 - If a user searches for Modern Music, it should return any document in the
hierarchy, including of course document containing Modern Music.

To be able to do this, we keep our logic of putting identifier to synonyms,
but it wasn't enough to achieve this goal, so we ends up making a field
specifically for hierarchy search, with a synonym filter pointing to two
different files for index & query time.

Here is the file for index time:
jazz => HIERARCHY_1
rock => HIERARCHY_2
hip hop => HIERARCHY_3

Here is the file for query time:
Jazz, modern music => HIERARCHY_1
rock, modern music => HIERARCHY_2
hip hop, modern music => HIERARCHY_3

with the following schema configuration 
<fieldtype name="string_hier" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
(...)
<filter class="solr.SynonymFilterFactory" synonyms="hierarchies.index.txt"
ignoreCase="true" expand="false" />
(...)
</analyzer>
<analyzer type="query">
(...)
<filter class="solr.SynonymFilterFactory" synonyms="hierarchies.query.txt"
ignoreCase="true" expand="false" />
(...)
</analyzer>
</fieldtype>

This way, a document containing "jazz", "rock" or "hip hop", will be indexed
respectively with "HIERARCHY_1", "HIERARCHY_2" and "HIERARCHY_3".
A document containing "modern music" keep having "modern music" since it is
not matched at index time.

User searches :
Case 1
If a user searches for "I love jazz", the query time parser for *_hier field
type will transform the query into "I love HIERARCHY_1"
And will match ONLY document containing jazz

Case 2 
If a user searches for "I love modern music", the query time parser for
*_hier field type will transform the query into
"I love [HIERARCHY_1 | HIERARCHY_2 | HIERARCHY_3]"

So it is in fact looking for the whole hierarchy, the only exception is
about "modern music" which is not matched here, but will be matched if we do
in parallel a search on another full-text-stemmed field type.

Here was our need and solution, hope this can help you to find a solution
for you.
Regards,
Laurent

-----Message d'origine-----
De : swarag [mailto:Swarag_Segu@citysearch.com] 
Envoyé : mardi 29 juillet 2008 04:08
À : solr-user@lucene.apache.org
Objet : RE: solr synonyms behaviour

Hi Laurent

Laurent Gilles wrote:
> 
> Hi,
> 
> I was faced with the same issues reguarding multiwords synonyms
> Let's say a synonyms list like:
> 
> club, bar, night cabaret
> 
> Now if we have a document containing "club", with the default synonyms
> filter behaviour with expand=true, we will end up in the lucene index with
> a
> document containing "club|bar|night cabaret".
> So if the user search for "night", the query-time will search for "night"
> in
> the index and will match our document since it had been "enriched" @
> index-time, and it really contains the token "night".
> 
> The only valid solution I've founded was to create a field-type
> exclusively
> used for synonyms search where: 
> 
> @IndexTime
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false" />
> @QueryTime
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false" />
> 
> And with a customised synonyms file that looks like:
> 
> SYN_ID_1, club, bar, night cabaret
> 
> So for our document containing "club", the synonym filter at index time
> with
> expand=false will replace every matching token/expression in the document
> with the SYN_ID_1.
> 
> And at query time, when an user search for "night", since "night" is not
> alone in synonyms definition, it will not be matched, even by "normal"
> search, because every document containing "club" or "bar" would have been
> "enriched" with "SYN_ID_1" and NOT with "club|bar|night cabaret", so the
> final indexed document will not contains isolated token from synonyms
> expression that risks to be matched later without notice.
> 
> In order to match our document containing "club", the user HAVE TO type
> the
> entire expression "night cabaret", and not only part of the expression.
> 
> 
> Of course, as I said before, this field was exclusively used for synonym
> matching, so it requires another field for normal full-text-stemmed search
> to add normal results, this approach give us the opportunity to setup
> Boosting separately on full-text-stemmed search VS synonyms search, let's
> say :
> 
> "title_stem":"club"^100 OR "title_syns":"club"^10
> 
> I hope to have been clear, even if I dont believe to.. Fact is this
> approach have fixed your problem, since we didn't what synonym matching if
> the user only types part of synonymic expression.
> 
> Regards,
> Laurent
> 
> 

This has seemed to solve our problem. Thank you very much for your help. 
Once we have our environment setup and all of our data indexed, it may even
provide an extra 'bonus' to be able to add different weights/boosts for the
different fields.

Now, not to be too greedy, but I am wondering if there is a way to utilize
this technique for "Explicit synonym matching" (i.e. synonym mappings that
use the '=>' operator).  For example, we may have a couple mappings like the
following:
night club=>club, bar
swim club=>club, team

As you can see, both night clubs and swim clubs are clubs, but are not
necessarily equivalent with the term "club".  It would be nice to be able to
search for "night club" and only see results for "clubs" and "bars", but not
necessarily "teams", which otherwise, would show up in the results if we use
Equivalent synonyms.

Just wondering if you have been able to do this as well.

Again, thank you for your help!

-- 
View this message in context:
http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18703520.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr synonyms behaviour

Posted by swarag <Sw...@citysearch.com>.

Hi Laurent


Laurent Gilles wrote:
> 
> Hi,
> 
> I was faced with the same issues reguarding multiwords synonyms
> Let's say a synonyms list like:
> 
> club, bar, night cabaret
> 
> Now if we have a document containing "club", with the default synonyms
> filter behaviour with expand=true, we will end up in the lucene index with
> a
> document containing "club|bar|night cabaret".
> So if the user search for "night", the query-time will search for "night"
> in
> the index and will match our document since it had been "enriched" @
> index-time, and it really contains the token "night".
> 
> The only valid solution I've founded was to create a field-type
> exclusively
> used for synonyms search where: 
> 
> @IndexTime
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false" />
> @QueryTime
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false" />
> 
> And with a customised synonyms file that looks like:
> 
> SYN_ID_1, club, bar, night cabaret
> 
> So for our document containing "club", the synonym filter at index time
> with
> expand=false will replace every matching token/expression in the document
> with the SYN_ID_1.
> 
> And at query time, when an user search for "night", since "night" is not
> alone in synonyms definition, it will not be matched, even by "normal"
> search, because every document containing "club" or "bar" would have been
> "enriched" with "SYN_ID_1" and NOT with "club|bar|night cabaret", so the
> final indexed document will not contains isolated token from synonyms
> expression that risks to be matched later without notice.
> 
> In order to match our document containing "club", the user HAVE TO type
> the
> entire expression "night cabaret", and not only part of the expression.
> 
> 
> Of course, as I said before, this field was exclusively used for synonym
> matching, so it requires another field for normal full-text-stemmed search
> to add normal results, this approach give us the opportunity to setup
> Boosting separately on full-text-stemmed search VS synonyms search, let's
> say :
> 
> "title_stem":"club"^100 OR "title_syns":"club"^10
> 
> I hope to have been clear, even if I dont believe to.. Fact is this
> approach have fixed your problem, since we didn't what synonym matching if
> the user only types part of synonymic expression.
> 
> Regards,
> Laurent
> 
> 

This has seemed to solve our problem. Thank you very much for your help. 
Once we have our environment setup and all of our data indexed, it may even
provide an extra 'bonus' to be able to add different weights/boosts for the
different fields.

Now, not to be too greedy, but I am wondering if there is a way to utilize
this technique for "Explicit synonym matching" (i.e. synonym mappings that
use the '=>' operator).  For example, we may have a couple mappings like the
following:
night club=>club, bar
swim club=>club, team

As you can see, both night clubs and swim clubs are clubs, but are not
necessarily equivalent with the term "club".  It would be nice to be able to
search for "night club" and only see results for "clubs" and "bars", but not
necessarily "teams", which otherwise, would show up in the results if we use
Equivalent synonyms.

Just wondering if you have been able to do this as well.

Again, thank you for your help!

-- 
View this message in context: http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18703520.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr synonyms behaviour

Posted by Laurent Gilles <lg...@sollan.com>.

Hi,

I was faced with the same issues reguarding multiwords synonyms
Let's say a synonyms list like:

club, bar, night cabaret

Now if we have a document containing "club", with the default synonyms
filter behaviour with expand=true, we will end up in the lucene index with a
document containing "club|bar|night cabaret".
So if the user search for "night", the query-time will search for "night" in
the index and will match our document since it had been "enriched" @
index-time, and it really contains the token "night".

The only valid solution I've founded was to create a field-type exclusively
used for synonyms search where: 

@IndexTime
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
@QueryTime
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />

And with a customised synonyms file that looks like:

SYN_ID_1, club, bar, night cabaret

So for our document containing "club", the synonym filter at index time with
expand=false will replace every matching token/expression in the document
with the SYN_ID_1.

And at query time, when an user search for "night", since "night" is not
alone in synonyms definition, it will not be matched, even by "normal"
search, because every document containing "club" or "bar" would have been
"enriched" with "SYN_ID_1" and NOT with "club|bar|night cabaret", so the
final indexed document will not contains isolated token from synonyms
expression that risks to be matched later without notice.

In order to match our document containing "club", the user HAVE TO type the
entire expression "night cabaret", and not only part of the expression.


Of course, as I said before, this field was exclusively used for synonym
matching, so it requires another field for normal full-text-stemmed search
to add normal results, this approach give us the opportunity to setup
Boosting separately on full-text-stemmed search VS synonyms search, let's
say :

"title_stem":"club"^100 OR "title_syns":"club"^10

I hope to have been clear, even if I dont believe to.. Fact is this
approach have fixed your problem, since we didn't what synonym matching if
the user only types part of synonymic expression.

Regards,
Laurent



-----Message d'origine-----
De : swarag [mailto:Swarag_Segu@citysearch.com] 
Envoyé : vendredi 25 juillet 2008 23:48
À : solr-user@lucene.apache.org
Objet : Re: solr synonyms behaviour



swarag wrote:
> 
> 
> Yonik Seeley wrote:
>> 
>> On Tue, Jul 15, 2008 at 2:27 PM, swarag <Sw...@citysearch.com>
>> wrote:
>>> To my understanding, this means I am using synonyms at index time and
>>> NOT
>>> query time. And yet, I am still having these problems with synonyms.
>> 
>> Can you give a specific example?  Use debugQuery=true to see what the
>> resulting query is.
>> You can also use the admin analysis page to see what the output of the
>> index and query analyzers.
>> 
>> -Yonik
>> 
>> 
> 
> So it sounds like using the '=>' operator for synonyms that may or may not
> contain multiple words causes problems.  So I changed my synonyms.txt to
> the following:
> 
> club,bar,night cabaret
> 
> In schema.xml, I now have the following:
>     <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>       	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
> 
> As you can see, 'night cabaret' is my only multi-word synonym term.
> Searches for 'bar' and 'club' now behave as expected.  However, if I
> search for JUST 'night' or JUST 'cabaret', it looks like it is still using
> the synonyms 'bar' and 'club', which is not what is desired.  I only want
> 'bar' and 'club' to be returned if a search for the complete 'night
> cabaret' is submitted.
> 
> Since query-time synonyms is turned "off", the resulting
> parsedquery_toString is simply "name:night", "name:cabaret", etc...
> 
> Thanks!
> 

We are still having problems. Searches for single words that are part of a
multi-word synonym seem to be affected by the synonyms, when they should
not.  Anyone else experience this?  If not, would you mind explaining your
config and the format of your synonyms.txt file?
-- 
View this message in context:
http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18660135.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr synonyms behaviour

Posted by swarag <Sw...@citysearch.com>.


swarag wrote:
> 
> 
> Yonik Seeley wrote:
>> 
>> On Tue, Jul 15, 2008 at 2:27 PM, swarag <Sw...@citysearch.com>
>> wrote:
>>> To my understanding, this means I am using synonyms at index time and
>>> NOT
>>> query time. And yet, I am still having these problems with synonyms.
>> 
>> Can you give a specific example?  Use debugQuery=true to see what the
>> resulting query is.
>> You can also use the admin analysis page to see what the output of the
>> index and query analyzers.
>> 
>> -Yonik
>> 
>> 
> 
> So it sounds like using the '=>' operator for synonyms that may or may not
> contain multiple words causes problems.  So I changed my synonyms.txt to
> the following:
> 
> club,bar,night cabaret
> 
> In schema.xml, I now have the following:
>     <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>       	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
> 
> As you can see, 'night cabaret' is my only multi-word synonym term.
> Searches for 'bar' and 'club' now behave as expected.  However, if I
> search for JUST 'night' or JUST 'cabaret', it looks like it is still using
> the synonyms 'bar' and 'club', which is not what is desired.  I only want
> 'bar' and 'club' to be returned if a search for the complete 'night
> cabaret' is submitted.
> 
> Since query-time synonyms is turned "off", the resulting
> parsedquery_toString is simply "name:night", "name:cabaret", etc...
> 
> Thanks!
> 

We are still having problems. Searches for single words that are part of a
multi-word synonym seem to be affected by the synonyms, when they should
not.  Anyone else experience this?  If not, would you mind explaining your
config and the format of your synonyms.txt file?
-- 
View this message in context: http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18660135.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr synonyms behaviour

Posted by swarag <Sw...@citysearch.com>.


Yonik Seeley wrote:
> 
> On Tue, Jul 15, 2008 at 2:27 PM, swarag <Sw...@citysearch.com>
> wrote:
>> To my understanding, this means I am using synonyms at index time and NOT
>> query time. And yet, I am still having these problems with synonyms.
> 
> Can you give a specific example?  Use debugQuery=true to see what the
> resulting query is.
> You can also use the admin analysis page to see what the output of the
> index and query analyzers.
> 
> -Yonik
> 
> 

So it sounds like using the '=>' operator for synonyms that may or may not
contain multiple words causes problems.  So I changed my synonyms.txt to the
following:

club,bar,night cabaret

In schema.xml, I now have the following:
    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
      	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

As you can see, 'night cabaret' is my only multi-word synonym term. Searches
for 'bar' and 'club' now behave as expected.  However, if I search for JUST
'night' or JUST 'cabaret', it looks like it is still using the synonyms
'bar' and 'club', which is not what is desired.  I only want 'bar' and
'club' to be returned if a search for the complete 'night cabaret' is
submitted.

Since query-time synonyms is turned "off", the resulting
parsedquery_toString is simply "name:night", "name:cabaret", etc...

Thanks!
-- 
View this message in context: http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18476205.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr synonyms behaviour

Posted by Yonik Seeley <yo...@apache.org>.

On Tue, Jul 15, 2008 at 2:27 PM, swarag <Sw...@citysearch.com> wrote:
> To my understanding, this means I am using synonyms at index time and NOT
> query time. And yet, I am still having these problems with synonyms.

Can you give a specific example?  Use debugQuery=true to see what the
resulting query is.
You can also use the admin analysis page to see what the output of the
index and query analyzers.

-Yonik

Re: solr synonyms behaviour

Posted by swarag <Sw...@citysearch.com>.


matt connolly wrote:
> 
> You won't have the multiple word problem if you use synonyms at index time
> instead of query time.
> 
> 
> swarag wrote:
>> 
>> Here is a basic example of some synonyms in my synonyms.txt:
>> club=>club,bar,night cabaret
>> bar=>bar,club
>> 
>> As you can see, a search for 'bar' will return any documents with 'bar'
>> or 'club' in the name. This works fine. However, a search for 'club'
>> SHOULD return any documents with 'club', 'bar' or 'night cabaret' in the
>> name, but it does not. It only returns 'bar' and 'club'.  
>> 
>> Interestingly, a search for 'night cabaret' gives me all 'night
>> cabaret's, 'bar's and 'club's...which is quite unexpected since I'm using
>> uni-directional synonym config (using the => symbol)
>> 
>> Does your config give you my desired behavior?
>> 
> 
> 

Is there something I am missing here? This is an excerpt from my schema.xml:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
      	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

To my understanding, this means I am using synonyms at index time and NOT
query time. And yet, I am still having these problems with synonyms.

-- 
View this message in context: http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18471922.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr synonyms behaviour

Posted by matt connolly <ma...@cabinetuk.com>.

You won't have the multiple word problem if you use synonyms at index time
instead of query time.


swarag wrote:
> 
> Here is a basic example of some synonyms in my synonyms.txt:
> club=>club,bar,night cabaret
> bar=>bar,club
> 
> As you can see, a search for 'bar' will return any documents with 'bar' or
> 'club' in the name. This works fine. However, a search for 'club' SHOULD
> return any documents with 'club', 'bar' or 'night cabaret' in the name,
> but it does not. It only returns 'bar' and 'club'.  
> 
> Interestingly, a search for 'night cabaret' gives me all 'night cabaret's,
> 'bar's and 'club's...which is quite unexpected since I'm using
> uni-directional synonym config (using the => symbol)
> 
> Does your config give you my desired behavior?
> 

-- 
View this message in context: http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18471373.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr synonyms behaviour

Posted by swarag <Sw...@citysearch.com>.


matt connolly wrote:
> 
> 
> swarag wrote:
>> 
>> Knowing the Lucene struggles with multi-word query-time synonyms, my
>> question is, does this also affect index-time synonyms? What other
>> alternatives do we have if we require there to be multiple word synonyms?
>> 
> 
> No the multiple word problem doesn't happen with index synonyms, only
> query synonyms.
> 
> See:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46
> 
> I ended up using index time synonyms, but ideally, I'd like to see a
> filter factory that does something like the SynsExpand tool does (which
> was written for lucene, not solr).
> 

I've tried this and it doesn't seem to work. Here are the basics of my
config:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
...
Synonyms for queryTime is off

Here is a basic example of some synonyms in my synonyms.txt:
club=>club,bar,night cabaret
bar=>bar,club

As you can see, a search for 'bar' will return any documents with 'bar' or
'club' in the name. This works fine. However, a search for 'club' SHOULD
return any documents with 'club', 'bar' or 'night cabaret' in the name, but
it does not. It only returns 'bar' and 'club'.  

Interestingly, a search for 'night cabaret' gives me all 'night cabaret's,
'bar's and 'club's...which is quite unexpected since I'm using
uni-directional synonym config (using the => symbol)

Does your config give you my desired behavior?
-- 
View this message in context: http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18469995.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr synonyms behaviour

Posted by matt connolly <ma...@cabinetuk.com>.


swarag wrote:
> 
> Knowing the Lucene struggles with multi-word query-time synonyms, my
> question is, does this also affect index-time synonyms? What other
> alternatives do we have if we require there to be multiple word synonyms?
> 

No the multiple word problem doesn't happen with index synonyms, only query
synonyms.

See:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46

I ended up using index time synonyms, but ideally, I'd like to see a filter
factory that does something like the SynsExpand tool does (which was written
for lucene, not solr).
-- 
View this message in context: http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18461507.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr synonyms behaviour

Posted by swarag <Sw...@citysearch.com>.


hossman wrote:
> 
> This is "Issue #1" regarding trying to use query time multi word synonyms 
> discussed on the wiki...
> 
>>> "The Lucene QueryParser tokenizes on white space before giving any 
>>> text to the Analyzer, so if a person searches for the words sea biscit 
>>> the analyzer will be given the words "sea" and "biscit" seperately, and 
>>> will not know that they match a synonym.
> 
> on the "boosting" part of the query (where the dismax handler 
> automagically quote the entire input and queries it against the "pf" 
> fields, the synonyms do get used (because the whole input is analyzed as 
> one string) but in this case the phrase queries will match any of these 
> phrases...
> 
>    divorce dispute resolution
>    alternative mediation resolution
>    divorce mediation resolution
>    etc...
> 
> ..it will *NOT* match either of these phrases...
> 
>    divorce mediation
>    alternative dispute resolution
> 
> ...because the SynonymFilter has no way to tell the query parser which 
> words should be linked to which other words when building up the phrase 
> query.  
> 
> This is "Issue #2" regarding trying to use query time multi word synonyms
> discussed on the wiki...
> 
>>> Phrase searching (ie: "sea biscit") will cause the QueryParser to pass 
>>> the entire string to the analyzer, but if the SynonymFilter is 
>>> configured to expand the synonyms, then when the QueryParser gets the  
>>> resulting list of tokens back from the Analyzer, it will construct a  
>>> MultiPhraseQuery that will not have the desired effect. This is because  
>>> of the limited mechanism available for the Analyzer to indicate that 
>>> two terms occupy the same position: there is no way to indicate that a  
>>> "phrase" occupies the same position as a term. For our example the  
>>> resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit 
>>> | biscit)" which would not match the simple case of "seabisuit" 
>>> occuring in a document
> 
> : I have the synonym filter only at query time coz i can't re-index data
> (or
> : portion of data) everytime i add a synonym and a couple of other
> reasons.
> 
> Use cases like yours will *never* work as a query time synonym ... hence 
> all of the information about multi-word synonyms and the caveats about 
> using them in the wiki...
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter
> 
> 
> -Hoss
> 
> 
> 

We have a very similar problem, and want to make sure that this is hopeless
with Solr before we try something else...

I have a synonyms.txt file similar to the following:
bar=>bar, club
club=>club, bar, night club
...

A search for 'bar' returns the exact results we want: anything with 'bar' or
'club' in the name.  However, a search for 'club' produces very strange
results: name:"(club bar night) club"

Knowing the Lucene struggles with multi-word query-time synonyms, my
question is, does this also affect index-time synonyms? What other
alternatives do we have if we require there to be multiple word synonyms?

-- 
View this message in context: http://www.nabble.com/solr-synonyms-behaviour-tp15051211p18349953.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr synonyms behaviour

Posted by Chris Hostetter <ho...@fucit.org>.

: so when i do a debug this is the parsedquery_tostring i see:
: (((text:divorc^0.8 | name:divorc^2.0)~0.01 (text:mediat^0.8 |
: name:mediat^2.0)~0.01)~2) (text:"(divorc altern) (disput mediat)
: resolut"~5^0.8 | name:"(divorc altern) (disput mediat) resolut"~5^2.0)~0.01

FYI: it's very hard to make sense of this kind of information without also 
knowing what your orriginal URL is and how your request handler is 
configured ... i assume you are using dismax and the orriginal request is 
something like...
                  q = divorce mediation
                 qf = text^0.8 name^2
                 pf = text^0.8 name^2
                 ps = 5

...correct?  if so then you didn't cut/paste the full query tostring ... 
there should be "+" in front of that first "("

: Now what i don't understand is how its doing the matching
: 
: Does it mean it will find all matches with either of the words (divorc
: altern), either of the words (disput mediat) (and/or) resolut

anything matching both "divorce" and "mediation" in either the "text" or 
"name" field will be considered a match ... the synonyms aren't affecting 
anything mandatory in the query, because the queyr parser has already 
split the input up before it's analyzed, so the synonyms don't come into 
play at all 

This is "Issue #1" regarding trying to use query time multi word synonyms 
discussed on the wiki...

>> "The Lucene QueryParser tokenizes on white space before giving any 
>> text to the Analyzer, so if a person searches for the words sea biscit 
>> the analyzer will be given the words "sea" and "biscit" seperately, and 
>> will not know that they match a synonym.

on the "boosting" part of the query (where the dismax handler 
automagically quote the entire input and queries it against the "pf" 
fields, the synonyms do get used (because the whole input is analyzed as 
one string) but in this case the phrase queries will match any of these 
phrases...

   divorce dispute resolution
   alternative mediation resolution
   divorce mediation resolution
   etc...

..it will *NOT* match either of these phrases...

   divorce mediation
   alternative dispute resolution

...because the SynonymFilter has no way to tell the query parser which 
words should be linked to which other words when building up the phrase 
query.  

This is "Issue #2" regarding trying to use query time multi word synonyms
discussed on the wiki...

>> Phrase searching (ie: "sea biscit") will cause the QueryParser to pass 
>> the entire string to the analyzer, but if the SynonymFilter is 
>> configured to expand the synonyms, then when the QueryParser gets the  
>> resulting list of tokens back from the Analyzer, it will construct a  
>> MultiPhraseQuery that will not have the desired effect. This is because  
>> of the limited mechanism available for the Analyzer to indicate that 
>> two terms occupy the same position: there is no way to indicate that a  
>> "phrase" occupies the same position as a term. For our example the  
>> resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit 
>> | biscit)" which would not match the simple case of "seabisuit" 
>> occuring in a document

: I have the synonym filter only at query time coz i can't re-index data (or
: portion of data) everytime i add a synonym and a couple of other reasons.

Use cases like yours will *never* work as a query time synonym ... hence 
all of the information about multi-word synonyms and the caveats about 
using them in the wiki...

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter


-Hoss