You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Zac Smith <za...@trinkit.com> on 2012/02/04 23:40:45 UTC

Multi word synonyms

Hi

I have seen several questions on this already but haven't been able to sort my issue. My problem is that multi-word synonyms aren't behaving as I would expect. I have copied my field type definition at the bottom of this message, but the relevant synonym filter is here (used at index time):
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory" />	

Say I have synonyms.txt setup like this:
syrup,sugar syrup,stock syrup

When indexing the text 'syrup', the 3 phrases are treated equivalently as expected. I can see this in the Index Analyzer as they all occupy the same term position.

But if all of the synonyms are a phrase, it doesn't work. 
e.g. synonyms.txt looks like:
simple syrup,sugar syrup,stock syrup

Now when putting the text 'simple syrup' into the Index Analyzer I can only see the original term listed. It is not finding the synonyms.

Anyone know how to fix this?

Zac

Field Type definition:
<fieldType name="phrase_searcher" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
	<analyzer type="index">
		<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />				
		<tokenizer class="solr.WhitespaceTokenizerFactory" />
		<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory" />				
		<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
		<filter class="solr.LowerCaseFilterFactory" />
		<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
		<filter class="solr.PorterStemFilterFactory" />
		<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
	</analyzer>
	<analyzer type="query">
		<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />
		<tokenizer class="solr.WhitespaceTokenizerFactory" />
		<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
		<filter class="solr.LowerCaseFilterFactory" />
		<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
		<filter class="solr.PorterStemFilterFactory" />
	</analyzer>
</fieldType>


Re: Multi word synonyms

Posted by Roman Chyla <ro...@gmail.com>.
Try separating multi word synonyms with a null byte

simple\0syrup,sugar\0syrup,stock\0syrup

see https://issues.apache.org/jira/browse/LUCENE-4499 for details

roman

On Sun, Feb 5, 2012 at 10:31 PM, Zac Smith <za...@trinkit.com> wrote:

> Thanks for your response. When I don't include the KeywordTokenizerFactory
> in the SynonymFilter definition, I get additional term values that I don't
> want.
>
> e.g. synonyms.txt looks like:
> simple syrup,sugar syrup,stock syrup
>
> A document with a value containing 'simple syrup' can now be found when
> searching for just 'stock'.
>
> So the problem I am trying to address with KeywordTokenizerFactory, is to
> prevent my multi word synonyms from getting broken down into single words.
>
> Thanks
> Zac
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Sunday, February 05, 2012 8:07 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Multi word synonyms
>
> I'm not quite sure what you're trying to do with KeywordTokenizerFactory
> in your SynonymFilter definition, but if I use the defaults, then the
> all-phrase form works just fine.
>
> So the question is "what problem are you trying to address by using
> KeywordTokenizerFactory?"
>
> Best
> Erick
>
> On Sun, Feb 5, 2012 at 8:21 AM, O. Klein <kl...@octoweb.nl> wrote:
> > Your query analyser will tokenize "simple sirup" into "simple" and
> "sirup"
> > and wont match on "simple syrup" in the synonyms.txt
> >
> > So you have to change the query analyzer into KeywordTokenizerFactory
> > as well.
> >
> > It might be idea to make a field for synonyms only with this tokenizer
> > and another field to search on and use dismax. Never tried this though.
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p37172
> > 15.html Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>

RE: Multi word synonyms

Posted by Zac Smith <za...@trinkit.com>.
Thanks for your response. When I don't include the KeywordTokenizerFactory in the SynonymFilter definition, I get additional term values that I don't want.

e.g. synonyms.txt looks like:
simple syrup,sugar syrup,stock syrup

A document with a value containing 'simple syrup' can now be found when searching for just 'stock'.

So the problem I am trying to address with KeywordTokenizerFactory, is to prevent my multi word synonyms from getting broken down into single words.

Thanks
Zac

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Sunday, February 05, 2012 8:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Multi word synonyms

I'm not quite sure what you're trying to do with KeywordTokenizerFactory in your SynonymFilter definition, but if I use the defaults, then the all-phrase form works just fine.

So the question is "what problem are you trying to address by using KeywordTokenizerFactory?"

Best
Erick

On Sun, Feb 5, 2012 at 8:21 AM, O. Klein <kl...@octoweb.nl> wrote:
> Your query analyser will tokenize "simple sirup" into "simple" and "sirup"
> and wont match on "simple syrup" in the synonyms.txt
>
> So you have to change the query analyzer into KeywordTokenizerFactory 
> as well.
>
> It might be idea to make a field for synonyms only with this tokenizer 
> and another field to search on and use dismax. Never tried this though.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p37172
> 15.html Sent from the Solr - User mailing list archive at Nabble.com.



Re: Multi word synonyms

Posted by Erick Erickson <er...@gmail.com>.
I'm not quite sure what you're trying to do with KeywordTokenizerFactory in
your SynonymFilter definition, but if I use the defaults, then the
all-phrase form works just fine.

So the question is "what problem are you trying to address by using
KeywordTokenizerFactory?"

Best
Erick

On Sun, Feb 5, 2012 at 8:21 AM, O. Klein <kl...@octoweb.nl> wrote:
> Your query analyser will tokenize "simple sirup" into "simple" and "sirup"
> and wont match on "simple syrup" in the synonyms.txt
>
> So you have to change the query analyzer into KeywordTokenizerFactory as
> well.
>
> It might be idea to make a field for synonyms only with this tokenizer and
> another field to search on and use dismax. Never tried this though.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3717215.html
> Sent from the Solr - User mailing list archive at Nabble.com.

RE: Multi word synonyms

Posted by Zac Smith <za...@trinkit.com>.
Thanks for the response. This almost worked, I created a new field using the KeywordTokenizerFactory as you suggested. The only problem was that searches only found documents when quotes were used. 
E.g. 
synonyms.txt setup like this:
simple syrup,sugar syrup,stock syrup

I indexed a document with the value 'simple syrup'. Searches only found the document when using quotes:
e.g.
"simple syrup" or "stock syrup" matched
simple syrup (no quotes) did not match

Here is the field I created:
        <fieldType name="synonym_searcher" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
            <analyzer type="index">
                <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />				
				<tokenizer class="solr.KeywordTokenizerFactory" />
				<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory" />				
            </analyzer>
            <analyzer type="query">
				<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />
                <tokenizer class="solr.KeywordTokenizerFactory" />				
                <filter class="solr.LowerCaseFilterFactory" />				
            </analyzer>
        </fieldType>

Any ideas? Also, I am using dismax and solr 3.5.0.

Thanks
Zac

-----Original Message-----
From: O. Klein [mailto:klein@octoweb.nl] 
Sent: Sunday, February 05, 2012 5:22 AM
To: solr-user@lucene.apache.org
Subject: Re: Multi word synonyms

Your query analyser will tokenize "simple sirup" into "simple" and "sirup"
and wont match on "simple syrup" in the synonyms.txt

So you have to change the query analyzer into KeywordTokenizerFactory as well.

It might be idea to make a field for synonyms only with this tokenizer and another field to search on and use dismax. Never tried this though.

--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3717215.html
Sent from the Solr - User mailing list archive at Nabble.com.



Problem Multi word synonyms in solr 3.4

Posted by Pravin Agrawal <Pr...@persistent.co.in>.
Hi All,



I am trying to use synonyms in solr 3.4 and facing below issue with multiword synonyms.



I am using edismax query parser with following fields in qf and pf



qf: name^1.2,name_synonym^0.5

pf: phrase_name^3



The analyzers that I am using for name_synonym is as follows



<fieldType name="text_synonym" class="solr.TextField"

            positionIncrementGap="100">

            <analyzer>

                <tokenizer class="solr.WhitespaceTokenizerFactory"/>

                <filter class="solr.StopFilterFactory"

                    ignoreCase="true" words="stopwords.txt"/>

                <filter class="solr.WordDelimiterFilterFactory"

                    generateWordParts="1" generateNumberParts="1"

                    catenateWords="0" catenateNumbers="0" catenateAll="0"

                    splitOnCaseChange="0" preserveOriginal="0" />

                <filter class="solr.LowerCaseFilterFactory"/>

                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory"/>

                <filter class="solr.LowerCaseFilterFactory"/>

                <filter class="solr.EnglishPorterFilterFactory"

                    protected="protwords.txt"/>

                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

            </analyzer>

</fieldType>



With above configuration the below type of synonyms works fine

foobar => foo bar

FnB => foo and bar

aaa,bbb,ccc





However for following multiword synonym, the dismax query is incorrectly formed for qf field

xxx zzz, aaa bbb, mmm nnn, aaabbb





The parsedquery_tostring that gets formed for the query aaabbb is as follows



+(name:aaabbb^1.2 | name_synonym:" xxx zzz aaa bbb mmm (nnn aaabbb)"^0.5)~0.5 (phrase_name:" xxx zzz aaa bbb mmm (nnn aaabbb)"~5^3.0)~0.5



I am expecting a query like



+(name:aaabbb^1.2 | ((name_synonym:xxx zzz name_synonym:aaa bbb name_synonym:mmm nnn name_synonym:aaabbb)^0.5))~0.5



Similarly for query xxx zzz I am getting following parsedquery_tostring from dismax



+((name:xxx^1.2 | name_synonym:xxx^0.5 | name:zzz^1.2 | name_synonym:zzz^0.5)~0.5) (phrase_name:"xxx zzz"~5^3.0)~0.5



But I m expecting following query



+((name:xxx^1.2 | name_synonym:xxx^0.5 | name:zzz^1.2 | name_synonym:zzz^0.5)~0.5) (phrase_name:"xxx zzz"~5^3.0 | phrase_name:"aaa bbb"~5^3.0 | phrase_name:"mmm nnn"~5^3.0 | phrase_name:"aaabbb"~5^3.0)~0.5





However it's not the case.

Please let me know if I am missing something or its expected behavior. Also please let me know what should be done to get my desired output.



Thanks in advance.

Pravin

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

RE: Multi word synonyms

Posted by Zac Smith <za...@trinkit.com>.
Are you able to explain how I would create another field to fit my scenario?

-----Original Message-----
From: O. Klein [mailto:klein@octoweb.nl] 
Sent: Tuesday, February 07, 2012 1:28 PM
To: solr-user@lucene.apache.org
Subject: RE: Multi word synonyms

Well, if you want both multi word and single words I guess you will have to create another field :) Or make queries like you suggested.

--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3724009.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Multi word synonyms

Posted by "O. Klein" <kl...@octoweb.nl>.
Well, if you want both multi word and single words I guess you will have to
create another field :) Or make queries like you suggested.

--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3724009.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Multi word synonyms

Posted by Zac Smith <za...@trinkit.com>.
It doesn't seem to do it for me. My field type is:
        <fieldType name="synonym_searcher" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
            <analyzer type="index">                
				<tokenizer class="solr.KeywordTokenizerFactory" />
				<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory" />                
            </analyzer>
            <analyzer type="query">				
                <tokenizer class="solr.KeywordTokenizerFactory" />                
            </analyzer>
        </fieldType>

I am using edismax and solr 3.5 and multi word values can only be matched when using quotes.

-----Original Message-----
From: O. Klein [mailto:klein@octoweb.nl] 
Sent: Tuesday, February 07, 2012 12:49 PM
To: solr-user@lucene.apache.org
Subject: RE: Multi word synonyms

Isn't that what autoGeneratePhraseQueries="true" is for?

--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3723886.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Multi word synonyms

Posted by "O. Klein" <kl...@octoweb.nl>.
Isn't that what autoGeneratePhraseQueries="true" is for?

--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3723886.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Multi word synonyms

Posted by Zac Smith <za...@trinkit.com>.
I suppose I could translate every user query to include the term with quotes.

e.g. if someone searches for stock syrup I send a query like:
q=stock syrup OR "stock syrup"

Seems like a bit of a hack though, is there a better way of doing this?

Zac

-----Original Message-----
From: Zac Smith 
Sent: Sunday, February 05, 2012 7:28 PM
To: solr-user@lucene.apache.org
Subject: RE: Multi word synonyms

Thanks for the response. This almost worked, I created a new field using the KeywordTokenizerFactory as you suggested. The only problem was that searches only found documents when quotes were used. 
E.g. 
synonyms.txt setup like this:
simple syrup,sugar syrup,stock syrup

I indexed a document with the value 'simple syrup'. Searches only found the document when using quotes:
e.g.
"simple syrup" or "stock syrup" matched
simple syrup (no quotes) did not match

Here is the field I created:
        <fieldType name="synonym_searcher" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
            <analyzer type="index">
                <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />				
				<tokenizer class="solr.KeywordTokenizerFactory" />
				<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory" />				
            </analyzer>
            <analyzer type="query">
				<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />
                <tokenizer class="solr.KeywordTokenizerFactory" />				
                <filter class="solr.LowerCaseFilterFactory" />				
            </analyzer>
        </fieldType>

Any ideas? Also, I am using dismax and solr 3.5.0.

Thanks
Zac

-----Original Message-----
From: O. Klein [mailto:klein@octoweb.nl] 
Sent: Sunday, February 05, 2012 5:22 AM
To: solr-user@lucene.apache.org
Subject: Re: Multi word synonyms

Your query analyser will tokenize "simple sirup" into "simple" and "sirup"
and wont match on "simple syrup" in the synonyms.txt

So you have to change the query analyzer into KeywordTokenizerFactory as well.

It might be idea to make a field for synonyms only with this tokenizer and another field to search on and use dismax. Never tried this though.

--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3717215.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Multi word synonyms

Posted by "O. Klein" <kl...@octoweb.nl>.
Your query analyser will tokenize "simple sirup" into "simple" and "sirup"
and wont match on "simple syrup" in the synonyms.txt

So you have to change the query analyzer into KeywordTokenizerFactory as
well.

It might be idea to make a field for synonyms only with this tokenizer and
another field to search on and use dismax. Never tried this though.

--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3717215.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multi word synonyms

Posted by Jack Krupansky <ja...@basetechnology.com>.
Yes, it is sad but true that multi-word synonym processing does not "work 
right out of the box" for all common interesting cases, although it does do 
semi-well for index-time processing, but even there, matching synonyms of 
varying lengths within larger phrases will sometimes work but sometimes not 
unless you all some amount of phrase slop.

The LucidWorks Search query parser does handle query-time synonyms 
reasonably well, but using some complicated, ad hoc processing that is not 
easy to replicate in your average application that doesn't have that extra, 
proprietary "magic". If you want robust, query-time processing of synonyms 
(which is a lot more flexible than index-time processing), you would need to 
replicate some form of that logic.

A couple of months ago I did propose that we design and implement a set of 
interfaces to support robust handling of multi-word synonyms at query time, 
but there was... NO interest expressed by any developers. Since then, the 
Lucene and Solr query parsers have diverged even further, making the support 
for such an interface even more problematic - unless we just bite the bullet 
and say that the Lucene query parser is a hopeless dinosaur and leave it 
behind in the dust as a remnant of "the early days" of Lucene and Solr. 
Also, the fact that we still have three distinct main Solr query parsers 
(SolrQueryParser, a derivative of the classic Lucene query parser, dismax, 
and edismax) still makes this task rather problematic, let alone the fact 
that there are a number of other "niche" query parsers which could use 
better synonym processing, make this a very daunting task. If we ever do 
integrate the "big three" (and write the Lucene query parser), then maybe 
the time will be ripe to revisit robust query-time multi-word synonym 
support.

(Or, maybe LucidWorks will finally donate their query parser!)

-- Jack Krupansky

-----Original Message----- 
From: Bernd Fehling
Sent: Thursday, November 29, 2012 8:19 AM
To: solr-user@lucene.apache.org
Subject: Re: Multi word synonyms

There are also other solutions:

Multi-word synonym filter (synonym expansion)
https://issues.apache.org/jira/browse/LUCENE-4499

Since Solr 3.4 i have my own solution which might be obsolete if
LUCENE-4499 will be in a released version.
http://www.ub.uni-bielefeld.de/~befehl/base/solr/eurovoc.html


Am 29.11.2012 13:44, schrieb O. Klein:
> Found an article about the issue of  multi word synonyms
> <http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/>  .
>
> Not sure it's the solution I'm looking for, but it may be for someone 
> else.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p4023220.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


Re: Multi word synonyms

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
There are also other solutions:

Multi-word synonym filter (synonym expansion)
https://issues.apache.org/jira/browse/LUCENE-4499

Since Solr 3.4 i have my own solution which might be obsolete if
LUCENE-4499 will be in a released version.
http://www.ub.uni-bielefeld.de/~befehl/base/solr/eurovoc.html


Am 29.11.2012 13:44, schrieb O. Klein:
> Found an article about the issue of  multi word synonyms
> <http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/>  .
> 
> Not sure it's the solution I'm looking for, but it may be for someone else.
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p4023220.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 

Re: Multi word synonyms

Posted by "O. Klein" <kl...@octoweb.nl>.
Found an article about the issue of  multi word synonyms
<http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/>  .

Not sure it's the solution I'm looking for, but it may be for someone else.



--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p4023220.html
Sent from the Solr - User mailing list archive at Nabble.com.