You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Zac Smith <za...@trinkit.com> on 2012/02/04 23:40:45 UTC
Multi word synonyms
Hi
I have seen several questions on this already but haven't been able to sort my issue. My problem is that multi-word synonyms aren't behaving as I would expect. I have copied my field type definition at the bottom of this message, but the relevant synonym filter is here (used at index time):
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory" />
Say I have synonyms.txt setup like this:
syrup,sugar syrup,stock syrup
When indexing the text 'syrup', the 3 phrases are treated equivalently as expected. I can see this in the Index Analyzer as they all occupy the same term position.
But if all of the synonyms are a phrase, it doesn't work.
e.g. synonyms.txt looks like:
simple syrup,sugar syrup,stock syrup
Now when putting the text 'simple syrup' into the Index Analyzer I can only see the original term listed. It is not finding the synonyms.
Anyone know how to fix this?
Zac
Field Type definition:
<fieldType name="phrase_searcher" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory" />
</analyzer>
</fieldType>
Re: Multi word synonyms
Posted by Roman Chyla <ro...@gmail.com>.
Try separating multi word synonyms with a null byte
simple\0syrup,sugar\0syrup,stock\0syrup
see https://issues.apache.org/jira/browse/LUCENE-4499 for details
roman
On Sun, Feb 5, 2012 at 10:31 PM, Zac Smith <za...@trinkit.com> wrote:
> Thanks for your response. When I don't include the KeywordTokenizerFactory
> in the SynonymFilter definition, I get additional term values that I don't
> want.
>
> e.g. synonyms.txt looks like:
> simple syrup,sugar syrup,stock syrup
>
> A document with a value containing 'simple syrup' can now be found when
> searching for just 'stock'.
>
> So the problem I am trying to address with KeywordTokenizerFactory, is to
> prevent my multi word synonyms from getting broken down into single words.
>
> Thanks
> Zac
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Sunday, February 05, 2012 8:07 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Multi word synonyms
>
> I'm not quite sure what you're trying to do with KeywordTokenizerFactory
> in your SynonymFilter definition, but if I use the defaults, then the
> all-phrase form works just fine.
>
> So the question is "what problem are you trying to address by using
> KeywordTokenizerFactory?"
>
> Best
> Erick
>
> On Sun, Feb 5, 2012 at 8:21 AM, O. Klein <kl...@octoweb.nl> wrote:
> > Your query analyser will tokenize "simple sirup" into "simple" and
> "sirup"
> > and wont match on "simple syrup" in the synonyms.txt
> >
> > So you have to change the query analyzer into KeywordTokenizerFactory
> > as well.
> >
> > It might be idea to make a field for synonyms only with this tokenizer
> > and another field to search on and use dismax. Never tried this though.
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p37172
> > 15.html Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>
RE: Multi word synonyms
Posted by Zac Smith <za...@trinkit.com>.
Thanks for your response. When I don't include the KeywordTokenizerFactory in the SynonymFilter definition, I get additional term values that I don't want.
e.g. synonyms.txt looks like:
simple syrup,sugar syrup,stock syrup
A document with a value containing 'simple syrup' can now be found when searching for just 'stock'.
So the problem I am trying to address with KeywordTokenizerFactory, is to prevent my multi word synonyms from getting broken down into single words.
Thanks
Zac
-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Sunday, February 05, 2012 8:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Multi word synonyms
I'm not quite sure what you're trying to do with KeywordTokenizerFactory in your SynonymFilter definition, but if I use the defaults, then the all-phrase form works just fine.
So the question is "what problem are you trying to address by using KeywordTokenizerFactory?"
Best
Erick
On Sun, Feb 5, 2012 at 8:21 AM, O. Klein <kl...@octoweb.nl> wrote:
> Your query analyser will tokenize "simple sirup" into "simple" and "sirup"
> and wont match on "simple syrup" in the synonyms.txt
>
> So you have to change the query analyzer into KeywordTokenizerFactory
> as well.
>
> It might be idea to make a field for synonyms only with this tokenizer
> and another field to search on and use dismax. Never tried this though.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p37172
> 15.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multi word synonyms
Posted by Erick Erickson <er...@gmail.com>.
I'm not quite sure what you're trying to do with KeywordTokenizerFactory in
your SynonymFilter definition, but if I use the defaults, then the
all-phrase form works just fine.
So the question is "what problem are you trying to address by using
KeywordTokenizerFactory?"
Best
Erick
On Sun, Feb 5, 2012 at 8:21 AM, O. Klein <kl...@octoweb.nl> wrote:
> Your query analyser will tokenize "simple sirup" into "simple" and "sirup"
> and wont match on "simple syrup" in the synonyms.txt
>
> So you have to change the query analyzer into KeywordTokenizerFactory as
> well.
>
> It might be idea to make a field for synonyms only with this tokenizer and
> another field to search on and use dismax. Never tried this though.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3717215.html
> Sent from the Solr - User mailing list archive at Nabble.com.
RE: Multi word synonyms
Posted by Zac Smith <za...@trinkit.com>.
Thanks for the response. This almost worked, I created a new field using the KeywordTokenizerFactory as you suggested. The only problem was that searches only found documents when quotes were used.
E.g.
synonyms.txt setup like this:
simple syrup,sugar syrup,stock syrup
I indexed a document with the value 'simple syrup'. Searches only found the document when using quotes:
e.g.
"simple syrup" or "stock syrup" matched
simple syrup (no quotes) did not match
Here is the field I created:
<fieldType name="synonym_searcher" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
Any ideas? Also, I am using dismax and solr 3.5.0.
Thanks
Zac
-----Original Message-----
From: O. Klein [mailto:klein@octoweb.nl]
Sent: Sunday, February 05, 2012 5:22 AM
To: solr-user@lucene.apache.org
Subject: Re: Multi word synonyms
Your query analyser will tokenize "simple sirup" into "simple" and "sirup"
and wont match on "simple syrup" in the synonyms.txt
So you have to change the query analyzer into KeywordTokenizerFactory as well.
It might be idea to make a field for synonyms only with this tokenizer and another field to search on and use dismax. Never tried this though.
--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3717215.html
Sent from the Solr - User mailing list archive at Nabble.com.
Problem Multi word synonyms in solr 3.4
Posted by Pravin Agrawal <Pr...@persistent.co.in>.
Hi All,
I am trying to use synonyms in solr 3.4 and facing below issue with multiword synonyms.
I am using edismax query parser with following fields in qf and pf
qf: name^1.2,name_synonym^0.5
pf: phrase_name^3
The analyzers that I am using for name_synonym is as follows
<fieldType name="text_synonym" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="0" preserveOriginal="0" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
With above configuration the below type of synonyms works fine
foobar => foo bar
FnB => foo and bar
aaa,bbb,ccc
However for following multiword synonym, the dismax query is incorrectly formed for qf field
xxx zzz, aaa bbb, mmm nnn, aaabbb
The parsedquery_tostring that gets formed for the query aaabbb is as follows
+(name:aaabbb^1.2 | name_synonym:" xxx zzz aaa bbb mmm (nnn aaabbb)"^0.5)~0.5 (phrase_name:" xxx zzz aaa bbb mmm (nnn aaabbb)"~5^3.0)~0.5
I am expecting a query like
+(name:aaabbb^1.2 | ((name_synonym:xxx zzz name_synonym:aaa bbb name_synonym:mmm nnn name_synonym:aaabbb)^0.5))~0.5
Similarly for query xxx zzz I am getting following parsedquery_tostring from dismax
+((name:xxx^1.2 | name_synonym:xxx^0.5 | name:zzz^1.2 | name_synonym:zzz^0.5)~0.5) (phrase_name:"xxx zzz"~5^3.0)~0.5
But I m expecting following query
+((name:xxx^1.2 | name_synonym:xxx^0.5 | name:zzz^1.2 | name_synonym:zzz^0.5)~0.5) (phrase_name:"xxx zzz"~5^3.0 | phrase_name:"aaa bbb"~5^3.0 | phrase_name:"mmm nnn"~5^3.0 | phrase_name:"aaabbb"~5^3.0)~0.5
However it's not the case.
Please let me know if I am missing something or its expected behavior. Also please let me know what should be done to get my desired output.
Thanks in advance.
Pravin
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
RE: Multi word synonyms
Posted by Zac Smith <za...@trinkit.com>.
Are you able to explain how I would create another field to fit my scenario?
-----Original Message-----
From: O. Klein [mailto:klein@octoweb.nl]
Sent: Tuesday, February 07, 2012 1:28 PM
To: solr-user@lucene.apache.org
Subject: RE: Multi word synonyms
Well, if you want both multi word and single words I guess you will have to create another field :) Or make queries like you suggested.
--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3724009.html
Sent from the Solr - User mailing list archive at Nabble.com.
RE: Multi word synonyms
Posted by "O. Klein" <kl...@octoweb.nl>.
Well, if you want both multi word and single words I guess you will have to
create another field :) Or make queries like you suggested.
--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3724009.html
Sent from the Solr - User mailing list archive at Nabble.com.
RE: Multi word synonyms
Posted by Zac Smith <za...@trinkit.com>.
It doesn't seem to do it for me. My field type is:
<fieldType name="synonym_searcher" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
</analyzer>
</fieldType>
I am using edismax and solr 3.5 and multi word values can only be matched when using quotes.
-----Original Message-----
From: O. Klein [mailto:klein@octoweb.nl]
Sent: Tuesday, February 07, 2012 12:49 PM
To: solr-user@lucene.apache.org
Subject: RE: Multi word synonyms
Isn't that what autoGeneratePhraseQueries="true" is for?
--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3723886.html
Sent from the Solr - User mailing list archive at Nabble.com.
RE: Multi word synonyms
Posted by "O. Klein" <kl...@octoweb.nl>.
Isn't that what autoGeneratePhraseQueries="true" is for?
--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3723886.html
Sent from the Solr - User mailing list archive at Nabble.com.
RE: Multi word synonyms
Posted by Zac Smith <za...@trinkit.com>.
I suppose I could translate every user query to include the term with quotes.
e.g. if someone searches for stock syrup I send a query like:
q=stock syrup OR "stock syrup"
Seems like a bit of a hack though, is there a better way of doing this?
Zac
-----Original Message-----
From: Zac Smith
Sent: Sunday, February 05, 2012 7:28 PM
To: solr-user@lucene.apache.org
Subject: RE: Multi word synonyms
Thanks for the response. This almost worked, I created a new field using the KeywordTokenizerFactory as you suggested. The only problem was that searches only found documents when quotes were used.
E.g.
synonyms.txt setup like this:
simple syrup,sugar syrup,stock syrup
I indexed a document with the value 'simple syrup'. Searches only found the document when using quotes:
e.g.
"simple syrup" or "stock syrup" matched
simple syrup (no quotes) did not match
Here is the field I created:
<fieldType name="synonym_searcher" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
Any ideas? Also, I am using dismax and solr 3.5.0.
Thanks
Zac
-----Original Message-----
From: O. Klein [mailto:klein@octoweb.nl]
Sent: Sunday, February 05, 2012 5:22 AM
To: solr-user@lucene.apache.org
Subject: Re: Multi word synonyms
Your query analyser will tokenize "simple sirup" into "simple" and "sirup"
and wont match on "simple syrup" in the synonyms.txt
So you have to change the query analyzer into KeywordTokenizerFactory as well.
It might be idea to make a field for synonyms only with this tokenizer and another field to search on and use dismax. Never tried this though.
--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3717215.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multi word synonyms
Posted by "O. Klein" <kl...@octoweb.nl>.
Your query analyser will tokenize "simple sirup" into "simple" and "sirup"
and wont match on "simple syrup" in the synonyms.txt
So you have to change the query analyzer into KeywordTokenizerFactory as
well.
It might be idea to make a field for synonyms only with this tokenizer and
another field to search on and use dismax. Never tried this though.
--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p3717215.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multi word synonyms
Posted by Jack Krupansky <ja...@basetechnology.com>.
Yes, it is sad but true that multi-word synonym processing does not "work
right out of the box" for all common interesting cases, although it does do
semi-well for index-time processing, but even there, matching synonyms of
varying lengths within larger phrases will sometimes work but sometimes not
unless you all some amount of phrase slop.
The LucidWorks Search query parser does handle query-time synonyms
reasonably well, but using some complicated, ad hoc processing that is not
easy to replicate in your average application that doesn't have that extra,
proprietary "magic". If you want robust, query-time processing of synonyms
(which is a lot more flexible than index-time processing), you would need to
replicate some form of that logic.
A couple of months ago I did propose that we design and implement a set of
interfaces to support robust handling of multi-word synonyms at query time,
but there was... NO interest expressed by any developers. Since then, the
Lucene and Solr query parsers have diverged even further, making the support
for such an interface even more problematic - unless we just bite the bullet
and say that the Lucene query parser is a hopeless dinosaur and leave it
behind in the dust as a remnant of "the early days" of Lucene and Solr.
Also, the fact that we still have three distinct main Solr query parsers
(SolrQueryParser, a derivative of the classic Lucene query parser, dismax,
and edismax) still makes this task rather problematic, let alone the fact
that there are a number of other "niche" query parsers which could use
better synonym processing, make this a very daunting task. If we ever do
integrate the "big three" (and write the Lucene query parser), then maybe
the time will be ripe to revisit robust query-time multi-word synonym
support.
(Or, maybe LucidWorks will finally donate their query parser!)
-- Jack Krupansky
-----Original Message-----
From: Bernd Fehling
Sent: Thursday, November 29, 2012 8:19 AM
To: solr-user@lucene.apache.org
Subject: Re: Multi word synonyms
There are also other solutions:
Multi-word synonym filter (synonym expansion)
https://issues.apache.org/jira/browse/LUCENE-4499
Since Solr 3.4 i have my own solution which might be obsolete if
LUCENE-4499 will be in a released version.
http://www.ub.uni-bielefeld.de/~befehl/base/solr/eurovoc.html
Am 29.11.2012 13:44, schrieb O. Klein:
> Found an article about the issue of multi word synonyms
> <http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/> .
>
> Not sure it's the solution I'm looking for, but it may be for someone
> else.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p4023220.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Re: Multi word synonyms
Posted by Bernd Fehling <be...@uni-bielefeld.de>.
There are also other solutions:
Multi-word synonym filter (synonym expansion)
https://issues.apache.org/jira/browse/LUCENE-4499
Since Solr 3.4 i have my own solution which might be obsolete if
LUCENE-4499 will be in a released version.
http://www.ub.uni-bielefeld.de/~befehl/base/solr/eurovoc.html
Am 29.11.2012 13:44, schrieb O. Klein:
> Found an article about the issue of multi word synonyms
> <http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/> .
>
> Not sure it's the solution I'm looking for, but it may be for someone else.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p4023220.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Re: Multi word synonyms
Posted by "O. Klein" <kl...@octoweb.nl>.
Found an article about the issue of multi word synonyms
<http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/> .
Not sure it's the solution I'm looking for, but it may be for someone else.
--
View this message in context: http://lucene.472066.n3.nabble.com/Multi-word-synonyms-tp3716292p4023220.html
Sent from the Solr - User mailing list archive at Nabble.com.