You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Nick D <nd...@gmail.com> on 2018/02/05 23:58:04 UTC
Min-should-match and Mutli-word synonyms unexpected result
I have run into an issue with multi-word synonyms and a min-should-match
(MM) of anything other than `0`, *Solr version 6.6.0*.
Here is my example query, first with mm set to zero and the second with a
non-zero value:
With MM set to 0
select?fl=*&indent=on&wt=json&debug=ALL&q=EIB&qf=ngs_title%20ngs_field_description&sow=false&mm=0
which parse to:
parsedquery_toString":"+(((+ngs_field_description:enterprise
+ngs_field_description:interface +ngs_field_description:builder)
ngs_field_description:eib) | ((+ngs_title:enterprise
+ngs_title:interface +ngs_title:builder) ngs_title:eib))~0.01"
and using my default MM (2<-35%)
select?fl=*&indent=on&wt=json&debug=ALL&q=EIB&qf=ngs_title%20ngs_field_description&sow=false
which parse to
((((+ngs_field_description:enterprise +ngs_field_description:interface
+ngs_field_description:builder) ngs_field_description:eib)~2) |
(((+ngs_title:enterprise +ngs_title:interface +ngs_title:builder)
ngs_title:eib)~2))
My synonym here is:
EIB, Enterprise Interface Builder
For my two documents I have the field ngs_title with values "EIB" (Doc 1)
and "enterprise interface builder" (Doc 2)
For both queries the doc 1 is always returned as EIB is matched, but for
doc 2 although I have EIB and Enterprise interface builder defined as
equivalent synonyms when the MM is not set to zero that document is not
returned. From the parsestring I see the ~2 being applied for the MM but my
expectation was that it has been met via the synonyms and the fact that I
am not actaully searching a phrase.
I couldn't find much on the relationship between the two outside of a some
of the things Doug Turnbull had linked to another solr-user question and
this blog post that mentions weirdness around MM and multi-word:
https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/
http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/
Also looked through the comments here,
https://issues.apache.org/jira/browse/SOLR-9185, but at first glance didn't
see anything that jumped out at me.
Here is the field definition for the ngs_* fields:
<fieldType name="ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([()])" replacement=""/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="(^[^0-9A-Za-z_]+)|([^0-9A-Za-z_]+$)" replacement=""/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="50"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I am not sure if we cannot use MM anymore for these type of queries or if
there is something I setup incorrectly, any help would be greatly
appreciated.
Nick
Re: Min-should-match and Mutli-word synonyms unexpected result
Posted by Nick D <nd...@gmail.com>.
Thanks Steve,
I'll test out that version.
Nick
On Feb 6, 2018 6:23 AM, "Steve Rowe" <sa...@gmail.com> wrote:
> Hi Nick,
>
> I think this was fixed by https://issues.apache.org/
> jira/browse/LUCENE-7878 in Solr 6.6.1.
>
> --
> Steve
> www.lucidworks.com
>
> > On Feb 5, 2018, at 3:58 PM, Nick D <nd...@gmail.com> wrote:
> >
> > I have run into an issue with multi-word synonyms and a min-should-match
> > (MM) of anything other than `0`, *Solr version 6.6.0*.
> >
> > Here is my example query, first with mm set to zero and the second with a
> > non-zero value:
> >
> > With MM set to 0
> > select?fl=*&indent=on&wt=json&debug=ALL&q=EIB&qf=ngs_title%
> 20ngs_field_description&sow=false&mm=0
> >
> > which parse to:
> >
> > parsedquery_toString":"+(((+ngs_field_description:enterprise
> > +ngs_field_description:interface +ngs_field_description:builder)
> > ngs_field_description:eib) | ((+ngs_title:enterprise
> > +ngs_title:interface +ngs_title:builder) ngs_title:eib))~0.01"
> >
> > and using my default MM (2<-35%)
> > select?fl=*&indent=on&wt=json&debug=ALL&q=EIB&qf=ngs_title%
> 20ngs_field_description&sow=false
> >
> > which parse to
> >
> > ((((+ngs_field_description:enterprise +ngs_field_description:interface
> > +ngs_field_description:builder) ngs_field_description:eib)~2) |
> > (((+ngs_title:enterprise +ngs_title:interface +ngs_title:builder)
> > ngs_title:eib)~2))
> >
> > My synonym here is:
> > EIB, Enterprise Interface Builder
> >
> > For my two documents I have the field ngs_title with values "EIB" (Doc 1)
> > and "enterprise interface builder" (Doc 2)
> >
> > For both queries the doc 1 is always returned as EIB is matched, but for
> > doc 2 although I have EIB and Enterprise interface builder defined as
> > equivalent synonyms when the MM is not set to zero that document is not
> > returned. From the parsestring I see the ~2 being applied for the MM but
> my
> > expectation was that it has been met via the synonyms and the fact that I
> > am not actaully searching a phrase.
> >
> > I couldn't find much on the relationship between the two outside of a
> some
> > of the things Doug Turnbull had linked to another solr-user question and
> > this blog post that mentions weirdness around MM and multi-word:
> >
> > https://lucidworks.com/2017/04/18/multi-word-synonyms-
> solr-adds-query-time-support/
> >
> > http://opensourceconnections.com/blog/2013/10/27/why-is-
> multi-term-synonyms-so-hard-in-solr/
> >
> > Also looked through the comments here,
> > https://issues.apache.org/jira/browse/SOLR-9185, but at first glance
> didn't
> > see anything that jumped out at me.
> >
> > Here is the field definition for the ngs_* fields:
> >
> > <fieldType name="ngram" class="solr.TextField"
> positionIncrementGap="100">
> > <analyzer type="index">
> > <charFilter class="solr.MappingCharFilterFactory"
> > mapping="mapping-ISOLatin1Accent.txt"/>
> > <charFilter class="solr.PatternReplaceCharFilterFactory"
> > pattern="([()])" replacement=""/>
> > <tokenizer class="solr.StandardTokenizerFactory"/>
> > <filter class="solr.PatternReplaceFilterFactory"
> > pattern="(^[^0-9A-Za-z_]+)|([^0-9A-Za-z_]+$)" replacement=""/>
> > <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"/>
> > <filter class="solr.LowerCaseFilterFactory"/>
> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="50"/>
> > </analyzer>
> > <analyzer type="query">
> > <tokenizer class="solr.StandardTokenizerFactory"/>
> > <filter class="solr.SynonymGraphFilterFactory"
> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> > <filter class="solr.LowerCaseFilterFactory"/>
> > </analyzer>
> > </fieldType>
> >
> > I am not sure if we cannot use MM anymore for these type of queries or if
> > there is something I setup incorrectly, any help would be greatly
> > appreciated.
> >
> > Nick
>
>
Re: Min-should-match and Mutli-word synonyms unexpected result
Posted by Steve Rowe <sa...@gmail.com>.
Hi Nick,
I think this was fixed by https://issues.apache.org/jira/browse/LUCENE-7878 in Solr 6.6.1.
--
Steve
www.lucidworks.com
> On Feb 5, 2018, at 3:58 PM, Nick D <nd...@gmail.com> wrote:
>
> I have run into an issue with multi-word synonyms and a min-should-match
> (MM) of anything other than `0`, *Solr version 6.6.0*.
>
> Here is my example query, first with mm set to zero and the second with a
> non-zero value:
>
> With MM set to 0
> select?fl=*&indent=on&wt=json&debug=ALL&q=EIB&qf=ngs_title%20ngs_field_description&sow=false&mm=0
>
> which parse to:
>
> parsedquery_toString":"+(((+ngs_field_description:enterprise
> +ngs_field_description:interface +ngs_field_description:builder)
> ngs_field_description:eib) | ((+ngs_title:enterprise
> +ngs_title:interface +ngs_title:builder) ngs_title:eib))~0.01"
>
> and using my default MM (2<-35%)
> select?fl=*&indent=on&wt=json&debug=ALL&q=EIB&qf=ngs_title%20ngs_field_description&sow=false
>
> which parse to
>
> ((((+ngs_field_description:enterprise +ngs_field_description:interface
> +ngs_field_description:builder) ngs_field_description:eib)~2) |
> (((+ngs_title:enterprise +ngs_title:interface +ngs_title:builder)
> ngs_title:eib)~2))
>
> My synonym here is:
> EIB, Enterprise Interface Builder
>
> For my two documents I have the field ngs_title with values "EIB" (Doc 1)
> and "enterprise interface builder" (Doc 2)
>
> For both queries the doc 1 is always returned as EIB is matched, but for
> doc 2 although I have EIB and Enterprise interface builder defined as
> equivalent synonyms when the MM is not set to zero that document is not
> returned. From the parsestring I see the ~2 being applied for the MM but my
> expectation was that it has been met via the synonyms and the fact that I
> am not actaully searching a phrase.
>
> I couldn't find much on the relationship between the two outside of a some
> of the things Doug Turnbull had linked to another solr-user question and
> this blog post that mentions weirdness around MM and multi-word:
>
> https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/
>
> http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/
>
> Also looked through the comments here,
> https://issues.apache.org/jira/browse/SOLR-9185, but at first glance didn't
> see anything that jumped out at me.
>
> Here is the field definition for the ngs_* fields:
>
> <fieldType name="ngram" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="([()])" replacement=""/>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.PatternReplaceFilterFactory"
> pattern="(^[^0-9A-Za-z_]+)|([^0-9A-Za-z_]+$)" replacement=""/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="50"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.SynonymGraphFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
> I am not sure if we cannot use MM anymore for these type of queries or if
> there is something I setup incorrectly, any help would be greatly
> appreciated.
>
> Nick