You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Andrea Gazzarini <a....@sease.io> on 2018/07/26 07:04:43 UTC

SynonymGraphFilter followed by StopFilter

Hi,
I have the following field type definition:

<fieldtype name="text" class="solr.TextField" 
autoGeneratePhraseQueries="true">
     <analyzer type="index">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
     <analyzer type="query">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" 
ignoreCase="false" expand="true"/>
         <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="false"/>
     </analyzer>
</fieldtype>

Where synonyms and stopwords are defined as follows:

synonyms = out of warranty,oow
stopwords = of

Running the following query:

q=my tv went out *of* warranty something *of*

I get wrong results, with the following explain:

title:my title:tv title:went (title:oow *PhraseQuery(title:"out ? 
warranty something"))*

That is, the synonyms is correctly detected, I see the graph information 
are correctly reported in the positionLength, it seems they are wrongly 
interpreted by the QueryParser.
I guess the reason is the "of" removal operated by the StopFilter, which

  * removes the "of" term within the phrase (I wouldn't want that)
  * creates a "hole" in the span defined by the "oow" term, which has
    been marked as a synonym with a positionLength = 3, therefore
    including the next available term (something).

I tried to change the StopFilter in order to ignore stopwords that are 
marked as SYNONYM or that are part of a previous synonym span, and it 
works: it correctly produces the following query:

title:my title:tv title:went *(title:oow PhraseQuery(title:"out of 
warranty"))* title:something

So I'd like to ask your opinion about this. Am I missing something? Do 
you think it's better to open a JIRA issue? If the solution is a graph 
aware stop filter, do you think it's better to change the existing 
filter or to subclass it?

Best,
Andrea

Re: SynonymGraphFilter followed by StopFilter

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Jul 26, 2018 at 10:25 PM, Michael Sokolov <ms...@gmail.com>
wrote:

>  > In general I’d avoid index-time synonyms in lucene because synonyms can
> create graphs (eg if a single term gets expanded to several terms), and we
> can’t index graphs correctly.
>
> I wonder what it would take to address this. I guess the blast radius of
> adding a token "width" could be pretty large. Is there an issue or any past
> discussion about that?
>

I think some ideas have been mentioned on past issues, e.g. using payloads
to hold the position length should be workable (with a custom Query)
without any source code changes to Lucene, but I don't know of anyone
building a prototype.

Mike McCandless

http://blog.mikemccandless.com

Re: SynonymGraphFilter followed by StopFilter

Posted by Robert Muir <rc...@gmail.com>.

No Solr patches necessary: synonymquery fixed that IDF issue 3 years ago.
There is just extremely outdated advice on this thread.

On Fri, Jul 27, 2018 at 7:08 AM, Alessandro Benedetti <a....@sease.io>
wrote:

> Hi all,
> I just want to add that
> "With synonyms at query time, you’ll see different idf for terms in the
> synonym set, with the rare variant scoring higher. That is probably the
> opposite of what is expected."
> should be solved by : https://issues.apache.org/jira/browse/SOLR-11662
>
> At least that feature brings flexibility in.
>
> Cheers
>
> --------------------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> www.sease.io
>
> On Fri, Jul 27, 2018 at 3:25 AM, Michael Sokolov <ms...@gmail.com>
> wrote:
>
>>  > In general I’d avoid index-time synonyms in lucene because synonyms
>> can create graphs (eg if a single term gets expanded to several terms), and
>> we can’t index graphs correctly.
>>
>> I wonder what it would take to address this. I guess the blast radius of
>> adding a token "width" could be pretty large. Is there an issue or any past
>> discussion about that?
>>
>> On Thu, Jul 26, 2018 at 11:42 AM Andrea Gazzarini <a....@sease.io>
>> wrote:
>>
>>> Hi Walter,
>>> many thanks for the response and without any constraint at all, I would
>>> agree with you. From your message I clearly understand your experience is
>>> greater than mine. My 2 cents inline below:
>>>
>>> > Move the synonym filter to the index analyzer chain. That provides
>>> better performance and avoids some surprising relevance behavior. With
>>> synonyms at query time, you’ll see different idf for terms in the synonym
>>> set, with the rare variant scoring higher. That is probably the opposite of
>>> what is expected.
>>>
>>> Unfortunately moving the synonym filter to the index analyzer is not an
>>> option: the project where I'm working on has a huge index and the synonyms
>>> list is something that (at least in this stage) frequently changes;
>>> re-index everything from scratch each time a change occurs is a big
>>> problem. On the other side, the IDF issue you mention doesn't produce so
>>> many unwanted effect, at least until now. But I got the point, thanks for
>>> the hint.
>>>
>>> > Also, phrase synonyms just don’t work at query time because the terms
>>> are parsed into individual tokens by the query parser, not the tokenizer.
>>> Here I dont' get you: using the SynonymGraph Filter + SplitOnWhiteSpace
>>> = false + AutoGeneratePhraseQueries I get the synonym phrasing correctly
>>> working (see the first example in my email).
>>>
>>> > Don’t use stop words. Just remove that line. Removing stop words is a
>>> performance and space hack that was useful in the 1960’s, but causes
>>> problems now. I’ve never used stop word removal and I started in search
>>> with Infoseek in 1996. Stop word removal is like a binary idf, ignoring
>>> common words. Since we have idf, we can give a lower score to common words
>>> and keep them in the index.
>>>
>>> And this is, as I see, something which animated long discussions around
>>> using / avoiding stopwords. I will check your suggestion, what it means to
>>> apply that approach to my project, but in meantime I think, also looking at
>>> the JIRA Alan pointed in his answer, the issue is there, and it's real; I
>>> mean, it is something that it doesn't work as expected (my use case, as far
>>> as I understand, is just an example because the thing is broader and it is
>>> related to the FilteredTokenFilter)
>>>
>>> Thanks again,
>>> Andrea
>>>
>>> On 26/07/18 16:59, Walter Underwood wrote:
>>>
>>> Move the synonym filter to the index analyzer chain. That provides
>>> better performance and avoids some surprising relevance behavior. With
>>> synonyms at query time, you’ll see different idf for terms in the synonym
>>> set, with the rare variant scoring higher. That is probably the opposite of
>>> what is expected.
>>>
>>> Also, phrase synonyms just don’t work at query time because the terms
>>> are parsed into individual tokens by the query parser, not the tokenizer.
>>>
>>> Don’t use stop words. Just remove that line. Removing stop words is a
>>> performance and space hack that was useful in the 1960’s, but causes
>>> problems now. I’ve never used stop word removal and I started in search
>>> with Infoseek in 1996. Stop word removal is like a binary idf, ignoring
>>> common words. Since we have idf, we can give a lower score to common words
>>> and keep them in the index.
>>>
>>> Do those two things and it should work as you expect.
>>>
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>> On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini <a....@sease.io>
>>> wrote:
>>>
>>> Hi Alan, thanks for the response and thank you very much for the pointers
>>>
>>> On 26/07/18 12:16, Alan Woodward wrote:
>>>
>>> Hi Andrea,
>>>
>>> This is a long-standing issue: see https://issues.apache.org/
>>> jira/browse/LUCENE-4065 and https://issues.apache.org/jira/b
>>> rowse/LUCENE-8250 for discussion.  I don’t think we’ve reached a
>>> consensus on how to fix it yet, but more examples are good.
>>>
>>> Unfortunately I don’t think changing the StopFilter to ignore SYNONYM
>>> tokens will work, because then you’ll generate queries that always fail -
>>> they’ll search for ‘of’ in the middle of the phrase, but ‘of’ never gets
>>> indexed because it’s removed by the StopFilter at index time.
>>>
>>> - Alan
>>>
>>> On 26 Jul 2018, at 08:04, Andrea Gazzarini <a....@sease.io> wrote:
>>>
>>> Hi,
>>> I have the following field type definition:
>>>
>>> <fieldtype name="text" class="solr.TextField" autoGeneratePhraseQueries="true">
>>>     <analyzer type="index">
>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>     </analyzer>
>>>     <analyzer type="query">
>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>>>         <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="false"/>
>>>     </analyzer></fieldtype>
>>>
>>> Where synonyms and stopwords are defined as follows:
>>>
>>> synonyms = out of warranty,oow
>>> stopwords = of
>>>
>>> Running the following query:
>>>
>>> q=my tv went out *of* warranty something *of*
>>>
>>> I get wrong results, with the following explain:
>>>
>>> title:my title:tv title:went (title:oow *PhraseQuery(title:"out ?
>>> warranty something"))*
>>>
>>> That is, the synonyms is correctly detected, I see the graph information
>>> are correctly reported in the positionLength, it seems they are wrongly
>>> interpreted by the QueryParser.
>>> I guess the reason is the "of" removal operated by the StopFilter, which
>>>
>>>    - removes the "of" term within the phrase (I wouldn't want that)
>>>    - creates a "hole" in the span defined by the "oow" term, which has
>>>    been marked as a synonym with a positionLength = 3, therefore including the
>>>    next available term (something).
>>>
>>> I tried to change the StopFilter in order to ignore stopwords that are
>>> marked as SYNONYM or that are part of a previous synonym span, and it
>>> works: it correctly produces the following query:
>>>
>>> title:my title:tv title:went *(title:oow PhraseQuery(title:"out of
>>> warranty"))* title:something
>>>
>>> So I'd like to ask your opinion about this. Am I missing something? Do
>>> you think it's better to open a JIRA issue? If the solution is a graph
>>> aware stop filter, do you think it's better to change the existing filter
>>> or to subclass it?
>>>
>>> Best,
>>> Andrea
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>

Re: SynonymGraphFilter followed by StopFilter

Posted by Alessandro Benedetti <a....@sease.io>.

Hi all,
I just want to add that
"With synonyms at query time, you’ll see different idf for terms in the
synonym set, with the rare variant scoring higher. That is probably the
opposite of what is expected."
should be solved by : https://issues.apache.org/jira/browse/SOLR-11662

At least that feature brings flexibility in.

Cheers

--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
www.sease.io

On Fri, Jul 27, 2018 at 3:25 AM, Michael Sokolov <ms...@gmail.com> wrote:

>  > In general I’d avoid index-time synonyms in lucene because synonyms can
> create graphs (eg if a single term gets expanded to several terms), and we
> can’t index graphs correctly.
>
> I wonder what it would take to address this. I guess the blast radius of
> adding a token "width" could be pretty large. Is there an issue or any past
> discussion about that?
>
> On Thu, Jul 26, 2018 at 11:42 AM Andrea Gazzarini <a....@sease.io>
> wrote:
>
>> Hi Walter,
>> many thanks for the response and without any constraint at all, I would
>> agree with you. From your message I clearly understand your experience is
>> greater than mine. My 2 cents inline below:
>>
>> > Move the synonym filter to the index analyzer chain. That provides
>> better performance and avoids some surprising relevance behavior. With
>> synonyms at query time, you’ll see different idf for terms in the synonym
>> set, with the rare variant scoring higher. That is probably the opposite of
>> what is expected.
>>
>> Unfortunately moving the synonym filter to the index analyzer is not an
>> option: the project where I'm working on has a huge index and the synonyms
>> list is something that (at least in this stage) frequently changes;
>> re-index everything from scratch each time a change occurs is a big
>> problem. On the other side, the IDF issue you mention doesn't produce so
>> many unwanted effect, at least until now. But I got the point, thanks for
>> the hint.
>>
>> > Also, phrase synonyms just don’t work at query time because the terms
>> are parsed into individual tokens by the query parser, not the tokenizer.
>> Here I dont' get you: using the SynonymGraph Filter + SplitOnWhiteSpace =
>> false + AutoGeneratePhraseQueries I get the synonym phrasing correctly
>> working (see the first example in my email).
>>
>> > Don’t use stop words. Just remove that line. Removing stop words is a
>> performance and space hack that was useful in the 1960’s, but causes
>> problems now. I’ve never used stop word removal and I started in search
>> with Infoseek in 1996. Stop word removal is like a binary idf, ignoring
>> common words. Since we have idf, we can give a lower score to common words
>> and keep them in the index.
>>
>> And this is, as I see, something which animated long discussions around
>> using / avoiding stopwords. I will check your suggestion, what it means to
>> apply that approach to my project, but in meantime I think, also looking at
>> the JIRA Alan pointed in his answer, the issue is there, and it's real; I
>> mean, it is something that it doesn't work as expected (my use case, as far
>> as I understand, is just an example because the thing is broader and it is
>> related to the FilteredTokenFilter)
>>
>> Thanks again,
>> Andrea
>>
>> On 26/07/18 16:59, Walter Underwood wrote:
>>
>> Move the synonym filter to the index analyzer chain. That provides better
>> performance and avoids some surprising relevance behavior. With synonyms at
>> query time, you’ll see different idf for terms in the synonym set, with the
>> rare variant scoring higher. That is probably the opposite of what is
>> expected.
>>
>> Also, phrase synonyms just don’t work at query time because the terms are
>> parsed into individual tokens by the query parser, not the tokenizer.
>>
>> Don’t use stop words. Just remove that line. Removing stop words is a
>> performance and space hack that was useful in the 1960’s, but causes
>> problems now. I’ve never used stop word removal and I started in search
>> with Infoseek in 1996. Stop word removal is like a binary idf, ignoring
>> common words. Since we have idf, we can give a lower score to common words
>> and keep them in the index.
>>
>> Do those two things and it should work as you expect.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini <a....@sease.io>
>> wrote:
>>
>> Hi Alan, thanks for the response and thank you very much for the pointers
>>
>> On 26/07/18 12:16, Alan Woodward wrote:
>>
>> Hi Andrea,
>>
>> This is a long-standing issue: see https://issues.apache.org/
>> jira/browse/LUCENE-4065 and https://issues.apache.org/jira/
>> browse/LUCENE-8250 for discussion.  I don’t think we’ve reached a
>> consensus on how to fix it yet, but more examples are good.
>>
>> Unfortunately I don’t think changing the StopFilter to ignore SYNONYM
>> tokens will work, because then you’ll generate queries that always fail -
>> they’ll search for ‘of’ in the middle of the phrase, but ‘of’ never gets
>> indexed because it’s removed by the StopFilter at index time.
>>
>> - Alan
>>
>> On 26 Jul 2018, at 08:04, Andrea Gazzarini <a....@sease.io> wrote:
>>
>> Hi,
>> I have the following field type definition:
>>
>> <fieldtype name="text" class="solr.TextField" autoGeneratePhraseQueries="true">
>>     <analyzer type="index">
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>     </analyzer>
>>     <analyzer type="query">
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>>         <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="false"/>
>>     </analyzer></fieldtype>
>>
>> Where synonyms and stopwords are defined as follows:
>>
>> synonyms = out of warranty,oow
>> stopwords = of
>>
>> Running the following query:
>>
>> q=my tv went out *of* warranty something *of*
>>
>> I get wrong results, with the following explain:
>>
>> title:my title:tv title:went (title:oow *PhraseQuery(title:"out ?
>> warranty something"))*
>>
>> That is, the synonyms is correctly detected, I see the graph information
>> are correctly reported in the positionLength, it seems they are wrongly
>> interpreted by the QueryParser.
>> I guess the reason is the "of" removal operated by the StopFilter, which
>>
>>    - removes the "of" term within the phrase (I wouldn't want that)
>>    - creates a "hole" in the span defined by the "oow" term, which has
>>    been marked as a synonym with a positionLength = 3, therefore including the
>>    next available term (something).
>>
>> I tried to change the StopFilter in order to ignore stopwords that are
>> marked as SYNONYM or that are part of a previous synonym span, and it
>> works: it correctly produces the following query:
>>
>> title:my title:tv title:went *(title:oow PhraseQuery(title:"out of
>> warranty"))* title:something
>>
>> So I'd like to ask your opinion about this. Am I missing something? Do
>> you think it's better to open a JIRA issue? If the solution is a graph
>> aware stop filter, do you think it's better to change the existing filter
>> or to subclass it?
>>
>> Best,
>> Andrea
>>
>>
>>
>>
>>
>>
>>

Re: SynonymGraphFilter followed by StopFilter

Posted by Michael Sokolov <ms...@gmail.com>.

 > In general I’d avoid index-time synonyms in lucene because synonyms can
create graphs (eg if a single term gets expanded to several terms), and we
can’t index graphs correctly.

I wonder what it would take to address this. I guess the blast radius of
adding a token "width" could be pretty large. Is there an issue or any past
discussion about that?

On Thu, Jul 26, 2018 at 11:42 AM Andrea Gazzarini <a....@sease.io>
wrote:

> Hi Walter,
> many thanks for the response and without any constraint at all, I would
> agree with you. From your message I clearly understand your experience is
> greater than mine. My 2 cents inline below:
>
> > Move the synonym filter to the index analyzer chain. That provides
> better performance and avoids some surprising relevance behavior. With
> synonyms at query time, you’ll see different idf for terms in the synonym
> set, with the rare variant scoring higher. That is probably the opposite of
> what is expected.
>
> Unfortunately moving the synonym filter to the index analyzer is not an
> option: the project where I'm working on has a huge index and the synonyms
> list is something that (at least in this stage) frequently changes;
> re-index everything from scratch each time a change occurs is a big
> problem. On the other side, the IDF issue you mention doesn't produce so
> many unwanted effect, at least until now. But I got the point, thanks for
> the hint.
>
> > Also, phrase synonyms just don’t work at query time because the terms
> are parsed into individual tokens by the query parser, not the tokenizer.
> Here I dont' get you: using the SynonymGraph Filter + SplitOnWhiteSpace =
> false + AutoGeneratePhraseQueries I get the synonym phrasing correctly
> working (see the first example in my email).
>
> > Don’t use stop words. Just remove that line. Removing stop words is a
> performance and space hack that was useful in the 1960’s, but causes
> problems now. I’ve never used stop word removal and I started in search
> with Infoseek in 1996. Stop word removal is like a binary idf, ignoring
> common words. Since we have idf, we can give a lower score to common words
> and keep them in the index.
>
> And this is, as I see, something which animated long discussions around
> using / avoiding stopwords. I will check your suggestion, what it means to
> apply that approach to my project, but in meantime I think, also looking at
> the JIRA Alan pointed in his answer, the issue is there, and it's real; I
> mean, it is something that it doesn't work as expected (my use case, as far
> as I understand, is just an example because the thing is broader and it is
> related to the FilteredTokenFilter)
>
> Thanks again,
> Andrea
>
> On 26/07/18 16:59, Walter Underwood wrote:
>
> Move the synonym filter to the index analyzer chain. That provides better
> performance and avoids some surprising relevance behavior. With synonyms at
> query time, you’ll see different idf for terms in the synonym set, with the
> rare variant scoring higher. That is probably the opposite of what is
> expected.
>
> Also, phrase synonyms just don’t work at query time because the terms are
> parsed into individual tokens by the query parser, not the tokenizer.
>
> Don’t use stop words. Just remove that line. Removing stop words is a
> performance and space hack that was useful in the 1960’s, but causes
> problems now. I’ve never used stop word removal and I started in search
> with Infoseek in 1996. Stop word removal is like a binary idf, ignoring
> common words. Since we have idf, we can give a lower score to common words
> and keep them in the index.
>
> Do those two things and it should work as you expect.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini <a....@sease.io>
> wrote:
>
> Hi Alan, thanks for the response and thank you very much for the pointers
>
> On 26/07/18 12:16, Alan Woodward wrote:
>
> Hi Andrea,
>
> This is a long-standing issue: see
> https://issues.apache.org/jira/browse/LUCENE-4065 and
> https://issues.apache.org/jira/browse/LUCENE-8250 for discussion.  I
> don’t think we’ve reached a consensus on how to fix it yet, but more
> examples are good.
>
> Unfortunately I don’t think changing the StopFilter to ignore SYNONYM
> tokens will work, because then you’ll generate queries that always fail -
> they’ll search for ‘of’ in the middle of the phrase, but ‘of’ never gets
> indexed because it’s removed by the StopFilter at index time.
>
> - Alan
>
> On 26 Jul 2018, at 08:04, Andrea Gazzarini <a....@sease.io> wrote:
>
> Hi,
> I have the following field type definition:
>
> <fieldtype name="text" class="solr.TextField" autoGeneratePhraseQueries="true">
>     <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>         <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="false"/>
>     </analyzer></fieldtype>
>
> Where synonyms and stopwords are defined as follows:
>
> synonyms = out of warranty,oow
> stopwords = of
>
> Running the following query:
>
> q=my tv went out *of* warranty something *of*
>
> I get wrong results, with the following explain:
>
> title:my title:tv title:went (title:oow *PhraseQuery(title:"out ?
> warranty something"))*
>
> That is, the synonyms is correctly detected, I see the graph information
> are correctly reported in the positionLength, it seems they are wrongly
> interpreted by the QueryParser.
> I guess the reason is the "of" removal operated by the StopFilter, which
>
>    - removes the "of" term within the phrase (I wouldn't want that)
>    - creates a "hole" in the span defined by the "oow" term, which has
>    been marked as a synonym with a positionLength = 3, therefore including the
>    next available term (something).
>
> I tried to change the StopFilter in order to ignore stopwords that are
> marked as SYNONYM or that are part of a previous synonym span, and it
> works: it correctly produces the following query:
>
> title:my title:tv title:went *(title:oow PhraseQuery(title:"out of
> warranty"))* title:something
>
> So I'd like to ask your opinion about this. Am I missing something? Do you
> think it's better to open a JIRA issue? If the solution is a graph aware
> stop filter, do you think it's better to change the existing filter or to
> subclass it?
>
> Best,
> Andrea
>
>
>
>
>
>
>

Re: SynonymGraphFilter followed by StopFilter

Posted by Andrea Gazzarini <a....@sease.io>.

Hi Walter,
many thanks for the response and without any constraint at all, I would 
agree with you. From your message I clearly understand your experience 
is greater than mine. My 2 cents inline below:

 > Move the synonym filter to the index analyzer chain. That provides 
better performance and avoids some surprising relevance behavior. With 
synonyms at query time, you’ll see different idf for terms in the 
synonym set, with the rare variant scoring higher. That is probably the 
opposite of what is expected.

Unfortunately moving the synonym filter to the index analyzer is not an 
option: the project where I'm working on has a huge index and the 
synonyms list is something that (at least in this stage) frequently 
changes; re-index everything from scratch each time a change occurs is a 
big problem. On the other side, the IDF issue you mention doesn't 
produce so many unwanted effect, at least until now. But I got the 
point, thanks for the hint.

 > Also, phrase synonyms just don’t work at query time because the terms 
are parsed into individual tokens by the query parser, not the tokenizer.
Here I dont' get you: using the SynonymGraph Filter + SplitOnWhiteSpace 
= false + AutoGeneratePhraseQueries I get the synonym phrasing correctly 
working (see the first example in my email).

 > Don’t use stop words. Just remove that line. Removing stop words is a 
performance and space hack that was useful in the 1960’s, but causes 
problems now. I’ve never used stop word removal and I started in search 
with Infoseek in 1996. Stop word removal is like a binary idf, ignoring 
common words. Since we have idf, we can give a lower score to common 
words and keep them in the index.

And this is, as I see, something which animated long discussions around 
using / avoiding stopwords. I will check your suggestion, what it means 
to apply that approach to my project, but in meantime I think, also 
looking at the JIRA Alan pointed in his answer, the issue is there, and 
it's real; I mean, it is something that it doesn't work as expected (my 
use case, as far as I understand, is just an example because the thing 
is broader and it is related to the FilteredTokenFilter)

Thanks again,
Andrea

On 26/07/18 16:59, Walter Underwood wrote:
> Move the synonym filter to the index analyzer chain. That provides 
> better performance and avoids some surprising relevance behavior. With 
> synonyms at query time, you’ll see different idf for terms in the 
> synonym set, with the rare variant scoring higher. That is probably 
> the opposite of what is expected.
>
> Also, phrase synonyms just don’t work at query time because the terms 
> are parsed into individual tokens by the query parser, not the tokenizer.
>
> Don’t use stop words. Just remove that line. Removing stop words is a 
> performance and space hack that was useful in the 1960’s, but causes 
> problems now. I’ve never used stop word removal and I started in 
> search with Infoseek in 1996. Stop word removal is like a binary idf, 
> ignoring common words. Since we have idf, we can give a lower score to 
> common words and keep them in the index.
>
> Do those two things and it should work as you expect.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org <ma...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
>
>> On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini <a.gazzarini@sease.io 
>> <ma...@sease.io>> wrote:
>>
>> Hi Alan, thanks for the response and thank you very much for the pointers
>>
>>
>> On 26/07/18 12:16, Alan Woodward wrote:
>>> Hi Andrea,
>>>
>>> This is a long-standing issue: see 
>>> https://issues.apache.org/jira/browse/LUCENE-4065 and 
>>> https://issues.apache.org/jira/browse/LUCENE-8250 for discussion.  I 
>>> don’t think we’ve reached a consensus on how to fix it yet, but more 
>>> examples are good.
>>>
>>> Unfortunately I don’t think changing the StopFilter to ignore 
>>> SYNONYM tokens will work, because then you’ll generate queries that 
>>> always fail - they’ll search for ‘of’ in the middle of the phrase, 
>>> but ‘of’ never gets indexed because it’s removed by the StopFilter 
>>> at index time.
>>>
>>> - Alan
>>>
>>>> On 26 Jul 2018, at 08:04, Andrea Gazzarini <a.gazzarini@sease.io 
>>>> <ma...@sease.io>> wrote:
>>>>
>>>> Hi,
>>>> I have the following field type definition:
>>>> <fieldtype name="text" class="solr.TextField" 
>>>> autoGeneratePhraseQueries="true">
>>>>      <analyzer type="index">
>>>>          <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>>      </analyzer>
>>>>      <analyzer type="query">
>>>>          <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>>          <filter class="solr.SynonymGraphFilterFactory" 
>>>> synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>>>>          <filter class="solr.StopFilterFactory" words="stopwords.txt" 
>>>> ignoreCase="false"/>
>>>>      </analyzer>
>>>> </fieldtype>
>>>> Where synonyms and stopwords are defined as follows:
>>>>
>>>> synonyms = out of warranty,oow
>>>> stopwords = of
>>>>
>>>> Running the following query:
>>>>
>>>> q=my tv went out *of* warranty something *of*
>>>>
>>>> I get wrong results, with the following explain:
>>>>
>>>> title:my title:tv title:went (title:oow *PhraseQuery(title:"out ? 
>>>> warranty something"))*
>>>>
>>>> That is, the synonyms is correctly detected, I see the graph 
>>>> information are correctly reported in the positionLength, it seems 
>>>> they are wrongly interpreted by the QueryParser.
>>>> I guess the reason is the "of" removal operated by the StopFilter, 
>>>> which
>>>>
>>>>   * removes the "of" term within the phrase (I wouldn't want that)
>>>>   * creates a "hole" in the span defined by the "oow" term, which
>>>>     has been marked as a synonym with a positionLength = 3,
>>>>     therefore including the next available term (something).
>>>>
>>>> I tried to change the StopFilter in order to ignore stopwords that 
>>>> are marked as SYNONYM or that are part of a previous synonym span, 
>>>> and it works: it correctly produces the following query:
>>>>
>>>> title:my title:tv title:went *(title:oow PhraseQuery(title:"out of 
>>>> warranty"))* title:something
>>>>
>>>> So I'd like to ask your opinion about this. Am I missing something? 
>>>> Do you think it's better to open a JIRA issue? If the solution is a 
>>>> graph aware stop filter, do you think it's better to change the 
>>>> existing filter or to subclass it?
>>>>
>>>> Best,
>>>> Andrea
>>>>
>>>>
>>>
>>
>

Re: SynonymGraphFilter followed by StopFilter

Posted by Alan Woodward <ro...@gmail.com>.

> Also, phrase synonyms just don’t work at query time because the terms are parsed into individual tokens by the query parser, not the tokenizer.

This is no longer the case.  In general I’d avoid index-time synonyms in lucene because synonyms can create graphs (eg if a single term gets expanded to several terms), and we can’t index graphs correctly.

I’d agree that removing stop words is generally unnecessary, but there are other reasons that you’d want to filter out terms from the Tokenstream, and we should be able to handle those situations correctly.

> On 26 Jul 2018, at 15:59, Walter Underwood <wu...@wunderwood.org> wrote:
> 
> Move the synonym filter to the index analyzer chain. That provides better performance and avoids some surprising relevance behavior. With synonyms at query time, you’ll see different idf for terms in the synonym set, with the rare variant scoring higher. That is probably the opposite of what is expected.
> 
> Also, phrase synonyms just don’t work at query time because the terms are parsed into individual tokens by the query parser, not the tokenizer.
> 
> Don’t use stop words. Just remove that line. Removing stop words is a performance and space hack that was useful in the 1960’s, but causes problems now. I’ve never used stop word removal and I started in search with Infoseek in 1996. Stop word removal is like a binary idf, ignoring common words. Since we have idf, we can give a lower score to common words and keep them in the index. 
> 
> Do those two things and it should work as you expect. 
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org <ma...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini <a.gazzarini@sease.io <ma...@sease.io>> wrote:
>> 
>> Hi Alan, thanks for the response and thank you very much for the pointers
>> 
>> On 26/07/18 12:16, Alan Woodward wrote:
>>> Hi Andrea,
>>> 
>>> This is a long-standing issue: see https://issues.apache.org/jira/browse/LUCENE-4065 <https://issues.apache.org/jira/browse/LUCENE-4065> and https://issues.apache.org/jira/browse/LUCENE-8250 <https://issues.apache.org/jira/browse/LUCENE-8250> for discussion.  I don’t think we’ve reached a consensus on how to fix it yet, but more examples are good.
>>> 
>>> Unfortunately I don’t think changing the StopFilter to ignore SYNONYM tokens will work, because then you’ll generate queries that always fail - they’ll search for ‘of’ in the middle of the phrase, but ‘of’ never gets indexed because it’s removed by the StopFilter at index time.
>>> 
>>> - Alan
>>> 
>>>> On 26 Jul 2018, at 08:04, Andrea Gazzarini <a.gazzarini@sease.io <ma...@sease.io>> wrote:
>>>> 
>>>> Hi, 
>>>> I have the following field type definition: 
>>>> <fieldtype name="text" class="solr.TextField" autoGeneratePhraseQueries="true">
>>>>     <analyzer type="index">
>>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>     </analyzer>
>>>>     <analyzer type="query">
>>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>>>>         <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="false"/>
>>>>     </analyzer>
>>>> </fieldtype>
>>>> Where synonyms and stopwords are defined as follows: 
>>>> 
>>>> synonyms = out of warranty,oow
>>>> stopwords = of
>>>> 
>>>> Running the following query:
>>>> 
>>>> q=my tv went out of warranty something of
>>>> 
>>>> I get wrong results, with the following explain: 
>>>> 
>>>> title:my title:tv title:went (title:oow PhraseQuery(title:"out ? warranty something"))
>>>> 
>>>> That is, the synonyms is correctly detected, I see the graph information are correctly reported in the positionLength, it seems they are wrongly interpreted by the QueryParser. 
>>>> I guess the reason is the "of" removal operated by the StopFilter, which 
>>>> removes the "of" term within the phrase (I wouldn't want that)
>>>> creates a "hole" in the span defined by the "oow" term, which has been marked as a synonym with a positionLength = 3, therefore including the next available term (something). 
>>>> I tried to change the StopFilter in order to ignore stopwords that are marked as SYNONYM or that are part of a previous synonym span, and it works: it correctly produces the following query: 
>>>> 
>>>> title:my title:tv title:went (title:oow PhraseQuery(title:"out of warranty")) title:something
>>>> 
>>>> So I'd like to ask your opinion about this. Am I missing something? Do you think it's better to open a JIRA issue? If the solution is a graph aware stop filter, do you think it's better to change the existing filter or to subclass it?
>>>> 
>>>> Best, 
>>>> Andrea
>>>> 
>>>> 
>>> 
>> 
>

Re: SynonymGraphFilter followed by StopFilter

Posted by Walter Underwood <wu...@wunderwood.org>.

Move the synonym filter to the index analyzer chain. That provides better performance and avoids some surprising relevance behavior. With synonyms at query time, you’ll see different idf for terms in the synonym set, with the rare variant scoring higher. That is probably the opposite of what is expected.

Also, phrase synonyms just don’t work at query time because the terms are parsed into individual tokens by the query parser, not the tokenizer.

Don’t use stop words. Just remove that line. Removing stop words is a performance and space hack that was useful in the 1960’s, but causes problems now. I’ve never used stop word removal and I started in search with Infoseek in 1996. Stop word removal is like a binary idf, ignoring common words. Since we have idf, we can give a lower score to common words and keep them in the index. 

Do those two things and it should work as you expect. 

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini <a....@sease.io> wrote:
> 
> Hi Alan, thanks for the response and thank you very much for the pointers
> 
> On 26/07/18 12:16, Alan Woodward wrote:
>> Hi Andrea,
>> 
>> This is a long-standing issue: see https://issues.apache.org/jira/browse/LUCENE-4065 <https://issues.apache.org/jira/browse/LUCENE-4065> and https://issues.apache.org/jira/browse/LUCENE-8250 <https://issues.apache.org/jira/browse/LUCENE-8250> for discussion.  I don’t think we’ve reached a consensus on how to fix it yet, but more examples are good.
>> 
>> Unfortunately I don’t think changing the StopFilter to ignore SYNONYM tokens will work, because then you’ll generate queries that always fail - they’ll search for ‘of’ in the middle of the phrase, but ‘of’ never gets indexed because it’s removed by the StopFilter at index time.
>> 
>> - Alan
>> 
>>> On 26 Jul 2018, at 08:04, Andrea Gazzarini <a.gazzarini@sease.io <ma...@sease.io>> wrote:
>>> 
>>> Hi, 
>>> I have the following field type definition: 
>>> <fieldtype name="text" class="solr.TextField" autoGeneratePhraseQueries="true">
>>>     <analyzer type="index">
>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>     </analyzer>
>>>     <analyzer type="query">
>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>>>         <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="false"/>
>>>     </analyzer>
>>> </fieldtype>
>>> Where synonyms and stopwords are defined as follows: 
>>> 
>>> synonyms = out of warranty,oow
>>> stopwords = of
>>> 
>>> Running the following query:
>>> 
>>> q=my tv went out of warranty something of
>>> 
>>> I get wrong results, with the following explain: 
>>> 
>>> title:my title:tv title:went (title:oow PhraseQuery(title:"out ? warranty something"))
>>> 
>>> That is, the synonyms is correctly detected, I see the graph information are correctly reported in the positionLength, it seems they are wrongly interpreted by the QueryParser. 
>>> I guess the reason is the "of" removal operated by the StopFilter, which 
>>> removes the "of" term within the phrase (I wouldn't want that)
>>> creates a "hole" in the span defined by the "oow" term, which has been marked as a synonym with a positionLength = 3, therefore including the next available term (something). 
>>> I tried to change the StopFilter in order to ignore stopwords that are marked as SYNONYM or that are part of a previous synonym span, and it works: it correctly produces the following query: 
>>> 
>>> title:my title:tv title:went (title:oow PhraseQuery(title:"out of warranty")) title:something
>>> 
>>> So I'd like to ask your opinion about this. Am I missing something? Do you think it's better to open a JIRA issue? If the solution is a graph aware stop filter, do you think it's better to change the existing filter or to subclass it?
>>> 
>>> Best, 
>>> Andrea
>>> 
>>> 
>> 
>

Re: SynonymGraphFilter followed by StopFilter

Posted by Andrea Gazzarini <a....@sease.io>.

Hi Alan, thanks for the response and thank you very much for the pointers


On 26/07/18 12:16, Alan Woodward wrote:
> Hi Andrea,
>
> This is a long-standing issue: see 
> https://issues.apache.org/jira/browse/LUCENE-4065 and 
> https://issues.apache.org/jira/browse/LUCENE-8250 for discussion.  I 
> don’t think we’ve reached a consensus on how to fix it yet, but more 
> examples are good.
>
> Unfortunately I don’t think changing the StopFilter to ignore SYNONYM 
> tokens will work, because then you’ll generate queries that always 
> fail - they’ll search for ‘of’ in the middle of the phrase, but ‘of’ 
> never gets indexed because it’s removed by the StopFilter at index time.
>
> - Alan
>
>> On 26 Jul 2018, at 08:04, Andrea Gazzarini <a.gazzarini@sease.io 
>> <ma...@sease.io>> wrote:
>>
>> Hi,
>> I have the following field type definition:
>> <fieldtype name="text" class="solr.TextField" 
>> autoGeneratePhraseQueries="true">
>>      <analyzer type="index">
>>          <tokenizer class="solr.StandardTokenizerFactory"/>
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>      </analyzer>
>>      <analyzer type="query">
>>          <tokenizer class="solr.StandardTokenizerFactory"/>
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>          <filter class="solr.SynonymGraphFilterFactory" 
>> synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>>          <filter class="solr.StopFilterFactory" words="stopwords.txt" 
>> ignoreCase="false"/>
>>      </analyzer>
>> </fieldtype>
>> Where synonyms and stopwords are defined as follows:
>>
>> synonyms = out of warranty,oow
>> stopwords = of
>>
>> Running the following query:
>>
>> q=my tv went out *of* warranty something *of*
>>
>> I get wrong results, with the following explain:
>>
>> title:my title:tv title:went (title:oow *PhraseQuery(title:"out ? 
>> warranty something"))*
>>
>> That is, the synonyms is correctly detected, I see the graph 
>> information are correctly reported in the positionLength, it seems 
>> they are wrongly interpreted by the QueryParser.
>> I guess the reason is the "of" removal operated by the StopFilter, which
>>
>>   * removes the "of" term within the phrase (I wouldn't want that)
>>   * creates a "hole" in the span defined by the "oow" term, which has
>>     been marked as a synonym with a positionLength = 3, therefore
>>     including the next available term (something).
>>
>> I tried to change the StopFilter in order to ignore stopwords that 
>> are marked as SYNONYM or that are part of a previous synonym span, 
>> and it works: it correctly produces the following query:
>>
>> title:my title:tv title:went *(title:oow PhraseQuery(title:"out of 
>> warranty"))* title:something
>>
>> So I'd like to ask your opinion about this. Am I missing something? 
>> Do you think it's better to open a JIRA issue? If the solution is a 
>> graph aware stop filter, do you think it's better to change the 
>> existing filter or to subclass it?
>>
>> Best,
>> Andrea
>>
>>
>

Re: SynonymGraphFilter followed by StopFilter

Posted by Alan Woodward <ro...@gmail.com>.

Hi Andrea,

This is a long-standing issue: see https://issues.apache.org/jira/browse/LUCENE-4065 <https://issues.apache.org/jira/browse/LUCENE-4065> and https://issues.apache.org/jira/browse/LUCENE-8250 <https://issues.apache.org/jira/browse/LUCENE-8250> for discussion.  I don’t think we’ve reached a consensus on how to fix it yet, but more examples are good.

Unfortunately I don’t think changing the StopFilter to ignore SYNONYM tokens will work, because then you’ll generate queries that always fail - they’ll search for ‘of’ in the middle of the phrase, but ‘of’ never gets indexed because it’s removed by the StopFilter at index time.

- Alan

> On 26 Jul 2018, at 08:04, Andrea Gazzarini <a.gazzarini@sease.io <ma...@sease.io>> wrote:
> 
> Hi, 
> I have the following field type definition: 
> <fieldtype name="text" class="solr.TextField" autoGeneratePhraseQueries="true">
>     <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>         <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="false"/>
>     </analyzer>
> </fieldtype>
> Where synonyms and stopwords are defined as follows: 
> 
> synonyms = out of warranty,oow
> stopwords = of
> 
> Running the following query:
> 
> q=my tv went out of warranty something of
> 
> I get wrong results, with the following explain: 
> 
> title:my title:tv title:went (title:oow PhraseQuery(title:"out ? warranty something"))
> 
> That is, the synonyms is correctly detected, I see the graph information are correctly reported in the positionLength, it seems they are wrongly interpreted by the QueryParser. 
> I guess the reason is the "of" removal operated by the StopFilter, which 
> removes the "of" term within the phrase (I wouldn't want that)
> creates a "hole" in the span defined by the "oow" term, which has been marked as a synonym with a positionLength = 3, therefore including the next available term (something). 
> I tried to change the StopFilter in order to ignore stopwords that are marked as SYNONYM or that are part of a previous synonym span, and it works: it correctly produces the following query: 
> 
> title:my title:tv title:went (title:oow PhraseQuery(title:"out of warranty")) title:something
> 
> So I'd like to ask your opinion about this. Am I missing something? Do you think it's better to open a JIRA issue? If the solution is a graph aware stop filter, do you think it's better to change the existing filter or to subclass it?
> 
> Best, 
> Andrea
> 
>