You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Mateusz Matela <mm...@man.poznan.pl> on 2023/01/11 10:53:56 UTC

Quoted phrase doesn't match when stemming and synonyms combined.

Hi,

My query is 'test polskie'.
I use MorfologikFilter for Polish stemming, it turns 'polskie' into 
'polski' + 'polskie'.
I also use SynonymGraphFilter which turns 'polski' into 'pol'.
Here's what I see in quey analysis (token position in parenthesis):
Tokenizer: test(1) polskie(2)
MF: test(1) polskie(2) polski(2)
SGF: test(1) polskie(2) pol(3) polski(3).

When I search for "test polskie" with quotation marks, a document with 
the same text doesn't match.
I think it's because SGF changes position of output tokens (SGF is 
applied only for query, so in index the positions are only 1 and 2). It 
mtches when I disable SGF.
Am I doing something wrong, or is this a bug in SGF?

Thanks,
Mateus


Re: Quoted phrase doesn't match when stemming and synonyms combined.

Posted by Dave <ha...@gmail.com>.
That’s awesome you found it! And of course anytime.  But again the idea of having a complete reindex plan ready would be wise in my opinion. Just something that makes you feel a tad safer when the s and the fan hit each other.  I’ve had to rebuild well over a terabyte of a solr index in less than a couole weeks and the stress the first time was enough to make sure I was ready for when I needed to do it again, which of course, I did 

> On Jan 12, 2023, at 10:02 AM, Mateusz Matela <mm...@man.poznan.pl> wrote:
> 
> If anyone's interested, I've submitted https://github.com/apache/lucene/issues/12080
> I found a small change in code that seem to fix the problem.
> Thank you Dave for the feedback!
> 
> W dniu 11.01.2023 o 15:17, Dave pisze:
>> On one hand that’s great news, on the other ot probably deserves a ticket but you need to have a very specific scenario where your index filters don’t match your query filters.
>> 
>> Also maybe spend some time putting together a reindexing plan.  Solr can use multiple cores so you can index content simultaneously if it’s split up rather than a single indexing process. In Perl you can use forking via the process manager cpan module, most other languages do it as well (but not as well imo)
>> 
>> 
>> 
>>>> On Jan 11, 2023, at 8:47 AM, Mateusz Matela <mm...@man.poznan.pl> wrote:
>>> 
>>> After reindexing with SGF the document matches, as expected.
>>> 
>>> Still, it looks like SGF was designed to work well when used only in query, and it's just a bug revealed by an edge case. Shall I submit an issue to https://github.com/apache/lucene ?
>>> 
>>> W dniu 11.01.2023 o 13:09, Dave pisze:
>>>> Yes then that is a problem, and I agree it should be intuitive that the quotes work without the modifier.  I’m not familiar with the underlying code enough to know for sure what’s going on in this instance, but reinfecting the content with the filter I wonder would fix it? You can experiment with just that one document and see.
>>>> 
>>>> Otherwise reindexing your content from scratch should have a plan, as upgrades/new filters to use become necessary.  It’s definitely inconvenient but sometimes you got to do what you got to do, so better to be ready for it since a search index should always be considered temporary and replaceable, it’s not a database, it’s a search tool to search a data set, and if done with that in mind you put the index on replaceable hardware and expect/have a plan for them to simply die and be replaced
>>>> 
>>>>>> On Jan 11, 2023, at 6:27 AM, Mateusz Matela <mm...@man.poznan.pl> wrote:
>>>>> W dniu 11.01.2023 o 12:04, Dave pisze:
>>>>>> Hmm. As an experiment what happens when you use a range of three or four with the quotes using the tilda in the query?
>>>>> You mean query like "test polskie"~1 ? Yes, it does match.
>>>>> 
>>>>> Unfortunately it's not a workaround I can use because the query is provided by the users. It's quite intuitive for them to use quotes, but not necessarily tildas. And if I added it artificially, it's a bit different query, may not always be what the user wants.
>>>>> 
>>>>>> Also generally o find it best to use the same filters for both indexing and query, just a personal preference, I know it’s not always possible however.
>>>>> The problem here is that I'd need to reindex documents when synonyms definitions change, which is quite inconvenient.
>>>>> It should solve the problem if SGF did not increase the positions. Am I correct to assume it's not the correct behavior and should be fixed? It doesn't do that when there's only one token on the position it modifies, for example:
>>>>> 
>>>>> test(1) polski(2) -> test(1) pol(2) polski(2)
>>>>> 
>>>>> Then the document does match.
>>>>> 
> 

Re: Quoted phrase doesn't match when stemming and synonyms combined.

Posted by Mateusz Matela <mm...@man.poznan.pl>.
If anyone's interested, I've submitted 
https://github.com/apache/lucene/issues/12080
I found a small change in code that seem to fix the problem.
Thank you Dave for the feedback!

W dniu 11.01.2023 o 15:17, Dave pisze:
> On one hand that’s great news, on the other ot probably deserves a ticket but you need to have a very specific scenario where your index filters don’t match your query filters.
>
> Also maybe spend some time putting together a reindexing plan.  Solr can use multiple cores so you can index content simultaneously if it’s split up rather than a single indexing process. In Perl you can use forking via the process manager cpan module, most other languages do it as well (but not as well imo)
>
>
>
>> On Jan 11, 2023, at 8:47 AM, Mateusz Matela <mm...@man.poznan.pl> wrote:
>>
>> After reindexing with SGF the document matches, as expected.
>>
>> Still, it looks like SGF was designed to work well when used only in query, and it's just a bug revealed by an edge case. Shall I submit an issue to https://github.com/apache/lucene ?
>>
>> W dniu 11.01.2023 o 13:09, Dave pisze:
>>> Yes then that is a problem, and I agree it should be intuitive that the quotes work without the modifier.  I’m not familiar with the underlying code enough to know for sure what’s going on in this instance, but reinfecting the content with the filter I wonder would fix it? You can experiment with just that one document and see.
>>>
>>> Otherwise reindexing your content from scratch should have a plan, as upgrades/new filters to use become necessary.  It’s definitely inconvenient but sometimes you got to do what you got to do, so better to be ready for it since a search index should always be considered temporary and replaceable, it’s not a database, it’s a search tool to search a data set, and if done with that in mind you put the index on replaceable hardware and expect/have a plan for them to simply die and be replaced
>>>
>>>>> On Jan 11, 2023, at 6:27 AM, Mateusz Matela <mm...@man.poznan.pl> wrote:
>>>> W dniu 11.01.2023 o 12:04, Dave pisze:
>>>>> Hmm. As an experiment what happens when you use a range of three or four with the quotes using the tilda in the query?
>>>> You mean query like "test polskie"~1 ? Yes, it does match.
>>>>
>>>> Unfortunately it's not a workaround I can use because the query is provided by the users. It's quite intuitive for them to use quotes, but not necessarily tildas. And if I added it artificially, it's a bit different query, may not always be what the user wants.
>>>>
>>>>> Also generally o find it best to use the same filters for both indexing and query, just a personal preference, I know it’s not always possible however.
>>>> The problem here is that I'd need to reindex documents when synonyms definitions change, which is quite inconvenient.
>>>> It should solve the problem if SGF did not increase the positions. Am I correct to assume it's not the correct behavior and should be fixed? It doesn't do that when there's only one token on the position it modifies, for example:
>>>>
>>>> test(1) polski(2) -> test(1) pol(2) polski(2)
>>>>
>>>> Then the document does match.
>>>>


Re: Quoted phrase doesn't match when stemming and synonyms combined.

Posted by Dave <ha...@gmail.com>.
On one hand that’s great news, on the other ot probably deserves a ticket but you need to have a very specific scenario where your index filters don’t match your query filters. 

Also maybe spend some time putting together a reindexing plan.  Solr can use multiple cores so you can index content simultaneously if it’s split up rather than a single indexing process. In Perl you can use forking via the process manager cpan module, most other languages do it as well (but not as well imo)



> On Jan 11, 2023, at 8:47 AM, Mateusz Matela <mm...@man.poznan.pl> wrote:
> 
> After reindexing with SGF the document matches, as expected.
> 
> Still, it looks like SGF was designed to work well when used only in query, and it's just a bug revealed by an edge case. Shall I submit an issue to https://github.com/apache/lucene ?
> 
> W dniu 11.01.2023 o 13:09, Dave pisze:
>> Yes then that is a problem, and I agree it should be intuitive that the quotes work without the modifier.  I’m not familiar with the underlying code enough to know for sure what’s going on in this instance, but reinfecting the content with the filter I wonder would fix it? You can experiment with just that one document and see.
>> 
>> Otherwise reindexing your content from scratch should have a plan, as upgrades/new filters to use become necessary.  It’s definitely inconvenient but sometimes you got to do what you got to do, so better to be ready for it since a search index should always be considered temporary and replaceable, it’s not a database, it’s a search tool to search a data set, and if done with that in mind you put the index on replaceable hardware and expect/have a plan for them to simply die and be replaced
>> 
>>>> On Jan 11, 2023, at 6:27 AM, Mateusz Matela <mm...@man.poznan.pl> wrote:
>>> 
>>> W dniu 11.01.2023 o 12:04, Dave pisze:
>>>> Hmm. As an experiment what happens when you use a range of three or four with the quotes using the tilda in the query?
>>> You mean query like "test polskie"~1 ? Yes, it does match.
>>> 
>>> Unfortunately it's not a workaround I can use because the query is provided by the users. It's quite intuitive for them to use quotes, but not necessarily tildas. And if I added it artificially, it's a bit different query, may not always be what the user wants.
>>> 
>>>> Also generally o find it best to use the same filters for both indexing and query, just a personal preference, I know it’s not always possible however.
>>> The problem here is that I'd need to reindex documents when synonyms definitions change, which is quite inconvenient.
>>> It should solve the problem if SGF did not increase the positions. Am I correct to assume it's not the correct behavior and should be fixed? It doesn't do that when there's only one token on the position it modifies, for example:
>>> 
>>> test(1) polski(2) -> test(1) pol(2) polski(2)
>>> 
>>> Then the document does match.
>>> 
> 

Re: Quoted phrase doesn't match when stemming and synonyms combined.

Posted by Mateusz Matela <mm...@man.poznan.pl>.
After reindexing with SGF the document matches, as expected.

Still, it looks like SGF was designed to work well when used only in 
query, and it's just a bug revealed by an edge case. Shall I submit an 
issue to https://github.com/apache/lucene ?

W dniu 11.01.2023 o 13:09, Dave pisze:
> Yes then that is a problem, and I agree it should be intuitive that the quotes work without the modifier.  I’m not familiar with the underlying code enough to know for sure what’s going on in this instance, but reinfecting the content with the filter I wonder would fix it? You can experiment with just that one document and see.
>
> Otherwise reindexing your content from scratch should have a plan, as upgrades/new filters to use become necessary.  It’s definitely inconvenient but sometimes you got to do what you got to do, so better to be ready for it since a search index should always be considered temporary and replaceable, it’s not a database, it’s a search tool to search a data set, and if done with that in mind you put the index on replaceable hardware and expect/have a plan for them to simply die and be replaced
>
>> On Jan 11, 2023, at 6:27 AM, Mateusz Matela <mm...@man.poznan.pl> wrote:
>>
>> W dniu 11.01.2023 o 12:04, Dave pisze:
>>> Hmm. As an experiment what happens when you use a range of three or four with the quotes using the tilda in the query?
>> You mean query like "test polskie"~1 ? Yes, it does match.
>>
>> Unfortunately it's not a workaround I can use because the query is provided by the users. It's quite intuitive for them to use quotes, but not necessarily tildas. And if I added it artificially, it's a bit different query, may not always be what the user wants.
>>
>>> Also generally o find it best to use the same filters for both indexing and query, just a personal preference, I know it’s not always possible however.
>> The problem here is that I'd need to reindex documents when synonyms definitions change, which is quite inconvenient.
>> It should solve the problem if SGF did not increase the positions. Am I correct to assume it's not the correct behavior and should be fixed? It doesn't do that when there's only one token on the position it modifies, for example:
>>
>> test(1) polski(2) -> test(1) pol(2) polski(2)
>>
>> Then the document does match.
>>


Re: Quoted phrase doesn't match when stemming and synonyms combined.

Posted by Dave <ha...@gmail.com>.
Yes then that is a problem, and I agree it should be intuitive that the quotes work without the modifier.  I’m not familiar with the underlying code enough to know for sure what’s going on in this instance, but reinfecting the content with the filter I wonder would fix it? You can experiment with just that one document and see. 

Otherwise reindexing your content from scratch should have a plan, as upgrades/new filters to use become necessary.  It’s definitely inconvenient but sometimes you got to do what you got to do, so better to be ready for it since a search index should always be considered temporary and replaceable, it’s not a database, it’s a search tool to search a data set, and if done with that in mind you put the index on replaceable hardware and expect/have a plan for them to simply die and be replaced

> On Jan 11, 2023, at 6:27 AM, Mateusz Matela <mm...@man.poznan.pl> wrote:
> 
> W dniu 11.01.2023 o 12:04, Dave pisze:
>> Hmm. As an experiment what happens when you use a range of three or four with the quotes using the tilda in the query?
> 
> You mean query like "test polskie"~1 ? Yes, it does match.
> 
> Unfortunately it's not a workaround I can use because the query is provided by the users. It's quite intuitive for them to use quotes, but not necessarily tildas. And if I added it artificially, it's a bit different query, may not always be what the user wants.
> 
>> Also generally o find it best to use the same filters for both indexing and query, just a personal preference, I know it’s not always possible however.
> 
> The problem here is that I'd need to reindex documents when synonyms definitions change, which is quite inconvenient.
> It should solve the problem if SGF did not increase the positions. Am I correct to assume it's not the correct behavior and should be fixed? It doesn't do that when there's only one token on the position it modifies, for example:
> 
> test(1) polski(2) -> test(1) pol(2) polski(2)
> 
> Then the document does match.
> 

Re: Quoted phrase doesn't match when stemming and synonyms combined.

Posted by Mateusz Matela <mm...@man.poznan.pl>.
W dniu 11.01.2023 o 12:04, Dave pisze:
> Hmm. As an experiment what happens when you use a range of three or four with the quotes using the tilda in the query?

You mean query like "test polskie"~1 ? Yes, it does match.

Unfortunately it's not a workaround I can use because the query is 
provided by the users. It's quite intuitive for them to use quotes, but 
not necessarily tildas. And if I added it artificially, it's a bit 
different query, may not always be what the user wants.

> Also generally o find it best to use the same filters for both indexing and query, just a personal preference, I know it’s not always possible however.

The problem here is that I'd need to reindex documents when synonyms 
definitions change, which is quite inconvenient.
It should solve the problem if SGF did not increase the positions. Am I 
correct to assume it's not the correct behavior and should be fixed? It 
doesn't do that when there's only one token on the position it modifies, 
for example:

test(1) polski(2) -> test(1) pol(2) polski(2)

Then the document does match.


Re: Quoted phrase doesn't match when stemming and synonyms combined.

Posted by Dave <ha...@gmail.com>.
Hmm. As an experiment what happens when you use a range of three or four with the quotes using the tilda in the query?

Also generally o find it best to use the same filters for both indexing and query, just a personal preference, I know it’s not always possible however. 

> On Jan 11, 2023, at 5:56 AM, Mateusz Matela <mm...@man.poznan.pl> wrote:
> 
> Hi,
> 
> My query is 'test polskie'.
> I use MorfologikFilter for Polish stemming, it turns 'polskie' into 'polski' + 'polskie'.
> I also use SynonymGraphFilter which turns 'polski' into 'pol'.
> Here's what I see in quey analysis (token position in parenthesis):
> Tokenizer: test(1) polskie(2)
> MF: test(1) polskie(2) polski(2)
> SGF: test(1) polskie(2) pol(3) polski(3).
> 
> When I search for "test polskie" with quotation marks, a document with the same text doesn't match.
> I think it's because SGF changes position of output tokens (SGF is applied only for query, so in index the positions are only 1 and 2). It mtches when I disable SGF.
> Am I doing something wrong, or is this a bug in SGF?
> 
> Thanks,
> Mateus
>