You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Oakley, Craig (NIH/NLM/NCBI) [C]" <cr...@nih.gov> on 2019/02/01 16:55:16 UTC

change in White Space when upgrading 6.6 to 7.4

We had a problem when upgrading from Solr 6.6 to Solr 7.4 in that a query ceased to work.


The query was of the form http://localhost:8983/solr/collection/select?indent=on&q=ABC4856.21%20AND%20-field1:ABC4856.21&wt=json&rows=0

Basically finding a count of those records where there is some field which has "ABC4856.21", but where the field field1 does not have that string (in other words, where there is some field other than field1 which has "ABC4856.21")

For this particular collection, running the query against Solr 6.6 resulted in "response":{"numFound":0" (which was correct), but running it against Solr 7.4 resulted in ""response":{"numFound":21322074"

After some investigation, it seemed to be a problem with the initial "ABC4856.21" being tokenized as "ABC4856" and "21"

We found various work-arounds such as putting quotation marks around the string or adding "*:" after the "q="; but the user wanted the exact same query to work in Solr 7.4 as it had in Solr 6.6

Eventually, we found a solution by adding "<str name="sow">true</str>" to the Select handler in solrconfig.xml (for "Separate On Whitespace").

This solution seems to be sufficient; but we would like to be sure we understand the solution.

Looking at lucene.apache.org/solr/guide/7_4/tokenizers.html#standard-tokenizer it would seem that the period should not split the string into two tokens.

Could someone clarify how we can know which Tokenize is used when, and which definition of White Space is used when?

Thanks

Re: change in White Space when upgrading 6.6 to 7.4

Posted by Matt Pearce <ma...@flax.co.uk>.

sow defaulting to false changed between 6.x and 7.x, which is why the 
problem has appeared for you, and is solved by setting sow=true in your 
defaults.

With sow=true, I would expect your query to be broken into three parts, 
and then tokenised:
ABC4856.21
AND
-field1:ABC4856.21
With sow=false, the whole query will be tokenised in one go, so one of 
the query analysers on the fields being searched is behaving differently 
depending on the string passed.

Does the parsed query (in the debugQuery=true output) give any 
indication of the differences between the two versions? What analysis is 
done on the fields being queried?

Thanks,
Matt


On 01/02/2019 16:55, Oakley, Craig (NIH/NLM/NCBI) [C] wrote:
> We had a problem when upgrading from Solr 6.6 to Solr 7.4 in that a query ceased to work.
> 
> 
> The query was of the form http://localhost:8983/solr/collection/select?indent=on&q=ABC4856.21%20AND%20-field1:ABC4856.21&wt=json&rows=0
> 
> Basically finding a count of those records where there is some field which has "ABC4856.21", but where the field field1 does not have that string (in other words, where there is some field other than field1 which has "ABC4856.21")
> 
> For this particular collection, running the query against Solr 6.6 resulted in "response":{"numFound":0" (which was correct), but running it against Solr 7.4 resulted in ""response":{"numFound":21322074"
> 
> After some investigation, it seemed to be a problem with the initial "ABC4856.21" being tokenized as "ABC4856" and "21"
> 
> We found various work-arounds such as putting quotation marks around the string or adding "*:" after the "q="; but the user wanted the exact same query to work in Solr 7.4 as it had in Solr 6.6
> 
> Eventually, we found a solution by adding "<str name="sow">true</str>" to the Select handler in solrconfig.xml (for "Separate On Whitespace").
> 
> This solution seems to be sufficient; but we would like to be sure we understand the solution.
> 
> Looking at lucene.apache.org/solr/guide/7_4/tokenizers.html#standard-tokenizer it would seem that the period should not split the string into two tokens.
> 
> Could someone clarify how we can know which Tokenize is used when, and which definition of White Space is used when?
> 
> Thanks
> 

-- 
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk

RE: change in White Space when upgrading 6.6 to 7.4

Posted by "Oakley, Craig (NIH/NLM/NCBI) [C]" <cr...@nih.gov>.

> Can we take this thread back to the mailing list, please? It would be good to allow other people to weigh in!

Sure

-----Original Message-----
From: Matt Pearce <ma...@flax.co.uk> 
Sent: Friday, February 08, 2019 6:45 AM
To: Oakley, Craig (NIH/NLM/NCBI) [C] <cr...@nih.gov>
Subject: Re: change in White Space when upgrading 6.6 to 7.4


The first (sow=false) query parses to:
"+(+((text:pd text:00002485621) | isolation_source:PDS000024856.21) 
-erd_group:PDS000024856.21)"
while the sow=true query parses to:
"+(+(text:\"pd 00002485621\" | isolation_source:PDS000024856.21) 
-erd_group:PDS000024856.21)"

This suggests to me that the analyzer on the text field is using the 
WordDelimiterFilterFactory (or WordDelimiterGraphFilterFactory), and 
splitting the query text into separate tokens on number/word boundaries 
- so "ABC123" => "ABC" "123". It is also stripping the "S" from "PDS", 
and the decimal point from the numeric part, as you can see from the 
"text:00002485621" part of both queries - this may not be the 
WordDelimiter filter, but I suspect it probably is.

It works when sow=true, because it's generating a phrase query. When 
sow=false, it doesn't generate a phrase query and you're getting matches 
on both "pd" and "00002485621" - presumably "pd" appears in a lot of 
your documents.

A possible solution without using sow=true would be to modify the 
analyzer on your text field so it doesn't use 
WordDelimiterFilterFactory, and retains "PD000024856.21" as a single 
token, or modify the behaviour of that filter so it doesn't split the 
tokens the same way. Of course, this may not be what you want, depending 
on the other data you have in the text field.

Can we take this thread back to the mailing list, please? It would be 
good to allow other people to weigh in!

Thanks,
Matt

On 07/02/2019 15:58, Oakley, Craig (NIH/NLM/NCBI) [C] wrote:
> Thanks. Here is what I have.
> 
> The first curl output is the problem results. The next two were changing the query (adding quotation marks or adding "*:")
> 
> After the third curl output, I upload a new solrconfig.xml (in another window) to include <str name="sow">true</str> in the /select requestHandler; I then RELOAD the core and run the final curl commend
> 
> The correct answer should have numFound 0 (and the only one which fails to get the correct answer is the first: the original query with sow defaulting to false in Solr7.4)
> 
> Let me know if you see any clarification in the debugQuery output
> 
> Thanks again
> 
> 
> 
> *[10:33 ~ 2209]$ curl -s 'http://host:9999/solr/isolates/select?indent=on&q=PDS000024856.21%20AND%20-erd_group:PDS000024856.21&wt=json&rows=0&debugQuery=on'|tee ~/solr/DBH14432debug190207a.out
> {
>    "responseHeader":{
>      "zkConnected":true,
>      "status":0,
>      "QTime":1,
>      "params":{
>        "q":"PDS000024856.21 AND -erd_group:PDS000024856.21",
>        "indent":"on",
>        "rows":"0",
>        "wt":"json",
>        "debugQuery":"on"}},
>    "response":{"numFound":21322074,"start":0,"docs":[]
>    },
>    "debug":{
>      "rawquerystring":"PDS000024856.21 AND -erd_group:PDS000024856.21",
>      "querystring":"PDS000024856.21 AND -erd_group:PDS000024856.21",
>      "parsedquery":"+(+DisjunctionMaxQuery(((text:pd text:00002485621) | isolation_source:PDS000024856.21)) -erd_group:PDS000024856.21)",
>      "parsedquery_toString":"+(+((text:pd text:00002485621) | isolation_source:PDS000024856.21) -erd_group:PDS000024856.21)",
>      "explain":{},
>      "QParser":"ExtendedDismaxQParser",
>      "altquerystring":null,
>      "boost_queries":null,
>      "parsed_boost_queries":[],
>      "boostfuncs":null,
>      "timing":{
>        "time":1.0,
>        "prepare":{
>          "time":0.0,
>          "query":{
>            "time":0.0},
>          "facet":{
>            "time":0.0},
>          "facet_module":{
>            "time":0.0},
>          "mlt":{
>            "time":0.0},
>          "highlight":{
>            "time":0.0},
>          "stats":{
>            "time":0.0},
>          "expand":{
>            "time":0.0},
>          "terms":{
>            "time":0.0},
>          "debug":{
>            "time":0.0}},
>        "process":{
>          "time":0.0,
>          "query":{
>            "time":0.0},
>          "facet":{
>            "time":0.0},
>          "facet_module":{
>            "time":0.0},
>          "mlt":{
>            "time":0.0},
>          "highlight":{
>            "time":0.0},
>          "stats":{
>            "time":0.0},
>          "expand":{
>            "time":0.0},
>          "terms":{
>            "time":0.0},
>          "debug":{
>            "time":0.0}}}}}
> *[10:33 ~ 2210]$ curl -s 'http://host:9999/solr/isolates/select?indent=on&q="PDS000024856.21"%20AND%20-erd_group:PDS000024856.21&wt=json&rows=0&debugQuery=on'|tee ~/solr/DBH14432debug190207b.out
> {
>    "responseHeader":{
>      "zkConnected":true,
>      "status":0,
>      "QTime":1,
>      "params":{
>        "q":"\"PDS000024856.21\" AND -erd_group:PDS000024856.21",
>        "indent":"on",
>        "rows":"0",
>        "wt":"json",
>        "debugQuery":"on"}},
>    "response":{"numFound":0,"start":0,"docs":[]
>    },
>    "debug":{
>      "rawquerystring":"\"PDS000024856.21\" AND -erd_group:PDS000024856.21",
>      "querystring":"\"PDS000024856.21\" AND -erd_group:PDS000024856.21",
>      "parsedquery":"+(+DisjunctionMaxQuery((text:\"pd 00002485621\" | isolation_source:PDS000024856.21)) -erd_group:PDS000024856.21)",
>      "parsedquery_toString":"+(+(text:\"pd 00002485621\" | isolation_source:PDS000024856.21) -erd_group:PDS000024856.21)",
>      "explain":{},
>      "QParser":"ExtendedDismaxQParser",
>      "altquerystring":null,
>      "boost_queries":null,
>      "parsed_boost_queries":[],
>      "boostfuncs":null,
>      "timing":{
>        "time":1.0,
>        "prepare":{
>          "time":0.0,
>          "query":{
>            "time":0.0},
>          "facet":{
>            "time":0.0},
>          "facet_module":{
>            "time":0.0},
>          "mlt":{
>            "time":0.0},
>          "highlight":{
>            "time":0.0},
>          "stats":{
>            "time":0.0},
>          "expand":{
>            "time":0.0},
>          "terms":{
>            "time":0.0},
>          "debug":{
>            "time":0.0}},
>        "process":{
>          "time":0.0,
>          "query":{
>            "time":0.0},
>          "facet":{
>            "time":0.0},
>          "facet_module":{
>            "time":0.0},
>          "mlt":{
>            "time":0.0},
>          "highlight":{
>            "time":0.0},
>          "stats":{
>            "time":0.0},
>          "expand":{
>            "time":0.0},
>          "terms":{
>            "time":0.0},
>          "debug":{
>            "time":0.0}}}}}
> *[10:33 ~ 2211]$ curl -s 'http://host:9999/solr/isolates/select?indent=on&q=*:PDS000024856.21%20AND%20-erd_group:PDS000024856.21&wt=json&rows=0&debugQuery=on'|tee ~/solr/DBH14432debug190207c.out
> {
>    "responseHeader":{
>      "zkConnected":true,
>      "status":0,
>      "QTime":9,
>      "params":{
>        "q":"*:PDS000024856.21 AND -erd_group:PDS000024856.21",
>        "indent":"on",
>        "rows":"0",
>        "wt":"json",
>        "debugQuery":"on"}},
>    "response":{"numFound":0,"start":0,"docs":[]
>    },
>    "debug":{
>      "rawquerystring":"*:PDS000024856.21 AND -erd_group:PDS000024856.21",
>      "querystring":"*:PDS000024856.21 AND -erd_group:PDS000024856.21",
>      "parsedquery":"+(+DisjunctionMaxQuery((text:*\\:pds000024856.21 | isolation_source:*\\:PDS000024856.21)) -erd_group:PDS000024856.21)",
>      "parsedquery_toString":"+(+(text:*\\:pds000024856.21 | isolation_source:*\\:PDS000024856.21) -erd_group:PDS000024856.21)",
>      "explain":{},
>      "QParser":"ExtendedDismaxQParser",
>      "altquerystring":null,
>      "boost_queries":null,
>      "parsed_boost_queries":[],
>      "boostfuncs":null,
>      "timing":{
>        "time":9.0,
>        "prepare":{
>          "time":8.0,
>          "query":{
>            "time":8.0},
>          "facet":{
>            "time":0.0},
>          "facet_module":{
>            "time":0.0},
>          "mlt":{
>            "time":0.0},
>          "highlight":{
>            "time":0.0},
>          "stats":{
>            "time":0.0},
>          "expand":{
>            "time":0.0},
>          "terms":{
>            "time":0.0},
>          "debug":{
>            "time":0.0}},
>        "process":{
>          "time":0.0,
>          "query":{
>            "time":0.0},
>          "facet":{
>            "time":0.0},
>          "facet_module":{
>            "time":0.0},
>          "mlt":{
>            "time":0.0},
>          "highlight":{
>            "time":0.0},
>          "stats":{
>            "time":0.0},
>          "expand":{
>            "time":0.0},
>          "terms":{
>            "time":0.0},
>          "debug":{
>            "time":0.0}}}}}
> [10:33 ~ 2212]$ curlp -s 'http://host:9999/solr/admin/cores?action=RELOAD&core=isolates_shard1_replica_n1&indent=on'
> {
>    "responseHeader":{
>      "status":0,
>      "QTime":2628}}
> *[10:41 ~ 2213]$ curl -s 'http://host:9999/solr/isolates/select?indent=on&q=PDS000024856.21%20AND%20-erd_group:PDS000024856.21&wt=json&rows=0&debugQuery=on'|tee ~/solr/DBH14432debug190207d.out
> {
>    "responseHeader":{
>      "zkConnected":true,
>      "status":0,
>      "QTime":112,
>      "params":{
>        "q":"PDS000024856.21 AND -erd_group:PDS000024856.21",
>        "indent":"on",
>        "rows":"0",
>        "wt":"json",
>        "debugQuery":"on"}},
>    "response":{"numFound":0,"start":0,"docs":[]
>    },
>    "debug":{
>      "rawquerystring":"PDS000024856.21 AND -erd_group:PDS000024856.21",
>      "querystring":"PDS000024856.21 AND -erd_group:PDS000024856.21",
>      "parsedquery":"+(+DisjunctionMaxQuery((text:\"pd 00002485621\" | isolation_source:PDS000024856.21)) -erd_group:PDS000024856.21)",
>      "parsedquery_toString":"+(+(text:\"pd 00002485621\" | isolation_source:PDS000024856.21) -erd_group:PDS000024856.21)",
>      "explain":{},
>      "QParser":"ExtendedDismaxQParser",
>      "altquerystring":null,
>      "boost_queries":null,
>      "parsed_boost_queries":[],
>      "boostfuncs":null,
>      "timing":{
>        "time":111.0,
>        "prepare":{
>          "time":0.0,
>          "query":{
>            "time":0.0},
>          "facet":{
>            "time":0.0},
>          "facet_module":{
>            "time":0.0},
>          "mlt":{
>            "time":0.0},
>          "highlight":{
>            "time":0.0},
>          "stats":{
>            "time":0.0},
>          "expand":{
>            "time":0.0},
>          "terms":{
>            "time":0.0},
>          "debug":{
>            "time":0.0}},
>        "process":{
>          "time":110.0,
>          "query":{
>            "time":110.0},
>          "facet":{
>            "time":0.0},
>          "facet_module":{
>            "time":0.0},
>          "mlt":{
>            "time":0.0},
>          "highlight":{
>            "time":0.0},
>          "stats":{
>            "time":0.0},
>          "expand":{
>            "time":0.0},
>          "terms":{
>            "time":0.0},
>          "debug":{
>            "time":0.0}}}}}
> [10:41 ~ 2214]$
> 
> -----Original Message-----
> From: Matt Pearce <ma...@flax.co.uk>
> Sent: Thursday, February 07, 2019 5:12 AM
> To: Oakley, Craig (NIH/NLM/NCBI) [C] <cr...@nih.gov>
> Subject: Re: change in White Space when upgrading 6.6 to 7.4
> 
> 
> Sorry - I'd intended to reply to the list, hit the wrong button in my
> mail client!
> 
> On 07/02/2019 10:04, Matt Pearce wrote:
>>
>> sow defaulting to false changed between 6.x and 7.x, which is why the
>> problem has appeared for you, and is solved by setting sow=true in your
>> defaults.
>>
>> With sow=true, I would expect your query to be broken into three parts,
>> and then tokenised:
>> ABC4856.21
>> AND
>> -field1:ABC4856.21
>> With sow=false, the whole query will be tokenised in one go, so one of
>> the query analysers on the fields being searched is behaving differently
>> depending on the string passed.
>>
>> Does the parsed query (in the debugQuery=true output) give any
>> indication of the differences between the two versions? What analysis is
>> done on the fields being queried?
>>
>>
>>
>> On 01/02/2019 16:55, Oakley, Craig (NIH/NLM/NCBI) [C] wrote:
>>> We had a problem when upgrading from Solr 6.6 to Solr 7.4 in that a
>>> query ceased to work.
>>>
>>>
>>> The query was of the form
>>> http://localhost:8983/solr/collection/select?indent=on&q=ABC4856.21%20AND%20-field1:ABC4856.21&wt=json&rows=0
>>>
>>>
>>> Basically finding a count of those records where there is some field
>>> which has "ABC4856.21", but where the field field1 does not have that
>>> string (in other words, where there is some field other than field1
>>> which has "ABC4856.21")
>>>
>>> For this particular collection, running the query against Solr 6.6
>>> resulted in "response":{"numFound":0" (which was correct), but running
>>> it against Solr 7.4 resulted in ""response":{"numFound":21322074"
>>>
>>> After some investigation, it seemed to be a problem with the initial
>>> "ABC4856.21" being tokenized as "ABC4856" and "21"
>>>
>>> We found various work-arounds such as putting quotation marks around
>>> the string or adding "*:" after the "q="; but the user wanted the
>>> exact same query to work in Solr 7.4 as it had in Solr 6.6
>>>
>>> Eventually, we found a solution by adding "<str name="sow">true</str>"
>>> to the Select handler in solrconfig.xml (for "Separate On Whitespace").
>>>
>>> This solution seems to be sufficient; but we would like to be sure we
>>> understand the solution.
>>>
>>> Looking at
>>> lucene.apache.org/solr/guide/7_4/tokenizers.html#standard-tokenizer it
>>> would seem that the period should not split the string into two tokens.
>>>
>>> Could someone clarify how we can know which Tokenize is used when, and
>>> which definition of White Space is used when?
>>>
>>> Thanks
>>>
>>
> 

-- 
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk