You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Dwane Hall <dw...@hotmail.com> on 2021/08/17 07:31:47 UTC

Edismax mysteries?

Hi all,

A quick question regarding query analysis if someone is feeling brave and knows a bit about the edismax parser's behaviour?!

It's probably best explained as an example:

I have 3 fields with two field types (defined below)
ST_Field1 - Field type of search_text
ST_Field2 - Field type of search_text
LC_Field1 - Field type of lowercase

<!--English only text searching-->
<fieldType name="search_text" class="solr.TextField" positionIncrementGap="100" uninvertible="false">
    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
      <filter class="solr.FlattenGraphFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
       <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>


<!--Code value searching e.g. A flight number QF123  -->
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100" uninvertible="false">
    <analyzer>
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.TrimFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>
</fieldType>

Now if I query these fields with a 2 term query "34567 something" (not a phrase query, q.op=AND) and only modify the qf fields the query parsers behaviour changes significantly.

Query 1 using qf=ST_Field1 ST_Field2
When I don't use a "lowercase" fieldType in qf - The query generated consists of a MUST DisjunctionMaxQuery for each term (2 total) with each qf field a SHOULD clause - This is the behaviour I'm expecting
"querystring":"34567 something",
"parsedquery":"+(+DisjunctionMaxQuery((ST_Field1:34567 | ST_Field1:34567)) +DisjunctionMaxQuery((ST_Field2:something | ST_Field2:something )))"

Query 2 using qf=ST_Field1 LC_Field1
When I use a "lowercase" fieldType in qf with a "search_text" fieldType - The query generated consists of a single MUST DisjunctionMaxQuery with each qf field of type "search_text" a MUST clause and the field of type "lowercase" a SHOULD clause
"querystring":"34567 something",
"parsedquery":"+(+DisjunctionMaxQuery(((+ST_Field1:34567 +ST_Field1:something) | LC_Field1:34567 something )))"

Does anyone know at a high level the rules that dictate these changes in query behaviour? If so are there a particular analysis chain to avoid to limit the chances of it happing (i.e. Force Query 1 behaviour, not Query 2 behaviour above). The Open Source Connections guys (John Berryman) have a great post on edismax (https://opensourceconnections.com/blog/2013/03/07/the-anatomy-of-a-dismax-query/) and it was either them or on this forum where I read that edismax behaviour will change if the query gets "too complex" but it'd be useful to understand some of the specifics on what forces this behaviour change so we can predict when to expect it!

Cheers,

Dwane

Solr 8.8.2

Re: Edismax mysteries?

Posted by Dwane Hall <dw...@hotmail.com>.

Thanks Shawn that makes sense!  Setting the sow parameter=true did yield the intended behaviour I was looking for but I'll have to run some more tests to confirm everything else is behaving as expected.   Additionally, thanks for the warning regarding mm and q.op I'll be sure to keep an eye out for that little got ya!  Your mention on sow did lead me to two excellent articles on edismax and sow so for other interested readers these articles provide a good insight into some peculiarities that can pop up when using edismax and how it can be interpreted.

Doug Turnbull (Open Source Connections)

https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/

Alessandro Benedetti (Sease)

https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html

Thanks again for your input it was a big help.

Cheers,

Dwane
________________________________
From: Shawn Heisey <ap...@elyograg.org>
Sent: Tuesday, 17 August 2021 11:51 PM
To: users@solr.apache.org <us...@solr.apache.org>
Subject: Re: Edismax mysteries?

On 8/17/2021 1:31 AM, Dwane Hall wrote:
> Query 2 using qf=ST_Field1 LC_Field1
> When I use a "lowercase" fieldType in qf with a "search_text" fieldType - The query generated consists of a single MUST DisjunctionMaxQuery with each qf field of type "search_text" a MUST clause and the field of type "lowercase" a SHOULD clause
> "querystring":"34567 something",
> "parsedquery":"+(+DisjunctionMaxQuery(((+ST_Field1:34567 +ST_Field1:something) | LC_Field1:34567 something )))"
>
> Does anyone know at a high level the rules that dictate these changes in query behaviour? If so are there a particular analysis chain to avoid to limit the chances of it happing (i.e. Force Query 1 behaviour, not Query 2 behaviour above). The Open Source Connections guys (John Berryman) have a great post on edismax (https://opensourceconnections.com/blog/2013/03/07/the-anatomy-of-a-dismax-query/) and it was either them or on this forum where I read that edismax behaviour will change if the query gets "too complex" but it'd be useful to understand some of the specifics on what forces this behaviour change so we can predict when to expect it!

On the first query, since both fields have the same fieldType, the query
parser can combine the terms in each clause for brevity, and maybe also
for performance.  But for the second query, since the second field is a
different type, it must be separate.  Because you used the Keyword
Tokenizer on that field, the query string is not split into separate
terms.  I think it was version 7.0 that changed the "sow" parameter
(split on whitespace) so it defaults to false.  Now the query parser no
longer splits the input on whitespace, the analysis must do that if you
need it.  You could try setting "sow=true" and see what that gives you
... but depending on the nature of your data, doing so may not actually
produce the results that you want.

The only part that doesn't look right with your description is that the
lowercase field is a SHOULD clause, which q.op=AND shouldn't produce.
Can you add "echoParams=all" to your query and check for "mm" and "q.op"
parameters in the response header?  With dismax and edismax,
interactions between q.op and mm can be very tricky to get right and can
produce some very surprising results. I would recommended that you only
set one of them.

Thanks,
Shawn

Re: Edismax mysteries?

Posted by Shawn Heisey <ap...@elyograg.org>.

On 8/17/2021 1:31 AM, Dwane Hall wrote:
> Query 2 using qf=ST_Field1 LC_Field1
> When I use a "lowercase" fieldType in qf with a "search_text" fieldType - The query generated consists of a single MUST DisjunctionMaxQuery with each qf field of type "search_text" a MUST clause and the field of type "lowercase" a SHOULD clause
> "querystring":"34567 something",
> "parsedquery":"+(+DisjunctionMaxQuery(((+ST_Field1:34567 +ST_Field1:something) | LC_Field1:34567 something )))"
>
> Does anyone know at a high level the rules that dictate these changes in query behaviour? If so are there a particular analysis chain to avoid to limit the chances of it happing (i.e. Force Query 1 behaviour, not Query 2 behaviour above). The Open Source Connections guys (John Berryman) have a great post on edismax (https://opensourceconnections.com/blog/2013/03/07/the-anatomy-of-a-dismax-query/) and it was either them or on this forum where I read that edismax behaviour will change if the query gets "too complex" but it'd be useful to understand some of the specifics on what forces this behaviour change so we can predict when to expect it!

On the first query, since both fields have the same fieldType, the query 
parser can combine the terms in each clause for brevity, and maybe also 
for performance.  But for the second query, since the second field is a 
different type, it must be separate.  Because you used the Keyword 
Tokenizer on that field, the query string is not split into separate 
terms.  I think it was version 7.0 that changed the "sow" parameter 
(split on whitespace) so it defaults to false.  Now the query parser no 
longer splits the input on whitespace, the analysis must do that if you 
need it.  You could try setting "sow=true" and see what that gives you 
... but depending on the nature of your data, doing so may not actually 
produce the results that you want.

The only part that doesn't look right with your description is that the 
lowercase field is a SHOULD clause, which q.op=AND shouldn't produce.  
Can you add "echoParams=all" to your query and check for "mm" and "q.op" 
parameters in the response header?  With dismax and edismax, 
interactions between q.op and mm can be very tricky to get right and can 
produce some very surprising results. I would recommended that you only 
set one of them.

Thanks,
Shawn

Re: Edismax mysteries?

Posted by Jan Høydahl <ja...@cominvent.com>.

Dwane,

You may find this unresolved issue an interesting (and long) read https://issues.apache.org/jira/browse/SOLR-12779

Jan

> 17. aug. 2021 kl. 09:31 skrev Dwane Hall <dw...@hotmail.com>:
> 
> Hi all,
> 
> A quick question regarding query analysis if someone is feeling brave and knows a bit about the edismax parser's behaviour?!
> 
> It's probably best explained as an example:
> 
> I have 3 fields with two field types (defined below)
> ST_Field1 - Field type of search_text
> ST_Field2 - Field type of search_text
> LC_Field1 - Field type of lowercase
> 
> <!--English only text searching-->
> <fieldType name="search_text" class="solr.TextField" positionIncrementGap="100" uninvertible="false">
>    <analyzer type="index">
>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
>      <filter class="solr.FlattenGraphFilterFactory"/>
>      <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>    </analyzer>
> </fieldType>
> 
> 
> <!--Code value searching e.g. A flight number QF123  -->
> <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100" uninvertible="false">
>    <analyzer>
>      <tokenizer class="solr.KeywordTokenizerFactory"/>
>      <filter class="solr.TrimFilterFactory"/>
>      <filter class="solr.LowerCaseFilterFactory" />
>    </analyzer>
> </fieldType>
> 
> Now if I query these fields with a 2 term query "34567 something" (not a phrase query, q.op=AND) and only modify the qf fields the query parsers behaviour changes significantly.
> 
> Query 1 using qf=ST_Field1 ST_Field2
> When I don't use a "lowercase" fieldType in qf - The query generated consists of a MUST DisjunctionMaxQuery for each term (2 total) with each qf field a SHOULD clause - This is the behaviour I'm expecting
> "querystring":"34567 something",
> "parsedquery":"+(+DisjunctionMaxQuery((ST_Field1:34567 | ST_Field1:34567)) +DisjunctionMaxQuery((ST_Field2:something | ST_Field2:something )))"
> 
> Query 2 using qf=ST_Field1 LC_Field1
> When I use a "lowercase" fieldType in qf with a "search_text" fieldType - The query generated consists of a single MUST DisjunctionMaxQuery with each qf field of type "search_text" a MUST clause and the field of type "lowercase" a SHOULD clause
> "querystring":"34567 something",
> "parsedquery":"+(+DisjunctionMaxQuery(((+ST_Field1:34567 +ST_Field1:something) | LC_Field1:34567 something )))"
> 
> Does anyone know at a high level the rules that dictate these changes in query behaviour? If so are there a particular analysis chain to avoid to limit the chances of it happing (i.e. Force Query 1 behaviour, not Query 2 behaviour above). The Open Source Connections guys (John Berryman) have a great post on edismax (https://opensourceconnections.com/blog/2013/03/07/the-anatomy-of-a-dismax-query/) and it was either them or on this forum where I read that edismax behaviour will change if the query gets "too complex" but it'd be useful to understand some of the specifics on what forces this behaviour change so we can predict when to expect it!
> 
> Cheers,
> 
> Dwane
> 
> Solr 8.8.2
>