You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by John Blythe <jo...@gmail.com> on 2018/09/12 02:32:08 UTC

6.x to 7.x differences

hi, all.

we recently migrated to cloud. part of that migration jumped us from 6.1 to
7.4.

one example query between our old solr instance and our new cloud instance
produces 42 results and 19k results.

the analyzer is the same aside from WordDelimiterFilterFactory moving over
to the graph variation of it and the lucene parser moving from 6.1 to 7.4
obviously.

i've used the analysis tool in solr admin to try to determine the
difference between the two. i'm seeing the same output between index and
query results yet when actually running the queries have that huge
divergence of results.

i'm left scratching my head at this point. i'm guessing it's from the
lucene parser? hoping to get some clarity from you guys!

thanks!

--
John Blythe

Re: 6.x to 7.x differences

Posted by Shawn Heisey <ap...@elyograg.org>.

On 9/11/2018 8:32 PM, John Blythe wrote:
> we recently migrated to cloud. part of that migration jumped us from 6.1 to
> 7.4.
>
> one example query between our old solr instance and our new cloud instance
> produces 42 results and 19k results.
>
> the analyzer is the same aside from WordDelimiterFilterFactory moving over
> to the graph variation of it and the lucene parser moving from 6.1 to 7.4
> obviously.

Did you completely reindex after changing your schema?  Not doing this, 
especially if attempting to use the index from the earlier version, can 
lead to problems.  Have you checked what happens if you use the 
non-graph version of WDF (and completely reindex), so you can see 
whether that changes anything?  That filter will disappear in 8.0, but 
it's still there for all of 7.x.

Adding "debug=query" to your URL parameters is very useful in locating 
differences.  Maybe 6.1 and 7.4 are parsing the query differently.  
There's a good chance that this will reveal something we can pursue.

> i've used the analysis tool in solr admin to try to determine the
> difference between the two. i'm seeing the same output between index and
> query results yet when actually running the queries have that huge
> divergence of results.

One of the big differences between 6.x and 7.x for query parsing is that 
the sow (split on whitespace) parameter defaults to true in 6.x (and I 
think it didn't even exist in 6.1, so it's effectively true).  In 7.x, 
that parameter defaults to false.  So the query parser in 7.x tends to 
behave *exactly* like what you see in the analysis tool, whereas in 6.x 
the input would be split on whitespace before ever reaching analysis, 
which can result in very subtle differences in how the input is 
analyzed.  Adding "sow=true" to your URL parameters is something you can 
try as a quick test.

Thanks,
Shawn

Re: 6.x to 7.x differences

Posted by John Blythe <jo...@gmail.com>.

thanks, shawn. yep, i saw the multi term synonym discussion when googling
around a bit after your first reply. pretty jazzed about finally getting to
tinker w that instead of creating our regex ducktape solution
for_multi_term_synonyms!

thanks again-

--
John Blythe


On Wed, Sep 12, 2018 at 2:15 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 9/12/2018 8:12 AM, John Blythe wrote:
> > shawn: at first, no. we rsynced data up after running it through the
> > migration tool. we'd gotten errors when using WDF so updated all
> instances
> > of it to WDGF (and subsequently added FlattenGraphFilterFactory to each
> > index analyzer that used WDGF to avoid errors).
>
> The messages you get in the log from WDF are not errors. They are
> warnings.  Just letting you know that the filter will be removed in the
> next major version.
>
> > the sow seems to be the key here. adding that to the query url dropped me
> > from +19k to 62 results lol. 'subtle' is a not so subtle understatement
> in
> > this case! i'm a big fan of finally being able to not be driven batty by
> > the analysis vs. query results though, so looking forward to playing w
> that
> > some more. for our immediate purposes, however, i think this solves it!
>
> Setting sow=false is a key part of the "graph" nature of the new filters
> that aren't deprecated.  Mostly this is to support multi-word synonyms
> properly.
>
>
> https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/
>
> Thanks,
> Shawn
>
>

Re: 6.x to 7.x differences

Posted by Shawn Heisey <ap...@elyograg.org>.

On 9/12/2018 8:12 AM, John Blythe wrote:
> shawn: at first, no. we rsynced data up after running it through the
> migration tool. we'd gotten errors when using WDF so updated all instances
> of it to WDGF (and subsequently added FlattenGraphFilterFactory to each
> index analyzer that used WDGF to avoid errors).

The messages you get in the log from WDF are not errors. They are 
warnings.  Just letting you know that the filter will be removed in the 
next major version.

> the sow seems to be the key here. adding that to the query url dropped me
> from +19k to 62 results lol. 'subtle' is a not so subtle understatement in
> this case! i'm a big fan of finally being able to not be driven batty by
> the analysis vs. query results though, so looking forward to playing w that
> some more. for our immediate purposes, however, i think this solves it!

Setting sow=false is a key part of the "graph" nature of the new filters 
that aren't deprecated.  Mostly this is to support multi-word synonyms 
properly.

https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/

Thanks,
Shawn

Re: 6.x to 7.x differences

Posted by John Blythe <jo...@gmail.com>.

hey guys.

preeti: good thought, but this was something we were already aware of and
had accounted for. thanks tho!

shawn: at first, no. we rsynced data up after running it through the
migration tool. we'd gotten errors when using WDF so updated all instances
of it to WDGF (and subsequently added FlattenGraphFilterFactory to each
index analyzer that used WDGF to avoid errors).

the sow seems to be the key here. adding that to the query url dropped me
from +19k to 62 results lol. 'subtle' is a not so subtle understatement in
this case! i'm a big fan of finally being able to not be driven batty by
the analysis vs. query results though, so looking forward to playing w that
some more. for our immediate purposes, however, i think this solves it!

--
John Blythe


On Wed, Sep 12, 2018 at 1:35 AM Preeti Bhat <pr...@shoregrp.com>
wrote:

> Hi John,
>
> Please check the solrQueryParser option, it was removed in 7.4 version, so
> you will need to provide <str name="q.op">AND</str> in solrconfig.xml or
> give the q.op option while querying to solve this problem. By default solr
> makes it an "OR" operation leading to too many results.
>
> Old Way: In Managed-schema or schema.xml
> <solrQueryParser defaultOperator="AND"/>
>
> New Way: in solrconfig.xml
>
>   <initParams
> path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
>     <lst name="defaults">
>   <str name="q.op">AND</str>
>     </lst>
>   </initParams>
>
>
> Thanks and Regards,
> Preeti Bhat
>
> -----Original Message-----
> From: John Blythe [mailto:johnblythe@gmail.com]
> Sent: Wednesday, September 12, 2018 8:02 AM
> To: solr-user@lucene.apache.org
> Subject: 6.x to 7.x differences
>
> hi, all.
>
> we recently migrated to cloud. part of that migration jumped us from 6.1
> to 7.4.
>
> one example query between our old solr instance and our new cloud instance
> produces 42 results and 19k results.
>
> the analyzer is the same aside from WordDelimiterFilterFactory moving over
> to the graph variation of it and the lucene parser moving from 6.1 to 7.4
> obviously.
>
> i've used the analysis tool in solr admin to try to determine the
> difference between the two. i'm seeing the same output between index and
> query results yet when actually running the queries have that huge
> divergence of results.
>
> i'm left scratching my head at this point. i'm guessing it's from the
> lucene parser? hoping to get some clarity from you guys!
>
> thanks!
>
> --
> John Blythe
>
> NOTICE TO RECIPIENTS: This communication may contain confidential and/or
> privileged information. If you are not the intended recipient (or have
> received this communication in error) please notify the sender and
> it-support@shoregrp.com immediately, and destroy this communication. Any
> unauthorized copying, disclosure or distribution of the material in this
> communication is strictly forbidden. Any views or opinions presented in
> this email are solely those of the author and do not necessarily represent
> those of the company. Finally, the recipient should check this email and
> any attachments for the presence of viruses. The company accepts no
> liability for any damage caused by any virus transmitted by this email.
>
>
>

RE: 6.x to 7.x differences

Posted by Preeti Bhat <pr...@shoregrp.com>.

Hi John,

Please check the solrQueryParser option, it was removed in 7.4 version, so you will need to provide <str name="q.op">AND</str> in solrconfig.xml or give the q.op option while querying to solve this problem. By default solr makes it an "OR" operation leading to too many results.

Old Way: In Managed-schema or schema.xml
<solrQueryParser defaultOperator="AND"/>

New Way: in solrconfig.xml

  <initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
    <lst name="defaults">
  <str name="q.op">AND</str>
    </lst>
  </initParams>


Thanks and Regards,
Preeti Bhat

-----Original Message-----
From: John Blythe [mailto:johnblythe@gmail.com]
Sent: Wednesday, September 12, 2018 8:02 AM
To: solr-user@lucene.apache.org
Subject: 6.x to 7.x differences

hi, all.

we recently migrated to cloud. part of that migration jumped us from 6.1 to 7.4.

one example query between our old solr instance and our new cloud instance produces 42 results and 19k results.

the analyzer is the same aside from WordDelimiterFilterFactory moving over to the graph variation of it and the lucene parser moving from 6.1 to 7.4 obviously.

i've used the analysis tool in solr admin to try to determine the difference between the two. i'm seeing the same output between index and query results yet when actually running the queries have that huge divergence of results.

i'm left scratching my head at this point. i'm guessing it's from the lucene parser? hoping to get some clarity from you guys!

thanks!

--
John Blythe

NOTICE TO RECIPIENTS: This communication may contain confidential and/or privileged information. If you are not the intended recipient (or have received this communication in error) please notify the sender and it-support@shoregrp.com immediately, and destroy this communication. Any unauthorized copying, disclosure or distribution of the material in this communication is strictly forbidden. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.