You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Webster Homer <we...@sial.com> on 2017/09/06 21:51:33 UTC

Consecutive calls to a query give different results

I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
replicas (total of 4 nodes).

If I run the query multiple times I see the three different top scoring
results.
No data load is running, all data has been commited

I get these three different hits with their scores:
copperiinitratehemipentahydrate2325919004194        430.61722
copperiinitrateoncelite1234598765                               432.44238
copperiinitratehydrate18756anhydrousbasis13778319 428.24185

How is it that the same search against the same data can give different
responses?
I looked at the specific cores they look OK the numdocs for the replicas in
a shard match

This is the query:
http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-catalog-product/select?defType=edismax&fl=searchmv_en_keywords,%20searchmv_keywords,searchmv_pno,%20searchmv_en_s_pri_name,%20search_en_p_pri_name,%20search_pno%20[explain%20style=nl]&group.field=id_s&group.limit=30&group=true&group.sort=sort_ds%20asc&indent=on&mm=2%3C-25%25&q.op=OR&q=copper%20nitrate&qf=search_pid
^500%20search_concat_pno^400%20searchmv_concat_sku^400%20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%20searchmv_p_skus_genr%20searchmv_user_term^200%20search_lform^190%20searchmv_en_acronym^180%20search_en_root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_keywords^140%20search_en_sortkey^120%20searchmv_p_skus^100%20searchmv_chem_comp^90%20searchmv_en_name_suf%20searchmv_cas_number^80%20searchmv_component_cas^70%20search_beilstein^50%20search_color_idx^40%20search_ecnumber^30%20search_egecnumber^30%20search_femanumber^20%20searchmv_isbn^10%20search_mdl_number%20searchmv_en_page_title%20searchmv_en_descriptions%20searchmv_en_attributes%20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_equivalent_pno%20searchmv_xref_exact_pno%20searchmv_xref_exact_sku%20searchmv_component_molform&rows=30&sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,search_pid%20asc&wt=json

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Posted by Erick Erickson <er...@gmail.com>.
Whew! I haven't been lying to people for _years_......

On Thu, Sep 7, 2017 at 5:58 AM, Yonik Seeley <ys...@gmail.com> wrote:
> On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <er...@gmail.com> wrote:
>> bq: and deleted documents are irrelevant to term statistics...
>>
>> Did you mean "relevant"? Or do I have to adjust my thinking _again_?
>
> One can make it work either way ;-)
> Whether a document is marked as deleted or not has no effect on term
> statistics (i.e. irrelevant)
> OR documents marked for deletion still count in term statistics (i.e. relevant)
>
> I guess I used the former because we don't go out of our way to still
> include deleted documents... it's just a side effect of the index
> structure that we don't (and can't easily) update statistics when a
> document is marked as deleted.
>
> -Yonik
>
>
>> Erick
>>
>> On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley <ys...@gmail.com> wrote:
>>> Different replicas of the same shard can have different numbers of
>>> deleted documents (really just marked as deleted), and deleted
>>> documents are irrelevant to term statistics (like the number of
>>> documents a term appears in).  Documents marked for deletion stop
>>> contributing to corpus statistics when they are actually removed (via
>>> expunge deletes, merges, optimizes).
>>> -Yonik
>>>
>>>
>>> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <we...@sial.com> wrote:
>>>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
>>>> replicas (total of 4 nodes).
>>>>
>>>> If I run the query multiple times I see the three different top scoring
>>>> results.
>>>> No data load is running, all data has been commited
>>>>
>>>> I get these three different hits with their scores:
>>>> copperiinitratehemipentahydrate2325919004194        430.61722
>>>> copperiinitrateoncelite1234598765                               432.44238
>>>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
>>>>
>>>> How is it that the same search against the same data can give different
>>>> responses?
>>>> I looked at the specific cores they look OK the numdocs for the replicas in
>>>> a shard match
>>>>
>>>> This is the query:
>>>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-catalog-product/select?defType=edismax&fl=searchmv_en_keywords,%20searchmv_keywords,searchmv_pno,%20searchmv_en_s_pri_name,%20search_en_p_pri_name,%20search_pno%20[explain%20style=nl]&group.field=id_s&group.limit=30&group=true&group.sort=sort_ds%20asc&indent=on&mm=2%3C-25%25&q.op=OR&q=copper%20nitrate&qf=search_pid
>>>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%20searchmv_p_skus_genr%20searchmv_user_term^200%20search_lform^190%20searchmv_en_acronym^180%20search_en_root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_keywords^140%20search_en_sortkey^120%20searchmv_p_skus^100%20searchmv_chem_comp^90%20searchmv_en_name_suf%20searchmv_cas_number^80%20searchmv_component_cas^70%20search_beilstein^50%20search_color_idx^40%20search_ecnumber^30%20search_egecnumber^30%20search_femanumber^20%20searchmv_isbn^10%20search_mdl_number%20searchmv_en_page_title%20searchmv_en_descriptions%20searchmv_en_attributes%20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_equivalent_pno%20searchmv_xref_exact_pno%20searchmv_xref_exact_sku%20searchmv_component_molform&rows=30&sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,search_pid%20asc&wt=json
>>>>
>>>> --
>>>>
>>>>
>>>> This message and any attachment are confidential and may be privileged or
>>>> otherwise protected from disclosure. If you are not the intended recipient,
>>>> you must not copy this message or attachment or disclose the contents to
>>>> any other person. If you have received this transmission in error, please
>>>> notify the sender immediately and delete the message and any attachment
>>>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>>> subsidiaries do not accept liability for any omissions or errors in this
>>>> message which may arise as a result of E-Mail-transmission or for damages
>>>> resulting from any unauthorized changes of the content of this message and
>>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>>>> subsidiaries do not guarantee that this message is free of viruses and does
>>>> not accept liability for any damages caused by any virus transmitted
>>>> therewith.
>>>>
>>>> Click http://www.emdgroup.com/disclaimer to access the German, French,
>>>> Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Posted by Webster Homer <we...@sial.com>.
We have several cloud collections, but this one is updated once a day with
a partial load, and once a week with a full load, followed by a delete
which is based upon an index_date field (timestamp of the solr record).

For this and related collections optimizing once per day is probably
acceptable.

We do have other collections that are updated every 15 minutes, I don't
think those would be able to be optimized from what you write.



On Thu, Sep 7, 2017 at 5:10 PM, Erick Erickson <er...@gmail.com>
wrote:

> bq: So apparently it IS essential to run optimize after a data load
>
> Don't do this if you can avoid it, you run the risk of excessive
> amounts of your index consisting of deleted documents unless you are
> following a process whereby you periodically (and I'm talking at least
> hours, if not once per day) index data then don't change the index for
> a bunch more hours.
>
> You're missing the point when it comes to deleted docs. Different
> replicas of the _same_ shard commit at different wall clock times due
> to network delays. Therefore, which segments are merged will not be
> identical between replicas when a commit happens, since commits are
> local.
>
> So replica1 may merge segments 1, 3, 6 in to segment 7
> replica2 may merge segments 1, 2, 4 into segment 7
>
> Here's the key: Now replica1 may have 100 deleted documents (ones
> marked as deleted but still in segments 2, 4 and 5
>                                  replica2 may have 90 deleted
> documents (the ones still in segments 3, 5 and 6)
>
> The statistics in the term frequency and document frequency for some
> terms are _not_ the same. Therefore the scoring will be slightly
> different. Therefore, depending on which replica serves the query, the
> order of docs may be somewhat different if the scores are close.
>
> optimizing squeezes all the deleted documents out of all the replicas
> so the scores become identical.
>
> This doesn't happen, of course, if you have only one replica.
>
> Best,
> Erick
>
> On Thu, Sep 7, 2017 at 8:13 AM, Webster Homer <we...@sial.com>
> wrote:
> > We have several solr clouds, a couple of them have only 1 replica per
> > shard. We have never observed the problem when we have a single replica
> > only when there are multiple replicas per shard.
> >
> > On Thu, Sep 7, 2017 at 10:08 AM, Webster Homer <we...@sial.com>
> > wrote:
> >
> >> the scores are not the same
> >> Doc
> >> 305340 432.44238
> >> C2646     428.24185
> >> 12837     430.61722
> >>
> >> One other thing. I just ran optimize and now document 305340 is
> >> consistently the top score.
> >> So apparently it IS essential to run optimize after a data load
> >>
> >> Note we see this behavior fairly commonly on our solr cloud instances.
> >> This was not the first time. This particular situation was on a
> development
> >> system
> >>
> >> On Thu, Sep 7, 2017 at 10:04 AM, Webster Homer <we...@sial.com>
> >> wrote:
> >>
> >>> the scores are not the same
> >>> Doc
> >>> 305340 432.44238
> >>>
> >>> On Thu, Sep 7, 2017 at 10:02 AM, David Hastings <
> >>> hastings.recursive@gmail.com> wrote:
> >>>
> >>>> "I am concerned that the same
> >>>> search gives different results after each search. The top document
> seems
> >>>> to
> >>>> cycle between 3 different documents"
> >>>>
> >>>>
> >>>> if you do debug query on the search, are the scores for the top 3
> >>>> documents
> >>>> the same or not?  you can easily have three documents with the same
> >>>> score,
> >>>> so when you have a result set that is ranked 1-1-1-2-3-4.... you can
> >>>> expect
> >>>> 1-1-1 to rotate based on whatever.  use a second element like id to
> your
> >>>> ranking perhaps.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Sep 7, 2017 at 10:54 AM, Webster Homer <
> webster.homer@sial.com>
> >>>> wrote:
> >>>>
> >>>> > I am not concerned about deleted documents. I am concerned that the
> >>>> same
> >>>> > search gives different results after each search. The top document
> >>>> seems to
> >>>> > cycle between 3 different documents
> >>>> >
> >>>> > I have an enhanced collections info api call that calls the core
> admin
> >>>> api
> >>>> > to get the index information for the replica.
> >>>> > When I said the numdocs were the same I meant exactly that. maxdocs
> and
> >>>> > deleted documents are not the same for the replicas, but the number
> of
> >>>> > numdocs is.
> >>>> >
> >>>> > Or are you saying that the search is looking at deleted documents
> >>>> wouldn't
> >>>> > that be a very significant bug?
> >>>> >
> >>>> > The four replicas:
> >>>> > shard1
> >>>> > core_node1
> >>>> > "numDocs": 383817,
> >>>> > "maxDocs": 611592,
> >>>> > "deletedDocs": 227775,
> >>>> > "size": "2.49 GB",
> >>>> > "lastModified": "2017-09-07T08:18:03.639Z",
> >>>> > "current": true,
> >>>> > "version": 35644,
> >>>> > "segmentCount": 28
> >>>> >
> >>>> > core_node3
> >>>> > "numDocs": 383817,
> >>>> > "maxDocs": 571737,
> >>>> > "deletedDocs": 187920,
> >>>> > "size": "2.85 GB",
> >>>> > "lastModified": "2017-09-07T08:18:03.634Z",
> >>>> > "current": false,
> >>>> > "version": 35562,
> >>>> > "segmentCount": 36
> >>>> > shard2
> >>>> > core_node2
> >>>> > "numDocs": 385326,
> >>>> > "maxDocs": 529214,
> >>>> > "deletedDocs": 143888,
> >>>> > "size": "2.13 GB",
> >>>> > "lastModified": "2017-09-07T08:18:03.632Z",
> >>>> > "current": true,
> >>>> > "version": 34783,
> >>>> > "segmentCount": 24
> >>>> > core_node4
> >>>> > "numDocs": 385326,
> >>>> > "maxDocs": 488201,
> >>>> > "deletedDocs": 102875,
> >>>> > "size": "1.96 GB",
> >>>> > "lastModified": "2017-09-07T08:18:03.633Z",
> >>>> > "current": true,
> >>>> > "version": 34932,
> >>>> > "segmentCount": 21
> >>>> >
> >>>> >
> >>>> > On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley <ys...@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> > > On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <
> >>>> erickerickson@gmail.com
> >>>> > >
> >>>> > > wrote:
> >>>> > > > bq: and deleted documents are irrelevant to term statistics...
> >>>> > > >
> >>>> > > > Did you mean "relevant"? Or do I have to adjust my thinking
> >>>> _again_?
> >>>> > >
> >>>> > > One can make it work either way ;-)
> >>>> > > Whether a document is marked as deleted or not has no effect on
> term
> >>>> > > statistics (i.e. irrelevant)
> >>>> > > OR documents marked for deletion still count in term statistics
> (i.e.
> >>>> > > relevant)
> >>>> > >
> >>>> > > I guess I used the former because we don't go out of our way to
> still
> >>>> > > include deleted documents... it's just a side effect of the index
> >>>> > > structure that we don't (and can't easily) update statistics when
> a
> >>>> > > document is marked as deleted.
> >>>> > >
> >>>> > > -Yonik
> >>>> > >
> >>>> > >
> >>>> > > > Erick
> >>>> > > >
> >>>> > > > On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley <yseeley@gmail.com
> >
> >>>> > wrote:
> >>>> > > >> Different replicas of the same shard can have different
> numbers of
> >>>> > > >> deleted documents (really just marked as deleted), and deleted
> >>>> > > >> documents are irrelevant to term statistics (like the number of
> >>>> > > >> documents a term appears in).  Documents marked for deletion
> stop
> >>>> > > >> contributing to corpus statistics when they are actually
> removed
> >>>> (via
> >>>> > > >> expunge deletes, merges, optimizes).
> >>>> > > >> -Yonik
> >>>> > > >>
> >>>> > > >>
> >>>> > > >> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <
> >>>> webster.homer@sial.com
> >>>> > >
> >>>> > > wrote:
> >>>> > > >>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards
> >>>> and 4
> >>>> > > >>> replicas (total of 4 nodes).
> >>>> > > >>>
> >>>> > > >>> If I run the query multiple times I see the three different
> top
> >>>> > scoring
> >>>> > > >>> results.
> >>>> > > >>> No data load is running, all data has been commited
> >>>> > > >>>
> >>>> > > >>> I get these three different hits with their scores:
> >>>> > > >>> copperiinitratehemipentahydrate2325919004194        430.61722
> >>>> > > >>> copperiinitrateoncelite1234598765
> >>>> > >  432.44238
> >>>> > > >>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
> >>>> > > >>>
> >>>> > > >>> How is it that the same search against the same data can give
> >>>> > different
> >>>> > > >>> responses?
> >>>> > > >>> I looked at the specific cores they look OK the numdocs for
> the
> >>>> > > replicas in
> >>>> > > >>> a shard match
> >>>> > > >>>
> >>>> > > >>> This is the query:
> >>>> > > >>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-
> >>>> > > catalog-product/select?defType=edismax&fl=searchmv_
> >>>> > > en_keywords,%20searchmv_keywords,searchmv_pno,%
> >>>> > 20searchmv_en_s_pri_name,%
> >>>> > > 20search_en_p_pri_name,%20search_pno%20[explain%
> >>>> > > 20style=nl]&group.field=id_s&group.limit=30&group=true&
> >>>> > > group.sort=sort_ds%20asc&indent=on&mm=2%3C-25%25&q.op=
> >>>> > > OR&q=copper%20nitrate&qf=search_pid
> >>>> > > >>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%
> >>>> > > 20searchmv_pno^300%20search_concat_pno_genr^100%
> 20searchmv_pno_genr%
> >>>> > > 20searchmv_p_skus_genr%20searchmv_user_term^200%
> >>>> > > 20search_lform^190%20searchmv_en_acronym^180%20search_en_
> >>>> > > root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_
> >>>> > > pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_
> >>>> > > keywords^140%20search_en_sortkey^120%20searchmv_p_skus^
> >>>> > > 100%20searchmv_chem_comp^90%20searchmv_en_name_suf%
> >>>> > > 20searchmv_cas_number^80%20searchmv_component_cas^70%
> >>>> > > 20search_beilstein^50%20search_color_idx^40%
> >>>> > 20search_ecnumber^30%20search_
> >>>> > > egecnumber^30%20search_femanumber^20%20searchmv_isbn^
> >>>> > > 10%20search_mdl_number%20searchmv_en_page_title%
> >>>> > > 20searchmv_en_descriptions%20searchmv_en_attributes%
> >>>> > > 20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_
> >>>> > > xref_comparable_pno%20searchmv_xref_comparable_
> sku%20searchmv_xref_
> >>>> > > equivalent_pno%20searchmv_xref_exact_pno%20searchmv_
> >>>> > > xref_exact_sku%20searchmv_component_molform&rows=30&
> >>>> > > sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,
> >>>> > > search_pid%20asc&wt=json
> >>>> > > >>>
> >>>> > > >>> --
> >>>> > > >>>
> >>>> > > >>>
> >>>> > > >>> This message and any attachment are confidential and may be
> >>>> > privileged
> >>>> > > or
> >>>> > > >>> otherwise protected from disclosure. If you are not the
> intended
> >>>> > > recipient,
> >>>> > > >>> you must not copy this message or attachment or disclose the
> >>>> contents
> >>>> > > to
> >>>> > > >>> any other person. If you have received this transmission in
> >>>> error,
> >>>> > > please
> >>>> > > >>> notify the sender immediately and delete the message and any
> >>>> > attachment
> >>>> > > >>> from your system. Merck KGaA, Darmstadt, Germany and any of
> its
> >>>> > > >>> subsidiaries do not accept liability for any omissions or
> errors
> >>>> in
> >>>> > > this
> >>>> > > >>> message which may arise as a result of E-Mail-transmission or
> for
> >>>> > > damages
> >>>> > > >>> resulting from any unauthorized changes of the content of this
> >>>> > message
> >>>> > > and
> >>>> > > >>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any
> >>>> of its
> >>>> > > >>> subsidiaries do not guarantee that this message is free of
> >>>> viruses
> >>>> > and
> >>>> > > does
> >>>> > > >>> not accept liability for any damages caused by any virus
> >>>> transmitted
> >>>> > > >>> therewith.
> >>>> > > >>>
> >>>> > > >>> Click http://www.emdgroup.com/disclaimer to access the
> German,
> >>>> > French,
> >>>> > > >>> Spanish and Portuguese versions of this disclaimer.
> >>>> > >
> >>>> >
> >>>> > --
> >>>> >
> >>>> >
> >>>> > This message and any attachment are confidential and may be
> privileged
> >>>> or
> >>>> > otherwise protected from disclosure. If you are not the intended
> >>>> recipient,
> >>>> > you must not copy this message or attachment or disclose the
> contents
> >>>> to
> >>>> > any other person. If you have received this transmission in error,
> >>>> please
> >>>> > notify the sender immediately and delete the message and any
> attachment
> >>>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> >>>> > subsidiaries do not accept liability for any omissions or errors in
> >>>> this
> >>>> > message which may arise as a result of E-Mail-transmission or for
> >>>> damages
> >>>> > resulting from any unauthorized changes of the content of this
> message
> >>>> and
> >>>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of
> its
> >>>> > subsidiaries do not guarantee that this message is free of viruses
> and
> >>>> does
> >>>> > not accept liability for any damages caused by any virus transmitted
> >>>> > therewith.
> >>>> >
> >>>> > Click http://www.emdgroup.com/disclaimer to access the German,
> French,
> >>>> > Spanish and Portuguese versions of this disclaimer.
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Posted by Erick Erickson <er...@gmail.com>.
bq: So apparently it IS essential to run optimize after a data load

Don't do this if you can avoid it, you run the risk of excessive
amounts of your index consisting of deleted documents unless you are
following a process whereby you periodically (and I'm talking at least
hours, if not once per day) index data then don't change the index for
a bunch more hours.

You're missing the point when it comes to deleted docs. Different
replicas of the _same_ shard commit at different wall clock times due
to network delays. Therefore, which segments are merged will not be
identical between replicas when a commit happens, since commits are
local.

So replica1 may merge segments 1, 3, 6 in to segment 7
replica2 may merge segments 1, 2, 4 into segment 7

Here's the key: Now replica1 may have 100 deleted documents (ones
marked as deleted but still in segments 2, 4 and 5
                                 replica2 may have 90 deleted
documents (the ones still in segments 3, 5 and 6)

The statistics in the term frequency and document frequency for some
terms are _not_ the same. Therefore the scoring will be slightly
different. Therefore, depending on which replica serves the query, the
order of docs may be somewhat different if the scores are close.

optimizing squeezes all the deleted documents out of all the replicas
so the scores become identical.

This doesn't happen, of course, if you have only one replica.

Best,
Erick

On Thu, Sep 7, 2017 at 8:13 AM, Webster Homer <we...@sial.com> wrote:
> We have several solr clouds, a couple of them have only 1 replica per
> shard. We have never observed the problem when we have a single replica
> only when there are multiple replicas per shard.
>
> On Thu, Sep 7, 2017 at 10:08 AM, Webster Homer <we...@sial.com>
> wrote:
>
>> the scores are not the same
>> Doc
>> 305340 432.44238
>> C2646     428.24185
>> 12837     430.61722
>>
>> One other thing. I just ran optimize and now document 305340 is
>> consistently the top score.
>> So apparently it IS essential to run optimize after a data load
>>
>> Note we see this behavior fairly commonly on our solr cloud instances.
>> This was not the first time. This particular situation was on a development
>> system
>>
>> On Thu, Sep 7, 2017 at 10:04 AM, Webster Homer <we...@sial.com>
>> wrote:
>>
>>> the scores are not the same
>>> Doc
>>> 305340 432.44238
>>>
>>> On Thu, Sep 7, 2017 at 10:02 AM, David Hastings <
>>> hastings.recursive@gmail.com> wrote:
>>>
>>>> "I am concerned that the same
>>>> search gives different results after each search. The top document seems
>>>> to
>>>> cycle between 3 different documents"
>>>>
>>>>
>>>> if you do debug query on the search, are the scores for the top 3
>>>> documents
>>>> the same or not?  you can easily have three documents with the same
>>>> score,
>>>> so when you have a result set that is ranked 1-1-1-2-3-4.... you can
>>>> expect
>>>> 1-1-1 to rotate based on whatever.  use a second element like id to your
>>>> ranking perhaps.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Sep 7, 2017 at 10:54 AM, Webster Homer <we...@sial.com>
>>>> wrote:
>>>>
>>>> > I am not concerned about deleted documents. I am concerned that the
>>>> same
>>>> > search gives different results after each search. The top document
>>>> seems to
>>>> > cycle between 3 different documents
>>>> >
>>>> > I have an enhanced collections info api call that calls the core admin
>>>> api
>>>> > to get the index information for the replica.
>>>> > When I said the numdocs were the same I meant exactly that. maxdocs and
>>>> > deleted documents are not the same for the replicas, but the number of
>>>> > numdocs is.
>>>> >
>>>> > Or are you saying that the search is looking at deleted documents
>>>> wouldn't
>>>> > that be a very significant bug?
>>>> >
>>>> > The four replicas:
>>>> > shard1
>>>> > core_node1
>>>> > "numDocs": 383817,
>>>> > "maxDocs": 611592,
>>>> > "deletedDocs": 227775,
>>>> > "size": "2.49 GB",
>>>> > "lastModified": "2017-09-07T08:18:03.639Z",
>>>> > "current": true,
>>>> > "version": 35644,
>>>> > "segmentCount": 28
>>>> >
>>>> > core_node3
>>>> > "numDocs": 383817,
>>>> > "maxDocs": 571737,
>>>> > "deletedDocs": 187920,
>>>> > "size": "2.85 GB",
>>>> > "lastModified": "2017-09-07T08:18:03.634Z",
>>>> > "current": false,
>>>> > "version": 35562,
>>>> > "segmentCount": 36
>>>> > shard2
>>>> > core_node2
>>>> > "numDocs": 385326,
>>>> > "maxDocs": 529214,
>>>> > "deletedDocs": 143888,
>>>> > "size": "2.13 GB",
>>>> > "lastModified": "2017-09-07T08:18:03.632Z",
>>>> > "current": true,
>>>> > "version": 34783,
>>>> > "segmentCount": 24
>>>> > core_node4
>>>> > "numDocs": 385326,
>>>> > "maxDocs": 488201,
>>>> > "deletedDocs": 102875,
>>>> > "size": "1.96 GB",
>>>> > "lastModified": "2017-09-07T08:18:03.633Z",
>>>> > "current": true,
>>>> > "version": 34932,
>>>> > "segmentCount": 21
>>>> >
>>>> >
>>>> > On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley <ys...@gmail.com>
>>>> wrote:
>>>> >
>>>> > > On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <
>>>> erickerickson@gmail.com
>>>> > >
>>>> > > wrote:
>>>> > > > bq: and deleted documents are irrelevant to term statistics...
>>>> > > >
>>>> > > > Did you mean "relevant"? Or do I have to adjust my thinking
>>>> _again_?
>>>> > >
>>>> > > One can make it work either way ;-)
>>>> > > Whether a document is marked as deleted or not has no effect on term
>>>> > > statistics (i.e. irrelevant)
>>>> > > OR documents marked for deletion still count in term statistics (i.e.
>>>> > > relevant)
>>>> > >
>>>> > > I guess I used the former because we don't go out of our way to still
>>>> > > include deleted documents... it's just a side effect of the index
>>>> > > structure that we don't (and can't easily) update statistics when a
>>>> > > document is marked as deleted.
>>>> > >
>>>> > > -Yonik
>>>> > >
>>>> > >
>>>> > > > Erick
>>>> > > >
>>>> > > > On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley <ys...@gmail.com>
>>>> > wrote:
>>>> > > >> Different replicas of the same shard can have different numbers of
>>>> > > >> deleted documents (really just marked as deleted), and deleted
>>>> > > >> documents are irrelevant to term statistics (like the number of
>>>> > > >> documents a term appears in).  Documents marked for deletion stop
>>>> > > >> contributing to corpus statistics when they are actually removed
>>>> (via
>>>> > > >> expunge deletes, merges, optimizes).
>>>> > > >> -Yonik
>>>> > > >>
>>>> > > >>
>>>> > > >> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <
>>>> webster.homer@sial.com
>>>> > >
>>>> > > wrote:
>>>> > > >>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards
>>>> and 4
>>>> > > >>> replicas (total of 4 nodes).
>>>> > > >>>
>>>> > > >>> If I run the query multiple times I see the three different top
>>>> > scoring
>>>> > > >>> results.
>>>> > > >>> No data load is running, all data has been commited
>>>> > > >>>
>>>> > > >>> I get these three different hits with their scores:
>>>> > > >>> copperiinitratehemipentahydrate2325919004194        430.61722
>>>> > > >>> copperiinitrateoncelite1234598765
>>>> > >  432.44238
>>>> > > >>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
>>>> > > >>>
>>>> > > >>> How is it that the same search against the same data can give
>>>> > different
>>>> > > >>> responses?
>>>> > > >>> I looked at the specific cores they look OK the numdocs for the
>>>> > > replicas in
>>>> > > >>> a shard match
>>>> > > >>>
>>>> > > >>> This is the query:
>>>> > > >>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-
>>>> > > catalog-product/select?defType=edismax&fl=searchmv_
>>>> > > en_keywords,%20searchmv_keywords,searchmv_pno,%
>>>> > 20searchmv_en_s_pri_name,%
>>>> > > 20search_en_p_pri_name,%20search_pno%20[explain%
>>>> > > 20style=nl]&group.field=id_s&group.limit=30&group=true&
>>>> > > group.sort=sort_ds%20asc&indent=on&mm=2%3C-25%25&q.op=
>>>> > > OR&q=copper%20nitrate&qf=search_pid
>>>> > > >>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%
>>>> > > 20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%
>>>> > > 20searchmv_p_skus_genr%20searchmv_user_term^200%
>>>> > > 20search_lform^190%20searchmv_en_acronym^180%20search_en_
>>>> > > root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_
>>>> > > pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_
>>>> > > keywords^140%20search_en_sortkey^120%20searchmv_p_skus^
>>>> > > 100%20searchmv_chem_comp^90%20searchmv_en_name_suf%
>>>> > > 20searchmv_cas_number^80%20searchmv_component_cas^70%
>>>> > > 20search_beilstein^50%20search_color_idx^40%
>>>> > 20search_ecnumber^30%20search_
>>>> > > egecnumber^30%20search_femanumber^20%20searchmv_isbn^
>>>> > > 10%20search_mdl_number%20searchmv_en_page_title%
>>>> > > 20searchmv_en_descriptions%20searchmv_en_attributes%
>>>> > > 20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_
>>>> > > xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_
>>>> > > equivalent_pno%20searchmv_xref_exact_pno%20searchmv_
>>>> > > xref_exact_sku%20searchmv_component_molform&rows=30&
>>>> > > sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,
>>>> > > search_pid%20asc&wt=json
>>>> > > >>>
>>>> > > >>> --
>>>> > > >>>
>>>> > > >>>
>>>> > > >>> This message and any attachment are confidential and may be
>>>> > privileged
>>>> > > or
>>>> > > >>> otherwise protected from disclosure. If you are not the intended
>>>> > > recipient,
>>>> > > >>> you must not copy this message or attachment or disclose the
>>>> contents
>>>> > > to
>>>> > > >>> any other person. If you have received this transmission in
>>>> error,
>>>> > > please
>>>> > > >>> notify the sender immediately and delete the message and any
>>>> > attachment
>>>> > > >>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>>> > > >>> subsidiaries do not accept liability for any omissions or errors
>>>> in
>>>> > > this
>>>> > > >>> message which may arise as a result of E-Mail-transmission or for
>>>> > > damages
>>>> > > >>> resulting from any unauthorized changes of the content of this
>>>> > message
>>>> > > and
>>>> > > >>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any
>>>> of its
>>>> > > >>> subsidiaries do not guarantee that this message is free of
>>>> viruses
>>>> > and
>>>> > > does
>>>> > > >>> not accept liability for any damages caused by any virus
>>>> transmitted
>>>> > > >>> therewith.
>>>> > > >>>
>>>> > > >>> Click http://www.emdgroup.com/disclaimer to access the German,
>>>> > French,
>>>> > > >>> Spanish and Portuguese versions of this disclaimer.
>>>> > >
>>>> >
>>>> > --
>>>> >
>>>> >
>>>> > This message and any attachment are confidential and may be privileged
>>>> or
>>>> > otherwise protected from disclosure. If you are not the intended
>>>> recipient,
>>>> > you must not copy this message or attachment or disclose the contents
>>>> to
>>>> > any other person. If you have received this transmission in error,
>>>> please
>>>> > notify the sender immediately and delete the message and any attachment
>>>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>>>> > subsidiaries do not accept liability for any omissions or errors in
>>>> this
>>>> > message which may arise as a result of E-Mail-transmission or for
>>>> damages
>>>> > resulting from any unauthorized changes of the content of this message
>>>> and
>>>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>>>> > subsidiaries do not guarantee that this message is free of viruses and
>>>> does
>>>> > not accept liability for any damages caused by any virus transmitted
>>>> > therewith.
>>>> >
>>>> > Click http://www.emdgroup.com/disclaimer to access the German, French,
>>>> > Spanish and Portuguese versions of this disclaimer.
>>>> >
>>>>
>>>
>>>
>>
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Posted by Webster Homer <we...@sial.com>.
We have several solr clouds, a couple of them have only 1 replica per
shard. We have never observed the problem when we have a single replica
only when there are multiple replicas per shard.

On Thu, Sep 7, 2017 at 10:08 AM, Webster Homer <we...@sial.com>
wrote:

> the scores are not the same
> Doc
> 305340 432.44238
> C2646     428.24185
> 12837     430.61722
>
> One other thing. I just ran optimize and now document 305340 is
> consistently the top score.
> So apparently it IS essential to run optimize after a data load
>
> Note we see this behavior fairly commonly on our solr cloud instances.
> This was not the first time. This particular situation was on a development
> system
>
> On Thu, Sep 7, 2017 at 10:04 AM, Webster Homer <we...@sial.com>
> wrote:
>
>> the scores are not the same
>> Doc
>> 305340 432.44238
>>
>> On Thu, Sep 7, 2017 at 10:02 AM, David Hastings <
>> hastings.recursive@gmail.com> wrote:
>>
>>> "I am concerned that the same
>>> search gives different results after each search. The top document seems
>>> to
>>> cycle between 3 different documents"
>>>
>>>
>>> if you do debug query on the search, are the scores for the top 3
>>> documents
>>> the same or not?  you can easily have three documents with the same
>>> score,
>>> so when you have a result set that is ranked 1-1-1-2-3-4.... you can
>>> expect
>>> 1-1-1 to rotate based on whatever.  use a second element like id to your
>>> ranking perhaps.
>>>
>>>
>>>
>>>
>>> On Thu, Sep 7, 2017 at 10:54 AM, Webster Homer <we...@sial.com>
>>> wrote:
>>>
>>> > I am not concerned about deleted documents. I am concerned that the
>>> same
>>> > search gives different results after each search. The top document
>>> seems to
>>> > cycle between 3 different documents
>>> >
>>> > I have an enhanced collections info api call that calls the core admin
>>> api
>>> > to get the index information for the replica.
>>> > When I said the numdocs were the same I meant exactly that. maxdocs and
>>> > deleted documents are not the same for the replicas, but the number of
>>> > numdocs is.
>>> >
>>> > Or are you saying that the search is looking at deleted documents
>>> wouldn't
>>> > that be a very significant bug?
>>> >
>>> > The four replicas:
>>> > shard1
>>> > core_node1
>>> > "numDocs": 383817,
>>> > "maxDocs": 611592,
>>> > "deletedDocs": 227775,
>>> > "size": "2.49 GB",
>>> > "lastModified": "2017-09-07T08:18:03.639Z",
>>> > "current": true,
>>> > "version": 35644,
>>> > "segmentCount": 28
>>> >
>>> > core_node3
>>> > "numDocs": 383817,
>>> > "maxDocs": 571737,
>>> > "deletedDocs": 187920,
>>> > "size": "2.85 GB",
>>> > "lastModified": "2017-09-07T08:18:03.634Z",
>>> > "current": false,
>>> > "version": 35562,
>>> > "segmentCount": 36
>>> > shard2
>>> > core_node2
>>> > "numDocs": 385326,
>>> > "maxDocs": 529214,
>>> > "deletedDocs": 143888,
>>> > "size": "2.13 GB",
>>> > "lastModified": "2017-09-07T08:18:03.632Z",
>>> > "current": true,
>>> > "version": 34783,
>>> > "segmentCount": 24
>>> > core_node4
>>> > "numDocs": 385326,
>>> > "maxDocs": 488201,
>>> > "deletedDocs": 102875,
>>> > "size": "1.96 GB",
>>> > "lastModified": "2017-09-07T08:18:03.633Z",
>>> > "current": true,
>>> > "version": 34932,
>>> > "segmentCount": 21
>>> >
>>> >
>>> > On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley <ys...@gmail.com>
>>> wrote:
>>> >
>>> > > On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <
>>> erickerickson@gmail.com
>>> > >
>>> > > wrote:
>>> > > > bq: and deleted documents are irrelevant to term statistics...
>>> > > >
>>> > > > Did you mean "relevant"? Or do I have to adjust my thinking
>>> _again_?
>>> > >
>>> > > One can make it work either way ;-)
>>> > > Whether a document is marked as deleted or not has no effect on term
>>> > > statistics (i.e. irrelevant)
>>> > > OR documents marked for deletion still count in term statistics (i.e.
>>> > > relevant)
>>> > >
>>> > > I guess I used the former because we don't go out of our way to still
>>> > > include deleted documents... it's just a side effect of the index
>>> > > structure that we don't (and can't easily) update statistics when a
>>> > > document is marked as deleted.
>>> > >
>>> > > -Yonik
>>> > >
>>> > >
>>> > > > Erick
>>> > > >
>>> > > > On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley <ys...@gmail.com>
>>> > wrote:
>>> > > >> Different replicas of the same shard can have different numbers of
>>> > > >> deleted documents (really just marked as deleted), and deleted
>>> > > >> documents are irrelevant to term statistics (like the number of
>>> > > >> documents a term appears in).  Documents marked for deletion stop
>>> > > >> contributing to corpus statistics when they are actually removed
>>> (via
>>> > > >> expunge deletes, merges, optimizes).
>>> > > >> -Yonik
>>> > > >>
>>> > > >>
>>> > > >> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <
>>> webster.homer@sial.com
>>> > >
>>> > > wrote:
>>> > > >>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards
>>> and 4
>>> > > >>> replicas (total of 4 nodes).
>>> > > >>>
>>> > > >>> If I run the query multiple times I see the three different top
>>> > scoring
>>> > > >>> results.
>>> > > >>> No data load is running, all data has been commited
>>> > > >>>
>>> > > >>> I get these three different hits with their scores:
>>> > > >>> copperiinitratehemipentahydrate2325919004194        430.61722
>>> > > >>> copperiinitrateoncelite1234598765
>>> > >  432.44238
>>> > > >>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
>>> > > >>>
>>> > > >>> How is it that the same search against the same data can give
>>> > different
>>> > > >>> responses?
>>> > > >>> I looked at the specific cores they look OK the numdocs for the
>>> > > replicas in
>>> > > >>> a shard match
>>> > > >>>
>>> > > >>> This is the query:
>>> > > >>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-
>>> > > catalog-product/select?defType=edismax&fl=searchmv_
>>> > > en_keywords,%20searchmv_keywords,searchmv_pno,%
>>> > 20searchmv_en_s_pri_name,%
>>> > > 20search_en_p_pri_name,%20search_pno%20[explain%
>>> > > 20style=nl]&group.field=id_s&group.limit=30&group=true&
>>> > > group.sort=sort_ds%20asc&indent=on&mm=2%3C-25%25&q.op=
>>> > > OR&q=copper%20nitrate&qf=search_pid
>>> > > >>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%
>>> > > 20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%
>>> > > 20searchmv_p_skus_genr%20searchmv_user_term^200%
>>> > > 20search_lform^190%20searchmv_en_acronym^180%20search_en_
>>> > > root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_
>>> > > pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_
>>> > > keywords^140%20search_en_sortkey^120%20searchmv_p_skus^
>>> > > 100%20searchmv_chem_comp^90%20searchmv_en_name_suf%
>>> > > 20searchmv_cas_number^80%20searchmv_component_cas^70%
>>> > > 20search_beilstein^50%20search_color_idx^40%
>>> > 20search_ecnumber^30%20search_
>>> > > egecnumber^30%20search_femanumber^20%20searchmv_isbn^
>>> > > 10%20search_mdl_number%20searchmv_en_page_title%
>>> > > 20searchmv_en_descriptions%20searchmv_en_attributes%
>>> > > 20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_
>>> > > xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_
>>> > > equivalent_pno%20searchmv_xref_exact_pno%20searchmv_
>>> > > xref_exact_sku%20searchmv_component_molform&rows=30&
>>> > > sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,
>>> > > search_pid%20asc&wt=json
>>> > > >>>
>>> > > >>> --
>>> > > >>>
>>> > > >>>
>>> > > >>> This message and any attachment are confidential and may be
>>> > privileged
>>> > > or
>>> > > >>> otherwise protected from disclosure. If you are not the intended
>>> > > recipient,
>>> > > >>> you must not copy this message or attachment or disclose the
>>> contents
>>> > > to
>>> > > >>> any other person. If you have received this transmission in
>>> error,
>>> > > please
>>> > > >>> notify the sender immediately and delete the message and any
>>> > attachment
>>> > > >>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>> > > >>> subsidiaries do not accept liability for any omissions or errors
>>> in
>>> > > this
>>> > > >>> message which may arise as a result of E-Mail-transmission or for
>>> > > damages
>>> > > >>> resulting from any unauthorized changes of the content of this
>>> > message
>>> > > and
>>> > > >>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any
>>> of its
>>> > > >>> subsidiaries do not guarantee that this message is free of
>>> viruses
>>> > and
>>> > > does
>>> > > >>> not accept liability for any damages caused by any virus
>>> transmitted
>>> > > >>> therewith.
>>> > > >>>
>>> > > >>> Click http://www.emdgroup.com/disclaimer to access the German,
>>> > French,
>>> > > >>> Spanish and Portuguese versions of this disclaimer.
>>> > >
>>> >
>>> > --
>>> >
>>> >
>>> > This message and any attachment are confidential and may be privileged
>>> or
>>> > otherwise protected from disclosure. If you are not the intended
>>> recipient,
>>> > you must not copy this message or attachment or disclose the contents
>>> to
>>> > any other person. If you have received this transmission in error,
>>> please
>>> > notify the sender immediately and delete the message and any attachment
>>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>>> > subsidiaries do not accept liability for any omissions or errors in
>>> this
>>> > message which may arise as a result of E-Mail-transmission or for
>>> damages
>>> > resulting from any unauthorized changes of the content of this message
>>> and
>>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>>> > subsidiaries do not guarantee that this message is free of viruses and
>>> does
>>> > not accept liability for any damages caused by any virus transmitted
>>> > therewith.
>>> >
>>> > Click http://www.emdgroup.com/disclaimer to access the German, French,
>>> > Spanish and Portuguese versions of this disclaimer.
>>> >
>>>
>>
>>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Posted by Webster Homer <we...@sial.com>.
the scores are not the same
Doc
305340 432.44238
C2646     428.24185
12837     430.61722

One other thing. I just ran optimize and now document 305340 is
consistently the top score.
So apparently it IS essential to run optimize after a data load

Note we see this behavior fairly commonly on our solr cloud instances. This
was not the first time. This particular situation was on a development
system

On Thu, Sep 7, 2017 at 10:04 AM, Webster Homer <we...@sial.com>
wrote:

> the scores are not the same
> Doc
> 305340 432.44238
>
> On Thu, Sep 7, 2017 at 10:02 AM, David Hastings <
> hastings.recursive@gmail.com> wrote:
>
>> "I am concerned that the same
>> search gives different results after each search. The top document seems
>> to
>> cycle between 3 different documents"
>>
>>
>> if you do debug query on the search, are the scores for the top 3
>> documents
>> the same or not?  you can easily have three documents with the same score,
>> so when you have a result set that is ranked 1-1-1-2-3-4.... you can
>> expect
>> 1-1-1 to rotate based on whatever.  use a second element like id to your
>> ranking perhaps.
>>
>>
>>
>>
>> On Thu, Sep 7, 2017 at 10:54 AM, Webster Homer <we...@sial.com>
>> wrote:
>>
>> > I am not concerned about deleted documents. I am concerned that the same
>> > search gives different results after each search. The top document
>> seems to
>> > cycle between 3 different documents
>> >
>> > I have an enhanced collections info api call that calls the core admin
>> api
>> > to get the index information for the replica.
>> > When I said the numdocs were the same I meant exactly that. maxdocs and
>> > deleted documents are not the same for the replicas, but the number of
>> > numdocs is.
>> >
>> > Or are you saying that the search is looking at deleted documents
>> wouldn't
>> > that be a very significant bug?
>> >
>> > The four replicas:
>> > shard1
>> > core_node1
>> > "numDocs": 383817,
>> > "maxDocs": 611592,
>> > "deletedDocs": 227775,
>> > "size": "2.49 GB",
>> > "lastModified": "2017-09-07T08:18:03.639Z",
>> > "current": true,
>> > "version": 35644,
>> > "segmentCount": 28
>> >
>> > core_node3
>> > "numDocs": 383817,
>> > "maxDocs": 571737,
>> > "deletedDocs": 187920,
>> > "size": "2.85 GB",
>> > "lastModified": "2017-09-07T08:18:03.634Z",
>> > "current": false,
>> > "version": 35562,
>> > "segmentCount": 36
>> > shard2
>> > core_node2
>> > "numDocs": 385326,
>> > "maxDocs": 529214,
>> > "deletedDocs": 143888,
>> > "size": "2.13 GB",
>> > "lastModified": "2017-09-07T08:18:03.632Z",
>> > "current": true,
>> > "version": 34783,
>> > "segmentCount": 24
>> > core_node4
>> > "numDocs": 385326,
>> > "maxDocs": 488201,
>> > "deletedDocs": 102875,
>> > "size": "1.96 GB",
>> > "lastModified": "2017-09-07T08:18:03.633Z",
>> > "current": true,
>> > "version": 34932,
>> > "segmentCount": 21
>> >
>> >
>> > On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley <ys...@gmail.com> wrote:
>> >
>> > > On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <
>> erickerickson@gmail.com
>> > >
>> > > wrote:
>> > > > bq: and deleted documents are irrelevant to term statistics...
>> > > >
>> > > > Did you mean "relevant"? Or do I have to adjust my thinking _again_?
>> > >
>> > > One can make it work either way ;-)
>> > > Whether a document is marked as deleted or not has no effect on term
>> > > statistics (i.e. irrelevant)
>> > > OR documents marked for deletion still count in term statistics (i.e.
>> > > relevant)
>> > >
>> > > I guess I used the former because we don't go out of our way to still
>> > > include deleted documents... it's just a side effect of the index
>> > > structure that we don't (and can't easily) update statistics when a
>> > > document is marked as deleted.
>> > >
>> > > -Yonik
>> > >
>> > >
>> > > > Erick
>> > > >
>> > > > On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley <ys...@gmail.com>
>> > wrote:
>> > > >> Different replicas of the same shard can have different numbers of
>> > > >> deleted documents (really just marked as deleted), and deleted
>> > > >> documents are irrelevant to term statistics (like the number of
>> > > >> documents a term appears in).  Documents marked for deletion stop
>> > > >> contributing to corpus statistics when they are actually removed
>> (via
>> > > >> expunge deletes, merges, optimizes).
>> > > >> -Yonik
>> > > >>
>> > > >>
>> > > >> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <
>> webster.homer@sial.com
>> > >
>> > > wrote:
>> > > >>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards
>> and 4
>> > > >>> replicas (total of 4 nodes).
>> > > >>>
>> > > >>> If I run the query multiple times I see the three different top
>> > scoring
>> > > >>> results.
>> > > >>> No data load is running, all data has been commited
>> > > >>>
>> > > >>> I get these three different hits with their scores:
>> > > >>> copperiinitratehemipentahydrate2325919004194        430.61722
>> > > >>> copperiinitrateoncelite1234598765
>> > >  432.44238
>> > > >>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
>> > > >>>
>> > > >>> How is it that the same search against the same data can give
>> > different
>> > > >>> responses?
>> > > >>> I looked at the specific cores they look OK the numdocs for the
>> > > replicas in
>> > > >>> a shard match
>> > > >>>
>> > > >>> This is the query:
>> > > >>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-
>> > > catalog-product/select?defType=edismax&fl=searchmv_
>> > > en_keywords,%20searchmv_keywords,searchmv_pno,%
>> > 20searchmv_en_s_pri_name,%
>> > > 20search_en_p_pri_name,%20search_pno%20[explain%
>> > > 20style=nl]&group.field=id_s&group.limit=30&group=true&
>> > > group.sort=sort_ds%20asc&indent=on&mm=2%3C-25%25&q.op=
>> > > OR&q=copper%20nitrate&qf=search_pid
>> > > >>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%
>> > > 20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%
>> > > 20searchmv_p_skus_genr%20searchmv_user_term^200%
>> > > 20search_lform^190%20searchmv_en_acronym^180%20search_en_
>> > > root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_
>> > > pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_
>> > > keywords^140%20search_en_sortkey^120%20searchmv_p_skus^
>> > > 100%20searchmv_chem_comp^90%20searchmv_en_name_suf%
>> > > 20searchmv_cas_number^80%20searchmv_component_cas^70%
>> > > 20search_beilstein^50%20search_color_idx^40%
>> > 20search_ecnumber^30%20search_
>> > > egecnumber^30%20search_femanumber^20%20searchmv_isbn^
>> > > 10%20search_mdl_number%20searchmv_en_page_title%
>> > > 20searchmv_en_descriptions%20searchmv_en_attributes%
>> > > 20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_
>> > > xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_
>> > > equivalent_pno%20searchmv_xref_exact_pno%20searchmv_
>> > > xref_exact_sku%20searchmv_component_molform&rows=30&
>> > > sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,
>> > > search_pid%20asc&wt=json
>> > > >>>
>> > > >>> --
>> > > >>>
>> > > >>>
>> > > >>> This message and any attachment are confidential and may be
>> > privileged
>> > > or
>> > > >>> otherwise protected from disclosure. If you are not the intended
>> > > recipient,
>> > > >>> you must not copy this message or attachment or disclose the
>> contents
>> > > to
>> > > >>> any other person. If you have received this transmission in error,
>> > > please
>> > > >>> notify the sender immediately and delete the message and any
>> > attachment
>> > > >>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>> > > >>> subsidiaries do not accept liability for any omissions or errors
>> in
>> > > this
>> > > >>> message which may arise as a result of E-Mail-transmission or for
>> > > damages
>> > > >>> resulting from any unauthorized changes of the content of this
>> > message
>> > > and
>> > > >>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of
>> its
>> > > >>> subsidiaries do not guarantee that this message is free of viruses
>> > and
>> > > does
>> > > >>> not accept liability for any damages caused by any virus
>> transmitted
>> > > >>> therewith.
>> > > >>>
>> > > >>> Click http://www.emdgroup.com/disclaimer to access the German,
>> > French,
>> > > >>> Spanish and Portuguese versions of this disclaimer.
>> > >
>> >
>> > --
>> >
>> >
>> > This message and any attachment are confidential and may be privileged
>> or
>> > otherwise protected from disclosure. If you are not the intended
>> recipient,
>> > you must not copy this message or attachment or disclose the contents to
>> > any other person. If you have received this transmission in error,
>> please
>> > notify the sender immediately and delete the message and any attachment
>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not accept liability for any omissions or errors in this
>> > message which may arise as a result of E-Mail-transmission or for
>> damages
>> > resulting from any unauthorized changes of the content of this message
>> and
>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not guarantee that this message is free of viruses and
>> does
>> > not accept liability for any damages caused by any virus transmitted
>> > therewith.
>> >
>> > Click http://www.emdgroup.com/disclaimer to access the German, French,
>> > Spanish and Portuguese versions of this disclaimer.
>> >
>>
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Posted by Webster Homer <we...@sial.com>.
the scores are not the same
Doc
305340 432.44238

On Thu, Sep 7, 2017 at 10:02 AM, David Hastings <
hastings.recursive@gmail.com> wrote:

> "I am concerned that the same
> search gives different results after each search. The top document seems to
> cycle between 3 different documents"
>
>
> if you do debug query on the search, are the scores for the top 3 documents
> the same or not?  you can easily have three documents with the same score,
> so when you have a result set that is ranked 1-1-1-2-3-4.... you can expect
> 1-1-1 to rotate based on whatever.  use a second element like id to your
> ranking perhaps.
>
>
>
>
> On Thu, Sep 7, 2017 at 10:54 AM, Webster Homer <we...@sial.com>
> wrote:
>
> > I am not concerned about deleted documents. I am concerned that the same
> > search gives different results after each search. The top document seems
> to
> > cycle between 3 different documents
> >
> > I have an enhanced collections info api call that calls the core admin
> api
> > to get the index information for the replica.
> > When I said the numdocs were the same I meant exactly that. maxdocs and
> > deleted documents are not the same for the replicas, but the number of
> > numdocs is.
> >
> > Or are you saying that the search is looking at deleted documents
> wouldn't
> > that be a very significant bug?
> >
> > The four replicas:
> > shard1
> > core_node1
> > "numDocs": 383817,
> > "maxDocs": 611592,
> > "deletedDocs": 227775,
> > "size": "2.49 GB",
> > "lastModified": "2017-09-07T08:18:03.639Z",
> > "current": true,
> > "version": 35644,
> > "segmentCount": 28
> >
> > core_node3
> > "numDocs": 383817,
> > "maxDocs": 571737,
> > "deletedDocs": 187920,
> > "size": "2.85 GB",
> > "lastModified": "2017-09-07T08:18:03.634Z",
> > "current": false,
> > "version": 35562,
> > "segmentCount": 36
> > shard2
> > core_node2
> > "numDocs": 385326,
> > "maxDocs": 529214,
> > "deletedDocs": 143888,
> > "size": "2.13 GB",
> > "lastModified": "2017-09-07T08:18:03.632Z",
> > "current": true,
> > "version": 34783,
> > "segmentCount": 24
> > core_node4
> > "numDocs": 385326,
> > "maxDocs": 488201,
> > "deletedDocs": 102875,
> > "size": "1.96 GB",
> > "lastModified": "2017-09-07T08:18:03.633Z",
> > "current": true,
> > "version": 34932,
> > "segmentCount": 21
> >
> >
> > On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley <ys...@gmail.com> wrote:
> >
> > > On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <
> erickerickson@gmail.com
> > >
> > > wrote:
> > > > bq: and deleted documents are irrelevant to term statistics...
> > > >
> > > > Did you mean "relevant"? Or do I have to adjust my thinking _again_?
> > >
> > > One can make it work either way ;-)
> > > Whether a document is marked as deleted or not has no effect on term
> > > statistics (i.e. irrelevant)
> > > OR documents marked for deletion still count in term statistics (i.e.
> > > relevant)
> > >
> > > I guess I used the former because we don't go out of our way to still
> > > include deleted documents... it's just a side effect of the index
> > > structure that we don't (and can't easily) update statistics when a
> > > document is marked as deleted.
> > >
> > > -Yonik
> > >
> > >
> > > > Erick
> > > >
> > > > On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley <ys...@gmail.com>
> > wrote:
> > > >> Different replicas of the same shard can have different numbers of
> > > >> deleted documents (really just marked as deleted), and deleted
> > > >> documents are irrelevant to term statistics (like the number of
> > > >> documents a term appears in).  Documents marked for deletion stop
> > > >> contributing to corpus statistics when they are actually removed
> (via
> > > >> expunge deletes, merges, optimizes).
> > > >> -Yonik
> > > >>
> > > >>
> > > >> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <
> webster.homer@sial.com
> > >
> > > wrote:
> > > >>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and
> 4
> > > >>> replicas (total of 4 nodes).
> > > >>>
> > > >>> If I run the query multiple times I see the three different top
> > scoring
> > > >>> results.
> > > >>> No data load is running, all data has been commited
> > > >>>
> > > >>> I get these three different hits with their scores:
> > > >>> copperiinitratehemipentahydrate2325919004194        430.61722
> > > >>> copperiinitrateoncelite1234598765
> > >  432.44238
> > > >>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
> > > >>>
> > > >>> How is it that the same search against the same data can give
> > different
> > > >>> responses?
> > > >>> I looked at the specific cores they look OK the numdocs for the
> > > replicas in
> > > >>> a shard match
> > > >>>
> > > >>> This is the query:
> > > >>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-
> > > catalog-product/select?defType=edismax&fl=searchmv_
> > > en_keywords,%20searchmv_keywords,searchmv_pno,%
> > 20searchmv_en_s_pri_name,%
> > > 20search_en_p_pri_name,%20search_pno%20[explain%
> > > 20style=nl]&group.field=id_s&group.limit=30&group=true&
> > > group.sort=sort_ds%20asc&indent=on&mm=2%3C-25%25&q.op=
> > > OR&q=copper%20nitrate&qf=search_pid
> > > >>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%
> > > 20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%
> > > 20searchmv_p_skus_genr%20searchmv_user_term^200%
> > > 20search_lform^190%20searchmv_en_acronym^180%20search_en_
> > > root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_
> > > pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_
> > > keywords^140%20search_en_sortkey^120%20searchmv_p_skus^
> > > 100%20searchmv_chem_comp^90%20searchmv_en_name_suf%
> > > 20searchmv_cas_number^80%20searchmv_component_cas^70%
> > > 20search_beilstein^50%20search_color_idx^40%
> > 20search_ecnumber^30%20search_
> > > egecnumber^30%20search_femanumber^20%20searchmv_isbn^
> > > 10%20search_mdl_number%20searchmv_en_page_title%
> > > 20searchmv_en_descriptions%20searchmv_en_attributes%
> > > 20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_
> > > xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_
> > > equivalent_pno%20searchmv_xref_exact_pno%20searchmv_
> > > xref_exact_sku%20searchmv_component_molform&rows=30&
> > > sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,
> > > search_pid%20asc&wt=json
> > > >>>
> > > >>> --
> > > >>>
> > > >>>
> > > >>> This message and any attachment are confidential and may be
> > privileged
> > > or
> > > >>> otherwise protected from disclosure. If you are not the intended
> > > recipient,
> > > >>> you must not copy this message or attachment or disclose the
> contents
> > > to
> > > >>> any other person. If you have received this transmission in error,
> > > please
> > > >>> notify the sender immediately and delete the message and any
> > attachment
> > > >>> from your system. Merck KGaA, Darmstadt, Germany and any of its
> > > >>> subsidiaries do not accept liability for any omissions or errors in
> > > this
> > > >>> message which may arise as a result of E-Mail-transmission or for
> > > damages
> > > >>> resulting from any unauthorized changes of the content of this
> > message
> > > and
> > > >>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of
> its
> > > >>> subsidiaries do not guarantee that this message is free of viruses
> > and
> > > does
> > > >>> not accept liability for any damages caused by any virus
> transmitted
> > > >>> therewith.
> > > >>>
> > > >>> Click http://www.emdgroup.com/disclaimer to access the German,
> > French,
> > > >>> Spanish and Portuguese versions of this disclaimer.
> > >
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
> >
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Posted by David Hastings <ha...@gmail.com>.
"I am concerned that the same
search gives different results after each search. The top document seems to
cycle between 3 different documents"


if you do debug query on the search, are the scores for the top 3 documents
the same or not?  you can easily have three documents with the same score,
so when you have a result set that is ranked 1-1-1-2-3-4.... you can expect
1-1-1 to rotate based on whatever.  use a second element like id to your
ranking perhaps.




On Thu, Sep 7, 2017 at 10:54 AM, Webster Homer <we...@sial.com>
wrote:

> I am not concerned about deleted documents. I am concerned that the same
> search gives different results after each search. The top document seems to
> cycle between 3 different documents
>
> I have an enhanced collections info api call that calls the core admin api
> to get the index information for the replica.
> When I said the numdocs were the same I meant exactly that. maxdocs and
> deleted documents are not the same for the replicas, but the number of
> numdocs is.
>
> Or are you saying that the search is looking at deleted documents wouldn't
> that be a very significant bug?
>
> The four replicas:
> shard1
> core_node1
> "numDocs": 383817,
> "maxDocs": 611592,
> "deletedDocs": 227775,
> "size": "2.49 GB",
> "lastModified": "2017-09-07T08:18:03.639Z",
> "current": true,
> "version": 35644,
> "segmentCount": 28
>
> core_node3
> "numDocs": 383817,
> "maxDocs": 571737,
> "deletedDocs": 187920,
> "size": "2.85 GB",
> "lastModified": "2017-09-07T08:18:03.634Z",
> "current": false,
> "version": 35562,
> "segmentCount": 36
> shard2
> core_node2
> "numDocs": 385326,
> "maxDocs": 529214,
> "deletedDocs": 143888,
> "size": "2.13 GB",
> "lastModified": "2017-09-07T08:18:03.632Z",
> "current": true,
> "version": 34783,
> "segmentCount": 24
> core_node4
> "numDocs": 385326,
> "maxDocs": 488201,
> "deletedDocs": 102875,
> "size": "1.96 GB",
> "lastModified": "2017-09-07T08:18:03.633Z",
> "current": true,
> "version": 34932,
> "segmentCount": 21
>
>
> On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley <ys...@gmail.com> wrote:
>
> > On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> > > bq: and deleted documents are irrelevant to term statistics...
> > >
> > > Did you mean "relevant"? Or do I have to adjust my thinking _again_?
> >
> > One can make it work either way ;-)
> > Whether a document is marked as deleted or not has no effect on term
> > statistics (i.e. irrelevant)
> > OR documents marked for deletion still count in term statistics (i.e.
> > relevant)
> >
> > I guess I used the former because we don't go out of our way to still
> > include deleted documents... it's just a side effect of the index
> > structure that we don't (and can't easily) update statistics when a
> > document is marked as deleted.
> >
> > -Yonik
> >
> >
> > > Erick
> > >
> > > On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley <ys...@gmail.com>
> wrote:
> > >> Different replicas of the same shard can have different numbers of
> > >> deleted documents (really just marked as deleted), and deleted
> > >> documents are irrelevant to term statistics (like the number of
> > >> documents a term appears in).  Documents marked for deletion stop
> > >> contributing to corpus statistics when they are actually removed (via
> > >> expunge deletes, merges, optimizes).
> > >> -Yonik
> > >>
> > >>
> > >> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <webster.homer@sial.com
> >
> > wrote:
> > >>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
> > >>> replicas (total of 4 nodes).
> > >>>
> > >>> If I run the query multiple times I see the three different top
> scoring
> > >>> results.
> > >>> No data load is running, all data has been commited
> > >>>
> > >>> I get these three different hits with their scores:
> > >>> copperiinitratehemipentahydrate2325919004194        430.61722
> > >>> copperiinitrateoncelite1234598765
> >  432.44238
> > >>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
> > >>>
> > >>> How is it that the same search against the same data can give
> different
> > >>> responses?
> > >>> I looked at the specific cores they look OK the numdocs for the
> > replicas in
> > >>> a shard match
> > >>>
> > >>> This is the query:
> > >>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-
> > catalog-product/select?defType=edismax&fl=searchmv_
> > en_keywords,%20searchmv_keywords,searchmv_pno,%
> 20searchmv_en_s_pri_name,%
> > 20search_en_p_pri_name,%20search_pno%20[explain%
> > 20style=nl]&group.field=id_s&group.limit=30&group=true&
> > group.sort=sort_ds%20asc&indent=on&mm=2%3C-25%25&q.op=
> > OR&q=copper%20nitrate&qf=search_pid
> > >>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%
> > 20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%
> > 20searchmv_p_skus_genr%20searchmv_user_term^200%
> > 20search_lform^190%20searchmv_en_acronym^180%20search_en_
> > root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_
> > pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_
> > keywords^140%20search_en_sortkey^120%20searchmv_p_skus^
> > 100%20searchmv_chem_comp^90%20searchmv_en_name_suf%
> > 20searchmv_cas_number^80%20searchmv_component_cas^70%
> > 20search_beilstein^50%20search_color_idx^40%
> 20search_ecnumber^30%20search_
> > egecnumber^30%20search_femanumber^20%20searchmv_isbn^
> > 10%20search_mdl_number%20searchmv_en_page_title%
> > 20searchmv_en_descriptions%20searchmv_en_attributes%
> > 20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_
> > xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_
> > equivalent_pno%20searchmv_xref_exact_pno%20searchmv_
> > xref_exact_sku%20searchmv_component_molform&rows=30&
> > sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,
> > search_pid%20asc&wt=json
> > >>>
> > >>> --
> > >>>
> > >>>
> > >>> This message and any attachment are confidential and may be
> privileged
> > or
> > >>> otherwise protected from disclosure. If you are not the intended
> > recipient,
> > >>> you must not copy this message or attachment or disclose the contents
> > to
> > >>> any other person. If you have received this transmission in error,
> > please
> > >>> notify the sender immediately and delete the message and any
> attachment
> > >>> from your system. Merck KGaA, Darmstadt, Germany and any of its
> > >>> subsidiaries do not accept liability for any omissions or errors in
> > this
> > >>> message which may arise as a result of E-Mail-transmission or for
> > damages
> > >>> resulting from any unauthorized changes of the content of this
> message
> > and
> > >>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > >>> subsidiaries do not guarantee that this message is free of viruses
> and
> > does
> > >>> not accept liability for any damages caused by any virus transmitted
> > >>> therewith.
> > >>>
> > >>> Click http://www.emdgroup.com/disclaimer to access the German,
> French,
> > >>> Spanish and Portuguese versions of this disclaimer.
> >
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
>

Re: Consecutive calls to a query give different results

Posted by Erick Erickson <er...@gmail.com>.
Here's Mike McCandless' blog on the topic:

https://www.elastic.co/blog/lucenes-handling-of-deleted-documents

The same options he mentions are available in Solr as both use Lucene
under the covers.

The long and short of it is that you can have a significant amount of
deleted documents in your index, depending on the update pattern.

One thing Mike doesn't mention is at the root of why I'm so negative
about optimize (and forceMerge is just an optimize that only mashes
segments together if they have > X% deleted docs). Let's say your max
segment size is 5G. And you optimize an index down to a single 100G
segment. That segment will _not_ be merged until it has < 2.5G live
docs. That's not a typo. 97.5% deleted docs......

You could ameliorate this somewhat by specifying the number of
segments after optimizing (default is 1). Say you determine that you
have 100G of live data, specify 20 segments for optimize. This would
be better I'd guess, but haven't tested personally.

Best,
Erick

On Fri, Sep 8, 2017 at 10:36 AM, Webster Homer <we...@sial.com> wrote:
> Thank you, Erick Erickson and Shawn Heisey for your excellent answers.
> For some of our collections, it would seem that an occasional optimize
> would be a good thing. However we have some collections that are updated
> constantly
>
> Would using the commit expungeDeletes help mitigate the issue?
>
> I also came across a discussion of Lucene merge policies. and the
> TieredMergePolicy.
> Is there documentation about this? I notice that a couple of our replicas
> in some of our collections have ~30% deleted documents which I would think
> would contribute to the problem.
> I have at least 3 collections that are updated constantly, and would not
> lend themselves to being optimized what is the best approach for these?
>
> Thanks
>
> On Fri, Sep 8, 2017 at 9:47 AM, Shawn Heisey <ap...@elyograg.org> wrote:
>
>> On 9/7/2017 8:54 AM, Webster Homer wrote:
>> > I am not concerned about deleted documents. I am concerned that the same
>> > search gives different results after each search. The top document seems
>> to
>> > cycle between 3 different documents
>> >
>> > I have an enhanced collections info api call that calls the core admin
>> api
>> > to get the index information for the replica.
>> > When I said the numdocs were the same I meant exactly that. maxdocs and
>> > deleted documents are not the same for the replicas, but the number of
>> > numdocs is.
>> >
>> > Or are you saying that the search is looking at deleted documents
>> wouldn't
>> > that be a very significant bug?
>>
>> Lucene score calculations take a lot of information in the index into
>> account when calculating the score.  That includes deleted documents,
>> because they are part of the index.  When you delete a document, Lucene
>> just makes a note saying "internal document ID number NNNN is deleted."
>> The actual information for that document is not removed from the index,
>> because doing so could take a very long time.
>>
>> When you make queries against a replicated SolrCloud, the queries are
>> load balanced across the entire cloud, so different queries will hit
>> different replicas.  With different numbers of deleted documents in
>> different replicas (which is not unusual), the scores are going to come
>> out a little bit different on each query.  If you're sorting by score
>> (which is the default sort), that *can* affect the order.  Your replicas
>> have a fairly high percentage of deleted documents, so there is a lot of
>> extra information affecting the scores.  The relative difference in the
>> deleted document count between the replicas is high as well, so multiple
>> queries could be substantially different.
>>
>> It is not a bug that Lucene and Solr look at deleted documents.
>> Removing deleted document information from things like the score
>> calculation would be VERY computationally intense, bordering on the
>> impossible.  To assure good performance, Lucene doesn't even try.
>> Because the way Lucene tracks deleted documents is with a list of
>> internal Lucene document IDs, those documents are easily removed from
>> *results*, but their contents are an integral part of the index and that
>> information can only be truly removed by completely rewriting (merging)
>> the index segments.
>>
>> You can get rid of all deleted documents with an optimize operation,
>> which is a forced merge of the entire index down to one segment -- but
>> just like it sounds, that is a complete rewrite of the index.  It
>> involves a huge amount of CPU resources and disk I/O, and can severely
>> impact normal indexing and query operations while it's happening.  If
>> the collection is extremely large, an optimize could take hours.  For
>> indexes that change rapidly, optimize is strongly discouraged, except as
>> an occasional "clean things up" operation, run during non-peak times.
>>
>> Thanks,
>> Shawn
>>
>>
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Posted by Webster Homer <we...@sial.com>.
Thank you, Erick Erickson and Shawn Heisey for your excellent answers.
For some of our collections, it would seem that an occasional optimize
would be a good thing. However we have some collections that are updated
constantly

Would using the commit expungeDeletes help mitigate the issue?

I also came across a discussion of Lucene merge policies. and the
TieredMergePolicy.
Is there documentation about this? I notice that a couple of our replicas
in some of our collections have ~30% deleted documents which I would think
would contribute to the problem.
I have at least 3 collections that are updated constantly, and would not
lend themselves to being optimized what is the best approach for these?

Thanks

On Fri, Sep 8, 2017 at 9:47 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 9/7/2017 8:54 AM, Webster Homer wrote:
> > I am not concerned about deleted documents. I am concerned that the same
> > search gives different results after each search. The top document seems
> to
> > cycle between 3 different documents
> >
> > I have an enhanced collections info api call that calls the core admin
> api
> > to get the index information for the replica.
> > When I said the numdocs were the same I meant exactly that. maxdocs and
> > deleted documents are not the same for the replicas, but the number of
> > numdocs is.
> >
> > Or are you saying that the search is looking at deleted documents
> wouldn't
> > that be a very significant bug?
>
> Lucene score calculations take a lot of information in the index into
> account when calculating the score.  That includes deleted documents,
> because they are part of the index.  When you delete a document, Lucene
> just makes a note saying "internal document ID number NNNN is deleted."
> The actual information for that document is not removed from the index,
> because doing so could take a very long time.
>
> When you make queries against a replicated SolrCloud, the queries are
> load balanced across the entire cloud, so different queries will hit
> different replicas.  With different numbers of deleted documents in
> different replicas (which is not unusual), the scores are going to come
> out a little bit different on each query.  If you're sorting by score
> (which is the default sort), that *can* affect the order.  Your replicas
> have a fairly high percentage of deleted documents, so there is a lot of
> extra information affecting the scores.  The relative difference in the
> deleted document count between the replicas is high as well, so multiple
> queries could be substantially different.
>
> It is not a bug that Lucene and Solr look at deleted documents.
> Removing deleted document information from things like the score
> calculation would be VERY computationally intense, bordering on the
> impossible.  To assure good performance, Lucene doesn't even try.
> Because the way Lucene tracks deleted documents is with a list of
> internal Lucene document IDs, those documents are easily removed from
> *results*, but their contents are an integral part of the index and that
> information can only be truly removed by completely rewriting (merging)
> the index segments.
>
> You can get rid of all deleted documents with an optimize operation,
> which is a forced merge of the entire index down to one segment -- but
> just like it sounds, that is a complete rewrite of the index.  It
> involves a huge amount of CPU resources and disk I/O, and can severely
> impact normal indexing and query operations while it's happening.  If
> the collection is extremely large, an optimize could take hours.  For
> indexes that change rapidly, optimize is strongly discouraged, except as
> an occasional "clean things up" operation, run during non-peak times.
>
> Thanks,
> Shawn
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/7/2017 8:54 AM, Webster Homer wrote:
> I am not concerned about deleted documents. I am concerned that the same
> search gives different results after each search. The top document seems to
> cycle between 3 different documents
>
> I have an enhanced collections info api call that calls the core admin api
> to get the index information for the replica.
> When I said the numdocs were the same I meant exactly that. maxdocs and
> deleted documents are not the same for the replicas, but the number of
> numdocs is.
>
> Or are you saying that the search is looking at deleted documents wouldn't
> that be a very significant bug?

Lucene score calculations take a lot of information in the index into
account when calculating the score.  That includes deleted documents,
because they are part of the index.  When you delete a document, Lucene
just makes a note saying "internal document ID number NNNN is deleted." 
The actual information for that document is not removed from the index,
because doing so could take a very long time.

When you make queries against a replicated SolrCloud, the queries are
load balanced across the entire cloud, so different queries will hit
different replicas.  With different numbers of deleted documents in
different replicas (which is not unusual), the scores are going to come
out a little bit different on each query.  If you're sorting by score
(which is the default sort), that *can* affect the order.  Your replicas
have a fairly high percentage of deleted documents, so there is a lot of
extra information affecting the scores.  The relative difference in the
deleted document count between the replicas is high as well, so multiple
queries could be substantially different.

It is not a bug that Lucene and Solr look at deleted documents. 
Removing deleted document information from things like the score
calculation would be VERY computationally intense, bordering on the
impossible.  To assure good performance, Lucene doesn't even try. 
Because the way Lucene tracks deleted documents is with a list of
internal Lucene document IDs, those documents are easily removed from
*results*, but their contents are an integral part of the index and that
information can only be truly removed by completely rewriting (merging)
the index segments.

You can get rid of all deleted documents with an optimize operation,
which is a forced merge of the entire index down to one segment -- but
just like it sounds, that is a complete rewrite of the index.  It
involves a huge amount of CPU resources and disk I/O, and can severely
impact normal indexing and query operations while it's happening.  If
the collection is extremely large, an optimize could take hours.  For
indexes that change rapidly, optimize is strongly discouraged, except as
an occasional "clean things up" operation, run during non-peak times.

Thanks,
Shawn


Re: Consecutive calls to a query give different results

Posted by Webster Homer <we...@sial.com>.
I am not concerned about deleted documents. I am concerned that the same
search gives different results after each search. The top document seems to
cycle between 3 different documents

I have an enhanced collections info api call that calls the core admin api
to get the index information for the replica.
When I said the numdocs were the same I meant exactly that. maxdocs and
deleted documents are not the same for the replicas, but the number of
numdocs is.

Or are you saying that the search is looking at deleted documents wouldn't
that be a very significant bug?

The four replicas:
shard1
core_node1
"numDocs": 383817,
"maxDocs": 611592,
"deletedDocs": 227775,
"size": "2.49 GB",
"lastModified": "2017-09-07T08:18:03.639Z",
"current": true,
"version": 35644,
"segmentCount": 28

core_node3
"numDocs": 383817,
"maxDocs": 571737,
"deletedDocs": 187920,
"size": "2.85 GB",
"lastModified": "2017-09-07T08:18:03.634Z",
"current": false,
"version": 35562,
"segmentCount": 36
shard2
core_node2
"numDocs": 385326,
"maxDocs": 529214,
"deletedDocs": 143888,
"size": "2.13 GB",
"lastModified": "2017-09-07T08:18:03.632Z",
"current": true,
"version": 34783,
"segmentCount": 24
core_node4
"numDocs": 385326,
"maxDocs": 488201,
"deletedDocs": 102875,
"size": "1.96 GB",
"lastModified": "2017-09-07T08:18:03.633Z",
"current": true,
"version": 34932,
"segmentCount": 21


On Thu, Sep 7, 2017 at 7:58 AM, Yonik Seeley <ys...@gmail.com> wrote:

> On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <er...@gmail.com>
> wrote:
> > bq: and deleted documents are irrelevant to term statistics...
> >
> > Did you mean "relevant"? Or do I have to adjust my thinking _again_?
>
> One can make it work either way ;-)
> Whether a document is marked as deleted or not has no effect on term
> statistics (i.e. irrelevant)
> OR documents marked for deletion still count in term statistics (i.e.
> relevant)
>
> I guess I used the former because we don't go out of our way to still
> include deleted documents... it's just a side effect of the index
> structure that we don't (and can't easily) update statistics when a
> document is marked as deleted.
>
> -Yonik
>
>
> > Erick
> >
> > On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley <ys...@gmail.com> wrote:
> >> Different replicas of the same shard can have different numbers of
> >> deleted documents (really just marked as deleted), and deleted
> >> documents are irrelevant to term statistics (like the number of
> >> documents a term appears in).  Documents marked for deletion stop
> >> contributing to corpus statistics when they are actually removed (via
> >> expunge deletes, merges, optimizes).
> >> -Yonik
> >>
> >>
> >> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <we...@sial.com>
> wrote:
> >>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
> >>> replicas (total of 4 nodes).
> >>>
> >>> If I run the query multiple times I see the three different top scoring
> >>> results.
> >>> No data load is running, all data has been commited
> >>>
> >>> I get these three different hits with their scores:
> >>> copperiinitratehemipentahydrate2325919004194        430.61722
> >>> copperiinitrateoncelite1234598765
>  432.44238
> >>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
> >>>
> >>> How is it that the same search against the same data can give different
> >>> responses?
> >>> I looked at the specific cores they look OK the numdocs for the
> replicas in
> >>> a shard match
> >>>
> >>> This is the query:
> >>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-
> catalog-product/select?defType=edismax&fl=searchmv_
> en_keywords,%20searchmv_keywords,searchmv_pno,%20searchmv_en_s_pri_name,%
> 20search_en_p_pri_name,%20search_pno%20[explain%
> 20style=nl]&group.field=id_s&group.limit=30&group=true&
> group.sort=sort_ds%20asc&indent=on&mm=2%3C-25%25&q.op=
> OR&q=copper%20nitrate&qf=search_pid
> >>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%
> 20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%
> 20searchmv_p_skus_genr%20searchmv_user_term^200%
> 20search_lform^190%20searchmv_en_acronym^180%20search_en_
> root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_
> pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_
> keywords^140%20search_en_sortkey^120%20searchmv_p_skus^
> 100%20searchmv_chem_comp^90%20searchmv_en_name_suf%
> 20searchmv_cas_number^80%20searchmv_component_cas^70%
> 20search_beilstein^50%20search_color_idx^40%20search_ecnumber^30%20search_
> egecnumber^30%20search_femanumber^20%20searchmv_isbn^
> 10%20search_mdl_number%20searchmv_en_page_title%
> 20searchmv_en_descriptions%20searchmv_en_attributes%
> 20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_
> xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_
> equivalent_pno%20searchmv_xref_exact_pno%20searchmv_
> xref_exact_sku%20searchmv_component_molform&rows=30&
> sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,
> search_pid%20asc&wt=json
> >>>
> >>> --
> >>>
> >>>
> >>> This message and any attachment are confidential and may be privileged
> or
> >>> otherwise protected from disclosure. If you are not the intended
> recipient,
> >>> you must not copy this message or attachment or disclose the contents
> to
> >>> any other person. If you have received this transmission in error,
> please
> >>> notify the sender immediately and delete the message and any attachment
> >>> from your system. Merck KGaA, Darmstadt, Germany and any of its
> >>> subsidiaries do not accept liability for any omissions or errors in
> this
> >>> message which may arise as a result of E-Mail-transmission or for
> damages
> >>> resulting from any unauthorized changes of the content of this message
> and
> >>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> >>> subsidiaries do not guarantee that this message is free of viruses and
> does
> >>> not accept liability for any damages caused by any virus transmitted
> >>> therewith.
> >>>
> >>> Click http://www.emdgroup.com/disclaimer to access the German, French,
> >>> Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Posted by Yonik Seeley <ys...@gmail.com>.
On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <er...@gmail.com> wrote:
> bq: and deleted documents are irrelevant to term statistics...
>
> Did you mean "relevant"? Or do I have to adjust my thinking _again_?

One can make it work either way ;-)
Whether a document is marked as deleted or not has no effect on term
statistics (i.e. irrelevant)
OR documents marked for deletion still count in term statistics (i.e. relevant)

I guess I used the former because we don't go out of our way to still
include deleted documents... it's just a side effect of the index
structure that we don't (and can't easily) update statistics when a
document is marked as deleted.

-Yonik


> Erick
>
> On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley <ys...@gmail.com> wrote:
>> Different replicas of the same shard can have different numbers of
>> deleted documents (really just marked as deleted), and deleted
>> documents are irrelevant to term statistics (like the number of
>> documents a term appears in).  Documents marked for deletion stop
>> contributing to corpus statistics when they are actually removed (via
>> expunge deletes, merges, optimizes).
>> -Yonik
>>
>>
>> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <we...@sial.com> wrote:
>>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
>>> replicas (total of 4 nodes).
>>>
>>> If I run the query multiple times I see the three different top scoring
>>> results.
>>> No data load is running, all data has been commited
>>>
>>> I get these three different hits with their scores:
>>> copperiinitratehemipentahydrate2325919004194        430.61722
>>> copperiinitrateoncelite1234598765                               432.44238
>>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
>>>
>>> How is it that the same search against the same data can give different
>>> responses?
>>> I looked at the specific cores they look OK the numdocs for the replicas in
>>> a shard match
>>>
>>> This is the query:
>>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-catalog-product/select?defType=edismax&fl=searchmv_en_keywords,%20searchmv_keywords,searchmv_pno,%20searchmv_en_s_pri_name,%20search_en_p_pri_name,%20search_pno%20[explain%20style=nl]&group.field=id_s&group.limit=30&group=true&group.sort=sort_ds%20asc&indent=on&mm=2%3C-25%25&q.op=OR&q=copper%20nitrate&qf=search_pid
>>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%20searchmv_p_skus_genr%20searchmv_user_term^200%20search_lform^190%20searchmv_en_acronym^180%20search_en_root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_keywords^140%20search_en_sortkey^120%20searchmv_p_skus^100%20searchmv_chem_comp^90%20searchmv_en_name_suf%20searchmv_cas_number^80%20searchmv_component_cas^70%20search_beilstein^50%20search_color_idx^40%20search_ecnumber^30%20search_egecnumber^30%20search_femanumber^20%20searchmv_isbn^10%20search_mdl_number%20searchmv_en_page_title%20searchmv_en_descriptions%20searchmv_en_attributes%20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_equivalent_pno%20searchmv_xref_exact_pno%20searchmv_xref_exact_sku%20searchmv_component_molform&rows=30&sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,search_pid%20asc&wt=json
>>>
>>> --
>>>
>>>
>>> This message and any attachment are confidential and may be privileged or
>>> otherwise protected from disclosure. If you are not the intended recipient,
>>> you must not copy this message or attachment or disclose the contents to
>>> any other person. If you have received this transmission in error, please
>>> notify the sender immediately and delete the message and any attachment
>>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not accept liability for any omissions or errors in this
>>> message which may arise as a result of E-Mail-transmission or for damages
>>> resulting from any unauthorized changes of the content of this message and
>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not guarantee that this message is free of viruses and does
>>> not accept liability for any damages caused by any virus transmitted
>>> therewith.
>>>
>>> Click http://www.emdgroup.com/disclaimer to access the German, French,
>>> Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Posted by Erick Erickson <er...@gmail.com>.
bq: and deleted documents are irrelevant to term statistics...

Did you mean "relevant"? Or do I have to adjust my thinking _again_?

Erick

On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley <ys...@gmail.com> wrote:
> Different replicas of the same shard can have different numbers of
> deleted documents (really just marked as deleted), and deleted
> documents are irrelevant to term statistics (like the number of
> documents a term appears in).  Documents marked for deletion stop
> contributing to corpus statistics when they are actually removed (via
> expunge deletes, merges, optimizes).
> -Yonik
>
>
> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <we...@sial.com> wrote:
>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
>> replicas (total of 4 nodes).
>>
>> If I run the query multiple times I see the three different top scoring
>> results.
>> No data load is running, all data has been commited
>>
>> I get these three different hits with their scores:
>> copperiinitratehemipentahydrate2325919004194        430.61722
>> copperiinitrateoncelite1234598765                               432.44238
>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
>>
>> How is it that the same search against the same data can give different
>> responses?
>> I looked at the specific cores they look OK the numdocs for the replicas in
>> a shard match
>>
>> This is the query:
>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-catalog-product/select?defType=edismax&fl=searchmv_en_keywords,%20searchmv_keywords,searchmv_pno,%20searchmv_en_s_pri_name,%20search_en_p_pri_name,%20search_pno%20[explain%20style=nl]&group.field=id_s&group.limit=30&group=true&group.sort=sort_ds%20asc&indent=on&mm=2%3C-25%25&q.op=OR&q=copper%20nitrate&qf=search_pid
>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%20searchmv_p_skus_genr%20searchmv_user_term^200%20search_lform^190%20searchmv_en_acronym^180%20search_en_root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_keywords^140%20search_en_sortkey^120%20searchmv_p_skus^100%20searchmv_chem_comp^90%20searchmv_en_name_suf%20searchmv_cas_number^80%20searchmv_component_cas^70%20search_beilstein^50%20search_color_idx^40%20search_ecnumber^30%20search_egecnumber^30%20search_femanumber^20%20searchmv_isbn^10%20search_mdl_number%20searchmv_en_page_title%20searchmv_en_descriptions%20searchmv_en_attributes%20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_equivalent_pno%20searchmv_xref_exact_pno%20searchmv_xref_exact_sku%20searchmv_component_molform&rows=30&sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,search_pid%20asc&wt=json
>>
>> --
>>
>>
>> This message and any attachment are confidential and may be privileged or
>> otherwise protected from disclosure. If you are not the intended recipient,
>> you must not copy this message or attachment or disclose the contents to
>> any other person. If you have received this transmission in error, please
>> notify the sender immediately and delete the message and any attachment
>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>> subsidiaries do not accept liability for any omissions or errors in this
>> message which may arise as a result of E-Mail-transmission or for damages
>> resulting from any unauthorized changes of the content of this message and
>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> subsidiaries do not guarantee that this message is free of viruses and does
>> not accept liability for any damages caused by any virus transmitted
>> therewith.
>>
>> Click http://www.emdgroup.com/disclaimer to access the German, French,
>> Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Posted by Yonik Seeley <ys...@gmail.com>.
Different replicas of the same shard can have different numbers of
deleted documents (really just marked as deleted), and deleted
documents are irrelevant to term statistics (like the number of
documents a term appears in).  Documents marked for deletion stop
contributing to corpus statistics when they are actually removed (via
expunge deletes, merges, optimizes).
-Yonik


On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <we...@sial.com> wrote:
> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
> replicas (total of 4 nodes).
>
> If I run the query multiple times I see the three different top scoring
> results.
> No data load is running, all data has been commited
>
> I get these three different hits with their scores:
> copperiinitratehemipentahydrate2325919004194        430.61722
> copperiinitrateoncelite1234598765                               432.44238
> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
>
> How is it that the same search against the same data can give different
> responses?
> I looked at the specific cores they look OK the numdocs for the replicas in
> a shard match
>
> This is the query:
> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-catalog-product/select?defType=edismax&fl=searchmv_en_keywords,%20searchmv_keywords,searchmv_pno,%20searchmv_en_s_pri_name,%20search_en_p_pri_name,%20search_pno%20[explain%20style=nl]&group.field=id_s&group.limit=30&group=true&group.sort=sort_ds%20asc&indent=on&mm=2%3C-25%25&q.op=OR&q=copper%20nitrate&qf=search_pid
> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%20searchmv_p_skus_genr%20searchmv_user_term^200%20search_lform^190%20searchmv_en_acronym^180%20search_en_root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_keywords^140%20search_en_sortkey^120%20searchmv_p_skus^100%20searchmv_chem_comp^90%20searchmv_en_name_suf%20searchmv_cas_number^80%20searchmv_component_cas^70%20search_beilstein^50%20search_color_idx^40%20search_ecnumber^30%20search_egecnumber^30%20search_femanumber^20%20searchmv_isbn^10%20search_mdl_number%20searchmv_en_page_title%20searchmv_en_descriptions%20searchmv_en_attributes%20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_equivalent_pno%20searchmv_xref_exact_pno%20searchmv_xref_exact_sku%20searchmv_component_molform&rows=30&sort=score%20desc,sort_en_name%20asc,sort_ds%20asc,search_pid%20asc&wt=json
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.