You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by AshB <bi...@gmail.com> on 2019/01/04 11:40:48 UTC

Solr relevancy score different on replicated nodes

Version Solr 7.4.0 zookeeper 3.4.11 Achitecture Two boxes Machine-1,Machine-2
holding single instances of solr

We are having a collection which was single shard and single replica i.e s=1
and rf=1

Few days back we tried to add replica to it.But the score for same query is
coming different from different replicas.

http://Machine-1:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)&rows=10&fl=score&defType=edismax&qf=search_field+content&wt=json

"response":{"numFound":5836,"start":0,"maxScore":*4.418847*,"docs":[

whereas on another machine(replica)

http://Machine-2:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)&rows=10&fl=score&defType=edismax&qf=search_field+content&wt=json

"response":{"numFound":5836,"start":0,"maxScore":*4.4952264*,"docs":[

The maxScore is different.

Relevancy gets affected due to sharding but replication was not expected as
same documents get copied to other node. score explaination gives issue with
docCount and docFreq uneven.

idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
1.050635000 docCount :*10020.000000000* docFreq :*3504.0000000*

idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
1.068795100

docCount :*10291.000000000* docFreq :*3534.0000000*

Is this expected?What could be wrong here?Please suggest



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Mikhail Khludnev <mk...@apache.org>.
Replicated segments might have different deleted documents by design.
Precise numbers can be achieved via exact stats. see
https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_


On Fri, Jan 4, 2019 at 2:40 PM AshB <bi...@gmail.com> wrote:

> Version Solr 7.4.0 zookeeper 3.4.11 Achitecture Two boxes
> Machine-1,Machine-2
> holding single instances of solr
>
> We are having a collection which was single shard and single replica i.e
> s=1
> and rf=1
>
> Few days back we tried to add replica to it.But the score for same query is
> coming different from different replicas.
>
>
> http://Machine-1:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)&rows=10&fl=score&defType=edismax&qf=search_field+content&wt=json
>
> "response":{"numFound":5836,"start":0,"maxScore":*4.418847*,"docs":[
>
> whereas on another machine(replica)
>
>
> http://Machine-2:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)&rows=10&fl=score&defType=edismax&qf=search_field+content&wt=json
>
> "response":{"numFound":5836,"start":0,"maxScore":*4.4952264*,"docs":[
>
> The maxScore is different.
>
> Relevancy gets affected due to sharding but replication was not expected as
> same documents get copied to other node. score explaination gives issue
> with
> docCount and docFreq uneven.
>
> idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))
> from:
> 1.050635000 docCount :*10020.000000000* docFreq :*3504.0000000*
>
> idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))
> from:
> 1.068795100
>
> docCount :*10291.000000000* docFreq :*3534.0000000*
>
> Is this expected?What could be wrong here?Please suggest
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Solr relevancy score different on replicated nodes

Posted by Aman Tandon <am...@gmail.com>.
Thanks Erick for your suggestions and time.

On Tue, Feb 12, 2019, 22:32 Erick Erickson <erickerickson@gmail.com wrote:

> You really only have four
> 1> use exactstats. This won't guarantee precise matches, but they'll be
> closer
> 2> optimize (not particularly recommended, but if you're willing to do
> it periodically it'll have the stats match until the next updates).
> 3> use TLOG/PULL replicas and confine the requests to the PULL
> replicas. There'll _still_ be some window for mismatches,
>     specifically the default is commit_interval/2
> 4> define the problem away.
>
> Best,
> Erick
>
> On Tue, Feb 12, 2019 at 2:42 AM Aman Tandon <am...@gmail.com>
> wrote:
> >
> > Hi Erick,
> >
> > Any suggestions on this?
> >
> > Regards,
> > Aman
> >
> > On Fri, Feb 8, 2019, 17:07 Aman Tandon <amantandon.10@gmail.com wrote:
> >
> > > Hi Erick,
> > >
> > > I find this thread very relevant to the people who are facing the same
> > > problem.
> > >
> > > In our case, we have a signals aggregation collection which is having
> > > total of around 8 million records. We have Solr cloud architecture(3
> shards
> > > and 4 replicas) and the whole size of index is of around 2.5 GB.
> > >
> > > We use this collection to fetch the most clicked products against a
> query
> > > and boost in search results. Boost score is the query score on
> aggregation
> > > collection.
> > >
> > > But when the query goes to different replica we get different boost
> score
> > > for some of the keywords, hence on page refresh results ordering keep
> on
> > > changing.
> > >
> > > In order to solve we tried the exactstats cache for distributed IDF
> and on
> > > debug level I am seeing global stats merge in logs but still the
> different
> > > scores coming on refreshing the results from aggregation collection.
> > >
> > > Our indexing occur once a day so should we do daily optimization or
> should
> > > we reduce merge segment count to 2/3 currently it is -1.
> > >
> > > What are your suggestions on this?
> > >
> > > Regards,
> > > Aman
> > >
> > > On Fri, Feb 8, 2019, 00:15 Erick Erickson <erickerickson@gmail.com
> wrote:
> > >
> > >> Optimization is safe. The large segment is irrelevant, you'll
> > >> lose a little parallelization, but on an index with this few
> > >> documents I doubt you'll notice.
> > >>
> > >> As of Solr 5, optimize will respect the max segment size
> > >> which defaults to 5G, but you're well under that limit.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> On Sun, Feb 3, 2019 at 11:54 PM Ashish Bisht <bishtashish77@gmail.com
> >
> > >> wrote:
> > >> >
> > >> > Thanks Erick and everyone.We are checking on stats cache.
> > >> >
> > >> > I noticed stats skew again and optimized the index to correct the
> > >> same.As
> > >> > per the documents.
> > >> >
> > >> >
> > >>
> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
> > >> > and
> > >> >
> > >>
> https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/
> > >> >
> > >> > wanted to check on below points considering we want stats skew to be
> > >> > corrected.
> > >> >
> > >> > 1.When optimized single segment won't be natural merged easily.As we
> > >> might
> > >> > be doing manual optimize every time,what I visualize is at a certain
> > >> point
> > >> > in future we might be having a single large segment.What impact this
> > >> large
> > >> > segment is going to have?
> > >> > Our index ~30k documents i.e files with content(Segment size <1Gb
> as of
> > >> now)
> > >> >
> > >> > 1.Do you recommend going for optimize in these situations?Probably
> it
> > >> will
> > >> > be done only when stats skew.Is it safe?
> > >> >
> > >> > Regards
> > >> > Ashish
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Sent from:
> http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> > >>
> > >
>

Re: Solr relevancy score different on replicated nodes

Posted by Erick Erickson <er...@gmail.com>.
You really only have four
1> use exactstats. This won't guarantee precise matches, but they'll be closer
2> optimize (not particularly recommended, but if you're willing to do
it periodically it'll have the stats match until the next updates).
3> use TLOG/PULL replicas and confine the requests to the PULL
replicas. There'll _still_ be some window for mismatches,
    specifically the default is commit_interval/2
4> define the problem away.

Best,
Erick

On Tue, Feb 12, 2019 at 2:42 AM Aman Tandon <am...@gmail.com> wrote:
>
> Hi Erick,
>
> Any suggestions on this?
>
> Regards,
> Aman
>
> On Fri, Feb 8, 2019, 17:07 Aman Tandon <amantandon.10@gmail.com wrote:
>
> > Hi Erick,
> >
> > I find this thread very relevant to the people who are facing the same
> > problem.
> >
> > In our case, we have a signals aggregation collection which is having
> > total of around 8 million records. We have Solr cloud architecture(3 shards
> > and 4 replicas) and the whole size of index is of around 2.5 GB.
> >
> > We use this collection to fetch the most clicked products against a query
> > and boost in search results. Boost score is the query score on aggregation
> > collection.
> >
> > But when the query goes to different replica we get different boost score
> > for some of the keywords, hence on page refresh results ordering keep on
> > changing.
> >
> > In order to solve we tried the exactstats cache for distributed IDF and on
> > debug level I am seeing global stats merge in logs but still the different
> > scores coming on refreshing the results from aggregation collection.
> >
> > Our indexing occur once a day so should we do daily optimization or should
> > we reduce merge segment count to 2/3 currently it is -1.
> >
> > What are your suggestions on this?
> >
> > Regards,
> > Aman
> >
> > On Fri, Feb 8, 2019, 00:15 Erick Erickson <erickerickson@gmail.com wrote:
> >
> >> Optimization is safe. The large segment is irrelevant, you'll
> >> lose a little parallelization, but on an index with this few
> >> documents I doubt you'll notice.
> >>
> >> As of Solr 5, optimize will respect the max segment size
> >> which defaults to 5G, but you're well under that limit.
> >>
> >> Best,
> >> Erick
> >>
> >> On Sun, Feb 3, 2019 at 11:54 PM Ashish Bisht <bi...@gmail.com>
> >> wrote:
> >> >
> >> > Thanks Erick and everyone.We are checking on stats cache.
> >> >
> >> > I noticed stats skew again and optimized the index to correct the
> >> same.As
> >> > per the documents.
> >> >
> >> >
> >> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
> >> > and
> >> >
> >> https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/
> >> >
> >> > wanted to check on below points considering we want stats skew to be
> >> > corrected.
> >> >
> >> > 1.When optimized single segment won't be natural merged easily.As we
> >> might
> >> > be doing manual optimize every time,what I visualize is at a certain
> >> point
> >> > in future we might be having a single large segment.What impact this
> >> large
> >> > segment is going to have?
> >> > Our index ~30k documents i.e files with content(Segment size <1Gb as of
> >> now)
> >> >
> >> > 1.Do you recommend going for optimize in these situations?Probably it
> >> will
> >> > be done only when stats skew.Is it safe?
> >> >
> >> > Regards
> >> > Ashish
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >>
> >

Re: Solr relevancy score different on replicated nodes

Posted by Aman Tandon <am...@gmail.com>.
Hi Erick,

Any suggestions on this?

Regards,
Aman

On Fri, Feb 8, 2019, 17:07 Aman Tandon <amantandon.10@gmail.com wrote:

> Hi Erick,
>
> I find this thread very relevant to the people who are facing the same
> problem.
>
> In our case, we have a signals aggregation collection which is having
> total of around 8 million records. We have Solr cloud architecture(3 shards
> and 4 replicas) and the whole size of index is of around 2.5 GB.
>
> We use this collection to fetch the most clicked products against a query
> and boost in search results. Boost score is the query score on aggregation
> collection.
>
> But when the query goes to different replica we get different boost score
> for some of the keywords, hence on page refresh results ordering keep on
> changing.
>
> In order to solve we tried the exactstats cache for distributed IDF and on
> debug level I am seeing global stats merge in logs but still the different
> scores coming on refreshing the results from aggregation collection.
>
> Our indexing occur once a day so should we do daily optimization or should
> we reduce merge segment count to 2/3 currently it is -1.
>
> What are your suggestions on this?
>
> Regards,
> Aman
>
> On Fri, Feb 8, 2019, 00:15 Erick Erickson <erickerickson@gmail.com wrote:
>
>> Optimization is safe. The large segment is irrelevant, you'll
>> lose a little parallelization, but on an index with this few
>> documents I doubt you'll notice.
>>
>> As of Solr 5, optimize will respect the max segment size
>> which defaults to 5G, but you're well under that limit.
>>
>> Best,
>> Erick
>>
>> On Sun, Feb 3, 2019 at 11:54 PM Ashish Bisht <bi...@gmail.com>
>> wrote:
>> >
>> > Thanks Erick and everyone.We are checking on stats cache.
>> >
>> > I noticed stats skew again and optimized the index to correct the
>> same.As
>> > per the documents.
>> >
>> >
>> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
>> > and
>> >
>> https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/
>> >
>> > wanted to check on below points considering we want stats skew to be
>> > corrected.
>> >
>> > 1.When optimized single segment won't be natural merged easily.As we
>> might
>> > be doing manual optimize every time,what I visualize is at a certain
>> point
>> > in future we might be having a single large segment.What impact this
>> large
>> > segment is going to have?
>> > Our index ~30k documents i.e files with content(Segment size <1Gb as of
>> now)
>> >
>> > 1.Do you recommend going for optimize in these situations?Probably it
>> will
>> > be done only when stats skew.Is it safe?
>> >
>> > Regards
>> > Ashish
>> >
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
>

Re: Solr relevancy score different on replicated nodes

Posted by Aman Tandon <am...@gmail.com>.
Hi Erick,

I find this thread very relevant to the people who are facing the same
problem.

In our case, we have a signals aggregation collection which is having total
of around 8 million records. We have Solr cloud architecture(3 shards and 4
replicas) and the whole size of index is of around 2.5 GB.

We use this collection to fetch the most clicked products against a query
and boost in search results. Boost score is the query score on aggregation
collection.

But when the query goes to different replica we get different boost score
for some of the keywords, hence on page refresh results ordering keep on
changing.

In order to solve we tried the exactstats cache for distributed IDF and on
debug level I am seeing global stats merge in logs but still the different
scores coming on refreshing the results from aggregation collection.

Our indexing occur once a day so should we do daily optimization or should
we reduce merge segment count to 2/3 currently it is -1.

What are your suggestions on this?

Regards,
Aman

On Fri, Feb 8, 2019, 00:15 Erick Erickson <erickerickson@gmail.com wrote:

> Optimization is safe. The large segment is irrelevant, you'll
> lose a little parallelization, but on an index with this few
> documents I doubt you'll notice.
>
> As of Solr 5, optimize will respect the max segment size
> which defaults to 5G, but you're well under that limit.
>
> Best,
> Erick
>
> On Sun, Feb 3, 2019 at 11:54 PM Ashish Bisht <bi...@gmail.com>
> wrote:
> >
> > Thanks Erick and everyone.We are checking on stats cache.
> >
> > I noticed stats skew again and optimized the index to correct the same.As
> > per the documents.
> >
> >
> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
> > and
> >
> https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/
> >
> > wanted to check on below points considering we want stats skew to be
> > corrected.
> >
> > 1.When optimized single segment won't be natural merged easily.As we
> might
> > be doing manual optimize every time,what I visualize is at a certain
> point
> > in future we might be having a single large segment.What impact this
> large
> > segment is going to have?
> > Our index ~30k documents i.e files with content(Segment size <1Gb as of
> now)
> >
> > 1.Do you recommend going for optimize in these situations?Probably it
> will
> > be done only when stats skew.Is it safe?
> >
> > Regards
> > Ashish
> >
> >
> >
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Solr relevancy score different on replicated nodes

Posted by Erick Erickson <er...@gmail.com>.
Optimization is safe. The large segment is irrelevant, you'll
lose a little parallelization, but on an index with this few
documents I doubt you'll notice.

As of Solr 5, optimize will respect the max segment size
which defaults to 5G, but you're well under that limit.

Best,
Erick

On Sun, Feb 3, 2019 at 11:54 PM Ashish Bisht <bi...@gmail.com> wrote:
>
> Thanks Erick and everyone.We are checking on stats cache.
>
> I noticed stats skew again and optimized the index to correct the same.As
> per the documents.
>
> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
> and
> https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/
>
> wanted to check on below points considering we want stats skew to be
> corrected.
>
> 1.When optimized single segment won't be natural merged easily.As we might
> be doing manual optimize every time,what I visualize is at a certain point
> in future we might be having a single large segment.What impact this large
> segment is going to have?
> Our index ~30k documents i.e files with content(Segment size <1Gb as of now)
>
> 1.Do you recommend going for optimize in these situations?Probably it will
> be done only when stats skew.Is it safe?
>
> Regards
> Ashish
>
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Ashish Bisht <bi...@gmail.com>.
Thanks Erick and everyone.We are checking on stats cache.

I noticed stats skew again and optimized the index to correct the same.As
per the documents.

https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
and 
https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/

wanted to check on below points considering we want stats skew to be
corrected.

1.When optimized single segment won't be natural merged easily.As we might
be doing manual optimize every time,what I visualize is at a certain point
in future we might be having a single large segment.What impact this large
segment is going to have?
Our index ~30k documents i.e files with content(Segment size <1Gb as of now)

1.Do you recommend going for optimize in these situations?Probably it will
be done only when stats skew.Is it safe?

Regards
Ashish

 




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Walter Underwood <wu...@wunderwood.org>.
Is this a sharded Solr Cloud collection? If so, you can try using global IDF.
That should make the scores more similar on different nodes.

https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 29, 2019, at 10:38 AM, David Hastings <ha...@gmail.com> wrote:
> 
> Maybe instead of using the solr score in your metrics, find a way to use
> the documents location in the results?   you can never trust the score to
> be consistent, its constantly changing as the indexes changes
> 
> On Tue, Jan 29, 2019 at 1:29 PM Ashish Bisht <bi...@gmail.com>
> wrote:
> 
>> Hi Erick,
>> 
>> Our business wanted score not to be totally based on default relevancy
>> algo.
>> Instead a mix of solr relevancy+usermetrics(80%+20%).
>> 
>> Each result doc is calculated against max score as a fraction of
>> 80.Remaining 20 is from user metrics.
>> 
>> Finally sort happens on new score.
>> 
>> But say we got first page correctly, and for the second page if the request
>> goes to other replica where max score is different. UI may result give
>> wrong
>> sort as compared to first page. For e.g last value of page 1 is 70 and
>> first
>> value of second page can be 72 I. e distorted sorting.
>> 
>> On top of it we are not using pagination but a infinite scroll which makes
>> it more noticeable.
>> 
>> Please suggest.
>> 
>> Regards
>> Ashish
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>> 


Re: Solr relevancy score different on replicated nodes

Posted by David Hastings <ha...@gmail.com>.
Maybe instead of using the solr score in your metrics, find a way to use
the documents location in the results?   you can never trust the score to
be consistent, its constantly changing as the indexes changes

On Tue, Jan 29, 2019 at 1:29 PM Ashish Bisht <bi...@gmail.com>
wrote:

> Hi Erick,
>
> Our business wanted score not to be totally based on default relevancy
> algo.
> Instead a mix of solr relevancy+usermetrics(80%+20%).
>
> Each result doc is calculated against max score as a fraction of
> 80.Remaining 20 is from user metrics.
>
> Finally sort happens on new score.
>
> But say we got first page correctly, and for the second page if the request
> goes to other replica where max score is different. UI may result give
> wrong
> sort as compared to first page. For e.g last value of page 1 is 70 and
> first
> value of second page can be 72 I. e distorted sorting.
>
> On top of it we are not using pagination but a infinite scroll which makes
> it more noticeable.
>
> Please suggest.
>
> Regards
> Ashish
>
>
>
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Solr relevancy score different on replicated nodes

Posted by Ashish Bisht <bi...@gmail.com>.
Hi Erick, 

Our business wanted score not to be totally based on default relevancy algo.
Instead a mix of solr relevancy+usermetrics(80%+20%). 

Each result doc is calculated against max score as a fraction of
80.Remaining 20 is from user metrics. 

Finally sort happens on new score. 

But say we got first page correctly, and for the second page if the request
goes to other replica where max score is different. UI may result give wrong
sort as compared to first page. For e.g last value of page 1 is 70 and first
value of second page can be 72 I. e distorted sorting. 

On top of it we are not using pagination but a infinite scroll which makes
it more noticeable. 

Please suggest. 

Regards
Ashish








--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Erick Erickson <er...@gmail.com>.
No, this is not a bug but a consequence of the design. ExactStats can help,
but there is no guarantee that different replicas will compute the exact same
score. Scores should be very close however.

You haven't explained why you need the scores to match. 99% of the time,
worrying about scores at this level is misguided. So I'd really try to
figure out
whether they're necessary or not.

Best,
Erick

On Tue, Jan 29, 2019 at 1:51 AM Ashish Bisht <bi...@gmail.com> wrote:
>
> Hi Erick,
>
> To test this scenario I added replica again and from few days have been
> monitoring metrics like Num Docs, Max Doc, Deleted Docs from *Overview*
> section of core.Checked *Segments Info* section too.Everything looks in
> sync.
>
> http://<mach-1>:8983/solr/#/MyTestCollection_*shard1_replica_n7*/
> http://<mach-2>:8983/solr/#/MyTestCollection_*4_shard1_replica_n7*/
>
> If in future they go out of sync,just wanted to confirm if this is a bug
> although you mentioned as
>
> *bq. Shouldn't both replica and leader come to same state
> after this much long period.
>
> No. After that long, the docs will be the same, all the docs
> present on one replica will be present and searchable on
> the other. However, they will be in different segments so the
> "stats skew" will remain. *
>
>
> We need these score,so as a temporary solution if we monitor these metrics
> for any issues and take action (either optimize or delete-add replica)
> accordingly.Does it make sense?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Ashish Bisht <bi...@gmail.com>.
Hi Erick,

To test this scenario I added replica again and from few days have been
monitoring metrics like Num Docs, Max Doc, Deleted Docs from *Overview*
section of core.Checked *Segments Info* section too.Everything looks in
sync.

http://<mach-1>:8983/solr/#/MyTestCollection_*shard1_replica_n7*/
http://<mach-2>:8983/solr/#/MyTestCollection_*4_shard1_replica_n7*/

If in future they go out of sync,just wanted to confirm if this is a bug
although you mentioned as

*bq. Shouldn't both replica and leader come to same state 
after this much long period. 

No. After that long, the docs will be the same, all the docs 
present on one replica will be present and searchable on 
the other. However, they will be in different segments so the 
"stats skew" will remain. *


We need these score,so as a temporary solution if we monitor these metrics
for any issues and take action (either optimize or delete-add replica)
accordingly.Does it make sense?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Erick Erickson <er...@gmail.com>.
What Elizabeth said.

Really, this is an intractable problem. Even in the TLOG
and PULL replica case, an index getting updates will
still fire their replication requests at different wall-clock
time. Even if that were coordinated, the vagaries of
networks etc. would _still_ mean the various replicas
would see slightly different "snapshots" of the index.
True, the window would be smaller....

The only situations I've seen where the scores on different
replicas are always identical is when the index is optimized,
which isn't recommended except if you can do it
all the time. Or TLOG and PULL replicas are used and
the index is not undergoing continuous updates.

As for locking subsequent requests to a set of nodes, the
idea has been bandied about but usually falls down when
it's realized that this has the potential to unevenly distribute
the load.

Best,
Erick

On Fri, Jan 11, 2019 at 3:13 AM Elizabeth Haubert
<eh...@opensourceconnections.com> wrote:
>
> Hello,
>
> To a certain extent, I agree with Eric, that this isn't a problem, but
> looks like one.  The nature of TF*IDF is such that you will see different
> scores for the same query over time on the same replica, or different
> replicas for the same query with most replication schemes. This is mildly
> annoying when the score is displayed to the user, although I have found
> most end users do not pay that much attention to the floating point score.
> Testers do.  On a small index with high write/delete traffic and homogenous
> docs, I've seen it cause document re-orderings when the same query is
> repeated and sent to different replicas such as for paging, and that is
> noticeable to end users.
>
> How big is your index, and how different are the percentages you are
> seeing?  This is a much more pronounced problem on smaller indices; it is
> possible this is a problem with your test setup, but not production.
>
> Your solution at directing users to a consistent replica will solve the
> change in values over a session-sized window of time.   With a single
> shard, you could use a Master/Slave setup, direct queries at a given
> slave.  This has a number of operational consequences though, as it means
> you will lose the benefits of SolrCloud.
>
> Mikhail's suggestion to use ExactStats would be cleaner:
> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_
>
>
> Elizabeth
>
> On Fri, Jan 11, 2019 at 3:56 AM Ashish Bisht <bi...@gmail.com>
> wrote:
>
> > Hi Erick,
> >
> > Your statement "*At best, I've seen UIs where they display, say, 1 to 5
> > stars that are just showing the percentile that the particular doc had
> > _relative to the max score*"  is something we are trying to achieve,but we
> > are dealing in percentages rather stars(ratings)
> >
> > Change in MaxScore per node is messing it.
> >
> > I was thinking if it possible to make one complete request(for a term) go
> > though one replica,i.e if to the client we could tell which replica hit the
> > first request and subsequently further paginated requests should go though
> > that replica until keyword is changed.Do you think it is possible or a good
> > idea?If yes is there a way in solr to know which replica served request?
> >
> > Regards
> > Ashish
> >
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >

Re: Solr relevancy score different on replicated nodes

Posted by Elizabeth Haubert <eh...@opensourceconnections.com>.
Hello,

To a certain extent, I agree with Eric, that this isn't a problem, but
looks like one.  The nature of TF*IDF is such that you will see different
scores for the same query over time on the same replica, or different
replicas for the same query with most replication schemes. This is mildly
annoying when the score is displayed to the user, although I have found
most end users do not pay that much attention to the floating point score.
Testers do.  On a small index with high write/delete traffic and homogenous
docs, I've seen it cause document re-orderings when the same query is
repeated and sent to different replicas such as for paging, and that is
noticeable to end users.

How big is your index, and how different are the percentages you are
seeing?  This is a much more pronounced problem on smaller indices; it is
possible this is a problem with your test setup, but not production.

Your solution at directing users to a consistent replica will solve the
change in values over a session-sized window of time.   With a single
shard, you could use a Master/Slave setup, direct queries at a given
slave.  This has a number of operational consequences though, as it means
you will lose the benefits of SolrCloud.

Mikhail's suggestion to use ExactStats would be cleaner:
https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_


Elizabeth

On Fri, Jan 11, 2019 at 3:56 AM Ashish Bisht <bi...@gmail.com>
wrote:

> Hi Erick,
>
> Your statement "*At best, I've seen UIs where they display, say, 1 to 5
> stars that are just showing the percentile that the particular doc had
> _relative to the max score*"  is something we are trying to achieve,but we
> are dealing in percentages rather stars(ratings)
>
> Change in MaxScore per node is messing it.
>
> I was thinking if it possible to make one complete request(for a term) go
> though one replica,i.e if to the client we could tell which replica hit the
> first request and subsequently further paginated requests should go though
> that replica until keyword is changed.Do you think it is possible or a good
> idea?If yes is there a way in solr to know which replica served request?
>
> Regards
> Ashish
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Solr relevancy score different on replicated nodes

Posted by Ashish Bisht <bi...@gmail.com>.
Hi Erick,

Your statement "*At best, I've seen UIs where they display, say, 1 to 5
stars that are just showing the percentile that the particular doc had
_relative to the max score*"  is something we are trying to achieve,but we
are dealing in percentages rather stars(ratings)

Change in MaxScore per node is messing it.

I was thinking if it possible to make one complete request(for a term) go
though one replica,i.e if to the client we could tell which replica hit the
first request and subsequently further paginated requests should go though
that replica until keyword is changed.Do you think it is possible or a good
idea?If yes is there a way in solr to know which replica served request?

Regards
Ashish




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Erick Erickson <er...@gmail.com>.
bq. Shouldn't both replica and leader come to same state
after this much long period.

No. After that long, the docs will be the same, all the docs
present on one replica will be present and searchable on
the other. However, they will be in different segments so the
"stats skew" will remain.

But displaying the scores isn't a good reason to worry about
this. Frankly, that's almost always a mistake. Scores are
meaningless outside of ranking the docs _in a single
query_. Because a doc in one query got a score of 10 but
some other doc in some other query scored 5 doesn't say
anything at all about whether one was "twice as good" as
another. Even within the same query, the same two
scores don't mean one doc is "twice as good".

I think this is a waste of effort frankly. At best, I've seen
UIs where they display, say, 1 to 5 stars that are just
showing the percentile that the particular doc had
_relative to the max score of that query_, unrelated
to any other query.

If you insist (and again I think it's a mistake) you can
optimize periodically, but if you're using anything
earlier than Solr 7.5 that has its own traps and I do
NOT recommend it unless you can do it every time
you change your index. See:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
and
https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/

On Tue, Jan 8, 2019 at 7:28 AM Ashish Bisht <bi...@gmail.com> wrote:
>
> Thank you Erick for explaining.
>
> In my senario, I stopped indexing and updates too and waited for 1 day.
> Restarted solr too.Shouldn't both replica and leader come to same state
> after this much long period. As you said this gets corrected by segment
> merging, hope it is internal process itself and no manual activity required.
>
> For us score matters as we are using it to display some scenarios on search
> and it gave changing values.As of now we are dependent of single
> shard-replica but in future we might need more replicas
> Will planning indexing and updates outside peak query hour help?
>
> I have tried the exact cache while debugging score difference during
> sharding.Didn't help much.Anyhow that's a different topic.
>
> Thanks again,
>
> Regards
> Ashish Bisht
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Ashish Bisht <bi...@gmail.com>.
Thank you Erick for explaining. 

In my senario, I stopped indexing and updates too and waited for 1 day.
Restarted solr too.Shouldn't both replica and leader come to same state
after this much long period. As you said this gets corrected by segment
merging, hope it is internal process itself and no manual activity required.

For us score matters as we are using it to display some scenarios on search
and it gave changing values.As of now we are dependent of single
shard-replica but in future we might need more replicas
Will planning indexing and updates outside peak query hour help? 

I have tried the exact cache while debugging score difference during
sharding.Didn't help much.Anyhow that's a different topic. 

Thanks again, 

Regards
Ashish Bisht





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Erick Erickson <er...@gmail.com>.
You misunderstand my point. The wall clock times _will_ be
different on leader and follower. It follows that the
documents contained in the individual segments on
the leader and follower will _not_ be identical.

This leads to _deleted_ documents being in different
segments on the leader and follower. Which also means
that the merge decisions will eventually merge different
segments.

Now remember that over time when you update a doc,
the doc is "marked as deleted", but some of the stats
e.g. termfrequency _still_ include the data for the
deleted docs and will until the segment is merged.

So the term frequency for some term on the leader
will be slightly different than on the follower and thus
the scoring will differ depending on which replica
gets the query. Etc.

The fact that you deleted and re-added the follower
supports the above. And your scores will skew as
you continue to update documents over time.

Generally this isn't something that people concern
themselves with, but if it's important to you you can
try enabling exactstatscache helps, see:
https://lucene.apache.org/solr/guide/6_6/distributed-requests.html

Best,
Erick

On Sun, Jan 6, 2019 at 10:25 PM Ashish Bisht <bi...@gmail.com> wrote:
>
> Hi Erick,
>
> Thank you for the details,but doesn't look like a time difference in
> autocommit caused this issue.As I said if I do retrieve all query/keyword
> query on both server,they returned correct number of docs,its just relevancy
> score is taking diff values.
>
> I waited for brief period,still discrepancy was coming(no indexing also).So
> I went ahead deleting the follower node(thinking leader replica should be in
> correct state).After adding the new replica again,the issue is not
> appearing.
>
> We will monitor same if it appears in future.
>
> Regards
> Ashish
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Ashish Bisht <bi...@gmail.com>.
Hi Erick,

Thank you for the details,but doesn't look like a time difference in
autocommit caused this issue.As I said if I do retrieve all query/keyword
query on both server,they returned correct number of docs,its just relevancy
score is taking diff values.  

I waited for brief period,still discrepancy was coming(no indexing also).So
I went ahead deleting the follower node(thinking leader replica should be in
correct state).After adding the new replica again,the issue is not
appearing.

We will monitor same if it appears in future.

Regards
Ashish



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Erick Erickson <er...@gmail.com>.
Ashish:

Deleting and re-adding a replica is not a solution. Even if you did,
that would then be identical only until you started indexing again,
then the stats could skew a bit.

When you index to NRT replicas, the wall clock times that cause the
commits to trigger will be different due to network delays. What
happens essentially is that the doc gets indexed to the leader at time
X but hits the replica Y milliseconds later. So on leader, the
autocommit interval expires at time X+Z (Z being your autocommit
interval) but X+Y+Z on the follower. However, some additional docs may
have already been indexed on the leader but not yet on the follower
when the autocommit trigger happens so the newly-closed segment on the
leader can have docs that the newly-closed segment on the  follower
does not have.

the point is that the termfreq does _not_ change when a document is
deleted in some segment (and remember that an update is really a
delete followed by an add). The data associated with deleted docs is
not purged until segments are merged. Further, the decision about
which segments to merge is influenced by how many documents are
deleted in each.

All of which means that the tf/idf statistics are different (slightly)
and you either have to use destributed IDF or just live with it.

You're saying that the document count of live documents is different,
and that's more concerning. Is this true for brief intervals or is it
true when there is _no_ indexing going on _and_ your autocommit
interval is allowed to expire? In that case it's a different problem.
However, if the condition is transitory and goes away if you stop
indexing, then it's the same issue I outlined above; autocommit is
happening at different wall-clock times.

Best,
Erick

On Fri, Jan 4, 2019 at 11:12 AM Ashish Bisht <bi...@gmail.com> wrote:
>
> Hi Erick,
>
> I have updated that I am not facing this problem in a new collection.
>
> As per 3) I can try deleting a replica and adding it again, but the
> confusion is which one out of two should I delete.(wondering which replica
> is giving correct score for query)
>
> Both replicas give same number of docs while doing all query.Its strange
> that in query explain docCount and docFreq is differing.
>
> Regards
> Ashish
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Ashish Bisht <bi...@gmail.com>.
Hi Erick, 

I have updated that I am not facing this problem in a new collection. 

As per 3) I can try deleting a replica and adding it again, but the
confusion is which one out of two should I delete.(wondering which replica
is giving correct score for query) 

Both replicas give same number of docs while doing all query.Its strange
that in query explain docCount and docFreq is differing. 

Regards
Ashish



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr relevancy score different on replicated nodes

Posted by Erick Erickson <er...@gmail.com>.
See particularly point 3 here and to a lesser extent point 2.
https://support.lucidworks.com/s/question/0D58000003LRpijCAD/the-number-of-results-returned-is-not-constant-every-time-i-query-solr

For point two (the internal Lucene doc IDs are different) you can
easily correct it by adding sort=score desc, solrId asc to the query.

That article was written before TLOG and PULL replicas came into the
picture. Since those replica types all have the
exact same index structure you shouldn't have this problem in that case.

Best,
Erick

On Fri, Jan 4, 2019 at 3:40 AM AshB <bi...@gmail.com> wrote:
>
> Version Solr 7.4.0 zookeeper 3.4.11 Achitecture Two boxes Machine-1,Machine-2
> holding single instances of solr
>
> We are having a collection which was single shard and single replica i.e s=1
> and rf=1
>
> Few days back we tried to add replica to it.But the score for same query is
> coming different from different replicas.
>
> http://Machine-1:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)&rows=10&fl=score&defType=edismax&qf=search_field+content&wt=json
>
> "response":{"numFound":5836,"start":0,"maxScore":*4.418847*,"docs":[
>
> whereas on another machine(replica)
>
> http://Machine-2:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)&rows=10&fl=score&defType=edismax&qf=search_field+content&wt=json
>
> "response":{"numFound":5836,"start":0,"maxScore":*4.4952264*,"docs":[
>
> The maxScore is different.
>
> Relevancy gets affected due to sharding but replication was not expected as
> same documents get copied to other node. score explaination gives issue with
> docCount and docFreq uneven.
>
> idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
> 1.050635000 docCount :*10020.000000000* docFreq :*3504.0000000*
>
> idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
> 1.068795100
>
> docCount :*10291.000000000* docFreq :*3534.0000000*
>
> Is this expected?What could be wrong here?Please suggest
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html