You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Cameron M VandenBerg <cm...@cs.cmu.edu> on 2021/03/19 15:15:52 UTC

Distributed IDF for Solr using ExactStatsCache issue

Hello,

I am using Solr in a distributed environment where I have split my collection into parts, which I have running on different nodes.  When I create each part of the collection, I set numShards and replicationFactor to 1.  The query speed is most important to us, and we are not worried about load on the system.

I want a Distributed IDF across all parts of the collection so I have added this line to my solrconfig.xml:
<statsCache class="org.apache.solr.search.stats.ExactStatsCache" />

This seems to work about 90% of the time, but if I run the same request over and over again, sometimes I get scores with a local IDF for just one part of the collection.  Here is a request example:
/solr/collection1,collection2/query?q=fulltext:shark&rows=500&fl=id,url,title,score&sort=score+desc

I still get documents from both collection1 and collection2, but sometimes I get scores that are the same as when I would just query collection1.  I believe that it is only using the document frequency of collection one for the term in that case.

Should I use a different configuration?  I would like to make sure the IDF is always distributed and the same every time I run the same query.  Is there any technique I could use to ensure that this happens?

Thank you,
Cameron VandenBerg


Re: Distributed IDF for Solr using ExactStatsCache issue

Posted by thallesr <th...@gmail.com>.
Not exactly your case, but i stumbled upon the same problem.
What i was able to identify was that the problem with different score is
because it uses MaxDocs to calculate score and that one sometimes differ
between replicas.

All the exactStats impl did not solve the problem for me, because they do
solve the problem if the shards have different statistics but not if the
replicas of the same shard have different ones.

One easy way to solve the problem is to change Solr code and exactStats impl
to ask statistics for all replicas from the shards involved (not only one
for each shard) and the calculated score will always be equal, of course it
has a lot of downsides. i did that, but the fact that the whole cluster
would be queried for all searches in my case made it impractical.

I created a script to repeatedly add a random number of docs while
simultaneously deleting some, and after some iterations the MaxDoc got
different between the replicas of the shards, even in  a single instance.
And then checked that none of those impls solved the problem.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

RE: Distributed IDF for Solr using ExactStatsCache issue

Posted by Cameron M VandenBerg <cm...@cs.cmu.edu>.
Is there anything that we can do to raise this issue?  Can I enter this in jira, or do I need to go to another channel first?

Thank you,
Cameron VandenBerg

-----Original Message-----
From: Michael Gibney <mi...@michaelgibney.net> 
Sent: Wednesday, March 24, 2021 11:11 AM
To: users@solr.apache.org
Subject: Re: Distributed IDF for Solr using ExactStatsCache issue

I see the same behavior you do; I'm going to paraphrase the problem (probably covering a lot of the same ground you've already covered), to be sure that we're on the same page:

It looks like this issue is specifically related to multi-collection requests (i.e., I don't observe this issue for a request against a single collection). Checking `docCount` in the score "explain" (with `debug=true`), it looks like multi-collection requests pick one collection or the other (apparently non-deterministically?) when retrieving distributed `docCount` for idf calculation. This seems definitely undesirable. For the case you describe, where the two jointly-queried collections are very different sizes, distributed idf would be at once more necessary _and_ more obviously unhelpful (as evidently currently implemented).

I'm hoping someone will contradict me if I'm missing something, and at this point I'm still not sure exactly what's happening under the hood; but I _am_ confident (as you have said) that single-collection idf is being used for straightforward multi-collection requests, and I really can't think of any case in which that would be desirable (assuming an asserted preference for distributed idf, via statsCache config).

One problem that occurs to me is that statsCache is configured per-collection, so if two collections with different statsCaches are queried jointly, is there a way to determine which config takes priority?
... and probably related followup questions. Probably nothing insurmountable, but still ...

On Tue, Mar 23, 2021 at 7:55 AM Cameron M VandenBerg <cm...@cs.cmu.edu>
wrote:

> Hi Michael,
>
> I have 8 shards (on 8 different nods) and no replicas with about 500 
> million documents.  Additionally, I have a collection with just 2 
> shards and no replicas (and significantly fewer documents) where I see 
> the same behavior.  I do observe this behavior even when I route the 
> query through the same "entry node".  To see this behavior, I can just 
> hit refresh on the same query several times.  Most of the time, the 
> scores do reflect a distributed IDF, but sometimes scores that reflect 
> the IDF of only one of the shards (even though documents from both shards are returned).
>
> Thanks!
> Cameron VandenBerg
>
> -----Original Message-----
> From: Michael Gibney <mi...@michaelgibney.net>
> Sent: Monday, March 22, 2021 10:20 PM
> To: users@solr.apache.org
> Subject: Re: Distributed IDF for Solr using ExactStatsCache issue
>
> Cameron,
> What is your cluster configuration? i.e., how many nodes, how many 
> replicas per node, how many replicas in each collection, etc.? Do you 
> observe consistent behavior for the same query if you always route 
> that query via the same "entry node" (i.e., not load balanced over the cluster)?
> Michael
>
> On Fri, Mar 19, 2021 at 11:16 AM Cameron M VandenBerg 
> <cm...@cs.cmu.edu>
> wrote:
>
> > Hello,
> >
> > I am using Solr in a distributed environment where I have split my 
> > collection into parts, which I have running on different nodes.  
> > When I create each part of the collection, I set numShards and 
> > replicationFactor to 1.  The query speed is most important to us, 
> > and we are not worried about load on the system.
> >
> > I want a Distributed IDF across all parts of the collection so I 
> > have added this line to my solrconfig.xml:
> > <statsCache class="org.apache.solr.search.stats.ExactStatsCache" />
> >
> > This seems to work about 90% of the time, but if I run the same 
> > request over and over again, sometimes I get scores with a local IDF 
> > for just one part of the collection.  Here is a request example:
> >
> > /solr/collection1,collection2/query?q=fulltext:shark&rows=500&fl=id,
> > ur
> > l,title,score&sort=score+desc
> >
> > I still get documents from both collection1 and collection2, but 
> > sometimes I get scores that are the same as when I would just query 
> > collection1.  I believe that it is only using the document frequency 
> > of collection one for the term in that case.
> >
> > Should I use a different configuration?  I would like to make sure 
> > the IDF is always distributed and the same every time I run the same 
> > query.  Is there any technique I could use to ensure that this happens?
> >
> > Thank you,
> > Cameron VandenBerg
> >
> >
>

Re: Distributed IDF for Solr using ExactStatsCache issue

Posted by Michael Gibney <mi...@michaelgibney.net>.
I see the same behavior you do; I'm going to paraphrase the problem
(probably covering a lot of the same ground you've already covered), to be
sure that we're on the same page:

It looks like this issue is specifically related to multi-collection
requests (i.e., I don't observe this issue for a request against a single
collection). Checking `docCount` in the score "explain" (with
`debug=true`), it looks like multi-collection requests pick one collection
or the other (apparently non-deterministically?) when retrieving
distributed `docCount` for idf calculation. This seems definitely
undesirable. For the case you describe, where the two jointly-queried
collections are very different sizes, distributed idf would be at once more
necessary _and_ more obviously unhelpful (as evidently currently
implemented).

I'm hoping someone will contradict me if I'm missing something, and at this
point I'm still not sure exactly what's happening under the hood; but I
_am_ confident (as you have said) that single-collection idf is being used
for straightforward multi-collection requests, and I really can't think of
any case in which that would be desirable (assuming an asserted preference
for distributed idf, via statsCache config).

One problem that occurs to me is that statsCache is configured
per-collection, so if two collections with different statsCaches are
queried jointly, is there a way to determine which config takes priority?
... and probably related followup questions. Probably nothing
insurmountable, but still ...

On Tue, Mar 23, 2021 at 7:55 AM Cameron M VandenBerg <cm...@cs.cmu.edu>
wrote:

> Hi Michael,
>
> I have 8 shards (on 8 different nods) and no replicas with about 500
> million documents.  Additionally, I have a collection with just 2 shards
> and no replicas (and significantly fewer documents) where I see the same
> behavior.  I do observe this behavior even when I route the query through
> the same "entry node".  To see this behavior, I can just hit refresh on the
> same query several times.  Most of the time, the scores do reflect a
> distributed IDF, but sometimes scores that reflect the IDF of only one of
> the shards (even though documents from both shards are returned).
>
> Thanks!
> Cameron VandenBerg
>
> -----Original Message-----
> From: Michael Gibney <mi...@michaelgibney.net>
> Sent: Monday, March 22, 2021 10:20 PM
> To: users@solr.apache.org
> Subject: Re: Distributed IDF for Solr using ExactStatsCache issue
>
> Cameron,
> What is your cluster configuration? i.e., how many nodes, how many
> replicas per node, how many replicas in each collection, etc.? Do you
> observe consistent behavior for the same query if you always route that
> query via the same "entry node" (i.e., not load balanced over the cluster)?
> Michael
>
> On Fri, Mar 19, 2021 at 11:16 AM Cameron M VandenBerg <cm...@cs.cmu.edu>
> wrote:
>
> > Hello,
> >
> > I am using Solr in a distributed environment where I have split my
> > collection into parts, which I have running on different nodes.  When
> > I create each part of the collection, I set numShards and
> > replicationFactor to 1.  The query speed is most important to us, and
> > we are not worried about load on the system.
> >
> > I want a Distributed IDF across all parts of the collection so I have
> > added this line to my solrconfig.xml:
> > <statsCache class="org.apache.solr.search.stats.ExactStatsCache" />
> >
> > This seems to work about 90% of the time, but if I run the same
> > request over and over again, sometimes I get scores with a local IDF
> > for just one part of the collection.  Here is a request example:
> >
> > /solr/collection1,collection2/query?q=fulltext:shark&rows=500&fl=id,ur
> > l,title,score&sort=score+desc
> >
> > I still get documents from both collection1 and collection2, but
> > sometimes I get scores that are the same as when I would just query
> > collection1.  I believe that it is only using the document frequency
> > of collection one for the term in that case.
> >
> > Should I use a different configuration?  I would like to make sure the
> > IDF is always distributed and the same every time I run the same
> > query.  Is there any technique I could use to ensure that this happens?
> >
> > Thank you,
> > Cameron VandenBerg
> >
> >
>

RE: Distributed IDF for Solr using ExactStatsCache issue

Posted by Cameron M VandenBerg <cm...@cs.cmu.edu>.
Hi Michael,

I have 8 shards (on 8 different nods) and no replicas with about 500 million documents.  Additionally, I have a collection with just 2 shards and no replicas (and significantly fewer documents) where I see the same behavior.  I do observe this behavior even when I route the query through the same "entry node".  To see this behavior, I can just hit refresh on the same query several times.  Most of the time, the scores do reflect a distributed IDF, but sometimes scores that reflect the IDF of only one of the shards (even though documents from both shards are returned).

Thanks!
Cameron VandenBerg

-----Original Message-----
From: Michael Gibney <mi...@michaelgibney.net> 
Sent: Monday, March 22, 2021 10:20 PM
To: users@solr.apache.org
Subject: Re: Distributed IDF for Solr using ExactStatsCache issue

Cameron,
What is your cluster configuration? i.e., how many nodes, how many replicas per node, how many replicas in each collection, etc.? Do you observe consistent behavior for the same query if you always route that query via the same "entry node" (i.e., not load balanced over the cluster)?
Michael

On Fri, Mar 19, 2021 at 11:16 AM Cameron M VandenBerg <cm...@cs.cmu.edu>
wrote:

> Hello,
>
> I am using Solr in a distributed environment where I have split my 
> collection into parts, which I have running on different nodes.  When 
> I create each part of the collection, I set numShards and 
> replicationFactor to 1.  The query speed is most important to us, and 
> we are not worried about load on the system.
>
> I want a Distributed IDF across all parts of the collection so I have 
> added this line to my solrconfig.xml:
> <statsCache class="org.apache.solr.search.stats.ExactStatsCache" />
>
> This seems to work about 90% of the time, but if I run the same 
> request over and over again, sometimes I get scores with a local IDF 
> for just one part of the collection.  Here is a request example:
>
> /solr/collection1,collection2/query?q=fulltext:shark&rows=500&fl=id,ur
> l,title,score&sort=score+desc
>
> I still get documents from both collection1 and collection2, but 
> sometimes I get scores that are the same as when I would just query 
> collection1.  I believe that it is only using the document frequency 
> of collection one for the term in that case.
>
> Should I use a different configuration?  I would like to make sure the 
> IDF is always distributed and the same every time I run the same 
> query.  Is there any technique I could use to ensure that this happens?
>
> Thank you,
> Cameron VandenBerg
>
>

Re: Distributed IDF for Solr using ExactStatsCache issue

Posted by Michael Gibney <mi...@michaelgibney.net>.
Cameron,
What is your cluster configuration? i.e., how many nodes, how many replicas
per node, how many replicas in each collection, etc.? Do you observe
consistent behavior for the same query if you always route that query via
the same "entry node" (i.e., not load balanced over the cluster)?
Michael

On Fri, Mar 19, 2021 at 11:16 AM Cameron M VandenBerg <cm...@cs.cmu.edu>
wrote:

> Hello,
>
> I am using Solr in a distributed environment where I have split my
> collection into parts, which I have running on different nodes.  When I
> create each part of the collection, I set numShards and replicationFactor
> to 1.  The query speed is most important to us, and we are not worried
> about load on the system.
>
> I want a Distributed IDF across all parts of the collection so I have
> added this line to my solrconfig.xml:
> <statsCache class="org.apache.solr.search.stats.ExactStatsCache" />
>
> This seems to work about 90% of the time, but if I run the same request
> over and over again, sometimes I get scores with a local IDF for just one
> part of the collection.  Here is a request example:
>
> /solr/collection1,collection2/query?q=fulltext:shark&rows=500&fl=id,url,title,score&sort=score+desc
>
> I still get documents from both collection1 and collection2, but sometimes
> I get scores that are the same as when I would just query collection1.  I
> believe that it is only using the document frequency of collection one for
> the term in that case.
>
> Should I use a different configuration?  I would like to make sure the IDF
> is always distributed and the same every time I run the same query.  Is
> there any technique I could use to ensure that this happens?
>
> Thank you,
> Cameron VandenBerg
>
>