You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Steven Bower <sm...@alcyon.net> on 2014/03/12 16:07:34 UTC

IDF maxDocs / numDocs

I am noticing the maxDocs between replicas is consistently different and
that in the idf calculation it is used which causes idf scores for the same
query/doc between replicas to be different. obviously an optimize can
normalize the maxDocs scores, but that is only temporary.. is there a way
to have idf use numDocs instead (as it should be consistent across
replicas)?

thanks,

steve

Re: IDF maxDocs / numDocs

Posted by Steven Bower <sm...@alcyon.net>.

My problem is that both maxDoc() and docCount() both report documents that
have been deleted in their values. Because of merging/etc.. those numbers
can be different per replica (or at least that is what I'm seeing). I need
a value that is consistent across replicas... I see in the comment it makes
mention of not using IndexReader.numDocs() but there doesn't seem to me a
way to get ahold of the IndexReader within a similarity implementation (as
only TermStats, CollectionStats are passed in, and neither contains of ref
to the reader)

I am contemplating just using a static value for the "number of docs" as
this won't change dramatically often..

steve

On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in
> idfExplain but there's also a docCount(). We use docCount in all our custom
> similarities, also because it allows you to have multiple languages in one
> index where one is much larger than the other. The small language will have
> very high IDF scores using maxDoc but they are proportional enough using
> docCount(). Using docCount() also fixes SolrCloud ranking problems, unless
> one of your replica's becomes inconsistent ;)
>
>
> https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29
>
>
>
> -----Original message-----
> > From:Steven Bower <sm...@alcyon.net>
> > Sent: Wednesday 12th March 2014 16:08
> > To: solr-user <so...@lucene.apache.org>
> > Subject: IDF maxDocs / numDocs
> >
> > I am noticing the maxDocs between replicas is consistently different and
> > that in the idf calculation it is used which causes idf scores for the
> same
> > query/doc between replicas to be different. obviously an optimize can
> > normalize the maxDocs scores, but that is only temporary.. is there a way
> > to have idf use numDocs instead (as it should be consistent across
> > replicas)?
> >
> > thanks,
> >
> > steve
> >
>

RE: IDF maxDocs / numDocs

Posted by Markus Jelsma <ma...@openindex.io>.

Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in idfExplain but there's also a docCount(). We use docCount in all our custom similarities, also because it allows you to have multiple languages in one index where one is much larger than the other. The small language will have very high IDF scores using maxDoc but they are proportional enough using docCount(). Using docCount() also fixes SolrCloud ranking problems, unless one of your replica's becomes inconsistent ;)

https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29

 
 
-----Original message-----
> From:Steven Bower <sm...@alcyon.net>
> Sent: Wednesday 12th March 2014 16:08
> To: solr-user <so...@lucene.apache.org>
> Subject: IDF maxDocs / numDocs
> 
> I am noticing the maxDocs between replicas is consistently different and
> that in the idf calculation it is used which causes idf scores for the same
> query/doc between replicas to be different. obviously an optimize can
> normalize the maxDocs scores, but that is only temporary.. is there a way
> to have idf use numDocs instead (as it should be consistent across
> replicas)?
> 
> thanks,
> 
> steve
>

RE: IDF maxDocs / numDocs

Posted by Markus Jelsma <ma...@openindex.io>.

Oh yes, i see what you mean. I would try SOLR-1632 and have distributed IDF, but it seems to be broken now.
 
-----Original message-----
> From:Steven Bower <sm...@alcyon.net>
> Sent: Wednesday 12th March 2014 21:47
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: IDF maxDocs / numDocs
> 
> My problem is that both maxDoc() and docCount() both report documents that
> have been deleted in their values. Because of merging/etc.. those numbers
> can be different per replica (or at least that is what I'm seeing). I need
> a value that is consistent across replicas... I see in the comment it makes
> mention of not using IndexReader.numDocs() but there doesn't seem to me a
> way to get ahold of the IndexReader within a similarity implementation (as
> only TermStats, CollectionStats are passed in, and neither contains of ref
> to the reader)
> 
> I am contemplating just using a static value for the "number of docs" as
> this won't change dramatically often..
> 
> steve
> 
> 
> On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma
> <ma...@openindex.io>wrote:
> 
> > Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in
> > idfExplain but there's also a docCount(). We use docCount in all our custom
> > similarities, also because it allows you to have multiple languages in one
> > index where one is much larger than the other. The small language will have
> > very high IDF scores using maxDoc but they are proportional enough using
> > docCount(). Using docCount() also fixes SolrCloud ranking problems, unless
> > one of your replica's becomes inconsistent ;)
> >
> >
> > https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29
> >
> >
> >
> > -----Original message-----
> > > From:Steven Bower <sm...@alcyon.net>
> > > Sent: Wednesday 12th March 2014 16:08
> > > To: solr-user <so...@lucene.apache.org>
> > > Subject: IDF maxDocs / numDocs
> > >
> > > I am noticing the maxDocs between replicas is consistently different and
> > > that in the idf calculation it is used which causes idf scores for the
> > same
> > > query/doc between replicas to be different. obviously an optimize can
> > > normalize the maxDocs scores, but that is only temporary.. is there a way
> > > to have idf use numDocs instead (as it should be consistent across
> > > replicas)?
> > >
> > > thanks,
> > >
> > > steve
> > >
> >
>