You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "alessandro.benedetti" <a....@sease.io> on 2017/12/04 14:21:49 UTC

Re: Skewed IDF in multi lingual index, again

Hi Markus,
just out of interest, why did 
" It was solved back then by using docCount instead of maxDoc when
calculating idf, it worked really well!" solve the problem ?

i assume you are using different fields, one per language.
Each field is appearing on a different number of docs I guess.
e.g.
text_en -> 10000 docs
text_fr -> 1000 docs
text_it -> 500 docs

the reason docCount was improving things is because it was using a docCount
relative to a specific field while maxDoc is global all over the index ?







-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Skewed IDF in multi lingual index, again

Posted by "alessandro.benedetti" <a....@sease.io>.

Furthermore, taking a look to the code for BM25 similarity, it seems to me it
is currently working right :
- docCount is used per field if != -1


/**
   * Computes a score factor for a simple term and returns an explanation
   * for that score factor.
   * 
   * <p>
   * The default implementation uses:
   * 
   * <pre class="prettyprint">
   * idf(docFreq, docCount);
   * </pre>
   * 
   * Note that {@link CollectionStatistics#docCount()} is used instead of
   * {@link org.apache.lucene.index.IndexReader#numDocs()
IndexReader#numDocs()} because also 
   * {@link TermStatistics#docFreq()} is used, and when the latter 
   * is inaccurate, so is {@link CollectionStatistics#docCount()}, and in
the same direction.
   * In addition, {@link CollectionStatistics#docCount()} does not skew when
fields are sparse.
   *   
   * @param collectionStats collection-level statistics
   * @param termStats term-level statistics for the term
   * @return an Explain object that includes both an idf score factor 
             and an explanation for the term.
   */
  public Explanation idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats) {
    final long df = termStats.docFreq();
    final long docCount = collectionStats.docCount() == -1 ?
collectionStats.maxDoc() : collectionStats.docCount();
    final float idf = idf(df, docCount);
    return Explanation.match(idf, "idf, computed as log(1 + (docCount -
docFreq + 0.5) / (docFreq + 0.5)) from:",
        Explanation.match(df, "docFreq"),
        Explanation.match(docCount, "docCount"));
  }



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Skewed IDF in multi lingual index, again

Posted by Doug Turnbull <dt...@opensourceconnections.com>.

It is challenging as the performance of different use cases and domains
will by very dependent on the use case (there's no one globally perfect
relevance solution). But a good set of metrics to see *generally* how stock
Solr performs across a reasonable set of verticals would be nice.

My philosophy about Lucene-based search is that it's not a solution, but
rather a framework that should have sane defaults but large amounts of
configurability.

For example,I'm not sure there's a globally "right" answer maxDoc vs
docCount

Problems with docCount come into play when a corpus usually has an empty
field, but it's occasionally filled out. This creates a strong bias against
matches in that usually empty field, when previously a match in that field
was weighted very highly

For example, if a product catalog has a user-editable tag field that is
rarely used, and a product description, such as

Product Name: Nice Pants!
Product Description: Come wear these pants!
Tags: [blue] [acid-wash]

Product Name: Acid Wash Pants
Product Description: Come wear these pants!
Tags: (empty)

In this case, the IDF for the acid wash match in tags is very low using
docCount whereas with maxDocs it was very high. Not sure what the right
answer is, but there is often a desire to want more complete docs to be
boosted much higher, which the "maxDocs" method does.

Another case where docCount can be a problem is copy fields: With copy
fields, you care that the original field had terms, even if for some reason
they were removed in the analysis chain. This can happen with some methods
we use for simple entity extraction.

Further the definitions of BM25, etc rely on corpus level document
frequency for a term and don't have a concept of fields. BM25F can mostly
be implemented with BlendedTermQuery which blends doc frequencies across
fields
http://opensourceconnections.com/blog/2016/10/19/bm25f-in-lucene/

On Tue, Dec 5, 2017 at 10:28 AM alessandro.benedetti <a....@sease.io>
wrote:

> Thanks Yonik and thanks Doug.
>
> I agree with Doug in adding few generics test corpora Jenkins automatically
> runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a
> golden truth too much.
> This of course can be very complex, but I think it is a direction the
> Apache
> Lucene/Solr community should work on.
>
> Given that, I do believe that in this case, moving from maxDocs(field
> independent) to docCount(field dependent) was a good move ( and this
> specific multi language use case is an example).
>
> Actually I also believe that theoretically docCount(field dependent) is
> still better than maxDocs(field dependent).
> This is because docCount(field dependent) represents a state in time
> associated to the current index while maxDocs represents an historical
> consideration.
> A corpus of documents can change in time, and how much a term is rare can
> drastically change ( let's pick an highly dynamic domain such news).
>
> Doug, were you able to generalise and abstract any consideration from what
> happened to your customers and why they got regressions moving from maxDocs
> to docCount(field dependent) ?
>
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
-- 
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)

Re: Skewed IDF in multi lingual index, again

Posted by "alessandro.benedetti" <a....@sease.io>.

Thanks Yonik and thanks Doug.

I agree with Doug in adding few generics test corpora Jenkins automatically
runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a
golden truth too much.
This of course can be very complex, but I think it is a direction the Apache
Lucene/Solr community should work on.

Given that, I do believe that in this case, moving from maxDocs(field
independent) to docCount(field dependent) was a good move ( and this
specific multi language use case is an example).

Actually I also believe that theoretically docCount(field dependent) is
still better than maxDocs(field dependent).
This is because docCount(field dependent) represents a state in time
associated to the current index while maxDocs represents an historical
consideration.
A corpus of documents can change in time, and how much a term is rare can
drastically change ( let's pick an highly dynamic domain such news).

Doug, were you able to generalise and abstract any consideration from what
happened to your customers and why they got regressions moving from maxDocs
to docCount(field dependent) ?




-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Skewed IDF in multi lingual index, again

Posted by Doug Turnbull <dt...@opensourceconnections.com>.

Just a piece of feedback from clients on the original docCount change.

I have seen several cases with clients where the switch to docCount
surprised and harmed  relevance.

More broadly, I’m concerned when we make these changes there’s not a
testing process against test corpuses with judgments and relevance metrics
to understand their impact. I see it mentioned in a JIRA from time to time
that someone saw an improvement on a private collection in NDCG. And we
have to take their word for it.

Public testing of relevance against every build using stock settings could
be extremely valuable and would more easily justify these changes.
Something similar to the performance tests that are made.

Sadly I can only complain now :) I wish I had time to work on something
like this.

Doug

On Tue, Dec 5, 2017 at 7:38 AM Yonik Seeley <ys...@gmail.com> wrote:

> On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti
> <a....@sease.io> wrote:
> > "Lucene/Solr doesn't actually delete documents when you delete them, it
> > just marks them as deleted.  I'm pretty sure that the difference between
> > docCount and maxDoc is deleted documents.  Maybe I don't understand what
> > I'm talking about, but that is the best I can come up with. "
> >
> > Thanks Shawn, yes, that is correct and I was aware of it.
> > I was curious of another difference :
> > I think we confirmed that docCount is local to the field ( thanks Yonik
> for
> > that) so :
> >
> > docCount(index,field1)= # of documents in the index that currently have
> > value(s) for field1
> >
> > My question is :
> >
> > maxDocs(index,field1)= max # of documents in the index that had value(s)
> for
> > field1
> >
> > OR
> >
> > maxDocs(index)= max # of documents that appeared in the index ( field
> > independent)
>
> The latter.
> I imagine that's why docCount was introduced (to avoid changing the
> meaning of an existing term).
> FWIW, the scoring change was made in
> https://issues.apache.org/jira/browse/LUCENE-6711 for Lucene/Solr 6.0
>
> -Yonik
>
-- 
Consultant, OpenSource Connections. Contact info at
http://o19s.com/about-us/doug-turnbull/; Free/Busy (http://bit.ly/dougs_cal)

Re: Skewed IDF in multi lingual index, again

Posted by Yonik Seeley <ys...@gmail.com>.

On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti
<a....@sease.io> wrote:
> "Lucene/Solr doesn't actually delete documents when you delete them, it
> just marks them as deleted.  I'm pretty sure that the difference between
> docCount and maxDoc is deleted documents.  Maybe I don't understand what
> I'm talking about, but that is the best I can come up with. "
>
> Thanks Shawn, yes, that is correct and I was aware of it.
> I was curious of another difference :
> I think we confirmed that docCount is local to the field ( thanks Yonik for
> that) so :
>
> docCount(index,field1)= # of documents in the index that currently have
> value(s) for field1
>
> My question is :
>
> maxDocs(index,field1)= max # of documents in the index that had value(s) for
> field1
>
> OR
>
> maxDocs(index)= max # of documents that appeared in the index ( field
> independent)

The latter.
I imagine that's why docCount was introduced (to avoid changing the
meaning of an existing term).
FWIW, the scoring change was made in
https://issues.apache.org/jira/browse/LUCENE-6711 for Lucene/Solr 6.0

-Yonik

Re: Skewed IDF in multi lingual index, again

Posted by "alessandro.benedetti" <a....@sease.io>.

"Lucene/Solr doesn't actually delete documents when you delete them, it 
just marks them as deleted.  I'm pretty sure that the difference between 
docCount and maxDoc is deleted documents.  Maybe I don't understand what 
I'm talking about, but that is the best I can come up with. "

Thanks Shawn, yes, that is correct and I was aware of it.
I was curious of another difference :
I think we confirmed that docCount is local to the field ( thanks Yonik for
that) so :

docCount(index,field1)= # of documents in the index that currently have
value(s) for field1

My question is :

maxDocs(index,field1)= max # of documents in the index that had value(s) for
field1

OR

maxDocs(index)= max # of documents that appeared in the index ( field
independent)

Regards




-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Skewed IDF in multi lingual index, again

Posted by Yonik Seeley <ys...@gmail.com>.

On Mon, Dec 4, 2017 at 1:35 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> I'm pretty sure that the difference between docCount and maxDoc is deleted documents.

docCount (not the best name) here is the number of documents with the
field being searched.  docFreq (df) is the number of documents
actually containing the term in that field.
In the past, maxDoc was used instead of docCount.

-Yonik

Re: Skewed IDF in multi lingual index, again

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/4/2017 7:21 AM, alessandro.benedetti wrote:
> the reason docCount was improving things is because it was using a docCount
> relative to a specific field while maxDoc is global all over the index ?

Lucene/Solr doesn't actually delete documents when you delete them, it 
just marks them as deleted.  I'm pretty sure that the difference between 
docCount and maxDoc is deleted documents.  Maybe I don't understand what 
I'm talking about, but that is the best I can come up with.

Not all aspects of the impact on scores from deleted documents can be 
eliminated, but there has been some effort to make it as minimal as 
possible.  For what has been described here, the actual count is 
available, so it gets used.

Thanks,
Shawn