You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2015/09/28 06:02:04 UTC
[jira] [Commented] (LUCENE-6818) Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr

    [ https://issues.apache.org/jira/browse/LUCENE-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14910012#comment-14910012 ] 

Robert Muir commented on LUCENE-6818:
-------------------------------------

It happens when expected = 0, caused by the craziness of how spans score (they will happily score a term that does not exist). In this case totalTermFreq is zero, which makes expected go to zero, and then later the formula produces infinity (which the test checks for)

The test has this explanation for how spans score terms that don't exist:

{code}
    // The problem: "normal" lucene queries create scorers, returning null if terms dont exist
    // This means they never score a term that does not exist.
    // however with spans, there is only one scorer for the whole hierarchy:
    // inner queries are not real queries, their boosts are ignored, etc.
{code}

The typical solution is to do something like adjust expected:

{code}
    final float expected = (1 + stats.getTotalTermFreq()) * docLen / (1 + stats.getNumberOfFieldTokens());
{code}

I have not read the paper, but these are things to deal with when integrating into lucene. Another thing to be careful about is ensuring that integration of lucene's boosting is really safe, index-time boosts work on the norm, by making the document appear shorter or longer, so docLen might have a "crazy" value if the user does this.


> Implementing Divergence from Independence (DFI) Term-Weighting for Lucene/Solr
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-6818
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6818
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/query/scoring
>    Affects Versions: 5.3
>            Reporter: Ahmet Arslan
>            Assignee: Robert Muir
>            Priority: Minor
>              Labels: similarity
>             Fix For: Trunk
>
>         Attachments: LUCENE-6818.patch
>
>
> As explained in the [write-up|http://lucidworks.com/blog/flexible-ranking-in-lucene-4], many state-of-the-art ranking model implementations are added to Apache Lucene. 
> This issue aims to include DFI model, which is the non-parametric counterpart of the Divergence from Randomness (DFR) framework.
> DFI is both parameter-free and non-parametric:
> * parameter-free: it does not require any parameter tuning or training.
>  * non-parametric: it does not make any assumptions about word frequency distributions on document collections.
> It is highly recommended *not* to remove stopwords (very common terms: the, of, and, to, a, in, for, is, on, that, etc) with this similarity.
> For more information see: [A nonparametric term weighting method for information retrieval based on measuring the divergence from independence|http://dx.doi.org/10.1007/s10791-013-9225-4]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org