You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ryan Ernst (JIRA)" <ji...@apache.org> on 2014/07/30 17:37:39 UTC
[jira] [Commented] (LUCENE-5847) Improved implementation of language models in lucene

    [ https://issues.apache.org/jira/browse/LUCENE-5847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079387#comment-14079387 ] 

Ryan Ernst commented on LUCENE-5847:
------------------------------------

Why can't the background score be implemented in the specific scorers for Dirichlet or JM? I don't think the Scorer interface should be cluttered with something specific to one implementation.

> Improved implementation of language models in lucene 
> -----------------------------------------------------
>
>                 Key: LUCENE-5847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5847
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Hadas Raviv
>            Priority: Minor
>             Fix For: 5.0
>
>         Attachments: LUCENE-2507.patch
>
>
> The current implementation of language models in lucene is based on the paper "A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval" by Zhai and Lafferty ('01). Specifically, LMDiricheltSimilarity and LMJelinikMercerSimilarity use a normalized smoothed score for a matching term in a document, as suggested in the above mentioned paper.
> However, lucene doesn't assign a score to query terms that do not appear in a matched document. According to the "pure" LM approach, these terms should be assigned with a collection probability "background score". If one uses the Jelinik Mercer smoothing method, the final result list produced by lucene is rank equivalent to the one that would have been created by a full LM implementation. However, this is not the case for Dirichlet smoothing method, because the background score is document dependent. Documents in which not all query terms appear, are missing the document-dependant background score for the missing terms. This component affects the final ranking of documents in the list.
> Since LM is a baseline method in many works in the IR research field, I attach a patch that implements a full LM in lucene. The basic issue that should be addressed here is assigning a document with a score that depends on *all* the query terms, collection statistics and the document length. The general idea of what I did is adding a new getBackGroundScore(int docID) method to similarity, scorer and bulkScorer. Than, when a collector assigns a score to a document (score = scorer.score()) I added the backgound score (score=scorer.score()+scorer.background(doc)) that is assigned by the similarity class used for ranking. 
> The patch also includes a correction of the document length such that it will be the real document length and not the encoded one. It is required for the full LM implementation.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org