You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "cheyenne.lin" <ch...@gmail.com> on 2012/02/02 09:40:52 UTC

Smoothing language model by Lucene

I've had an old implementation Lucene-lm by ilps, which is a good start.
However, that implementation doesn't include smooth algorithm. And I found
it particularly hard to re-write the core scoring mechanism to enable
smooth.

(Background: In language model, smoothing strategy adds a little constant
weight to documents with zero query frequency. Of course it doesn't change
anything for one keyword, but consider the case of multiple-keyword query,
when one document is strongly relevant to a few distinguishing keywords,
smoothing may be important) 

In the lucene framework for a multiple-keyword query (say, the simplest
unigram, non-positional query), the following procedure happens, as my
understanding:

1)QueryParser parse query string to BooleanQuery.clauses (weights)
2)(The corresponding scorer of BooleanQuery ) merges all document scores for
each clause
3) but the problem is: each clause's termdocs only contains inversed index
of clause, thus make smoothing strategy impossible, because the document
won't be scored by each query term.

What can I do about that? What class should I concentrate on?

--
View this message in context: http://lucene.472066.n3.nabble.com/Smoothing-language-model-by-Lucene-tp3709311p3709311.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Smoothing language model by Lucene

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Feb 2, 2012 at 3:40 AM, cheyenne.lin <ch...@gmail.com> wrote:
>
> What can I do about that? What class should I concentrate on?
>

Maybe as an example you can take a look at lucene's trunk, it has two
of the methods from the Zhai/Lafferty paper:
"A study of smoothing methods for language models applied to Ad Hoc
information retrieval."

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/similarities/LMDirichletSimilarity.java
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.java

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org