You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Nikita Zhiltsov <ni...@gmail.com> on 2012/12/10 06:36:29 UTC

LMDirichletSimilarity for multiple fields

Hi all,

I'm implementing an approach of mixture of language models in Lucene 4.0.0.
Here is a little math to be precise:

The ranking score for query q with t terms:

p(q | \theta) = \prod_{t \in q} p(t | \theta)

where

p(t | \theta) = \sum_f \alpha_f p(t | \theta^f)

and

p(t | \theta^f) = (freq(t) + \mu_f p(t | \theta_c^f)) / (length(f) + \mu_f)

\mu_f - Dirichlet prior for field f.

I've enhanced LMDirichletSimilarity to work with per-field priors:

public class LMPerFieldDirichletSimilarity extends LMDirichletSimilarity {
    @Override
    protected float score(BasicStats stats, float freq, float docLen) {
        float mu = stats.getAvgFieldLength();
        float collectionProbability = ((LMStats)
stats).getCollectionProbability();
        float score = (freq + mu * collectionProbability) / (docLen + mu);
        return score;
    }

    @Override
    public void computeNorm(FieldInvertState state, Norm norm) {
        byte length = new Integer(state.getLength()).byteValue();
        norm.setByte(length);
    }

    @Override
    protected float decodeNormValue(byte norm) {
        return new Byte(norm).floatValue();
    }
}

and I can mix CustomScoreQuery, BooleanQuery and FieldsQuery to get
relevant documents and compute the ranking function (the first
probability). However, my current solution omits p(t | \theta^f) values for
the fields, which do not contain occurrences of a given term t. Those
values should be computed by LMPerFieldDirichletSimilarity.score with
freq=0.

Surely, the problem comes from the fact that Lucene does not retrieve such
term positions by default. This problem is not so severe in case
of LMDirichletSimilarity and one-field approach, since such documents are
simply irrelevant. But in case of multi-field documents, one cannot omit
those values, if the document contains at least one term occurrence no
matter in which field.

How would you add these values while scoring?

-- 

Nikita Zhiltsov

Visiting Graduate Student
Emory University
Intelligent Information Access Lab
E500 Emerson Hall, Atlanta, Georgia, USA
Phone: (404) 834-5364
E-mail: znikita@emory.edu


---------------------------------------------------------------------
Gradute Student, Research Fellow
Kazan Federal University
Computational Linguistics Laboratory
Russia, 420008
Kazan, Prof. Nuzhina Str., 1/37 room 117
Skype: nickita.jhiltsov
Personal page: http://cll.niimm.ksu.ru/~nzhiltsov
E-mail: nikita.zhiltsov@gmail.com

---------------------------------------------------------------------

LMDirichletSimilarity for multiple fields

Posted by Nikita Zhiltsov <ni...@gmail.com>.
Hi all,

I'm implementing an approach of mixture of language models in Lucene 4.0.0.
Here is a little math to be precise:

The ranking score for query q with t terms:

p(q | \theta) = \prod_{t \in q} p(t | \theta)

where

p(t | \theta) = \sum_f \alpha_f p(t | \theta^f)

and

p(t | \theta^f) = (freq(t) + \mu_f p(t | \theta_c^f)) / (length(f) + \mu_f)

\mu_f - Dirichlet prior for field f.

I've enhanced LMDirichletSimilarity to work with per-field priors:

public class LMPerFieldDirichletSimilarity extends LMDirichletSimilarity {
    @Override
    protected float score(BasicStats stats, float freq, float docLen) {
        float mu = stats.getAvgFieldLength();
        float collectionProbability = ((LMStats)
stats).getCollectionProbability();
        float score = (freq + mu * collectionProbability) / (docLen + mu);
        return score;
    }

    @Override
    public void computeNorm(FieldInvertState state, Norm norm) {
        byte length = new Integer(state.getLength()).byteValue();
        norm.setByte(length);
    }

    @Override
    protected float decodeNormValue(byte norm) {
        return new Byte(norm).floatValue();
    }
}

and I can mix CustomScoreQuery, BooleanQuery and FieldsQuery to get
relevant documents and compute the ranking function (the first
probability). However, my current solution omits p(t | \theta^f) values for
the fields, which do not contain occurrences of a given term t. Those
values should be computed by LMPerFieldDirichletSimilarity.score with
freq=0.

Surely, the problem comes from the fact that Lucene does not retrieve such
term positions by default. This problem is not so severe in case
of LMDirichletSimilarity and one-field approach, since such documents are
simply irrelevant. But in case of multi-field documents, one cannot omit
those values, if the document contains at least one term occurrence no
matter in which field.

How would you add these values while scoring?

-- 

Nikita Zhiltsov

Visiting Graduate Student
Emory University
Intelligent Information Access Lab
E500 Emerson Hall, Atlanta, Georgia, USA
Phone: (404) 834-5364
E-mail: znikita@emory.edu


---------------------------------------------------------------------
Gradute Student, Research Fellow
Kazan Federal University
Computational Linguistics Laboratory
Russia, 420008
Kazan, Prof. Nuzhina Str., 1/37 room 117
Skype: nickita.jhiltsov
Personal page: http://cll.niimm.ksu.ru/~nzhiltsov
E-mail: nikita.zhiltsov@gmail.com

---------------------------------------------------------------------