You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Nikita Zhiltsov <ni...@gmail.com> on 2012/12/10 06:36:29 UTC
LMDirichletSimilarity for multiple fields
Hi all,
I'm implementing an approach of mixture of language models in Lucene 4.0.0.
Here is a little math to be precise:
The ranking score for query q with t terms:
p(q | \theta) = \prod_{t \in q} p(t | \theta)
where
p(t | \theta) = \sum_f \alpha_f p(t | \theta^f)
and
p(t | \theta^f) = (freq(t) + \mu_f p(t | \theta_c^f)) / (length(f) + \mu_f)
\mu_f - Dirichlet prior for field f.
I've enhanced LMDirichletSimilarity to work with per-field priors:
public class LMPerFieldDirichletSimilarity extends LMDirichletSimilarity {
@Override
protected float score(BasicStats stats, float freq, float docLen) {
float mu = stats.getAvgFieldLength();
float collectionProbability = ((LMStats)
stats).getCollectionProbability();
float score = (freq + mu * collectionProbability) / (docLen + mu);
return score;
}
@Override
public void computeNorm(FieldInvertState state, Norm norm) {
byte length = new Integer(state.getLength()).byteValue();
norm.setByte(length);
}
@Override
protected float decodeNormValue(byte norm) {
return new Byte(norm).floatValue();
}
}
and I can mix CustomScoreQuery, BooleanQuery and FieldsQuery to get
relevant documents and compute the ranking function (the first
probability). However, my current solution omits p(t | \theta^f) values for
the fields, which do not contain occurrences of a given term t. Those
values should be computed by LMPerFieldDirichletSimilarity.score with
freq=0.
Surely, the problem comes from the fact that Lucene does not retrieve such
term positions by default. This problem is not so severe in case
of LMDirichletSimilarity and one-field approach, since such documents are
simply irrelevant. But in case of multi-field documents, one cannot omit
those values, if the document contains at least one term occurrence no
matter in which field.
How would you add these values while scoring?
--
Nikita Zhiltsov
Visiting Graduate Student
Emory University
Intelligent Information Access Lab
E500 Emerson Hall, Atlanta, Georgia, USA
Phone: (404) 834-5364
E-mail: znikita@emory.edu
---------------------------------------------------------------------
Gradute Student, Research Fellow
Kazan Federal University
Computational Linguistics Laboratory
Russia, 420008
Kazan, Prof. Nuzhina Str., 1/37 room 117
Skype: nickita.jhiltsov
Personal page: http://cll.niimm.ksu.ru/~nzhiltsov
E-mail: nikita.zhiltsov@gmail.com
---------------------------------------------------------------------
LMDirichletSimilarity for multiple fields
Posted by Nikita Zhiltsov <ni...@gmail.com>.
Hi all,
I'm implementing an approach of mixture of language models in Lucene 4.0.0.
Here is a little math to be precise:
The ranking score for query q with t terms:
p(q | \theta) = \prod_{t \in q} p(t | \theta)
where
p(t | \theta) = \sum_f \alpha_f p(t | \theta^f)
and
p(t | \theta^f) = (freq(t) + \mu_f p(t | \theta_c^f)) / (length(f) + \mu_f)
\mu_f - Dirichlet prior for field f.
I've enhanced LMDirichletSimilarity to work with per-field priors:
public class LMPerFieldDirichletSimilarity extends LMDirichletSimilarity {
@Override
protected float score(BasicStats stats, float freq, float docLen) {
float mu = stats.getAvgFieldLength();
float collectionProbability = ((LMStats)
stats).getCollectionProbability();
float score = (freq + mu * collectionProbability) / (docLen + mu);
return score;
}
@Override
public void computeNorm(FieldInvertState state, Norm norm) {
byte length = new Integer(state.getLength()).byteValue();
norm.setByte(length);
}
@Override
protected float decodeNormValue(byte norm) {
return new Byte(norm).floatValue();
}
}
and I can mix CustomScoreQuery, BooleanQuery and FieldsQuery to get
relevant documents and compute the ranking function (the first
probability). However, my current solution omits p(t | \theta^f) values for
the fields, which do not contain occurrences of a given term t. Those
values should be computed by LMPerFieldDirichletSimilarity.score with
freq=0.
Surely, the problem comes from the fact that Lucene does not retrieve such
term positions by default. This problem is not so severe in case
of LMDirichletSimilarity and one-field approach, since such documents are
simply irrelevant. But in case of multi-field documents, one cannot omit
those values, if the document contains at least one term occurrence no
matter in which field.
How would you add these values while scoring?
--
Nikita Zhiltsov
Visiting Graduate Student
Emory University
Intelligent Information Access Lab
E500 Emerson Hall, Atlanta, Georgia, USA
Phone: (404) 834-5364
E-mail: znikita@emory.edu
---------------------------------------------------------------------
Gradute Student, Research Fellow
Kazan Federal University
Computational Linguistics Laboratory
Russia, 420008
Kazan, Prof. Nuzhina Str., 1/37 room 117
Skype: nickita.jhiltsov
Personal page: http://cll.niimm.ksu.ru/~nzhiltsov
E-mail: nikita.zhiltsov@gmail.com
---------------------------------------------------------------------