You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Qi Li <al...@gmail.com> on 2010/12/28 21:11:06 UTC
relevant score calculation
Happy Holidays !
Test case
doc1 : test -- one two three
doc2 : test, one two three
doc3 : one two three
Search query : "one two three" by QueryParser and StandardAnalyzer
Question: why all of three documents have the same score? I really want
the doc3 has higher score because it is an exact match and short. Can
anybody explain this? I will appreciate a lot
Here is my code and its output
public class Test {
public static void main(String[] args){
test();
}
private static void test(){
String[] contents = {"test -- one two three",
"test, one two three",
"one two three"};
Directory dir = new RAMDirectory();
try {
IndexWriter writer = new IndexWriter(dir, new
StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
for (int i=0; i<contents.length; i++){
Document doc = new Document();
doc.add(new Field("de", contents[i], Field.Store.YES,
Field.Index.ANALYZED));
writer.addDocument(doc);
}
writer.close();
IndexSearcher searcher = new IndexSearcher(dir);
QueryParser parser = new QueryParser(Version.LUCENE_30,"de", new
StandardAnalyzer(Version.LUCENE_30));
Query q = parser.parse("one two three");
TopDocs topDocs = searcher.search(q, 10);
for (ScoreDoc scoreDoc : topDocs.scoreDocs){
Document doc = searcher.doc(scoreDoc.doc);
System.out.println(doc.get("de"));
Explanation explan = searcher.explain(q, scoreDoc.doc);
System.out.println(explan.toString());
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (LockObtainFailedException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
test -- one two three
0.6168854 = (MATCH) sum of:
0.20562847 = (MATCH) weight(de:one in 0), product of:
0.57735026 = queryWeight(de:one), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.8105233 = queryNorm
0.35615897 = (MATCH) fieldWeight(de:one in 0), product of:
1.0 = tf(termFreq(de:one)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.5 = fieldNorm(field=de, doc=0)
0.20562847 = (MATCH) weight(de:two in 0), product of:
0.57735026 = queryWeight(de:two), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.8105233 = queryNorm
0.35615897 = (MATCH) fieldWeight(de:two in 0), product of:
1.0 = tf(termFreq(de:two)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.5 = fieldNorm(field=de, doc=0)
0.20562847 = (MATCH) weight(de:three in 0), product of:
0.57735026 = queryWeight(de:three), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.8105233 = queryNorm
0.35615897 = (MATCH) fieldWeight(de:three in 0), product of:
1.0 = tf(termFreq(de:three)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.5 = fieldNorm(field=de, doc=0)
test, one two three
0.6168854 = (MATCH) sum of:
0.20562847 = (MATCH) weight(de:one in 1), product of:
0.57735026 = queryWeight(de:one), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.8105233 = queryNorm
0.35615897 = (MATCH) fieldWeight(de:one in 1), product of:
1.0 = tf(termFreq(de:one)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.5 = fieldNorm(field=de, doc=1)
0.20562847 = (MATCH) weight(de:two in 1), product of:
0.57735026 = queryWeight(de:two), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.8105233 = queryNorm
0.35615897 = (MATCH) fieldWeight(de:two in 1), product of:
1.0 = tf(termFreq(de:two)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.5 = fieldNorm(field=de, doc=1)
0.20562847 = (MATCH) weight(de:three in 1), product of:
0.57735026 = queryWeight(de:three), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.8105233 = queryNorm
0.35615897 = (MATCH) fieldWeight(de:three in 1), product of:
1.0 = tf(termFreq(de:three)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.5 = fieldNorm(field=de, doc=1)
one two three
0.6168854 = (MATCH) sum of:
0.20562847 = (MATCH) weight(de:one in 2), product of:
0.57735026 = queryWeight(de:one), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.8105233 = queryNorm
0.35615897 = (MATCH) fieldWeight(de:one in 2), product of:
1.0 = tf(termFreq(de:one)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.5 = fieldNorm(field=de, doc=2)
0.20562847 = (MATCH) weight(de:two in 2), product of:
0.57735026 = queryWeight(de:two), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.8105233 = queryNorm
0.35615897 = (MATCH) fieldWeight(de:two in 2), product of:
1.0 = tf(termFreq(de:two)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.5 = fieldNorm(field=de, doc=2)
0.20562847 = (MATCH) weight(de:three in 2), product of:
0.57735026 = queryWeight(de:three), product of:
0.71231794 = idf(docFreq=3, maxDocs=3)
0.8105233 = queryNorm
0.35615897 = (MATCH) fieldWeight(de:three in 2), product of:
1.0 = tf(termFreq(de:three)=1)
0.71231794 = idf(docFreq=3, maxDocs=3)
0.5 = fieldNorm(field=de, doc=2)
Best regards,
Qi Li
Re: relevant score calculation
Posted by Ian Lea <ia...@gmail.com>.
Some of the factors that go in to the score calculation are encoded as
a byte with inevitable loss of precision. Maybe length is one of
these and lucene is not differentiating between your 3 and 4 word
docs. Try indexing a document that is significantly longer than 3 or
4 words.
Further reading: http://lucene.apache.org/java/3_0_3/scoring.html, the
javadocs for Similarity and DefaultSimilarity, whatever Google finds.
--
Ian.
On Tue, Dec 28, 2010 at 8:11 PM, Qi Li <al...@gmail.com> wrote:
> Happy Holidays !
>
> Test case
> doc1 : test -- one two three
> doc2 : test, one two three
> doc3 : one two three
>
> Search query : "one two three" by QueryParser and StandardAnalyzer
>
> Question: why all of three documents have the same score? I really want
> the doc3 has higher score because it is an exact match and short. Can
> anybody explain this? I will appreciate a lot
>
> Here is my code and its output
>
> public class Test {
>
> public static void main(String[] args){
> test();
> }
>
> private static void test(){
> String[] contents = {"test -- one two three",
> "test, one two three",
> "one two three"};
>
> Directory dir = new RAMDirectory();
> try {
> IndexWriter writer = new IndexWriter(dir, new
> StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
> for (int i=0; i<contents.length; i++){
> Document doc = new Document();
> doc.add(new Field("de", contents[i], Field.Store.YES,
> Field.Index.ANALYZED));
> writer.addDocument(doc);
> }
> writer.close();
>
> IndexSearcher searcher = new IndexSearcher(dir);
> QueryParser parser = new QueryParser(Version.LUCENE_30,"de", new
> StandardAnalyzer(Version.LUCENE_30));
>
> Query q = parser.parse("one two three");
> TopDocs topDocs = searcher.search(q, 10);
> for (ScoreDoc scoreDoc : topDocs.scoreDocs){
> Document doc = searcher.doc(scoreDoc.doc);
> System.out.println(doc.get("de"));
> Explanation explan = searcher.explain(q, scoreDoc.doc);
> System.out.println(explan.toString());
> }
>
> } catch (CorruptIndexException e) {
> e.printStackTrace();
> } catch (LockObtainFailedException e) {
> e.printStackTrace();
> } catch (ParseException e) {
> e.printStackTrace();
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
>
>
> test -- one two three
> 0.6168854 = (MATCH) sum of:
> 0.20562847 = (MATCH) weight(de:one in 0), product of:
> 0.57735026 = queryWeight(de:one), product of:
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.8105233 = queryNorm
> 0.35615897 = (MATCH) fieldWeight(de:one in 0), product of:
> 1.0 = tf(termFreq(de:one)=1)
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.5 = fieldNorm(field=de, doc=0)
> 0.20562847 = (MATCH) weight(de:two in 0), product of:
> 0.57735026 = queryWeight(de:two), product of:
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.8105233 = queryNorm
> 0.35615897 = (MATCH) fieldWeight(de:two in 0), product of:
> 1.0 = tf(termFreq(de:two)=1)
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.5 = fieldNorm(field=de, doc=0)
> 0.20562847 = (MATCH) weight(de:three in 0), product of:
> 0.57735026 = queryWeight(de:three), product of:
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.8105233 = queryNorm
> 0.35615897 = (MATCH) fieldWeight(de:three in 0), product of:
> 1.0 = tf(termFreq(de:three)=1)
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.5 = fieldNorm(field=de, doc=0)
>
> test, one two three
> 0.6168854 = (MATCH) sum of:
> 0.20562847 = (MATCH) weight(de:one in 1), product of:
> 0.57735026 = queryWeight(de:one), product of:
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.8105233 = queryNorm
> 0.35615897 = (MATCH) fieldWeight(de:one in 1), product of:
> 1.0 = tf(termFreq(de:one)=1)
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.5 = fieldNorm(field=de, doc=1)
> 0.20562847 = (MATCH) weight(de:two in 1), product of:
> 0.57735026 = queryWeight(de:two), product of:
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.8105233 = queryNorm
> 0.35615897 = (MATCH) fieldWeight(de:two in 1), product of:
> 1.0 = tf(termFreq(de:two)=1)
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.5 = fieldNorm(field=de, doc=1)
> 0.20562847 = (MATCH) weight(de:three in 1), product of:
> 0.57735026 = queryWeight(de:three), product of:
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.8105233 = queryNorm
> 0.35615897 = (MATCH) fieldWeight(de:three in 1), product of:
> 1.0 = tf(termFreq(de:three)=1)
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.5 = fieldNorm(field=de, doc=1)
>
> one two three
> 0.6168854 = (MATCH) sum of:
> 0.20562847 = (MATCH) weight(de:one in 2), product of:
> 0.57735026 = queryWeight(de:one), product of:
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.8105233 = queryNorm
> 0.35615897 = (MATCH) fieldWeight(de:one in 2), product of:
> 1.0 = tf(termFreq(de:one)=1)
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.5 = fieldNorm(field=de, doc=2)
> 0.20562847 = (MATCH) weight(de:two in 2), product of:
> 0.57735026 = queryWeight(de:two), product of:
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.8105233 = queryNorm
> 0.35615897 = (MATCH) fieldWeight(de:two in 2), product of:
> 1.0 = tf(termFreq(de:two)=1)
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.5 = fieldNorm(field=de, doc=2)
> 0.20562847 = (MATCH) weight(de:three in 2), product of:
> 0.57735026 = queryWeight(de:three), product of:
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.8105233 = queryNorm
> 0.35615897 = (MATCH) fieldWeight(de:three in 2), product of:
> 1.0 = tf(termFreq(de:three)=1)
> 0.71231794 = idf(docFreq=3, maxDocs=3)
> 0.5 = fieldNorm(field=de, doc=2)
>
> Best regards,
> Qi Li
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: relevant score calculation
Posted by Qi Li <al...@gmail.com>.
I tried to override the default lengthNorm method with the suggestion in
this link
https://issues.apache.org/jira/browse/LUCENE-2187.
But it will not work because not every number of terms from 1 to 10 has an
unique score.
Here is my solution, which only works for shorter fields. Welcome any
critiques or better solutions
private float[] fs = {1.0f, 0.9f, 0.8f, 0.7f, 0.6f, 0.45f, 0.40f, 0.35f,
0.30f, 0.20f};
@Override
public float lengthNorm(String fieldName, int numTerms){
if (numTerms < 11 && numTerms > 0){
return fs[numTerms -1];
}
float result = super.lengthNorm(fieldName, numTerms);
if (result > 0.1875){
return 0.1875f;
}
return result;
}
Here is the fieldNorm from 1 to 10
>> # of terms lengthNorm
>> 1 1.0
>> 2 .875
>> 3 .75
>> 4 .625
>> 5 .5
>> 6 .4375
>> 7 .375
>> 8 .3125
>> 9 .25
>> 10 .1875
Qi
On Wed, Dec 29, 2010 at 9:00 AM, Ahmet Arslan <io...@yahoo.com> wrote:
> > Test case
> > doc1 : test -- one two
> > three
> > doc2 : test, one two three
> > doc3 : one two three
> >
> > Search query : "one two three" by QueryParser and
> > StandardAnalyzer
> >
> > Question: why all of three documents have the same
> > score?
>
> As Ian said, length norm values of your all documents are the same.
> See Jay Hill's message at http://search-lucene.com/m/Qw6CZpvRjw/
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: relevant score calculation
Posted by Qi Li <al...@gmail.com>.
Ahmet and Ian:
Thanks to both of you very much. I will try the patch.
Qi
On Wed, Dec 29, 2010 at 9:00 AM, Ahmet Arslan <io...@yahoo.com> wrote:
> > Test case
> > doc1 : test -- one two
> > three
> > doc2 : test, one two three
> > doc3 : one two three
> >
> > Search query : "one two three" by QueryParser and
> > StandardAnalyzer
> >
> > Question: why all of three documents have the same
> > score?
>
> As Ian said, length norm values of your all documents are the same.
> See Jay Hill's message at http://search-lucene.com/m/Qw6CZpvRjw/
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: relevant score calculation
Posted by Ahmet Arslan <io...@yahoo.com>.
> Test case
> doc1 : test -- one two
> three
> doc2 : test, one two three
> doc3 : one two three
>
> Search query : "one two three" by QueryParser and
> StandardAnalyzer
>
> Question: why all of three documents have the same
> score?
As Ian said, length norm values of your all documents are the same.
See Jay Hill's message at http://search-lucene.com/m/Qw6CZpvRjw/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org