You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Qi Li <al...@gmail.com> on 2010/12/28 21:11:06 UTC

relevant score calculation

Happy Holidays !

Test case
    doc1 :   test -- one two three
    doc2 :   test, one two three
    doc3 :   one two three

Search query :  "one two three" by QueryParser and StandardAnalyzer

Question:  why all of three documents have the same score?  I really want
the doc3 has higher score because it is an exact match and short.   Can
anybody explain this?  I will appreciate a lot

Here is my code and its output

public class Test {

    public static void main(String[] args){
        test();
    }

    private static void test(){
        String[] contents = {"test -- one two three",
                             "test, one two three",
                             "one two three"};

        Directory dir = new RAMDirectory();
        try {
            IndexWriter writer = new IndexWriter(dir, new
StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
            for (int i=0; i<contents.length; i++){
                Document doc = new Document();
                doc.add(new Field("de", contents[i], Field.Store.YES,
Field.Index.ANALYZED));
                writer.addDocument(doc);
            }
            writer.close();

            IndexSearcher searcher = new IndexSearcher(dir);
            QueryParser parser = new QueryParser(Version.LUCENE_30,"de", new
StandardAnalyzer(Version.LUCENE_30));

            Query q = parser.parse("one two three");
            TopDocs topDocs = searcher.search(q, 10);
            for (ScoreDoc scoreDoc : topDocs.scoreDocs){
                Document doc = searcher.doc(scoreDoc.doc);
                System.out.println(doc.get("de"));
                Explanation explan = searcher.explain(q, scoreDoc.doc);
                System.out.println(explan.toString());
            }

        } catch (CorruptIndexException e) {
            e.printStackTrace();
        } catch (LockObtainFailedException e) {
            e.printStackTrace();
        } catch (ParseException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}


test -- one two three
0.6168854 = (MATCH) sum of:
  0.20562847 = (MATCH) weight(de:one in 0), product of:
    0.57735026 = queryWeight(de:one), product of:
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.8105233 = queryNorm
    0.35615897 = (MATCH) fieldWeight(de:one in 0), product of:
      1.0 = tf(termFreq(de:one)=1)
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.5 = fieldNorm(field=de, doc=0)
  0.20562847 = (MATCH) weight(de:two in 0), product of:
    0.57735026 = queryWeight(de:two), product of:
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.8105233 = queryNorm
    0.35615897 = (MATCH) fieldWeight(de:two in 0), product of:
      1.0 = tf(termFreq(de:two)=1)
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.5 = fieldNorm(field=de, doc=0)
  0.20562847 = (MATCH) weight(de:three in 0), product of:
    0.57735026 = queryWeight(de:three), product of:
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.8105233 = queryNorm
    0.35615897 = (MATCH) fieldWeight(de:three in 0), product of:
      1.0 = tf(termFreq(de:three)=1)
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.5 = fieldNorm(field=de, doc=0)

test, one two three
0.6168854 = (MATCH) sum of:
  0.20562847 = (MATCH) weight(de:one in 1), product of:
    0.57735026 = queryWeight(de:one), product of:
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.8105233 = queryNorm
    0.35615897 = (MATCH) fieldWeight(de:one in 1), product of:
      1.0 = tf(termFreq(de:one)=1)
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.5 = fieldNorm(field=de, doc=1)
  0.20562847 = (MATCH) weight(de:two in 1), product of:
    0.57735026 = queryWeight(de:two), product of:
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.8105233 = queryNorm
    0.35615897 = (MATCH) fieldWeight(de:two in 1), product of:
      1.0 = tf(termFreq(de:two)=1)
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.5 = fieldNorm(field=de, doc=1)
  0.20562847 = (MATCH) weight(de:three in 1), product of:
    0.57735026 = queryWeight(de:three), product of:
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.8105233 = queryNorm
    0.35615897 = (MATCH) fieldWeight(de:three in 1), product of:
      1.0 = tf(termFreq(de:three)=1)
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.5 = fieldNorm(field=de, doc=1)

one two three
0.6168854 = (MATCH) sum of:
  0.20562847 = (MATCH) weight(de:one in 2), product of:
    0.57735026 = queryWeight(de:one), product of:
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.8105233 = queryNorm
    0.35615897 = (MATCH) fieldWeight(de:one in 2), product of:
      1.0 = tf(termFreq(de:one)=1)
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.5 = fieldNorm(field=de, doc=2)
  0.20562847 = (MATCH) weight(de:two in 2), product of:
    0.57735026 = queryWeight(de:two), product of:
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.8105233 = queryNorm
    0.35615897 = (MATCH) fieldWeight(de:two in 2), product of:
      1.0 = tf(termFreq(de:two)=1)
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.5 = fieldNorm(field=de, doc=2)
  0.20562847 = (MATCH) weight(de:three in 2), product of:
    0.57735026 = queryWeight(de:three), product of:
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.8105233 = queryNorm
    0.35615897 = (MATCH) fieldWeight(de:three in 2), product of:
      1.0 = tf(termFreq(de:three)=1)
      0.71231794 = idf(docFreq=3, maxDocs=3)
      0.5 = fieldNorm(field=de, doc=2)

Best regards,
Qi Li

Re: relevant score calculation

Posted by Ian Lea <ia...@gmail.com>.

Some of the factors that go in to the score calculation are encoded as
a byte with inevitable loss of precision.  Maybe length is one of
these and lucene is not differentiating between your 3 and 4 word
docs.  Try indexing a document that is significantly longer than 3 or
4 words.

Further reading: http://lucene.apache.org/java/3_0_3/scoring.html, the
javadocs for Similarity and DefaultSimilarity, whatever Google finds.


--
Ian.


On Tue, Dec 28, 2010 at 8:11 PM, Qi Li <al...@gmail.com> wrote:
> Happy Holidays !
>
> Test case
>    doc1 :   test -- one two three
>    doc2 :   test, one two three
>    doc3 :   one two three
>
> Search query :  "one two three" by QueryParser and StandardAnalyzer
>
> Question:  why all of three documents have the same score?  I really want
> the doc3 has higher score because it is an exact match and short.   Can
> anybody explain this?  I will appreciate a lot
>
> Here is my code and its output
>
> public class Test {
>
>    public static void main(String[] args){
>        test();
>    }
>
>    private static void test(){
>        String[] contents = {"test -- one two three",
>                             "test, one two three",
>                             "one two three"};
>
>        Directory dir = new RAMDirectory();
>        try {
>            IndexWriter writer = new IndexWriter(dir, new
> StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
>            for (int i=0; i<contents.length; i++){
>                Document doc = new Document();
>                doc.add(new Field("de", contents[i], Field.Store.YES,
> Field.Index.ANALYZED));
>                writer.addDocument(doc);
>            }
>            writer.close();
>
>            IndexSearcher searcher = new IndexSearcher(dir);
>            QueryParser parser = new QueryParser(Version.LUCENE_30,"de", new
> StandardAnalyzer(Version.LUCENE_30));
>
>            Query q = parser.parse("one two three");
>            TopDocs topDocs = searcher.search(q, 10);
>            for (ScoreDoc scoreDoc : topDocs.scoreDocs){
>                Document doc = searcher.doc(scoreDoc.doc);
>                System.out.println(doc.get("de"));
>                Explanation explan = searcher.explain(q, scoreDoc.doc);
>                System.out.println(explan.toString());
>            }
>
>        } catch (CorruptIndexException e) {
>            e.printStackTrace();
>        } catch (LockObtainFailedException e) {
>            e.printStackTrace();
>        } catch (ParseException e) {
>            e.printStackTrace();
>        } catch (IOException e) {
>            e.printStackTrace();
>        }
>    }
> }
>
>
> test -- one two three
> 0.6168854 = (MATCH) sum of:
>  0.20562847 = (MATCH) weight(de:one in 0), product of:
>    0.57735026 = queryWeight(de:one), product of:
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.8105233 = queryNorm
>    0.35615897 = (MATCH) fieldWeight(de:one in 0), product of:
>      1.0 = tf(termFreq(de:one)=1)
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.5 = fieldNorm(field=de, doc=0)
>  0.20562847 = (MATCH) weight(de:two in 0), product of:
>    0.57735026 = queryWeight(de:two), product of:
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.8105233 = queryNorm
>    0.35615897 = (MATCH) fieldWeight(de:two in 0), product of:
>      1.0 = tf(termFreq(de:two)=1)
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.5 = fieldNorm(field=de, doc=0)
>  0.20562847 = (MATCH) weight(de:three in 0), product of:
>    0.57735026 = queryWeight(de:three), product of:
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.8105233 = queryNorm
>    0.35615897 = (MATCH) fieldWeight(de:three in 0), product of:
>      1.0 = tf(termFreq(de:three)=1)
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.5 = fieldNorm(field=de, doc=0)
>
> test, one two three
> 0.6168854 = (MATCH) sum of:
>  0.20562847 = (MATCH) weight(de:one in 1), product of:
>    0.57735026 = queryWeight(de:one), product of:
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.8105233 = queryNorm
>    0.35615897 = (MATCH) fieldWeight(de:one in 1), product of:
>      1.0 = tf(termFreq(de:one)=1)
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.5 = fieldNorm(field=de, doc=1)
>  0.20562847 = (MATCH) weight(de:two in 1), product of:
>    0.57735026 = queryWeight(de:two), product of:
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.8105233 = queryNorm
>    0.35615897 = (MATCH) fieldWeight(de:two in 1), product of:
>      1.0 = tf(termFreq(de:two)=1)
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.5 = fieldNorm(field=de, doc=1)
>  0.20562847 = (MATCH) weight(de:three in 1), product of:
>    0.57735026 = queryWeight(de:three), product of:
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.8105233 = queryNorm
>    0.35615897 = (MATCH) fieldWeight(de:three in 1), product of:
>      1.0 = tf(termFreq(de:three)=1)
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.5 = fieldNorm(field=de, doc=1)
>
> one two three
> 0.6168854 = (MATCH) sum of:
>  0.20562847 = (MATCH) weight(de:one in 2), product of:
>    0.57735026 = queryWeight(de:one), product of:
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.8105233 = queryNorm
>    0.35615897 = (MATCH) fieldWeight(de:one in 2), product of:
>      1.0 = tf(termFreq(de:one)=1)
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.5 = fieldNorm(field=de, doc=2)
>  0.20562847 = (MATCH) weight(de:two in 2), product of:
>    0.57735026 = queryWeight(de:two), product of:
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.8105233 = queryNorm
>    0.35615897 = (MATCH) fieldWeight(de:two in 2), product of:
>      1.0 = tf(termFreq(de:two)=1)
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.5 = fieldNorm(field=de, doc=2)
>  0.20562847 = (MATCH) weight(de:three in 2), product of:
>    0.57735026 = queryWeight(de:three), product of:
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.8105233 = queryNorm
>    0.35615897 = (MATCH) fieldWeight(de:three in 2), product of:
>      1.0 = tf(termFreq(de:three)=1)
>      0.71231794 = idf(docFreq=3, maxDocs=3)
>      0.5 = fieldNorm(field=de, doc=2)
>
> Best regards,
> Qi Li
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: relevant score calculation

Posted by Qi Li <al...@gmail.com>.

I tried to override the default lengthNorm method with the suggestion in
this link
https://issues.apache.org/jira/browse/LUCENE-2187.
But it will not work because not every number of terms from 1 to 10 has an
unique score.

Here is my solution, which only works for shorter fields.  Welcome any
critiques or better solutions

    private float[] fs = {1.0f, 0.9f, 0.8f, 0.7f, 0.6f, 0.45f, 0.40f, 0.35f,
0.30f, 0.20f};

    @Override
    public float lengthNorm(String fieldName, int numTerms){
        if (numTerms < 11 && numTerms > 0){
            return fs[numTerms -1];
        }
        float result = super.lengthNorm(fieldName, numTerms);
        if (result > 0.1875){
            return 0.1875f;
        }
        return result;
    }

Here is the fieldNorm from 1 to 10
>> # of terms    lengthNorm
>>    1          1.0
>>    2         .875
>>    3         .75
>>    4         .625
>>    5         .5
>>    6         .4375
>>    7         .375
>>    8         .3125
>>    9         .25
>>   10         .1875

Qi



On Wed, Dec 29, 2010 at 9:00 AM, Ahmet Arslan <io...@yahoo.com> wrote:

> > Test case
> >     doc1 :   test -- one two
> > three
> >     doc2 :   test, one two three
> >     doc3 :   one two three
> >
> > Search query :  "one two three" by QueryParser and
> > StandardAnalyzer
> >
> > Question:  why all of three documents have the same
> > score?
>
> As Ian said, length norm values of your all documents are the same.
> See Jay Hill's message at http://search-lucene.com/m/Qw6CZpvRjw/
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: relevant score calculation

Posted by Qi Li <al...@gmail.com>.

Ahmet and Ian:

Thanks to both of you very much. I will try the patch.

Qi

On Wed, Dec 29, 2010 at 9:00 AM, Ahmet Arslan <io...@yahoo.com> wrote:

> > Test case
> >     doc1 :   test -- one two
> > three
> >     doc2 :   test, one two three
> >     doc3 :   one two three
> >
> > Search query :  "one two three" by QueryParser and
> > StandardAnalyzer
> >
> > Question:  why all of three documents have the same
> > score?
>
> As Ian said, length norm values of your all documents are the same.
> See Jay Hill's message at http://search-lucene.com/m/Qw6CZpvRjw/
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: relevant score calculation

Posted by Ahmet Arslan <io...@yahoo.com>.

> Test case
>     doc1 :   test -- one two
> three
>     doc2 :   test, one two three
>     doc3 :   one two three
> 
> Search query :  "one two three" by QueryParser and
> StandardAnalyzer
> 
> Question:  why all of three documents have the same
> score? 

As Ian said, length norm values of your all documents are the same. 
See Jay Hill's message at http://search-lucene.com/m/Qw6CZpvRjw/


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org