You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by stefcl <st...@gmail.com> on 2010/02/15 15:13:52 UTC

Strange Fuzzyquery results scoring when using a low minimal distance

Hello,

I'm using Lucene v3. 
Please consider the following spellings 

Lucene
Lucéne
lucéne
Lucane
Lucen

When searching for "lucéne" among those words using a FuzzyQuery (with 0.5
edit distance), results show :

1. Lucene 1.0259752
2. Lucane 1.0259752
3. Lucéne 0.95660806
4. lucéne 0.95660806
5. Lucen 0.30779266

#4 is an exact match, why does it receive a lower score than "Lucane" which
contains one incorrect letter?

Also, if you raise min similarity a bit higher (0.6 of above), everything
becomes normal :

1. Lucéne 1.0438477
2. lucéne 1.0438477
3. Lucene 0.97959816
4. Lucane 0.97959816


Any idea?
Thanks in advance...


The code I use :

   /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws IOException,
ParseException
    {

        StandardAnalyzer analyzer = new
StandardAnalyzer(Version.LUCENE_CURRENT);

        // TODO code application logic here
        Directory index = new RAMDirectory();
        IndexWriter w = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);

        addDoc(w, "Lucene");
        addDoc(w, "Lucéne");
        addDoc(w, "lucéne");
        addDoc(w, "Lucane");
        addDoc(w, "Lucen");

        w.close();

        FuzzyQuery q =  new FuzzyQuery( new Term("title", "lucéne") , 0.5f
);
        
        // 3. search
        IndexSearcher searcher = new IndexSearcher(index);
        
        TopDocs collector = searcher.search(q, 10);
        ScoreDoc[] hits = collector.scoreDocs;

        // 4. display results
        System.out.println("Found " + hits.length + " hits.");
        for(int i = 0 ; i < hits.length; i++)
        {
              Document d = searcher.doc(hits[i].doc);
              System.out.println((i + 1) + ". " + d.get("title") + " " + 
hits[i].score );
        }

        // searcher can only be closed when there
        // is no need to access the documents any more.
        searcher.close();
    }


    private static void addDoc(IndexWriter w, String value) throws
IOException
    {
        Document doc = new Document();
        doc.add(new Field("title", value, Field.Store.YES,
Field.Index.ANALYZED));
        w.addDocument(doc);
    }
-- 
View this message in context: http://old.nabble.com/Strange-Fuzzyquery-results-scoring-when-using-a-low-minimal-distance-tp27594371p27594371.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Strange Fuzzyquery results scoring when using a low minimal distance

Posted by stefcl <st...@gmail.com>.

Thanks for the very detailed answer.
Using fuzzylikethis solves the problem.



Uwe Schindler wrote:
> 
> The problem ist he following:
> The docFreq of the term "lucéne" is 2, all other terms have 1 (because
> StandardAnalyzer lowercases everything). What happens is, that terms with
> lower docFreq get a higher score in TermQuery. This score overweighs the
> boosting done by FuzzyQuery (because you index is so small).
> 
> If you raise the minSimilarity a little bit, your query matches less terms
> and the rewritten BooleanQuery contains less clauses. At some point the
> score overweigh of the less frequent terms is no longer relevant for the
> final score. 
> 
> By the way, you can always look at the explain() results which informs you
> about the scoring done.
> 
> The fix is (applies only to trunk, see issue
> https://issues.apache.org/jira/browse/LUCENE-124) to ignore scoring of the
> TermQueries generated by Fuzzy and only look at the edit distance
> (implemented by another MTQ.RewriteMode), that can be set with
> FuzzyQuery.setRewriteMode().
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> 
>> -----Original Message-----
>> From: stefcl [mailto:stefatwork@gmail.com]
>> Sent: Tuesday, February 16, 2010 10:11 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Strange Fuzzyquery results scoring when using a low
>> minimal distance
>> 
>> 
>> Thanksa lot,
>> But I still don't understand why raising a little bit the min
>> similarity
>> change the ordering...
>> 
>> 
>> 
>> markharw00d wrote:
>> >
>> > This could be down to IDF ie "Lucane" is ranked higher because it is
>> rarer
>> > despite having worse edit distance.
>> > This is arguably a bug.
>> > See http://issues.apache.org/jira/browse/LUCENE-329 which discusses
>> this.
>> > You could try subclass QueryParser and override newFuzzyQuery to
>> return
>> > FuzzyLikeThisQuery (found in "contrib/queries")
>> >
>> > Cheers
>> > Mark
>> >
>> >
>> >
>> > ----- Original Message ----
>> > From: stefcl <st...@gmail.com>
>> > To: java-user@lucene.apache.org
>> > Sent: Mon, 15 February, 2010 14:13:52
>> > Subject: Strange Fuzzyquery results scoring when using a low minimal
>> > distance
>> >
>> >
>> > Hello,
>> >
>> > I'm using Lucene v3.
>> > Please consider the following spellings
>> >
>> > Lucene
>> > Lucéne
>> > lucéne
>> > Lucane
>> > Lucen
>> >
>> > When searching for "lucéne" among those words using a FuzzyQuery
>> (with 0.5
>> > edit distance), results show :
>> >
>> > 1. Lucene 1.0259752
>> > 2. Lucane 1.0259752
>> > 3. Lucéne 0.95660806
>> > 4. lucéne 0.95660806
>> > 5. Lucen 0.30779266
>> >
>> > #4 is an exact match, why does it receive a lower score than "Lucane"
>> > which
>> > contains one incorrect letter?
>> >
>> > Also, if you raise min similarity a bit higher (0.6 of above),
>> everything
>> > becomes normal :
>> >
>> > 1. Lucéne 1.0438477
>> > 2. lucéne 1.0438477
>> > 3. Lucene 0.97959816
>> > 4. Lucane 0.97959816
>> >
>> >
>> > Any idea?
>> > Thanks in advance...
>> >
>> >
>> > The code I use :
>> >
>> >    /**
>> >      * @param args the command line arguments
>> >      */
>> >     public static void main(String[] args) throws IOException,
>> > ParseException
>> >     {
>> >
>> >         StandardAnalyzer analyzer = new
>> > StandardAnalyzer(Version.LUCENE_CURRENT);
>> >
>> >         // TODO code application logic here
>> >         Directory index = new RAMDirectory();
>> >         IndexWriter w = new IndexWriter(index, analyzer, true,
>> > IndexWriter.MaxFieldLength.UNLIMITED);
>> >
>> >         addDoc(w, "Lucene");
>> >         addDoc(w, "Lucéne");
>> >         addDoc(w, "lucéne");
>> >         addDoc(w, "Lucane");
>> >         addDoc(w, "Lucen");
>> >
>> >         w.close();
>> >
>> >         FuzzyQuery q =  new FuzzyQuery( new Term("title", "lucéne") ,
>> 0.5f
>> > );
>> >
>> >         // 3. search
>> >         IndexSearcher searcher = new IndexSearcher(index);
>> >
>> >         TopDocs collector = searcher.search(q, 10);
>> >         ScoreDoc[] hits = collector.scoreDocs;
>> >
>> >         // 4. display results
>> >         System.out.println("Found " + hits.length + " hits.");
>> >         for(int i = 0 ; i < hits.length; i++)
>> >         {
>> >               Document d = searcher.doc(hits[i].doc);
>> >               System.out.println((i + 1) + ". " + d.get("title") + "
>> " +
>> > hits[i].score );
>> >         }
>> >
>> >         // searcher can only be closed when there
>> >         // is no need to access the documents any more.
>> >         searcher.close();
>> >     }
>> >
>> >
>> >     private static void addDoc(IndexWriter w, String value) throws
>> > IOException
>> >     {
>> >         Document doc = new Document();
>> >         doc.add(new Field("title", value, Field.Store.YES,
>> > Field.Index.ANALYZED));
>> >         w.addDocument(doc);
>> >     }
>> > --
>> > View this message in context:
>> > http://old.nabble.com/Strange-Fuzzyquery-results-scoring-when-using-
>> a-low-minimal-distance-tp27594371p27594371.html
>> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>> 
>> --
>> View this message in context: http://old.nabble.com/Strange-Fuzzyquery-
>> results-scoring-when-using-a-low-minimal-distance-
>> tp27594371p27605395.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Strange-Fuzzyquery-results-scoring-when-using-a-low-minimal-distance-tp27594371p27702921.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Strange Fuzzyquery results scoring when using a low minimal distance

Posted by Uwe Schindler <uw...@thetaphi.de>.

The problem ist he following:
The docFreq of the term "lucéne" is 2, all other terms have 1 (because StandardAnalyzer lowercases everything). What happens is, that terms with lower docFreq get a higher score in TermQuery. This score overweighs the boosting done by FuzzyQuery (because you index is so small).

If you raise the minSimilarity a little bit, your query matches less terms and the rewritten BooleanQuery contains less clauses. At some point the score overweigh of the less frequent terms is no longer relevant for the final score. 

By the way, you can always look at the explain() results which informs you about the scoring done.

The fix is (applies only to trunk, see issue https://issues.apache.org/jira/browse/LUCENE-124) to ignore scoring of the TermQueries generated by Fuzzy and only look at the edit distance (implemented by another MTQ.RewriteMode), that can be set with FuzzyQuery.setRewriteMode().

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: stefcl [mailto:stefatwork@gmail.com]
> Sent: Tuesday, February 16, 2010 10:11 AM
> To: java-user@lucene.apache.org
> Subject: Re: Strange Fuzzyquery results scoring when using a low
> minimal distance
> 
> 
> Thanksa lot,
> But I still don't understand why raising a little bit the min
> similarity
> change the ordering...
> 
> 
> 
> markharw00d wrote:
> >
> > This could be down to IDF ie "Lucane" is ranked higher because it is
> rarer
> > despite having worse edit distance.
> > This is arguably a bug.
> > See http://issues.apache.org/jira/browse/LUCENE-329 which discusses
> this.
> > You could try subclass QueryParser and override newFuzzyQuery to
> return
> > FuzzyLikeThisQuery (found in "contrib/queries")
> >
> > Cheers
> > Mark
> >
> >
> >
> > ----- Original Message ----
> > From: stefcl <st...@gmail.com>
> > To: java-user@lucene.apache.org
> > Sent: Mon, 15 February, 2010 14:13:52
> > Subject: Strange Fuzzyquery results scoring when using a low minimal
> > distance
> >
> >
> > Hello,
> >
> > I'm using Lucene v3.
> > Please consider the following spellings
> >
> > Lucene
> > Lucéne
> > lucéne
> > Lucane
> > Lucen
> >
> > When searching for "lucéne" among those words using a FuzzyQuery
> (with 0.5
> > edit distance), results show :
> >
> > 1. Lucene 1.0259752
> > 2. Lucane 1.0259752
> > 3. Lucéne 0.95660806
> > 4. lucéne 0.95660806
> > 5. Lucen 0.30779266
> >
> > #4 is an exact match, why does it receive a lower score than "Lucane"
> > which
> > contains one incorrect letter?
> >
> > Also, if you raise min similarity a bit higher (0.6 of above),
> everything
> > becomes normal :
> >
> > 1. Lucéne 1.0438477
> > 2. lucéne 1.0438477
> > 3. Lucene 0.97959816
> > 4. Lucane 0.97959816
> >
> >
> > Any idea?
> > Thanks in advance...
> >
> >
> > The code I use :
> >
> >    /**
> >      * @param args the command line arguments
> >      */
> >     public static void main(String[] args) throws IOException,
> > ParseException
> >     {
> >
> >         StandardAnalyzer analyzer = new
> > StandardAnalyzer(Version.LUCENE_CURRENT);
> >
> >         // TODO code application logic here
> >         Directory index = new RAMDirectory();
> >         IndexWriter w = new IndexWriter(index, analyzer, true,
> > IndexWriter.MaxFieldLength.UNLIMITED);
> >
> >         addDoc(w, "Lucene");
> >         addDoc(w, "Lucéne");
> >         addDoc(w, "lucéne");
> >         addDoc(w, "Lucane");
> >         addDoc(w, "Lucen");
> >
> >         w.close();
> >
> >         FuzzyQuery q =  new FuzzyQuery( new Term("title", "lucéne") ,
> 0.5f
> > );
> >
> >         // 3. search
> >         IndexSearcher searcher = new IndexSearcher(index);
> >
> >         TopDocs collector = searcher.search(q, 10);
> >         ScoreDoc[] hits = collector.scoreDocs;
> >
> >         // 4. display results
> >         System.out.println("Found " + hits.length + " hits.");
> >         for(int i = 0 ; i < hits.length; i++)
> >         {
> >               Document d = searcher.doc(hits[i].doc);
> >               System.out.println((i + 1) + ". " + d.get("title") + "
> " +
> > hits[i].score );
> >         }
> >
> >         // searcher can only be closed when there
> >         // is no need to access the documents any more.
> >         searcher.close();
> >     }
> >
> >
> >     private static void addDoc(IndexWriter w, String value) throws
> > IOException
> >     {
> >         Document doc = new Document();
> >         doc.add(new Field("title", value, Field.Store.YES,
> > Field.Index.ANALYZED));
> >         w.addDocument(doc);
> >     }
> > --
> > View this message in context:
> > http://old.nabble.com/Strange-Fuzzyquery-results-scoring-when-using-
> a-low-minimal-distance-tp27594371p27594371.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> 
> --
> View this message in context: http://old.nabble.com/Strange-Fuzzyquery-
> results-scoring-when-using-a-low-minimal-distance-
> tp27594371p27605395.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Strange Fuzzyquery results scoring when using a low minimal distance

Posted by stefcl <st...@gmail.com>.

Thanksa lot,
But I still don't understand why raising a little bit the min similarity
change the ordering...



markharw00d wrote:
> 
> This could be down to IDF ie "Lucane" is ranked higher because it is rarer
> despite having worse edit distance.
> This is arguably a bug.
> See http://issues.apache.org/jira/browse/LUCENE-329 which discusses this.
> You could try subclass QueryParser and override newFuzzyQuery to return
> FuzzyLikeThisQuery (found in "contrib/queries")
> 
> Cheers
> Mark
> 
> 
> 
> ----- Original Message ----
> From: stefcl <st...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Mon, 15 February, 2010 14:13:52
> Subject: Strange Fuzzyquery results scoring when using a low minimal
> distance
> 
> 
> Hello,
> 
> I'm using Lucene v3. 
> Please consider the following spellings 
> 
> Lucene
> Lucéne
> lucéne
> Lucane
> Lucen
> 
> When searching for "lucéne" among those words using a FuzzyQuery (with 0.5
> edit distance), results show :
> 
> 1. Lucene 1.0259752
> 2. Lucane 1.0259752
> 3. Lucéne 0.95660806
> 4. lucéne 0.95660806
> 5. Lucen 0.30779266
> 
> #4 is an exact match, why does it receive a lower score than "Lucane"
> which
> contains one incorrect letter?
> 
> Also, if you raise min similarity a bit higher (0.6 of above), everything
> becomes normal :
> 
> 1. Lucéne 1.0438477
> 2. lucéne 1.0438477
> 3. Lucene 0.97959816
> 4. Lucane 0.97959816
> 
> 
> Any idea?
> Thanks in advance...
> 
> 
> The code I use :
> 
>    /**
>      * @param args the command line arguments
>      */
>     public static void main(String[] args) throws IOException,
> ParseException
>     {
> 
>         StandardAnalyzer analyzer = new
> StandardAnalyzer(Version.LUCENE_CURRENT);
> 
>         // TODO code application logic here
>         Directory index = new RAMDirectory();
>         IndexWriter w = new IndexWriter(index, analyzer, true,
> IndexWriter.MaxFieldLength.UNLIMITED);
> 
>         addDoc(w, "Lucene");
>         addDoc(w, "Lucéne");
>         addDoc(w, "lucéne");
>         addDoc(w, "Lucane");
>         addDoc(w, "Lucen");
> 
>         w.close();
> 
>         FuzzyQuery q =  new FuzzyQuery( new Term("title", "lucéne") , 0.5f
> );
>         
>         // 3. search
>         IndexSearcher searcher = new IndexSearcher(index);
>         
>         TopDocs collector = searcher.search(q, 10);
>         ScoreDoc[] hits = collector.scoreDocs;
> 
>         // 4. display results
>         System.out.println("Found " + hits.length + " hits.");
>         for(int i = 0 ; i < hits.length; i++)
>         {
>               Document d = searcher.doc(hits[i].doc);
>               System.out.println((i + 1) + ". " + d.get("title") + " " + 
> hits[i].score );
>         }
> 
>         // searcher can only be closed when there
>         // is no need to access the documents any more.
>         searcher.close();
>     }
> 
> 
>     private static void addDoc(IndexWriter w, String value) throws
> IOException
>     {
>         Document doc = new Document();
>         doc.add(new Field("title", value, Field.Store.YES,
> Field.Index.ANALYZED));
>         w.addDocument(doc);
>     }
> -- 
> View this message in context:
> http://old.nabble.com/Strange-Fuzzyquery-results-scoring-when-using-a-low-minimal-distance-tp27594371p27594371.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Strange-Fuzzyquery-results-scoring-when-using-a-low-minimal-distance-tp27594371p27605395.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Strange Fuzzyquery results scoring when using a low minimal distance

Posted by mark harwood <ma...@yahoo.co.uk>.

This could be down to IDF ie "Lucane" is ranked higher because it is rarer despite having worse edit distance.
This is arguably a bug.
See http://issues.apache.org/jira/browse/LUCENE-329 which discusses this. You could try subclass QueryParser and override newFuzzyQuery to return FuzzyLikeThisQuery (found in "contrib/queries")

Cheers
Mark



----- Original Message ----
From: stefcl <st...@gmail.com>
To: java-user@lucene.apache.org
Sent: Mon, 15 February, 2010 14:13:52
Subject: Strange Fuzzyquery results scoring when using a low minimal distance


Hello,

I'm using Lucene v3. 
Please consider the following spellings 

Lucene
Lucéne
lucéne
Lucane
Lucen

When searching for "lucéne" among those words using a FuzzyQuery (with 0.5
edit distance), results show :

1. Lucene 1.0259752
2. Lucane 1.0259752
3. Lucéne 0.95660806
4. lucéne 0.95660806
5. Lucen 0.30779266

#4 is an exact match, why does it receive a lower score than "Lucane" which
contains one incorrect letter?

Also, if you raise min similarity a bit higher (0.6 of above), everything
becomes normal :

1. Lucéne 1.0438477
2. lucéne 1.0438477
3. Lucene 0.97959816
4. Lucane 0.97959816


Any idea?
Thanks in advance...


The code I use :

   /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws IOException,
ParseException
    {

        StandardAnalyzer analyzer = new
StandardAnalyzer(Version.LUCENE_CURRENT);

        // TODO code application logic here
        Directory index = new RAMDirectory();
        IndexWriter w = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);

        addDoc(w, "Lucene");
        addDoc(w, "Lucéne");
        addDoc(w, "lucéne");
        addDoc(w, "Lucane");
        addDoc(w, "Lucen");

        w.close();

        FuzzyQuery q =  new FuzzyQuery( new Term("title", "lucéne") , 0.5f
);
        
        // 3. search
        IndexSearcher searcher = new IndexSearcher(index);
        
        TopDocs collector = searcher.search(q, 10);
        ScoreDoc[] hits = collector.scoreDocs;

        // 4. display results
        System.out.println("Found " + hits.length + " hits.");
        for(int i = 0 ; i < hits.length; i++)
        {
              Document d = searcher.doc(hits[i].doc);
              System.out.println((i + 1) + ". " + d.get("title") + " " + 
hits[i].score );
        }

        // searcher can only be closed when there
        // is no need to access the documents any more.
        searcher.close();
    }


    private static void addDoc(IndexWriter w, String value) throws
IOException
    {
        Document doc = new Document();
        doc.add(new Field("title", value, Field.Store.YES,
Field.Index.ANALYZED));
        w.addDocument(doc);
    }
-- 
View this message in context: http://old.nabble.com/Strange-Fuzzyquery-results-scoring-when-using-a-low-minimal-distance-tp27594371p27594371.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org