You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Jingkei Ly (JIRA)" <ji...@apache.org> on 2010/07/23 18:22:51 UTC

[jira] Created: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
------------------------------------------------------------------------------

                 Key: LUCENE-2557
                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
             Project: Lucene - Java
          Issue Type: Bug
          Components: Query/Scoring
    Affects Versions: 3.0.2
            Reporter: Jingkei Ly


The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 

For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891673#action_12891673 ] 

Robert Muir commented on LUCENE-2557:
-------------------------------------

I dont understand why we need to average any idfs? this seems really costly and i think in general the idea of fuzzy is to find misspellings.

furthermore i dont understand why its important if the idf if the query term exists in the index or not, because the query itself could be misspelled.


> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Reopened: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Jingkei Ly (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jingkei Ly reopened LUCENE-2557:
--------------------------------


Robert,

I posted a comment just before your one (apparently in the same minute) - I made an additional point that TopTermsBoostOnlyBooleanQueryRewrite ignores the IDF undesirably - does that make the issue valid again?

> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Jingkei Ly (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jingkei Ly updated LUCENE-2557:
-------------------------------

    Attachment: idf-scoring-test-case.patch

I've attached a test case which demonstrates some of the scoring issues (the patch applies to the existing TestFuzzyQuery class). With the default FuzzyQuery, the fuzzy terms "joness" and "smiith" get promoted to the top of the search results because they have higher IDFs than the exact matches.

If you modify the test so that the FuzzyQuerys use TopTermsBoostOnlyBooleanQueryRewrite, i.e. uncomment these lines in the test case:
{code}
smithQuery.setRewriteMethod(new MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite());
jonesQuery.setRewriteMethod(new MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite());
{code}

The fuzzy terms are correctly relegated to the bottom of the search results but, because IDF is ignored, "jones" appears more highly scored than "smith" even though "smith" is the rarer term.

Ideally the solution should solve both these issues.

> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891680#action_12891680 ] 

Robert Muir commented on LUCENE-2557:
-------------------------------------

bq. I agree that fuzzy is to find misspellings, but I don't think it should favour misspellings above an exact match.

So what is the problem with TopTermsBoostOnlyBooleanQueryRewrite? it will never do this.

While I agree this is a really simple solution to the problem, It seemed to me from the comments in LUCENE-329 that there were differing opinions on how one might want to combine the factors of edit distance boost, tf, idf, etc... it seems it will depend on the application.

So I definitely don't think this is any bug in fuzzyquery. Personally, I am not against adding new alternative rewrite methods like the one you added here, so that more choices are available. But this just seems to be the same issue as LUCENE-329 to me.

My personal preference would be to take this code and bring LUCENE-329 up to speed, e.g. creating an alternative in contrib/queries or something that uses Mark Harwoods "smart fuzzy" logic which is currently limited to FuzzyLikeThis.


> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892285#action_12892285 ] 

Mark Harwood commented on LUCENE-2557:
--------------------------------------

I think we're agreed that the effects of IDF are troublesome when ranking variant term matches but I question that the default solution should be to remove IDF from the equation completely.

Doing that reminds me of the time my mother thought the shadow in a photograph was annoying and cut it out with a pair of scissors leaving a big hole in its place.
What we're proposing here instead is the equivalent of some "photoshopping" to retain some of the original information but suitably blurred to provide a more natural balance to the overall picture.

Some degree of  IDF can be usefully retained from a FuzzyQuery in order to acheive balance with all the other (potentially non-fuzzy) optional clauses that may exist in a BooleanQuery. 
The proposal is that the most natural blending of IDF scores within a FuzzyQuery is to use only the IDF of the input term (which defines the user's original intent) and use this to score a match on any suggested variant . If the input term does not exist the average IDF of all variants is used as the next best alternative for scoring each variant.

This approach has exactly the same ranking effect as the existing "remove IDF" policy within a single FuzzyQuery but has the added advantage of sitting better with the other optional clauses that may exist in a containing query.

The question over core vs contrib comes down to what is considered the more natural/expected behaviour. 


> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-2557.
---------------------------------

    Resolution: Duplicate

Duplicate of LUCENE-124, which added this new rewrite method in trunk and 3.x:

{noformat}
* LUCENE-124: Add a TopTermsBoostOnlyBooleanQueryRewrite to MultiTermQuery.
  This rewrite method is similar to TopTermsScoringBooleanQueryRewrite, but
  only scores terms by their boost values. For example, this can be used 
  with FuzzyQuery to ensure that exact matches are always scored higher, 
  because only the boost will be used in scoring. 
{noformat}

> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Jingkei Ly (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jingkei Ly updated LUCENE-2557:
-------------------------------

    Attachment: LUCENE-2557.patch

I've had a crack at implementing a fix, based on suggestions in LUCENE-329. It takes the IDF of the term used in the FuzzyQuery if it exists in the index and uses that as the IDF. If the term is not in the index it uses the average IDF of all the terms.

It is implemented as a rewrite method similar to TopTermsBoostOnlyBooleanQueryRewrite from LUCENE-124, although it required modifying TopTermsBooleanQueryRewrite a little bit.

> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892297#action_12892297 ] 

Robert Muir commented on LUCENE-2557:
-------------------------------------

so here is an option for this issue. we could reword the whole issue as 'improve FuzzyQuery defaults'.

If we were to do this, i would suggest the following at the minimum:
* instead of a default distance of 0.5 (from queryparser), if distance isnt provided (~0.6 etc), calculate one that will perform well and never brute-force compare all the terms.
* instead of a default max expansions of booleanquery max clause count (1024), use a more reasonable # of expansions by default (such as 50)
* instead of the current rewrite, use a rewrite similar to FuzzyLikeThis. maybe we dont need to average docfreq across all 50 terms even, maybe the top-5 or so is sufficient.

If we were to do something like this, maybe we could improve performance and behavior instead of making tradeoffs.


> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892315#action_12892315 ] 

Robert Muir commented on LUCENE-2557:
-------------------------------------

bq. If this policy is a performance concern then we could reduce the number of terms as you suggest or just ignore IDF entirely in this case but I'm not sure the averaging costs represent any kind of real performance concern given the IO costs of accessing TermDocs.

I suggested reducing the number of terms (for the averaging), but also the number of default expansions.
I think in general expanding to 1024 is obscene...

But also, if we reduce this number, FuzzyTermsEnum itself gets faster, too.
FuzzyTermsEnum is aware (via an attribute) when the priority queue is filled, and it knows the minimal score to be competitive.
When a certain edit distance is no longer competitive, it optimizes itself by swapping in a more efficient Automaton.
This is safe because the pq's comparator is score, then the term's compareTo (lexicographic order).

Simple example: lets say you ask for a max of 1 expansions, but with a fuzzy query of max 1 edit distance.
as soon as the enum finds a term of ed=1, terms of ed=1 are no longer competitive, so it will then try to seek
to an exact match (swapping in an ed=0 automaton) and exit, instead of wasting time seeking to useless terms.

its a bit more complicated since the boost value is really not just edit distance but also string length, but I think this illustration works,
its one reason why I think we should try to 'improve the defaults'.


> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Jingkei Ly (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891678#action_12891678 ] 

Jingkei Ly commented on LUCENE-2557:
------------------------------------

{quote}
I dont understand why we need to average any idfs? this seems really costly and i think in general the idea of fuzzy is to find misspellings. 
{quote}
I agree that fuzzy is to find misspellings, but I don't think it should favour misspellings above an exact match. I think the reasoning behind the average IDFs (I based that on comments in LUCENE-329), is that in the absence of an IDF from the exact match it's better than nothing to have an average of the terms you do know. Perhaps, there is a better heuristic for that case, though.

{quote}
furthermore i dont understand why its important if the idf if the query term exists in the index or not, because the query itself could be misspelled.
{quote}
I think it's a fair assumption that users are searching for specific terms (+fore:john +sur:smith), so are unlikely that they would have a misspelling in the original query. If they did misspell it and got erroneous results, it seems it's immediately clear that the cause is a misspelt query.


> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892290#action_12892290 ] 

Robert Muir commented on LUCENE-2557:
-------------------------------------

bq. I think we're agreed that the effects of IDF are troublesome when ranking variant term matches but I question that the default solution should be to remove IDF from the equation completely.

Mark, just fyi, the boost-only rewrite method isnt the default (its just a simple option, but no runtime behavior has changed, it still uses "normal" boolean expansion as default).

I basically agree with your idea that it would be nicer to add a smarter rewrite method, and if so, probably change the defaults. But my concerns are:
* this input term / does not exist thing seems a little wierd to me as mentioned.
* doing a lot of docfreq/idf calls seems expensive? this is no problem with FuzzyLikeThis i think though, doesnt it uses a more reasonable PQ size? (50 or something)

> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892341#action_12892341 ] 

Eks Dev commented on LUCENE-2557:
---------------------------------

It looks like we have one invariant:
IDF(QueryTerm) >= IDF(Expansion Term) // Preventing better scoring documents with ET then Documents with exact match on QT.

Fixing all expansions to IDF(QT) would remove dynamics of the score, making the contribution to the score  for all expansions identical. Maybe proportionally scaling IDF of all expansions  to preserve mutual IDF dynamics, (relative to IDF(QT) to keep-up with invariant)  would work better?

In case when there is no matching QueryTerm, why not simply preserving expansion Term IDF, what is averaging good for, performance?

> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892344#action_12892344 ] 

Mark Harwood commented on LUCENE-2557:
--------------------------------------

bq. Fixing all expansions to IDF(QT) would remove dynamics of the score, making the contribution to the score for all expansions identical. 

The "boost" property is used by fuzzy/synonyms etc to express the preference for one term variant over another. The effects of this boost setting are demonstrably wiped out when unfiltered IDF of term variants is used (see the attached Junit)

bq. , why not simply preserving expansion Term IDF,

See above. The objective is for all variants in an expanded query to share the same IDF setting in order for the boost setting to work as required.

> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892311#action_12892311 ] 

Mark Harwood commented on LUCENE-2557:
--------------------------------------

bq. I dont understand why we need to average any idfs? this seems really costly 

IDF lookups and averaging etc should only be calculated for the top "n" terms that finally make it into the query. "Top" in this case being some edit distance threshold or synonymity measure. All the required doc frequency info for IDF is available in RAM on TermEnum which is iterated across anyway and so shouldn't incur any extra disk seeks. So given a query that expands to 1,000 terms the cost of computing the average IDF for that set of terms is surely lost in the cost of 1,000 disk seeks on the TermDocs as part of query evaluation? I need to review the code to remind myself of how it is processed but it feels like it should be cheap.

bq. average docfreq across all 50 terms even, maybe the top-5 or so is sufficient.

That could work. The IDF score simply has to be a value that is used as a constant for all the expanded terms in a fuzzy query and, as an added bonus, represents a value that can be usefully contrasted with other query clauses.  The averaging policy is just a fall-back position in the rarer situations when a user's original input term has no associated IDF value we can use. If this policy is a performance concern then we could reduce the number of terms as you suggest or just ignore IDF entirely in this case but I'm not sure the averaging costs represent any kind of real performance concern given the IO costs of accessing TermDocs.

> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org