You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2008/01/08 23:42:34 UTC

[jira] Created: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity

short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity
----------------------------------------------------------------------------------------

                 Key: LUCENE-1124
                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Query/Scoring
            Reporter: Hoss Man


I found this (unreplied to) email floating around in my Lucene folder from during the holidays...

{noformat}
From: Timo Nentwig
To: java-dev
Subject: Fuzzy makes no sense for short tokens
Date: Mon, 31 Dec 2007 16:01:11 +0100
Message-Id: <20...@nitwit.de>

Hi!

it generally makes no sense to search fuzzy for short tokens because changing
even only a single character of course already results in a high edit
distance. So it actually only makes sense in this case:

           if( token.length() > 1f / (1f - minSimilarity) )

E.g. changing one character in a 3-letter token (foo) results in an edit
distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
we can save all the expensive rewrite() logic.
{noformat}

I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557144#action_12557144 ] 

Otis Gospodnetic commented on LUCENE-1124:
------------------------------------------

This makes sense, Timo.  Could you please attach a patch + a simple unit test?


> short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity
> ----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1124:
--------------------------------

    Priority: Trivial  (was: Major)

> short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>            Assignee: Mark Miller
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1124.
----------------------------------------

    Resolution: Fixed

> short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>            Assignee: Mark Miller
>            Priority: Trivial
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated LUCENE-1124:
-------------------------------------

    Summary: short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity  (was: short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity)

> short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>         Attachments: LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Reopened: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reopened LUCENE-1124:
----------------------------------------


This fix breaks the case when the exact term is present in the index.

> short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>            Assignee: Mark Miller
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller resolved LUCENE-1124.
---------------------------------

    Resolution: Fixed

Thanks! Committed.

> short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>            Assignee: Mark Miller
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1124:
---------------------------------------

    Fix Version/s:     (was: 2.9)
                   3.0
                   2.9.1

> short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>            Assignee: Mark Miller
>            Priority: Trivial
>             Fix For: 2.9.1, 3.0
>
>         Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625476#action_12625476 ] 

Hoss Man commented on LUCENE-1124:
----------------------------------

I don't deal with FuzzyQueries much, but skimming this issue it seems to touch on a lot of hte same things that spawned the creation of the "mm" syntax for specifying the "minNumberShouldMatch" value on BooleanQueries in the Solr dismax query parser...

http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html

...perhaps something similar could be used to allow people to specify simpel expressions for dictating the "fuzzyiness" of short input vs medium length input, vs long input.

> short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>         Attachments: LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1124:
--------------------------------

    Attachment: LUCENE-1124.patch

This optimization is correct. Highlights some interesting things about fuzzy query as well i.e. if you put a minsim of 0.9, your query term *has* to be over 10 chars to have any hope of getting a match. For the default of 0.5 its 2 chars, so in the common case the optimization doesn't do much good, and you do have to pay for the check every time no matter what. For larger minsim values though, this will turn a lot of fuzz queries into no ops.

- Mark

> short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity
> ----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>         Attachments: LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1124:
--------------------------------

    Fix Version/s: 2.9
         Assignee: Mark Miller

> short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1124:
---------------------------------------

    Attachment: LUCENE-1124.patch

Attach patch (based on 2.9) showing the bug, along with the fix.  Instead of rewriting to empty BooleanQuery when prefix term is not long enough, I rewrite to TermQuery with that prefix.  This way the exact term matches.

I'll commit shortly to trunk & 2.9.x.

> short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>            Assignee: Mark Miller
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1124:
--------------------------------

    Attachment: LUCENE-1124.patch

Computing the needed term length in the constructor is probably better.

> short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity
> ----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>         Attachments: LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1124:
--------------------------------

    Attachment: LUCENE-1124.patch

Updated to trunk.

Im going to commit in few days if no one objects.

> short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>         Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <20...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity.  (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org