You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Thomas Morton (JIRA)" <ji...@apache.org> on 2009/02/28 04:15:12 UTC

[jira] Created: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

Add N-Gram String Matching for Spell Checking
---------------------------------------------

                 Key: LUCENE-1550
                 URL: https://issues.apache.org/jira/browse/LUCENE-1550
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/spellchecker
    Affects Versions: 2.9
            Reporter: Thomas Morton
             Fix For: 2.9


N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  Buenos Aires, Argentina, November 2005. 
http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

Posted by "Thomas Morton (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Morton updated LUCENE-1550:
----------------------------------

    Attachment: LUCENE-1550.patch


2 seems a reasonable default.  Experiments in paper should comparable results for bi-grams and tri-grams.  Made an empty constructor which sets n=2.

Yes that can be moved up without penalty.

That's a bug in the empty case.  Should return 0 unless both strings are empty.  I ported this bug form the Levenstein Distance code.  It's now fixed in both and has unit tests in both.  New patch attached.

Technically NGramDistance(1) is the same thing as LevensteinDistance but LevensteinDistance code is more straight forward and may be slightly faster.


> Add N-Gram String Matching for Spell Checking
> ---------------------------------------------
>
>                 Key: LUCENE-1550
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1550
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>    Affects Versions: 2.9
>            Reporter: Thomas Morton
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1550.patch, LUCENE-1550.patch
>
>
> N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  Buenos Aires, Argentina, November 2005. 
> http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Closed: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll closed LUCENE-1550.
-----------------------------------

       Resolution: Fixed
    Lucene Fields:   (was: [New])

> Add N-Gram String Matching for Spell Checking
> ---------------------------------------------
>
>                 Key: LUCENE-1550
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1550
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>    Affects Versions: 2.9
>            Reporter: Thomas Morton
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1550.patch, LUCENE-1550.patch, LUCENE-1550.patch
>
>
> N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  Buenos Aires, Argentina, November 2005. 
> http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688079#action_12688079 ] 

Grant Ingersoll commented on LUCENE-1550:
-----------------------------------------

Hey Tom,

Few questions:

# Do you have recommendations on picking n?
# On line 78 or so, can't that be moved up?  s1/t1 are calculated on line 47 and not assigned to.  Seems like it would be an optimization to return out if they are 0.  Also, can it just be:
{code}
if (s1 == 0 || t1 == 0){return 1;};
{code}
In fact, all tests still pass when this is moved up to the top.  However, I must not be understanding something, as why should:
{code}
public void testEmpty() throws Exception {
    StringDistance nsd = new NGramDistance(1);
    float d = nsd.getDistance("", "al");
    assertEquals(d,1.0f,0.001);
  }
{code}
pass?


> Add N-Gram String Matching for Spell Checking
> ---------------------------------------------
>
>                 Key: LUCENE-1550
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1550
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>    Affects Versions: 2.9
>            Reporter: Thomas Morton
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1550.patch
>
>
> N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  Buenos Aires, Argentina, November 2005. 
> http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694718#action_12694718 ] 

Grant Ingersoll commented on LUCENE-1550:
-----------------------------------------

Hey Tom,

The empty string test still fails, namely b/c it is expecting 0.0 instead of -1.  Seems like we should just return 0 for the case where one is empty.  Although, even that is a bit weird, right?  For instance, in the Edit distance case, there still is a notion of the number of edits one needs to make to get from a string to an empty string, right? That is, the distance from "a" to "" seems less than the distance of "abcdef" to "", no?

I guess I'm fine with 0, but what does the literature suggest for this edge case?

> Add N-Gram String Matching for Spell Checking
> ---------------------------------------------
>
>                 Key: LUCENE-1550
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1550
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>    Affects Versions: 2.9
>            Reporter: Thomas Morton
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1550.patch, LUCENE-1550.patch
>
>
> N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  Buenos Aires, Argentina, November 2005. 
> http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned LUCENE-1550:
---------------------------------------

    Assignee: Grant Ingersoll

> Add N-Gram String Matching for Spell Checking
> ---------------------------------------------
>
>                 Key: LUCENE-1550
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1550
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>    Affects Versions: 2.9
>            Reporter: Thomas Morton
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1550.patch
>
>
> N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  Buenos Aires, Argentina, November 2005. 
> http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

Posted by "Thomas Morton (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Morton updated LUCENE-1550:
----------------------------------

    Attachment: LUCENE-1550.patch

Patch includes implementation of n-gram string matching.

This implementation uses the position-based optimization to compute partial matches of n-gram sub-strings and adds a null-character prefix of size n-1  so that the first character is contained in the same number of n-grams as a middle character.  Null-character prefix matches are discounted so that  strings with no matching characters will return a distance of 0.

Includes test cases.

Fixes javadoc description of StringDistance to reflect that 1 is used for the same string and 0 for non-matching.

> Add N-Gram String Matching for Spell Checking
> ---------------------------------------------
>
>                 Key: LUCENE-1550
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1550
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>    Affects Versions: 2.9
>            Reporter: Thomas Morton
>             Fix For: 2.9
>
>         Attachments: LUCENE-1550.patch
>
>
> N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  Buenos Aires, Argentina, November 2005. 
> http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

Posted by "Thomas Morton (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701362#action_12701362 ] 

Thomas Morton commented on LUCENE-1550:
---------------------------------------

The implementations returns a normalized edit distance (normalized by string length) and specifically 1 if the strings are the same and 0 if that are maximally different.  0 in that case makes sense as the number of edits is equal to the number of characters in the longest string, so:

1- (2 edits /2 length) = 0


> Add N-Gram String Matching for Spell Checking
> ---------------------------------------------
>
>                 Key: LUCENE-1550
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1550
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>    Affects Versions: 2.9
>            Reporter: Thomas Morton
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1550.patch, LUCENE-1550.patch, LUCENE-1550.patch
>
>
> N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  Buenos Aires, Argentina, November 2005. 
> http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711150#action_12711150 ] 

Grant Ingersoll commented on LUCENE-1550:
-----------------------------------------

Committed revision 776704.

Thanks Tom!

> Add N-Gram String Matching for Spell Checking
> ---------------------------------------------
>
>                 Key: LUCENE-1550
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1550
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>    Affects Versions: 2.9
>            Reporter: Thomas Morton
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1550.patch, LUCENE-1550.patch, LUCENE-1550.patch
>
>
> N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  Buenos Aires, Argentina, November 2005. 
> http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

Posted by "Thomas Morton (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Morton updated LUCENE-1550:
----------------------------------

    Attachment: LUCENE-1550.patch

Fixes the empty string case.
Adds some additional unit tests.

> Add N-Gram String Matching for Spell Checking
> ---------------------------------------------
>
>                 Key: LUCENE-1550
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1550
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>    Affects Versions: 2.9
>            Reporter: Thomas Morton
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1550.patch, LUCENE-1550.patch, LUCENE-1550.patch
>
>
> N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  Buenos Aires, Argentina, November 2005. 
> http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1550) Add N-Gram String Matching for Spell Checking

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-1550:
------------------------------------

    Priority: Minor  (was: Major)

not major

> Add N-Gram String Matching for Spell Checking
> ---------------------------------------------
>
>                 Key: LUCENE-1550
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1550
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>    Affects Versions: 2.9
>            Reporter: Thomas Morton
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1550.patch
>
>
> N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126,  Buenos Aires, Argentina, November 2005. 
> http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org