You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Samuel García Martínez (JIRA)" <ji...@apache.org> on 2013/02/21 22:28:13 UTC

[jira] [Updated] (LUCENE-4793) Spellchecker don't find suggestion for concrete misspelled 6 letter words

     [ https://issues.apache.org/jira/browse/LUCENE-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Samuel García Martínez updated LUCENE-4793:
-------------------------------------------

    Description: 
Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene Spellchecker) behaviour i think i found a bug when the input is a 6 letter word:
  - george
  - anthem
  - argued
  - fluent

Due to the getMin() and getMax() the grams indexed for these terms are 3 and 4. So, the fields would be something like this:
  - for "george"
     - start3: "geo"
     - start4: "geor"
     - end3: "rge"
     - end4: "orge"
     - 3: "geo", "eor", "org", "rge"
     - 4: "geor", "eorg", "orge"
  - for "anthem"
     - start3: "ant"
     - start4: "anth"
     - end3: "tem"
     - end4: "them"

The problem shows up when the user swap 3rd a 4th characters, misspelling the word like this:
  - geroge
  - anhtem

The queries generated for this terms are: (SHOULD boolean queries)
- for "geroge" 
  - start3: "ger"
  - start4: "gero"
  - end3: "oge"
  - end4: "roge"
  - 3: "ger", "ero", "rog", "oge"
  - 4: "gero", "erog", "roge"
- for "anhtem"
  - start3: "anh"
  - start4: "anht"
  - end3: "tem"
  - end4: "htem"
  - 3: "anh", "nht", "hte", "tem"
  - 4: "anht", "nhte", "htem"

So, as you can see, this kind of misspelling never matches the suitable suggestions although the edit distance is 0.95555556.

I think getMin(int l) and getMax(int l) should return 2 and 3, respectively, for l==6. Debugging other values i did not found any problem with any kind of misspelling.
    
> Spellchecker don't find suggestion for concrete misspelled 6 letter words
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-4793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4793
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/spellchecker
>    Affects Versions: 3.6, 4.0, 4.1
>            Reporter: Samuel García Martínez
>            Priority: Minor
>
> Debugging Solr spellchecker (IndexBasedSpellchecker, delegating on lucene Spellchecker) behaviour i think i found a bug when the input is a 6 letter word:
>   - george
>   - anthem
>   - argued
>   - fluent
> Due to the getMin() and getMax() the grams indexed for these terms are 3 and 4. So, the fields would be something like this:
>   - for "george"
>      - start3: "geo"
>      - start4: "geor"
>      - end3: "rge"
>      - end4: "orge"
>      - 3: "geo", "eor", "org", "rge"
>      - 4: "geor", "eorg", "orge"
>   - for "anthem"
>      - start3: "ant"
>      - start4: "anth"
>      - end3: "tem"
>      - end4: "them"
> The problem shows up when the user swap 3rd a 4th characters, misspelling the word like this:
>   - geroge
>   - anhtem
> The queries generated for this terms are: (SHOULD boolean queries)
> - for "geroge" 
>   - start3: "ger"
>   - start4: "gero"
>   - end3: "oge"
>   - end4: "roge"
>   - 3: "ger", "ero", "rog", "oge"
>   - 4: "gero", "erog", "roge"
> - for "anhtem"
>   - start3: "anh"
>   - start4: "anht"
>   - end3: "tem"
>   - end4: "htem"
>   - 3: "anh", "nht", "hte", "tem"
>   - 4: "anht", "nhte", "htem"
> So, as you can see, this kind of misspelling never matches the suitable suggestions although the edit distance is 0.95555556.
> I think getMin(int l) and getMax(int l) should return 2 and 3, respectively, for l==6. Debugging other values i did not found any problem with any kind of misspelling.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org