You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "David Bowen (JIRA)" <ji...@apache.org> on 2009/01/30 18:56:59 UTC

[jira] Created: (LUCENE-1532) File based spellcheck with doc frequencies supplied

File based spellcheck with doc frequencies supplied
---------------------------------------------------

Key: LUCENE-1532
URL: https://issues.apache.org/jira/browse/LUCENE-1532
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/spellchecker
Reporter: David Bowen

The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered. It would be very useful to have the option of supplying an integer with each word which indicates its commonness. I.e. the integer could be the document frequency in some index or set of indexes.

I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index. So Lucene users can provide alternative implementations of DocFrequencyInfo. I could submit this as a patch if there is interest. Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669323#action_12669323 ] 

Mark Miller commented on LUCENE-1532:
-------------------------------------

Just to expand:

I almost think it would be best if both the file and index based checkers used a normalized frequency combined with the edit distance. The only difference would be how each gets the term list and freq info. With the index version, the freq info can be auto calculated from the index, and with the file version, either all terms have the same freq or the user can supply a file with user freqs (possible the index version could just write that file).

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669599#action_12669599 ] 

Mark Miller commented on LUCENE-1532:
-------------------------------------

bq. In one corpus doc. frequency of 3 means it is probably a typo, in another this means nothing...

I think typos are another problem though. There can be too many high frequency typos and low frequency correct spellings. I think this has to be attacked in other ways anyway. Maybe favoring an a true dictionary slightly, over the user dictionary. Certainly its hard to use freq for it though. Helps keep those typos from getting suggested though - and you only pay by seeing fewer less common, but correct, suggestions as well.

bq. My proposal is to work with real frequency as you have no information loss there ... 

I don't think the info you are losing is helpful. If you start favoring a word that occurs 70,000 times heavily over words that occur 40,000 times, I think it works in the favor of bad suggestions. On a scaled freq chart, they might actually be a 4 and 5 or both the same freq even. Considering that 70k-40k doesnt likely tell you much in terms of which is a better suggestion, this allows edit distance to play the larger role that it should in deciding.

Of course it makes sense for the implementation to be able to work with the raw values as you say though. We wouldn't want to hardcode the normalization. Your right - who knows what is right to do for it, or whether you should even normalize at the end of the day. I don't. A casual experience showed good results though, and I think supplying something like that out of the box will really improve lucenes spelling suggestions.
- Mark

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669582#action_12669582 ] 

Robert Muir commented on LUCENE-1532:
-------------------------------------

I agree the frequency information is very useful, but I'm not sure the exact frequency number at just word-level is really that useful for spelling correction, assuming a normal zipfian distribution.

using the frequency as a basic guide: 'typo or non-typo', 'common or uncommon', etc might be the best use for it.


> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669906#action_12669906 ] 

Mark Harwood commented on LUCENE-1532:
--------------------------------------

Surely the biggest factor in picking the right spelling suggestion is to look at the other words the user has typed in the query? A quick search tells me there are 4 words used in the average Google query (I used 4 words to find this out). Measuring coocurrence of variants with the other query words seems like a much more useful measure than considering IDF of isolated word variants?

This may be useful : http://issues.apache.org/jira/browse/LUCENE-474?focusedCommentId=12358701#action_12358701 

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1532:
--------------------------------

    Priority: Minor  (was: Major)

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669018#action_12669018 ] 

Eks Dev commented on LUCENE-1532:
---------------------------------

bq. so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered

in my experience freq information brings there a lot, but is not linear. It is not always that word with higher frequency makes better suggestion. Common sense is that high frequency words get often misspelled in different ways in normal corpus. Making following patterns: 

HF(High Freiquency) Word against LF(Low Frequency) that is similar in edit distance sense is much more likely typo/misspelling than HF vs HF case. 

Similar cases with HF vs LF
"the" against "hte"
"think" vs "tihnk"

Very similar, but HF vs HF 
"think" vs "thing"

some cases that jump out of these ideas are synonyms, alternative spellings and very common mistakes. Very tricky to isolate just by using some distance measure and frequency. Her you need context.
similar and HF vs HF
"thomas" vs "tomas" sometimes spelling mistake, sometimes different names...

depends what you are trying to achieve, if you expect mistakes in query you are good if you assume HF suggestions are better, but if you go for high recall you need to cover cases where query  term is correct  you have to dig into your corpus to find incorrect words (Query "think about it" should find document containing "tihnk about it")

very challenging problem.... but cutting to the chase. The proposal is to make it possible to define
 float Function(Edit distance, Query_Token_Freq, Corpus_Token_Freq) that returns some measure that is higher  for more similar pairs considering edit distance and frequency (value that gets used as condition for priority queue) . Default could just work as you described. (It is maybe already possible, I  did not look at it). 

  



 


> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669605#action_12669605 ] 

Robert Muir commented on LUCENE-1532:
-------------------------------------

when you talk about hardcoding normalization, I really don't see where its unfair or even 'hardcoding' to assume a zipfian distribution in any corpus of text for incorporating the frequency weight....

I agree the specific corpus determines some of these properties but at the end of the day they all tend to have the same general distribution curve even if the specifics are different.

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669502#action_12669502 ] 

Mark Miller commented on LUCENE-1532:
-------------------------------------

A little experimentation showed better results. It may depend though. I think its more useful when the dictionary contains lots of misspellings (many index based spellcheck indexes). In this case, I think its more important that docFreq play a role with edit distance to get good results (rather than just being an edit distance tie breaker). The fact that one term appeared 30,000 times and another 36,700 doesn't make much of a difference in spell checking. Words that are relatively similar in frequency are bucketed together, and then edit distance can judge from there. Especially with misspellings, this can work really well. The unaltered term frequencies are too widely distributed to be super helpful as part of a weight. Normalizing down allows the edit distance to play a stronger role, and keeps super frequent terms from clobbering good results. But it makes the more frequent terms more likely to be chosen as the suggestion. The edit distances will likely be similar too - but say one word beats another by a small edit distance - it can certainly make sense to choose the word that lost because its a 10 freq and the other word a 1 freq. You will satisfy more users. Even a 10 vs a 4 or 10 vs a 5 - you will likely guess better.

Keep in mind, I'm no expert on spell checking though.

I have a feeling that a similar move would be beneficial to a dictionary based spellchecker too. Breaking the freqs down into smaller buckets keeps insignificant differences from  playing a role in the correction. I'd love to test a little and see how straight edit distance compares to an edit distance / freq weight with a dictionary approach. I still wouldn't be surprised if slightly favoring more frequent words by allowing a bit of edit distance leeway wouldnt improve results. Saying this word is chosen because it beats the other by a slim edit distance, when the loser is a high frequency word in the language, and the winner a low, makes little sense.

I just kind of like the idea of unifying the two approaches also. Really just talking out loud though.

- Mark

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669589#action_12669589 ] 

markrmiller@gmail.com edited comment on LUCENE-1532 at 2/2/09 5:02 AM:
-------------------------------------------------------------

bq. but I'm not sure the exact frequency number at just word-level is really that useful for spelling correction, assuming a normal zipfian distribution. 

Thats what normalizing down takes care of. 1-10 is just out of the hat. You could do 1-3 and have low freq, med freq, hi freq. (note: i found that when normalizing, taking the top value as like the 90-95 percentile created a better distribution - knocks off a decent amount of outliers that can push everything else to lower freq values)

Consider I make a site called MarkMiller.com - its full of stuff about Mark Miller. In my dictionary is Mike Muller though, which is mentioned on the site twice. Mark Miller is mentioned thousands of times. Now if I type something like Mlller and it suggest Muller just using edit distance - that type of thing will create a lot of bad suggestions. Muller is practically unheard of on my site, but I am suggesting it over Miller which is all over the place. Edit distance by itself as the first cut off creates too many of these close bad suggestions. So its not that freq should be used heavily - but it can clear up these little oddities quite nicely.


      was (Author: markrmiller@gmail.com):
    bq. but I'm not sure the exact frequency number at just word-level is really that useful for spelling correction, assuming a normal zipfian distribution. 

Thats what normalizing down takes care of. 1-10 is just out of the hat. You could do 1-3 and have low freq, med freq, hi freq.

Consider I make a site called MarkMiller.com - its full of stuff about Mark Miller. In my dictionary is Mike Muller though, which is mentioned on the site twice. Mark Miller is mentioned thousands of times. Now if I type something like Mlller and it suggest Muller just using edit distance - that type of thing will create a lot of bad suggestions. Muller is practically unheard of on my site, but I am suggesting it over Miller which is all over the place. Edit distance by itself as the first cut off creates too many of these close bad suggestions. So its not that freq should be used heavily - but it can clear up this little oddities quite nicely.

  
> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669029#action_12669029 ] 

Mark Miller commented on LUCENE-1532:
-------------------------------------

Our spellchecking def needs improvement.

I like the idea of using a weight measure - something that combines frequency and edit distance. The spellchecker can make much better suggestions this way, and do things like only return a suggestion if it has a higher frequency or higher weight.

I've found that unaltered frequency is not a great stat to use though - it becomes much better if you do something like normalize freq to a value between 1-10. Then use that with the edit distance to calculate the weight. Or some such magic.

- Mark

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669601#action_12669601 ] 

Robert Muir commented on LUCENE-1532:
-------------------------------------

I think we are on the same page here, I'm just suggesting that if the broad goal is to improve spellcheck, I think smarter distance metrics are also worth looking at.

In my tests I got significantly better results by tuning the ED function as mentioned, I also use freetts/cmudict to incorporate phonetic edit distance and average the two. (The idea being to help with true typos but also with genuinely bad spellers).  The downside to these tricks are that they are language-dependent.

For reference the other thing I will mention is aspell has some test data here: http://aspell.net/test/orig/ , maybe it is useful in some way?


> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "David Bowen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669497#action_12669497 ] 

David Bowen commented on LUCENE-1532:
-------------------------------------

Mark, what is the reason for normalizing the doc frequency values?  Is the point to get some kind of scale that is comparable with edit distance?

I was assuming that edit distance would remain the primary criterion, and doc frequency would just be a tie-breaker (as is currently the case for the index-based spellcheck), but I could well imagine that some less simple combination of the two could give better results.

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669576#action_12669576 ] 

Robert Muir commented on LUCENE-1532:
-------------------------------------

just a suggestion... I got better results by refining edit distance costs by keyboard layout (substituting a 'd' with an 'f' costs less than 'd' with 'j', and i also penalize less for transposition).

if you have lots of terms it helps for ed function to be able to discriminate terms better.

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669595#action_12669595 ] 

Eks Dev commented on LUCENE-1532:
---------------------------------

.bq but I'm not sure the exact frequency number at just word-level is really that useful for spelling correction, assuming a normal zipfian distribution.

you are probably right, you cannot expect high resolution from frequency, but exact frequency information is your "source information". Clustering it on anything is just one algorithmic modification where, at the end, less information remains. Mark suggests 1-10, someone else would be happy with 1-3  ... who could tell? Therefore I would recommend real frequency information and leave possibility for end user to decide what to do with it. 

Frequency distribution is not simple measure, depends heavily on corpus composition, size. In one corpus doc. frequency of 3 means it is probably a typo, in another this means nothing...

My proposal is to work with real frequency as you have no information loss there ...  




      


> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669603#action_12669603 ] 

Mark Miller commented on LUCENE-1532:
-------------------------------------

Just to clear up the normalization (since I havn't done a good job of explaining it) -

If you use a weight that combines freq and edit distance (so that small edit distance wins don't completely decide the suggestion), using the real term frequencies will create math that explodes higher freq terms  unfairly. Normalizing down makes things a little better - in an index with term freqs that go from 1 to 400,000 - you don't want a term with freq:150,000 to be heavily preferred over something with freq:125,000. They should really be treated about the same, with edit distance the main factor.

- Mark

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669589#action_12669589 ] 

Mark Miller commented on LUCENE-1532:
-------------------------------------

bq. but I'm not sure the exact frequency number at just word-level is really that useful for spelling correction, assuming a normal zipfian distribution. 

Thats what normalizing down takes care of. 1-10 is just out of the hat. You could do 1-3 and have low freq, med freq, hi freq.

Consider I make a site called MarkMiller.com - its full of stuff about Mark Miller. In my dictionary is Mike Muller though, which is mentioned on the site twice. Mark Miller is mentioned thousands of times. Now if I type something like Mlller and it suggest Muller just using edit distance - that type of thing will create a lot of bad suggestions. Muller is practically unheard of on my site, but I am suggesting it over Miller which is all over the place. Edit distance by itself as the first cut off creates too many of these close bad suggestions. So its not that freq should be used heavily - but it can clear up this little oddities quite nicely.


> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669579#action_12669579 ] 

Eks Dev commented on LUCENE-1532:
---------------------------------

.bq  I got better results by refining edit distance costs by keyboard layout 

Sure,  better distance helps a lot, but even in that case frequency information brings a lot. Frequency brings you some information about  corpus that is orthogonal to information you get from pure "word1" vs "word2" comparison. 



> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally valid, so it can suggest a very obscure word rather than a more common word which is equally close to the misspelled word that was entered.  It would be very useful to have the option of supplying an integer with each word which indicates its commonness.  I.e. the integer could be the document frequency in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by defining a DocFrequencyInfo interface for obtaining the doc frequency of a word, and a class which implements the interface by looking up the frequency in an index.  So Lucene users can provide alternative implementations of DocFrequencyInfo.  I could submit this as a patch if there is interest.  Alternatively, it might be better to just extend the spellcheck API to have a way to supply the frequencies when you create a PlainTextDictionary, but that would mean storing the frequencies somewhere when building the spellcheck index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org