You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Germán Biozzoli <ge...@gmail.com> on 2009/12/07 06:03:04 UTC

Spellchecking - Is there a way to do this?

Hello everybody

1. Have tons of digitalized text with the logical errors in OCR process
2. Have indexed with Solr and is working OK.
3. Have added spellchecker index-based for words and phrases with the
hope to offer suggestions with "suspicious" possible new query
expressions, or related query expressions to the actual one with the
intention to find documents that have the original expression but
contains OCR errors (the user originally have search for "state and
democracy" and the interface will offer "stete and demcraci" as an
alternate query expression)

My first problem appears because I need suggestions inclusive when the
expression has returned results. It's seems that only appear
suggestions when there are no results. Is there a way to do so?

The second question is: For the purposes that I've mentioned, is the
best way to use spellchecker or mlt component? Or some other (as a
fuzzy query)?

Thanks a lot
German

Re: Spellchecking - Is there a way to do this?

Posted by Lance Norskog <go...@gmail.com>.
Another thing you might check into is stemming. The Porter stemmer
included in Solr is "aggressive", meaning that it will tend to do
weird things with misspellings. There is a different stemmer called
KStem which is available from www.lucidimagination.com/Downloads is
less aggressive. Porter turns "changes" and "changing" into "chang",
while KStem does not go this far.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem

On Thu, Dec 17, 2009 at 12:59 PM, Lance Norskog <go...@gmail.com> wrote:
> Character-based NGrams are a good tool for this problem. MLT is a
> document-wide numerical analysis.
>
> If the common types of OCR mistakes are different than what NGrams
> create, you might tune the ngram generator. For example, swapping
> letters might not happen very often. SIngle- and multi-word errors
> must happen a lot.
>
> If you do a facet query on your indexed terms, you will get a lot of
> facets with only one appearance in the index. These are often
> misspellings. It is possible to automate pulling these and creating a
> matching set of synonyms for words that appear in the spelling index.
>
> On Tue, Dec 15, 2009 at 12:57 PM, Chris Hostetter
> <ho...@fucit.org> wrote:
>>
>> : My first problem appears because I need suggestions inclusive when the
>> : expression has returned results. It's seems that only appear
>> : suggestions when there are no results. Is there a way to do so?
>>
>> can you give us an example of what your queries look like?  with the
>> example configs, i can get matches, as well as suggestions...
>>
>>
>> http://localhost:8983/solr/spell?q=ide&spellcheck=true
>>
>> : The second question is: For the purposes that I've mentioned, is the
>> : best way to use spellchecker or mlt component? Or some other (as a
>> : fuzzy query)?
>>
>> there's no clear cut answer to that -- i don't remember anyone else ever
>> asking about anything particularly similar to what you're doing, so i
>> don't know that there is any precident for a "best" way to go about it.
>>
>>
>>
>> -Hoss
>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: Spellchecking - Is there a way to do this?

Posted by Lance Norskog <go...@gmail.com>.
Character-based NGrams are a good tool for this problem. MLT is a
document-wide numerical analysis.

If the common types of OCR mistakes are different than what NGrams
create, you might tune the ngram generator. For example, swapping
letters might not happen very often. SIngle- and multi-word errors
must happen a lot.

If you do a facet query on your indexed terms, you will get a lot of
facets with only one appearance in the index. These are often
misspellings. It is possible to automate pulling these and creating a
matching set of synonyms for words that appear in the spelling index.

On Tue, Dec 15, 2009 at 12:57 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : My first problem appears because I need suggestions inclusive when the
> : expression has returned results. It's seems that only appear
> : suggestions when there are no results. Is there a way to do so?
>
> can you give us an example of what your queries look like?  with the
> example configs, i can get matches, as well as suggestions...
>
>
> http://localhost:8983/solr/spell?q=ide&spellcheck=true
>
> : The second question is: For the purposes that I've mentioned, is the
> : best way to use spellchecker or mlt component? Or some other (as a
> : fuzzy query)?
>
> there's no clear cut answer to that -- i don't remember anyone else ever
> asking about anything particularly similar to what you're doing, so i
> don't know that there is any precident for a "best" way to go about it.
>
>
>
> -Hoss
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Spellchecking - Is there a way to do this?

Posted by Chris Hostetter <ho...@fucit.org>.
: My first problem appears because I need suggestions inclusive when the
: expression has returned results. It's seems that only appear
: suggestions when there are no results. Is there a way to do so?

can you give us an example of what your queries look like?  with the 
example configs, i can get matches, as well as suggestions...


http://localhost:8983/solr/spell?q=ide&spellcheck=true

: The second question is: For the purposes that I've mentioned, is the
: best way to use spellchecker or mlt component? Or some other (as a
: fuzzy query)?

there's no clear cut answer to that -- i don't remember anyone else ever 
asking about anything particularly similar to what you're doing, so i 
don't know that there is any precident for a "best" way to go about it.



-Hoss