You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Thomas Michael Engelke <th...@posteo.de> on 2014/09/01 13:43:56 UTC

Solr spellcheck returns more than 1 word for a 1 word spellcheck

 I'm in the process of incorporating Solr spellchecking in our product.
For that, I've created a new field:

 <field name="spell" type="spell"
indexed="true" stored="true" required="false" multiValued="false"/>

<copyField source="name" dest="spell" maxChars="30000" />

And in the
fieldType definitions:

 <fieldType name="spell" class="solr.TextField"
positionIncrementGap="100">
 <analyzer>
 <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
 </analyzer>

</fieldType>

Then I feed the names of products into the corresponding
core. They can have a lot of words (examples):

 door lock rear left

Door brake, door in front + rear fitting.

However, the names get pretty
long, and in the source data, they have been truncated. This sometimes
leaves parts of words at the end:

 The water pump can evacuate some
coo

I have created a spellcheck component, feeding of the `spell` field
defined earlier. Now for the problem.

Sometimes, when I look up a
slightly misspelled word, I get results I do not expect. Example
request:

 http://solr.url:8983/solr/en/spell?q=coole

This is (part of)
the response:

 <str name="word">cooler</str><int name="freq">21</int>

<str name="word">coo le</str><int name="freq">2</int>
 <str
name="word">cable</str><int name="freq">334</int>
 <str name="word">co o
le</str><int name="freq">4</int>
 [...]

Now, as you can see, the
misspelled `coole` should have been `cooler`, and it's the first
suggestion. However, the second and fourth suggestion baffle me. After a
bit of research, I found this to be multiple words clunked together. As
I described above, `coo` was a part of a name that was truncated. I
found `co` the same way, and the source data contains a small number of
`o` characters on their own (product number names).

Now, my question
is: Why is Solr suggesting `multiple words` pasted together for a
spellcheck for a single word? Is there a way to prevent Solr from
pasting together word parts to forge suggestions?

RE: Solr spellcheck returns more than 1 word for a 1 word spellcheck

Posted by "Dyer, James" <Ja...@ingramcontent.com>.

This is the WordBreakSolrSpellChecker, which is there to correct spelling errors involving misplaced whitespace (or is it white space ??)  To disable it, remove this or similar line from your requestHandler in solrconfig.xml:

<str name="spellcheck.dictionary">wordbreak</str>

Keep in mind, if you want the best of both worlds, you can keep this there and using the "collation" feature, it will try and pick the best combination of spelling corrections that best fixes your user's query. See http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate and following sections.

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Thomas Michael Engelke [mailto:thomas.engelke@posteo.de] 
Sent: Monday, September 01, 2014 6:44 AM
To: Solr user
Subject: Solr spellcheck returns more than 1 word for a 1 word spellcheck

 I'm in the process of incorporating Solr spellchecking in our product.
For that, I've created a new field:

 <field name="spell" type="spell"
indexed="true" stored="true" required="false" multiValued="false"/>

<copyField source="name" dest="spell" maxChars="30000" />

And in the
fieldType definitions:

 <fieldType name="spell" class="solr.TextField"
positionIncrementGap="100">
 <analyzer>
 <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
 </analyzer>

</fieldType>

Then I feed the names of products into the corresponding
core. They can have a lot of words (examples):

 door lock rear left

Door brake, door in front + rear fitting.

However, the names get pretty
long, and in the source data, they have been truncated. This sometimes
leaves parts of words at the end:

 The water pump can evacuate some
coo

I have created a spellcheck component, feeding of the `spell` field
defined earlier. Now for the problem.

Sometimes, when I look up a
slightly misspelled word, I get results I do not expect. Example
request:

 http://solr.url:8983/solr/en/spell?q=coole

This is (part of)
the response:

 <str name="word">cooler</str><int name="freq">21</int>

<str name="word">coo le</str><int name="freq">2</int>
 <str
name="word">cable</str><int name="freq">334</int>
 <str name="word">co o
le</str><int name="freq">4</int>
 [...]

Now, as you can see, the
misspelled `coole` should have been `cooler`, and it's the first
suggestion. However, the second and fourth suggestion baffle me. After a
bit of research, I found this to be multiple words clunked together. As
I described above, `coo` was a part of a name that was truncated. I
found `co` the same way, and the source data contains a small number of
`o` characters on their own (product number names).

Now, my question
is: Why is Solr suggesting `multiple words` pasted together for a
spellcheck for a single word? Is there a way to prevent Solr from
pasting together word parts to forge suggestions?