You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Caroline Collet <ca...@pertimm.com> on 2016/06/07 09:20:05 UTC

Lucene DirectSpellChecker strange behavior

Hello,

I have a very strange behavior when I use the DirectSpellChecker of 
Lucene. I have set the prefixLength to 0. I have indexed only one item 
with one field : brand=samsung.
I have tried to make requests with spelling mistakes inside.

When I search for "smsng" I obtain "samsung" which is logical since I 
only have 2 corrections to make to obtain "samsung"
When I search for "amsung" I obtain "samsung" since I have set the 
prefixLenght to 0
But when I search "amung" which only has 2 errors, I do not obtain 
"samsung", I obtain nothing.

I don't understand this behaviour, it is like no other correction is 
permitted if the first letter is misspelled.

Did I miss some parameters of the spellchecker that could explain this 
behavior?

I precise that I use :
- Lucene 5.5.0
- JRE 1.8

Thank you in advance for taking time to answer my question,
Bests regards,
-- 
PERTIMM <http://www.pertimm.com/fr/> 	

Caroline Collet
Ing�nieur d�veloppement

Tel : +33 (0)1 80 04 82 89
caroline.collet@pertimm.com <ma...@pertimm.com>
http://www.pertimm.com/fr/

	

Pertimm
51, boulevard Voltaire
92600 Asni�res-Sur-Seine, France




Re: Lucene DirectSpellChecker strange behavior

Posted by Caroline Collet <ca...@pertimm.com>.
Thank you for your prompt reply this makes perfect sense.

Le 07/06/2016 17:24, Robert Muir a �crit :
> Its just a heuristic: that it does not allow 2 edits 
> (insertion/deletion/substitution/transposition) to the word if the 
> first character differs 
> (https://github.com/apache/lucene-solr/blob/master/lucene/suggest/src/java/org/apache/lucene/search/spell/DirectSpellChecker.java#L411). 
> So when it goes back for n=2, it requires the first character to match.
>
> At least at the time the thing was written, this has a very large 
> impact on performance, because otherwise too much of the term 
> dictionary must be inspected and its much slower. The idea is, it 
> won't hurt too much on quality, for the same reasons that many of 
> these string distance functions incorporate a bias towards the 
> matching prefix (e.g. jaro winkler).
>
>
> On Tue, Jun 7, 2016 at 5:20 AM, Caroline Collet 
> <caroline.collet@pertimm.com <ma...@pertimm.com>> wrote:
>
>     Hello,
>
>     I have a very strange behavior when I use the DirectSpellChecker
>     of Lucene. I have set the prefixLength to 0. I have indexed only
>     one item with one field : brand=samsung.
>     I have tried to make requests with spelling mistakes inside.
>
>     When I search for "smsng" I obtain "samsung" which is logical
>     since I only have 2 corrections to make to obtain "samsung"
>     When I search for "amsung" I obtain "samsung" since I have set the
>     prefixLenght to 0
>     But when I search "amung" which only has 2 errors, I do not obtain
>     "samsung", I obtain nothing.
>
>     I don't understand this behaviour, it is like no other correction
>     is permitted if the first letter is misspelled.
>
>     Did I miss some parameters of the spellchecker that could explain
>     this behavior?
>
>     I precise that I use :
>     - Lucene 5.5.0
>     - JRE 1.8
>
>     Thank you in advance for taking time to answer my question,
>     Bests regards,
>     -- 
>     PERTIMM <http://www.pertimm.com/fr/> 	
>
>     Caroline Collet
>     Ing�nieur d�veloppement
>
>     Tel : +33 (0)1 80 04 82 89 <tel:%2B33%20%280%291%2080%2004%2082%2089>
>     caroline.collet@pertimm.com <ma...@pertimm.com>
>     http://www.pertimm.com/fr/
>
>     	
>
>     Pertimm
>     51, boulevard Voltaire
>     92600 Asni�res-Sur-Seine, France
>
>
>
>
-- 
PERTIMM <http://www.pertimm.com/fr/> 	

Caroline Collet
Ing�nieur d�veloppement

Tel : +33 (0)1 80 04 82 89
caroline.collet@pertimm.com <ma...@pertimm.com>
http://www.pertimm.com/fr/

	

Pertimm
51, boulevard Voltaire
92600 Asni�res-Sur-Seine, France




Re: Lucene DirectSpellChecker strange behavior

Posted by Robert Muir <rc...@gmail.com>.
Its just a heuristic: that it does not allow 2 edits
(insertion/deletion/substitution/transposition) to the word if the first
character differs (
https://github.com/apache/lucene-solr/blob/master/lucene/suggest/src/java/org/apache/lucene/search/spell/DirectSpellChecker.java#L411).
So when it goes back for n=2, it requires the first character to match.

At least at the time the thing was written, this has a very large impact on
performance, because otherwise too much of the term dictionary must be
inspected and its much slower. The idea is, it won't hurt too much on
quality, for the same reasons that many of these string distance functions
incorporate a bias towards the matching prefix (e.g. jaro winkler).


On Tue, Jun 7, 2016 at 5:20 AM, Caroline Collet <caroline.collet@pertimm.com
> wrote:

> Hello,
>
> I have a very strange behavior when I use the DirectSpellChecker of
> Lucene. I have set the prefixLength to 0. I have indexed only one item with
> one field : brand=samsung.
> I have tried to make requests with spelling mistakes inside.
>
> When I search for "smsng" I obtain "samsung" which is logical since I only
> have 2 corrections to make to obtain "samsung"
> When I search for "amsung" I obtain "samsung" since I have set the
> prefixLenght to 0
> But when I search "amung" which only has 2 errors, I do not obtain
> "samsung", I obtain nothing.
>
> I don't understand this behaviour, it is like no other correction is
> permitted if the first letter is misspelled.
>
> Did I miss some parameters of the spellchecker that could explain this
> behavior?
>
> I precise that I use :
> - Lucene 5.5.0
> - JRE 1.8
>
> Thank you in advance for taking time to answer my question,
> Bests regards,
> --
> [image: PERTIMM] <http://www.pertimm.com/fr/>
>
> Caroline Collet
> Ingénieur développement
>
> Tel : +33 (0)1 80 04 82 89
> <ca...@pertimm.com>caroline.collet@pertimm.com
> http://www.pertimm.com/fr/
>
> Pertimm
> 51, boulevard Voltaire
> 92600 Asnières-Sur-Seine, France
>
>
>