You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Artem Lukanin <ic...@mail.ru> on 2013/05/30 14:26:46 UTC

minFuzzyLength in FuzzySuggester behaves differently for English and Russian

minFuzzyLength is the length in bytes, which is wrong, I think, because it is
expected to be in letters. In English the word "table" is 5 bytes, but in
Russian the word "книга" is 10 bytes, though it has only 5 letters. If I
have English and Russian words in one field I have to multiply
minFuzzyLength by 2 if the current query has Russian letters.

Though this hack works it is wrong, because you cannot swap bytes or
substitute bytes in Russian letters if you wish to guess whether it was a
typo. Every arc in FST should be a letter, not a byte.



--
View this message in context: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Re[2]: minFuzzyLength in FuzzySuggester behaves differently for English and Russian

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Jun 5, 2013 at 2:51 AM, Artem Lukanin <ic...@mail.ru> wrote:
>  OK, I will try to do it myself.

Thank you!

> As I understand I have to clone lucene_solr_4_3 from  https://github.com/apache/lucene-solr.git  and upload a patch to the issue for review?

I'm not a git user, but that sounds right!  See here for more details:

    http://wiki.apache.org/lucene-java/HowToContribute

Mike McCandless

http://blog.mikemccandless.com

Re[2]: minFuzzyLength in FuzzySuggester behaves differently for English and Russian

Posted by Artem Lukanin <ic...@mail.ru>.
 OK, I will try to do it myself. As I understand I have to clone lucene_solr_4_3 from  https://github.com/apache/lucene-solr.git  and upload a patch to the issue for review?
>Thanks Artem.  If you have time/energy to work out a patch that would
>be great :)
>
>Mike McCandless
>
>http://blog.mikemccandless.com
>
>
>On Mon, Jun 3, 2013 at 7:17 AM, Artem Lukanin < [hidden email] > wrote:
>> I have opened an issue:  https://issues.apache.org/jira/browse/LUCENE-5030
>>
>>
>>
>> --
>> View this message in context:  http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4067774.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.





--
View this message in context: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4068265.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: minFuzzyLength in FuzzySuggester behaves differently for English and Russian

Posted by Michael McCandless <lu...@mikemccandless.com>.
Thanks Artem.  If you have time/energy to work out a patch that would
be great :)

Mike McCandless

http://blog.mikemccandless.com


On Mon, Jun 3, 2013 at 7:17 AM, Artem Lukanin <ic...@mail.ru> wrote:
> I have opened an issue: https://issues.apache.org/jira/browse/LUCENE-5030
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4067774.html
> Sent from the Lucene - General mailing list archive at Nabble.com.

Re: minFuzzyLength in FuzzySuggester behaves differently for English and Russian

Posted by Artem Lukanin <ic...@mail.ru>.
I have opened an issue: https://issues.apache.org/jira/browse/LUCENE-5030



--
View this message in context: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4067774.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: minFuzzyLength in FuzzySuggester behaves differently for English and Russian

Posted by Michael McCandless <lu...@mikemccandless.com>.
This unfortunately is a limitation of the current FuzzySuggester
implementation: it computes edits in UTF-8 space instead of Unicode
character (code point) space.

This should be fixable: we'd need to fix TokenStreamToAutomaton to
work in Unicode character space, then fix FuzzySuggester to do the
same steps that FuzzyQuery does: do the LevN expansion in Unicode
character space, then convert that automaton to UTF-8, then intersect
with the suggest FST.

Could you open an issue for this?  I won't have any time soon to work
on this but we should open an issue to discuss / see if someone else
has time / iterate. Thanks!

Mike McCandless

http://blog.mikemccandless.com


On Thu, May 30, 2013 at 8:39 AM, Artem Lukanin <ic...@mail.ru> wrote:
> BTW, I have to set maxEdits=2 to allow letter transpositions in Russian,
> because there will be actually 2 transpositions of 4 bytes representing 2
> Russian letters in UTF-8.
>
> The worst case is when one field has both Russian and English letters (or
> e.g. numbers), where I have to use minFuzzyLength=6 and maxEdits=2, which
> will work only for Russian words of more than 2 letters and for English
> words of more than 5 letters!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4067026.html
> Sent from the Lucene - General mailing list archive at Nabble.com.

Re: minFuzzyLength in FuzzySuggester behaves differently for English and Russian

Posted by Artem Lukanin <ic...@mail.ru>.
Artem Lukanin wrote
> BTW, I have to set maxEdits=2 to allow letter transpositions in Russian,
> because there will be actually 2 transpositions of 4 bytes representing 2
> Russian letters in UTF-8.

This is true only for the transposition of the first 2 letters (when
nonFuzzyPrefix=0).



--
View this message in context: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4067769.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: minFuzzyLength in FuzzySuggester behaves differently for English and Russian

Posted by Artem Lukanin <ic...@mail.ru>.
BTW, I have to set maxEdits=2 to allow letter transpositions in Russian,
because there will be actually 2 transpositions of 4 bytes representing 2
Russian letters in UTF-8.

The worst case is when one field has both Russian and English letters (or
e.g. numbers), where I have to use minFuzzyLength=6 and maxEdits=2, which
will work only for Russian words of more than 2 letters and for English
words of more than 5 letters!



--
View this message in context: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4067026.html
Sent from the Lucene - General mailing list archive at Nabble.com.