You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Rob Koeling <ro...@gmail.com> on 2012/11/14 01:18:21 UTC

Has anyone HunspellStemFilterFactory working?

If so, would you be willing to share the .dic and .aff files with me?
When I try to load a dictionary file, Solr is complaining that:

java.lang.RuntimeException: java.io.IOException: Unable to load hunspell
data! [dictionary=en_GB.dic,affix=en_GB.aff]
    at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:116)
.......
Caused by: java.text.ParseException: The first non-comment line in the
affix file must be a 'SET charset', was: 'FLAG num'
    at
org.apache.lucene.analysis.hunspell.HunspellDictionary.getDictionaryEncoding(HunspellDictionary.java:306)
    at
org.apache.lucene.analysis.hunspell.HunspellDictionary.<init>(HunspellDictionary.java:130)
    at
org.apache.lucene.analysis.hunspell.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:103)
    ... 46 more

When I change the first line to 'SET charset' it is still not happy. I got
the dictionary files from the OpenOffice website.

I'm using Solr 4.0 (but had the same problem with 3.6)

  - Rob

Re: Has anyone HunspellStemFilterFactory working?

Posted by Rob Koeling <ro...@gmail.com>.

Thanks for your reply, Sergey!

Well, I was a bit puzzled. I tried adding a line to set the character set
before, but then it complained about that as well.
I installed the Russian dictionary and Solr was happy to load that. I
noticed that the character-set was only set in the affix file for Russian.
So, when I added the line 'SET UTF-8' only to the affix file for en_UK, all
was well. I must have added that same line to the .dic file as well before
and I suppose that was what Solr was complaining about.

I just checked that, and that seems to be the case. The character-set
should only be set on the first line of the .aff file, the .dic file should
be left alone.

Thanks again Sergey, that was very useful.

Best,

   - Rob




On Wed, Nov 14, 2012 at 11:08 AM, Сергей Бирюков <ka...@yandex.ru> wrote:

> Rob, as regards your "problem"
>
>> 'SET charset'
>>
> 'charset' word must be replaced with a name-of-character-set (i.e.
> encoding)
> For exampe,  you can write 'SET UTF-8'
>
> BUT...
>
> ----
>
> Be careful!
> At least for russian language morthology HunspellStemFilterFactory has
> bug(s) in its algorythm.
>
> Simple comparison with original hunspell library shown huge difference.
>
>
> For example on  Linux x86_64 Ubuntu 12.10
>
> 1) INSTALL:
> # sudo apt-get install hunspell hunspell-ru
>
>
> 2) TEST with string "мама мыла раму мелом"
> (it has a meaning: "mom was_washing frame (with) chalk" ):
>
> 2.1) OS hunspell library
> # echo "мама мыла раму мелом" | hunspell -d ru_RU -D -m
> gives results:
> ...
>     LOADED DICTIONARY:
>     /usr/share/hunspell/ru_RU.aff
>     /usr/share/hunspell/ru_RU.dic
>
>     мама  -> мама
>     мыла  -> мыло | мыть     <<< as noun | as verb
>     раму  -> рама
>     мелом -> мел
>
> 2.2) solr's HunspellStemFilterFactory
> config fieldType
>     <fieldType name="text_hunspell" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
>         <filter class="solr.**LowerCaseFilterFactory" />
>         <filter class="solr.**HunspellStemFilterFactory"
> dictionary="ru_RU.dic" affix="ru_RU.aff" ignoreCase="true" />
>       </analyzer>
>     </fieldType>
>
> gives results:
>     мама -> мама | мама         : FAILED:  duplicate words
>     мыла -> мыть | мыло         : SUSSECC: all OK
>     раму -> рама | расти          : FAILED: second word is wrong and excess
>     мелом -> мести | метить | месть | мел  :  FAILED: only last word is
> correct, other ones are excess
>
> ----------
>
> That's why I use a JNA (v3.2.7) binding on original (system)
> libhunspell.so for a long time :)
>
> ----
> Best regards
>   Sergey Biryukov
>   Moscow, Russian Federation
>
>
>
> 14.11.2012 04:18, Rob Koeling wrote:
>
>> If so, would you be willing to share the .dic and .aff files with me?
>> When I try to load a dictionary file, Solr is complaining that:
>>
>> java.lang.RuntimeException: java.io.IOException: Unable to load hunspell
>> data! [dictionary=en_GB.dic,affix=**en_GB.aff]
>>      at org.apache.solr.schema.**IndexSchema.<init>(**
>> IndexSchema.java:116)
>> .......
>> Caused by: java.text.ParseException: The first non-comment line in the
>> affix file must be a 'SET charset', was: 'FLAG num'
>>      at
>> org.apache.lucene.analysis.**hunspell.HunspellDictionary.**
>> getDictionaryEncoding(**HunspellDictionary.java:306)
>>      at
>> org.apache.lucene.analysis.**hunspell.HunspellDictionary.<**
>> init>(HunspellDictionary.java:**130)
>>      at
>> org.apache.lucene.analysis.**hunspell.**HunspellStemFilterFactory.**
>> inform(**HunspellStemFilterFactory.**java:103)
>>      ... 46 more
>>
>> When I change the first line to 'SET charset' it is still not happy. I got
>> the dictionary files from the OpenOffice website.
>>
>> I'm using Solr 4.0 (but had the same problem with 3.6)
>>
>>    - Rob
>>
>>
>

Re: Has anyone HunspellStemFilterFactory working?

Posted by Сергей Бирюков <ka...@yandex.ru>.

Rob, as regards your "problem"
> 'SET charset'
'charset' word must be replaced with a name-of-character-set (i.e. encoding)
For exampe,  you can write 'SET UTF-8'

BUT...

----

Be careful!
At least for russian language morthology HunspellStemFilterFactory has 
bug(s) in its algorythm.

Simple comparison with original hunspell library shown huge difference.


For example on  Linux x86_64 Ubuntu 12.10

1) INSTALL:
# sudo apt-get install hunspell hunspell-ru


2) TEST with string "мама мыла раму мелом"
(it has a meaning: "mom was_washing frame (with) chalk" ):

2.1) OS hunspell library
# echo "мама мыла раму мелом" | hunspell -d ru_RU -D -m
gives results:
...
     LOADED DICTIONARY:
     /usr/share/hunspell/ru_RU.aff
     /usr/share/hunspell/ru_RU.dic

     мама  -> мама
     мыла  -> мыло | мыть     <<< as noun | as verb
     раму  -> рама
     мелом -> мел

2.2) solr's HunspellStemFilterFactory
config fieldType
     <fieldType name="text_hunspell" class="solr.TextField" 
positionIncrementGap="100">
       <analyzer>
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory" />
         <filter class="solr.HunspellStemFilterFactory" 
dictionary="ru_RU.dic" affix="ru_RU.aff" ignoreCase="true" />
       </analyzer>
     </fieldType>

gives results:
     мама -> мама | мама         : FAILED:  duplicate words
     мыла -> мыть | мыло         : SUSSECC: all OK
     раму -> рама | расти          : FAILED: second word is wrong and excess
     мелом -> мести | метить | месть | мел  :  FAILED: only last word is 
correct, other ones are excess

----------

That's why I use a JNA (v3.2.7) binding on original (system) 
libhunspell.so for a long time :)

----
Best regards
   Sergey Biryukov
   Moscow, Russian Federation


14.11.2012 04:18, Rob Koeling wrote:
> If so, would you be willing to share the .dic and .aff files with me?
> When I try to load a dictionary file, Solr is complaining that:
>
> java.lang.RuntimeException: java.io.IOException: Unable to load hunspell
> data! [dictionary=en_GB.dic,affix=en_GB.aff]
>      at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:116)
> .......
> Caused by: java.text.ParseException: The first non-comment line in the
> affix file must be a 'SET charset', was: 'FLAG num'
>      at
> org.apache.lucene.analysis.hunspell.HunspellDictionary.getDictionaryEncoding(HunspellDictionary.java:306)
>      at
> org.apache.lucene.analysis.hunspell.HunspellDictionary.<init>(HunspellDictionary.java:130)
>      at
> org.apache.lucene.analysis.hunspell.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:103)
>      ... 46 more
>
> When I change the first line to 'SET charset' it is still not happy. I got
> the dictionary files from the OpenOffice website.
>
> I'm using Solr 4.0 (but had the same problem with 3.6)
>
>    - Rob
>