You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shamik Bandopadhyay <sh...@gmail.com> on 2014/09/09 22:14:39 UTC

Czech stemmer

Hi,

  I'm facing stemming issues with the Czech language search. Solr/Lucene
currently provides CzechStemFilterFactory as the sole option. Snowball
Porter doesn't seem to be available for Czech. Here's the issue.

I'm trying to search for "posunout" (means move in English) which returns
result, but fails if I use ''posunulo" (means moved in English). I used the
following text as field for search.

"Pomocí multifunkčních uzlů je možné odkazy mnoha způsoby upravovat. Můžete
přidat a odstranit odkazy, přidat a odstranit vrcholy, prodloužit nebo
přesunout prodloužení čáry nebo přesunout text odkazu. Přístup k požadované
možnosti získáte po přesunutí ukazatele myši na uzel. Z uzlu prodloužení
čáry můžete zvolit tyto možnosti: Protáhnout: Umožňuje posunout prodloužení
odkazové čáry. Délka prodloužení čáry: Umožňuje prodloužit prodloužení
čáry. Přidat odkaz: Umožňuje přidat jednu nebo více odkazových čar. Z uzlu
koncového bodu odkazu můžete zvolit tyto možnosti: Protáhnout: Umožňuje
posunout koncový bod odkazové čáry. Přidat vrchol: Umožňuje přidat vrchol k
odkazové čáře. Odstranit odkaz: Umožňuje odstranit vybranou odkazovou čáru.
Z uzlu vrcholu odkazu můžete zvolit tyto možnosti: Protáhnout: Umožňuje
posunout vrchol. Přidat vrchol: Umožňuje přidat vrchol na odkazovou čáru.
Odstranit vrchol: Umožňuje odstranit vrchol. "

Just wondering if there's a different stemmer available or a way to address
this.

Schema :

<fieldType name="text_csy" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true" >
<analyzer  type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_cz.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_csy.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.CzechStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_cz.txt" />
<filter class="solr.CzechStemFilterFactory"/>
</analyzer>
</fieldType>

Any pointers will be appreciated.

- Thanks,
Shamik

Re: Czech stemmer

Posted by shamik <sh...@gmail.com>.

Lucas,
 
  Thanks for the information. I took the dictionary and used hunspell
stemmer. It worked for the use-case I had mentioned, i.e. "posunout" and
"posunulo". But, it had an impact on other search terms. For e.g. a search
term "ukončit" or "ukončí" is not returning any result, though they work
with CzechStemFilterFactory. I know there'll be trade-offs with various
stemmers, but not sure which one fits the bill. Being an alien to Czech
language doesn't help the cause either.

Thanks,
Shamik



--
View this message in context: http://lucene.472066.n3.nabble.com/Czech-stemmer-tp4157675p4158301.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Czech stemmer

Posted by Lukáš Vlček <lu...@gmail.com>.

Hi,

I would recommend you to look at stemmer or token filter based on Hunspell
dictionaries. I am not a Solr user so I can not point you to appropriate
documentation about this but Czech dictionary that can be used with
Hunspell is of high quality. It can be downloaded from OpenOffice here
http://extensions.services.openoffice.org/en/project/czech-dictionary-pack-ceske-slovniky-cs-cz
(distributed under GPL).

Note: when I was looking at it the last time I noticed that the dictionary
contained one broken affix rule which may require manual fix depending on
how strict the rule loaded is in Solr. If you are interested in more
details and can not figure it yourself feel free to ping me again, I can
point you to some resources about how I used it in connection with
Elasticsearch, I assume the basic concepts apply to Solr as well.

Regards,
Lukas

2014-09-09 22:14 GMT+02:00 Shamik Bandopadhyay <sh...@gmail.com>:

> Hi,
>
>   I'm facing stemming issues with the Czech language search. Solr/Lucene
> currently provides CzechStemFilterFactory as the sole option. Snowball
> Porter doesn't seem to be available for Czech. Here's the issue.
>
> I'm trying to search for "posunout" (means move in English) which returns
> result, but fails if I use ''posunulo" (means moved in English). I used the
> following text as field for search.
>
> "Pomocí multifunkčních uzlů je možné odkazy mnoha způsoby upravovat. Můžete
> přidat a odstranit odkazy, přidat a odstranit vrcholy, prodloužit nebo
> přesunout prodloužení čáry nebo přesunout text odkazu. Přístup k požadované
> možnosti získáte po přesunutí ukazatele myši na uzel. Z uzlu prodloužení
> čáry můžete zvolit tyto možnosti: Protáhnout: Umožňuje posunout prodloužení
> odkazové čáry. Délka prodloužení čáry: Umožňuje prodloužit prodloužení
> čáry. Přidat odkaz: Umožňuje přidat jednu nebo více odkazových čar. Z uzlu
> koncového bodu odkazu můžete zvolit tyto možnosti: Protáhnout: Umožňuje
> posunout koncový bod odkazové čáry. Přidat vrchol: Umožňuje přidat vrchol k
> odkazové čáře. Odstranit odkaz: Umožňuje odstranit vybranou odkazovou čáru.
> Z uzlu vrcholu odkazu můžete zvolit tyto možnosti: Protáhnout: Umožňuje
> posunout vrchol. Přidat vrchol: Umožňuje přidat vrchol na odkazovou čáru.
> Odstranit vrchol: Umožňuje odstranit vrchol. "
>
> Just wondering if there's a different stemmer available or a way to address
> this.
>
> Schema :
>
> <fieldType name="text_csy" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true" >
> <analyzer  type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_cz.txt" />
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_csy.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.CzechStemFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_cz.txt" />
> <filter class="solr.CzechStemFilterFactory"/>
> </analyzer>
> </fieldType>
>
> Any pointers will be appreciated.
>
> - Thanks,
> Shamik
>