You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Oleg Burlaca <ol...@burlaca.com> on 2010/07/27 10:05:36 UTC

Russian stemmer

Hello,

I'm using SnowballPorterFilterFactory with language="Russian".
The stemming works ok except people names, geographical places.
Here are some examples:

searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове.

Are there other stemming plugins for the russian language that can handle
this?
If not, what are the options. A simple solution may be to use the wildcard
queries in Standard mode instead of the DisMaxQueryHandler:
Ковров*

but I'd like to avoid it.

Thanks.

Re: Russian stemmer

Posted by Dennis Gearon <ge...@sbcglobal.net>.
I have studied some Russian. I kind of got the picture from the texts that all the exceptions had already been 'found', and were listed in the book. 

I do know that languages are living, changing organisms, but Russian has got to be more regular than English I would think, even WITH all six cases and 3 genders.

Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Tue, 7/27/10, Robert Muir <rc...@gmail.com> wrote:

> From: Robert Muir <rc...@gmail.com>
> Subject: Re: Russian stemmer
> To: solr-user@lucene.apache.org
> Date: Tuesday, July 27, 2010, 7:12 AM
> right, but your problem is this is
> the current output:
> 
> Ковров -> Ковр
> Коврову -> Ковров
> Ковровом -> Ковров
> Коврове -> Ковров
> 
> so, if Ковров was simply left alone, all your forms
> would match...
> 
> 2010/7/27 Oleg Burlaca <ol...@burlaca.com>
> 
> > Thanks Robert for all your help,
> >
> > The idea of ы[A-Z].* stopwords is ideal for the
> english language,
> > although in russian nouns are inflected: Борис,
> Борису, Бориса, Борисом
> >
> > I'll try the RussianLightStemFilterFactory (the
> article in the PDF
> > mentioned
> > it's more accurate).
> >
> > Once again thanks,
> > Oleg Burlaca
> >
> > On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir <rc...@gmail.com>
> wrote:
> >
> > > 2010/7/27 Oleg Burlaca <ol...@burlaca.com>
> > >
> > > > Actually the situation with Немцов
> из ок,
> > > > I've just checked how Yandex works with
> Немцов and Немцова:
> > > > http://nano.yandex.ru/project/inflect/
> > > >
> > > > I think there are two solutions:
> > > > a) manually search for both Немцов and
> then Немцова
> > > > b) use wildcard query: Немцов*
> > > >
> > >
> > > Well, here is one idea of a more general
> solution.
> > > The problem with "protected words" is you must
> have a complete list.
> > >
> > > One idea would be to add a filter that protects
> any words from stemming
> > > that
> > > match a regular expression:
> > > In english maybe someone wants to avoid any
> capitalized words to reduce
> > > trouble: [A-Z].*
> > > in your case then some pattern like [A-Я].*ов
> might prevent problems.
> > >
> > >
> > > > Robert, thanks for the
> RussianLightStemFilterFactory info,
> > > > I've found this page
> > > >
> > http://www.mail-archive.com/solr-commits@lucene.apache.org/msg06857.html
> > > > that somehow describes it. Where can I read
> more about
> > > > RussianLightStemFilterFactory ?
> > > >
> > > >
> > > Here is the link:
> > >
> > >
> > http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf
> > >
> > >
> > > > Regards,
> > > > Oleg
> > > >
> > > > 2010/7/27 Oleg Burlaca <ol...@burlaca.com>
> > > >
> > > > > A similar word is Немцов.
> > > > > The strange thing is that searching for
> "Немцова" will not find
> > > documents
> > > > > containing "Немцов"
> > > > >
> > > > > Немцова: 14 articles
> > > > >
> > > > >
> > > >
> > >
> > http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
> > > > >
> > > > > Немцов: 74 articles
> > > > >
> > > > >
> > > >
> > >
> > http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
> 
> 
> 
> -- 
> Robert Muir
> rcmuir@gmail.com
> 

Re: Russian stemmer

Posted by Robert Muir <rc...@gmail.com>.
right, but your problem is this is the current output:

Ковров -> Ковр
Коврову -> Ковров
Ковровом -> Ковров
Коврове -> Ковров

so, if Ковров was simply left alone, all your forms would match...

2010/7/27 Oleg Burlaca <ol...@burlaca.com>

> Thanks Robert for all your help,
>
> The idea of ы[A-Z].* stopwords is ideal for the english language,
> although in russian nouns are inflected: Борис, Борису, Бориса, Борисом
>
> I'll try the RussianLightStemFilterFactory (the article in the PDF
> mentioned
> it's more accurate).
>
> Once again thanks,
> Oleg Burlaca
>
> On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir <rc...@gmail.com> wrote:
>
> > 2010/7/27 Oleg Burlaca <ol...@burlaca.com>
> >
> > > Actually the situation with Немцов из ок,
> > > I've just checked how Yandex works with Немцов and Немцова:
> > > http://nano.yandex.ru/project/inflect/
> > >
> > > I think there are two solutions:
> > > a) manually search for both Немцов and then Немцова
> > > b) use wildcard query: Немцов*
> > >
> >
> > Well, here is one idea of a more general solution.
> > The problem with "protected words" is you must have a complete list.
> >
> > One idea would be to add a filter that protects any words from stemming
> > that
> > match a regular expression:
> > In english maybe someone wants to avoid any capitalized words to reduce
> > trouble: [A-Z].*
> > in your case then some pattern like [A-Я].*ов might prevent problems.
> >
> >
> > > Robert, thanks for the RussianLightStemFilterFactory info,
> > > I've found this page
> > >
> http://www.mail-archive.com/solr-commits@lucene.apache.org/msg06857.html
> > > that somehow describes it. Where can I read more about
> > > RussianLightStemFilterFactory ?
> > >
> > >
> > Here is the link:
> >
> >
> http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf
> >
> >
> > > Regards,
> > > Oleg
> > >
> > > 2010/7/27 Oleg Burlaca <ol...@burlaca.com>
> > >
> > > > A similar word is Немцов.
> > > > The strange thing is that searching for "Немцова" will not find
> > documents
> > > > containing "Немцов"
> > > >
> > > > Немцова: 14 articles
> > > >
> > > >
> > >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
> > > >
> > > > Немцов: 74 articles
> > > >
> > > >
> > >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Re: Russian stemmer

Posted by Oleg Burlaca <ol...@burlaca.com>.
Thanks Robert for all your help,

The idea of ы[A-Z].* stopwords is ideal for the english language,
although in russian nouns are inflected: Борис, Борису, Бориса, Борисом

I'll try the RussianLightStemFilterFactory (the article in the PDF mentioned
it's more accurate).

Once again thanks,
Oleg Burlaca

On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir <rc...@gmail.com> wrote:

> 2010/7/27 Oleg Burlaca <ol...@burlaca.com>
>
> > Actually the situation with Немцов из ок,
> > I've just checked how Yandex works with Немцов and Немцова:
> > http://nano.yandex.ru/project/inflect/
> >
> > I think there are two solutions:
> > a) manually search for both Немцов and then Немцова
> > b) use wildcard query: Немцов*
> >
>
> Well, here is one idea of a more general solution.
> The problem with "protected words" is you must have a complete list.
>
> One idea would be to add a filter that protects any words from stemming
> that
> match a regular expression:
> In english maybe someone wants to avoid any capitalized words to reduce
> trouble: [A-Z].*
> in your case then some pattern like [A-Я].*ов might prevent problems.
>
>
> > Robert, thanks for the RussianLightStemFilterFactory info,
> > I've found this page
> > http://www.mail-archive.com/solr-commits@lucene.apache.org/msg06857.html
> > that somehow describes it. Where can I read more about
> > RussianLightStemFilterFactory ?
> >
> >
> Here is the link:
>
> http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf
>
>
> > Regards,
> > Oleg
> >
> > 2010/7/27 Oleg Burlaca <ol...@burlaca.com>
> >
> > > A similar word is Немцов.
> > > The strange thing is that searching for "Немцова" will not find
> documents
> > > containing "Немцов"
> > >
> > > Немцова: 14 articles
> > >
> > >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
> > >
> > > Немцов: 74 articles
> > >
> > >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: Russian stemmer

Posted by Robert Muir <rc...@gmail.com>.
2010/7/27 Oleg Burlaca <ol...@burlaca.com>

> Actually the situation with Немцов из ок,
> I've just checked how Yandex works with Немцов and Немцова:
> http://nano.yandex.ru/project/inflect/
>
> I think there are two solutions:
> a) manually search for both Немцов and then Немцова
> b) use wildcard query: Немцов*
>

Well, here is one idea of a more general solution.
The problem with "protected words" is you must have a complete list.

One idea would be to add a filter that protects any words from stemming that
match a regular expression:
In english maybe someone wants to avoid any capitalized words to reduce
trouble: [A-Z].*
in your case then some pattern like [A-Я].*ов might prevent problems.


> Robert, thanks for the RussianLightStemFilterFactory info,
> I've found this page
> http://www.mail-archive.com/solr-commits@lucene.apache.org/msg06857.html
> that somehow describes it. Where can I read more about
> RussianLightStemFilterFactory ?
>
>
Here is the link:
http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf


> Regards,
> Oleg
>
> 2010/7/27 Oleg Burlaca <ol...@burlaca.com>
>
> > A similar word is Немцов.
> > The strange thing is that searching for "Немцова" will not find documents
> > containing "Немцов"
> >
> > Немцова: 14 articles
> >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
> >
> > Немцов: 74 articles
> >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
> >
> >
> >
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Re: Russian stemmer

Posted by Oleg Burlaca <ol...@burlaca.com>.
Actually the situation with Немцов из ок,
I've just checked how Yandex works with Немцов and Немцова:
http://nano.yandex.ru/project/inflect/

I think there are two solutions:
a) manually search for both Немцов and then Немцова
b) use wildcard query: Немцов*

Robert, thanks for the RussianLightStemFilterFactory info,
I've found this page
http://www.mail-archive.com/solr-commits@lucene.apache.org/msg06857.html
that somehow describes it. Where can I read more about
RussianLightStemFilterFactory ?

Regards,
Oleg

2010/7/27 Oleg Burlaca <ol...@burlaca.com>

> A similar word is Немцов.
> The strange thing is that searching for "Немцова" will not find documents
> containing "Немцов"
>
> Немцова: 14 articles
>
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
>
> Немцов: 74 articles
>
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
>
>
>
>

Re: Russian stemmer

Posted by Oleg Burlaca <ol...@burlaca.com>.
A similar word is Немцов.
The strange thing is that searching for "Немцова" will not find documents
containing "Немцов"

Немцова: 14 articles
http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0

Немцов: 74 articles
http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2

Re: Russian stemmer

Posted by Oleg Burlaca <ol...@burlaca.com>.
Yes, I'm sure I've enabled SnowballPorterFilterFactory both at Index and
Query time, because the search works ok,
except names and geo locations.

I've noticed that searching by
Коврова

also shows documents that contain Коврову, Коврове

Search by Ковров, 7 results:
http://www.sova-center.ru/search/?q=%D0%BA%D0%BE%D0%B2%D1%80%D0%BE%D0%B2

Search by Коврова, 26 results:
http://www.sova-center.ru/search/?lg=1&q=%D0%BA%D0%BE%D0%B2%D1%80%D0%BE%D0%B2%D0%B0

Adding such words in stopwords.txt will be a tedious task, as there are 7
millions russian names :)

Kind Regards,
Oleg Burlaca



On Tue, Jul 27, 2010 at 11:35 AM, Robert Muir <rc...@gmail.com> wrote:

> another look, your problem is ковров itself... its mapped to ковр
>
> a workaround might be to use the protected words functionality to
> keep ковров and any other problematic people/geo names as-is.
>
> separately, in trunk there is an alternative russian stemmer
> (RussianLightStemFilterFactory), which might give you less problems on
> average, but I noticed it has this same problem with the example you gave.
>
> On Tue, Jul 27, 2010 at 4:25 AM, Robert Muir <rc...@gmail.com> wrote:
>
> > All of your examples stem to "ковров":
> >
> >    assertAnalyzesTo(a, "Коврова Коврову Ковровом Коврове",
> >           new String[] { "ковров", "ковров", "ковров", "ковров" });
> >     }
> >
> > Are you sure you enabled this at *both* index and query time?
> >
> > 2010/7/27 Oleg Burlaca <ol...@burlaca.com>
> >
> > Hello,
> >>
> >> I'm using SnowballPorterFilterFactory with language="Russian".
> >> The stemming works ok except people names, geographical places.
> >> Here are some examples:
> >>
> >> searching for Ковров should also find Коврова, Коврову, Ковровом,
> Коврове.
> >>
> >> Are there other stemming plugins for the russian language that can
> handle
> >> this?
> >> If not, what are the options. A simple solution may be to use the
> wildcard
> >> queries in Standard mode instead of the DisMaxQueryHandler:
> >> Ковров*
> >>
> >> but I'd like to avoid it.
> >>
> >> Thanks.
> >>
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: Russian stemmer

Posted by Robert Muir <rc...@gmail.com>.
another look, your problem is ковров itself... its mapped to ковр

a workaround might be to use the protected words functionality to
keep ковров and any other problematic people/geo names as-is.

separately, in trunk there is an alternative russian stemmer
(RussianLightStemFilterFactory), which might give you less problems on
average, but I noticed it has this same problem with the example you gave.

On Tue, Jul 27, 2010 at 4:25 AM, Robert Muir <rc...@gmail.com> wrote:

> All of your examples stem to "ковров":
>
>    assertAnalyzesTo(a, "Коврова Коврову Ковровом Коврове",
>           new String[] { "ковров", "ковров", "ковров", "ковров" });
>     }
>
> Are you sure you enabled this at *both* index and query time?
>
> 2010/7/27 Oleg Burlaca <ol...@burlaca.com>
>
> Hello,
>>
>> I'm using SnowballPorterFilterFactory with language="Russian".
>> The stemming works ok except people names, geographical places.
>> Here are some examples:
>>
>> searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове.
>>
>> Are there other stemming plugins for the russian language that can handle
>> this?
>> If not, what are the options. A simple solution may be to use the wildcard
>> queries in Standard mode instead of the DisMaxQueryHandler:
>> Ковров*
>>
>> but I'd like to avoid it.
>>
>> Thanks.
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

Re: Russian stemmer

Posted by Robert Muir <rc...@gmail.com>.
All of your examples stem to "ковров":

   assertAnalyzesTo(a, "Коврова Коврову Ковровом Коврове",
          new String[] { "ковров", "ковров", "ковров", "ковров" });
    }

Are you sure you enabled this at *both* index and query time?

2010/7/27 Oleg Burlaca <ol...@burlaca.com>

> Hello,
>
> I'm using SnowballPorterFilterFactory with language="Russian".
> The stemming works ok except people names, geographical places.
> Here are some examples:
>
> searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове.
>
> Are there other stemming plugins for the russian language that can handle
> this?
> If not, what are the options. A simple solution may be to use the wildcard
> queries in Standard mode instead of the DisMaxQueryHandler:
> Ковров*
>
> but I'd like to avoid it.
>
> Thanks.
>



-- 
Robert Muir
rcmuir@gmail.com