You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Leos Literak <li...@seznam.cz> on 2002/09/05 13:34:28 UTC

czech support

Hi,

I use lucene on my website and i am enthusiastic
about it. I was able to enable fulltext for my
complicated database driven site within few hours!

But I need some enhancements, that are related to
my language.

1\ in Czech, there are variants of words.
in english, you have only two shapes:

singular: dog
plural: dogs

but we have 7 shapes for each of singular
and plural:

pes,psa,psovi,psu,pse,psovi,psem
psi,psu,psum,psy,psi,psech,psy

I'd like to be able to search for all of this
variants in first shape: pes

eg. whenever I encounter "psa" index it as "pes"

2\ another issue is with diacritics.

for example lávka (la'vka)

some people use it, some write words without it.

so I'd like to be able to look up both
lavka and lávka. easy way is to index words without
diacritics, because it is common denominator.

the algorithm is quite simple.



so my question is: what kind of interfaces/classes
shall I implement/overwrite? i have no idea of
relations between classes in Lucene and their
purpose. so it would take me lot of time to find that.

thank you for your ideas!

	Leos


-- 
Leos Literak
http://AbcLinuxu.cz - tady je tucnakum hej!




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: czech support

Posted by Gerhard Schwarz <ge...@fpg.de>.
Leos Literak wrote:
> But I need some enhancements, that are related to
> my language.
[...]
> but we have 7 shapes for each of singular
> and plural:
> 
> pes,psa,psovi,psu,pse,psovi,psem
> psi,psu,psum,psy,psi,psech,psy
> 
> I'd like to be able to search for all of this
> variants in first shape: pes
> 
> eg. whenever I encounter "psa" index it as "pes"

How are those suffixes related to each other? It makes most sense to
reduce every plural to it's singular form first. After that you can
strip unnecessary suffixes.

> 2\ another issue is with diacritics.
> 
> for example lávka (la'vka)
> 
> some people use it, some write words without it.
> 
> so I'd like to be able to look up both
> lavka and lávka. easy way is to index words without
> diacritics, because it is common denominator.

If you eleminate the diacritics for indexing and searching
you have the advantage that everyone can search even if he
is not familiar with czech diacritics. Or maybe is not able
to use diacritics (wrong codepage, font, whatever).
 
> so my question is: what kind of interfaces/classes
> shall I implement/overwrite? i have no idea of
> relations between classes in Lucene and their
> purpose. so it would take me lot of time to find that.

You should first look at the classes Analyzer and Filter. You need
an Analyzer that pipes the content of a document trough a suitable
Tokenizer (StandardTokenizer should fit for czech) and then trough
the needed filters. One of those filters is yours that makes the
needed grammatically changes.


Greets, Gerhard

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: czech support

Posted by Leos Literak <li...@seznam.cz>.
Hi,

that's great! I knew, that (i|a)spell does the job too,
but I didn't have an idea of integration with Lucene.
I planned to get inspired by its rules :-)

When I return from vacation, I will definitively look
through this archive and test it! It would save me lots
of work. I just miss better documentation.

Nevertheless thank you for kicking me good way!

	Leos


Alex Murzaku wrote:
> it seems this analyzer based on ispell was posted only on lucene-dev:
> 
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg01757.html
> 
> if it works as promised, it would add support to all the languages
> supported by ispell. if you look at the list of available dictionaries
> in http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html, czech and
> hungarian appear to have ispell dictionaries.
> 
> alex
> 
> --- Zsolt Czinkos <cz...@interware.hu> wrote:
> 
>>Hello
>>
>>You need to write your own analyzer/filter which can handle czech
>>grammar.
>>I face up with the same problem in Hungarian. Not as easy as English
>>- lots of work.
>>
>>Best regards,
>>
>>czinkos
>>
>>On Thu, Sep 05, 2002 at 01:34:28PM +0200, Leos Literak wrote:
>>
>>>Hi,
>>>
>>>I use lucene on my website and i am enthusiastic
>>>about it. I was able to enable fulltext for my
>>>complicated database driven site within few hours!
>>>
>>>But I need some enhancements, that are related to
>>>my language.
>>>
>>>1\ in Czech, there are variants of words.
>>>in english, you have only two shapes:
>>>
>>>singular: dog
>>>plural: dogs
>>>
>>>but we have 7 shapes for each of singular
>>>and plural:
>>>
>>>pes,psa,psovi,psu,pse,psovi,psem
>>>psi,psu,psum,psy,psi,psech,psy
>>>
>>>I'd like to be able to search for all of this
>>
>>--
>>To unsubscribe, e-mail:  
>><ma...@jakarta.apache.org>
>>For additional commands, e-mail:
>><ma...@jakarta.apache.org>
>>
> 
> 
> =====
> __________________________________
> alex@lissus.com -- http://www.lissus.com
> 
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Finance - Get real-time stock quotes
> http://finance.yahoo.com


-- 
Leos Literak
http://AbcLinuxu.cz - tady je tucnakum hej!




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: czech support

Posted by Alex Murzaku <mu...@yahoo.com>.
it seems this analyzer based on ispell was posted only on lucene-dev:

http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg01757.html

if it works as promised, it would add support to all the languages
supported by ispell. if you look at the list of available dictionaries
in http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html, czech and
hungarian appear to have ispell dictionaries.

alex

--- Zsolt Czinkos <cz...@interware.hu> wrote:
> Hello
> 
> You need to write your own analyzer/filter which can handle czech
> grammar.
> I face up with the same problem in Hungarian. Not as easy as English
> - lots of work.
> 
> Best regards,
> 
> czinkos
> 
> On Thu, Sep 05, 2002 at 01:34:28PM +0200, Leos Literak wrote:
> > Hi,
> > 
> > I use lucene on my website and i am enthusiastic
> > about it. I was able to enable fulltext for my
> > complicated database driven site within few hours!
> > 
> > But I need some enhancements, that are related to
> > my language.
> > 
> > 1\ in Czech, there are variants of words.
> > in english, you have only two shapes:
> > 
> > singular: dog
> > plural: dogs
> > 
> > but we have 7 shapes for each of singular
> > and plural:
> > 
> > pes,psa,psovi,psu,pse,psovi,psem
> > psi,psu,psum,psy,psi,psech,psy
> > 
> > I'd like to be able to search for all of this
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


=====
__________________________________
alex@lissus.com -- http://www.lissus.com

__________________________________________________
Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes
http://finance.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: czech support

Posted by Zsolt Czinkos <cz...@interware.hu>.
Hello

You need to write your own analyzer/filter which can handle czech
grammar.
I face up with the same problem in Hungarian. Not as easy as English - lots of work.

Best regards,

czinkos

On Thu, Sep 05, 2002 at 01:34:28PM +0200, Leos Literak wrote:
> Hi,
> 
> I use lucene on my website and i am enthusiastic
> about it. I was able to enable fulltext for my
> complicated database driven site within few hours!
> 
> But I need some enhancements, that are related to
> my language.
> 
> 1\ in Czech, there are variants of words.
> in english, you have only two shapes:
> 
> singular: dog
> plural: dogs
> 
> but we have 7 shapes for each of singular
> and plural:
> 
> pes,psa,psovi,psu,pse,psovi,psem
> psi,psu,psum,psy,psi,psech,psy
> 
> I'd like to be able to search for all of this

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>