You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Rodrigo Reyes <re...@charabia.net> on 2002/03/11 21:59:01 UTC

Normalization

Hi,

I'd like to talk about the normalization (aka filter) processing of a string
being indexed/searched, and how it is done in lucene. I'll end with a
proposal for another method of handling it.

The lucene engine includes some filter which purpose is to remove some
meaningless morphological mark, in order to extend the document retrieval
with pertinent documents that do not match the exact forms used by the users
in their queries.

There are some filters provided off-the-shelf along with lucene, a Porter
stemmer and a stemmer specific to german. However, my point is that not only
there can't be a single stemmer for all language (this is obvious for
everybody I guess), but ideally there would be several filter for a same
language. For example, the Porter filter is fine for standard english, but
rather inapropriate for proper nouns. At the contrary, the soundex is
probably fine for names, but it generates innacurate results when used as a
filter on a whole document. Generally speaking, there may be very different
strategies when normalizing text, whether it be highly aggressive (like the
soundex) or rather soft (like a simple diacritics removal). But this is up
to the designer of the search engine to choose carefully its strategy
according to his/her audience and targetted document. It is even possible to
mix several strategies by including an information extraction system that
would additionnaly store in separate indexes the proper nouns, the dates,
the places, etc.

In my opinion, stemming is not the perfect, unique solution for
normalization. For example, I personnaly prefer a normalization that
includes stemming, but also some light phonetic simplification that discards
the differences of close phonemes (like the french
é/è/ê/ei/ai/ait/ais/aient/etc or ain/ein/in/un/etc), as it gives good
results on texts issued from usenet (while it may be a bit too aggressive
for newspaper texts written by journalists).

Well, in fact my main point is the following : having one filter per
language is wrong. Second point is: having the filter algorithm hard-coded
in a programming language is wrong as well. There should be a simple way of
specifying a filter in a simple, dedicated language. In this way, the
snowball project is really interesting as it solves the issue. In my mind,
there should be mainly a normalizer engine, with many configuration files,
easy to modify to implement or adapt a filter. This is an important issue,
as the accuracy of the search engine is directly linked to the normalization
strategy.

However, an important point is also the ease of use of such a language. In
my attempt to build such a simple description language, I came with
something that I hope is quite simple, yet powerful enough : something that
just specify the letters to transform, the right and left context, and the
replacement string. In my opinion, this covers 80% of the need for (at
least) european languages. I implemented it (in java) and wrote a normalizer
for french, which stems and phonetically simplifies its input.

Just as an example, here is a small excerpt of my french normalizer (written
in the toy language I implemented):
:: sh ::        > ch
:: sch ::       > ch
// transform the "in"/"yn" into the same string, when not pronounced "inn"
 :: in :: [~aeiouymn] > 1
[~aeiouy] :: yn :: [~aeiouynm]  > 1   // "syndicat", "synchro", but not
"payer"
:: ives :: $ > if    // "consécutives"

Before the first "::" is the left context, after the second "::" is the
right context. "$" indicates a word boundary.

Some features are still missing in my implementation, such as putting
constraints on the word length (i.e. to apply a transformation only on words
that have more than x letters) or the like, but I am globally satisfied with
it.

As an exemple of result (the two input forms are pronounced identically in
french, although the second is not written correctly):
read: <démesuré> result: <demezur>
read: <daimesurré> result: <demezur>

Before going on the process of submitting it to the lucene project, I'd like
to hear your comments on the approach. Of high concern is the language used
to describe the normalization process, as I am not plenty satisfied of it,
but hey it's hard to find something really simple yet just expressive
enough. Any idea ?

Rodrigo
http://www.charabia.net



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Normalization

Posted by Brian Goetz <br...@quiotix.com>.

> Isn't this really a property of an index rather then an entire Lucene 
> build? 

Technically no, but in spirit, yes.  

Personally, I always liked the idea of creating an Analyzer at index
creation time, and having the Analyzer object stored as a serialized
object in the index.  Then you couldn't make the all-too-common
mistake of indexing with one and then trying to search with another.

> If so, having a text-based way to describe a policy is very helpful
> and better than a source code-based one.

yup.

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Normalization

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.

Rodrigo Reyes wrote:

>
> Good point. I had this drawback in mind, but I am not totally convinced
>that the compilation process is really a good protection barrier, I'd rather
>rely on educational explanations and warnings. However, while the parser &
>interpreter are already written, it shouldn't be that hard to write a
>source-code generator (at least, it'd make it more efficient/faster, and
>that's not something I can be against). 
>
Isn't this really a property of an index rather then an entire Lucene 
build? I mean, ideally, the exact normalization procedure would be fixed 
when an index is created and stored together with that index, so that 
all future documents added to the index also use it and all queries 
against the index can be normalized using the same procedure. As far as 
searching across multiple indexes (with MultiSearcher), it would either 
(a) refuse to work when indexes do not share normalization policy, or 
(b) apply different normalization policies for each sub-index.
I realize that this is easier said than done, but would you agree that 
this would be the ideal solution?

If so, having a text-based way to describe a policy is very helpful and 
better than a source code-based one.

Dmitry.


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Normalization

Posted by Brian Goetz <br...@quiotix.com>.

> Would it make sense to allow a full regex in the matching part? Could
> use regex or oromatcher packages. Don't know how that would affect your
> hashing though...

Its important that normalization rules are guaranteed to converge.  
If you have rules like 
 a->ie
and
 i->a
you're in trouble.  

Allowing regular expressions in the matching part seems like asking for
trouble here...

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Normalization

Posted by Alex Murzaku <mu...@earthlink.net>.

Hi Rodrigo and Brian,

The power of regex is desirable especially in the left and right context
matching. As it is, you need to write a lot of little rules for every
possible combination. A regex instead would allow for just one rule
covering most of the combinations. For example, you have a rule that
would remove the "ation(s)" at the end of a word. That creates a stem
like "n" for "nation(s)". This kind of problem could be resolved by
having a way to define units bigger than just one letter, for example a
syllable.

The other feature that I have found useful is the possibility to create
classes of sounds (letters). You go around it with enumeration --
sometimes it makes sense to be able to define groups of consonants or
vowels etc..

But at the end, you are right, regex is too powerful. My point of view
is that this tool will be used by people that once they spend the time
to learn and understand it, they will always aim at covering as many
linguistic exceptions as possible. The present limitations could become
frustrating.

Just my two lipas.

Alex

-----Original Message-----
From: Rodrigo Reyes [mailto:reyes@charabia.net] 
Sent: Wednesday, March 13, 2002 2:02 PM
To: Lucene Developers List
Subject: Re: Normalization


Hi Alex,

> Would it make sense to allow a full regex in the matching part? Could 
> use regex or oromatcher packages. Don't know how that would affect 
> your hashing though...

 I'd give an answer not really different than Brian's : you don't really
need all that power. Although I don't have significant experience with
non-european languages, this is not the first tool of the kind I write,
and to my knowledge you don't really need more power than that. At
least, not the kind of additional expressiveness that can be provided by
regexps (although, as I mentionned in another mail, you may need
restriction on the size of the string input or output, for example
soundex specifies a 4-letter limitation that is not currently addressed
by the language).

However, I'd be very interested in hearing about counter-example that
would need. The only counter-example I could find was the annoyance of
having to remove sequences of the same letter, which was unnice, so I
added an option called "uniquify" to do the job more easely (as you can
see in the soundex or french normalizer).

Rodrigo



--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Normalization

Posted by Rodrigo Reyes <re...@charabia.net>.

Hi Alex,

> Would it make sense to allow a full regex in the matching part? Could
> use regex or oromatcher packages. Don't know how that would affect your
> hashing though...

 I'd give an answer not really different than Brian's : you don't really
need all that power. Although I don't have significant experience with
non-european languages, this is not the first tool of the kind I write, and
to my knowledge you don't really need more power than that. At least, not
the kind of additional expressiveness that can be provided by regexps
(although, as I mentionned in another mail, you may need restriction on the
size of the string input or output, for example soundex specifies a 4-letter
limitation that is not currently addressed by the language).

However, I'd be very interested in hearing about counter-example that would
need. The only counter-example I could find was the annoyance of having to
remove sequences of the same letter, which was unnice, so I added an option
called "uniquify" to do the job more easely (as you can see in the soundex
or french normalizer).

Rodrigo



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Normalization

Posted by Alex Murzaku <mu...@earthlink.net>.

Would it make sense to allow a full regex in the matching part? Could
use regex or oromatcher packages. Don't know how that would affect your
hashing though...

Alex

-----Original Message-----
From: Rodrigo Reyes [mailto:reyes@charabia.net] 
Sent: Tuesday, March 12, 2002 5:16 PM
To: Lucene Developers List
Subject: Re: Normalization

Hi Alex,

 Thanks for your feedback,

> The rules seem to be applied sequentially and each rule modifies the 
> output of the previous one. This is kind of risky especially if the 
> rule set becomes too big. The author of the rules needs to keep this 
> present at all times. For example, there is a rule for "ons$" and a 
> following one for "ions$". The second one will never be matched 
> because the string will be changed by the first rule it matches. Even 
> though aimons and aimions should be reduced to "em" they end up into 
> "em" and "emi". Maybe this could be solved if you do longest match 
> first.

You're right, rule-masking is a real problem, but not exactly in the
example you give.

There rules are not applied sequentially as they appear in the file,
they are stored in a (kind-of) transducer where the first state is the
first letter of the focused string (i.e. not the right nor the left
contextual rules). The rules are then hashed according to the first
letter of the central string. The normalizer iterates through the
letters of the string, applying the smallest subset of the rules on each
letter, and reducing the string as it goes. In your example, the rules
for "ions$" would be applied when the normalizer reaches the letter i,
reducing the string and therefore the rule for "ons$" cannot be applied.

However, when the strings have the same beginning char, you're right,
there is a risk of having a rule never applied. For exemple "on" then
"ont", the latter is unlikely to be used, ever. As you state it, this
can be solved by sorting the rules of a given subset with the longest
first (which is a very good point, I'll fix it on the source, thanks!).

> The other consequence of the sequentiality is the possible change of 
> context. Some rules could never be reached therefore. Don't remember 
> how we got around this.

The context is never changed in a single pass, as there is an input
buffer and an output buffer. Rules read from the input buffer, and write
in the output buffer. This way, the context is never modified. The
description language in fact allows to have at the same time rules that
rely on the context not being changed by other rules (this security is
addressed by the input/output double buffer) and rules that rely on the
changes made by other rules (with multiple passes on the data, using the
#start keyword).

Rodrigo

--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Normalization

Posted by Rodrigo Reyes <re...@charabia.net>.

Hi Alex,

 Thanks for your feedback,

> The rules seem to be applied sequentially and each rule modifies the
> output of the previous one. This is kind of risky especially if the rule
> set becomes too big. The author of the rules needs to keep this present
> at all times. For example, there is a rule for "ons$" and a following
> one for "ions$". The second one will never be matched because the string
> will be changed by the first rule it matches. Even though aimons and
> aimions should be reduced to "em" they end up into "em" and "emi". Maybe
> this could be solved if you do longest match first.

You're right, rule-masking is a real problem, but not exactly in the example
you give.

There rules are not applied sequentially as they appear in the file, they
are stored in a (kind-of) transducer where the first state is the first
letter of the focused string (i.e. not the right nor the left contextual
rules). The rules are then hashed according to the first letter of the
central string. The normalizer iterates through the letters of the string,
applying the smallest subset of the rules on each letter, and reducing the
string as it goes. In your example, the rules for "ions$" would be applied
when the normalizer reaches the letter i, reducing the string and therefore
the rule for "ons$" cannot be applied.

However, when the strings have the same beginning char, you're right, there
is a risk of having a rule never applied. For exemple "on" then "ont", the
latter is unlikely to be used, ever. As you state it, this can be solved by
sorting the rules of a given subset with the longest first (which is a very
good point, I'll fix it on the source, thanks!).

> The other consequence of the sequentiality is the possible change of
> context. Some rules could never be reached therefore. Don't remember how
> we got around this.

The context is never changed in a single pass, as there is an input buffer
and an output buffer. Rules read from the input buffer, and write in the
output buffer. This way, the context is never modified. The description
language in fact allows to have at the same time rules that rely on the
context not being changed by other rules (this security is addressed by the
input/output double buffer) and rules that rely on the changes made by other
rules (with multiple passes on the data, using the #start keyword).

Rodrigo



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Normalization

Posted by Alex Murzaku <mu...@earthlink.net>.

Hi Rodrigo,

A couple of things that I should have warned you about in our discussion
yesterday. 

The rules seem to be applied sequentially and each rule modifies the
output of the previous one. This is kind of risky especially if the rule
set becomes too big. The author of the rules needs to keep this present
at all times. For example, there is a rule for "ons$" and a following
one for "ions$". The second one will never be matched because the string
will be changed by the first rule it matches. Even though aimons and
aimions should be reduced to "em" they end up into "em" and "emi". Maybe
this could be solved if you do longest match first.

The other consequence of the sequentiality is the possible change of
context. Some rules could never be reached therefore. Don't remember how
we got around this.

Alex



-----Original Message-----
From: Rodrigo Reyes [mailto:reyes@charabia.net] 
Sent: Tuesday, March 12, 2002 3:18 PM
To: Lucene Developers List
Subject: Re: Normalization



> Anyway, I'll try to add a few comments in the sourcecode (although 
> it's
very
> small, like 8 small classes) and package it so that the lucene 
> developers can try it. Should be ready tomorrow.

Ok, please find enclosed hereby the archive of the normalizer. To
compile it, juste type "ant". To test the french normalizer just run
"ant test-french", or "ant test-soundex" for the soundex.

Rodrigo




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Normalization

Posted by Rodrigo Reyes <re...@charabia.net>.

> Anyway, I'll try to add a few comments in the sourcecode (although it's
very
> small, like 8 small classes) and package it so that the lucene developers
> can try it. Should be ready tomorrow.

Ok, please find enclosed hereby the archive of the normalizer. To compile
it, juste type "ant". To test the french normalizer just run "ant
test-french", or "ant test-soundex" for the soundex.

Rodrigo

Re: Normalization

Posted by Rodrigo Reyes <re...@charabia.net>.

Hi Brian,

> Great stuff, Rodrigo!  Welcome.

Thanks :-)

> will stop working.  So any such filtering language should produce code
> (or data) that becomes part of the program, rather than simply a
> configuration file along with the program.  In other words, it should
> be considered source code, not configuration data.

 Good point. I had this drawback in mind, but I am not totally convinced
that the compilation process is really a good protection barrier, I'd rather
rely on educational explanations and warnings. However, while the parser &
interpreter are already written, it shouldn't be that hard to write a
source-code generator (at least, it'd make it more efficient/faster, and
that's not something I can be against). I mainly wanted to write it and the
french normalizer as a proof-of-concept.

> Great idea!  We'd love to have something like this.  This is the sort
> of contribution we're really looking for.  I'm willing to help write
> a parser for it if the langauge gets complicated.

Great, some extension may be needed to describe additional word-length
constraints on rules, and so on, but my belief is that it should stay as
simple as possible.

Anyway, I'll try to add a few comments in the sourcecode (although it's very
small, like 8 small classes) and package it so that the lucene developers
can try it. Should be ready tomorrow.

Rodrigo



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Normalization

Posted by Brian Goetz <br...@quiotix.com>.

Great stuff, Rodrigo!  Welcome.  

Your comments are right on the mark.  While Lucene has a great
architecture for building flexible text processing systems, the
supplied tokenizers and analyzers aren't perfect.  Fortunately,
its easy to add new ones.

> Well, in fact my main point is the following : having one filter per
> language is wrong. Second point is: having the filter algorithm hard-coded
> in a programming language is wrong as well. There should be a simple way of
> specifying a filter in a simple, dedicated language. In this way, the
> snowball project is really interesting as it solves the issue. In my mind,
> there should be mainly a normalizer engine, with many configuration files,
> easy to modify to implement or adapt a filter. This is an important issue,
> as the accuracy of the search engine is directly linked to the normalization
> strategy.

I'm all for domain-specific languages, but you have to be careful of
making the filter language too easy to change, since if the filter is
changed after the archive is created and documents indexed, searches
will stop working.  So any such filtering language should produce code
(or data) that becomes part of the program, rather than simply a
configuration file along with the program.  In other words, it should
be considered source code, not configuration data. 

> Before going on the process of submitting it to the lucene project,
> I'd like to hear your comments on the approach. Of high concern is
> the language used to describe the normalization process, as I am not
> plenty satisfied of it, but hey it's hard to find something really
> simple yet just expressive enough. 

Great idea!  We'd love to have something like this.  This is the sort
of contribution we're really looking for.  I'm willing to help write 
a parser for it if the langauge gets complicated.  


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Normalization

Posted by Brian Goetz <br...@quiotix.com>.

> As I have said before in this list, this gets way off of Lucene. The
> normalizer, or the morphologic analyzer or the phonetic transducer, or
> the stemmer, or the thesaurus -- they all could be stand-alone products.

I've got to disagree with you here on two points.  

1.  Lucene's architecture is all about flexibility and plug-ins.
Rodrigo's proposal is entirely consistent with that -- offering better
tools for building Analyzers.  (Contrast this with some of the
proposals that have been flying for building crawlers and such --
those truly are off the mark as tools to put INTO Lucene.)

2.  The vast majority of users will use one of the provided analyzers
(SimpleAnalyzer, StandardAnalyzer.)  Fair or not, Lucene will be
judged on how well it does on typical documents using the "default"
tools.  Right now, the default tools are unnecessarily weak.

> I used to make such products many years ago and there are companies that
> still sell such tools (e.g. inXight). I like the way Lucene is now: the
> included analyzer/filter could be used as-is but also allows everyone to
> use whatever else they need. One could use the German or Porter stemmer,
> but anyone could easily use other analyzers as well (for example all the
> languages snowball offers.) This is fine as long as Lucene remains a
> library.

My understanding of Rodrigo's idea (filtered through my own view of
the project philosophy) is that he's proposing an "Analyzer
Construction Kit".  That seems like a great idea to me, and while we
could say "put it in /contrib", it really does seem like the sort of
thing we want to have.  

> As Brian says, what matters is to keep the analyzers synchronized
> between indexing and searching. Is there a way to force this?

Having it generate Analyzer source code seems like a pretty good way
to me.  

> I rather prefer changes of the core engine 

Is this a change to the core engine, or an additional tool that can
be plugged into the engine?  I think the latter.

> that accommodate all/many
> possible "normalizations" like what Joanne Sproston contributed some
> months ago i.e. the possibility to return more than one word for a
> filtered word and store them in the same document position (useful for
> synonyms and for agglutinative languages like Finish, Turkish etc.)

That's a change to the core.  Might be useful, but is a more intrusive
change.

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Normalization

Posted by Alex Murzaku <mu...@earthlink.net>.

The generic string transducer kit could become a fine and widely used
lucene contrib tool but could also become more than that: a standalone
tool like Snowball. The formal language Rodrigo describes is quite
powerful and allows for a lot. 

What I was trying to say is that it doesn't need to be plugged. But
thinking it over and reading your comments, I now understand that having
it output Analyzer code, that could be quite nice and would enforce
index/search analyzer synchronization.

-----Original Message-----
From: Brian Goetz [mailto:brian@quiotix.com] 
Sent: Monday, March 11, 2002 5:20 PM
To: Lucene Developers List
Subject: Re: Normalization


> As I have said before in this list, this gets way off of Lucene. The 
> normalizer, or the morphologic analyzer or the phonetic transducer, or

> the stemmer, or the thesaurus -- they all could be stand-alone 
> products.

I think that as Lucene matures, ALL of the sample implementations of
Analyzers (SimpleAnalyzer, StandardAnalyzer, the porter stemmer) should
be moved out of the "core" project and into the "library" of plug-ins,
leaving the core with only interfaces and perhaps the most basic
building blocks (WordTokenizer, LowerCaseFilter.)  Until recently, there
have been few plug-ins available, but this is changing and eventually we
will want to recognize this.

I think a good step would be to create a separate Lucene subproject, for
Analyzers and other plug-ins, and we can give out commit privs to those
more widely to people who have that domain expertise.  

--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Normalization

Posted by Brian Goetz <br...@quiotix.com>.

> As I have said before in this list, this gets way off of Lucene. The
> normalizer, or the morphologic analyzer or the phonetic transducer, or
> the stemmer, or the thesaurus -- they all could be stand-alone products.

I think that as Lucene matures, ALL of the sample implementations of
Analyzers (SimpleAnalyzer, StandardAnalyzer, the porter stemmer)
should be moved out of the "core" project and into the "library" of
plug-ins, leaving the core with only interfaces and perhaps the most
basic building blocks (WordTokenizer, LowerCaseFilter.)  Until
recently, there have been few plug-ins available, but this is changing
and eventually we will want to recognize this.

I think a good step would be to create a separate Lucene subproject,
for Analyzers and other plug-ins, and we can give out commit privs to
those more widely to people who have that domain expertise.  

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Normalization

Posted by Alex Murzaku <mu...@earthlink.net>.

As I have said before in this list, this gets way off of Lucene. The
normalizer, or the morphologic analyzer or the phonetic transducer, or
the stemmer, or the thesaurus -- they all could be stand-alone products.
I used to make such products many years ago and there are companies that
still sell such tools (e.g. inXight). I like the way Lucene is now: the
included analyzer/filter could be used as-is but also allows everyone to
use whatever else they need. One could use the German or Porter stemmer,
but anyone could easily use other analyzers as well (for example all the
languages snowball offers.) This is fine as long as Lucene remains a
library.

As Brian says, what matters is to keep the analyzers synchronized
between indexing and searching. Is there a way to force this?

I rather prefer changes of the core engine that accommodate all/many
possible "normalizations" like what Joanne Sproston contributed some
months ago i.e. the possibility to return more than one word for a
filtered word and store them in the same document position (useful for
synonyms and for agglutinative languages like Finish, Turkish etc.)

-----Original Message-----
From: Rodrigo Reyes [mailto:reyes@charabia.net] 
Sent: Monday, March 11, 2002 3:59 PM
To: lucene-dev@jakarta.apache.org
Subject: Normalization


Hi,

I'd like to talk about the normalization (aka filter) processing of a
string being indexed/searched, and how it is done in lucene. I'll end
with a proposal for another method of handling it.

The lucene engine includes some filter which purpose is to remove some
meaningless morphological mark, in order to extend the document
retrieval with pertinent documents that do not match the exact forms
used by the users in their queries.

There are some filters provided off-the-shelf along with lucene, a
Porter stemmer and a stemmer specific to german. However, my point is
that not only there can't be a single stemmer for all language (this is
obvious for everybody I guess), but ideally there would be several
filter for a same language. For example, the Porter filter is fine for
standard english, but rather inapropriate for proper nouns. At the
contrary, the soundex is probably fine for names, but it generates
innacurate results when used as a filter on a whole document. Generally
speaking, there may be very different strategies when normalizing text,
whether it be highly aggressive (like the
soundex) or rather soft (like a simple diacritics removal). But this is
up to the designer of the search engine to choose carefully its strategy
according to his/her audience and targetted document. It is even
possible to mix several strategies by including an information
extraction system that would additionnaly store in separate indexes the
proper nouns, the dates, the places, etc.

In my opinion, stemming is not the perfect, unique solution for
normalization. For example, I personnaly prefer a normalization that
includes stemming, but also some light phonetic simplification that
discards the differences of close phonemes (like the french
é/è/ê/ei/ai/ait/ais/aient/etc or ain/ein/in/un/etc), as it gives good
results on texts issued from usenet (while it may be a bit too
aggressive for newspaper texts written by journalists).

Well, in fact my main point is the following : having one filter per
language is wrong. Second point is: having the filter algorithm
hard-coded in a programming language is wrong as well. There should be a
simple way of specifying a filter in a simple, dedicated language. In
this way, the snowball project is really interesting as it solves the
issue. In my mind, there should be mainly a normalizer engine, with many
configuration files, easy to modify to implement or adapt a filter. This
is an important issue, as the accuracy of the search engine is directly
linked to the normalization strategy.

However, an important point is also the ease of use of such a language.
In my attempt to build such a simple description language, I came with
something that I hope is quite simple, yet powerful enough : something
that just specify the letters to transform, the right and left context,
and the replacement string. In my opinion, this covers 80% of the need
for (at
least) european languages. I implemented it (in java) and wrote a
normalizer for french, which stems and phonetically simplifies its
input.

Just as an example, here is a small excerpt of my french normalizer
(written in the toy language I implemented):
:: sh ::        > ch
:: sch ::       > ch
// transform the "in"/"yn" into the same string, when not pronounced
"inn"
 :: in :: [~aeiouymn] > 1
[~aeiouy] :: yn :: [~aeiouynm]  > 1   // "syndicat", "synchro", but not
"payer"
:: ives :: $ > if    // "consécutives"

Before the first "::" is the left context, after the second "::" is the
right context. "$" indicates a word boundary.

Some features are still missing in my implementation, such as putting
constraints on the word length (i.e. to apply a transformation only on
words that have more than x letters) or the like, but I am globally
satisfied with it.

As an exemple of result (the two input forms are pronounced identically
in french, although the second is not written correctly):
read: <démesuré> result: <demezur>
read: <daimesurré> result: <demezur>

Before going on the process of submitting it to the lucene project, I'd
like to hear your comments on the approach. Of high concern is the
language used to describe the normalization process, as I am not plenty
satisfied of it, but hey it's hard to find something really simple yet
just expressive enough. Any idea ?

Rodrigo
http://www.charabia.net



--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>