You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Konrad Scherer <bc...@uottawa.ca> on 2002/11/21 21:12:26 UTC

StandardFilter that works for French

Hello all,

I am using Lucene to index both English and French documents and have run 
into some problems with the analysis of the text. The project I am working 
with is using the searches to do language analysis so this may not be 
relevant to some people. Here is a quick explanation.

In French you have 6 words (me, te, se, le/la , ne, de) where the e is 
replaced with an apostrophe when the following word starts with a vowel. 
For example me aider becomes m'aider. Currently Lucene indexes m'aider, 
s'aider, n'aider as different words when in fact they should be analyzed as 
me aider, se aider, ne aider, etc. So I modified Standard filter to send 
back these words as two words. I had to add a one Token buffer. I toyed 
with modifying StandardTokenizer.jj but I was worried about unintended 
changes in behavior.

This change will not effect English indexing. The only change I can think 
of is that a word like m'lord would be indexed as "me lord". Still it might 
be better to make a French package and add this to a French Filter.

I hope this is useful to anyone working with French.
All the best.

Konrad

Re: StandardFilter that works for French

Posted by Konrad Scherer <bc...@uottawa.ca>.
>
>There are a number of contractions in English that could be affected if
>you're using the apostrophe as a marker, e.g.: isn't, wouldn't, I'd, he's,
>hasn't.  (Granted, these are often considered stop words.)  Thus, I think
>that your idea of incorporating this change into a French filter, rather
>than modifying Standard filter, is a good idea.

Sorry I forgot to mention that it only looks at words where the apostrophe 
occurs in the second letter and only for words that start with the six 
magic letters m,t,s,l,n,d . If filtering the very English specific 's and 
'S possessives is good enough for the StandardFilter then why not French as 
well?  In the comments of StandardTokenizer.jj we have "This should be a 
good tokenizer for most European-language documents". Most people will use 
this one, why not have it work as well as possible? The standard tokenizer 
is very english centric and the code I posted was for those who may not be 
aware of it. I work with a lot of bilingual documents (english and french) 
and my case, this filter improves the quality of the index.
More philosophically, there probably shouldn't even be a "standard" 
analyzer, just language specific ones.
All the best

Konrad


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: StandardFilter that works for French

Posted by Joshua O'Madadhain <jm...@ics.uci.edu>.
On Thu, 21 Nov 2002, Konrad Scherer wrote:

> In French you have 6 words (me, te, se, le/la , ne, de) where the e is
> replaced with an apostrophe when the following word starts with a vowel.
> For example me aider becomes m'aider. Currently Lucene indexes m'aider,
> s'aider, n'aider as different words when in fact they should be analyzed as
> me aider, se aider, ne aider, etc. So I modified Standard filter to send
> back these words as two words. I had to add a one Token buffer. I toyed
> with modifying StandardTokenizer.jj but I was worried about unintended
> changes in behavior.
>
> This change will not effect English indexing. The only change I can think
> of is that a word like m'lord would be indexed as "me lord". Still it might
> be better to make a French package and add this to a French Filter.

There are a number of contractions in English that could be affected if
you're using the apostrophe as a marker, e.g.: isn't, wouldn't, I'd, he's,
hasn't.  (Granted, these are often considered stop words.)  Thus, I think
that your idea of incorporating this change into a French filter, rather
than modifying Standard filter, is a good idea.

Joshua O'Madadhain

  jmadden@ics.uci.edu....Obscurium Per Obscurius....www.ics.uci.edu/~jmadden
   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for.  -- Bill Watterson
 My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>