You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Hugo Lafayette <hu...@temis-group.com> on 2005/10/11 15:22:36 UTC

Bad behaviors of FrenchAnalyzer

Hi there,

I just test the french analyzer, which works well for most part of it
(Stemmer particulary). But ATM, I have two unexpected behavior with the
default configuration:

- accentuated characters: The french analyzer keep accents, which could
be useful, but may also become boring. I just have to add the
ISOLatinFilter.java to correct that, but maybe adding an option to keep
them or not could be useful.

- apsotrophe (') characters: The standard analyzer does NOT tokenize on
('), because of O'Reilly like words. But in french, lot's of expression
must be tokenize, like "j'aime" or "l'amour" which contains respectively
2 tokens each ("je" & "aime", "le" & "amour"). I'm quite surprised that
nobody else found that supicious behavior before, so maybe I missed
something.

Anyway I don't know how to proceed, since I have to index both english
and french text.

The simple way will be to change the standard analyzer grammar (remove
the APOSTROPHE rules basically), to get 2 tokens. But I'm afraid of
unexpected side effects.

The other way will be to make the french analyzer further tokenize
"j'aime" into 2 sub tokens (with a token buffer, right ?). Is it the
right thing to do ? Does this represent a bug that will be corrected
soon ? Is there other way around ?

Thanks in advance for your answers, and congrats for your delightful
software !


PS: I'm working with "lucene-1.9-rc1-dev" version from the svn repository.

-- 
Hugo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bad behaviors of FrenchAnalyzer

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Oct 11, 2005, at 10:52 AM, Hugo Lafayette wrote:
> Erik Hatcher wrote:
>
>
>> Rather than changing StandardAnalyzer, you could create a custom
>> Analyzer that is something along the lines of StandardTokenizer  ->
>> custom apostrophe splitting filter -> ISOLatinFilter.
>>
>
> Why do not include that in the FrenchStemFilter "next()" method  
> itself ?
> It will be a bad design ?

I've not personally used the FrenchStemFilter, so I cannot comment on  
its behavior at all.  I'm out of my league in that realm.

> And I'm quite concerned with performance issue, but it seem's to me  
> that
> your solution will only affect "APOSTROPHE" typed token, so the  
> overhead
> will be unexistant, right ?

There is little need to be concerned with analyzer performance, at  
least at this stage.  First have a problem, then optimize for it.  I  
don't speculate with performance.  But yes, only the apostrophe type  
(whatever that is, I'm not looking at the code now, but I think its  
"<APOSTROPHE>", with angle brackets) would need to be caught and  
split, the rest could pass straight through.  Again, look at the  
StandardTokenFilter for an example - it removes apostrophes.

>> You get a special type for words with interior apostrophes from
>> StandardTokenizer (look at StandardFilter to see how that works). You
>> could create a simple TokenFilter that splits apostrophe'd tokens
>> into two.
>>
>
> I'm not sure to figure out to do that efficiently. Is it something  
> like
> that ? :
>
> <code>
>
> private Stack subTokens; //previously initialized
>
> public final Token next() throws IOException {
>   Token t = null;
>   if (subTokens != null && !subTokens.empty) {
>     t = subTokens.pop();
>   } else {
>     t = input.next();
>     if (t != null)
>     {
>       String type = t.type();
>       if (type == APOSTROPHE_TYPE) {
>     tokenizeApostrophe(t, subTokens);
>       }
>     }
>   }
>   return t;
> }
>
> </code>
>
> with "tokenizeApostrophe(Token, Stack)" that split on conditions the
> token into 2 others, and push them on the stack.

Using a stack (or only a single spare Token if you will only split  
into two pieces) is a good appraoch.  I haven't tried your code, but  
I recommend writing some unit tests that exercise your filter  
separately and ensure it works to split tokens as you expect.  :)

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bad behaviors of FrenchAnalyzer

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Oct 11, 2005, at 10:04 AM, Hugo Lafayette wrote:

> First of all, add maybe I make a false assumption here, but if you  
> strip
> leading "j'", "t'" and so on, that means that if you make a search  
> like:
>
>  +text:"il m'aime"
>
> you will get documents with the sentence "il m'aime" (french for "he
> loves me") and document with the sentence "il t'aime" (french for "he
> loves you"), which is wrong, right ?

I don't speak French, and I can't tell you whether  
Lingua::Stem::Snowball strips m' and t' -- the docs say "This method  
strips 's (english) and l', d', ... (french)."

That's a compelling example you have there, though, so I would hope  
not.  Conceptually, I would want the search to focus on the  
relatively rare word for "love" rather than on the pronouns.   
However, if the stemmer strips the pronouns, "m'aime" and "t'aime"  
would be conflated, which is as you say, "wrong". :)  Is "aime" ever  
used in isolation, or is it always hitched to a pronoun?

> So if this is correct, this is why I need to index both "m" and "aime"
> as distinct tokens.
>
> And I guess this is why "O'Reilly" is not splitted by the
> StandardAnalyzer, since you don't want to find the documents  
> containing
> "N'Reilly".

Actually, the reason is that you wouldn't want to conflate searches  
for "Reilly" and "O'Reilly".  Further processing of a token falls  
under the rubric of stemming.

> For a more general purpose, I am a native french speaker, but I'm not
> sure there are some cases where a string with an apostrophe has to be
> split into two (real) searchable tokens. I know the word "aujourd'hui"
> (french for "today"), but it's  likely a complete word by itself which
> does not need to be splitted again.

So you wouldn't need a search for "aujourd" or "hui" to turn up  
documents which contain "aujourd'hui"?  Very good.

But then, what about "t'aime"?  If a search for "aime" should match  
documents which contain "t'aime", then that's our problematic  
example.  You wouldn't care about searching for a pronoun -- EXCEPT  
when trying to match a phrase. If that's the case, then the  
StandardTokenizer may in fact be inadequate for French -- "t'aime"  
should be broken up into two tokens: "t" and "aime".

> If this is important to you, I could look further, and ask some french
> linguists help.

I'm asking because a new version of my own search engine library has  
a default tokenizer which keeps apostrophic strings together (like  
StandardTokenizer), and I want to be aware of cases where this choice  
causes problems.  However, it's unlikely I'll change that behavior,  
as the problem is addressed by making it trivially easy to customize  
the tokenizer.  So I would say that for my own purposes, consulting a  
linguist is probably overkill.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bad behaviors of FrenchAnalyzer

Posted by Hugo Lafayette <hu...@temis-group.com>.
Marvin Humphrey wrote:

> I'm curious: are there any cases in French where a string with an  
> apostrophe in it ought to be split into two searchable tokens?  I  
> know of no such cases in English: you never want to search for the ll  
> in you'll, or the O in O'Reilly, etc.

First of all, add maybe I make a false assumption here, but if you strip
leading "j'", "t'" and so on, that means that if you make a search like:

 +text:"il m'aime"

you will get documents with the sentence "il m'aime" (french for "he
loves me") and document with the sentence "il t'aime" (french for "he
loves you"), which is wrong, right ?

So if this is correct, this is why I need to index both "m" and "aime"
as distinct tokens.

And I guess this is why "O'Reilly" is not splitted by the
StandardAnalyzer, since you don't want to find the documents containing
"N'Reilly".

For a more general purpose, I am a native french speaker, but I'm not
sure there are some cases where a string with an apostrophe has to be
split into two (real) searchable tokens. I know the word "aujourd'hui"
(french for "today"), but it's  likely a complete word by itself which
does not need to be splitted again.

If this is important to you, I could look further, and ask some french
linguists help.

-- 
Hugo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bad behaviors of FrenchAnalyzer

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Oct 11, 2005, at 7:52 AM, Hugo Lafayette wrote:

> Why do not include that in the FrenchStemFilter "next()" method  
> itself ?
> It will be a bad design ?

I agree with your assessment.  Conceptually, this is a stemming  
problem.  By extension, it's not a tokenizing problem, and the  
behavior of the StandardTokenizer is fine.

The most recent Perl/CPAN version of the Snowball stemming library  
added an option to strip leading l', d', etc in French.  I know this  
because until the most recent version, it didn't strip trailing 's in  
English, either, and I had to write some workarounds, just like  
you'll have to.

http://search.cpan.org/search?query=snowball

I'm curious: are there any cases in French where a string with an  
apostrophe in it ought to be split into two searchable tokens?  I  
know of no such cases in English: you never want to search for the ll  
in you'll, or the O in O'Reilly, etc.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bad behaviors of FrenchAnalyzer

Posted by Hugo Lafayette <hu...@temis-group.com>.
Erik Hatcher wrote:

> Rather than changing StandardAnalyzer, you could create a custom  
> Analyzer that is something along the lines of StandardTokenizer  ->  
> custom apostrophe splitting filter -> ISOLatinFilter. 

Why do not include that in the FrenchStemFilter "next()" method itself ?
It will be a bad design ?

And I'm quite concerned with performance issue, but it seem's to me that
your solution will only affect "APOSTROPHE" typed token, so the overhead
will be unexistant, right ?

> You get a special type for words with interior apostrophes from 
> StandardTokenizer (look at StandardFilter to see how that works). You
> could create a simple TokenFilter that splits apostrophe'd tokens 
> into two.

I'm not sure to figure out to do that efficiently. Is it something like
that ? :

<code>

private Stack subTokens; //previously initialized

public final Token next() throws IOException {
  Token t = null;
  if (subTokens != null && !subTokens.empty) {
    t = subTokens.pop();
  } else {
    t = input.next();
    if (t != null)
    {
      String type = t.type();
      if (type == APOSTROPHE_TYPE) {
	tokenizeApostrophe(t, subTokens);
      }
    }
  }
  return t;
}

</code>

with "tokenizeApostrophe(Token, Stack)" that split on conditions the
token into 2 others, and push them on the stack.

> Maybe it's simple enough also to expand "j" and "l" into "je" and
> "le" in the same step too?

It will be simple, but I'm not sure yet I want to expand them back.
Maybe it will be useful to index the "j" token after all.

Anyway thanks for your quick answer,

-- 
Hugo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Bad behaviors of FrenchAnalyzer

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Oct 11, 2005, at 9:22 AM, Hugo Lafayette wrote:
> - accentuated characters: The french analyzer keep accents, which  
> could
> be useful, but may also become boring. I just have to add the
> ISOLatinFilter.java to correct that, but maybe adding an option to  
> keep
> them or not could be useful.
>
> - apsotrophe (') characters: The standard analyzer does NOT  
> tokenize on
> ('), because of O'Reilly like words. But in french, lot's of  
> expression
> must be tokenize, like "j'aime" or "l'amour" which contains  
> respectively
> 2 tokens each ("je" & "aime", "le" & "amour"). I'm quite surprised  
> that
> nobody else found that supicious behavior before, so maybe I missed
> something.

Rather than changing StandardAnalyzer, you could create a custom  
Analyzer that is something along the lines of StandardTokenizer  ->  
custom apostrophe splitting filter -> ISOLatinFilter.  You get a  
special type for words with interior apostrophes from  
StandardTokenizer (look at StandardFilter to see how that works).   
You could create a simple TokenFilter that splits apostrophe'd tokens  
into two.  Maybe it's simple enough also to expand "j" and "l" into  
"je" and "le" in the same step too?

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org