You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Christophe from paris <zl...@yahoo.fr> on 2008/08/06 12:35:33 UTC

search with accent not match

Hello

I'm use FrenchAnalyzer for index 

IndexWriter writer = new IndexWriter(pathOfIndex, new FrenchAnalyzer(),
true);
Document = new Document();
doc.add(new
Field("TXT_CHARACT_VALUE",word.toLowerCase(),Field.Store.YES,Field.Index.TOKENIZED));
writer.addDocument(doc);

And search

IndexReader reader = IndexReader.open(pathOfIndex);			
Searcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new FrenchAnalyzer();						
QueryParser parser = new QueryParser(field, analyzer);					
Query query = parser.parse(motRecherche);
Hits hits = searcher.search(query);

in my document i have the word "lumiere" and "lumière"

when i search lumière only document match lumière but "lumiere" is not
return

and if search "lumiere" the result is lumiere, lumieres ,lumiére,lumiéres
but not lumière

for a total match i must search "lumiere OR limièez"
but is not the best solution 
-- 
View this message in context: http://www.nabble.com/search-with-accent-not-match-tp18848522p18848522.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search with accent not match

Posted by Christophe from paris <zl...@yahoo.fr>.

Yes  markrmiller,the order is important
then 

 TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);  
    result = new StopFilter(result, stoptable);    
    result = new ISOLatin1AccentFilter(result);
    result = new FrenchStemFilter(result, excltable);
    result = new LowerCaseFilter(result);

And finaly with ISOLatin1AccentFilter the result is good :)

tanks you.

Now go the polish search ^^


markrmiller wrote:
> 
> You certainly can - just create your own Analyzer starting with a copy 
> of the French one you are using.
> 
> Then you just plug in the filter in the order you want it applied:
> 
> result = new ISOLatin1AccentFilter(result);
> 
> You have to decide for yourself where it will come - if you put it 
> before the stopword step, more stops words might be removed than if it 
> was after - that type of thing usually comes down to individual 
> requirements/filter limitations. If your stopword list has diacriticals 
> and you run the accent filter before applying the stopword list, some 
> expected stopwords will never be removed...etc.
> 
> 
> Christophe from paris wrote:
>> Actualy in my FrenchAnalyser 
>>
>> i have :
>>
>>  TokenStream result = new StandardTokenizer(reader);
>>     result = new StandardFilter(result);
>>     result = new StopFilter(result, stoptable);
>>     result = new FrenchStemFilter(result, excltable);
>>     result = new LowerCaseFilter(result);
>>
>>
>> I can use ISOLatin1AccentFilter in this Class for indexing ans search ?
>> And it is the case where ?
>>
>>
>> markrmiller wrote:
>>   
>>> Check out org.apache.lucene.analysis.ISOLatin1AccentFilter
>>>
>>> It will strip diacritics - just be sure to use it at index time and 
>>> query time to get what you want. Also, you will no longer be able to 
>>> differentiate between the two in your searching (rarely that important 
>>> in my opinion, but others certainly disagree).
>>>
>>> - Mark
>>>
>>> Christophe from paris wrote:
>>>     
>>>> Hello
>>>>
>>>> I'm use FrenchAnalyzer for index 
>>>>
>>>> IndexWriter writer = new IndexWriter(pathOfIndex, new FrenchAnalyzer(),
>>>> true);
>>>> Document = new Document();
>>>> doc.add(new
>>>> Field("TXT_CHARACT_VALUE",word.toLowerCase(),Field.Store.YES,Field.Index.TOKENIZED));
>>>> writer.addDocument(doc);
>>>>
>>>> And search
>>>>
>>>> IndexReader reader = IndexReader.open(pathOfIndex);			
>>>> Searcher searcher = new IndexSearcher(reader);
>>>> Analyzer analyzer = new FrenchAnalyzer();						
>>>> QueryParser parser = new QueryParser(field, analyzer);					
>>>> Query query = parser.parse(motRecherche);
>>>> Hits hits = searcher.search(query);
>>>>
>>>> in my document i have the word "lumiere" and "lumière"
>>>>
>>>> when i search lumière only document match lumière but "lumiere" is not
>>>> return
>>>>
>>>> and if search "lumiere" the result is lumiere, lumieres
>>>> ,lumiére,lumiéres
>>>> but not lumière
>>>>
>>>> for a total match i must search "lumiere OR limière"
>>>> but is not the best solution 
>>>>   
>>>>       
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>     
>>
>>   
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/search-with-accent-not-match-tp18848522p18869247.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search with accent not match

Posted by Mark Miller <ma...@gmail.com>.

You certainly can - just create your own Analyzer starting with a copy 
of the French one you are using.

Then you just plug in the filter in the order you want it applied:

result = new ISOLatin1AccentFilter(result);

You have to decide for yourself where it will come - if you put it 
before the stopword step, more stops words might be removed than if it 
was after - that type of thing usually comes down to individual 
requirements/filter limitations. If your stopword list has diacriticals 
and you run the accent filter before applying the stopword list, some 
expected stopwords will never be removed...etc.


Christophe from paris wrote:
> Actualy in my FrenchAnalyser 
>
> i have :
>
>  TokenStream result = new StandardTokenizer(reader);
>     result = new StandardFilter(result);
>     result = new StopFilter(result, stoptable);
>     result = new FrenchStemFilter(result, excltable);
>     result = new LowerCaseFilter(result);
>
>
> I can use ISOLatin1AccentFilter in this Class for indexing ans search ?
> And it is the case where ?
>
>
> markrmiller wrote:
>   
>> Check out org.apache.lucene.analysis.ISOLatin1AccentFilter
>>
>> It will strip diacritics - just be sure to use it at index time and 
>> query time to get what you want. Also, you will no longer be able to 
>> differentiate between the two in your searching (rarely that important 
>> in my opinion, but others certainly disagree).
>>
>> - Mark
>>
>> Christophe from paris wrote:
>>     
>>> Hello
>>>
>>> I'm use FrenchAnalyzer for index 
>>>
>>> IndexWriter writer = new IndexWriter(pathOfIndex, new FrenchAnalyzer(),
>>> true);
>>> Document = new Document();
>>> doc.add(new
>>> Field("TXT_CHARACT_VALUE",word.toLowerCase(),Field.Store.YES,Field.Index.TOKENIZED));
>>> writer.addDocument(doc);
>>>
>>> And search
>>>
>>> IndexReader reader = IndexReader.open(pathOfIndex);			
>>> Searcher searcher = new IndexSearcher(reader);
>>> Analyzer analyzer = new FrenchAnalyzer();						
>>> QueryParser parser = new QueryParser(field, analyzer);					
>>> Query query = parser.parse(motRecherche);
>>> Hits hits = searcher.search(query);
>>>
>>> in my document i have the word "lumiere" and "lumière"
>>>
>>> when i search lumière only document match lumière but "lumiere" is not
>>> return
>>>
>>> and if search "lumiere" the result is lumiere, lumieres ,lumiére,lumiéres
>>> but not lumière
>>>
>>> for a total match i must search "lumiere OR limière"
>>> but is not the best solution 
>>>   
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search with accent not match

Posted by Christophe from paris <zl...@yahoo.fr>.

Actualy in my FrenchAnalyser 

i have :

 TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);
    result = new StopFilter(result, stoptable);
    result = new FrenchStemFilter(result, excltable);
    result = new LowerCaseFilter(result);


I can use ISOLatin1AccentFilter in this Class for indexing ans search ?
And it is the case where ?


markrmiller wrote:
> 
> Check out org.apache.lucene.analysis.ISOLatin1AccentFilter
> 
> It will strip diacritics - just be sure to use it at index time and 
> query time to get what you want. Also, you will no longer be able to 
> differentiate between the two in your searching (rarely that important 
> in my opinion, but others certainly disagree).
> 
> - Mark
> 
> Christophe from paris wrote:
>> Hello
>>
>> I'm use FrenchAnalyzer for index 
>>
>> IndexWriter writer = new IndexWriter(pathOfIndex, new FrenchAnalyzer(),
>> true);
>> Document = new Document();
>> doc.add(new
>> Field("TXT_CHARACT_VALUE",word.toLowerCase(),Field.Store.YES,Field.Index.TOKENIZED));
>> writer.addDocument(doc);
>>
>> And search
>>
>> IndexReader reader = IndexReader.open(pathOfIndex);			
>> Searcher searcher = new IndexSearcher(reader);
>> Analyzer analyzer = new FrenchAnalyzer();						
>> QueryParser parser = new QueryParser(field, analyzer);					
>> Query query = parser.parse(motRecherche);
>> Hits hits = searcher.search(query);
>>
>> in my document i have the word "lumiere" and "lumière"
>>
>> when i search lumière only document match lumière but "lumiere" is not
>> return
>>
>> and if search "lumiere" the result is lumiere, lumieres ,lumiére,lumiéres
>> but not lumière
>>
>> for a total match i must search "lumiere OR limière"
>> but is not the best solution 
>>   
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/search-with-accent-not-match-tp18848522p18850615.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search with accent not match

Posted by Mark Miller <ma...@gmail.com>.

Check out org.apache.lucene.analysis.ISOLatin1AccentFilter

It will strip diacritics - just be sure to use it at index time and 
query time to get what you want. Also, you will no longer be able to 
differentiate between the two in your searching (rarely that important 
in my opinion, but others certainly disagree).

- Mark

Christophe from paris wrote:
> Hello
>
> I'm use FrenchAnalyzer for index 
>
> IndexWriter writer = new IndexWriter(pathOfIndex, new FrenchAnalyzer(),
> true);
> Document = new Document();
> doc.add(new
> Field("TXT_CHARACT_VALUE",word.toLowerCase(),Field.Store.YES,Field.Index.TOKENIZED));
> writer.addDocument(doc);
>
> And search
>
> IndexReader reader = IndexReader.open(pathOfIndex);			
> Searcher searcher = new IndexSearcher(reader);
> Analyzer analyzer = new FrenchAnalyzer();						
> QueryParser parser = new QueryParser(field, analyzer);					
> Query query = parser.parse(motRecherche);
> Hits hits = searcher.search(query);
>
> in my document i have the word "lumiere" and "lumière"
>
> when i search lumière only document match lumière but "lumiere" is not
> return
>
> and if search "lumiere" the result is lumiere, lumieres ,lumiére,lumiéres
> but not lumière
>
> for a total match i must search "lumiere OR limière"
> but is not the best solution 
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search with accent not match

Posted by lekamm <ca...@gmail.com>.

http://www.blardone.org/2008/10/12/lucene-query-accented-character/

Is specific about Php, but can be easily use try to solve the same problem
in Java.

I had the same problem as "Christophe from paris", and changing the query to
it's html encoded equivalent makes my search queries work.

So Perhaps, Chris could try to html encode it's queries that contains accent
and see if more results are returned.

And sorry if it is php only solution.



hossman wrote:
> 
> 
> : http://www.blardone.org/2008/10/12/lucene-query-accented-character/
> 
> thta post appears to be specificly about a PHP function to convert UTF-8 
> characters to their HTML equivilents ... which doesn'trelaly seem relevant 
> to the posters question ...
> 
> : > I'm use FrenchAnalyzer for index 
> 	...
> : > in my document i have the word "lumiere" and "lumière"
> : > 
> : > when i search lumière only document match lumière but "lumiere" is not
> : > return
> : > 
> : > and if search "lumiere" the result is lumiere, lumieres
> ,lumiére,lumiéres
> : > but not lumière
> 
> 1) you should take a look at the Luke tool to help make sense of exactly 
> what is getting indexed and how your query is getting parsed -- or just 
> write a simple java program to look at the tokens produced by your 
> analyzer.
> 
> 2) the FrenchAnalyzer doesn't by default do any accent normalization (so 
> i'm not sure why your search for lumiere is even matching lumiére ... but 
> you may want to make your own Analyzer wrapping the FrenchAnalyzer that 
> also uses the ISOLatin1AccentFilter to deal with this.
> 
> -Hoss
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

-- 
View this message in context: http://www.nabble.com/search-with-accent-not-match-tp18848522p19986937.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search with accent not match

Posted by Chris Hostetter <ho...@fucit.org>.

: http://www.blardone.org/2008/10/12/lucene-query-accented-character/

thta post appears to be specificly about a PHP function to convert UTF-8 
characters to their HTML equivilents ... which doesn'trelaly seem relevant 
to the posters question ...

: > I'm use FrenchAnalyzer for index 
	...
: > in my document i have the word "lumiere" and "lumi�re"
: > 
: > when i search lumi�re only document match lumi�re but "lumiere" is not
: > return
: > 
: > and if search "lumiere" the result is lumiere, lumieres ,lumi�re,lumi�res
: > but not lumi�re

1) you should take a look at the Luke tool to help make sense of exactly 
what is getting indexed and how your query is getting parsed -- or just 
write a simple java program to look at the tokens produced by your 
analyzer.

2) the FrenchAnalyzer doesn't by default do any accent normalization (so 
i'm not sure why your search for lumiere is even matching lumi�re ... but 
you may want to make your own Analyzer wrapping the FrenchAnalyzer that 
also uses the ISOLatin1AccentFilter to deal with this.

-Hoss

Re: search with accent not match

Posted by lekamm <ca...@gmail.com>.

Does this :

http://www.blardone.org/2008/10/12/lucene-query-accented-character/

solve your problem ?

Cheers,

lekamm



Christophe from paris wrote:
> 
> Hello
> 
> I'm use FrenchAnalyzer for index 
> 
> IndexWriter writer = new IndexWriter(pathOfIndex, new FrenchAnalyzer(),
> true);
> Document = new Document();
> doc.add(new
> Field("TXT_CHARACT_VALUE",word.toLowerCase(),Field.Store.YES,Field.Index.TOKENIZED));
> writer.addDocument(doc);
> 
> And search
> 
> IndexReader reader = IndexReader.open(pathOfIndex);			
> Searcher searcher = new IndexSearcher(reader);
> Analyzer analyzer = new FrenchAnalyzer();						
> QueryParser parser = new QueryParser(field, analyzer);					
> Query query = parser.parse(motRecherche);
> Hits hits = searcher.search(query);
> 
> in my document i have the word "lumiere" and "lumière"
> 
> when i search lumière only document match lumière but "lumiere" is not
> return
> 
> and if search "lumiere" the result is lumiere, lumieres ,lumiére,lumiéres
> but not lumière
> 
> for a total match i must search "lumiere OR limière"
> but is not the best solution 
> 

-- 
View this message in context: http://www.nabble.com/search-with-accent-not-match-tp18848522p19963381.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org