You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by hans meiser <fi...@yahoo.de> on 2006/11/06 16:41:55 UTC

whats the correct way to do normalisation?

Hi,
   
  Lucene indexes documents from 3 different countries here
(English, German and French). I want to normalize some 
characters like umlauts. ä -> ae
  I did it in the following way:
  New Analyzer:
public class SpecialCharsAnalyzer extends StandardAnalyzer {
 public SpecialCharsAnalyzer() {
 }
   public SpecialCharsAnalyzer(Set stopWords) {
  super(stopWords);
 }
   public SpecialCharsAnalyzer(String[] stopWords) {
  super(stopWords);
 }
   public SpecialCharsAnalyzer(File stopwords) throws IOException {
  super(stopwords);
 }
   public SpecialCharsAnalyzer(Reader stopwords) throws IOException {
  super(stopwords);
 }
   @Override
 public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = super.tokenStream(fieldName, reader);
  ts = new SpecialCharacterFilter(ts);
  return ts;
 }
}
  Is the SpecialCharsAnalyzer::tokenStream implemented correctly?
  
New Filter:
public class SpecialCharacterFilter extends TokenFilter {
 public SpecialCharacterFilter(TokenStream input) {
  super(input);
 }
   @Override
 public Token next() throws IOException {
  Token t = input.next();
    if (t == null)
   return null;
    String str = t.termText();
  if (str.indexOf("ä") != -1) {
   str = str.replaceAll("ä", "ae");
   t = new Token(str, t.startOffset(), t.endOffset() + 1);
  }
  return t;
 }
}
  Is the SpecialCharacterFilter::next implemented correctly, 
in case of the "ä"?
  
Is this way the correct way to do normalisation?
  thx

 		
---------------------------------
NEU: Fragen stellen - Wissen, Meinungen und Erfahrungen teilen. Jetzt auf Yahoo! Clever.

Re: whats the correct way to do normalisation?

Posted by Joe <fi...@yahoo.de>.
Hi,
> : I want "Überraschung" is found by
> :
> : Überr*
> : Ueberr*
> :
> : So the best i can do is to do the normalisation manually(not by an
> : analyzer) before the indexing/searching process?
>
> Or use an Analyzer at index time that puts both the UTF-8 version of the
> string and the Latin-1 version of the string in the same field (at the
> same position so they still work with phrases) and at query time just
> search for the text the user types in as is ... that should work for both
> straight term queries and prefix/wildcard queries that don't get analyzed
> at query time.
>   
Oh yes thats sounds good too.




		
___________________________________________________________ 
Telefonate ohne weitere Kosten vom PC zum PC: http://messenger.yahoo.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: whats the correct way to do normalisation?

Posted by Chris Hostetter <ho...@fucit.org>.
: I want "Überraschung" is found by
:
: Überr*
: Ueberr*
:
: So the best i can do is to do the normalisation manually(not by an
: analyzer) before the indexing/searching process?

Or use an Analyzer at index time that puts both the UTF-8 version of the
string and the Latin-1 version of the string in the same field (at the
same position so they still work with phrases) and at query time just
search for the text the user types in as is ... that should work for both
straight term queries and prefix/wildcard queries that don't get analyzed
at query time.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: whats the correct way to do normalisation?

Posted by Joe <fi...@yahoo.de>.
Hi,
> http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a
>
> Are Wildcard, Prefix, and Fuzzy queries case sensitive?
>
> Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries
> are not passed through the Analyzer, which is the component that performs
> operations such as stemming and lowercasing

Ok, thx

I want "Überraschung" is found by

Überr*
Ueberr*

So the best i can do is to do the normalisation manually(not by an 
analyzer) before the indexing/searching process?


		
___________________________________________________________ 
Telefonate ohne weitere Kosten vom PC zum PC: http://messenger.yahoo.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: whats the correct way to do normalisation?

Posted by Chris Hostetter <ho...@fucit.org>.
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a

Are Wildcard, Prefix, and Fuzzy queries case sensitive?

Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries
are not passed through the Analyzer, which is the component that performs
operations such as stemming and lowercasing.

The reason for skipping the Analyzer is that if you were searching for
"dogs*" you would not want "dogs" first stemmed to "dog", since that would
then match "dog*", which is not the intended query.


: Date: Tue, 7 Nov 2006 12:41:58 +0100 (CET)
: From: hans meiser <fi...@yahoo.de>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Re: whats the correct way to do normalisation?
:
: Hi,
:
: On Nov 6, 2006, at 11:27 AM, hans meiser wrote:
: >> public final Token next() throws java.io.IOException {
: >> final Token t = input.next();
: >> if (t == null)
: >> return null;
: >> return new Token(removeAccents(t.termText()), t.startOffset(),
: >> t.endOffset(), t.type());
: >> }
: >>
:
: > For highlighting purposes, it's best to keep the offsets in the
: > original text, not adjusted for token mutation.
:
:   Ok, i corrected it.
:
:   For a  "normal" search without a "*" it works now. But when i do a
:   search with an "*" or a "?" my newly implemented filter is not called and for example my umlauts are not replaced by the analyzer(filter).
:
:   I do a:
:   Analyzer analyzer = new SpecialCharsAnalyzer();
:   QueryParser parser = new QueryParser(DocumentFields.TEXT, analyzer);
:   query = parser.parse(queryStr);
:
:   For wildcards the tokenStream method of my analyzer isnt called.
:   Whats my fault?
:
:
: ---------------------------------
: Yahoo! 360° – Bloggen und Leute treffen. Erstellen Sie jetzt Ihre eigene Seite – kostenlos!.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: whats the correct way to do normalisation?

Posted by Daniel Naber <lu...@danielnaber.de>.
On Tuesday 07 November 2006 12:41, hans meiser wrote:

>   For a  "normal" search without a "*" it works now. But when i do a
>   search with an "*" or a "?" my newly implemented filter is not called
> and for example my umlauts are not replaced by the analyzer(filter).

See
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: whats the correct way to do normalisation?

Posted by hans meiser <fi...@yahoo.de>.
Hi, 
  
On Nov 6, 2006, at 11:27 AM, hans meiser wrote:
>> public final Token next() throws java.io.IOException {
>> final Token t = input.next();
>> if (t == null)
>> return null;
>> return new Token(removeAccents(t.termText()), t.startOffset(), 
>> t.endOffset(), t.type());
>> }
>>

> For highlighting purposes, it's best to keep the offsets in the 
> original text, not adjusted for token mutation.
   
  Ok, i corrected it.
   
  For a  "normal" search without a "*" it works now. But when i do a
  search with an "*" or a "?" my newly implemented filter is not called and for example my umlauts are not replaced by the analyzer(filter).
   
  I do a:
  Analyzer analyzer = new SpecialCharsAnalyzer();
  QueryParser parser = new QueryParser(DocumentFields.TEXT, analyzer);
  query = parser.parse(queryStr);
   
  For wildcards the tokenStream method of my analyzer isnt called.
  Whats my fault?

 		
---------------------------------
Yahoo! 360° – Bloggen und Leute treffen. Erstellen Sie jetzt Ihre eigene Seite – kostenlos!. 

Re: whats the correct way to do normalisation?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Nov 6, 2006, at 11:27 AM, hans meiser wrote:

> Hi,
>
>> Did you take a look at IsoLatin1AccentFilter ?
>
>   It nearly do the same i need, but not perfectly.
>
>    public final Token next() throws java.io.IOException {
>  final Token t = input.next();
>    if (t == null)
>    return null;
>  return new Token(removeAccents(t.termText()), t.startOffset(),  
> t.endOffset(), t.type());
>  }
>
>   Here also a new Token is created. The question i have, why the  
> endoffset is not
>   corrected for the new created token? Some times the new token is  
> bigger than before.
>   Complete code link:
>   http://developer.spikesource.com/spikewatch.logs/fedora-3- 
> i386/2221/lucene/reports/clover/org/apache/lucene/analysis/ 
> ISOLatin1AccentFilter.html

For highlighting purposes, it's best to keep the offsets in the  
original text, not adjusted for token mutation.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: whats the correct way to do normalisation?

Posted by hans meiser <fi...@yahoo.de>.
Hi,
   
  > Did you take a look at IsoLatin1AccentFilter ?
   
  It nearly do the same i need, but not perfectly.
   
   public final Token next() throws java.io.IOException {
 final Token t = input.next();
   if (t == null)
   return null;   
 return new Token(removeAccents(t.termText()), t.startOffset(), t.endOffset(), t.type());
 }
   
  Here also a new Token is created. The question i have, why the endoffset is not
  corrected for the new created token? Some times the new token is bigger than before.
  Complete code link:
  http://developer.spikesource.com/spikewatch.logs/fedora-3-i386/2221/lucene/reports/clover/org/apache/lucene/analysis/ISOLatin1AccentFilter.html
  


 

 		
---------------------------------
Keine Lust auf Tippen? Rufen Sie Ihre Freunde einfach an.
  Yahoo! Messenger. Jetzt installieren . 

Re: whats the correct way to do normalisation?

Posted by Patrick Turcotte <pa...@gmail.com>.
Hi,

Did you take a look at IsoLatin1AccentFilter ?

Patrick

On 11/6/06, hans meiser <fi...@yahoo.de> wrote:
>
> Hi,
>
>   Lucene indexes documents from 3 different countries here
> (English, German and French). I want to normalize some
> characters like umlauts. ä -> ae
>   I did it in the following way:
>   New Analyzer:
> public class SpecialCharsAnalyzer extends StandardAnalyzer {
> public SpecialCharsAnalyzer() {
> }
>    public SpecialCharsAnalyzer(Set stopWords) {
>   super(stopWords);
> }
>    public SpecialCharsAnalyzer(String[] stopWords) {
>   super(stopWords);
> }
>    public SpecialCharsAnalyzer(File stopwords) throws IOException {
>   super(stopwords);
> }
>    public SpecialCharsAnalyzer(Reader stopwords) throws IOException {
>   super(stopwords);
> }
>    @Override
> public TokenStream tokenStream(String fieldName, Reader reader) {
>     TokenStream ts = super.tokenStream(fieldName, reader);
>   ts = new SpecialCharacterFilter(ts);
>   return ts;
> }
> }
>   Is the SpecialCharsAnalyzer::tokenStream implemented correctly?
>
> New Filter:
> public class SpecialCharacterFilter extends TokenFilter {
> public SpecialCharacterFilter(TokenStream input) {
>   super(input);
> }
>    @Override
> public Token next() throws IOException {
>   Token t = input.next();
>     if (t == null)
>    return null;
>     String str = t.termText();
>   if (str.indexOf("ä") != -1) {
>    str = str.replaceAll("ä", "ae");
>    t = new Token(str, t.startOffset(), t.endOffset() + 1);
>   }
>   return t;
> }
> }
>   Is the SpecialCharacterFilter::next implemented correctly,
> in case of the "ä"?
>
> Is this way the correct way to do normalisation?
>   thx
>
>
> ---------------------------------
> NEU: Fragen stellen - Wissen, Meinungen und Erfahrungen teilen. Jetzt auf
> Yahoo! Clever.
>