You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by KK <di...@gmail.com> on 2009/06/08 11:21:41 UTC

How to make wordDelimiterFilter[pulled from Solr nighly] to not break non-english words in a wrong way in lucene indexing/searching?

Hi All,
I'm trying to index some indian web page content which are basically a mix
of indian and say 5% of english content in the same page itself. For all
this I can not use standard or simple analyzer as they break the non-english
words in a wrong places say[because the isLetter(ch) happens to be false for
them, even if they are part of a word]. So I wrote/extended the anayzer that
does the following,
public class IndicAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = new WhitespaceTokenizer(reader);
    //ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
    ts = new LowerCaseFilter(ts);
    ts = new PorterStemFilter(ts);
    return ts;
  }
}
This is working fine to some extent when the line commented above is left as
such, but its not able to give me the results when the documtnt contains a
string say "hello@how.com" and the searched query is hello, this is expected
as the above code doesnot do any word delimiting around these basic
characters [like @ . , etc ].
Now the problem is when I'm trying to use wordDelimiterFilter[hte commented
out line, this filter I got from solr] it is breaking say hindi words around
some characters which are actually part of a word. After going through the
code for this filter I found that it is using the isLetter() standard
fuction of java which I think returns false for those hindi characters
around which it is breaking the words. As per javadoc isLetter() is Unicode
compliant, right? so can't we say that it is aware of the above characters
that they are word delimiters, then why is this breaking around those
characters. I'm stuck and dont know how to get rid of the problem. And
because of this problem when I search for say a hindi word "helo" , assuming
its hindi, it highlights this word but alognwith that it also highlights the
letters of this word h/e/l/o whereever it finds it which it should not do,
right?
I request both Solr and Lucene users to guide me in fixing this issue. BTW,
do we need to do some sort of normalization for the content before sending
it to lucene indexer? just a thought, i don know whats the way out?