You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Harini Raghavan <ha...@insideview.com> on 2007/08/31 19:40:16 UTC

Query Analyzer Issue

Hi Everyone,

I am facing some strange behaviour with Analyzers. I am using SimpleAnalyzer
for some fields in my Compass entity, but I also wrote a custom Analyzer
that is slightly different from the SimpleAnalyzer as I wanted to allow even
letters and digits in company name column.
So custom analyzer CompanyNameAnalyzer uses the following tokenizer.

public class CompanyNameTokenizer extends CharTokenizer {
  /** Construct a new CompanyNameTokenizer. */
  public CompanyNameTokenizer(Reader in) {
    super(in);
  }

  protected char normalize(char c) {
    return Character.toLowerCase(c);
  }

  /** Collects only characters which are numbers or letters */
  protected boolean isTokenChar(char c) {
    Character ch = new Character(c);
    // Exclude @ special character while tokenizing
    if(Character.isLetterOrDigit(c) || ch.equals('@'))
        return true;
    else
        return false;
  }
}

But for some reason when I search for +companyName:good +companyName:will,
the word 'will' is ignored in the search, I get results that match only
good. I guess this means that 'will' is being stripped off as it is a stop
word. I don't understand why this should happen because the custom Analyzer
I am using does not use the StopFilter. So why should it filter the stop
words?

I tried looking at the Lucene source too, but no luck. Any suggestions would
be appreciated.

Thanks,
Harini

Re: Query Analyzer Issue

Posted by Kalvir Sandhu <ka...@kalv.co.uk>.

You should test your Analyzer to confirm what tokens are being produced. You
can do this by using a helper class, to save time there is one written in
the Lucene in Action book called AnalyzerUtils, you should be able to get it
out of the download of sourcecode from the book here:
http://www.lucenebook.com/LuceneInAction.zip

It's very easy to use. You give it an analyzer and some text and out will
pop the tokens.

You can also use Luke:
http://www.getopt.org/luke/ - To diagnose your index. Very useful tool. You
can see what terms you have stored against what document.

On 8/31/07, Harini Raghavan <ha...@insideview.com> wrote:
>
> Hi Everyone,
>
> I am facing some strange behaviour with Analyzers. I am using
> SimpleAnalyzer
> for some fields in my Compass entity, but I also wrote a custom Analyzer
> that is slightly different from the SimpleAnalyzer as I wanted to allow
> even
> letters and digits in company name column.
> So custom analyzer CompanyNameAnalyzer uses the following tokenizer.
>
> public class CompanyNameTokenizer extends CharTokenizer {
>   /** Construct a new CompanyNameTokenizer. */
>   public CompanyNameTokenizer(Reader in) {
>     super(in);
>   }
>
>   protected char normalize(char c) {
>     return Character.toLowerCase(c);
>   }
>
>   /** Collects only characters which are numbers or letters */
>   protected boolean isTokenChar(char c) {
>     Character ch = new Character(c);
>     // Exclude @ special character while tokenizing
>     if(Character.isLetterOrDigit(c) || ch.equals('@'))
>         return true;
>     else
>         return false;
>   }
> }
>
> But for some reason when I search for +companyName:good +companyName:will,
> the word 'will' is ignored in the search, I get results that match only
> good. I guess this means that 'will' is being stripped off as it is a stop
> word. I don't understand why this should happen because the custom
> Analyzer
> I am using does not use the StopFilter. So why should it filter the stop
> words?
>
> I tried looking at the Lucene source too, but no luck. Any suggestions
> would
> be appreciated.
>
> Thanks,
> Harini
>

Re: Query Analyzer Issue

Posted by sandeep chawla <sa...@gmail.com>.

Which analyzer are you using in your query parser ?

can you share the one line of code in which you construct a QueryParser Object.

As you might be  parsing  a  query string made of different fields, I
suggest you use a

PerFieldAnalyzerWrapper  which lets you do unique analysis for different fields.


~sandeep


On 31/08/2007, Harini Raghavan <ha...@insideview.com> wrote:
> Hi Everyone,
>
> I am facing some strange behaviour with Analyzers. I am using SimpleAnalyzer
> for some fields in my Compass entity, but I also wrote a custom Analyzer
> that is slightly different from the SimpleAnalyzer as I wanted to allow even
> letters and digits in company name column.
> So custom analyzer CompanyNameAnalyzer uses the following tokenizer.
>
> public class CompanyNameTokenizer extends CharTokenizer {
>   /** Construct a new CompanyNameTokenizer. */
>   public CompanyNameTokenizer(Reader in) {
>     super(in);
>   }
>
>   protected char normalize(char c) {
>     return Character.toLowerCase(c);
>   }
>
>   /** Collects only characters which are numbers or letters */
>   protected boolean isTokenChar(char c) {
>     Character ch = new Character(c);
>     // Exclude @ special character while tokenizing
>     if(Character.isLetterOrDigit(c) || ch.equals('@'))
>         return true;
>     else
>         return false;
>   }
> }
>
> But for some reason when I search for +companyName:good +companyName:will,
> the word 'will' is ignored in the search, I get results that match only
> good. I guess this means that 'will' is being stripped off as it is a stop
> word. I don't understand why this should happen because the custom Analyzer
> I am using does not use the StopFilter. So why should it filter the stop
> words?
>
> I tried looking at the Lucene source too, but no luck. Any suggestions would
> be appreciated.
>
> Thanks,
> Harini
>


-- 
SANDEEP CHAWLA
House No- 23                     			
10th main 					
BTM 1st  Stage     					
Bangalore						Mobile: 91-9986150603

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org