You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mufaddal Khumri <MK...@allegromedical.com> on 2006/02/23 07:10:17 UTC

hyphen not being removed by standard filter

Hi,

I might be missing something. I have a custom analyzer the gist of which is:

	public TokenStream tokenStream(String fieldName, Reader reader)
	{
		TokenStream result = new StandardTokenizer(reader);
		result = new StandardFilter(result);
		result = new LowerCaseFilter(result);
		result = new StopFilter(result, stopSet);
		result = new PorterStemFilter(result);
		return result;
	}

I test my above analyzer with the following query string:
the is EOS-20D canon amazing

In my test code I do this  to see what my analyzed query string looks like:
        
	        PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardStemmingAnalyzer());
		analyzer.addAnalyzer("categoryNames", new KeywordAnalyzer());

		TokenStream stream = analyzer.tokenStream(null, new StringReader(queryString));
		String analyzedQueryString = "";
		
		while(true)
		{
			Token token = stream.next();
			if(token == null)
			{
				break;
			}
		
			analyzedQueryString = analyzedQueryString + token.termText() + " ";
		}
		
		analyzedQueryString = analyzedQueryString.trim();

		log.debug("analyzedQueryString = " + analyzedQueryString);

The output of the log statement above is:

analyzedQueryString = eos-20d canon amaz

I see that the common stop words have been removed, everything has been lower cased and even the query has also been stemmed, why was the hyphen not removed by the standard filter??? Or does the standard analyzer remove hyphens only from phrases like "eos - 20d" and not from "eos-20d" ?

Thanks.