You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Martin Braun <mb...@uni-hd.de> on 2006/11/17 11:04:43 UTC

Search "C++" with Solrs WordDelimiterFilter

hi all,

I would like to implement the possibility to search for "C++" and "C#" -
I found in the archive the hint to customize the appropriate *.jj  file
with the code in NutchAnalysis.jj:

     // irregular words
| <#IRREGULAR_WORD: (<C_PLUS_PLUS>|<C_SHARP>)>
| <#C_PLUS_PLUS: ("C"|"c") "++" >
| <#C_SHARP: ("C"|"c") "#" >

I am using a custum analyzer with the yonik's WordDelimiterFilter:

@Override
	public TokenStream tokenStream(String fieldName, Reader reader) {
				
		return new LowerCaseFilter(new WordDelimiterFilter(new
WhitespaceTokenizer(reader),1,1,1,1,1 ));
	}


But as I can see WordDelimiterFilter uses only the WhiteSpaceTokenizer
which does not use a Java-CC file.

What would be the best way to integrate (anyway, preferably not changing
lucene-src) this feature?

Should I override the WhitespaceTokenizer and using java-cc ( are there
any docs on doing this?).

tia,
martin





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search "C++" with Solrs WordDelimiterFilter

Posted by Chris Hostetter <ho...@fucit.org>.

WordDelimiterFilter doesn't explicitly use an Tokenizer -- thats the
bueaty of TokenFilters, you can compose them arround any other TokenStream
instance that you want.

If you have a custom grammer file of your own that you like, you can use
it to build your own Tokenizer and then wrap that up in a
WordDelimiterFilter (and any other filters you want) to make a custom
Analyzer ... this is all StandardAnalyzer does, it wraps the
StandardTokenizer (which is built from a .jj file) with a few useful
TokenFilters.


: Date: Fri, 17 Nov 2006 11:04:43 +0100
: From: Martin Braun <mb...@uni-hd.de>
: Reply-To: java-user@lucene.apache.org, mbraun@uni-hd.de
: To: java-user@lucene.apache.org
: Subject: Search "C++" with Solrs WordDelimiterFilter
:
: hi all,
:
: I would like to implement the possibility to search for "C++" and "C#" -
: I found in the archive the hint to customize the appropriate *.jj  file
: with the code in NutchAnalysis.jj:
:
:      // irregular words
: | <#IRREGULAR_WORD: (<C_PLUS_PLUS>|<C_SHARP>)>
: | <#C_PLUS_PLUS: ("C"|"c") "++" >
: | <#C_SHARP: ("C"|"c") "#" >
:
: I am using a custum analyzer with the yonik's WordDelimiterFilter:
:
: @Override
: 	public TokenStream tokenStream(String fieldName, Reader reader) {
:
: 		return new LowerCaseFilter(new WordDelimiterFilter(new
: WhitespaceTokenizer(reader),1,1,1,1,1 ));
: 	}
:
:
: But as I can see WordDelimiterFilter uses only the WhiteSpaceTokenizer
: which does not use a Java-CC file.
:
: What would be the best way to integrate (anyway, preferably not changing
: lucene-src) this feature?
:
: Should I override the WhitespaceTokenizer and using java-cc ( are there
: any docs on doing this?).
:
: tia,
: martin
:
:
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org