You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Martin Braun <mb...@uni-hd.de> on 2006/11/17 11:04:43 UTC
Search "C++" with Solrs WordDelimiterFilter
hi all,
I would like to implement the possibility to search for "C++" and "C#" -
I found in the archive the hint to customize the appropriate *.jj file
with the code in NutchAnalysis.jj:
// irregular words
| <#IRREGULAR_WORD: (<C_PLUS_PLUS>|<C_SHARP>)>
| <#C_PLUS_PLUS: ("C"|"c") "++" >
| <#C_SHARP: ("C"|"c") "#" >
I am using a custum analyzer with the yonik's WordDelimiterFilter:
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseFilter(new WordDelimiterFilter(new
WhitespaceTokenizer(reader),1,1,1,1,1 ));
}
But as I can see WordDelimiterFilter uses only the WhiteSpaceTokenizer
which does not use a Java-CC file.
What would be the best way to integrate (anyway, preferably not changing
lucene-src) this feature?
Should I override the WhitespaceTokenizer and using java-cc ( are there
any docs on doing this?).
tia,
martin
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Search "C++" with Solrs WordDelimiterFilter
Posted by Chris Hostetter <ho...@fucit.org>.
WordDelimiterFilter doesn't explicitly use an Tokenizer -- thats the
bueaty of TokenFilters, you can compose them arround any other TokenStream
instance that you want.
If you have a custom grammer file of your own that you like, you can use
it to build your own Tokenizer and then wrap that up in a
WordDelimiterFilter (and any other filters you want) to make a custom
Analyzer ... this is all StandardAnalyzer does, it wraps the
StandardTokenizer (which is built from a .jj file) with a few useful
TokenFilters.
: Date: Fri, 17 Nov 2006 11:04:43 +0100
: From: Martin Braun <mb...@uni-hd.de>
: Reply-To: java-user@lucene.apache.org, mbraun@uni-hd.de
: To: java-user@lucene.apache.org
: Subject: Search "C++" with Solrs WordDelimiterFilter
:
: hi all,
:
: I would like to implement the possibility to search for "C++" and "C#" -
: I found in the archive the hint to customize the appropriate *.jj file
: with the code in NutchAnalysis.jj:
:
: // irregular words
: | <#IRREGULAR_WORD: (<C_PLUS_PLUS>|<C_SHARP>)>
: | <#C_PLUS_PLUS: ("C"|"c") "++" >
: | <#C_SHARP: ("C"|"c") "#" >
:
: I am using a custum analyzer with the yonik's WordDelimiterFilter:
:
: @Override
: public TokenStream tokenStream(String fieldName, Reader reader) {
:
: return new LowerCaseFilter(new WordDelimiterFilter(new
: WhitespaceTokenizer(reader),1,1,1,1,1 ));
: }
:
:
: But as I can see WordDelimiterFilter uses only the WhiteSpaceTokenizer
: which does not use a Java-CC file.
:
: What would be the best way to integrate (anyway, preferably not changing
: lucene-src) this feature?
:
: Should I override the WhitespaceTokenizer and using java-cc ( are there
: any docs on doing this?).
:
: tia,
: martin
:
:
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org