You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sujit Pal <su...@comcast.net> on 2011/10/17 20:27:05 UTC
Re: How do you see if a tokenstream has tokens without consuming
the tokens ?
Hi Paul,
Since you have modified the StandardAnalyzer (I presume you mean
StandardFilter), why not do a check on the term.text() and if its all
punctuation, skip the analysis for that term? Something like this in
your StandardFilter:
public final boolean incrementToken() throws IOException {
CharTermAttribute ta = getAttribute(CharTermAttribute.class);
if (isAllPunctuation(ta.buffer()) {
return true;
} else {
... normal processing here
}
}
If the filters are made keyword attribute aware (I have a bug open on
this, LUCENE-3236, although I only asked for Lowercase and Stop filters
in here), then its even simpler, you can plug in your own filter that
marks the term as a KeywordAttribute so downstream filters pass it
through.
-sujit
On Mon, 2011-10-17 at 13:12 +0100, Paul Taylor wrote:
> We have a modified version of a Lucene StandardAnalyzer , we use it for
> tokenizing music metadata such as as artist names & song titles, so
> typically only a few words. On tokenizing it usually it strips out
> punctuations which is correct, however if the input text consists of
> only punctuation characters then we end up with nothing, for these
> particular RARE cases I want to use a mapping filter.
>
> So what I try to do is have my analyzer tokenize as normal, then if the
> results is no tokens retokenize with the mapping filter , I check it has
> no token using incrementToken() but then cant see how I
> decrementToken(). How can I do this, or is there a more efficient way of
> doing this. Note of maybe 10,000,000 records only a few 100 records will
> have this problem so I need a solution which doesn't impact performance
> unreasonably.
>
> NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
> specialcharConvertMap.add("!", "Exclamation");
> specialcharConvertMap.add("?","QuestionMark");
> ...............
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
> CharFilter specialCharFilter = new
> MappingCharFilter(specialcharConvertMap,reader);
>
> StandardTokenizer tokenStream = new
> StandardTokenizer(LuceneVersion.LUCENE_VERSION);
> try
> {
> if(tokenStream.incrementToken()==false)
> {
> tokenStream = new
> StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
> }
> else
> {
> //TODO **************** set tokenstream back as it was
> before increment token
> }
> }
> catch(IOException ioe)
> {
>
> }
> TokenStream result = new LowercaseFilter(result);
> return result;
> }
>
> thanks for any help
>
>
> Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org