You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sujit Pal <su...@comcast.net> on 2011/10/17 20:27:05 UTC

Re: How do you see if a tokenstream has tokens without consuming the tokens ?

Hi Paul,

Since you have modified the StandardAnalyzer (I presume you mean
StandardFilter), why not do a check on the term.text() and if its all
punctuation, skip the analysis for that term? Something like this in
your StandardFilter:

public final boolean incrementToken() throws IOException {
  CharTermAttribute ta = getAttribute(CharTermAttribute.class);
  if (isAllPunctuation(ta.buffer()) {
    return true;
  } else {
    ... normal processing here
  }
}

If the filters are made keyword attribute aware (I have a bug open on
this, LUCENE-3236, although I only asked for Lowercase and Stop filters
in here), then its even simpler, you can plug in your own filter that
marks the term as a KeywordAttribute so downstream filters pass it
through.

-sujit

On Mon, 2011-10-17 at 13:12 +0100, Paul Taylor wrote:
> We have a modified version of a Lucene StandardAnalyzer , we use it for 
> tokenizing music metadata such as as artist names & song titles, so 
> typically only a few words. On tokenizing it usually it strips out 
> punctuations which is correct, however if the input text consists of 
> only punctuation characters then we end up with nothing, for these 
> particular RARE cases I want to use a mapping filter.
> 
> So what I try to do is have my analyzer tokenize as normal, then if the 
> results is no tokens retokenize with the mapping filter , I check it has 
> no token using incrementToken() but then cant see how I 
> decrementToken(). How can I do this, or is there a more efficient way of 
> doing this. Note of maybe 10,000,000 records only a few 100 records will 
> have this problem so I need a solution which doesn't impact performance 
> unreasonably.
> 
>      NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
>      specialcharConvertMap.add("!", "Exclamation");
>      specialcharConvertMap.add("?","QuestionMark");
>      ...............
> 
>      public  TokenStream tokenStream(String fieldName, Reader reader) {
>          CharFilter specialCharFilter = new 
> MappingCharFilter(specialcharConvertMap,reader);
> 
>          StandardTokenizer tokenStream = new 
> StandardTokenizer(LuceneVersion.LUCENE_VERSION);
>          try
>          {
>              if(tokenStream.incrementToken()==false)
>              {
>                  tokenStream = new 
> StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
>              }
>              else
>              {
>                  //TODO **************** set tokenstream back as it was 
> before increment token
>              }
>          }
>          catch(IOException ioe)
>          {
> 
>          }
>          TokenStream result = new LowercaseFilter(result);
>          return result;
>      }
> 
> thanks for any help
> 
> 
> Paul
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org