You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2011/11/02 17:12:09 UTC

Creating additional tokens from input in a token filter

I have a tokenizer filter that takes tokens and then drops any non 
alphanumeric characters

i.e 'this-stuff' becomes 'thisstuff'

but what I actually want it to do is split the one token into multiple 
tokens using the non-alphanumeric characters as word boundaries

i.e 'this-stuff' becomes 'this stuff'

How do I do this ?

thanks Paul

(You may be wondering why I just didn't filter out these characters at 
the tokenizer stage, but I had to keep them in to solve another problem, 
that is they needed to be kept for 'words' that only consisted of 
non-alphanumeric characters)

This is my existing class:

public class MusicbrainzTokenizerFilter extends TokenFilter {
     /**
      * Construct filtering <i>in</i>.
      */
     public MusicbrainzTokenizerFilter(TokenStream in) {
         super(in);
         termAtt = (CharTermAttribute) 
addAttribute(CharTermAttribute.class);
         typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class);
     }

     private static final String ALPHANUMANDPUNCTUATION
             = 
MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPUNCTUATION];

     // this filters uses attribute type
     private TypeAttribute       typeAtt;
     private CharTermAttribute   termAtt;

     /**
      * Returns the next token in the stream, or null at EOS.
      * <p>Removes <tt>'</tt> from the words.
      * <p>Removes dots from acronyms.
      */
     public final boolean incrementToken() throws java.io.IOException {
         if (!input.incrementToken()) {
             return false;
         }

         char[] buffer = termAtt.buffer();
         final int bufferLength = termAtt.length();
         final String type = typeAtt.type();

         if (type == ALPHANUMANDPUNCTUATION) {      // remove no alpha 
numerics
             int upto = 0;
             for (int i = 0; i < bufferLength; i++) {
                 char c = buffer[i];
                 if (!Character.isLetterOrDigit(c) )
                 {
                     //Do Nothing, (drop the character)
                 }
                 else {
                     buffer[upto++] = c;
                 }
             }
             termAtt.setLength(upto);
         }
         return true;
     }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Creating additional tokens from input in a token filter

Posted by Paul Taylor <pa...@fastmail.fm>.
On 02/11/2011 17:15, Uwe Schindler wrote:
> Hi Paul,
>
> There is WordDelimiterFilter which does exactly what you want. In 3.x its
> unfortunately only shipped in Solr JAR file, but in 4.0 it's in the
> analyzers-common module.
>
> Uwe
>
Ah great, erm I being a bit dense but where is Lucene 4.0 , Ive looked 
under various places in http://svn.apache.org/viewvc/lucene/dev/ but 
cant see it


Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Creating additional tokens from input in a token filter

Posted by Paul Taylor <pa...@fastmail.fm>.
On 02/11/2011 20:48, Paul Taylor wrote:
> On 02/11/2011 17:15, Uwe Schindler wrote:
>> Hi Paul,
>>
>> There is WordDelimiterFilter which does exactly what you want. In 3.x 
>> its
>> unfortunately only shipped in Solr JAR file, but in 4.0 it's in the
>> analyzers-common module.
> Okay so I found it and its looks very interesting but really overly 
> complex for what I want to do and doesnt handle my specific case, 
> could anyone possibly give a code example
> of how I create two tokens from one, assume I already know how to 
> split it (I cant work that bit out)
>
I took another look at WordDelimiterFilter and managed to get it work, 
sweet , thanks very much

In case of interest to others, and because I had to hack WordDelimiter a 
little bit this is my solution.

1. I changed my existing tokenizer to convert control/punctuation chars 
to a '-' rather than dropping them

       if (type == ALPHANUMANDPUNCTUATION) {      // remove no alpha 
numerics
             int upto = 0;
             for (int i = 0; i < bufferLength; i++) {
                 char c = buffer[i];
                 if (!Character.isLetterOrDigit(c) )
                 {
                     //Replace control/punctuation chars with '-' to 
help word delimiter
                     buffer[upto++] = '-';
                 }
                 else {
                     //Normal Char
                     buffer[upto++] = c;
                 }
             }

2. I took a copy of WordDelimiter and WordDelimiterIterator and modified 
it slightly so that it only did anything for attributetype equals 
ALPHANUMANDPUNCTUATION (couldnt see any constructor that would let me 
set this)

public boolean incrementToken() throws IOException {
     while (true) {
       if (!hasSavedState) {
         // process a new input word
         if (!input.incrementToken()) {
           return false;
         }

         //Use Word Delimiter just on these tokens
         if (typeAttribute.type() != 
MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPUNCTUATION];) 
{
             return true;
         }
         ...................
}

3. Added my WordDelimiter and just set it to to generateWordParts

streams.filteredTokenStream = new 
WordDelimiterFilter(streams.filteredTokenStream,
                                           
WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE,
                                           1,
                                           0,
                                           0,
                                           0,
                                           0,
                                           0,
                                           0,
                                           0,
                                           0,
                                          null);

Cheers Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Creating additional tokens from input in a token filter

Posted by Paul Taylor <pa...@fastmail.fm>.
On 02/11/2011 17:15, Uwe Schindler wrote:
> Hi Paul,
>
> There is WordDelimiterFilter which does exactly what you want. In 3.x its
> unfortunately only shipped in Solr JAR file, but in 4.0 it's in the
> analyzers-common module.
Okay so I found it and its looks very interesting but really overly 
complex for what I want to do and doesnt handle my specific case, could 
anyone possibly give a code example
of how I create two tokens from one, assume I already know how to split 
it (I cant work that bit out)

     public final boolean incrementToken() throws java.io.IOException {
         if (!input.incrementToken()) {
             return false;
         }

         char[] buffer = termAtt.buffer();
         final int bufferLength = termAtt.length();
         final String type = typeAtt.type();

         if (type == ALPHANUMANDPUNCTUATION) {
             int upto = 0;

             for (int i = 0; i < bufferLength; i++) {
                 char c = buffer[i];
                 if (!Character.isLetterOrDigit(c) )
                 {
                     //TODO PUT ALL CHARS AFTER THIS INTO A NEW TOKEN
                 }
                 else {
                     buffer[upto++] = c;
                 }
             }
             termAtt.setLength(upto);
         }
         return true;
     }



>> -----Original Message-----
>> From: Paul Taylor [mailto:paul_t100@fastmail.fm]
>> Sent: Wednesday, November 02, 2011 5:12 PM
>> To: java-user@lucene.apache.org
>> Subject: Creating additional tokens from input in a token filter
>>
>> I have a tokenizer filter that takes tokens and then drops any non
> alphanumeric
>> characters
>>
>> i.e 'this-stuff' becomes 'thisstuff'
>>
>> but what I actually want it to do is split the one token into multiple
> tokens using
>> the non-alphanumeric characters as word boundaries
>>
>> i.e 'this-stuff' becomes 'this stuff'
>>
>> How do I do this ?
>>
>> thanks Paul
>>
>> (You may be wondering why I just didn't filter out these characters at the
>> tokenizer stage, but I had to keep them in to solve another problem, that
> is they
>> needed to be kept for 'words' that only consisted of non-alphanumeric
>> characters)
>>
>> This is my existing class:
>>
>> public class MusicbrainzTokenizerFilter extends TokenFilter {
>>       /**
>>        * Construct filtering<i>in</i>.
>>        */
>>       public MusicbrainzTokenizerFilter(TokenStream in) {
>>           super(in);
>>           termAtt = (CharTermAttribute)
>> addAttribute(CharTermAttribute.class);
>>           typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class);
>>       }
>>
>>       private static final String ALPHANUMANDPUNCTUATION
>>               =
>> MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPU
>> NCTUATION];
>>
>>       // this filters uses attribute type
>>       private TypeAttribute       typeAtt;
>>       private CharTermAttribute   termAtt;
>>
>>       /**
>>        * Returns the next token in the stream, or null at EOS.
>>        *<p>Removes<tt>'</tt>  from the words.
>>        *<p>Removes dots from acronyms.
>>        */
>>       public final boolean incrementToken() throws java.io.IOException {
>>           if (!input.incrementToken()) {
>>               return false;
>>           }
>>
>>           char[] buffer = termAtt.buffer();
>>           final int bufferLength = termAtt.length();
>>           final String type = typeAtt.type();
>>
>>           if (type == ALPHANUMANDPUNCTUATION) {      // remove no alpha
>> numerics
>>               int upto = 0;
>>               for (int i = 0; i<  bufferLength; i++) {
>>                   char c = buffer[i];
>>                   if (!Character.isLetterOrDigit(c) )
>>                   {
>>                       //Do Nothing, (drop the character)
>>                   }
>>                   else {
>>                       buffer[upto++] = c;
>>                   }
>>               }
>>               termAtt.setLength(upto);
>>           }
>>           return true;
>>       }
>> }
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Creating additional tokens from input in a token filter

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi Paul,

There is WordDelimiterFilter which does exactly what you want. In 3.x its
unfortunately only shipped in Solr JAR file, but in 4.0 it's in the
analyzers-common module.

Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Paul Taylor [mailto:paul_t100@fastmail.fm]
> Sent: Wednesday, November 02, 2011 5:12 PM
> To: java-user@lucene.apache.org
> Subject: Creating additional tokens from input in a token filter
> 
> I have a tokenizer filter that takes tokens and then drops any non
alphanumeric
> characters
> 
> i.e 'this-stuff' becomes 'thisstuff'
> 
> but what I actually want it to do is split the one token into multiple
tokens using
> the non-alphanumeric characters as word boundaries
> 
> i.e 'this-stuff' becomes 'this stuff'
> 
> How do I do this ?
> 
> thanks Paul
> 
> (You may be wondering why I just didn't filter out these characters at the
> tokenizer stage, but I had to keep them in to solve another problem, that
is they
> needed to be kept for 'words' that only consisted of non-alphanumeric
> characters)
> 
> This is my existing class:
> 
> public class MusicbrainzTokenizerFilter extends TokenFilter {
>      /**
>       * Construct filtering <i>in</i>.
>       */
>      public MusicbrainzTokenizerFilter(TokenStream in) {
>          super(in);
>          termAtt = (CharTermAttribute)
> addAttribute(CharTermAttribute.class);
>          typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class);
>      }
> 
>      private static final String ALPHANUMANDPUNCTUATION
>              =
> MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPU
> NCTUATION];
> 
>      // this filters uses attribute type
>      private TypeAttribute       typeAtt;
>      private CharTermAttribute   termAtt;
> 
>      /**
>       * Returns the next token in the stream, or null at EOS.
>       * <p>Removes <tt>'</tt> from the words.
>       * <p>Removes dots from acronyms.
>       */
>      public final boolean incrementToken() throws java.io.IOException {
>          if (!input.incrementToken()) {
>              return false;
>          }
> 
>          char[] buffer = termAtt.buffer();
>          final int bufferLength = termAtt.length();
>          final String type = typeAtt.type();
> 
>          if (type == ALPHANUMANDPUNCTUATION) {      // remove no alpha
> numerics
>              int upto = 0;
>              for (int i = 0; i < bufferLength; i++) {
>                  char c = buffer[i];
>                  if (!Character.isLetterOrDigit(c) )
>                  {
>                      //Do Nothing, (drop the character)
>                  }
>                  else {
>                      buffer[upto++] = c;
>                  }
>              }
>              termAtt.setLength(upto);
>          }
>          return true;
>      }
> }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org