You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "G.Long" <jd...@gmail.com> on 2014/10/09 16:54:15 UTC

custom token filter generates empty tokens

Hi :)

I wrote a custom token filter which removes special characters. 
Sometimes, all characters of the token are removed so the filter 
procudes an empty token. I would like to remove this token from the 
tokenstream but i'm not sure how to do that.

Is there something missing in my custom token filter or do I need to 
chain another custom token filter to remove empty tokens?

Regards :)

ps:

this is the code of my custom filter :

public class SpecialCharFilter extends TokenFilter {

     private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);

     protected SpecialCharFilter(TokenStream input) {
         super(input);
     }

     @Override
     public boolean incrementToken() throws IOException {

         if (!input.incrementToken()) {
             return false;
         }

         final char[] buffer = termAtt.buffer();
         final int length = termAtt.length();
         final char[] newBuffer = new char[length];

         int newIndex = 0;
         for (int i = 0; i < length; i++) {
             if (!isFilteredChar(buffer[i])) {
                 newBuffer[newIndex] = buffer[i];
                 newIndex++;
             }
         }

         String term = new String(newBuffer);
         term = term.trim();
         char[] characters = term.toCharArray();
         termAtt.setEmpty();
         termAtt.copyBuffer(characters, 0, characters.length);

         return true;
     }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: custom token filter generates empty tokens

Posted by "G.Long" <jd...@gmail.com>.
Thanks Ahmet and Jose for you help :)

Regards,

Le 09/10/2014 18:29, Jose Fernandez a écrit :
> When you return true from incrementToken() you tell Lucene to add your token to the token stream. At the end of incrementToken() check for an empty token. If it's empty then return incrementToken() to process the next token. This will affect your positions so if you're doing phrase search you will need to adjust the position attribute to account for the now-empty token.
>
> -----Original Message-----
> From: G.Long [mailto:jdevgl@gmail.com]
> Sent: Thursday, October 09, 2014 7:54 AM
> To: java-user@lucene.apache.org
> Subject: custom token filter generates empty tokens
>
> Hi :)
>
> I wrote a custom token filter which removes special characters.
> Sometimes, all characters of the token are removed so the filter procudes an empty token. I would like to remove this token from the tokenstream but i'm not sure how to do that.
>
> Is there something missing in my custom token filter or do I need to chain another custom token filter to remove empty tokens?
>
> Regards :)
>
> ps:
>
> this is the code of my custom filter :
>
> public class SpecialCharFilter extends TokenFilter {
>
>       private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
>
>       protected SpecialCharFilter(TokenStream input) {
>           super(input);
>       }
>
>       @Override
>       public boolean incrementToken() throws IOException {
>
>           if (!input.incrementToken()) {
>               return false;
>           }
>
>           final char[] buffer = termAtt.buffer();
>           final int length = termAtt.length();
>           final char[] newBuffer = new char[length];
>
>           int newIndex = 0;
>           for (int i = 0; i < length; i++) {
>               if (!isFilteredChar(buffer[i])) {
>                   newBuffer[newIndex] = buffer[i];
>                   newIndex++;
>               }
>           }
>
>           String term = new String(newBuffer);
>           term = term.trim();
>           char[] characters = term.toCharArray();
>           termAtt.setEmpty();
>           termAtt.copyBuffer(characters, 0, characters.length);
>
>           return true;
>       }
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> SDL PLC confidential, all rights reserved.
> If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.
> SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.
> Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.
>
>
>
> This message has been scanned for malware by Websense. www.websense.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: custom token filter generates empty tokens

Posted by Jose Fernandez <jf...@sdl.com>.
When you return true from incrementToken() you tell Lucene to add your token to the token stream. At the end of incrementToken() check for an empty token. If it's empty then return incrementToken() to process the next token. This will affect your positions so if you're doing phrase search you will need to adjust the position attribute to account for the now-empty token.

-----Original Message-----
From: G.Long [mailto:jdevgl@gmail.com] 
Sent: Thursday, October 09, 2014 7:54 AM
To: java-user@lucene.apache.org
Subject: custom token filter generates empty tokens

Hi :)

I wrote a custom token filter which removes special characters. 
Sometimes, all characters of the token are removed so the filter procudes an empty token. I would like to remove this token from the tokenstream but i'm not sure how to do that.

Is there something missing in my custom token filter or do I need to chain another custom token filter to remove empty tokens?

Regards :)

ps:

this is the code of my custom filter :

public class SpecialCharFilter extends TokenFilter {

     private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

     protected SpecialCharFilter(TokenStream input) {
         super(input);
     }

     @Override
     public boolean incrementToken() throws IOException {

         if (!input.incrementToken()) {
             return false;
         }

         final char[] buffer = termAtt.buffer();
         final int length = termAtt.length();
         final char[] newBuffer = new char[length];

         int newIndex = 0;
         for (int i = 0; i < length; i++) {
             if (!isFilteredChar(buffer[i])) {
                 newBuffer[newIndex] = buffer[i];
                 newIndex++;
             }
         }

         String term = new String(newBuffer);
         term = term.trim();
         char[] characters = term.toCharArray();
         termAtt.setEmpty();
         termAtt.copyBuffer(characters, 0, characters.length);

         return true;
     }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

SDL PLC confidential, all rights reserved.
If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.
SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.



This message has been scanned for malware by Websense. www.websense.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: custom token filter generates empty tokens

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi G.Long,

You can use TrimFilter+LengthFilter to remove empty/whitespace tokens.


Ahmet

On Thursday, October 9, 2014 5:54 PM, G.Long <jd...@gmail.com> wrote:
Hi :)

I wrote a custom token filter which removes special characters. 
Sometimes, all characters of the token are removed so the filter 
procudes an empty token. I would like to remove this token from the 
tokenstream but i'm not sure how to do that.

Is there something missing in my custom token filter or do I need to 
chain another custom token filter to remove empty tokens?

Regards :)

ps:

this is the code of my custom filter :

public class SpecialCharFilter extends TokenFilter {

     private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);

     protected SpecialCharFilter(TokenStream input) {
         super(input);
     }

     @Override
     public boolean incrementToken() throws IOException {

         if (!input.incrementToken()) {
             return false;
         }

         final char[] buffer = termAtt.buffer();
         final int length = termAtt.length();
         final char[] newBuffer = new char[length];

         int newIndex = 0;
         for (int i = 0; i < length; i++) {
             if (!isFilteredChar(buffer[i])) {
                 newBuffer[newIndex] = buffer[i];
                 newIndex++;
             }
         }

         String term = new String(newBuffer);
         term = term.trim();
         char[] characters = term.toCharArray();
         termAtt.setEmpty();
         termAtt.copyBuffer(characters, 0, characters.length);

         return true;
     }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org