You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Dan Armbrust <da...@gmail.com> on 2005/08/23 17:32:13 UTC

WhiteSpace Tokenizer question

I wrote a slightly modified version of the WhiteSpaceTokenizer that 
allows me to treat other characters as whitespace.  My thought was that 
this would be an easy way to make it tokenize on characters such as "-".

My tokenizer looks like this:

public class CustomWhiteSpaceTokenizer extends CharTokenizer
{

    protected boolean isTokenChar(char c)
    {
        if (Character.isWhitespace(c) || whiteSpaceChars_.contains(new 
Character(c)))
        {
            return false;
        }
        else
        {
            return true;
        }
    }

<snip other stuff>
}

When I use my Analyzer which uses this tokenizer in the QueryParser with 
the character "-" defined as whitespace, the following query gets parsed 
like this:

"title:(john  a) body:(john  a) " -> (title:john title:a) (body:john body:a)

which is what I expect.  But then the following query:

"title:(john--a) body:(john--a) " -> title:"john a" body:"john a"

Isn't what I want.  I can't seem to figure out why it is behaving 
differently on these characters (space vs hyphen) when I am specifying 
them both as a non-token.

This is with the svn trunk as of yesterday.
Any help appreciated,

Thanks,

Dan

-- 
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: WhiteSpace Tokenizer question

Posted by Yonik Seeley <ys...@gmail.com>.

It's the QueryParser, not the Analyzer.
When the query parser sees multiple tokens from what looks like a
single word, it puts them in a phrase query.

I think the only way to change that behavior would be to modify the
QueryParser.

-Yonik

On 8/23/05, Dan Armbrust <da...@gmail.com> wrote:
> I wrote a slightly modified version of the WhiteSpaceTokenizer that
> allows me to treat other characters as whitespace.  My thought was that
> this would be an easy way to make it tokenize on characters such as "-".
> 
> My tokenizer looks like this:
> 
> public class CustomWhiteSpaceTokenizer extends CharTokenizer
> {
> 
>     protected boolean isTokenChar(char c)
>     {
>         if (Character.isWhitespace(c) || whiteSpaceChars_.contains(new
> Character(c)))
>         {
>             return false;
>         }
>         else
>         {
>             return true;
>         }
>     }
> 
> <snip other stuff>
> }
> 
> When I use my Analyzer which uses this tokenizer in the QueryParser with
> the character "-" defined as whitespace, the following query gets parsed
> like this:
> 
> "title:(john  a) body:(john  a) " -> (title:john title:a) (body:john body:a)
> 
> which is what I expect.  But then the following query:
> 
> "title:(john--a) body:(john--a) " -> title:"john a" body:"john a"
> 
> Isn't what I want.  I can't seem to figure out why it is behaving
> differently on these characters (space vs hyphen) when I am specifying
> them both as a non-token.
> 
> This is with the svn trunk as of yesterday.
> Any help appreciated,
> 
> Thanks,
> 
> Dan
> 
> --
> ****************************
> Daniel Armbrust
> Biomedical Informatics
> Mayo Clinic Rochester
> daniel.armbrust(at)mayo.edu
> http://informatics.mayo.edu/
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org