You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dan Armbrust <da...@gmail.com> on 2005/08/23 17:32:13 UTC
WhiteSpace Tokenizer question
I wrote a slightly modified version of the WhiteSpaceTokenizer that
allows me to treat other characters as whitespace. My thought was that
this would be an easy way to make it tokenize on characters such as "-".
My tokenizer looks like this:
public class CustomWhiteSpaceTokenizer extends CharTokenizer
{
protected boolean isTokenChar(char c)
{
if (Character.isWhitespace(c) || whiteSpaceChars_.contains(new
Character(c)))
{
return false;
}
else
{
return true;
}
}
<snip other stuff>
}
When I use my Analyzer which uses this tokenizer in the QueryParser with
the character "-" defined as whitespace, the following query gets parsed
like this:
"title:(john a) body:(john a) " -> (title:john title:a) (body:john body:a)
which is what I expect. But then the following query:
"title:(john--a) body:(john--a) " -> title:"john a" body:"john a"
Isn't what I want. I can't seem to figure out why it is behaving
differently on these characters (space vs hyphen) when I am specifying
them both as a non-token.
This is with the svn trunk as of yesterday.
Any help appreciated,
Thanks,
Dan
--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: WhiteSpace Tokenizer question
Posted by Yonik Seeley <ys...@gmail.com>.
It's the QueryParser, not the Analyzer.
When the query parser sees multiple tokens from what looks like a
single word, it puts them in a phrase query.
I think the only way to change that behavior would be to modify the
QueryParser.
-Yonik
On 8/23/05, Dan Armbrust <da...@gmail.com> wrote:
> I wrote a slightly modified version of the WhiteSpaceTokenizer that
> allows me to treat other characters as whitespace. My thought was that
> this would be an easy way to make it tokenize on characters such as "-".
>
> My tokenizer looks like this:
>
> public class CustomWhiteSpaceTokenizer extends CharTokenizer
> {
>
> protected boolean isTokenChar(char c)
> {
> if (Character.isWhitespace(c) || whiteSpaceChars_.contains(new
> Character(c)))
> {
> return false;
> }
> else
> {
> return true;
> }
> }
>
> <snip other stuff>
> }
>
> When I use my Analyzer which uses this tokenizer in the QueryParser with
> the character "-" defined as whitespace, the following query gets parsed
> like this:
>
> "title:(john a) body:(john a) " -> (title:john title:a) (body:john body:a)
>
> which is what I expect. But then the following query:
>
> "title:(john--a) body:(john--a) " -> title:"john a" body:"john a"
>
> Isn't what I want. I can't seem to figure out why it is behaving
> differently on these characters (space vs hyphen) when I am specifying
> them both as a non-token.
>
> This is with the svn trunk as of yesterday.
> Any help appreciated,
>
> Thanks,
>
> Dan
>
> --
> ****************************
> Daniel Armbrust
> Biomedical Informatics
> Mayo Clinic Rochester
> daniel.armbrust(at)mayo.edu
> http://informatics.mayo.edu/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org