You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Nilesh Vijaywargiay <ni...@gmail.com> on 2012/03/27 20:03:30 UTC
Lucene tokenization
I have a string 01a_b-_-c-d which is tokenized as
01a_b
c
d
and the string a_b-_-c_d which is tokenized as
a
b
c
d
why is there a difference when there is a digit at the beginning? I am
using standard unstemmed tokenizer.
RE: Lucene tokenization
Posted by Steven A Rowe <sa...@syr.edu>.
Hi Nilesh,
Which version of Lucene are you using? StandardTokenizer behavior changed in v3.1.
Steve
-----Original Message-----
From: Nilesh Vijaywargiay [mailto:nilesh.vijay@gmail.com]
Sent: Tuesday, March 27, 2012 2:04 PM
To: java-user@lucene.apache.org
Subject: Lucene tokenization
I have a string 01a_b-_-c-d which is tokenized as 01a_b c d
and the string a_b-_-c_d which is tokenized as a b c d
why is there a difference when there is a digit at the beginning? I am using standard unstemmed tokenizer.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene tokenization
Posted by Paul Libbrecht <pa...@hoplahup.net>.
Nilesh,
the StandardAnalyzer is full of generally useful special cases, including emails and numbers detection.
I am supposing you met one such special case which has a justification of some sort.
I can't tell you why but I can tell it's really hard to change because others rely on this somehow (I think).
paul
Le 27 mars 2012 à 20:03, Nilesh Vijaywargiay a écrit :
> I have a string 01a_b-_-c-d which is tokenized as
> 01a_b
> c
> d
>
> and the string a_b-_-c_d which is tokenized as
> a
> b
> c
> d
>
> why is there a difference when there is a digit at the beginning? I am
> using standard unstemmed tokenizer.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org