You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by WWilson <wo...@hotmail.com> on 2009/10/22 15:38:01 UTC

Issue with Tokenising with Standard Analyzer and comma's

Hi,

I am not a Lucene expert by any means and I am hoping that you all may be
able to help me with a little problem i currently have.

I am using the lucene standard analyzer on an address text field. My issue
arises when that address contains a comma.

For example 87,Green Street

The standard analyzer sees the comma as an important interconnecting
character and retains the token 87,Green. 

I presume this is to ensure numeric values (10,000) are correctly
maintained.

The problem is that individual searches for 87 or green come back as non
matching to the token 87,Green.

Should the standard analyzer not check the text either side of the comma to
ensure they are both numeric in nature and if not split the token 87,Green
into 87 and Green.

I can wrap up the standard analyzer and process the tokens generated to
create the above effect but was wondering if the issue above was the
standard analyzer 'working as intended'.

Many thanks

-- 
View this message in context: http://www.nabble.com/Issue-with-Tokenising-with-Standard-Analyzer-and-comma%27s-tp26010057p26010057.html
Sent from the Lucene - General mailing list archive at Nabble.com.