You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "colm.mchugh" <co...@mapflow.com> on 2012/03/22 14:23:25 UTC

Case where StandardAnalyzer doesn't remove punctuation

I'm using Lucene to search address data, and came across an interesting case
where StandardAnalyzer appears not to remove punctuation (a comma). To
illustrate, the following code snippet uses StandardAnalyzer to analyze an
address, printing out each analyzed token. 
 The output of the code snippet is: 
If the code is altered slightly so the String text is initialized as
follows:
 (there's a space between the first comma and the building number) then the
output is as follows:I would expect the output to be the same in both cases
based on my understanding. Is this a known issue? Or am I off on my
understanding? It's not a biggie. It caught my attention because I have a
unit test that asserts token text is all lower case or alphanumeric. It can
be easily got around, but I thought it worth posting about.

--
View this message in context: http://lucene.472066.n3.nabble.com/Case-where-StandardAnalyzer-doesn-t-remove-punctuation-tp3848460p3848460.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Case where StandardAnalyzer doesn't remove punctuation

Posted by "colm.mchugh" <co...@mapflow.com>.
Hi Steve,

thanks for your response. Totally makes sense, given that the comma
character is a widely used for written number syntax (e.g. 1000 is the same
as 1,000). Thanks also for the notes re the mailing list and nabble.

Colm.

--
View this message in context: http://lucene.472066.n3.nabble.com/Case-where-StandardAnalyzer-doesn-t-remove-punctuation-tp3848460p3858661.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Case where StandardAnalyzer doesn't remove punctuation

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Colm,

Thanks for bringing the issue up.  The behavior you mention is expected, though - see below for details.

In the future, you should use the java-user@l.a.o mailing list for questions about Lucene *usage* -- this mailing list (dev@l.a.o), by contrast, is intended for development discussions.

<rant>
FYI, Nabble.com stripped out all of your code and examples before sending your message to the mailing list.  My suggestion: stop using Nabble.  (I've described this problem to their support people a couple of times, and they apparently just don't care, since it still persists, years later.)
</rant>

StandardTokenizer, the tokenizer included in StandardAnalyzer, implements the Word Break rules from Unicode 6.0.0 UAX#29 <http://www.unicode.org/reports/tr29/tr29-17.html>.  These rules are international in nature.

The UAX#29 Word Break rules that prohibit breaking around commas when surrounded by digits are:

	Do not break within sequences, such as "3.2" or "3,456.789".

	WB11.	Numeric (MidNum | MidNumLet) × Numeric
	WB12.	Numeric × (MidNum | MidNumLet) Numeric

(In these rules, "×" means "do not break".)

>From the table "Word_Break property values" at <http://www.unicode.org/reports/tr29/tr29-17.html#Default_Word_Boundaries>, you can see that the set of characters that are assigned the Word_Break:MidNum property value is a superset of the characters assigned the Line_Break:Infix_Numeric property value, which includes the comma character.  You can see the full set of characters that are assigned the Line_Break:Infix_Number property value here: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3ALine_Break+%3D+Infix_Numeric%3A%5D> (note that because this utility refers to Unicode 6.1.0, the results may differ slightly from Lucene v3.5.0 StandardTokenizer, since it uses Unicode 6.0.0).

Steve

-----Original Message-----
From: colm.mchugh [mailto:colm.mchugh@mapflow.com] 
Sent: Thursday, March 22, 2012 9:23 AM
To: dev@lucene.apache.org
Subject: Case where StandardAnalyzer doesn't remove punctuation

I'm using Lucene to search address data, and came across an interesting case where StandardAnalyzer appears not to remove punctuation (a comma). To illustrate, the following code snippet uses StandardAnalyzer to analyze an address, printing out each analyzed token. 
 The output of the code snippet is: 
If the code is altered slightly so the String text is initialized as
follows:
 (there's a space between the first comma and the building number) then the output is as follows:I would expect the output to be the same in both cases based on my understanding. Is this a known issue? Or am I off on my understanding? It's not a biggie. It caught my attention because I have a unit test that asserts token text is all lower case or alphanumeric. It can be easily got around, but I thought it worth posting about.

--
View this message in context: http://lucene.472066.n3.nabble.com/Case-where-StandardAnalyzer-doesn-t-remove-punctuation-tp3848460p3848460.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org