You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Vincenzo D'Amore <v....@gmail.com> on 2016/06/01 23:13:57 UTC

StandardTokenizer behaviour with apostrophe and colon

Hi all,

StandardTokenizer don't split the text with an apostrophe (punctuation mark
' ) and with a colon (punctuation mark : ).

Just to be clear looking at documentation all punctation marks are
delimiters, with an exception for periods (dots), so I suppose that a pair
of Italian word like "nell'aria" should be split in two words "nell" and
"aria".

So I have bypassed the problem using a WordDelimiterFilterFactory.

Is this a bug or an undocumented behaviour? In any case, what to do next?

Best regards,
Vincenzo


-- 
Vincenzo D'Amore
email: v.damore@gmail.com
skype: free.dev
mobile: +39 349 8513251