You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "romseygeek (via GitHub)" <gi...@apache.org> on 2023/05/04 11:15:54 UTC

[GitHub] [lucene] romseygeek commented on issue #12264: Shouldn't StandardTokenizer keep aplanum dot joined?

romseygeek commented on issue #12264:
URL: https://github.com/apache/lucene/issues/12264#issuecomment-1534586713

   The tokenizer is based on http://unicode.org/reports/tr29/, which has rules for handling dots that appear in numbers or in URLs, but it does seem that URLs that have a number before a dot are not handled here (the relevant rule I think is http://unicode.org/reports/tr29/#WB6 that tells the tokenizer not to break on letter + dot + letter, and then WB11 tells it not to break on number + dot + number, but there's nothing about number + dot + letter - possibly because there are also a bunch of cases where we *do* actually want to break here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org