You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "mkhludnev (via GitHub)" <gi...@apache.org> on 2023/05/04 10:19:21 UTC

[GitHub] [lucene] mkhludnev opened a new issue, #12264: Shouldn't StandardTokenizer keep aplanum dot joined?

mkhludnev opened a new issue, #12264:
URL: https://github.com/apache/lucene/issues/12264

   ### Description
   
   ### AS-IS
   `a9nine.com` -> `a9nine.com`
   `3.14` -> `3.14`
   ### Problem
   `a9.com` -> `a9` `com`
   
   Should it keep them joined?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mkhludnev commented on issue #12264: Shouldn't StandardTokenizer keep aplanum dot joined?

Posted by "mkhludnev (via GitHub)" <gi...@apache.org>.
mkhludnev commented on issue #12264:
URL: https://github.com/apache/lucene/issues/12264#issuecomment-1534792087

   The proposal around http://unicode.org/reports/tr29/#WB7 is to introduce (implement) two new don't break rules: 
   *WB6a*
   `AHLetter Numeric | × | (MidLetter | MidNumLetQ) AHLetter`
   *WB7d*
   `AHLetter Numeric (MidLetter | MidNumLetQ) | × | AHLetter`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mkhludnev commented on issue #12264: Shouldn't StandardTokenizer keep aplanum dot joined?

Posted by "mkhludnev (via GitHub)" <gi...@apache.org>.
mkhludnev commented on issue #12264:
URL: https://github.com/apache/lucene/issues/12264#issuecomment-1534733406

   Thanks @romseygeek. Right. It's a question. Maybe it's worth to discuss. 
   
   For the reference https://lists.apache.org/thread/gpxz58jdb9n1sh2oxx161g4kkd7x94wn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] romseygeek commented on issue #12264: Shouldn't StandardTokenizer keep aplanum dot joined?

Posted by "romseygeek (via GitHub)" <gi...@apache.org>.
romseygeek commented on issue #12264:
URL: https://github.com/apache/lucene/issues/12264#issuecomment-1534586713

   The tokenizer is based on http://unicode.org/reports/tr29/, which has rules for handling dots that appear in numbers or in URLs, but it does seem that URLs that have a number before a dot are not handled here (the relevant rule I think is http://unicode.org/reports/tr29/#WB6 that tells the tokenizer not to break on letter + dot + letter, and then WB11 tells it not to break on number + dot + number, but there's nothing about number + dot + letter - possibly because there are also a bunch of cases where we *do* actually want to break here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org