You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/11/26 01:58:06 UTC

[GitHub] [lucene] maomao905 commented on issue #11976: End offset for compatibility characters is not incremented with ICUNormalizer2CharFilter

maomao905 commented on issue #11976:
URL: https://github.com/apache/lucene/issues/11976#issuecomment-1327957480

   Thanks! 
   
   > both 1 and 月 should have same offsets as they come from same input character ㋀
   
   I run your suggested test and it failed. I am not sure this is the bug or not.
   The end offset of character `1` is 2 (expected offset is 3).
   ```
   $ ./gradlew test --tests org.apache.lucene.analysis.icu.TestICUNormalizer2CharFilter.testDecomposeFromSameInputCharacter
   ...
   org.apache.lucene.analysis.icu.TestICUNormalizer2CharFilter > testDecomposeFromSameInputCharacter FAILED
       java.lang.AssertionError: endOffset 2 term=1 expected:<3> but was:<2>
   ...
   ```
   
   > Test/issue should not be named "combining character" as there are no combining characters involved. "combining character" has a very specific meaning in unicode and this is not that.
   
   I changed the issue title from "combining character" to "compatibility character"
   `㋀` seems [enclosed CJK letters and months](https://en.wikipedia.org/wiki/Enclosed_CJK_Letters_and_Months) in unicode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org