You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by bu...@apache.org on 2002/04/07 05:27:31 UTC
DO NOT REPLY [Bug 7806] New: -
Non-BMP Unicode block names in regexes
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=7806>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND
INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=7806
Non-BMP Unicode block names in regexes
Summary: Non-BMP Unicode block names in regexes
Product: Xerces2-J
Version: 2.0.0
Platform: Other
OS/Version: Other
Status: NEW
Severity: Normal
Priority: Other
Component: XML Schema datatypes
AssignedTo: xerces-j-dev@xml.apache.org
ReportedBy: jjc@jclark.com
There's a bug with handling the Unicode block names that are outside the BMP (i.e.
with codes > 0xFFFF). Something like \p{IsGothic} doesn't work as it should.
The bug is in org.apache.xerces.impl.xpath.regex.Token. In the declaration of
blockNames, there's a comment:
//missing Specials add manually
But it doesn't do this. The blockRanges string includes things like \u10300\u1032F
which is completely bogus, since \u only takes 4 hex digits.
The fix is to add a table of non-BMP block ranges
static final int[] nonBmpBlockRanges = { 0x10330, 0x1032F, ... };
Then in Token.getRange(), do addRange for each of the ranges in
nonBmpBlockRanges.
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org