You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by bu...@apache.org on 2002/04/07 05:27:31 UTC

DO NOT REPLY [Bug 7806] New: - Non-BMP Unicode block names in regexes

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=7806>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=7806

Non-BMP Unicode block names in regexes

           Summary: Non-BMP Unicode block names in regexes
           Product: Xerces2-J
           Version: 2.0.0
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: XML Schema datatypes
        AssignedTo: xerces-j-dev@xml.apache.org
        ReportedBy: jjc@jclark.com


There's a bug with handling the Unicode block names that are outside the BMP (i.e. 
with codes > 0xFFFF).  Something like \p{IsGothic} doesn't work as it should.

 The bug is in org.apache.xerces.impl.xpath.regex.Token.  In the declaration of 
blockNames, there's a comment:

         //missing Specials add manually

But it doesn't do this. The blockRanges string includes things like \u10300\u1032F 
which is completely bogus, since \u only takes 4 hex digits.

The fix is to add a table of non-BMP block ranges

static final int[] nonBmpBlockRanges = { 0x10330, 0x1032F, ... };

Then in Token.getRange(), do addRange for each of the ranges in 
nonBmpBlockRanges.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org