You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by "Josh Spiegel (JIRA)" <xe...@xml.apache.org> on 2009/11/10 02:47:32 UTC

[jira] Commented: (XERCESJ-1389) RegEx matching: ranges not computed correctly in "ignore case" mode

    [ https://issues.apache.org/jira/browse/XERCESJ-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775244#action_12775244 ] 

Josh Spiegel commented on XERCESJ-1389:
---------------------------------------

First of all, thanks for fixing this bug.

I was looking at this fix (831926) and I think there may be a problem but I am not positive.  I apologize in advance if I am mistaken. 

When interpreting a case insensitive range, the code seems to add the lower and upper case of each character in the range.  (see the new RegexParser.addCaseInsensitiveChar and RegexParser.addCaseInsensitiveCharRange).  However, it is my understanding that not all character case mappings in unicode are invertible like this (http://unicode.org/faq/casemap_charprop.html#2)

For example both capital K and the kelvin sign have a lower-case of 'k':
    lower-case(['K' - 0x004B]) ==  'k' 
      AND 
    lower-case([Kelvin-sign - 0x212A]) ==  'k'

So, if I have a regular expression 'k', in case insensitive mode shouldn't this match both 'K' and the Kelvin-sign?  Currently it seems it would only match 'k' or 'K'.

Thanks.

> RegEx matching: ranges not computed correctly in "ignore case" mode
> -------------------------------------------------------------------
>
>                 Key: XERCESJ-1389
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1389
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: Other
>    Affects Versions: 2.9.1
>            Reporter: Radu Preotiuc-Pietro
>            Assignee: Khaled Noaman
>
> There are a couple of problems in interpreting character ranges in "case-insensitive" mode.
> When doing range subtraction (or negation), all the case-variants of the subtracted characters need to be considered. For example, "[^Q]" means, in case-insensitive mode, "any character except 'q' or 'Q'" but the regex engine matches both 'q' and 'Q' in this example.
> Also, in case-insensitive mode, all character classes must stay the same, so for example "\p{Lu}" would not match a lowercase letter, but the regex engine matches 'q'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org