You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by "Michael Glavassevich (JIRA)" <xe...@xml.apache.org> on 2009/07/12 18:27:15 UTC

[jira] Commented: (XERCESJ-1383) Adding Unicode Normalization support to Xerces2-J

    [ https://issues.apache.org/jira/browse/XERCESJ-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730116#action_12730116 ] 

Michael Glavassevich commented on XERCESJ-1383:
-----------------------------------------------

Hi Richard, as I mentioned on the mailing list what you have so far is looking good. I do have a few suggestions:

For performance I think if a piece text was already determined to be normalized by the normalization checker you could pass the original string through without calling normalize. For example:

   public void comment(XMLString text, Augmentations augs) throws XNIException {
       boolean normalized = false;
        if (fCheckCharacters) {
           normalized = checkNormalized(text,0);
        }
        if (fDocumentHandler != null) {
            if (fCharacterNormalization && !normalized) {
                fDocumentHandler.comment(normalize(text),augs);
            }
            else {
                fDocumentHandler.comment(text,augs);
            }
        }    
    } // comment(XMLString,Augmentations)

I wonder if the new error message you added ("The XML characters are not fully normalized.") could contain some context about the error (e.g. the sequence of text which isn't normalized) that would help the user better understand what portion(s) of the document they would need to repair to make their document normalized.

Xerces guarantees that the String values in QNames and several other constructs have been internalized [1] (i.e. String.intern()). Applications rely on this as well as Xerces' internals which do reference comparison with '==' for performance reasons in many places instead of .equals(). We need to make sure that when we normalize one of those constructs that the String value that we pass down the pipeline has been internalized. This can be accomplished using the SymbolTable [2].

[1] http://xerces.apache.org/xerces2-j/features.html#string-interning
[2] http://xerces.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/util/SymbolTable.html

> Adding Unicode Normalization support to Xerces2-J 
> --------------------------------------------------
>
>                 Key: XERCESJ-1383
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1383
>             Project: Xerces2-J
>          Issue Type: New Feature
>          Components: DOM (Level 3 Core), SAX
>    Affects Versions: 2.9.1
>         Environment: All
>            Reporter: Richard Kelly
>            Assignee: Michael Glavassevich
>         Attachments: CharacterNormalizer.java, CharacterNormalizer.patch
>
>
> This feature will add support for Unicode character normalization and normalization checking to Xerces.  Applications that use Xerces will be able to produce fully normalized XML documents and verify that any XML documents they process are fully normalised. 
> Adding this functionality will allow Xerces to meet the XML 1.1 W3C Recommendation regarding character normalization and allow it to implement the optional character normalization and normalization checking features specified in the DOM Level 3 Core and SAX2.
> More specifically, the features to be implemented are:
> DOM Level 3 Core: "normalize-characters" [1]
> DOM Level 3 Core: "check-character-normalization" [2]
> SAX2: "unicode-normalization-checking" [3]
> [1] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-normalize-characters
> [2] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-check-character-normalization
> [3] http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org