You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by merlin <me...@merlin.org> on 2000/10/08 20:24:12 UTC

attribute value normalization

Hi,

It appears that Xerces-J does not perform attribute value normalization
as per 3.3.3 of the XML 1.0 recommendation:

http://www.w3.org/TR/1998/REC-xml-19980210#AVNormalize

        If the declared value is not CDATA, then the XML processor
        must further process the normalized attribute value
        by discarding any leading and trailing space (#x20)
        characters, and by replacing sequences of space (#x20)
        characters by a single space (#x20) character.

Also, 5.1 states that conformant non-validating processors are required
to use the internal DTD subset to do this, among other things.

I have a crude implementation:

org.apache.xerces.framework.XMLAttrList

    /* merlin@merlin.org hack */
    private int normalizeValue (int value) {
      String val = fStringPool.toString (value);
      int n = val.length (), i = 0, j = n;
      while ((i < j) && (val.charAt (i) == ' '))
        ++ i;
      while ((j > i) && (val.charAt (j - 1) == ' '))
        -- j;
      if ((i == 0) && (j == n) && (val.indexOf ("  ") < 0))
        return value;
      StringBuffer norm = new StringBuffer ();
      while (i < j) {
        int k = val.indexOf ("  ", i);
        if (k < 0)
          k = j;
        else if (k < j)
          ++ k;
        while (i < k)
          norm.append (val.charAt (i ++));
        while ((i < j) && (val.charAt (i) == ' '))
          ++ i;
      }
      return fStringPool.addString (norm.toString ());
    } /* inefficient and is this the right place? */
    /* /merlin@merlin.org hack */

    public int addAttr (QName attribute, int attValue, int attType, boolean
    specified, boolean search) throws Exception {
        ...
        /* merlin@merlin.org hack */
        if (attType != fStringPool.addSymbol ("CDATA"))
          attValue = normalizeValue (attValue);
        /* /merlin@merlin.org hack */
        ...
    }

    public void setAttType(int attrIndex, int attTypeIndex) {
        ...
        /* merlin@merlin.org hack */
        if (attTypeIndex != fStringPool.addSymbol ("CDATA"))
          fAttValue[chunk][index] = normalizeValue (fAttValue[chunk][index]);
        /* /merlin@merlin.org hack */
    }

I also, however, had to remove the short-circuit exit from the validator
code:

org.apache.xerces.validators.common.XMLValidator

    private void validateElementAndAttributes(QName element, XMLAttrList
    attrList) throws Exception {
        /* merlin@merlin.org cut
        if ((fElementDepth >= 0 && fValidationFlagStack[fElementDepth] != 0 )|| 
        ...
        }
        /merlin@merlin.org cut */

The end result is that if you read, using a non validating parser:

        <!DOCTYPE doc [
        <!ATTLIST A id ID #IMPLIED
                    foo CDATA #IMPLIED
                    bar CDATA "baz"
        >]>
        <doc>
        <A id="  fo  o  " foo="  fo  o  " />
        </doc>

The id attribute should be exposed to the application as "fo o"
whereas the foo attribute will be unchanged. Obviously the value
is not valid, but it serves as an example.

Removal of the short-circuit exit is equally important in allowing
ID resolution to work. In the above document, for example, the
internal DTD subset should allow a non-validating parser to locate
element A by identifier.

I don't have a sufficient grasp of the Xerces "big picture" to be
able to tell if this is even remotely the right place to be doing
this, or what are the consequences of eliminating the short circuit,
but this does have what appear to be the correct effects.

Also, I'm not sure what the patch submission process is.

I built, btw, from a CVS snapshot of the other day. This is needed
for XML signature processing which places some fairly stringent
requirements on the XML parser.

Merlin