You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ant.apache.org by Jon Stevens <la...@gmail.com> on 2010/06/26 02:02:41 UTC

bug in DOMElementWriter

Say you have an element like this:

<foo attr="&#10;" />

If you want to load that through an XMLFragment and then output it later,
the DOMElementWriter.encode() method doesn't case properly for non-printable
characters.

The quick fix for me in this case was to overload that method and add this:

                case '\n':
                    sb.append("&#10;");
                    break;

Obviously, you'd want a more complete encoding of all possible values.

jon

Re: bug in DOMElementWriter

Posted by Jesse Glick <je...@oracle.com>.
On 06/27/2010 01:35 PM, Dominique Devienne wrote:
> with \n, which is just like any other character*, the serializer doesn't do
> anything special, and the output the also contain a "plain" \n.

Jon is correct: &#10; or similar should be emitted for \n. Newlines are _not_ just like any other character; in XML attributes they are collapsed into a generic 
whitespace sequence. For example,

<project default="run">
     <target name="run">
     <echo message="hello
there"/>
     <echo message="hello&#10;again"/>
     </target>
</project>

prints

hello there
hello
again

In general, there is a certain class of characters other than the usual &<>"' which should be emitted in character entity form by any tool which purports to create 
round-trippable XML. Others, like &#8;, are simply invalid in XML content and cannot be encoded at all.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Re: bug in DOMElementWriter

Posted by Jon Stevens <la...@gmail.com>.
>
> By now I tend to agree with Jon that DOMElementWriter should encode \n,
> \r and \t when writing attribute values (and only when writing attribute
> values).
>
> Stefan
>

Thank you.

jon

Re: bug in DOMElementWriter

Posted by Dominique Devienne <dd...@gmail.com>.
On Mon, Jun 28, 2010 at 4:03 AM, Stefan Bodewig <bo...@apache.org> wrote:
> By now I tend to agree with Jon that DOMElementWriter should encode \n,
> \r and \t when writing attribute values (and only when writing attribute values).

Despite giving an example involving nested text (so technically
correct ;), and mentioning whitespace normalization in passing, I now
see that I missed Jon's issue was with attribute values, Apologies.
Now I stand corrected by Stefan and Jesse, and I can go back hiding in
my little corner where I should have remained. --DD

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Re: bug in DOMElementWriter

Posted by Stefan Bodewig <bo...@apache.org>.
On 2010-06-29, Jon Stevens wrote:

> Maybe it is just me, but it seems vastly more efficient to just write out
> the correct string than call Integer.toHexString().

svn revision 959173 addresses this (forget the change to Execute, has
been reverted with the next commit).

> Also, don't forget to update WHATSNEW since this is potentially a backwards
> incompatible change.

svn revision 959176

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Re: bug in DOMElementWriter

Posted by Jon Stevens <la...@gmail.com>.
Maybe it is just me, but it seems vastly more efficient to just write out
the correct string than call Integer.toHexString().

Also, don't forget to update WHATSNEW since this is potentially a backwards
incompatible change.

jon


On Tue, Jun 29, 2010 at 12:51 AM, Stefan Bodewig <bo...@apache.org> wrote:

> On 2010-06-28, Stefan Bodewig wrote:
>
> > By now I tend to agree with Jon that DOMElementWriter should encode \n,
> > \r and \t when writing attribute values (and only when writing attribute
> > values).
>
> svn revision 958857
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
> For additional commands, e-mail: dev-help@ant.apache.org
>
>

Re: bug in DOMElementWriter

Posted by Stefan Bodewig <bo...@apache.org>.
On 2010-06-28, Stefan Bodewig wrote:

> By now I tend to agree with Jon that DOMElementWriter should encode \n,
> \r and \t when writing attribute values (and only when writing attribute
> values).

svn revision 958857

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Re: bug in DOMElementWriter

Posted by Stefan Bodewig <bo...@apache.org>.
On 2010-06-28, Stefan Bodewig wrote:

> [just echoing what Antoine and Dominque already said, Ant doesn't even
>       know you used an entity reference to specify the newline.]

Just read <http://www.w3.org/TR/2008/REC-xml-20081126/#AVNormalize> and
realized that what I said above isn't true.

If Ant sees a \n, \r or \t inside an attribute's value, then it must
have been an entity reference in the original input - otherwise
normalization would have replaced it with a space.

This also means

>> <foo attr="beforenewline&#10;afternewline">

and

>> <foo attr="beforenewline
>> afternewline" />

result in different attribute values being passed to the application
(one containing the newline and one containing a space instead).

By now I tend to agree with Jon that DOMElementWriter should encode \n,
\r and \t when writing attribute values (and only when writing attribute
values).

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Re: bug in DOMElementWriter

Posted by Stefan Bodewig <bo...@apache.org>.
On 2010-06-28, Jon Stevens wrote:

[just echoing what Antoine and Dominque already said, Ant doesn't even
      know you used an entity reference to specify the newline.]

> And what about:

> <foo attr="beforenewline&#10;afternewline">

> which ends up like this after 'echoxml'...

> <foo attr="beforenewline
> afternewline" />

Which just is correct, isn't it?  It is perfectly legal to have embedded
newlines inside an attribute even if they look silly.

Does the output cause any problems downstream?  It shouldn't since the
XML technically is the same.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Re: bug in DOMElementWriter

Posted by Jon Stevens <la...@gmail.com>.
>
> Don't get hand up on the textual  representation of the XML file. This
>
> <foo>&#10;</foo>
>
> and this
>
> <foo>
> </foo>
>
> is exactly the same thing as far as XML is concerned.
>
>
And what about:

<foo attr="beforenewline&#10;afternewline">

which ends up like this after 'echoxml'...

<foo attr="beforenewline
afternewline" />

?

jon

Re: bug in DOMElementWriter

Posted by Dominique Devienne <dd...@gmail.com>.
On Sun, Jun 27, 2010 at 8:55 PM, Jon Stevens <la...@gmail.com> wrote:
> However, the character that went into the attribute was not a \n, it was a
> &#10;. I'd expect ant to give me &#10; back out, not \n. The point of
> <echoxml> is to echo xml, is it not? In that case, the point here should be
> to echo out the encoded value as xml, not something that is useless.

Jon, in XML land &#10; *is* \n, whatever you say about it.

See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

You *can* have a plain '\n' char (i.e. an actual LF, not '\\' and 'n')
in XML, and for the parser that's the *same*.

Furthermore, whatever you feed your <echoxml>-generated XML file to,
will / should not care either whether it see a '\n' or a "&#10;" if it
uses a compliant XML parser.

Don't get hand up on the textual  representation of the XML file. This

<foo>&#10;</foo>

and this

<foo>
</foo>

is exactly the same thing as far as XML is concerned.

If you absolutely want your &#10; in the <echoxml> output, you must
follow Antoine's advice.

I suggest you read more on XML and again Ant, for better or worse,
uses an XML parser so will only see '\n' and not your XML char entity.
--DD

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Re: bug in DOMElementWriter

Posted by Jon Stevens <la...@gmail.com>.
On Sun, Jun 27, 2010 at 10:35 AM, Dominique Devienne <dd...@gmail.com>wrote:

> On Sat, Jun 26, 2010 at 6:54 PM, Jon Stevens <la...@gmail.com> wrote:
> > For example, attr="&amp;" comes out as attr="&amp;" and not attr="&"... I
> > don't have to write attr="&amp;amp;" to get what I want. The same is true
> > with attr="&gt;"... it comes out as attr="&gt;" instead of attr=">". This
> is
> > all because DOMElementWriter.encode() is smart about those entities.
> >
> > attr="&#10;" should come out as attr="&#10;", not attr="\n"
>
> Well, I'm afraid Antoine is right, and the comparison you make is not
> "fair".
>
> &, <, and > are "special" in XML, and must always be encoded in
> attribute values and textual content. \n is not.
>
> <echoxml> never sees the "&amp;" text, it sees whatever the XML parser
> reports, a "&", and the XML serializer Ant uses knows it must encode
> that char into "&amp;", thus it ends up back the way it was. But with
> \n, which is just like any other character*, the serializer doesn't do
> anything special, and the output the also contain a "plain" \n.


However, the character that went into the attribute was not a \n, it was a
&#10;. I'd expect ant to give me &#10; back out, not \n. The point of
<echoxml> is to echo xml, is it not? In that case, the point here should be
to echo out the encoded value as xml, not something that is useless.

jon

Re: bug in DOMElementWriter

Posted by Dominique Devienne <dd...@gmail.com>.
On Sat, Jun 26, 2010 at 6:54 PM, Jon Stevens <la...@gmail.com> wrote:
> For example, attr="&amp;" comes out as attr="&amp;" and not attr="&"... I
> don't have to write attr="&amp;amp;" to get what I want. The same is true
> with attr="&gt;"... it comes out as attr="&gt;" instead of attr=">". This is
> all because DOMElementWriter.encode() is smart about those entities.
>
> attr="&#10;" should come out as attr="&#10;", not attr="\n"

Well, I'm afraid Antoine is right, and the comparison you make is not "fair".

&, <, and > are "special" in XML, and must always be encoded in
attribute values and textual content. \n is not.

<echoxml> never sees the "&amp;" text, it sees whatever the XML parser
reports, a "&", and the XML serializer Ant uses knows it must encode
that char into "&amp;", thus it ends up back the way it was. But with
\n, which is just like any other character*, the serializer doesn't do
anything special, and the output the also contain a "plain" \n.

The is XML, and Ant can do nothing about it. The textual
representation of the "XML infoset" doesn't matter, what matters is
the info, and the XML parser doesn't always report the info as it was
in the text of the XML but as it's equivalent is. Most parsers offer
configurations that control how it reports stuff, but you can never
get a fully exact representation of the XML text, without digging into
the parser itself. --DD

* Well it's whitespace, so it could be "normalized" too.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Re: bug in DOMElementWriter

Posted by Jon Stevens <la...@gmail.com>.
So, here is what I'm trying to do... I'm including some XML which is
generated by Eclipse as part of the .launch config files for launch
profiles. That XML has &#10; in an attribute value.

I'd expect to be able to put whatever into an <echoxml> element and have it
output back out. I shouldn't have to double encode it. It doesn't matter
that the ant build file is also xml because the error is clearly in
DOMElementWriter.encode().
It encodes other stuff, but not &#*; entities.

For example, attr="&amp;" comes out as attr="&amp;" and not attr="&"... I
don't have to write attr="&amp;amp;" to get what I want. The same is true
with attr="&gt;"... it comes out as attr="&gt;" instead of attr=">". This is
all because DOMElementWriter.encode() is smart about those entities.

attr="&#10;" should come out as attr="&#10;", not attr="\n"

jon

On Sat, Jun 26, 2010 at 12:07 PM, Antoine Levy-Lambert <an...@gmx.de>wrote:

> Hello Jon,
>
> do not forget that the ant build file is also written in xml.
>
> To do what you want, you need to write this :
>
> <echoxml>
>  <test attr="&amp;#10;" />
> </echoxml>
>
>
>
> Jon Stevens wrote:
> > Here is a better example:
> >
> > <?xml version="1.0"?>
> > <project name="test" basedir=".">
> > <echoxml>
> >  <test attr="&#10;" />
> > </echoxml>
> > </project>
> >
> > [8][ //tmp ]% ant
> > Buildfile: build.xml
> > <?xml version="1.0" encoding="UTF-8"?>
> > <test attr="
> > " />
> >
> > BUILD SUCCESSFUL
> > Total time: 0 seconds
> >
> > Not what I was expecting.
> >
> > jon
> >
> >
> > On Fri, Jun 25, 2010
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
> For additional commands, e-mail: dev-help@ant.apache.org
>
>

Re: bug in DOMElementWriter

Posted by Antoine Levy-Lambert <an...@gmx.de>.
Hello Jon,

do not forget that the ant build file is also written in xml.

To do what you want, you need to write this :

<echoxml>
 <test attr="&amp;#10;" />
</echoxml>



Jon Stevens wrote:
> Here is a better example:
>
> <?xml version="1.0"?>
> <project name="test" basedir=".">
> <echoxml>
>  <test attr="&#10;" />
> </echoxml>
> </project>
>
> [8][ //tmp ]% ant
> Buildfile: build.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <test attr="
> " />
>
> BUILD SUCCESSFUL
> Total time: 0 seconds
>
> Not what I was expecting.
>
> jon
>
>
> On Fri, Jun 25, 2010 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Re: bug in DOMElementWriter

Posted by Jon Stevens <la...@gmail.com>.
Here is a better example:

<?xml version="1.0"?>
<project name="test" basedir=".">
<echoxml>
 <test attr="&#10;" />
</echoxml>
</project>

[8][ //tmp ]% ant
Buildfile: build.xml
<?xml version="1.0" encoding="UTF-8"?>
<test attr="
" />

BUILD SUCCESSFUL
Total time: 0 seconds

Not what I was expecting.

jon


On Fri, Jun 25, 2010 at 5:02 PM, Jon Stevens <la...@gmail.com> wrote:

> Say you have an element like this:
>
> <foo attr="&#10;" />
>
> If you want to load that through an XMLFragment and then output it later,
> the DOMElementWriter.encode() method doesn't case properly for non-printable
> characters.
>
> The quick fix for me in this case was to overload that method and add this:
>
>                 case '\n':
>                     sb.append("&#10;");
>                     break;
>
> Obviously, you'd want a more complete encoding of all possible values.
>
> jon
>
>