You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commons-dev@ws.apache.org by James M Snell <ja...@gmail.com> on 2006/08/24 02:49:03 UTC

Axiom, Stax and DTDs

While investigating a number of security concerns for the Abdera
project, I noticed that there were a number of problems with DTD
handling in the various stax parser implementations.  For instance, if
you parse the following xml document with Axiom using the Woodstox
parser, then reserialize it the xml will be invalid.

Input:

  <?xml version="1.0" encoding="utf-8"?>
  <!DOCTYPE feed [
    <!ENTITY foo "bar">
    <!ENTITY bar "foo">
  ]>
  <feed xmlns="http://www.w3.org/2005/Atom" >
  </feed>

Output using Woodstox:

  <?xml version="1.0" encoding="utf-8"?>

    <!ENTITY foo "bar">
    <!ENTITY bar "foo">

  <feed xmlns="http://www.w3.org/2005/Atom" >
  </feed>

Output using Stax Reference Impl

  <?xml version="1.0" encoding="utf-8"?>
  <!DOCTYPE feed [
    <!ENTITY foo "bar">
    <!ENTITY bar "foo">
  ]>
  <feed xmlns="http://www.w3.org/2005/Atom" >
  </feed>

Comparing these two, it would appear as if there is a bug in Woodstox.
Unfortunately, Woodstox is apparently acting exactly as the Stax spec
says it should and it's actually the Stax reference impl that's doing it
wrong... apparently.  So I had to dig a little deeper.

In StAXOMBuilder, the createDTD method calls parser.getText() to get the
DTD contents.  According to the Stax javadocs and spec, getText returns
the internal subset of the DTD, not the complete doctype declaration.
So while the stax reference implementation is doing what we want, it's
apparently not doing what the stax spec says it should be doing.

According to the woodstox developers, there is currently no way of
getting to the complete DTD doctype declaration using the standardized
XMLStreamReader interface.  The XMLEventReader interface, however, works
just fine.

So where does this leave us?  Using Axiom and Woodstox to parse
documents containing doctype decls produces invalid XML; Using Axiom and
the Stax ref impl requires relying on what is apparently either a bug or
a deliberate incompatibility with the spec.

Now, by this point you should note that I am using the word "apparently"
a lot.  That's because I'm basing this information off what one woodstox
developer told me and I've been unable to verify.

Another problem that I've noticed with the stax DTD handling is that
even when you tell it not to replace entity references, it will still
replace entity references found in attribute values.... which is more
than just slightly annoying.

In any case, I wanted to report these issues.  In the very near future I
will also post some feedback on various experiences we've had developing
with Axiom and suggestions on how to make things better.

- James

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@ws.apache.org
For additional commands, e-mail: commons-dev-help@ws.apache.org


Re: Axiom, Stax and DTDs

Posted by Eran Chinthaka <ch...@opensource.lk>.
Hi James,

Thanks for your valuable feedback on this.

We also had some problems with different implementation implementing
things differently. IIRC, Rich once made a workaround, by actually
looking at what parser is being used underneath.
Do you think doing something like that will help to solve the problem?
If yes, I'm happy to implement or to get a contribution from you guys.

(I will look in to this, but appreciate if u can create a JIRA out of
this, giving the details found in this mail abt DTD handling)

In the mean time, if you come across problems in using Axiom, just do
not hesitate to create JIRAs or post them here. I believe we should have
better coordination/cooperation  between two communities, i.e. Axiom and
Abdera, as we both from ASF.

Thanks
-- Chinthaka


James M Snell wrote:
> While investigating a number of security concerns for the Abdera
> project, I noticed that there were a number of problems with DTD
> handling in the various stax parser implementations.  For instance, if
> you parse the following xml document with Axiom using the Woodstox
> parser, then reserialize it the xml will be invalid.
> 
> Input:
> 
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE feed [
>     <!ENTITY foo "bar">
>     <!ENTITY bar "foo">
>   ]>
>   <feed xmlns="http://www.w3.org/2005/Atom" >
>   </feed>
> 
> Output using Woodstox:
> 
>   <?xml version="1.0" encoding="utf-8"?>
> 
>     <!ENTITY foo "bar">
>     <!ENTITY bar "foo">
> 
>   <feed xmlns="http://www.w3.org/2005/Atom" >
>   </feed>
> 
> Output using Stax Reference Impl
> 
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE feed [
>     <!ENTITY foo "bar">
>     <!ENTITY bar "foo">
>   ]>
>   <feed xmlns="http://www.w3.org/2005/Atom" >
>   </feed>
> 
> Comparing these two, it would appear as if there is a bug in Woodstox.
> Unfortunately, Woodstox is apparently acting exactly as the Stax spec
> says it should and it's actually the Stax reference impl that's doing it
> wrong... apparently.  So I had to dig a little deeper.
> 
> In StAXOMBuilder, the createDTD method calls parser.getText() to get the
> DTD contents.  According to the Stax javadocs and spec, getText returns
> the internal subset of the DTD, not the complete doctype declaration.
> So while the stax reference implementation is doing what we want, it's
> apparently not doing what the stax spec says it should be doing.
> 
> According to the woodstox developers, there is currently no way of
> getting to the complete DTD doctype declaration using the standardized
> XMLStreamReader interface.  The XMLEventReader interface, however, works
> just fine.
> 
> So where does this leave us?  Using Axiom and Woodstox to parse
> documents containing doctype decls produces invalid XML; Using Axiom and
> the Stax ref impl requires relying on what is apparently either a bug or
> a deliberate incompatibility with the spec.
> 
> Now, by this point you should note that I am using the word "apparently"
> a lot.  That's because I'm basing this information off what one woodstox
> developer told me and I've been unable to verify.
> 
> Another problem that I've noticed with the stax DTD handling is that
> even when you tell it not to replace entity references, it will still
> replace entity references found in attribute values.... which is more
> than just slightly annoying.
> 
> In any case, I wanted to report these issues.  In the very near future I
> will also post some feedback on various experiences we've had developing
> with Axiom and suggestions on how to make things better.
> 
> - James
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@ws.apache.org
> For additional commands, e-mail: commons-dev-help@ws.apache.org
> 
>