You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xml-commons-dev@xerces.apache.org by Jack Bates <ms...@freezone.co.uk> on 2009/10/23 18:52:22 UTC

resolver should be able to parse catalog files without needing to resolve external entities?

I'm getting the following exception using the XML catalog resolver with
FOP,

[...]
DELEGATE_PUBLIC: -//W3C//DTD XHTML//EN
        file:/usr/share/xml/xhtml/schema/dtd/catalog.xml
resolvePublic(-//W3C//DTD XHTML 1.0 Transitional//EN,null)
Switching to delegated catalog(s):
        file:/usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml
Parse catalog: file:/usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml
Loading catalog: file:/usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml
Default BASE: file:/usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml
22-Oct-2009 5:04:13 PM org.apache.fop.cli.Main startFOP
SEVERE: Exception
javax.xml.transform.TransformerException: java.net.UnknownHostException: globaltranscorp.org
        at org.apache.fop.cli.InputHandler.transformTo(InputHandler.java:314)
        at org.apache.fop.cli.InputHandler.renderTo(InputHandler.java:146)
        at org.apache.fop.cli.Main.startFOP(Main.java:174)
        at org.apache.fop.cli.Main.main(Main.java:205)
[...]

I think it's the same issue reported here,
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=491091

- and here,
https://bugs.launchpad.net/ubuntu/+source/w3c-dtd-xhtml/+bug/400259

Is comment #4 correct?

"Arguably this is a bug in org.apache.xml.resolver, cf.
http://www.oasis-open.org/committees/download.php/14809/xml-catalogs.html#s.bootstrap - the resolver should be able to parse catalog files without needing to resolve external entities"

https://bugs.launchpad.net/ubuntu/+source/w3c-dtd-xhtml/+bug/400259/comments/4

Re: resolver should be able to parse catalog files without needing to resolve external entities?

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Jack,

Jack Bates <ms...@freezone.co.uk> wrote on 10/28/2009 07:48:23 PM:

> On Sat, 2009-10-24 at 15:10 -0400, Michael Glavassevich wrote:
> > The OASIS catalog DTD is included in resolver.jar and there is a
> > BootstrapResolver [1] which gets installed on the parser that reads
> > the catalog which can return this DTD. I'm sure the reason that isn't
> > happening is that the public and system IDs differ from the ones that
> > the resolver knows about. You're supposed to extend BootstrapResolver
> > (in your own application) if you need support for more than the
> > well-known public IDs and URIs for the catalog DTDs / schemas and set
> > an instance of this extension on the CatalogManager [2].
>
> Thank you Michael! After some more digging, I think the reason that the
> w3c-dtd-xhtml catalog.xml isn't using the well known catalog DTD public
> ID and URI,
>
> <!DOCTYPE catalog PUBLIC "-//OASIS//DTD XML Catalogs V1.0//EN"
>   "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
>
> is that it's trying to use a different DTD?
>
> <!DOCTYPE catalog PUBLIC "-//GlobalTransCorp//DTD XML Catalogs V1.0-
> Based Extension V1.0//EN"
>     "http://globaltranscorp.org/oasis/catalog/xml/tr9401.dtd">
>
> The xml-core package distributes "tr9401.dtd", in addition to
> "catalog.dtd" - here it is,
> http://www.sfu.ca/~jdbates/tmp/debian/200910280/tr9401.dtd
>
> - and here are the differences between it, and the "catalog.dtd"
> included in resolver.jar,
> http://www.sfu.ca/~jdbates/tmp/debian/200910280/diff
>
> I dunno if the w3c-dtd-xhtml catalog.xml actually requires this
> different DTD, but it sounds like if it does, and the system ID isn't
> accessible, then it will only be parsable by tools which extend
> BootstrapResolver to add support for this different DTD?

That's what I would do. I could imagine extending BootstrapResolver so that
it uses a secondary catalog resolver so that you just update a catalog file
instead of the code when you need a redirect for yet another catalog DTD.

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Re: resolver should be able to parse catalog files without needing to resolve external entities?

Posted by Jack Bates <ms...@freezone.co.uk>.
On Sat, 2009-10-24 at 15:10 -0400, Michael Glavassevich wrote:
> The OASIS catalog DTD is included in resolver.jar and there is a
> BootstrapResolver [1] which gets installed on the parser that reads
> the catalog which can return this DTD. I'm sure the reason that isn't
> happening is that the public and system IDs differ from the ones that
> the resolver knows about. You're supposed to extend BootstrapResolver
> (in your own application) if you need support for more than the
> well-known public IDs and URIs for the catalog DTDs / schemas and set
> an instance of this extension on the CatalogManager [2].

Thank you Michael! After some more digging, I think the reason that the
w3c-dtd-xhtml catalog.xml isn't using the well known catalog DTD public
ID and URI,

<!DOCTYPE catalog PUBLIC "-//OASIS//DTD XML Catalogs V1.0//EN" 
  "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">

is that it's trying to use a different DTD?

<!DOCTYPE catalog PUBLIC "-//GlobalTransCorp//DTD XML Catalogs V1.0-Based Extension V1.0//EN"
    "http://globaltranscorp.org/oasis/catalog/xml/tr9401.dtd">

The xml-core package distributes "tr9401.dtd", in addition to
"catalog.dtd" - here it is,
http://www.sfu.ca/~jdbates/tmp/debian/200910280/tr9401.dtd

- and here are the differences between it, and the "catalog.dtd"
included in resolver.jar,
http://www.sfu.ca/~jdbates/tmp/debian/200910280/diff

I dunno if the w3c-dtd-xhtml catalog.xml actually requires this
different DTD, but it sounds like if it does, and the system ID isn't
accessible, then it will only be parsable by tools which extend
BootstrapResolver to add support for this different DTD?

Re: resolver should be able to parse catalog files without needing to resolve external entities?

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
The OASIS catalog DTD is included in resolver.jar and there is a
BootstrapResolver [1] which gets installed on the parser that reads the
catalog which can return this DTD. I'm sure the reason that isn't happening
is that the public and system IDs differ from the ones that the resolver
knows about. You're supposed to extend BootstrapResolver (in your own
application) if you need support for more than the well-known public IDs
and URIs for the catalog DTDs / schemas and set an instance of this
extension on the CatalogManager [2].

Thanks.

[1]
http://xml.apache.org/commons/components/apidocs/resolver/org/apache/xml/resolver/helpers/BootstrapResolver.html
[2]
http://xml.apache.org/commons/components/apidocs/resolver/org/apache/xml/resolver/CatalogManager.html#setBootstrapResolver
(org.apache.xml.resolver.helpers.BootstrapResolver)

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Earl Hood <ea...@earlhood.com> wrote on 10/24/2009 01:22:50 PM:

> On October 23, 2009 at 17:19, someone wrote:
>
> > Here's an example of a catalog.xml file distributed in the Debian and
> > Ubuntu w3c-dtd-xhtml package,
> > http://www.sfu.ca/~jdbates/tmp/debian/200910230/catalog.xml
> >
> > It starts with,
> >
> > <?xml version='1.0'?>
> > <!DOCTYPE catalog PUBLIC "-//GlobalTransCorp//DTD XML Catalogs
V1.0-Based
> > <Extension V1.0//EN"
> >     "http://globaltranscorp.org/oasis/catalog/xml/tr9401.dtd";>
> > [...]
>
> I think they should fix it so the system identifier is set
> to a pathname on the local file system.
>
> Also, the public identifier used is not the standard public
> identifier, "-//OASIS//DTD XML Catalogs V1.1//EN".  So even
> if the resolver provided intrinsic recognition of
> the "-//OASIS//DTD XML Catalogs V1.1//EN" identifier, it
> would still be of no use in this case.
>
> One can argue that the w3c-dtd-xhtml package has a bug in
> their distribution since it provides no facility to resolve
> the DTD to the local file system.  The system identifier
> should be set to the pathname the catalog DTD is placed
> by the w3c-dtd-xhtml installer.
>
> > I understand comment #4,
> > https://bugs.launchpad.net/ubuntu/+source/w3c-dtd-
> xhtml/+bug/400259/comments/4
> >
> > - to be suggesting that org.apache.xml.resolver is not following the
> > encouragement of,
> > http://www.oasis-open.org/committees/download.php/14809/xml-
> catalogs.html#s.bootstrap
> >
> > "Implementations are encouraged to provide some sort of bootstrapping
> > functionality to resolve external identifiers and URIs that the
> > implementation needs to load catalog entry files.
>
> It is not a requirement:
>
>   Conformant processors are not required to be able to perform
>   resolution of those identifiers through the XML Catalog.
>
> The word "should" is used in other text instead of "must".  Also,
> the following is stated:
>
>   Users can avoid any problems that might arise by limiting the
>   external identifiers and URIs used to those that do not need
>   resolution. Note that this only applies to external identifiers and
>   URIs that must be resolved in order to load the catalog entry file.
>
> > - and to be suggesting that not following this encouragement is a bug
> >
> > Is maybe my understanding wrong - or either of these suggestions wrong?
>
> The recommendations of the Oasis document are beneficial, but
> they are only recommendations, not requirements.  So the "bug"
> reports are really enhancement requests.
>
> IMO, the work-around for the problem is easy, and is directly
> suggested by the Oasis document: Use system identifiers that
> are resolvable without the need of a catalog.
>
> I think the underlying technical problem of why the resolver library
> does not provide intrinsic resolution of the catalog DTD is that
> the library does not know where the DTD may be installed for any
> system that uses the resolver.  Since other software systems include
> the resolver in their distribution, the DTD itself may not even
> be available.
>
> A possible method of always knowing how to find the catalog DTD is
> for the resolver to include the DTD in the resolver.jar file itself.
> The resolver could register a custom (internal) resolver to the XML
> parser when reading catalog files so any references to the DTD can
> be resolved via a classpath resource lookup.  IMO, I'm not sure it
> is worth the effort to do this when simple work-arounds exist for
> the problem.
>
> I'm sure patches are welcome if anyone wants to implement this
> functionality.
>
> --ewh

Re: resolver should be able to parse catalog files without needing to resolve external entities?

Posted by Earl Hood <ea...@earlhood.com>.
On October 23, 2009 at 17:19, someone wrote:

> Here's an example of a catalog.xml file distributed in the Debian and
> Ubuntu w3c-dtd-xhtml package,
> http://www.sfu.ca/~jdbates/tmp/debian/200910230/catalog.xml
> 
> It starts with,
> 
> <?xml version='1.0'?>
> <!DOCTYPE catalog PUBLIC "-//GlobalTransCorp//DTD XML Catalogs V1.0-Based 
> <Extension V1.0//EN"
>     "http://globaltranscorp.org/oasis/catalog/xml/tr9401.dtd";>
> [...]

I think they should fix it so the system identifier is set
to a pathname on the local file system.

Also, the public identifier used is not the standard public
identifier, "-//OASIS//DTD XML Catalogs V1.1//EN".  So even
if the resolver provided intrinsic recognition of
the "-//OASIS//DTD XML Catalogs V1.1//EN" identifier, it
would still be of no use in this case.

One can argue that the w3c-dtd-xhtml package has a bug in
their distribution since it provides no facility to resolve
the DTD to the local file system.  The system identifier
should be set to the pathname the catalog DTD is placed
by the w3c-dtd-xhtml installer.

> I understand comment #4,
> https://bugs.launchpad.net/ubuntu/+source/w3c-dtd-xhtml/+bug/400259/comments/4
> 
> - to be suggesting that org.apache.xml.resolver is not following the
> encouragement of,
> http://www.oasis-open.org/committees/download.php/14809/xml-catalogs.html#s.bootstrap
> 
> "Implementations are encouraged to provide some sort of bootstrapping
> functionality to resolve external identifiers and URIs that the
> implementation needs to load catalog entry files.

It is not a requirement:

  Conformant processors are not required to be able to perform
  resolution of those identifiers through the XML Catalog.

The word "should" is used in other text instead of "must".  Also,
the following is stated:

  Users can avoid any problems that might arise by limiting the
  external identifiers and URIs used to those that do not need
  resolution. Note that this only applies to external identifiers and
  URIs that must be resolved in order to load the catalog entry file.

> - and to be suggesting that not following this encouragement is a bug
> 
> Is maybe my understanding wrong - or either of these suggestions wrong?

The recommendations of the Oasis document are beneficial, but
they are only recommendations, not requirements.  So the "bug"
reports are really enhancement requests.

IMO, the work-around for the problem is easy, and is directly
suggested by the Oasis document: Use system identifiers that
are resolvable without the need of a catalog.

I think the underlying technical problem of why the resolver library
does not provide intrinsic resolution of the catalog DTD is that
the library does not know where the DTD may be installed for any
system that uses the resolver.  Since other software systems include
the resolver in their distribution, the DTD itself may not even
be available.

A possible method of always knowing how to find the catalog DTD is
for the resolver to include the DTD in the resolver.jar file itself.
The resolver could register a custom (internal) resolver to the XML
parser when reading catalog files so any references to the DTD can
be resolved via a classpath resource lookup.  IMO, I'm not sure it
is worth the effort to do this when simple work-arounds exist for
the problem.

I'm sure patches are welcome if anyone wants to implement this
functionality.

--ewh

Re: resolver should be able to parse catalog files without needing to resolve external entities?

Posted by Earl Hood <ea...@earlhood.com>.
On October 23, 2009 at 09:52, Jack Bates wrote:

> I'm getting the following exception using the XML catalog resolver with
> FOP,

> DELEGATE_PUBLIC: -//W3C//DTD XHTML//EN
>         file:/usr/share/xml/xhtml/schema/dtd/catalog.xml
> resolvePublic(-//W3C//DTD XHTML 1.0 Transitional//EN,null)
> Switching to delegated catalog(s):
>         file:/usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml
> Parse catalog: file:/usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml
> Loading catalog: file:/usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml
> Default BASE: file:/usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml
> 22-Oct-2009 5:04:13 PM org.apache.fop.cli.Main startFOP
> SEVERE: Exception
> javax.xml.transform.TransformerException: java.net.UnknownHostException: 
> globaltranscorp.org

Is globaltranscorp.org resolvable on the system you are running FOP?
Apparently the underlying network library of the JVM cannot resolve
the hostname.

As for the "resolve external entities" problem, it appears to
be a chicken-n-egg problem.  The resolver depends on the underlying
XML parser to parse the XML catalog file, but at that time,
the base entity resolution of the XML parser is being used since
the resolver is still bootstrapping itself.

If your catalog file contains a DOCTYPE declaration with a public and
system identifier, then the XML parser will try to resolve it, and if
the system identifier listed is not accessible, you will get an error.

All of this is a function of the XML parser itself and NOT the
resolver library.

In practice, I normally do not specify a doctype declaration for
catalog files to avoid the unnecessary overhead of parsing a DTD.

If you absolutely need to have DTD validation of your catalog files,
make sure the system identifier is resolvable, and preferably to a
location on the local file system for better performance and to avoid
dependency on a remote system.

--ewh