You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Paul Eng <pa...@iclick.com> on 2000/06/05 21:10:19 UTC

Bug in Resolving External Entities?

Currently, the expandSystemId() method in DefaultEntityHandler uses the
java.net.URL class to resolve external entities.  This doesn't work for
me since I am using a scheme/protocol that is not recognized by the URL
class.  So, for example, if I have <!ENTITY % someEntity SYSTEM
"../index.ent"> in my DTD and this system ID gets passed in with
<resource:///source/test/.ent> as the base, the method cannot resolve it
since URL knows nothing about the <resource> scheme/protocol.

Is this considered to be a bug?  Documentation that I've seen says that
the system-identifier should be a URI (not URL).  If this is correct, we
would like to submit a URI class and a patch to DefaultEntityHandler to
use the URI class instead of URL.  If this is not correct, can someone
let me know?

Paul

Re: Bug in Resolving External Entities?

Posted by earthlinks satellite <ea...@hotmail.com>.
unsubscribe please
----- Original Message ----- 
From: "Paul Eng" <pa...@iclick.com>
To: <xe...@xml.apache.org>
Sent: Tuesday, June 06, 2000 1:02 PM
Subject: Re: Bug in Resolving External Entities?


> Again, thanks for the reply.
> 
> Our feeling here is that if we have a dtd, say
> 'myscheme:///one/two/three/foo.dtd' and it contains <!ENTITY % myEntity
> SYSTEM "../../something.mod">, xerces should resolve the system ID to
> 'myscheme:///one/something.mod'. I say this because, as you point out,
> the XML spec defines system Ids as URIs. Therefore, xerces should follow
> the URI specification in RFC 2396 which outlines the steps required in
> resolving relative URIs.
> 
> As it stands now, the expandSystemId() method in DefaultEntityHandler
> fails to resolve the relative URI, and, as a result, we get
> '../../something.mod' passed into the resolveEntity() method in the
> class that we're using to implement EntityResolver. We see no reason for
> the expandSystemId() method to use java.net.URL to resolve a URI.
> Instead, it should use a URI class that follows the RFC 2396 spec.
> 
> Or am I missing something here?
> 
> Paul
> 
> Jason Crickmer wrote:
> >
> > Yes, system identifiers are URI's (XML Specification version 1.0
> > section 4.2.2).  URI's (specified in RFC 1738), as you well know, make
> > no presumption other than that the URI use 7-bit characters (US-ASCII)
> > and start with a scheme name, followed by a colon, followed by a
> > scheme-specific string.
> >
> > The problem is that most EVERY document you start reading will be from
> > a run-of-the-mill URL like 'file:///tmp/my-humble-doc.xml' or
> > 'http://localhost/dtd/mydtd.dtd'.  Thus, for that one document, all
> > relative paths that have no scheme specified should use the 'base
> > reference'.  If I used the former example and specified the location
> > of some other external entity as '../home/jcrickmer/entity.xml', then
> > it would be logical to conclude I meant
> > 'file:///tmp/../home/jcrickmer/entity.xml'.
> >
> > In my experience, I have found that knowing the base reference is one
> > of the most critical parts to resolving entities.  And, unless you
> > specifically read a document from a URI of your own scheme, you should
> > resolve all relative references with respect to the only base
> > reference you have (that of the document or document fragment
> > containing the entity (in the DTD) and not the entity reference (in
> > the document)).
> >
> > I hope this helps...
> >
> > Thanks,
> > Jason
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> 
> --
> Paul Eng
> iClick, Inc.
> 120 Bloomingdale Road
> White Plains, NY 10605
> voice: (914) 872-8051
> fax: (914) 872-8100
> email: paul.eng@iclick.com
> http://www.iclick.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> 
> 

Re: [PATCH] Use URI to Resolve External Entities

Posted by Andy Clark <an...@apache.org>.
It's rather short notice but I decided to include the patch in the
tree so that it goes into the new build that we're trying to do
this week. Thanks again for the contribution, Paul! This class
will also improve our URI datatype validator.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

Re: [PATCH] Use URI to Resolve External Entities

Posted by Andy Clark <an...@apache.org>.
Michael Mason wrote:
> 
> Andy Clark wrote:
> 
> > I applied the patches and found one difference with the old code. As
> > a convenience, our old code allowed the user to use DOS path names.
> > For example: c:\xerces\data\personal.xml
> 
> So what does one use instead of c:\foo\bar.xml ?

file:///c:/foo/bar.xml works. My question was not whether there
was an alternative to the standard DOS syntax but if we should
support it.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

Re: [PATCH] Use URI to Resolve External Entities

Posted by Michael Mason <mg...@decisionsoft.com>.
Andy Clark wrote:

> I applied the patches and found one difference with the old code. As
> a convenience, our old code allowed the user to use DOS path names.
> For example: c:\xerces\data\personal.xml

So what does one use instead of c:\foo\bar.xml ?

Mike.

Re: [PATCH] Use URI to Resolve External Entities

Posted by Andy Clark <an...@apache.org>.
Paul Eng wrote:
> Attached is the updated source code with the Apache license and
> copyright info.
> 
> We have checked with the powers that be and have received their OK.

Kick ass! Thanks!

I applied the patches and found one difference with the old code. As 
a convenience, our old code allowed the user to use DOS path names. 
For example: c:\xerces\data\personal.xml

This causes the URI class to barf. So the question is: do
members of the group think that this should be allowed or
disallowed? If we disallow the DOS file convention we will be
more compliant with the spec but at the expense of convenience
for a large number of users.

So, whatcha think, everyone?

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

Re: [PATCH] Use URI to Resolve External Entities

Posted by Paul Eng <pa...@iclick.com>.
Attached is the updated source code with the Apache license and
copyright info.

We have checked with the powers that be and have received their OK.

The URI class has been tested against the test oracle in the appendix of
RFC 2396 as well as all test cases found at
<http://www.ics.uci.edu/~fielding/url/>.

Andy Clark wrote:
> 
> Paul Eng wrote:
> > Since I haven't heard otherwise, I'm posting this as a patch. The
> > first attachment is a patch to DefaultEntityResolver that uses a
> > URI class instead of java.net.URL to resolve the system ID for
> > external entities. The second attachment is the URI class.
> 
> As it stands, we can't accept the source code because it contains
> the copyright of the originating company, iClick. In order to
> contribute this code, this copyright must be removed and
> replaced with the standard Apache license and copyright info.
> You can use any of leading comments in any of the Xerces source
> files for this purpose. You can even follow the section at the
> end of the Apache license which gives attribution to IBM for the
> donated source code in order to attribute iClick.
> 
> However, please realize that, by doing so, you are saying that
> you are allowed to contribute this source code to the project.
> Have you checked with your people that it's OK to contribute
> this code? If so, then just change the license agreement and
> resend the code -- I'll add it into the repository.
> 
> BTW, how much testing have you done on this source code to
> know that it accepts well-formed URIs, etc?
> 
> --
> Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [PATCH] Use URI to Resolve External Entities

Posted by Andy Clark <an...@apache.org>.
Paul Eng wrote:
> Since I haven't heard otherwise, I'm posting this as a patch. The 
> first attachment is a patch to DefaultEntityResolver that uses a 
> URI class instead of java.net.URL to resolve the system ID for 
> external entities. The second attachment is the URI class.

As it stands, we can't accept the source code because it contains
the copyright of the originating company, iClick. In order to
contribute this code, this copyright must be removed and 
replaced with the standard Apache license and copyright info.
You can use any of leading comments in any of the Xerces source
files for this purpose. You can even follow the section at the
end of the Apache license which gives attribution to IBM for the
donated source code in order to attribute iClick.

However, please realize that, by doing so, you are saying that
you are allowed to contribute this source code to the project.
Have you checked with your people that it's OK to contribute
this code? If so, then just change the license agreement and
resend the code -- I'll add it into the repository.

BTW, how much testing have you done on this source code to
know that it accepts well-formed URIs, etc?

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

[PATCH] Use URI to Resolve External Entities

Posted by Paul Eng <pa...@iclick.com>.
Since I haven't heard otherwise, I'm posting this as a patch. The first
attachment is a patch to DefaultEntityResolver that uses a URI class
instead of java.net.URL to resolve the system ID for external entities. 
The second attachment is the URI class.


Paul Eng wrote:
> 
> Again, thanks for the reply.
> 
> Our feeling here is that if we have a dtd, say
> 'myscheme:///one/two/three/foo.dtd' and it contains <!ENTITY % myEntity
> SYSTEM "../../something.mod">, xerces should resolve the system ID to
> 'myscheme:///one/something.mod'. I say this because, as you point out,
> the XML spec defines system Ids as URIs. Therefore, xerces should follow
> the URI specification in RFC 2396 which outlines the steps required in
> resolving relative URIs.
> 
> As it stands now, the expandSystemId() method in DefaultEntityHandler
> fails to resolve the relative URI, and, as a result, we get
> '../../something.mod' passed into the resolveEntity() method in the
> class that we're using to implement EntityResolver. We see no reason for
> the expandSystemId() method to use java.net.URL to resolve a URI.
> Instead, it should use a URI class that follows the RFC 2396 spec.
> 
> Or am I missing something here?
> 
> Paul
>  
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: Bug in Resolving External Entities?

Posted by Paul Eng <pa...@iclick.com>.
Again, thanks for the reply.

Our feeling here is that if we have a dtd, say
'myscheme:///one/two/three/foo.dtd' and it contains <!ENTITY % myEntity
SYSTEM "../../something.mod">, xerces should resolve the system ID to
'myscheme:///one/something.mod'. I say this because, as you point out,
the XML spec defines system Ids as URIs. Therefore, xerces should follow
the URI specification in RFC 2396 which outlines the steps required in
resolving relative URIs. 

As it stands now, the expandSystemId() method in DefaultEntityHandler
fails to resolve the relative URI, and, as a result, we get
'../../something.mod' passed into the resolveEntity() method in the
class that we're using to implement EntityResolver. We see no reason for
the expandSystemId() method to use java.net.URL to resolve a URI. 
Instead, it should use a URI class that follows the RFC 2396 spec.

Or am I missing something here?

Paul

Jason Crickmer wrote:
> 
> Yes, system identifiers are URI's (XML Specification version 1.0
> section 4.2.2).  URI's (specified in RFC 1738), as you well know, make
> no presumption other than that the URI use 7-bit characters (US-ASCII)
> and start with a scheme name, followed by a colon, followed by a
> scheme-specific string.
> 
> The problem is that most EVERY document you start reading will be from
> a run-of-the-mill URL like 'file:///tmp/my-humble-doc.xml' or
> 'http://localhost/dtd/mydtd.dtd'.  Thus, for that one document, all
> relative paths that have no scheme specified should use the 'base
> reference'.  If I used the former example and specified the location
> of some other external entity as '../home/jcrickmer/entity.xml', then
> it would be logical to conclude I meant
> 'file:///tmp/../home/jcrickmer/entity.xml'.
> 
> In my experience, I have found that knowing the base reference is one
> of the most critical parts to resolving entities.  And, unless you
> specifically read a document from a URI of your own scheme, you should
> resolve all relative references with respect to the only base
> reference you have (that of the document or document fragment
> containing the entity (in the DTD) and not the entity reference (in
> the document)).
> 
> I hope this helps...
> 
> Thanks,
> Jason
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

-- 
Paul Eng
iClick, Inc.
120 Bloomingdale Road		
White Plains, NY 10605
voice: (914) 872-8051
fax: (914) 872-8100
email: paul.eng@iclick.com
http://www.iclick.com

Re: Bug in Resolving External Entities?

Posted by Jason Crickmer <jc...@hire.com>.
Paul wrote:
> Thanks for the reply.  Just as you suggested, we are currently
> implementing EntityResolver to resolve our own URIs.  This worked fine
> with absolute URIs since it contained the scheme which we could check. 
> It did not work with relative URI/path <../../something.ent>, so we
> resorted to building our own URI class and changing the
> DefaultEntityHandler to use the URI class to resolve it.  
> We were under the impression that the system ID for an external entity
> was a URI, and not a URL.  Is this incorrect?

Yes, system identifiers are URI's (XML Specification version 1.0
section 4.2.2).  URI's (specified in RFC 1738), as you well know, make
no presumption other than that the URI use 7-bit characters (US-ASCII)
and start with a scheme name, followed by a colon, followed by a
scheme-specific string.

The problem is that most EVERY document you start reading will be from
a run-of-the-mill URL like 'file:///tmp/my-humble-doc.xml' or
'http://localhost/dtd/mydtd.dtd'.  Thus, for that one document, all
relative paths that have no scheme specified should use the 'base
reference'.  If I used the former example and specified the location
of some other external entity as '../home/jcrickmer/entity.xml', then
it would be logical to conclude I meant
'file:///tmp/../home/jcrickmer/entity.xml'.

In my experience, I have found that knowing the base reference is one
of the most critical parts to resolving entities.  And, unless you
specifically read a document from a URI of your own scheme, you should
resolve all relative references with respect to the only base
reference you have (that of the document or document fragment
containing the entity (in the DTD) and not the entity reference (in
the document)).

I hope this helps...

Thanks,
Jason

Re: Bug in Resolving External Entities?

Posted by Paul Eng <pa...@iclick.com>.
Thanks for the reply.  Just as you suggested, we are currently
implementing EntityResolver to resolve our own URIs.  This worked fine
with absolute URIs since it contained the scheme which we could check. 
It did not work with relative URI/path <../../something.ent>, so we
resorted to building our own URI class and changing the
DefaultEntityHandler to use the URI class to resolve it.  
We were under the impression that the system ID for an external entity
was a URI, and not a URL.  Is this incorrect?

Paul

Jason Crickmer wrote:
> 
> Paul wrote:
> > Currently, the expandSystemId() method in DefaultEntityHandler uses the
> > java.net.URL class to resolve external entities.  This doesn't work for
> > me since I am using a scheme/protocol that is not recognized by the URL
> > class.  So, for example, if I have <!ENTITY % someEntity SYSTEM
> > "../index.ent"> in my DTD and this system ID gets passed in with
> > <resource:///source/test/.ent> as the base, the method cannot resolve it
> > since URL knows nothing about the <resource> scheme/protocol.
> >
> > Is this considered to be a bug?  Documentation that I've seen says that
> > the system-identifier should be a URI (not URL).  If this is correct, we
> > would like to submit a URI class and a patch to DefaultEntityHandler to
> > use the URI class instead of URL.  If this is not correct, can someone
> > let me know?
> 
> This is the correct behavior.  If you would like resolve your own
> URI's then you will need to create your own EntityResolver by
> extending org.xml.sax.EntityResolver or DefaultHandler.
> 
> In our application of XML, we have created our own URI resolver by
> extending EntityResolver.  In the resolveEntity(String,
> String):IntputSource method, we do a simple check on the scheme to
> determine which method we should use to resolve the entity.
> 
> When you create your Parser, simply call
> setEntityResolver(EntityResolver)... now you are set to go!
> 
> Hope this helps,
> Jason
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: Bug in Resolving External Entities?

Posted by Jason Crickmer <jc...@hire.com>.
Paul wrote:
> Currently, the expandSystemId() method in DefaultEntityHandler uses the
> java.net.URL class to resolve external entities.  This doesn't work for
> me since I am using a scheme/protocol that is not recognized by the URL
> class.  So, for example, if I have <!ENTITY % someEntity SYSTEM
> "../index.ent"> in my DTD and this system ID gets passed in with
> <resource:///source/test/.ent> as the base, the method cannot resolve it
> since URL knows nothing about the <resource> scheme/protocol.
> 
> Is this considered to be a bug?  Documentation that I've seen says that
> the system-identifier should be a URI (not URL).  If this is correct, we
> would like to submit a URI class and a patch to DefaultEntityHandler to
> use the URI class instead of URL.  If this is not correct, can someone
> let me know?

This is the correct behavior.  If you would like resolve your own
URI's then you will need to create your own EntityResolver by
extending org.xml.sax.EntityResolver or DefaultHandler.

In our application of XML, we have created our own URI resolver by
extending EntityResolver.  In the resolveEntity(String,
String):IntputSource method, we do a simple check on the scheme to
determine which method we should use to resolve the entity.

When you create your Parser, simply call
setEntityResolver(EntityResolver)... now you are set to go!

Hope this helps,
Jason