You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Markus Breilmann <ma...@tamgroup.com> on 2000/06/29 21:31:58 UTC

BUG: setEntityResolver doesn't work with DOM

Hi all, hi Andy,

Sorry if this has been posted several times, but it didn't seem to make it 
to the list...

I've had problems using the setEntityResolver method with the DOM parser 
(all versions at least since 1.0.3). Though my EntityResolver would replace 
an external entity reference correctly, the parser would still crash with a 
file not found exception. I finally tracked the problem down to the 
org.apache.xerces.readers.DefaultEntityHandler class. The problem seems to 
be that although the EntityResolver.resolveEntity method is called (from 
method startReadingFromExternalEntity() @line 739 in version 1.1.2), the 
old systemId is still used, not reflecting the new one referred to by the 
new InputSource supplied by the EntityResolver.

I modified the code as follows and it seems to work 
(DefaultEntityHandler.startReadingFromExternalEntity, around line 740):

Instead of
         fSource = fResolver == null ? null : 
fResolver.resolveEntity(fPublicId, fSystemId);
         if (fSource == null) {
             fSource = new InputSource(fSystemId);
             if (fPublicId != null)
                 fSource.setPublicId(fPublicId);
         }

I used:
         if (fSource == null) {
             fSource = new InputSource(fSystemId);
             if (fPublicId != null)
                 fSource.setPublicId(fPublicId);
         } else {
		fSystemId = fSource.getSystemId();
	  }

I assume the same problem exists for publicId, I was just dealing with 
system ids, so I didn't take care of that yet.
Could somebody in the core team please take a look and evaluate/sanity 
check my change? It would be nice if a fix could make it into the next release.

Best regards,

Markus
Markus Breilmann                        markus.breilmann@tamgroup.com
Director of Technology                           tel: +1.415.455.5770
The Tamalpais Group, Inc.                        fax: +1.415.455.5771
11 Belle Avenue                                 web: www.tamgroup.com
CA 94960 San Anselmo, USA


Re: BUG: setEntityResolver doesn't work with DOM

Posted by Andy Clark <an...@apache.org>.
Norman Walsh wrote:
> |   1) entity resolver returns null
> 
> I see only one reasonable interpretation, resolution failed and the
> original system identifier should be used.

That's easy enough.

> |   2) entity resolver returns input source
> |      a) with only system id set
> 
> Open and use that system identifier.

With the restriction that the system identifier MUST be absolute.
That's a clarification that I would agree with and implement.

> | However, in your entity resolvers, you should either follow 1
> | or 2.d - 2.e, never anything in between. When you return a non-
> | null input source from the resolveEntity method, you are (now)
> | making a contract with the parser that you are opening the
> | stream to the entity (hence the "throws IOException" in the
> | interface).
> 
> Yes, that's the conclusion I reached as well, but I'd really like to
> see this fixed (if we agree it's a bug :-). I'd like to be able to
> return an absolute URI so that the resolver didn't have to get into
> opening documents. That's not really the resolver's job, IMHO. The

Well I would disagree with you here. Entity resolvers can and do
return open streams. If your entity resolvers don't open files,
though, no big deal.

A very common performance enhancement to XML applications is to
pre-load the grammar files into memory and then use the entity
resolver to read from the in-memory buffer.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

Re: BUG: setEntityResolver doesn't work with DOM

Posted by Norman Walsh <nd...@nwalsh.com>.
/ Andy Clark <an...@apache.org> was heard to say:
| entity resolver is undefined. There should be rules for what an
| XML parser is supposed to do in the following cases:

There should definitely be rules. I haven't looked at the SAX
documentation recently to see where the ambiguities are. What follows
are my opinions.

|   1) entity resolver returns null

I see only one reasonable interpretation, resolution failed and the
original system identifier should be used.

|   2) entity resolver returns input source
|      a) with only system id set

Open and use that system identifier.

|      b) with only public id set

This is an error; the purpose of an entity resolver is to resolve the
entity to something explicit.

|      c) with only system and public id set

The public id is (would seem to me to be) irrelevant, the purpose of
an entity resolver is to resolve the entity to something explicit.

|      d) with byte stream set
|      e) with character stream set
| 
| In the case of 2.a - 2.c, what should the parser do? What if the
| the systemId is relative? Should the parser expand it using the
| current base systemId? 

I don't think entityResolvers should return relative system
identifiers.  In fact, for OASIS Open Catalog-based resolvers, I think
it would clearly be an error. The base URI for relative system
identifiers in an TR9401 Catalog is *either* the base URI of the
catalog or the most recent BASE declaration in the catalog. If a
resolver returned a relative URI, it would be entirely impossible for
the parser to provide the right base.

| Should the entity resolver get another
| chance to resolve the systemId once it's been expanded?

No, IMHO.

| Also,
| what publicId/systemId should the application see? The original
| one or the expanded/resolved one?

The resolved one.

| I'm looking at this code now but I can't say when I'll have a
| satisfactory solution.

I've looked at this code too. There's definitely some subtlety in
there. Or I don't have my head around Xerces very well yet. Or both.

| However, in your entity resolvers, you should either follow 1
| or 2.d - 2.e, never anything in between. When you return a non-
| null input source from the resolveEntity method, you are (now)
| making a contract with the parser that you are opening the
| stream to the entity (hence the "throws IOException" in the
| interface).

Yes, that's the conclusion I reached as well, but I'd really like to
see this fixed (if we agree it's a bug :-). I'd like to be able to
return an absolute URI so that the resolver didn't have to get into
opening documents. That's not really the resolver's job, IMHO. The
contract as I see it should simply be "parser: here's some stuff (a
public and/or system identifier, an entity name, or what-have-you)
tell me what absolute URI I should use to retrieve this
resource. resolver: I know nothing, use your best guess (returns null)
or here's the absolute URI.

                                        Be seeing you,
                                          norm

-- 
Norman Walsh <nd...@nwalsh.com> | All things are contingent. And there is
http://nwalsh.com/            | chaos.--Spalding Gray

Re: BUG: setEntityResolver doesn't work with DOM

Posted by Matt Jones <jo...@nceas.ucsb.edu>.
Thanks, Andy.  As you indicated, this ambiguity in the SAX API is a
problem.  I was using the Oracle XML parser with this code earlier and
things worked fine, then broke when I switched to xerces, so it looks
like Oracle interpreted the way your case 2.a works differently. 

I'll just switch to case 2.e and that will probably work for any parser
because it will always have a stream opened.  Of course, I don't see the
point in SAX InputSource having a constructor for the systemid alone if
the class requires the user to open a byte or character stream as well
to function correctly.  Maybe the SAX2 folks will consider an errata of
some sort for this, at a minimum disallowing one to return an
InputSource with just a system ID or public ID alone.

Thanks again,
Matt

Andy Clark wrote:
> I'm guessing that the problem is happening when the resolveEntity
> method returns an InputSource object with only the systemId set.
> The real problem is that the behavior of an XML parser using the
> entity resolver is undefined. There should be rules for what an
> XML parser is supposed to do in the following cases:
> 
>   1) entity resolver returns null
>   2) entity resolver returns input source
>      a) with only system id set
>      b) with only public id set
>      c) with only system and public id set
>      d) with byte stream set
>      e) with character stream set
> 
> In the case of 2.a - 2.c, what should the parser do? What if the
> the systemId is relative? Should the parser expand it using the
> current base systemId? Should the entity resolver get another
> chance to resolve the systemId once it's been expanded? Also,
> what publicId/systemId should the application see? The original
> one or the expanded/resolved one?
-- 
******************************************************************
Matt Jones                                    jones@nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Ph: 907-789-0496   Fax: 425-920-2439
National Center for Ecological Analysis and Synthesis (NCEAS)
******************************************************************

Re: BUG: setEntityResolver doesn't work with DOM

Posted by Andy Clark <an...@apache.org>.
Matt Jones wrote:
> I had been assuming that the InputSource constructor would create the
> character stream for me from the systemid, but looking at the SAX code,
> that's not what happens.

I'm guessing that the problem is happening when the resolveEntity
method returns an InputSource object with only the systemId set.
The real problem is that the behavior of an XML parser using the
entity resolver is undefined. There should be rules for what an
XML parser is supposed to do in the following cases:

  1) entity resolver returns null
  2) entity resolver returns input source
     a) with only system id set
     b) with only public id set
     c) with only system and public id set
     d) with byte stream set
     e) with character stream set

In the case of 2.a - 2.c, what should the parser do? What if the
the systemId is relative? Should the parser expand it using the
current base systemId? Should the entity resolver get another
chance to resolve the systemId once it's been expanded? Also,
what publicId/systemId should the application see? The original
one or the expanded/resolved one?

I'm looking at this code now but I can't say when I'll have a
satisfactory solution.

However, in your entity resolvers, you should either follow 1
or 2.d - 2.e, never anything in between. When you return a non-
null input source from the resolveEntity method, you are (now)
making a contract with the parser that you are opening the
stream to the entity (hence the "throws IOException" in the
interface).

Lastly, you should always set the byte stream on the input
source and not the character stream, unless that is *exactly*
what you want to happen. When you set the character stream,
you are denying the parser the ability to detect the entity's
encoding and decode it appropriately. If you don't know what
encodings your files will take, then don't just use any old
reader -- use an input stream and let the *parser* figure out
the encoding from the bytes being read.

--
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

Re: BUG: setEntityResolver doesn't work with DOM

Posted by Matt Jones <jo...@nceas.ucsb.edu>.
Hi Markus,

I reported a similar problem yesterday when using resolveEntity() with
the SAX parser.  I also saw my entity resolver successfully replace the
systemid, but the parser still used the old system id, rather than the
InputSource I provided, and thus couldn't find my DTD.  I found that if
I created an InputStreamReader for the systemid myself (in
resolveEntity()) and then used InputSource.setCharacterStream() to
associate the Reader with the InputSource, then xerces used the
InputSource I provided properly.

I had been assuming that the InputSource constructor would create the
character stream for me from the systemid, but looking at the SAX code,
that's not what happens.

Could your DOM EntityResolver problem be related to this too?

Matt

PS.  The inconsistency in the declarations of the SAX
EntityResolver.resolveEntity() and DefaultHandler.resolveEntity() still
exists.  Anybody else think this is an issue?

Markus Breilmann wrote:
> 
> Hi all, hi Andy,
> 
> Sorry if this has been posted several times, but it didn't seem to make it
> to the list...
> 
> I've had problems using the setEntityResolver method with the DOM parser
> (all versions at least since 1.0.3). Though my EntityResolver would replace
> an external entity reference correctly, the parser would still crash with a
> file not found exception. I finally tracked the problem down to the
> org.apache.xerces.readers.DefaultEntityHandler class. The problem seems to
> be that although the EntityResolver.resolveEntity method is called (from
> method startReadingFromExternalEntity() @line 739 in version 1.1.2), the
> old systemId is still used, not reflecting the new one referred to by the
> new InputSource supplied by the EntityResolver.
> 
> I modified the code as follows and it seems to work
> (DefaultEntityHandler.startReadingFromExternalEntity, around line 740):
> 
> Instead of
>          fSource = fResolver == null ? null :
> fResolver.resolveEntity(fPublicId, fSystemId);
>          if (fSource == null) {
>              fSource = new InputSource(fSystemId);
>              if (fPublicId != null)
>                  fSource.setPublicId(fPublicId);
>          }
> 
> I used:
>          if (fSource == null) {
>              fSource = new InputSource(fSystemId);
>              if (fPublicId != null)
>                  fSource.setPublicId(fPublicId);
>          } else {
>                 fSystemId = fSource.getSystemId();
>           }
> 
> I assume the same problem exists for publicId, I was just dealing with
> system ids, so I didn't take care of that yet.
> Could somebody in the core team please take a look and evaluate/sanity
> check my change? It would be nice if a fix could make it into the next release.
> 
> Best regards,
> 
> Markus
> Markus Breilmann                        markus.breilmann@tamgroup.com
> Director of Technology                           tel: +1.415.455.5770
> The Tamalpais Group, Inc.                        fax: +1.415.455.5771
> 11 Belle Avenue                                 web: www.tamgroup.com
> CA 94960 San Anselmo, USA

-- 
******************************************************************
Matt Jones                                    jones@nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Ph: 907-789-0496   Fax: 425-920-2439
National Center for Ecological Analysis and Synthesis (NCEAS)
******************************************************************