You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by ne...@ca.ibm.com on 2003/03/11 00:05:28 UTC

determining the encoding of an external subset via XNI

Hi all,

In an attempt to generate some more discussion surrounding the issue I
raised in the message below, here are some ways by which we might move
forward.  For those who didn't see the previous thread, the Cole's Notes
version of the problem is that, as XNI is currently designed, there doesn't
seem to be any way of determining what the parser autodetected the encoding
of the DTD external subset to be--or any way of determining anything about
that encoding at all if the external subset doesn't happen to contain a
text decl.

Here are all the options that I've thought of:

1.  We could modify the XMLDTDHandler#externalSubset callback so that,
instead of looking like

      public void startExternalSubset(XMLResourceIdentifier identifier,
Augmentations augs)

it looks like

      public void startExternalSubset(XMLResourceIdentifier identifier,
String encoding, Augmentations augs)

This would make that callback much more symmetric to the startDocument
callback of the XMLDocumentHandler interface; unfortunately it has the
tremendous drawback of not being terribly backwards compatible.

2.  We could add a new callback to the XMLDTDHandler interface, something
like:

      public void externalSubsetEncoding(String encoding)

which we would advertise as occurring after the startExternalSubset
callback and before the textDecl call. While this would be far more
backward compatible, there's no precedent for anything like it in XNI;
also, the callback would only be useful for external subsets, since in all
other contexts we already have methods for conveying encoding information.

3.  We could use the Augmentations parameter of the startExternalSubset
callback.  This would preserve backward compatibility, but certainly
couldn't be accused of being beautiful; also , it would mark the first time
we've used Augmentations in Xerces for something at the level of a scanner.
So far, we've only employed that functionality in the context of schema
validation.

4.  We could amend the XMLLocator interface by adding a method like

      public String getEncoding()

on the lines of the SAX Locator2 interface.  This again would only be
really useful in this single context, since XNI goes out of its way
everywhere else to explicitly make provision for the passage of encoding
information; i.e., it doesn't seem to accord well with the overall design
of the API.

I'll readily admit that none of these solutions is particularly attractive.
Thoughts, preferences, or more appealing solutions are thus even more than
usually welcome!

Cheers,
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  neilg@ca.ibm.com


----- Forwarded by Neil Graham/Toronto/IBM on 03/10/2003 06:03 PM -----
|---------+---------------------------->
|         |           Neil Graham      |
|         |                            |
|         |           03/04/2003 11:13 |
|         |           PM               |
|         |                            |
|---------+---------------------------->
  >---------------------------------------------------------------------------------------------------------------------------------------------|
  |                                                                                                                                             |
  |       To:       xerces-j-dev@xml.apache.org                                                                                                 |
  |       cc:                                                                                                                                   |
  |       From:     Neil Graham/Toronto/IBM@IBMCA                                                                                               |
  |       Subject:  another encoding issue                                                                                                      |
  |                                                                                                                                             |
  |                                                                                                                                             |
  >---------------------------------------------------------------------------------------------------------------------------------------------|



Hi all,

How does one determine the autodetected encoding of a DTD external subset?

Right now, our DTD scanner takes this information from the entity manager
in a (non-XNI) startEntity(name, resourceIdentifier, encoding) call but
drops the encoding information on the floor for entities whose names are
[dtd].

It sure would have been handy if the
XMLDTDHandler#startExternalSubset(XMLResourceIdentifier, Augmentations) had
also included an encoding parameter...

Thoughts?

Cheers,
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  neilg@ca.ibm.com





---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: determining the encoding of an external subset via XNI

Posted by Neeraj Bajaj <ne...@sun.com>.
Hi Neil,


Catching up a week old mail..

>Hi all,
>
>In an attempt to generate some more discussion surrounding the issue I
>raised in the message below, here are some ways by which we might move
>forward.  For those who didn't see the previous thread, the Cole's Notes
>version of the problem is that, as XNI is currently designed, there doesn't
>seem to be any way of determining what the parser autodetected the encoding
>of the DTD external subset to be--or any way of determining anything about
>that encoding at all if the external subset doesn't happen to contain a
>text decl.
>
>Here are all the options that I've thought of:
>
>1.  We could modify the XMLDTDHandler#externalSubset callback so that,
>instead of looking like
>
>      public void startExternalSubset(XMLResourceIdentifier identifier,
>Augmentations augs)
>
>it looks like
>
>      public void startExternalSubset(XMLResourceIdentifier identifier,
>String encoding, Augmentations augs)
>
>This would make that callback much more symmetric to the startDocument
>callback of the XMLDocumentHandler interface; unfortunately it has the
>tremendous drawback of not being terribly backwards compatible.
>
>2.  We could add a new callback to the XMLDTDHandler interface, something
>like:
>
>      public void externalSubsetEncoding(String encoding)
>
>which we would advertise as occurring after the startExternalSubset
>callback and before the textDecl call. While this would be far more
>backward compatible, there's no precedent for anything like it in XNI;
>also, the callback would only be useful for external subsets, since in all
>other contexts we already have methods for conveying encoding information.
>
>3.  We could use the Augmentations parameter of the startExternalSubset
>callback.  This would preserve backward compatibility, but certainly
>couldn't be accused of being beautiful; also , it would mark the first time
>we've used Augmentations in Xerces for something at the level of a scanner.
>So far, we've only employed that functionality in the context of schema
>validation.
>
>4.  We could amend the XMLLocator interface by adding a method like
>
>      public String getEncoding()
>
>on the lines of the SAX Locator2 interface.  This again would only be
>really useful in this single context, 
>

I like this solution better. First it doesn't break anything. Second, It 
is not only useful with DTD but instance document, external parsed 
entity. Third, it  is more user friendly.
            With above change, application can always rely on Locator 
interface to get the encoding of the document/dtd/externalParsedEntity 
being parsed at any point of time. Well, One can argue encoding of the 
instance document can always be determined using the  xmldecl() and 
startDocument() callbacks, But it is more pain for the user. Given the 
choice no body would like to write code to determine the exact encoding 
information, when it can be made available directly by the parser.

 There is another use case,
<employee>
   &address;
</employee>

A document refers external entity, now to determine the exact encoding 
of the external parsed entity "address" User has to depend upon 
startGeneralEntity() and textDecl() callbacks.

                But adding encoding information as part of Locator 
interface, User can always rely on "locator.getEncoding()" and doesn't 
need to duplicate code at different places. Makes life easy for user, 
moreover solves the problem which started all this issue :-)


Neeraj


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: determining the encoding of an external subset via XNI

Posted by Neeraj Bajaj <ne...@sun.com>.
Hi Neil,


Catching up a week old mail..

>Hi all,
>
>In an attempt to generate some more discussion surrounding the issue I
>raised in the message below, here are some ways by which we might move
>forward.  For those who didn't see the previous thread, the Cole's Notes
>version of the problem is that, as XNI is currently designed, there doesn't
>seem to be any way of determining what the parser autodetected the encoding
>of the DTD external subset to be--or any way of determining anything about
>that encoding at all if the external subset doesn't happen to contain a
>text decl.
>
>Here are all the options that I've thought of:
>
>1.  We could modify the XMLDTDHandler#externalSubset callback so that,
>instead of looking like
>
>      public void startExternalSubset(XMLResourceIdentifier identifier,
>Augmentations augs)
>
>it looks like
>
>      public void startExternalSubset(XMLResourceIdentifier identifier,
>String encoding, Augmentations augs)
>
>This would make that callback much more symmetric to the startDocument
>callback of the XMLDocumentHandler interface; unfortunately it has the
>tremendous drawback of not being terribly backwards compatible.
>
>2.  We could add a new callback to the XMLDTDHandler interface, something
>like:
>
>      public void externalSubsetEncoding(String encoding)
>
>which we would advertise as occurring after the startExternalSubset
>callback and before the textDecl call. While this would be far more
>backward compatible, there's no precedent for anything like it in XNI;
>also, the callback would only be useful for external subsets, since in all
>other contexts we already have methods for conveying encoding information.
>
>3.  We could use the Augmentations parameter of the startExternalSubset
>callback.  This would preserve backward compatibility, but certainly
>couldn't be accused of being beautiful; also , it would mark the first time
>we've used Augmentations in Xerces for something at the level of a scanner.
>So far, we've only employed that functionality in the context of schema
>validation.
>
>4.  We could amend the XMLLocator interface by adding a method like
>
>      public String getEncoding()
>
>on the lines of the SAX Locator2 interface.  This again would only be
>really useful in this single context, 
>

I like this solution better. First it doesn't break anything. Second, It 
is not only useful with DTD but instance document, external parsed 
entity. Third, it  is more user friendly.
            With above change, application can always rely on Locator 
interface to get the encoding of the document/dtd/externalParsedEntity 
being parsed at any point of time. Well, One can argue encoding of the 
instance document can always be determined using the  xmldecl() and 
startDocument() callbacks, But it is more pain for the user. Given the 
choice no body would like to write code to determine the exact encoding 
information, when it can be made available directly by the parser.

 There is another use case,
<employee>
   &address;
</employee>

A document refers external entity, now to determine the exact encoding 
of the external parsed entity "address" User has to depend upon 
startGeneralEntity() and textDecl() callbacks.

                But adding encoding information as part of Locator 
interface, User can always rely on "locator.getEncoding()" and doesn't 
need to duplicate code at different places. Makes life easy for user, 
moreover solves the problem which started all this issue :-)


Neeraj


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org