You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Aurelien Pernoud <ap...@sopragroup.com> on 2003/01/08 14:56:22 UTC

How to keep entities unresolved in the result ?

Hi, I've posted this question on several mailing-lists / newsgroup without
any clue or start of answer, so ANY idea -even to say "it's not possible" is
welcome, I'm getting mad :)

I'm using Sax in Java to parse an XHTML document.
Everything works fine, except that the entities found are always translated
by the parser to their equivalent in the characters() method :
&amp; becomes &
&nbsp; becomes space
&eacute; becomes é
This is fine, but how do I get the ref back ? I must in my case keep the
existant otherwise I get errors in the XHTML generated.

Moreover, depending of encoding issue, some entities such as &#8217; are
translated to "?". I've set the encoding to ISO-8859-1, and didn't find
which one to use to get back the &#8217; ...

I found a way to get the entity (startEntity / endEntity) using my own
handler for everything (DTD, Content, Error, lexical...), but it seems the
characters() method is called after all entities contained between two
elements are translated, so I don't know how to do what I want...
I've searched the xml-dev archive, there was an old thread about this, but
didn't end as I wanted it too :)
http://lists.xml.org/archives/xml-dev/200005/msg00211.html

I simply want to keep the #8217 or whatever was the entity, in the
characters() method.

Even more, if it's possible, I'd like not to resolve entities at all, cause
I don't work with it and it's causing me more troubles than solutions
(typical error is "Undefined entity...").

Thanks in advance,
Aurelien


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: How to keep entities unresolved in the result ?

Posted by Andy Clark <an...@apache.org>.
Aurelien Pernoud wrote:
> Everything works fine, except that the entities found are always translated
> by the parser to their equivalent in the characters() method :
> &amp; becomes &
> &nbsp; becomes space
> &eacute; becomes é
> This is fine, but how do I get the ref back ? I must in my case keep the
> existant otherwise I get errors in the XHTML generated.

As Joe mentioned, it's probably better to allow the
parser to do its job and pass the text of the entity
to the application. If you're dealing with XHTML,
then it should be the serializer's job to turn those
characters back into their entity references.

However...

If you want to know exactly what entity references
appear in the document (including character entity
refs like &#x20;) then you can turn on a feature in
Xerces to notify the application of all entity refs.
See the following page for information on the
feature:

   http://xml.apache.org/xerces2-j/features.html

But this would still pass on the characters between
the start/end entity ref calls. If you don't want
this, then you should extend the DOMParser or SAX-
Parser class to filter out this unwanted content.
However, realize that this would be a non-standard
way of dealing with these references.

> Moreover, depending of encoding issue, some entities such as &#8217; are
> translated to "?". I've set the encoding to ISO-8859-1, and didn't find
> which one to use to get back the &#8217; ...

The appearance of a '?' is either a display issue
(i.e. the font doesn't have a glyph for that char)
or a serialization issue (i.e. that character can
not be represented in the specified encoding). I'm
guessing your problem is the latter -- please use
an encoding that can represent all the Unicode
characters, like UTF-8.

-- 
Andy Clark * andyc@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: How to keep entities unresolved in the result ?

Posted by Joseph Kesselman <ke...@us.ibm.com>.
I don't believe "unresolved" is an option. 

You can get the parser to tell you what the boundaries of entities are -- 
SAX has begin/endEntity events, and the DOM has Entity nodes -- but if you 
don't want the contents, it's your responsibility to not look inside 
them.Xerces doesn't always generate Entity nodes; see 
http://xml.apache.org/xerces2-j/features.htm for the 
create-entity-ref-nodes option which controls that. (But the default is to 
produce them.)

For effiency reasons, built-in character references such as &amp;, and 
numeric character references such as &#10; are almost always expanded into 
the character they represent rather than being left as an entity 
reference. If you really can't live with that,; there are parser features 
options which will cause the parser to leave them  in entity form. Again, 
this does not keep them from being expanded, it only tells you where the 
boundaries were.



______________________________________
Joe Kesselman  / IBM Research


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: How to keep entities unresolved in the result ?

Posted by Joseph Kesselman <ke...@us.ibm.com>.
Forgot to mention: The better solution, in most cases of character 
references, is to allow the character to be expanded during parsing, 
operate on it as a Unicode character, and use a serializer which knows how 
to convert it back to a proper XML/HTML representation when you write the 
data back out.

______________________________________
Joe Kesselman  / IBM Research


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org