You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by John Byrne <jo...@propylon.com> on 2008/04/22 17:22:03 UTC

encoded character references

Hi,

I'm using the AbstractXMLDocumentParser class from the XNI.

Is there an event raised in this class for specially encoded characters, 
for example &#x201d; ?

If I parse a document containing references such as this, and print the 
output, I get the interpretation of the character (double quotes in this 
case), rather than the original character sequence - so I'm wondering, 
at what level does this interpretation take place?

 Is there a way I can get the parser, or one of it's underlying objects, 
to notify me that it found &#x201d; as a raw sequence of charcaters?

Thanks in advance!

-John

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: encoded character references

Posted by Daniel Yokomizo <da...@gmail.com>.

On Tue, Apr 22, 2008 at 12:22 PM, John Byrne <jo...@propylon.com> wrote:
> Hi,
>
>  I'm using the AbstractXMLDocumentParser class from the XNI.
>
>  Is there an event raised in this class for specially encoded characters,
> for example &#x201d; ?
>
>  If I parse a document containing references such as this, and print the
> output, I get the interpretation of the character (double quotes in this
> case), rather than the original character sequence - so I'm wondering, at
> what level does this interpretation take place?
>
>  Is there a way I can get the parser, or one of it's underlying objects, to
> notify me that it found &#x201d; as a raw sequence of charcaters?

I studied it for a while and found that if it appears as a text node
then you can use the LexicalHandler to be notified (IIRC) on attribute
nodes there's no event but you can access the non normalized value.
See the last thread on
http://mail-archives.apache.org/mod_mbox/xerces-j-users/200803.mbox/thread,
subject: "How to disable attribute normalization" for more details.

In the end I decided to "trick" Xerces, because my goal was to keep
the entities unevaluated. All I did was make replace & with &amp; in
the source document (I decorated the source InputStream and
transformed it on the fly). If you want to follow this approach, I can
give you this class, it's licensed under the ASL.

>  Thanks in advance!
>
>  -John

Best regards,
Daniel Yokomizo.

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: encoded character references

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi John,

If you really need to know the boundaries of character references you
should enable the 'notify-char-refs' [1] feature. Note that this only
applies to the content of elements (i.e. not attribute values).

Thanks.

[1]
http://xerces.apache.org/xerces2-j/features.html#scanner.notify-char-refs

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

John Byrne <jo...@propylon.com> wrote on 04/22/2008 04:58:15 PM:

> "The distinction is syntactic, not semantic. Nothing that's looking at
> the semantic content of XML documents should care about it... and
> nothing should be looking at the purely syntactic details of XML except
> the parser. "
>
> True. And from that point of view, what I am working on is, in fact, and
> kind of parser, albiet a very specialized one. One of the things my
> parser needs to do is detect the presence of these character references.
> I need to distinguish between  &#65; and a letter A character. Now I
> could go and write the code to do this, but I thought since Xerces must
> already have a way of doing this, I'd go ahead and use that instead.
>
> I imagine that there is a callback method somewhere in the XNI API that
> handles the translation of these references into their "normative"
> representation.
>
> As regards the correctness of my design, all I can say is that I've have
> given it quite a lot of thought, and I'm confident that my solution it
> the best option available to me. Unfortunately I'm not in a position to
> go into a lot of detail. While I do appreciate any and all advice, be it
> theoretical or otherwise, what I really need is a practical solution!
>
>
> keshlam@us.ibm.com wrote:
> >
> > > &amp; might be treated as being the same as &#38;, but these are both
> > > distinct from ordinary text
> >
> > As far as XML is concerned,  neither is "distinct from ordinary text"
> > -- they're just representations of the & character.
> >
> > For comparison, consider &#65;. XML doesn't distinguish between this
> > and a simple capital-A character.
> >
> > The distinction is syntactic, not semantic. Nothing that's looking at
> > the semantic content of XML documents should care about it... and
> > nothing should be looking at the purely syntactic details of XML
> > except the parser.
> >
> > ______________________________________
> > "... Three things see no end: A loop with exit code done wrong,
> > A semaphore untested, And the change that comes along. ..."
> >  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish
> > (http://www.ovff.org/pegasus/songs/threes-rev-11.html)
> >
------------------------------------------------------------------------
> >
> > No virus found in this incoming message.
> > Checked by AVG.
> > Version: 7.5.524 / Virus Database: 269.23.3/1390 - Release Date:
> 21/04/2008 16:23
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: encoded character references

Posted by John Byrne <jo...@propylon.com>.

"The distinction is syntactic, not semantic. Nothing that's looking at 
the semantic content of XML documents should care about it... and 
nothing should be looking at the purely syntactic details of XML except 
the parser. "

True. And from that point of view, what I am working on is, in fact, and 
kind of parser, albiet a very specialized one. One of the things my 
parser needs to do is detect the presence of these character references. 
I need to distinguish between  &#65; and a letter A character. Now I 
could go and write the code to do this, but I thought since Xerces must 
already have a way of doing this, I'd go ahead and use that instead.

I imagine that there is a callback method somewhere in the XNI API that 
handles the translation of these references into their "normative" 
representation.

As regards the correctness of my design, all I can say is that I've have 
given it quite a lot of thought, and I'm confident that my solution it 
the best option available to me. Unfortunately I'm not in a position to 
go into a lot of detail. While I do appreciate any and all advice, be it 
theoretical or otherwise, what I really need is a practical solution!

keshlam@us.ibm.com wrote:
>
> > &amp; might be treated as being the same as &#38;, but these are both
> > distinct from ordinary text
>
> As far as XML is concerned,  neither is "distinct from ordinary text" 
> -- they're just representations of the & character.
>
> For comparison, consider &#65;. XML doesn't distinguish between this 
> and a simple capital-A character.
>
> The distinction is syntactic, not semantic. Nothing that's looking at 
> the semantic content of XML documents should care about it... and 
> nothing should be looking at the purely syntactic details of XML 
> except the parser.
>
> ______________________________________
> "... Three things see no end: A loop with exit code done wrong,
> A semaphore untested, And the change that comes along. ..."
>  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish 
> (http://www.ovff.org/pegasus/songs/threes-rev-11.html)
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG. 
> Version: 7.5.524 / Virus Database: 269.23.3/1390 - Release Date: 21/04/2008 16:23
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: encoded character references

Posted by ke...@us.ibm.com.

> &amp; might be treated as being the same as &#38;, but these are both 
> distinct from ordinary text

As far as XML is concerned,  neither is "distinct from ordinary text" -- 
they're just representations of the & character.

For comparison, consider &#65;. XML doesn't distinguish between this and a 
simple capital-A character.

The distinction is syntactic, not semantic. Nothing that's looking at the 
semantic content of XML documents should care about it... and nothing 
should be looking at the purely syntactic details of XML except the 
parser.

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)

Re: encoded character references

Posted by John Byrne <jo...@propylon.com>.

I see what you mean. I'm actually not interested in exactly how the 
sequence is represented - whether it's &amp;, &#38;, &#x26;, or 
<![CDATA[&]]> - but I do need to know that a special character has been 
found.

&amp; might be treated as being the same as &#38;, but these are both 
distinct from ordinary text. When the parser encounters the string 
"foobar", it returns a String containing "foobar", but when it finds the 
string "&amp;" it returns a string containing "&".

What I need is to be notified that this tranlslation has occurred, or 
that it is about to occur.

Is this possible?

keshlam@us.ibm.com wrote:
>
> As far as XML is concerned, numeric character references are identical 
> to the characters they represent. The XML APIs shouldn't make any 
> distinction.
>
> For implementation reasons, you *may* find that SAX delivers these as 
> separate characters() events. But that is not guaranteed.
>
> No XML application should ever care whether your document contains 
> &amp;, &#38;, &#x26;, or <![CDATA[&]]>... so I'd suggest that if you 
> think you need this information, your design is probably broken and 
> should be reconsidered.
>
> ______________________________________
> "... Three things see no end: A loop with exit code done wrong,
> A semaphore untested, And the change that comes along. ..."
>  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish 
> (http://www.ovff.org/pegasus/songs/threes-rev-11.html)
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG. 
> Version: 7.5.524 / Virus Database: 269.23.3/1390 - Release Date: 21/04/2008 16:23
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: encoded character references

Posted by ke...@us.ibm.com.

As far as XML is concerned, numeric character references are identical to 
the characters they represent. The XML APIs shouldn't make any 
distinction.

For implementation reasons, you *may* find that SAX delivers these as 
separate characters() events. But that is not guaranteed.

No XML application should ever care whether your document contains &amp;, 
&#38;, &#x26;, or <![CDATA[&]]>... so I'd suggest that if you think you 
need this information, your design is probably broken and should be 
reconsidered.

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)