You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Madhavan, Sethu" <SM...@firstam.com> on 2001/02/08 21:17:30 UTC

Un-escaped Internal Entity References

Hi,
    I have an xml document that contains unescaped special symbols such as
'&', '<'', '>' etc.,
as shown below:
	<ExceptionInfo>
		<PackageName>RODT</PackageName>
		<LineNumber>279</LineNumber>
		<FunctionName>getReptOpt(const RWCString
&a_sArg)</FunctionName>
		<FileName>CRODTProduct.cc</FileName>
		<DateTime>10/16/2000 16:25:00</DateTime>
		<ProcessId>92833</ProcessId>
		<ThreadId>5</ThreadId>
		<MessageId>1000002</MessageId>
		<HostName>im2kdev</HostName>
		<VerboseMessage> Report Options count <10 </VerboseMessage>
	</ExceptionInfo>

    I get a fatal error when I parse this document using Xerces-C SAX parser
(as expected). I 
don't have control over the incoming document. Is there any scanner flag
that I could set to 
tell the Scanner to substitute escape sequences such as &amp; or &lt; etc., 
or could I get away by overriding functionality of any SAX parser related
class  to ensure that these special characters are escaped appropriately
before the parsing commences.

Thanks in advance.
Sethu Madhavan .K
Senior Programmer/Analyst
First American Credco
San Diego, CA




Re: Un-escaped Internal Entity References

Posted by Kirk Wylie <ki...@radik.com>.
Assuming that you're dealing with something which is particularly
structured, and for which you cannot beat down the author of the XML to
produce correct XML, you might be able to get by with doing a regular
expression run beforehand.

For example, if this situation ONLY occurs within <VerboseMessage> you can
automatically <![CDATA[ anything within it using a regular expression
match, assuming you can preprocess the (mal-formed) XML before attempting
to really parse it.

Kirk Wylie

"Madhavan, Sethu" wrote:
> 
> Hi,
>     I have an xml document that contains unescaped special symbols such as
> '&', '<'', '>' etc.,
> as shown below:
>         <ExceptionInfo>
>                 <PackageName>RODT</PackageName>
>                 <LineNumber>279</LineNumber>
>                 <FunctionName>getReptOpt(const RWCString
> &a_sArg)</FunctionName>
>                 <FileName>CRODTProduct.cc</FileName>
>                 <DateTime>10/16/2000 16:25:00</DateTime>
>                 <ProcessId>92833</ProcessId>
>                 <ThreadId>5</ThreadId>
>                 <MessageId>1000002</MessageId>
>                 <HostName>im2kdev</HostName>
>                 <VerboseMessage> Report Options count <10 </VerboseMessage>
>         </ExceptionInfo>
> 
>     I get a fatal error when I parse this document using Xerces-C SAX parser
> (as expected). I
> don't have control over the incoming document. Is there any scanner flag
> that I could set to
> tell the Scanner to substitute escape sequences such as &amp; or &lt; etc.,
> or could I get away by overriding functionality of any SAX parser related
> class  to ensure that these special characters are escaped appropriately
> before the parsing commences.
> 
> Thanks in advance.
> Sethu Madhavan .K
> Senior Programmer/Analyst
> First American Credco
> San Diego, CA
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

-- 
Kirk Wylie  |  mailto:kirk@radik.com  |  http://www.radik.com