You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Colin Benson <co...@neteffect.com> on 2000/09/20 02:11:13 UTC

MemBufInputSource, whitespace, HandlerBase::characters()

Hello,

I expect that my question is going to be based in some unusually stupid
assumption on my part but that's never stopped me in the past...

I have a file containing XML content. The encoding is listed as UTF-8 (I
tried changing to ASCII without effect). In the same directory is the
corresponding DTD file. 

I run the SAXCount example on my file. It tells me that I have 16
characters. Indeed, when I look at the file, this is the count of character
data I see (ignoring leading tabs etc).

Then I modify the MemParse example. Specifically, instead of constructing a
MemBufInputSource from the static buffer in memory, I have the following
code fragment (please don't tell me its ugly, I know its ugly but that's not
what I'm writing about) ...

	char buf[10000];
	int f = _open("d:\\myfile.xml", _O_RDONLY); // my.dtd is in d:\
	int count = _read(f, buf, 9999);
	buf[count] = 0; // tack on an end of string marker
	MemBufInputSource* memBufIS = new MemBufInputSource
		(
		(const XMLByte*)buf,
		strlen(buf),
		gMemBufId,
		false
		);

Now the interesting thing here is that when I run with this code, MemParse
thinks that there are 30 or more characters in the file. If I set a
breakpoint in the characters() handler of MemParseHandlers:: I see that it
gets called for the 'real' characters (16 of them) and also for CR,HT
sequences. i.e. it is getting kicked for every bit of whitespace in the
file. That's certainly not what I want to happen.

So, I suspect that I've failed to understand something to do with encoding,
transcoding, decoding or flea coding. I just don't know which. Please help
me.

thanks

colin


Like as the waves make towards the pebbled shore...

Re: MemBufInputSource, whitespace, HandlerBase::characters()

Posted by Dean Roddey <dr...@charmedquark.com>.
Proably its a validation thing. If you validate, then characters are counted
differently than when you are not validating. If you are validating, then
whitespace that is inside characters with CHILDREN type content models is
reported as ignorable whitespace. If are you not valdating, then it all
comes out as characters.

See if that has something to do with it.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"It takes two buttocks to make friction"
    - African Proverb


----- Original Message -----
From: "Colin Benson" <co...@neteffect.com>
To: <xe...@xml.apache.org>
Sent: Tuesday, September 19, 2000 5:11 PM
Subject: MemBufInputSource, whitespace, HandlerBase::characters()


> Hello,
>
> I expect that my question is going to be based in some unusually stupid
> assumption on my part but that's never stopped me in the past...
>
> I have a file containing XML content. The encoding is listed as UTF-8 (I
> tried changing to ASCII without effect). In the same directory is the
> corresponding DTD file.
>
> I run the SAXCount example on my file. It tells me that I have 16
> characters. Indeed, when I look at the file, this is the count of
character
> data I see (ignoring leading tabs etc).
>
> Then I modify the MemParse example. Specifically, instead of constructing
a
> MemBufInputSource from the static buffer in memory, I have the following
> code fragment (please don't tell me its ugly, I know its ugly but that's
not
> what I'm writing about) ...
>
> char buf[10000];
> int f = _open("d:\\myfile.xml", _O_RDONLY); // my.dtd is in d:\
> int count = _read(f, buf, 9999);
> buf[count] = 0; // tack on an end of string marker
> MemBufInputSource* memBufIS = new MemBufInputSource
> (
> (const XMLByte*)buf,
> strlen(buf),
> gMemBufId,
> false
> );
>
> Now the interesting thing here is that when I run with this code, MemParse
> thinks that there are 30 or more characters in the file. If I set a
> breakpoint in the characters() handler of MemParseHandlers:: I see that it
> gets called for the 'real' characters (16 of them) and also for CR,HT
> sequences. i.e. it is getting kicked for every bit of whitespace in the
> file. That's certainly not what I want to happen.
>
> So, I suspect that I've failed to understand something to do with
encoding,
> transcoding, decoding or flea coding. I just don't know which. Please help
> me.
>
> thanks
>
> colin
>
>
> Like as the waves make towards the pebbled shore...
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>