You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-dev@xerces.apache.org by Brendan Reville <br...@tenzing.com> on 2002/01/20 00:13:53 UTC

"Invalid document structure" exception?

hi all,

I have a text .xml file which I saved in Windows Notepad as a UTF-8 file.

However, when I try to parse it with Xerces, I get an "invalid document
structure" exception on line 1, character 1.  Any idea why I would be
getting this?


The very beginning of the file looks like this:

ï»¿<?xml versio

A binary dump of the same gives this:

ef bb bf 3c 3f 78 6d 6c   20 76 65 72 73 69 6f 6e

if I'm not mistaken.  I'm not sure what those first three bytes are, but I
didn't expect to see them there; they don't show up in Notepad, that's for
sure.

thanks

- Brendan


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

getElementsByTagName() Depth specification

Posted by Renji Panicker <re...@aliyance.com>.

Hi,

Is there any way to specify that one wants only the immediate children of
the current element, and not all the subchildren as well?

Or will I have to do a firstchild/nextchild loop for this?

-/renji


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: Xerces size -- was Re: XMLCh to utf-8 conversion?

Posted by Dean Roddey <dr...@charmedquark.com>.

Unless something major has changed, the deal is that you should be able to
drop the DOM out, if you don't need it. You'll have to also drop out the
DOMParser. It used to be even more modular than that, but somewhere along
the line a SAX2 class started gettnig used inside the core, which may cause
problems.

But I designed it from the beginning to be built as separate chunks. There
are PROJ_XXXX type defines which are used the standard VC++ import/export
reasons. If you look in the XercesDefs.h header, I believe that in there
there is a define that controls whether its going to define them all (which
is what's required for a monolithic build), or whether they'll be defined on
a per-project basis, which is what is needed to build them separately.

Originally, the layers were: Utils, scanner, validator, SAX/DOM, Parsers.
Each of those should be buildable as a DLL. However, the thing with the
circular SAX2 reference might not allow it anymore, because it would require
a circular link time issue for separate DLLs. That should really be fixed.
Anyway, you may just have to arrange to build it as a monolithic chunk
still, but just leave out the DOM and DOMParser parts.

--------------------------
Dean Roddey
The Charmed Quark Controller
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"If it don't have a control port, don't buy it!"


----- Original Message -----
From: "Brendan Reville" <br...@tenzing.com>
To: <xe...@xml.apache.org>
Sent: Monday, January 21, 2002 11:03 AM
Subject: RE: Xerces size -- was Re: XMLCh to utf-8 conversion?


>
> > > I'd be happy to go with the latest version of Xerces if either (a)
> > > the DLL was still small or (b) there was a decent way to build and
> > > link a minimal SAX-only version of Xerces.
> >
> > Hmm... I think it's possible. Other people can correct me, but I
> > believe Xerces DOM relies on SAX, but SAX has no reliance on DOM. So
> > you should be able to modify the Makefile to remove the DOM and IDOM
> > components (and the DOMParser and IDOMParser) and get a working
> > library.
>
> The last thread I saw on this, quite a while back, concluded that this was
> possible in theory, but in practice the two codebases were just too
> intertangled.  Perhaps this wasn't really the case?
>
> > I don't use windows, so I couldn't tell you how to do it. Perhaps
> > Tinny could?
>
> I'm all ears!
>
> - Brendan
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

RE: Xerces size -- was Re: XMLCh to utf-8 conversion?

Posted by Brendan Reville <br...@tenzing.com>.

> > I'd be happy to go with the latest version of Xerces if either (a)
> > the DLL was still small or (b) there was a decent way to build and
> > link a minimal SAX-only version of Xerces.
>
> Hmm... I think it's possible. Other people can correct me, but I
> believe Xerces DOM relies on SAX, but SAX has no reliance on DOM. So
> you should be able to modify the Makefile to remove the DOM and IDOM
> components (and the DOMParser and IDOMParser) and get a working
> library.

The last thread I saw on this, quite a while back, concluded that this was
possible in theory, but in practice the two codebases were just too
intertangled.  Perhaps this wasn't really the case?

> I don't use windows, so I couldn't tell you how to do it. Perhaps
> Tinny could?

I'm all ears!

- Brendan


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: Xerces size -- was Re: XMLCh to utf-8 conversion?

Posted by "Jason E. Stewart" <ja...@openinformatics.com>.

"Brendan Reville" <br...@tenzing.com> writes:

> I'd be happy to go with the latest version of Xerces if either (a)
> the DLL was still small or (b) there was a decent way to build and
> link a minimal SAX-only version of Xerces.

Hmm... I think it's possible. Other people can correct me, but I
believe Xerces DOM relies on SAX, but SAX has no reliance on DOM. So
you should be able to modify the Makefile to remove the DOM and IDOM
components (and the DOMParser and IDOMParser) and get a working
library.

I don't use windows, so I couldn't tell you how to do it. Perhaps
Tinny could?

jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Xerces size -- was Re: XMLCh to utf-8 conversion?

Posted by Brendan Reville <br...@tenzing.com>.

> > I'm using xerces 1.2, hope that's ok.
>
> Good Lord, son!! That's bronze age equiptment. I'd seriously
> upgrade. There have been many many improvements since then (like bugs
> fixed and memory leaks patched, etc).
>
> Besides, we won't be able to help you if you don't (Since everyone on
> this list probably runs 1.5.2 or 1.6.0 ;-)

I stuck at 1.2 because last time I looked the .DLL had grown unreasonably
large, to bundle with our little piece of software, and because Xerces
suddenly linked to a new version of winsock which broke our build (with
which it previously worked).

I'd be happy to go with the latest version of Xerces if either (a) the DLL
was still small or (b) there was a decent way to build and link a minimal
SAX-only version of Xerces.


As for the exception, I'm not sure; I've found some alternate unicode->utf8
java code which I might adapt.

thanks though

- Brendan


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Debug Assertion Failed

Posted by Xerces Rule <xe...@bilbao.com>.

Hi again.

I succeeded in inserting a function to obtain the text
string from a Node:

void myFunciton (DOM_Node input, char *output) {
	char *attrValC = input.getNodeValue().transcode();
		
	try {
		strcpy(output,attrValC);		
		delete [] attrValC;
	}
	catch (...) {
		delete [] attrValC;
		throw;
	}
}

I've got no problems with the Release Program (VC++6.0 on
NT4.0), but when executing the Debug one, I get this pop up
error message:
*************
Debug Assertion Failed
Program: dbgheap.c
Line: 1044
Expression: _CtrlsValidHeapPointer(pUserData)
*************

I've been browsing the Internet, and there are more similar
cases. I would be extremely obliged if anybody could explain
me if this is an Xerces bug, a MS bug or I am to blame.

Regards.



_______________________________________________________________________
¿Ya conoces eBay, el mayor centro de compra y venta en internet?
Móviles, portátiles, pda´s, cd´s, cámaras digitales...
¡Compruébalo tú mismo!
http://adfarm.mediaplex.com/ad/ck/1185-5550-4234-3?RedirectEnter&partner=34113&loc=http://www.es.ebay.com

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: XMLCh to utf-8 conversion?

Posted by "Jason E. Stewart" <ja...@openinformatics.com>.

"Brendan Reville" <br...@tenzing.com> writes:

> Jason, thanks for the help.
> 
> 1. Unfortunately, I'm getting an unhandled exception (access
>    violation) somewhere inside the call to transcodeTo().  Any idea
>    why this would happen?
> 
> 2. I'm also a bit fuzzy about the value of UTF8_MAXLEN.  Would this
>    normally be 4 or so?

Ooops, sorry that's a Perl constant. It's actually 7 in the
pathological case.

> I'm using xerces 1.2, hope that's ok.

Good Lord, son!! That's bronze age equiptment. I'd seriously
upgrade. There have been many many improvements since then (like bugs
fixed and memory leaks patched, etc).

Besides, we won't be able to help you if you don't (Since everyone on
this list probably runs 1.5.2 or 1.6.0 ;-)

jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

RE: XMLCh to utf-8 conversion?

Posted by Brendan Reville <br...@tenzing.com>.

Just an addendum to my last post...

> 1. Unfortunately, I'm getting an unhandled exception (access violation)
> somewhere inside the call to transcodeTo().  Any idea why this 
> would happen?

Here's some more detail:

unsigned int total_chars =
UTF8_TRANSCODER->transcodeTo((const XMLCh*) source, 
	(unsigned int) length,
	(XMLByte*) res,
	(unsigned int) length*UTF8_MAXLEN,
	charsEaten,
	//XMLTranscoder::UnRep_Throw
	XMLTranscoder::UnRep_RepChar  );

source: 66 00 00 00      (f)
length: 1

res is allocated fine, 
UTF8_MAXLEN is 4, 
charsEaten is 0

looks okay from my end, but maybe I'm missing something obvious?

- Brendan



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

RE: XMLCh to utf-8 conversion?

Posted by Brendan Reville <br...@tenzing.com>.

Jason, thanks for the help.

1. Unfortunately, I'm getting an unhandled exception (access violation)
somewhere inside the call to transcodeTo().  Any idea why this would happen?

2. I'm also a bit fuzzy about the value of UTF8_MAXLEN.  Would this normally
be 4 or so?


I'm using xerces 1.2, hope that's ok.

thanks!

- Brendan




> Here's a bit of C-ish typemap code that does it. Note that $source is
> the XMLCh* input string, and UTF8_TRANSCODER is a global variable set
> to a pre-allocated transcoder for UTF-8.
>
>   unsigned int charsEaten = 0;
>   int length  = XMLString::stringLen($source);      // string length
>   XMLByte* res = new XMLByte[length * UTF8_MAXLEN];          //
> output string
>   unsigned int total_chars =
>     UTF8_TRANSCODER->transcodeTo((const XMLCh*) $source,
> 				   (unsigned int) length,
> 				   (XMLByte*) res,
> 				   (unsigned int) length*UTF8_MAXLEN,
> 				   charsEaten,
> 				   XMLTranscoder::UnRep_Throw
> 				   );
>   res[total_chars] = '\0';
>
> Here's the code that allocates the transcoder:
>
> static XMLCh* UTF8_ENCODING = NULL;
> static XMLTranscoder* UTF8_TRANSCODER  = NULL;
>     XMLTransService::Codes failReason;
>     XMLPlatformUtils::Initialize(); // first we must create the
> transservice
>     UTF8_ENCODING = XMLString::transcode("UTF-8");
>     UTF8_TRANSCODER =
>
> XMLPlatformUtils::fgTransService->makeNewTranscoderFor(UTF8_ENCODING,
>                                                              failReason,
>                                                              1024);
>     if (! UTF8_TRANSCODER) {
> 	croak("ERROR: XML::Xerces: INIT: Could not create UTF-8
> transcoder");
>     }
>
> HTH,
> jas.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: XMLCh to utf-8 conversion?

Posted by "Jason E. Stewart" <ja...@openinformatics.com>.

"Brendan Reville" <br...@tenzing.com> writes:

> I have a .xml file encoded as utf-8, and I want to load some
> attribute values as utf-8 strings inside my own code.
> 
> given that AttributeList::getName()) returns an XMLCh*, which I take
> to be Unicode characters, how can I then convert this back into a
> utf-8 string?

Here's a bit of C-ish typemap code that does it. Note that $source is
the XMLCh* input string, and UTF8_TRANSCODER is a global variable set
to a pre-allocated transcoder for UTF-8.

  unsigned int charsEaten = 0;
  int length  = XMLString::stringLen($source);      // string length
  XMLByte* res = new XMLByte[length * UTF8_MAXLEN];          // output string
  unsigned int total_chars =
    UTF8_TRANSCODER->transcodeTo((const XMLCh*) $source, 
				   (unsigned int) length,
				   (XMLByte*) res,
				   (unsigned int) length*UTF8_MAXLEN,
				   charsEaten,
				   XMLTranscoder::UnRep_Throw
				   );
  res[total_chars] = '\0';

Here's the code that allocates the transcoder:

static XMLCh* UTF8_ENCODING = NULL; 
static XMLTranscoder* UTF8_TRANSCODER  = NULL;
    XMLTransService::Codes failReason;
    XMLPlatformUtils::Initialize(); // first we must create the transservice
    UTF8_ENCODING = XMLString::transcode("UTF-8");
    UTF8_TRANSCODER =
      XMLPlatformUtils::fgTransService->makeNewTranscoderFor(UTF8_ENCODING,
                                                             failReason,
                                                             1024);
    if (! UTF8_TRANSCODER) {
	croak("ERROR: XML::Xerces: INIT: Could not create UTF-8 transcoder");
    }

HTH,
jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

XMLCh to utf-8 conversion?

Posted by Brendan Reville <br...@tenzing.com>.

hi all,

I have a .xml file encoded as utf-8, and I want to load some attribute
values as utf-8 strings inside my own code.

given that AttributeList::getName()) returns an XMLCh*, which I take to be
Unicode characters, how can I then convert this back into a utf-8 string?


thanks!

- Brendan


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: "Invalid document structure" exception?

Posted by Dean Roddey <dr...@charmedquark.com>.

UTF-16 isn't Unicode either, its an encoding. I was being loose in my
wording, which was my bad. UTF-16 is an encoding of Unicode, which is double
(or quadruple byte), therefore it needs a byte order mark. UTF-8 is a
multi-byte encoding of Unicode that doesn't need any byte order mark, though
its perfectly capable of encoding it. But the XML standard, which may have
been changed wrt to this issue since I originally wrote that encoding
sensing code, I think said it was either 0xFFFE or 0xFEFF followed by <?xml,
else <?xml had to be the first thing in the file. I don't think at the time
it allowed for the BOM in other cases, or didn't do so explicitly I don't
think, so it wasn't allowed for in the Xerces parser.

They may have updated the spec for that since then, since I've seen this
discussion a couple times, but I don't remember what the final decision was.
Given that the parser still chokes on it, either the decision was it wasn't
allowed, or no decision was made at all :-)

--------------------------
Dean Roddey
The Charmed Quark Controller
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"If it don't have a control port, don't buy it!"

----- Original Message -----
From: "Jason E. Stewart" <ja...@openinformatics.com>
To: <xe...@xml.apache.org>
Sent: Saturday, January 19, 2002 4:05 PM
Subject: Re: "Invalid document structure" exception?

> "Dean Roddey" <dr...@charmedquark.com> writes:
>
> > UTF-8 isn't Unicode, though it can encode Unicode.
>
> This isn't true, is it?
>
> from unicode.org:
>
>   Q: Can Unicode text be represented in more than one way?
>
>   A: Yes, there are several possible representations of Unicode data,
>   including UTF-8,  UTF-16 and UTF-32.
>
>   Q: What is a UTF?
>
>   A: A Unicode transformation format (UTF) is an algorithmic mapping
>   from every Unicode scalar value to a unique byte sequence.
>
> As I understand it, UTF-8 is one of the character encodings that are
> defined in the Unicode standard.
>
> jas.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: "Invalid document structure" exception?

Posted by "Jason E. Stewart" <ja...@openinformatics.com>.

"Dean Roddey" <dr...@charmedquark.com> writes:

> UTF-8 isn't Unicode, though it can encode Unicode.

This isn't true, is it?

from unicode.org:

  Q: Can Unicode text be represented in more than one way?
  
  A: Yes, there are several possible representations of Unicode data,
  including UTF-8,  UTF-16 and UTF-32. 
  
  Q: What is a UTF?
  
  A: A Unicode transformation format (UTF) is an algorithmic mapping
  from every Unicode scalar value to a unique byte sequence. 

As I understand it, UTF-8 is one of the character encodings that are
defined in the Unicode standard.

jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

RE: "Invalid document structure" exception?

Posted by Brendan Reville <br...@tenzing.com>.

> Though my hex math isn't good enough to do it in my head, it
> looks like the
> UTF-8 version of the Unicode byte order mark. I know that there was some
> controversy as to whether that was legal, since the BOM is supposed to
> indicate byte order for Unicode, whereas UTF-8 has no byte order. OTOH, if
> you encode Unicode in UTF-8, is it required to strip the BOM, or
> just encode
> it as UTF-8?

Ah, this makes a lot of sense.  I just remembered that the file was
originally saved as Unicode, and then re-saved as UTF-8.  I guess those
bytes just remained intact in Notepad's conversion..

- Brendan


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: "Invalid document structure" exception?

Posted by Dean Roddey <dr...@charmedquark.com>.

Though my hex math isn't good enough to do it in my head, it looks like the
UTF-8 version of the Unicode byte order mark. I know that there was some
controversy as to whether that was legal, since the BOM is supposed to
indicate byte order for Unicode, whereas UTF-8 has no byte order. OTOH, if
you encode Unicode in UTF-8, is it required to strip the BOM, or just encode
it as UTF-8?

Anyway, it looks like that's it. I don't know if the parser was updated to
handle that or not. It used to not, because I never forsaw such a thing, and
would have thought it was illegal by the XML spec, which says its either
Unicode with a BOM or the first thing must be <?xml. UTF-8 isn't Unicode,
though it can encode Unicode.

Of course, if that's not what it is, and its just some random garbage, then
ignore all of this :-)

--------------------------
Dean Roddey
The Charmed Quark Controller
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"If it don't have a control port, don't buy it!"


----- Original Message -----
From: "Brendan Reville" <br...@tenzing.com>
To: <xe...@xml.apache.org>
Sent: Saturday, January 19, 2002 3:13 PM
Subject: "Invalid document structure" exception?


> hi all,
>
> I have a text .xml file which I saved in Windows Notepad as a UTF-8 file.
>
> However, when I try to parse it with Xerces, I get an "invalid document
> structure" exception on line 1, character 1.  Any idea why I would be
> getting this?
>
>
> The very beginning of the file looks like this:
>
> ï»¿<?xml versio
>
> A binary dump of the same gives this:
>
> ef bb bf 3c 3f 78 6d 6c   20 76 65 72 73 69 6f 6e
>
> if I'm not mistaken.  I'm not sure what those first three bytes are, but I
> didn't expect to see them there; they don't show up in Notepad, that's for
> sure.
>
> thanks
>
> - Brendan
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org