You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Greg Farrell <gf...@elandtech.com> on 2003/10/17 17:09:31 UTC

Linux xerces parser problem

Hi,

  we use the xerces xml c++ parser v1.7 to read in xml data. 

The following text is correctly parsed in windows,

<DATA>
	(†[ASSIGN,Found,TEXT,YES])|([ASSIGN,Found,TEXT,FALSE])    
</DATA>
 
however in linux the cross of loraine (†) character is stripped, as
is the data immediately after it. This also happens with xerces 2.3.
Can anyone suggest a way around this problem? Or even better, a fix
for it.

 thanks in advance,
    
      Greg


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


Re: Linux xerces parser problem

Posted by David Sheldon <dw...@decisionsoft.com>.
On Fri, Oct 17, 2003 at 04:58:40PM +0100, David Sheldon wrote:
> Correctly representing the character in UTF-8 might be possible,
> depending on what you are using to create the document, however I am not
> convinced this would be the easiest. In this case this character should
> be represented by the 4 bytes 0xe2, 0x98, 0xa8, 0x0a

Oops, it is just the first 3 bytes of this. The other byte is the line
end character from my test file. Sorry about that.

David
-- 
David Sheldon, Client Services        DecisionSoft Ltd.
Telephone: +44-1865-203192            http://www.decisionsoft.com

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


Re: Linux xerces parser problem

Posted by David Sheldon <dw...@decisionsoft.com>.
On Fri, Oct 17, 2003 at 04:09:31PM +0100, Greg Farrell wrote:
> The following text is correctly parsed in windows,
> 
> <DATA>
> 	(?[ASSIGN,Found,TEXT,YES])|([ASSIGN,Found,TEXT,FALSE])    
> </DATA>
>  
> however in linux the cross of loraine (?) character is stripped, as
> is the data immediately after it. This also happens with xerces 2.3.
> Can anyone suggest a way around this problem? Or even better, a fix
> for it.


I think that the problem here is character sets. I am assuming that your
XML does not have an XML declaration. According to [1], in the absence
of a declaration specifying an encoding (and no hint provided by an
external transport protocol) then UTF-8 should be assumed. 

The character 134 in UTF-8, is not a valid character for starting a
multi-byte sequence. Hence your document is not valid.

In order for it to be valid, then either your should have an xml
declaration, stating that the text is in the Windows-1251 encoding, 
correctly represent your character in UTF-8, or include it as an entity.

Adding the declaration would probably be the easiest, as it would just
involve adding a line saying

<?xml encoding="windows-1251"?>

At the beginning of your document, however I am not sure that Xerces-C
1.7 would support transcoding from the Windows codepage.

Correctly representing the character in UTF-8 might be possible,
depending on what you are using to create the document, however I am not
convinced this would be the easiest. In this case this character should
be represented by the 4 bytes 0xe2, 0x98, 0xa8, 0x0a

According to [2], the entity for the "cross of lorraine" is &#x2628;, so
your document would be valid were it

<DATA>
	(&#x2628;[ASSIGN,Found,TEXT,YES])|([ASSIGN,Found,TEXT,FALSE])    
</DATA>

I hope that this helps. I do find that a lot of problems with XML
documents are due to character set mismatches such as this.

David

[1] http://www.w3.org/TR/REC-xml#NT-EncodingDecl
[2] http://ppewww.ph.gla.ac.uk/~flavell/unicode/unidata26.html
-- 
David Sheldon, Client Services        DecisionSoft Ltd.
Telephone: +44-1865-203192            http://www.decisionsoft.com

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org