You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by "Wons, Jean-Baptiste" <Je...@kbcfp.com> on 2008/06/19 18:08:43 UTC
Sterling pound sign encoding sith XML string
Hello.
I am not sure if this is a bug in xerces or me not using xerces well.
This is my code:
<code>
#include <string>
#include <iostream>
#include <xercesc/dom/DOM.hpp>
#include <xercesc/dom/DOMException.hpp>
#include <xercesc/dom/DOMImplementationRegistry.hpp>
#include <xercesc/framework/MemBufInputSource.hpp>
#include <xercesc/parsers/XercesDOMParser.hpp>
#include <xercesc/util/PlatformUtils.hpp>
#include <xercesc/util/XMLString.hpp>
using namespace std;
using namespace XERCES_CPP_NAMESPACE;
void replaceSpecialCharactersXML(std::string &s)
{
string cp;
unsigned int i;
cp.reserve(s.size()*2);
for (i = 0; i < s.size(); i++)
{
const unsigned char c = s[i];
if ((c < 32 && c != '\012' && c != '\015') || c > 127)
{
char buffer[10000];
sprintf(buffer, "&#x%02x;", c);
cp += buffer;
}
else
{
cp += c;
}
}
s = cp;
}
int main()
{
XMLPlatformUtils::Initialize();
string aString0 ("This will crash ££££ ...");
XMLCh* fUnicodeForm = XMLString::transcode(aString0.c_str());
char *pMsg = XMLString::transcode(fUnicodeForm);
string res(pMsg);
replaceSpecialCharactersXML(res);
cout << aString0 << " -> " << pMsg << " -> " << res << endl;
return 0;
}
</code>
When I compile and run, I have that output:
<output>
sh$ ./testxerces
This will crash ££££ ... -> This will crash ... -> This will crash  ...
</output>
When I transcode the £ sign to XMLCh, then transcode it back to a char*, it is transformed to 0x1a.
Is it a real bug, or is it just me missing something ?
Regards,
Jean-Baptiste
--
This message may contain confidential, proprietary, or legally privileged information. No confidentiality or privilege is waived by any transmission to an unintended recipient. If you are not an intended recipient, please notify the sender and delete this message immediately. Any views expressed in this message are those of the sender, not those of any entity within the KBC Financial Products group of companies (together referred to as "KBC FP").
This message does not create any obligation, contractual or otherwise, on the part of KBC FP. It is not an offer (or solicitation of an offer) of, or a recommendation to buy or sell, any financial product. Any prices or other values included in this message are indicative only, and do not necessarily represent current market prices, prices at which KBC FP would enter into a transaction, or prices at which similar transactions may be carried on KBC FP's own books. The information contained in this message is provided "as is", without representations or warranties, express or implied, of any kind. Past performance is not indicative of future returns.
RE: Sterling pound sign encoding sith XML string
Posted by "Wons, Jean-Baptiste" <Je...@kbcfp.com>.
Thanks, I will try that.
Regards,
JB
-----Original Message-----
From: David Bertoni [mailto:dbertoni@apache.org]
Sent: Friday, June 20, 2008 1:04 PM
To: c-users@xerces.apache.org
Subject: Re: Sterling pound sign encoding sith XML string
Wons, Jean-Baptiste wrote:
> Hello,
>
> The fact that I hardcoded the Pound sign in this piece of code was
just to show my problem.
> In my program, I actualy get data from a file. Then I try to make a
xml with the data embedded in this file.
That's a very important piece of information that you didn't mention in
your original post.
>
> Sometime, I have the pound sign in this file (this file is encoded in
ISO-8859-1).
> But the transcode method of XMLString get quite upset with that, and
when I transcode the xml back to ISO-8859-1, I get the 0x1a.
XMLString::transcode() converts between UTF-16 and the local code page.
Unless you can guarantee the local code page is ISO-8859-1, then don't
use XMLString::transcode().
>
> Is there any way to work-around this ?
If you know the encoding of the data, then you should just create a
transcoder for that encoding. Take a look at
XMLTransService::makeNewXMLTranscoder() in
xercesc/util/TransService.hpp, and search through the code for examples
of how to use it.
Dave
--
This message may contain confidential, proprietary, or legally privileged information. No confidentiality or privilege is waived by any transmission to an unintended recipient. If you are not an intended recipient, please notify the sender and delete this message immediately. Any views expressed in this message are those of the sender, not those of any entity within the KBC Financial Products group of companies (together referred to as "KBC FP").
This message does not create any obligation, contractual or otherwise, on the part of KBC FP. It is not an offer (or solicitation of an offer) of, or a recommendation to buy or sell, any financial product. Any prices or other values included in this message are indicative only, and do not necessarily represent current market prices, prices at which KBC FP would enter into a transaction, or prices at which similar transactions may be carried on KBC FP's own books. The information contained in this message is provided "as is", without representations or warranties, express or implied, of any kind. Past performance is not indicative of future returns.
Re: Sterling pound sign encoding sith XML string
Posted by David Bertoni <db...@apache.org>.
Wons, Jean-Baptiste wrote:
> Hello,
>
> The fact that I hardcoded the Pound sign in this piece of code was just to show my problem.
> In my program, I actualy get data from a file. Then I try to make a xml with the data embedded in this file.
That's a very important piece of information that you didn't mention in
your original post.
>
> Sometime, I have the pound sign in this file (this file is encoded in ISO-8859-1).
> But the transcode method of XMLString get quite upset with that, and when I transcode the xml back to ISO-8859-1, I get the 0x1a.
XMLString::transcode() converts between UTF-16 and the local code page.
Unless you can guarantee the local code page is ISO-8859-1, then don't
use XMLString::transcode().
>
> Is there any way to work-around this ?
If you know the encoding of the data, then you should just create a
transcoder for that encoding. Take a look at
XMLTransService::makeNewXMLTranscoder() in
xercesc/util/TransService.hpp, and search through the code for examples
of how to use it.
Dave
RE: Sterling pound sign encoding sith XML string
Posted by "Wons, Jean-Baptiste" <Je...@kbcfp.com>.
Hello,
The fact that I hardcoded the Pound sign in this piece of code was just to show my problem.
In my program, I actualy get data from a file. Then I try to make a xml with the data embedded in this file.
Sometime, I have the pound sign in this file (this file is encoded in ISO-8859-1).
But the transcode method of XMLString get quite upset with that, and when I transcode the xml back to ISO-8859-1, I get the 0x1a.
Is there any way to work-around this ?
Thanks,
Jean-Baptiste
-----Original Message-----
From: David Bertoni [mailto:dbertoni@apache.org]
Sent: Thursday, June 19, 2008 9:17 PM
To: c-users@xerces.apache.org
Subject: Re: Sterling pound sign encoding sith XML string
Wons, Jean-Baptiste wrote:
> Hello.
>
> I am not sure if this is a bug in xerces or me not using xerces well.
>
> This is my code:
>
> <code>
>
> #include <string>
> #include <iostream>
>
> #include <xercesc/dom/DOM.hpp>
> #include <xercesc/dom/DOMException.hpp>
> #include <xercesc/dom/DOMImplementationRegistry.hpp>
> #include <xercesc/framework/MemBufInputSource.hpp>
> #include <xercesc/parsers/XercesDOMParser.hpp>
> #include <xercesc/util/PlatformUtils.hpp>
> #include <xercesc/util/XMLString.hpp>
>
>
> using namespace std;
> using namespace XERCES_CPP_NAMESPACE;
>
> void replaceSpecialCharactersXML(std::string &s)
> {
> string cp;
> unsigned int i;
> cp.reserve(s.size()*2);
> for (i = 0; i < s.size(); i++)
> {
> const unsigned char c = s[i];
>
> if ((c < 32 && c != '\012' && c != '\015') || c > 127)
> {
> char buffer[10000];
> sprintf(buffer, "&#x%02x;", c);
> cp += buffer;
> }
> else
> {
> cp += c;
> }
> }
> s = cp;
> }
>
>
> int main()
> {
> XMLPlatformUtils::Initialize();
> string aString0 ("This will crash ££££ ...");
> XMLCh* fUnicodeForm = XMLString::transcode(aString0.c_str());
> char *pMsg = XMLString::transcode(fUnicodeForm);
> string res(pMsg);
> replaceSpecialCharactersXML(res);
>
> cout << aString0 << " -> " << pMsg << " -> " << res << endl;
>
> return 0;
> }
>
> </code>
>
> When I compile and run, I have that output:
>
> <output>
> sh$ ./testxerces
> This will crash ££££ ... -> This will crash ... -> This will crash  ...
> </output>
I ran your code on Windows XP with the default Windows code page for
English and got the following result:
This will crash úúúú ... -> This will crash úúúú ... -> This will crash
££££ ...
The fact that your system displays "ú" instead of the pound sign is your
first clue that something is very wrong.
>
> When I transcode the £ sign to XMLCh, then transcode it back to a char*, it is transformed to 0x1a.
>
> Is it a real bug, or is it just me missing something ?
It's generally dangerous to transcode between the local code page and
Unicode because it's easy to lose data. It may be that your current
code page encodes the Unicode character U+00A3 Pound Sign as 0x1A,
although that seems unlikely. Without knowing what anything your
system's local code page, we can only guess. Also, if your code will
run on other systems, you can't make any assumptions about the local
code page.
It's also dangerous to embed strings in your program with code units
outside of a very limited set, because they will be sensitive to the
compiler's idea of how characters are encoded. For example, you may be
using an editor that supports UTF-8 or ISO-8859-1, while your compiler
assumes some other encoding for the bytes of the source file. Since
your email arrived encoded in ISO-8859-1, perhaps your editor also uses
that encoding.
The best thing to do is to use Unicode strings throughout your code, and
only transcode to the local code page when you absolutely must, making
sure you never assume that any particular character can be represented
in the local code page. Also, construct hard-coded strings directly in
UTF-16, instead of embedded character string constants and transcoding
them. You can look at src/xerces/util/XMLUni.cpp for some examples of
how to do that.
Dave
--
This message may contain confidential, proprietary, or legally privileged information. No confidentiality or privilege is waived by any transmission to an unintended recipient. If you are not an intended recipient, please notify the sender and delete this message immediately. Any views expressed in this message are those of the sender, not those of any entity within the KBC Financial Products group of companies (together referred to as "KBC FP").
This message does not create any obligation, contractual or otherwise, on the part of KBC FP. It is not an offer (or solicitation of an offer) of, or a recommendation to buy or sell, any financial product. Any prices or other values included in this message are indicative only, and do not necessarily represent current market prices, prices at which KBC FP would enter into a transaction, or prices at which similar transactions may be carried on KBC FP's own books. The information contained in this message is provided "as is", without representations or warranties, express or implied, of any kind. Past performance is not indicative of future returns.
Re: Sterling pound sign encoding sith XML string
Posted by David Bertoni <db...@apache.org>.
Wons, Jean-Baptiste wrote:
> Hello.
>
> I am not sure if this is a bug in xerces or me not using xerces well.
>
> This is my code:
>
> <code>
>
> #include <string>
> #include <iostream>
>
> #include <xercesc/dom/DOM.hpp>
> #include <xercesc/dom/DOMException.hpp>
> #include <xercesc/dom/DOMImplementationRegistry.hpp>
> #include <xercesc/framework/MemBufInputSource.hpp>
> #include <xercesc/parsers/XercesDOMParser.hpp>
> #include <xercesc/util/PlatformUtils.hpp>
> #include <xercesc/util/XMLString.hpp>
>
>
> using namespace std;
> using namespace XERCES_CPP_NAMESPACE;
>
> void replaceSpecialCharactersXML(std::string &s)
> {
> string cp;
> unsigned int i;
> cp.reserve(s.size()*2);
> for (i = 0; i < s.size(); i++)
> {
> const unsigned char c = s[i];
>
> if ((c < 32 && c != '\012' && c != '\015') || c > 127)
> {
> char buffer[10000];
> sprintf(buffer, "&#x%02x;", c);
> cp += buffer;
> }
> else
> {
> cp += c;
> }
> }
> s = cp;
> }
>
>
> int main()
> {
> XMLPlatformUtils::Initialize();
> string aString0 ("This will crash ££££ ...");
> XMLCh* fUnicodeForm = XMLString::transcode(aString0.c_str());
> char *pMsg = XMLString::transcode(fUnicodeForm);
> string res(pMsg);
> replaceSpecialCharactersXML(res);
>
> cout << aString0 << " -> " << pMsg << " -> " << res << endl;
>
> return 0;
> }
>
> </code>
>
> When I compile and run, I have that output:
>
> <output>
> sh$ ./testxerces
> This will crash ££££ ... -> This will crash ... -> This will crash  ...
> </output>
I ran your code on Windows XP with the default Windows code page for
English and got the following result:
This will crash úúúú ... -> This will crash úúúú ... -> This will crash
££££ ...
The fact that your system displays "ú" instead of the pound sign is your
first clue that something is very wrong.
>
> When I transcode the £ sign to XMLCh, then transcode it back to a char*, it is transformed to 0x1a.
>
> Is it a real bug, or is it just me missing something ?
It's generally dangerous to transcode between the local code page and
Unicode because it's easy to lose data. It may be that your current
code page encodes the Unicode character U+00A3 Pound Sign as 0x1A,
although that seems unlikely. Without knowing what anything your
system's local code page, we can only guess. Also, if your code will
run on other systems, you can't make any assumptions about the local
code page.
It's also dangerous to embed strings in your program with code units
outside of a very limited set, because they will be sensitive to the
compiler's idea of how characters are encoded. For example, you may be
using an editor that supports UTF-8 or ISO-8859-1, while your compiler
assumes some other encoding for the bytes of the source file. Since
your email arrived encoded in ISO-8859-1, perhaps your editor also uses
that encoding.
The best thing to do is to use Unicode strings throughout your code, and
only transcode to the local code page when you absolutely must, making
sure you never assume that any particular character can be represented
in the local code page. Also, construct hard-coded strings directly in
UTF-16, instead of embedded character string constants and transcoding
them. You can look at src/xerces/util/XMLUni.cpp for some examples of
how to do that.
Dave
RE: ICU service
Posted by "Luu, Richard" <ri...@siemens.com>.
Hi David,
I tried again with ICU 3.8.1. It also failed with the same error.
I posted this issue onto ICU support list
PS: A quick question, with the Xerces C++ build with ICU, does the
sample program ( eg . PParse) need any code change to support ICU ?
Thanks,
RLUU
-----Original Message-----
From: David Bertoni [mailto:dbertoni@apache.org]
Sent: Friday, June 20, 2008 11:56 AM
To: c-users@xerces.apache.org
Subject: Re: ICU service
Luu, Richard wrote:
> Hi David,
>
> Yes, I got error in 'make check install' for ICU installation. Below
is
> the cut/paste.
>
> .....
> make[1]: Making `all' in `testdata'
> make[2]: Entering directory
> `/plm/cynas/users/luur/icu/source/test/testdata'
>
LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY
> _PATH ../../bin/genrb -k -q -s ../../test/testdata -d
> ../../test/testdata/out/build te.txt
> ../../test/testdata/te.txt:92: parse error. Stopped parsing with
> U_FILE_ACCESS_ERROR
> couldn't parse the file te.txt. Error:U_FILE_ACCESS_ERROR
> make[2]: *** [../../test/testdata/out/build/te.res] Error 4
> make[2]: Leaving directory
> `/plm/cynas/users/luur/icu/source/test/testdata'
> make[1]: *** [all-recursive] Error 2
> make[1]: Leaving directory `/plm/cynas/users/luur/icu/source/test'
> make: *** [check-recursive] Error 2
There could be many problems. Perhaps your ICU download is corrupt, or
your compiler is broken. I suggest you download the ICU distribution
again, verify it's OK, then build again.
By the way, ICU version 4.0 is only available as a development milestone
(d02), so you might want to download 3.8.1 and try building that
instead.
Also, please post this problem to the ICU support list, and include the
exact Linux distribution you're using, along with the kernel version and
compiler version. Perhaps there is a bug in 4.0 that needs to be
addressed.
Dave
Re: ICU service
Posted by David Bertoni <db...@apache.org>.
Luu, Richard wrote:
> Hi David,
>
> Yes, I got error in 'make check install' for ICU installation. Below is
> the cut/paste.
>
> .....
> make[1]: Making `all' in `testdata'
> make[2]: Entering directory
> `/plm/cynas/users/luur/icu/source/test/testdata'
> LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY
> _PATH ../../bin/genrb -k -q -s ../../test/testdata -d
> ../../test/testdata/out/build te.txt
> ../../test/testdata/te.txt:92: parse error. Stopped parsing with
> U_FILE_ACCESS_ERROR
> couldn't parse the file te.txt. Error:U_FILE_ACCESS_ERROR
> make[2]: *** [../../test/testdata/out/build/te.res] Error 4
> make[2]: Leaving directory
> `/plm/cynas/users/luur/icu/source/test/testdata'
> make[1]: *** [all-recursive] Error 2
> make[1]: Leaving directory `/plm/cynas/users/luur/icu/source/test'
> make: *** [check-recursive] Error 2
There could be many problems. Perhaps your ICU download is corrupt, or
your compiler is broken. I suggest you download the ICU distribution
again, verify it's OK, then build again.
By the way, ICU version 4.0 is only available as a development milestone
(d02), so you might want to download 3.8.1 and try building that instead.
Also, please post this problem to the ICU support list, and include the
exact Linux distribution you're using, along with the kernel version and
compiler version. Perhaps there is a bug in 4.0 that needs to be addressed.
Dave
RE: ICU service
Posted by "Luu, Richard" <ri...@siemens.com>.
Hi David,
Yes, I got error in 'make check install' for ICU installation. Below is
the cut/paste.
.....
make[1]: Making `all' in `testdata'
make[2]: Entering directory
`/plm/cynas/users/luur/icu/source/test/testdata'
LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY
_PATH ../../bin/genrb -k -q -s ../../test/testdata -d
../../test/testdata/out/build te.txt
../../test/testdata/te.txt:92: parse error. Stopped parsing with
U_FILE_ACCESS_ERROR
couldn't parse the file te.txt. Error:U_FILE_ACCESS_ERROR
make[2]: *** [../../test/testdata/out/build/te.res] Error 4
make[2]: Leaving directory
`/plm/cynas/users/luur/icu/source/test/testdata'
make[1]: *** [all-recursive] Error 2
make[1]: Leaving directory `/plm/cynas/users/luur/icu/source/test'
make: *** [check-recursive] Error 2
-----Original Message-----
From: David Bertoni [mailto:dbertoni@apache.org]
Sent: Friday, June 20, 2008 10:12 AM
To: c-users@xerces.apache.org
Subject: Re: ICU service
Luu, Richard wrote:
> Hi all,
>
> I just built/installed ICU 4.0 and Xerces C++ 2.8 for libs (with
> -ticu) and Samples on Linux platform. After the built done, I ran
> PParse -? <Enter> It returned with "Can't open transcoding service".
>
> What command do I use to enable or bring up the transcoding service ?
This seems strange. I know that ICU 3.2.1 is the latest version of the
ICU that will work with Xerces-C 2.8 if you want to use the ICU as a
message loader, but I'm not aware of any problems using the ICU for the
transcoding service. Did you verify the ICU built properly and install
it correctly by running "make check install"?
Dave
Re: ICU service
Posted by David Bertoni <db...@apache.org>.
Luu, Richard wrote:
> Hi all,
>
> I just built/installed ICU 4.0 and Xerces C++ 2.8 for libs (with
> -ticu) and Samples on Linux platform. After the built done, I ran
> PParse -? <Enter> It returned with "Can't open transcoding service".
>
> What command do I use to enable or bring up the transcoding service ?
This seems strange. I know that ICU 3.2.1 is the latest version of the
ICU that will work with Xerces-C 2.8 if you want to use the ICU as a
message loader, but I'm not aware of any problems using the ICU for the
transcoding service. Did you verify the ICU built properly and install
it correctly by running "make check install"?
Dave
ICU service
Posted by "Luu, Richard" <ri...@siemens.com>.
Hi all,
I just built/installed ICU 4.0 and Xerces C++ 2.8 for libs (with
-ticu) and Samples on Linux platform. After the built done, I ran
PParse -? <Enter> It returned with "Can't open transcoding service".
What command do I use to enable or bring up the transcoding service ?
Thanks for your help.
RLU
Re: adding xsi:schemaLocation attribute to the XML document
Posted by David Bertoni <db...@apache.org>.
Matteo Vega wrote:
> Hi,
>
> I am trying to create an XML file using DOMDocument. How can I add xsi:schemaLocation attribute to the XML document?
You should take a look at the CreateDOMDocument sample applications.
However, you will want to use the namespace-aware factory functions,
such as DOMDocument::createElementNS() and DOMElement::setAttributeNS().
Dave
adding xsi:schemaLocation attribute to the XML document
Posted by Matteo Vega <ve...@yahoo.com>.
Hi,
I am trying to create an XML file using DOMDocument. How can I add xsi:schemaLocation attribute to the XML document?
Thank you,
-Matteo