You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "Hondros, Constantine" <Co...@nl.compuware.com> on 2003/06/25 18:49:38 UTC

Problem extracting Japanese characters in straight SAX parse

I'm parsing a UTF-16 Japanese XML file with Xerces 2.4 with a simple class
that extends DefaultHandler. I am just trying to write out certain CDATA
attribute values (these are the Japanese characters)  into a file : very
simple, supposedly.

Problem is, there is some sort of encoding mischief going on , as the UTF-16
Japanese characters in the CDATA attributes are coming out horribly mangled.

This is how I am initiating the parse :

	XMLReader parser =
XMLReaderFactory.createXMLReader(DEFAULT_PARSER_NAME);
	parser.setFeature(VALIDATION_FEATURE_ID, false);
	parser.setContentHandler(this);
	parser.setErrorHandler(this);
	parser.setEntityResolver(new DTDResolver());
	FileReader reader = new FileReader(tocFile);
            InputSource source = new InputSource(reader);
            source.setEncoding("UTF-16");
            source.setSystemId(tocFile.getAbsolutePath());
	parser.parse(source);

and this (simplified) is how I am grabbing the Japanese characters (I am
appending them to a StringBuffer) :

	public void startElement(String uri, String local, String qname,
Attributes attrs) throws SAXException {
	            myStringBuffer.append(attrs.getValue("myattribute"));

So two questions : should I be using a FileReader when I initiate the parse
or some other object of the IO family?
And : is it naive to expect the characters to pop off the attrs parameter
without having to do some extra work?

Any hints greatly appreciated,

Constantine Hondros
  



-- 
The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it. 


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Problem extracting Japanese characters in straight SAX parse

Posted by Michael Glavassevich <mr...@apache.org>.
Hello Constantine,

It looks like your problem is with FileReader. It assumes the default
character encoding for your system, which may be UTF-8, EBCDIC, or
something else. When you pass a Reader to the parser, any available
encoding information isn't used because the parser doesn't read from the
underlying byte stream. It only sees the transcoded characters.

Unless you have a good reason against it, you should let the parser detect
the encoding itself. For instance you could create a FileInputStream
instead, and set this on your InputSource.

Hope that helps.

On Wed, 25 Jun 2003, Hondros, Constantine wrote:

> I'm parsing a UTF-16 Japanese XML file with Xerces 2.4 with a simple class
> that extends DefaultHandler. I am just trying to write out certain CDATA
> attribute values (these are the Japanese characters)  into a file : very
> simple, supposedly.
>
> Problem is, there is some sort of encoding mischief going on , as the UTF-16
> Japanese characters in the CDATA attributes are coming out horribly mangled.
>
> This is how I am initiating the parse :
>
> 	XMLReader parser =
> XMLReaderFactory.createXMLReader(DEFAULT_PARSER_NAME);
> 	parser.setFeature(VALIDATION_FEATURE_ID, false);
> 	parser.setContentHandler(this);
> 	parser.setErrorHandler(this);
> 	parser.setEntityResolver(new DTDResolver());
> 	FileReader reader = new FileReader(tocFile);
>             InputSource source = new InputSource(reader);
>             source.setEncoding("UTF-16");
>             source.setSystemId(tocFile.getAbsolutePath());
> 	parser.parse(source);
>
> and this (simplified) is how I am grabbing the Japanese characters (I am
> appending them to a StringBuffer) :
>
> 	public void startElement(String uri, String local, String qname,
> Attributes attrs) throws SAXException {
> 	            myStringBuffer.append(attrs.getValue("myattribute"));
>
> So two questions : should I be using a FileReader when I initiate the parse
> or some other object of the IO family?
> And : is it naive to expect the characters to pop off the attrs parameter
> without having to do some extra work?
>
> Any hints greatly appreciated,
>
> Constantine Hondros
>
>
>
>
> --
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or disclose
> it to anyone else. If you received it in error please notify us immediately
> and then destroy it.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>
>

--------------------
Michael Glavassevich
mrglavas@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org