You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Jon Shoberg <js...@cbd.net> on 2001/12/31 16:29:05 UTC

Can Xerces Force Encoding Type ?

In the documents I am parsing there is no <?xml version='1.0'
encoding='ISO-8859-1'?> declaration.  Take a quick view at the following
URL,

http://dmoz.org/rdf/structure.example.txt

Xerces parses it ok except for when it hits a n-tilde or like character.  Is
there a way to force the encoding type within Xerces-SAX parser?  I havn't
seen this noted in the docs or archives.  If feasable this would be a more
elegant solution than having the overhead of writing a single line to a
1.2GB file every time it needs parsed.

Thanks

Jon


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


using xpath

Posted by tom john <cy...@yahoo.com>.
hi,

i have xml file similar to the following:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<addbook>
   	<person>
   		<firstname>AAA</name>
   		<lastname>ZZZ</lastname>
   		<telephone>11111</telephone>
   	</person>
	<person>
   		<firstname>BBB</name>
   		<lastname>YYY</lastname>
   		<telephone>22222</telephone>
   	</person>
   	<person>
   		<firstname>CCC</name>
   		<lastname>XXX</lastname>
   		<telephone>33333</telephone>
   	</person>
</addbook>

I would like to get the person node when i give the
first and the last name. 

for example if i give:
firstname = BBB & lastname = YYY

i should get 

<person>
  <firstname>BBB</name>
  <lastname>YYY</lastname>
  <telephone>22222</telephone>
</person>


Is it possible using XPath? How?
Hope someone can help.

tom

__________________________________________________
Do You Yahoo!?
Send your FREE holiday greetings online!
http://greetings.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


RE: Can Xerces Force Encoding Type ?

Posted by Elliotte Rusty Harold <el...@metalab.unc.edu>.
>My document does not contain a XMLDecl,
>
>http://dmoz.org/rdf/structure.example.txt
>
>How can I force Xerces to use an encoding type rather than default (UTF-8)?
>Surely an over-ride parameter/function was developed.  Any resources on
>this? For the answer to the in the docs or samples, it sure is burried ...
>

You can use a byte-order mark or external metadata (e.g. an HTTP 
Content-type header). You can then wrap the input stream in 
InputSource, the encoding for which is determined by the external 
metadata (though looking at your document it doesn't seem to include 
such metadata:

Connected to dmoz.org.
Escape character is '^]'.
GET /rdf/structure.example.txt HTTP/1.0

HTTP/1.1 200 OK
Date: Wed, 02 Jan 2002 00:03:00 GMT
Server: Apache/1.3.9 (Unix)
Last-Modified: Thu, 28 Jan 1999 21:19:28 GMT
ETag: "5b58e3-80cc-36b0d460"
Accept-Ranges: bytes
Content-Length: 32972
Connection: close
Content-Type: text/plain

Here, the Content-type header doesn't include a charset. However, 
according to the HTTP 1.1 spec "When no explicit charset parameter is 
provided by the sender, media subtypes of the "text" type are defined 
to have a default charset value of "ISO-8859-1" when received via 
HTTP." so the external metadata here (text/plain) should be 
interpreted as meaning that the XML document is encoded in Latin-1. 
You'll need to call setEncoding() on the InputSource object to 
specify this. Then parse the InputSource object. See 
http://www.cafeconleche.org/books/xmljava/chapters/ch07s02.html#d0e8813 
for details and examples.
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|              http://www.ibiblio.org/xml/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |
+----------------------------------+---------------------------------+

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


RE: Can Xerces Force Encoding Type ?

Posted by Jon Shoberg <js...@cbd.net>.
My document does not contain a XMLDecl,

http://dmoz.org/rdf/structure.example.txt

How can I force Xerces to use an encoding type rather than default (UTF-8)?
Surely an over-ride parameter/function was developed.  Any resources on
this? For the answer to the in the docs or samples, it sure is burried ...

Thanks,
Jon

-----Original Message-----
From: Andy Clark [mailto:andyc@apache.org]
Sent: Tuesday, January 01, 2002 12:00 PM
To: xerces-j-user@xml.apache.org
Subject: Re: Can Xerces Force Encoding Type ?


Milind Gadre wrote:
> I may be missing something here, but shouldn't this be handled
> by the parser?

It is but if the document does not contain an XMLDecl (that
"<?xml ..." line), then it has no way of reliably detecting
the document's encoding. Therefore, in the case where the
document does not specify the encoding it uses, the parser
MUST assume UTF-8 encoding. And the document in question
did not specify the encoding and tried to use characters
that were encoded in ISO Latin 1, not UTF-8.

--
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Can Xerces Force Encoding Type ?

Posted by Andy Clark <an...@apache.org>.
Milind Gadre wrote:
> I may be missing something here, but shouldn't this be handled 
> by the parser? 

It is but if the document does not contain an XMLDecl (that
"<?xml ..." line), then it has no way of reliably detecting
the document's encoding. Therefore, in the case where the
document does not specify the encoding it uses, the parser
MUST assume UTF-8 encoding. And the document in question 
did not specify the encoding and tried to use characters
that were encoded in ISO Latin 1, not UTF-8.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Can Xerces Force Encoding Type ?

Posted by Milind Gadre <mi...@ecplatforms.com>.
I may be missing something here, but shouldn't this be handled by the
parser? After all, it knows what the encoding attribute of the XML
document is.


> Jon Shoberg wrote:
> > Xerces parses it ok except for when it hits a n-tilde or like
character.  Is
> > there a way to force the encoding type within Xerces-SAX parser?  I
havn't
>
> Yes. Instead of using an InputStream, use a Reader that knows
> how to decode ISO-Latin-1 (ISO-8859-1). For example:
>
>   InputStream stream = /* ... */;
>   Reader reader = new InputStreamReader(stream, "ISO8859_1");
>   InputSource source = new InputSource(reader);
>
>   XMLReader parser = /* ... */;
>   reader.parse(source);
>
> --
> Andy Clark * andyc@apache.org
>




---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Can Xerces Force Encoding Type ?

Posted by Andy Clark <an...@apache.org>.
Jon Shoberg wrote:
> Xerces parses it ok except for when it hits a n-tilde or like character.  Is
> there a way to force the encoding type within Xerces-SAX parser?  I havn't

Yes. Instead of using an InputStream, use a Reader that knows
how to decode ISO-Latin-1 (ISO-8859-1). For example:

  InputStream stream = /* ... */;
  Reader reader = new InputStreamReader(stream, "ISO8859_1");
  InputSource source = new InputSource(reader);

  XMLReader parser = /* ... */;
  reader.parse(source);

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org