You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Jonathan Whitall <fi...@yahoo.com> on 2003/05/15 15:39:07 UTC

character encodings with Xerces

Hello,

I was wondering if Xerces can convert from one text
encoding to another specified one on the fly.  I have
some data that is stored in UTF-8 in a database, and I
want to be able to create text nodes which are in the
set of Latin-1.  If I pass UTF-8 to, say, the creator
of a text node, can it convert this automatically, or
do I have to lop off the bytes that I don't want
manually?

Thank you very much for your help,
Jonathan

__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: character encodings with Xerces

Posted by Michael Rafael Glavassevich <mr...@engmail.uwaterloo.ca>.

Hi Jonathan,

You cannot use the serializer to filter out characters, if they cannot be
represented in the output encoding, they will be written as character
references. DOM has no such character filtering features either. The
simplest solution would be to wrap your own java.io.Reader around the
UTF-8 Reader which reads the data from your database, and then filter out
the codepoints which don't appear in Latin-1. See the JDK API docs for
java.lang.Character.UnicodeBlock.

-----------------------------
Michael Glavassevich
mrglavas@engmail.uwaterloo.ca
4B Computer Engineering
University of Waterloo

On Thu, 15 May 2003, Jonathan Whitall wrote:

> > If your application is reading the UTF-8 bytes
> > coming
> > from the database and want to create, for example,
> > DOM
> > text nodes, then you need to convert the bytes into
> > Java Strings to create the nodes. But this is easy
> > in
> > code.
> >
> > Don't confuse the input/output encoding of a
> > document
> > with the encoding of the internal storage of those
> > characters. Internally, Java stores everything in
> > two
> > byte Unicode characters. Therefore, Xerces does NOT
> > create nodes in UTF-8 or ISO Latin-1 byte sequences.
> >
> > The parser only reads an XML document into an
> > internal
> > format (e.g. SAX or DOM). For writing the document
> > back
> > to a file (or stream), you would use a serializer
> > with
> > the intended output encoding. The Xerces package
> > comes
> > with serializers for this purpose.
> >
> > Does this answer your question?
>
> Hi,
>
> Yes, I am using DOM.  I did play around with
> XMLSerializer and was able to set the outbound
> encoding to Latin-1 without any problems.  The
> characters in question that weren't in the bounds of
> my outbound encoding got converted to entity
> representation (e.g. &#350;).  This is certainly
> better than sending the actual Unicode character, but
> what I really want to do is filter out all of these
> characters that don't fall within the bounds of
> Latin-1.  Is there a way to scan and inspect all of
> the entities in a particular document, or to
> automatically filter them out on serialization?
>
> Thanks,
> Jonathan
>
> __________________________________
> Do you Yahoo!?
> The New Yahoo! Search - Faster. Easier. Bingo.
> http://search.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: xerces: no grammar found

Posted by "K. Venugopal" <k....@sun.com>.

Please refer to
http://xml.apache.org/xerces2-j/faqs.html
http://xml.apache.org/xerces2-j/features.html
http://xml.apache.org/xerces2-j/properties.html

for more info .

Regards
venu



K. Venugopal wrote:

>
> Hi Steve ,
>
> you must have set validation feature to true  , set it to false.
>
> Regards
> venu
>
> Steve Guo wrote:
>
>>     I was trying to validate a simple example on schema from XML
>>     Bible, but I got
>>
>>     Document is invalid: no grammar found
>>
>>     Document roor element "novel" must match DOCTYPE root "null"
>>
>>     My XML file does not use DOCTYPE. - why the error?
>>
>>     Thanks all
>>
>> ------------------------------------------------------------------------
>> Do you Yahoo!?
>> The New Yahoo! Search 
>> <http://us.rd.yahoo.com/search/mailsig/*http://search.yahoo.com> - 
>> Faster. Easier. Bingo. 
>
>

Re: xerces: no grammar found

Posted by "K. Venugopal" <k....@sun.com>.

please mail your xml and xsd file
-venu


Steve Guo wrote:

> I followed the book example and used the -v option (-V option, no error),
> (I want to validate the document against the schema, so used -v). Did 
> I misunderstood the options?
>
> "K. Venugopal" <k....@sun.com> wrote:
>
>
>     Hi Steve ,
>
>     you must have set validation feature to true  , set it to false.
>
>     Regards
>     venu
>
>     Steve Guo wrote:
>
>>         I was trying to validate a simple example on schema from XML
>>         Bible, but I got
>>
>>         Document is invalid: no grammar found
>>
>>         Document roor element "novel" must match DOCTYPE root "null"
>>
>>         My XML file does not use DOCTYPE. - why the error?
>>
>>         Thanks all
>>
>>     ------------------------------------------------------------------------
>>     Do you Yahoo!?
>>     The New Yahoo! Search
>>     <http://us.rd.yahoo.com/search/mailsig/*http://search.yahoo.com>
>>     - Faster. Easier. Bingo. 
>
>
> ------------------------------------------------------------------------
> Do you Yahoo!?
> The New Yahoo! Search 
> <http://us.rd.yahoo.com/search/mailsig/*http://search.yahoo.com> - 
> Faster. Easier. Bingo.

Re: xerces: no grammar found

Posted by Steve Guo <co...@yahoo.com>.

I followed the book example and used the -v option (-V option, no error),
(I want to validate the document against the schema, so used -v). Did I misunderstood the options?

"K. Venugopal" <k....@sun.com> wrote:

Hi Steve ,

you must have set validation feature to true  , set it to false.

Regards
venu

Steve Guo wrote:

I was trying to validate a simple example on schema from XML Bible, but I got 

Document is invalid: no grammar found

Document roor element "novel" must match DOCTYPE root "null"

My XML file does not use DOCTYPE. - why the error?

Thanks all


---------------------------------
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo. 


---------------------------------
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.

Re: xerces: no grammar found

Posted by "K. Venugopal" <k....@sun.com>.

Hi Steve ,

you must have set validation feature to true  , set it to false.

Regards
venu

Steve Guo wrote:

>     I was trying to validate a simple example on schema from XML
>     Bible, but I got
>
>     Document is invalid: no grammar found
>
>     Document roor element "novel" must match DOCTYPE root "null"
>
>     My XML file does not use DOCTYPE. - why the error?
>
>     Thanks all
>
> ------------------------------------------------------------------------
> Do you Yahoo!?
> The New Yahoo! Search 
> <http://us.rd.yahoo.com/search/mailsig/*http://search.yahoo.com> - 
> Faster. Easier. Bingo.

xerces: no grammar found

Posted by Steve Guo <co...@yahoo.com>.

I was trying to validate a simple example on schema from XML Bible, but I got 

Document is invalid: no grammar found

Document roor element "novel" must match DOCTYPE root "null"

My XML file does not use DOCTYPE. - why the error?

Thanks all



---------------------------------
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.

Re: character encodings with Xerces

Posted by Jonathan Whitall <fi...@yahoo.com>.

> If your application is reading the UTF-8 bytes
> coming
> from the database and want to create, for example,
> DOM
> text nodes, then you need to convert the bytes into
> Java Strings to create the nodes. But this is easy
> in
> code.
> 
> Don't confuse the input/output encoding of a
> document
> with the encoding of the internal storage of those
> characters. Internally, Java stores everything in
> two
> byte Unicode characters. Therefore, Xerces does NOT
> create nodes in UTF-8 or ISO Latin-1 byte sequences.
> 
> The parser only reads an XML document into an
> internal
> format (e.g. SAX or DOM). For writing the document
> back
> to a file (or stream), you would use a serializer
> with
> the intended output encoding. The Xerces package
> comes
> with serializers for this purpose.
> 
> Does this answer your question?

Hi,

Yes, I am using DOM.  I did play around with
XMLSerializer and was able to set the outbound
encoding to Latin-1 without any problems.  The
characters in question that weren't in the bounds of
my outbound encoding got converted to entity
representation (e.g. &#350;).  This is certainly
better than sending the actual Unicode character, but
what I really want to do is filter out all of these
characters that don't fall within the bounds of
Latin-1.  Is there a way to scan and inspect all of
the entities in a particular document, or to
automatically filter them out on serialization?

Thanks,
Jonathan

__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: character encodings with Xerces

Posted by Andy Clark <an...@apache.org>.

Jonathan Whitall wrote:
> I was wondering if Xerces can convert from one text
> encoding to another specified one on the fly.  I have
> some data that is stored in UTF-8 in a database, and I
> want to be able to create text nodes which are in the
> set of Latin-1.  If I pass UTF-8 to, say, the creator
> of a text node, can it convert this automatically, or
> do I have to lop off the bytes that I don't want
> manually?

If your application is reading the UTF-8 bytes coming
from the database and want to create, for example, DOM
text nodes, then you need to convert the bytes into
Java Strings to create the nodes. But this is easy in
code.

Don't confuse the input/output encoding of a document
with the encoding of the internal storage of those
characters. Internally, Java stores everything in two
byte Unicode characters. Therefore, Xerces does NOT
create nodes in UTF-8 or ISO Latin-1 byte sequences.

The parser only reads an XML document into an internal
format (e.g. SAX or DOM). For writing the document back
to a file (or stream), you would use a serializer with
the intended output encoding. The Xerces package comes
with serializers for this purpose.

Does this answer your question?

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org