You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by anton_slutsky <as...@applevac.com> on 2007/03/09 16:58:48 UTC

SessionImpl.export*View() serialize with prolog set to UTF-8

Hello,

I'm running into a nasty little problem with the export*View() methods on
Session.  It looks like the OutputStream implementation has the encoding
part of the xml prolog hardcoded to UTF-8.  Thats fine for serializing, but
blows up when deserializing if I have any non-ascii content (my app is
i18n'd to spanish).  I've written a workaround and used my own
ContentHandler, but I'm wondering if UTF-8 is big enough for general usage? 
And besides, I really dont want to have all those javax.xml.transform
imports all over the place doing nothing but supporting this workaround.

Thanks,
Anton
-- 
View this message in context: http://www.nabble.com/SessionImpl.export*View%28%29-serialize-with-prolog-set-to-UTF-8-tf3376445.html#a9397005
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.


Re: [JCR Browser] SessionImpl.export*View() serialize with prolog set to UTF-8

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 3/9/07, anton_slutsky <as...@applevac.com> wrote:
> Jackrabbit's import is working fine.  The problem is with the export.  It
> doesnt matter what I have my properties encoded as.  The implementation
> always uses UTF-8 when serializing.  But I'll do some more research.

JCR string properties are always stored directly as Unicode
characters. It doesn't make sense to talk about the encoding of a
property. An octet encoding is only used when importing or exporting
content in XML format.

BR,

Jukka Zitting

Re: [JCR Browser] SessionImpl.export*View() serialize with prolog set to UTF-8

Posted by anton_slutsky <as...@applevac.com>.
Jackrabbit's import is working fine.  The problem is with the export.  It
doesnt matter what I have my properties encoded as.  The implementation
always uses UTF-8 when serializing.  But I'll do some more research.


Jukka Zitting wrote:
> 
> Hi,
> 
> On 3/9/07, anton_slutsky <as...@applevac.com> wrote:
>> Basically, Session.importXml() blows up when the following string is
>> present
>> as a value of a property in my serialized xml: "Español".  The "ñ"
>> character
>> is causing the problem.  With encoding="UTF-8", my sax parser complains
>> about an invalid character.  If I set encoding to UTF-16, the problem
>> goes
>> away.
> 
> What's the encoding of the XML document you're giving to the
> importXML() method? I.e. how is the "ñ" character encoded? The XML
> parser uses the declared encoding to transform the raw octet stream
> into characters, and it's an error there is an octet that doesn't
> conform with the declared character encoding.
> 
> Did you try validating the XML document you're trying to import? Try
> validating the document at  http://www.validome.org/xml/ with the
> "Well-formedness only" option selected. You should get a green "The
> document is well-formed" result if everything is OK. Otherwise it's a
> problem in your document, not the Jackrabbit import implementation.
> 
> BR,
> 
> Jukka Zitting
> 
> 

-- 
View this message in context: http://www.nabble.com/SessionImpl.export*View%28%29-serialize-with-prolog-set-to-UTF-8-tf3376445.html#a9398760
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.


Re: [JCR Browser] SessionImpl.export*View() serialize with prolog set to UTF-8

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 3/9/07, anton_slutsky <as...@applevac.com> wrote:
> Basically, Session.importXml() blows up when the following string is present
> as a value of a property in my serialized xml: "Español".  The "ñ" character
> is causing the problem.  With encoding="UTF-8", my sax parser complains
> about an invalid character.  If I set encoding to UTF-16, the problem goes
> away.

What's the encoding of the XML document you're giving to the
importXML() method? I.e. how is the "ñ" character encoded? The XML
parser uses the declared encoding to transform the raw octet stream
into characters, and it's an error there is an octet that doesn't
conform with the declared character encoding.

Did you try validating the XML document you're trying to import? Try
validating the document at  http://www.validome.org/xml/ with the
"Well-formedness only" option selected. You should get a green "The
document is well-formed" result if everything is OK. Otherwise it's a
problem in your document, not the Jackrabbit import implementation.

BR,

Jukka Zitting

Re: [JCR Browser] SessionImpl.export*View() serialize with prolog set to UTF-8

Posted by anton_slutsky <as...@applevac.com>.
Hey Jukka,

Basically, Session.importXml() blows up when the following string is present
as a value of a property in my serialized xml: "Español".  The "ñ" character
is causing the problem.  With encoding="UTF-8", my sax parser complains
about an invalid character.  If I set encoding to UTF-16, the problem goes
away.  

It could be a parser issue, but still ...

Thanks!

Anton



Jukka Zitting wrote:
> 
> Hi,
> 
> On 3/9/07, anton_slutsky <as...@applevac.com> wrote:
>> I'm running into a nasty little problem with the export*View() methods on
>> Session.  It looks like the OutputStream implementation has the encoding
>> part of the xml prolog hardcoded to UTF-8.  Thats fine for serializing,
>> but
>> blows up when deserializing if I have any non-ascii content (my app is
>> i18n'd to spanish).  I've written a workaround and used my own
>> ContentHandler, but I'm wondering if UTF-8 is big enough for general
>> usage?
> 
> Do you have some alternative in mind? I don't see anything wrong with
> UTF-8, it's as standard as it gets when working with Unicode and
> internationalized applications.
> 
> If you need some specific encoding, then your solution is correct,
> i.e. use your own ContentHandler that serializes the data in whatever
> encoding you want. Using javax.xml.transform.sax.SAXTransformerFactory
> is probably the easiest standard way to achieve that.
> 
> BR,
> 
> Jukka Zitting
> 
> 

-- 
View this message in context: http://www.nabble.com/SessionImpl.export*View%28%29-serialize-with-prolog-set-to-UTF-8-tf3376445.html#a9398024
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.


Re: SessionImpl.export*View() serialize with prolog set to UTF-8

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 3/9/07, anton_slutsky <as...@applevac.com> wrote:
> I'm running into a nasty little problem with the export*View() methods on
> Session.  It looks like the OutputStream implementation has the encoding
> part of the xml prolog hardcoded to UTF-8.  Thats fine for serializing, but
> blows up when deserializing if I have any non-ascii content (my app is
> i18n'd to spanish).  I've written a workaround and used my own
> ContentHandler, but I'm wondering if UTF-8 is big enough for general usage?

Do you have some alternative in mind? I don't see anything wrong with
UTF-8, it's as standard as it gets when working with Unicode and
internationalized applications.

If you need some specific encoding, then your solution is correct,
i.e. use your own ContentHandler that serializes the data in whatever
encoding you want. Using javax.xml.transform.sax.SAXTransformerFactory
is probably the easiest standard way to achieve that.

BR,

Jukka Zitting