You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lenya.apache.org by Andreas Hartmann <an...@apache.org> on 2005/08/22 11:44:29 UTC

Invalid UTF-8 error for non-ASCII meta data (was: Re: Reality check on 1.4x (long))

Angelo Turetta wrote:

[...]

> In the authoring page, click on 'Document type examples', and watch a 
> wonderful 'Invalid byte 2 of 3-byte UTF-8 sequence' error. I've found 
> the cause of the problem: every page that has non-ASCII chars in the 
> meta information fails almost every operation.

I can't reproduce this one ...
How about the others?

-- Andreas


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org


Re: Invalid UTF-8 error for non-ASCII meta data

Posted by Andreas Hartmann <an...@apache.org>.
Josias Thoeny wrote:
> On Mon, 2005-08-22 at 11:44 +0200, Andreas Hartmann wrote:
> 
>>Angelo Turetta wrote:
>>
>>[...]
>>
>>
>>>In the authoring page, click on 'Document type examples', and watch a 
>>>wonderful 'Invalid byte 2 of 3-byte UTF-8 sequence' error. I've found 
>>>the cause of the problem: every page that has non-ASCII chars in the 
>>>meta information fails almost every operation.
>>
>>I can't reproduce this one ...
>>How about the others?
> 
> 
> I have seen this error.
> I tried to debug this some time ago, but I didn't find out much.
> 
> To reproduce it, go to the site area of the default publication. Select
> e.g. the "Document Type Examples" page, copy it, and insert it
> somewhere.

Thanks for your comments, I can reproduce it.

I filed a bug:

http://issues.apache.org/bugzilla/show_bug.cgi?id=36341

-- Andreas


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org


Re: Invalid UTF-8 error for non-ASCII meta data (was: Re: Reality check on 1.4x (long))

Posted by Josias Thoeny <jo...@wyona.com>.
On Mon, 2005-08-22 at 11:44 +0200, Andreas Hartmann wrote:
> Angelo Turetta wrote:
> 
> [...]
> 
> > In the authoring page, click on 'Document type examples', and watch a 
> > wonderful 'Invalid byte 2 of 3-byte UTF-8 sequence' error. I've found 
> > the cause of the problem: every page that has non-ASCII chars in the 
> > meta information fails almost every operation.
> 
> I can't reproduce this one ...
> How about the others?

I have seen this error.
I tried to debug this some time ago, but I didn't find out much.

To reproduce it, go to the site area of the default publication. Select
e.g. the "Document Type Examples" page, copy it, and insert it
somewhere.

Here is the relevant part of the stacktrace that I'm getting:

java.io.UTFDataFormatException: Invalid byte 2 of 3-byte UTF-8 sequence.
	at org.apache.lenya.cms.metadata.MetaDataImpl.loadValues(MetaDataImpl.java:169)
	at org.apache.lenya.cms.metadata.MetaDataImpl.<init>(MetaDataImpl.java:82)
	at org.apache.lenya.cms.metadata.LenyaMetaData.<init>(LenyaMetaData.java:74)
	at org.apache.lenya.cms.metadata.MetaDataManager.getLenyaMetaData(MetaDataManager.java:80)
	at org.apache.lenya.cms.metadata.MetaDataManager.replaceMetaData(MetaDataManager.java:151)
	at org.apache.lenya.cms.repository.RepositoryManagerImpl.copy(RepositoryManagerImpl.java:40)
	... 100 more
Caused by: java.io.UTFDataFormatException: Invalid byte 2 of 3-byte UTF-8 sequence.
	at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
	at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
	at org.apache.lenya.xml.DocumentHelper.readDocument(DocumentHelper.java:173)
	at org.apache.lenya.cms.cocoon.source.SourceUtil.readDOM(SourceUtil.java:161)
	at org.apache.lenya.cms.metadata.MetaDataImpl.getDocument(MetaDataImpl.java:260)
	at org.apache.lenya.cms.metadata.MetaDataImpl.loadValues(MetaDataImpl.java:139)
	... 105 more

Here is another way to get the same problem:
- Create a document with an umlaut in the navigation title (the umlaut
will be written into the sitetree).
- Perform an operation which changes the sitetree (e.g. create another
document)

I get the following stacktrace:
<snip/>
Caused by: java.io.UTFDataFormatException: Invalid byte 2 of 3-byte
UTF-8 sequence.
	at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
	at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
	at org.apache.lenya.xml.DocumentHelper.readDocument(DocumentHelper.java:173)
	at org.apache.lenya.cms.cocoon.source.SourceUtil.readDOM(SourceUtil.java:161)
	at org.apache.lenya.cms.site.tree.DefaultSiteTree.<init>(DefaultSiteTree.java:83)
	... 51 more


It seems the problem occurs when lenya reads a non-ascii char from a
file and saves it again, using DocumentHelper/SourceUtil. If the special
char comes from a web-form, it seems to be saved correctly.

Can anyone else reproduce this?

I wonder if it might have something to do with the following code in
SourceUtil.java, around line 195:

    ....
    OutputStream oStream = source.getOutputStream();
    Writer writer = new OutputStreamWriter(oStream);
    DocumentHelper.writeDocument(document, writer);
    ....

The OutputStreamWriter assumes a default encoding, and does not respect
the encoding of the source. But that's just a guess, actually I'm not
sure whether it's a reading or a writing problem.

Or is something wrong with my setup?

Josias



> 
> -- Andreas
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
> For additional commands, e-mail: dev-help@lenya.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org