You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@commons.apache.org by DECAFFMEYER MATHIEU <MA...@fortis.lu> on 2006/12/28 11:30:07 UTC

[Urgent] UTF-8 encoding problem

Hi,

I am using Jakarta Configuration to manipulate some XML files.
I have the following error when I open one of the files :


org.apache.commons.configuration.ConfigurationException: Octet 2 incorrect dans la s�quence UTF-8 � 3-octets.
	at org.apache.commons.configuration.XMLConfiguration.load(XMLConfiguration.java:620)
	at org.apache.commons.configuration.XMLConfiguration.load(XMLConfiguration.java:578)
	at org.apache.commons.configuration.XMLConfiguration$XMLFileConfigurationDelegate.load(XMLConfiguration.java:1045)
	at org.apache.commons.configuration.AbstractFileConfiguration.load(AbstractFileConfiguration.java:280)
[...]




Caused by: java.io.UTFDataFormatException: Octet 2 incorrect dans la s�quence UTF-8 � 3-octets.
	at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
[...]



The headlines of the file is :

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE configuration [
<!ENTITY amp "&#x26;">
<!ENTITY lt "&#x3C;">
<!ENTITY minus "&#45;">
]>
[...]


I have an XML with exactly the same lines at the top,
and I have no problem with this one :
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE configuration [
<!ENTITY amp "&#x26;">
<!ENTITY lt "&#x3C;">
<!ENTITY minus "&#45;">
]>
[...]


What do u suggest me to do ?

Thank u for any help ! Will be greatly appreciated !


============================================
Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The information contained in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Nothing in the message is capable or intended to create any legally binding obligations on either party and it is not intended to provide legal advice.
============================================


Re: [Urgent] UTF-8 encoding problem

Posted by Thomas Thomas <de...@gmail.com>.
(this is my address from home)

Thank u all for your precious comments,
it helped.

For some reason I didn't write a String encoded as UTF-8.
I still can't figure out why.

I have added this code :


                       * byte*[] codedString;
                        String decoded=*null*;
                       * try* {
                                codedString = stopwords.getBytes();
                                decoded =* new* String(codedString, "UTF-8"
);
                        }* catch* (UnsupportedEncodingException e) {
                                //* TODO* Auto-generated catch block
                                e.printStackTrace();
                        } // to bytes

                        stopwords = decoded;
                        writer.setProperty(
             "searchIndex.stopwordList", stopwords);


seems to work now !
Thank u all.

Re: [Urgent] UTF-8 encoding problem

Posted by Thorbjørn Ravn Andersen <th...@gmail.com>.
DECAFFMEYER MATHIEU skrev  den 28-12-2006 11:30:
>
> What do u suggest me to do ?
>
You have a broken xml-file where the byte stream is not a valid UTF-8 
stream.

Try loading the file in Internet Explorer and see what it says about the 
file.


-- 
  Thorbjørn

Re: [Configuration] UTF-8 encoding problem

Posted by Andrew Shirley <ak...@decisionsoft.co.uk>.
On Fri, Dec 29, 2006 at 01:00:51AM +1300, Simon Kitching wrote:
> On Thu, 2006-12-28 at 11:15 +0000, Andrew Shirley wrote:
> > On Thu, Dec 28, 2006 at 11:30:07AM +0100, DECAFFMEYER MATHIEU wrote:
> > > 
> > > Hi,
> > > 
> > > I am using Jakarta Configuration to manipulate some XML files.
> > > 
> > 
> > 
> > > 
> > > What do u suggest me to do ?
> > > 
> > > Thank u for any help ! Will be greatly appreciated !
> > 
> > This may be that the file isn't actually UTF-8 i.e. it contains some
> > extended ASCII characters. The usual problem in the uk is the pound
> > sign but the euro is probably a good candidate as well. I would check
> > that you are only using the standard (i.e. < 128) ascii characters.
> 
> The UTF-8 encoding can handle any character at all, not just ASCII.
> 

This is true however editing in utf8 is still not straight forward to
set up. When using xml, I would recommend restricting yourself to ASCII
(<128) and handling any other requirements as an entity, this is far
less likely to break and is more portable.

Apologies if I confused the utf8/ascii issue.

Andrew Shirley

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


Re: [Configuration] UTF-8 encoding problem

Posted by Simon Kitching <sk...@apache.org>.
On Thu, 2006-12-28 at 11:15 +0000, Andrew Shirley wrote:
> On Thu, Dec 28, 2006 at 11:30:07AM +0100, DECAFFMEYER MATHIEU wrote:
> > 
> > Hi,
> > 
> > I am using Jakarta Configuration to manipulate some XML files.
> > 
> 
> 
> > 
> > What do u suggest me to do ?
> > 
> > Thank u for any help ! Will be greatly appreciated !
> 
> This may be that the file isn't actually UTF-8 i.e. it contains some
> extended ASCII characters. The usual problem in the uk is the pound
> sign but the euro is probably a good candidate as well. I would check
> that you are only using the standard (i.e. < 128) ascii characters.

The UTF-8 encoding can handle any character at all, not just ASCII.

The error message you are seeing is not being generated by
commons-configuration, but by the underlying xml parser:

> 
> Caused by: java.io.UTFDataFormatException: Octet 2 incorrect dans la
> séquence UTF-8 à 3-octets. 
>         at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown 

In other words, your input file is corrupt; the xml parser has
encountered a sequence of bytes that does not correspond to any valid
character. 

You will need to fix your input file so that it is valid UTF-8. There is
no way that the commons-configuration library can process your data if
the xml parser refuses to parse it.

One possibility is that the input file is actually encoded in an 8-bit
character encoding such as LATIN-1, NOT UTF-8 at all.

With UTF-8, any byte from 0 through 127 is an ASCII character, while a
byte from 128 through 255 indicates the start of a multibyte sequence
(two or more bytes) that represents a character that is NOT in the ascii
set.

With an 8-bit encoding like LATIN-1, values from 128 to 255 are NOT
multibyte sequences, but instead represent a specific set of 128
"extended characters", and there is no way to represent a character that
is not in the set associated with that encoding.

Regards,

Simon


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


Re: [Configuration] UTF-8 encoding problem

Posted by Andrew Shirley <ak...@decisionsoft.co.uk>.
On Thu, Dec 28, 2006 at 11:30:07AM +0100, DECAFFMEYER MATHIEU wrote:
> 
> Hi,
> 
> I am using Jakarta Configuration to manipulate some XML files.
> 


> 
> What do u suggest me to do ?
> 
> Thank u for any help ! Will be greatly appreciated !

This may be that the file isn't actually UTF-8 i.e. it contains some
extended ASCII characters. The usual problem in the uk is the pound
sign but the euro is probably a good candidate as well. I would check
that you are only using the standard (i.e. < 128) ascii characters.

Note that I don't mean that an entity for euro will be causing the
problem but an actual character (i.e. shift-3).

hope this helps

Andrew Shirley

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


Re: [Urgent] UTF-8 encoding problem

Posted by lu...@free.fr.
You can also try the GNU program recode by François Pinard (available under
GNU/Linux, Unix and other systems as well. For example, the following command
line should start conversion from file x.xml and put the result in y.xml but
stop at the first character that is not really in UTF-8:

   recode UTF-8..ISO-8859-15 < x.xml > y.xml

You could then look at the end of the y.xml file to get an idea what is before
the wrong character.

Luc

Selon Tom <tb...@tbee.org>:

> You could also try and use an encoding aware editor to edit the XML
> file; for example XMLSpy or Eclipse with the web tools plugins. These
> editors will interpret the specified encoding value and save the XML in
> that encoding. Very convenient!

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


Re: [Urgent] UTF-8 encoding problem

Posted by Tom <tb...@tbee.org>.
You could also try and use an encoding aware editor to edit the XML 
file; for example XMLSpy or Eclipse with the web tools plugins. These 
editors will interpret the specified encoding value and save the XML in 
that encoding. Very convenient!

Tom


Jan.Van-Stalle@ec.europa.eu wrote:
> Sound more like an encoding problem to me;
>
> Try submitting an xml-file with "simple" (<127) characters; these are encoded the same way in other encoding schemes (like windows 1252) and utf8; if this works, I would think that the submitted xml-file is not correctly utf8 encoded; as the xml-header defines that the xml is UTF8 encoded, characters like é, è, or the euro-sign will be encoded differently.
>
> Jan
>
> ------------------
> Jan Van Stalle
> DIGIT.B.03                 
> tel +32 2 299 49 82
> Bureau MO34 2/54
>
>
> -----Original Message-----
> From: Mark Diggory [mailto:mdiggory@gmail.com] 
> Sent: Thursday, January 04, 2007 1:08 PM
> To: Jakarta Commons Users List
> Subject: Re: [Urgent] UTF-8 encoding problem
>
>
> This looks more like an XML / Xerces Parsing issue, I would seek help there.
> Sounds like your placing non-UTF encoded chars into your XML file.
>
> -Mark
>
> On 12/28/06, DECAFFMEYER MATHIEU <MA...@fortis.lu> wrote:
>   
>> Hi,
>>
>> I am using Jakarta Configuration to manipulate some XML files.
>> I have the following error when I open one of the files :
>>
>> org.apache.commons.configuration.ConfigurationException: Octet 2 incorrect
>> dans la séquence UTF-8 à 3-octets.
>>         at org.apache.commons.configuration.XMLConfiguration.load(
>> XMLConfiguration.java:620)
>>         at org.apache.commons.configuration.XMLConfiguration.load(
>> XMLConfiguration.java:578)
>>         at
>> org.apache.commons.configuration.XMLConfiguration$XMLFileConfigurationDelegate.load
>> (XMLConfiguration.java:1045)
>>
>>         at org.apache.commons.configuration.AbstractFileConfiguration.load
>> (AbstractFileConfiguration.java:280)
>> [...]
>>
>>
>>
>> Caused by: java.io.UTFDataFormatException: Octet 2 incorrect dans la
>> séquence UTF-8 à 3-octets.
>>         at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown
>> Source)
>>         at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
>>         at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
>> [...]
>>
>>
>> The headlines of the file is :
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <!DOCTYPE configuration [
>> <!ENTITY amp "&#x26;">
>> <!ENTITY lt "&#x3C;">
>> <!ENTITY minus "&#45;">
>> ]>
>> [...]
>>
>> I have an XML with exactly the same lines at the top,
>> and I have no problem with this one :
>> <?xml version="1.0" encoding="UTF-8"?>
>> <!DOCTYPE configuration [
>> <!ENTITY amp "&#x26;">
>> <!ENTITY lt "&#x3C;">
>> <!ENTITY minus "&#45;">
>> ]>
>> [...]
>>
>> What do u suggest me to do ?
>>
>> Thank u for any help ! Will be greatly appreciated !
>>
>>
>> ============================================
>> Internet communications are not secure and therefore Fortis Banque
>> Luxembourg S.A. does not accept legal responsibility for the contents of
>> this message. The information contained in this e-mail is confidential and
>> may be legally privileged. It is intended solely for the addressee. If you
>> are not the intended recipient, any disclosure, copying, distribution or any
>> action taken or omitted to be taken in reliance on it, is prohibited and may
>> be unlawful. Nothing in the message is capable or intended to create any
>> legally binding obligations on either party and it is not intended to
>> provide legal advice.
>> ============================================
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: commons-user-help@jakarta.apache.org
>>
>>
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-user-help@jakarta.apache.org
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


RE: [Urgent] UTF-8 encoding problem

Posted by Ja...@ec.europa.eu.
Sound more like an encoding problem to me;

Try submitting an xml-file with "simple" (<127) characters; these are encoded the same way in other encoding schemes (like windows 1252) and utf8; if this works, I would think that the submitted xml-file is not correctly utf8 encoded; as the xml-header defines that the xml is UTF8 encoded, characters like é, è, or the euro-sign will be encoded differently.

Jan

------------------
Jan Van Stalle
DIGIT.B.03                 
tel +32 2 299 49 82
Bureau MO34 2/54


-----Original Message-----
From: Mark Diggory [mailto:mdiggory@gmail.com] 
Sent: Thursday, January 04, 2007 1:08 PM
To: Jakarta Commons Users List
Subject: Re: [Urgent] UTF-8 encoding problem


This looks more like an XML / Xerces Parsing issue, I would seek help there.
Sounds like your placing non-UTF encoded chars into your XML file.

-Mark

On 12/28/06, DECAFFMEYER MATHIEU <MA...@fortis.lu> wrote:
>
>
> Hi,
>
> I am using Jakarta Configuration to manipulate some XML files.
> I have the following error when I open one of the files :
>
> org.apache.commons.configuration.ConfigurationException: Octet 2 incorrect
> dans la séquence UTF-8 à 3-octets.
>         at org.apache.commons.configuration.XMLConfiguration.load(
> XMLConfiguration.java:620)
>         at org.apache.commons.configuration.XMLConfiguration.load(
> XMLConfiguration.java:578)
>         at
> org.apache.commons.configuration.XMLConfiguration$XMLFileConfigurationDelegate.load
> (XMLConfiguration.java:1045)
>
>         at org.apache.commons.configuration.AbstractFileConfiguration.load
> (AbstractFileConfiguration.java:280)
> [...]
>
>
>
> Caused by: java.io.UTFDataFormatException: Octet 2 incorrect dans la
> séquence UTF-8 à 3-octets.
>         at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown
> Source)
>         at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
>         at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
> [...]
>
>
> The headlines of the file is :
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE configuration [
> <!ENTITY amp "&#x26;">
> <!ENTITY lt "&#x3C;">
> <!ENTITY minus "&#45;">
> ]>
> [...]
>
> I have an XML with exactly the same lines at the top,
> and I have no problem with this one :
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE configuration [
> <!ENTITY amp "&#x26;">
> <!ENTITY lt "&#x3C;">
> <!ENTITY minus "&#45;">
> ]>
> [...]
>
> What do u suggest me to do ?
>
> Thank u for any help ! Will be greatly appreciated !
>
>
> ============================================
> Internet communications are not secure and therefore Fortis Banque
> Luxembourg S.A. does not accept legal responsibility for the contents of
> this message. The information contained in this e-mail is confidential and
> may be legally privileged. It is intended solely for the addressee. If you
> are not the intended recipient, any disclosure, copying, distribution or any
> action taken or omitted to be taken in reliance on it, is prohibited and may
> be unlawful. Nothing in the message is capable or intended to create any
> legally binding obligations on either party and it is not intended to
> provide legal advice.
> ============================================
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-user-help@jakarta.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


Re: [Urgent] UTF-8 encoding problem

Posted by Mark Diggory <md...@gmail.com>.
This looks more like an XML / Xerces Parsing issue, I would seek help there.
Sounds like your placing non-UTF encoded chars into your XML file.

-Mark

On 12/28/06, DECAFFMEYER MATHIEU <MA...@fortis.lu> wrote:
>
>
> Hi,
>
> I am using Jakarta Configuration to manipulate some XML files.
> I have the following error when I open one of the files :
>
> org.apache.commons.configuration.ConfigurationException: Octet 2 incorrect
> dans la séquence UTF-8 à 3-octets.
>         at org.apache.commons.configuration.XMLConfiguration.load(
> XMLConfiguration.java:620)
>         at org.apache.commons.configuration.XMLConfiguration.load(
> XMLConfiguration.java:578)
>         at
> org.apache.commons.configuration.XMLConfiguration$XMLFileConfigurationDelegate.load
> (XMLConfiguration.java:1045)
>
>         at org.apache.commons.configuration.AbstractFileConfiguration.load
> (AbstractFileConfiguration.java:280)
> [...]
>
>
>
> Caused by: java.io.UTFDataFormatException: Octet 2 incorrect dans la
> séquence UTF-8 à 3-octets.
>         at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown
> Source)
>         at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
>         at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
> [...]
>
>
> The headlines of the file is :
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE configuration [
> <!ENTITY amp "&#x26;">
> <!ENTITY lt "&#x3C;">
> <!ENTITY minus "&#45;">
> ]>
> [...]
>
> I have an XML with exactly the same lines at the top,
> and I have no problem with this one :
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE configuration [
> <!ENTITY amp "&#x26;">
> <!ENTITY lt "&#x3C;">
> <!ENTITY minus "&#45;">
> ]>
> [...]
>
> What do u suggest me to do ?
>
> Thank u for any help ! Will be greatly appreciated !
>
>
> ============================================
> Internet communications are not secure and therefore Fortis Banque
> Luxembourg S.A. does not accept legal responsibility for the contents of
> this message. The information contained in this e-mail is confidential and
> may be legally privileged. It is intended solely for the addressee. If you
> are not the intended recipient, any disclosure, copying, distribution or any
> action taken or omitted to be taken in reliance on it, is prohibited and may
> be unlawful. Nothing in the message is capable or intended to create any
> legally binding obligations on either party and it is not intended to
> provide legal advice.
> ============================================
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-user-help@jakarta.apache.org
>
>