You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Bertrand Delacretaz <bd...@apache.org> on 2004/12/09 07:24:19 UTC

Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF

Le 9 déc. 04, à 01:03, Leszek Gawron a écrit :

>>> ...
>>> +<?xml version="1.0" encoding="UTF-8"?>
>> ...
> This is BOM (byte ordering mark). It is being written by some of xml 
> editors to the beginning of the multibyte encoded (i.e. utf-8) xml 
> file. The file I commited is a valid xml. Check in any xml 
> editor/browser...

BOM has no meaning for UTF-8, see 
http://www.unicode.org/unicode/faq/utf_bom.html#BOM

It is certainly better *not* to use it, to avoid any confusion. On 
unixish OSes, many tools check the first four bytes of a file and 
expect them to be <?xm

-Bertrand

Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF

Posted by Antonio Gallardo <ag...@agssa.net>.
On Jue, 9 de Diciembre de 2004, 2:21, Leszek Gawron dijo:
> By the way: it is a little bit different on win32. Some tools detect utf
> encoding by checking for BOM. If there is none - ANSI encoding is assumed.

....then switch to Linux! ;-)

Best Regards,

Antonio Gallardo


Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF

Posted by Leszek Gawron <lg...@mobilebox.pl>.
Bertrand Delacretaz wrote:
> Le 9 déc. 04, à 09:49, Leszek Gawron a écrit :
> 
>> ... Because Microsoft did it, and there is so much Notepad data out 
>> there, the UTF-8 BOM became a de facto standard and then a de jure 
>> standard. (Although the BOM is optional.)..
> 
> 
> hmm...not sure if notepad is the kind of reference that we want to use 
> here ;-)
> 
> Anyway, I think most or all our XML files are UTF-8 with no BOM, so it's 
> probably not a good idea to change.
It is not only the problem of notepad. At least 2 tools I use (UltraEdit 
and Araxis Merge) follow the same logic.

-- 
Leszek Gawron                                      lgawron@mobilebox.pl
Project Manager                                    MobileBox sp. z o.o.
+48 (61) 855 06 67                              http://www.mobilebox.pl
mobile: +48 (501) 720 812                       fax: +48 (61) 853 29 65

Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF

Posted by Bertrand Delacretaz <bd...@apache.org>.
Le 9 déc. 04, à 09:49, Leszek Gawron a écrit :
> ... Because Microsoft did it, and there is so much Notepad data out 
> there, the UTF-8 BOM became a de facto standard and then a de jure 
> standard. (Although the BOM is optional.)..

hmm...not sure if notepad is the kind of reference that we want to use 
here ;-)

Anyway, I think most or all our XML files are UTF-8 with no BOM, so 
it's probably not a good idea to change.

-Bertrand

Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF

Posted by Leszek Gawron <lg...@mobilebox.pl>.
Vadim Gritsenko wrote:
> Antonio Gallardo wrote:
> 
>> On Jue, 9 de Diciembre de 2004, 2:49, Leszek Gawron dijo:
>>
>>> Bertrand Delacretaz wrote:
>>>
>>>> Le 9 déc. 04, à 09:21, Leszek Gawron a écrit :
>>>>
>>>>> ...By the way: it is a little bit different on win32. Some tools
>>>>> detect utf encoding by checking for BOM. If there is none - ANSI
>>>>> encoding is assumed...
>>>>
>>>>
>>>> AFAIU this is ok for 16-bit based encodings, not for UTF-8.
>>>>
>>> http://www.xencraft.com/resources/unicodebom.html
> 
> ...
> 
>>>
>>> M$ again.
>>
>>
>> This is the standard:
>>
>> http://www.zvon.org/tmRFC/RFC3023/Output/chapter8.html#sub1 :-D
> 
> 
> No, it's not. The standard is:
>   http://www.w3c.org/TR/2004/REC-xml11-20040204/#NT-XMLDecl
> 
> XML *must* start with '<?xml'. So no MS junk of any kind is allowed. 
> Please don't use notepad - vim and far both have syntax highlight and do 
> not write boms of any kind :)
I am not using notepad :). Apparently even sophisticated win32 text 
editors (like my favourive UltraEdit) support UTF-8 BOMs.

-- 
Leszek Gawron                                      lgawron@mobilebox.pl
Project Manager                                    MobileBox sp. z o.o.
+48 (61) 855 06 67                              http://www.mobilebox.pl
mobile: +48 (501) 720 812                       fax: +48 (61) 853 29 65

Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF

Posted by Vadim Gritsenko <va...@reverycodes.com>.
Antonio Gallardo wrote:
> On Jue, 9 de Diciembre de 2004, 2:49, Leszek Gawron dijo:
> 
>>Bertrand Delacretaz wrote:
>>
>>>Le 9 déc. 04, à 09:21, Leszek Gawron a écrit :
>>>
>>>>...By the way: it is a little bit different on win32. Some tools
>>>>detect utf encoding by checking for BOM. If there is none - ANSI
>>>>encoding is assumed...
>>>
>>>AFAIU this is ok for 16-bit based encodings, not for UTF-8.
>>>
>>http://www.xencraft.com/resources/unicodebom.html
...
>>
>>M$ again.
> 
> This is the standard:
> 
> http://www.zvon.org/tmRFC/RFC3023/Output/chapter8.html#sub1 :-D

No, it's not. The standard is:
   http://www.w3c.org/TR/2004/REC-xml11-20040204/#NT-XMLDecl

XML *must* start with '<?xml'. So no MS junk of any kind is allowed. Please 
don't use notepad - vim and far both have syntax highlight and do not write boms 
of any kind :)

Vadim

Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF

Posted by Antonio Gallardo <ag...@agssa.net>.
On Jue, 9 de Diciembre de 2004, 2:49, Leszek Gawron dijo:
> Bertrand Delacretaz wrote:
>> Le 9 déc. 04, à 09:21, Leszek Gawron a écrit :
>>
>>> ...By the way: it is a little bit different on win32. Some tools
>>> detect utf encoding by checking for BOM. If there is none - ANSI
>>> encoding is assumed...
>>
>>
>> AFAIU this is ok for 16-bit based encodings, not for UTF-8.
>>
>> -Bertrand
> http://www.xencraft.com/resources/unicodebom.html
> <quote>
> Even though UTF-8 does not need a BOM to indicate endianness, Microsoft
> Notepad began prepending a BOM to its UTF-8 text files. Actually, it is
> a conversion of U+FEFF to an encoding as UTF-8 serialized bytes: EF BB
> BF (or in 4GL: CHR(15711167)). There is some value in the BOM being used
> as a file signature, indicating the plain text file is encoded as
> Unicode UTF-8, as opposed to some other code page. That particular
> 3-byte sequence is unlikely to represent data in any other code page,
> given the text is supposed to be human readable in some language.
> However, there is some small possibility that it represents some string
> in some code page... Because Microsoft did it, and there is so much
> Notepad data out there, the UTF-8 BOM became a de facto standard and
> then a de jure standard. (Although the BOM is optional.)
> </quote>
>
> M$ again.

This is the standard:

http://www.zvon.org/tmRFC/RFC3023/Output/chapter8.html#sub1 :-D

Best Regards,

Antonio Gallardo.


Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF

Posted by Leszek Gawron <lg...@mobilebox.pl>.
Bertrand Delacretaz wrote:
> Le 9 déc. 04, à 09:21, Leszek Gawron a écrit :
> 
>> ...By the way: it is a little bit different on win32. Some tools 
>> detect utf encoding by checking for BOM. If there is none - ANSI 
>> encoding is assumed...
> 
> 
> AFAIU this is ok for 16-bit based encodings, not for UTF-8.
> 
> -Bertrand
http://www.xencraft.com/resources/unicodebom.html
<quote>
Even though UTF-8 does not need a BOM to indicate endianness, Microsoft 
Notepad began prepending a BOM to its UTF-8 text files. Actually, it is 
a conversion of U+FEFF to an encoding as UTF-8 serialized bytes: EF BB 
BF (or in 4GL: CHR(15711167)). There is some value in the BOM being used 
as a file signature, indicating the plain text file is encoded as 
Unicode UTF-8, as opposed to some other code page. That particular 
3-byte sequence is unlikely to represent data in any other code page, 
given the text is supposed to be human readable in some language. 
However, there is some small possibility that it represents some string 
in some code page... Because Microsoft did it, and there is so much 
Notepad data out there, the UTF-8 BOM became a de facto standard and 
then a de jure standard. (Although the BOM is optional.)
</quote>

M$ again.

-- 
Leszek Gawron                                      lgawron@mobilebox.pl
Project Manager                                    MobileBox sp. z o.o.
+48 (61) 855 06 67                              http://www.mobilebox.pl
mobile: +48 (501) 720 812                       fax: +48 (61) 853 29 65

Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF

Posted by Bertrand Delacretaz <bd...@apache.org>.
Le 9 déc. 04, à 09:21, Leszek Gawron a écrit :
> ...By the way: it is a little bit different on win32. Some tools 
> detect utf encoding by checking for BOM. If there is none - ANSI 
> encoding is assumed...

AFAIU this is ok for 16-bit based encodings, not for UTF-8.

-Bertrand

Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF

Posted by Leszek Gawron <lg...@mobilebox.pl>.
Bertrand Delacretaz wrote:
> Le 9 déc. 04, à 01:03, Leszek Gawron a écrit :
> 
>>>> ...
>>>> +<?xml version="1.0" encoding="UTF-8"?>
>>>
>>> ...
>>
>> This is BOM (byte ordering mark). It is being written by some of xml 
>> editors to the beginning of the multibyte encoded (i.e. utf-8) xml 
>> file. The file I commited is a valid xml. Check in any xml 
>> editor/browser...
> 
> 
> BOM has no meaning for UTF-8, see 
> http://www.unicode.org/unicode/faq/utf_bom.html#BOM
> 
> It is certainly better *not* to use it, to avoid any confusion. On 
> unixish OSes, many tools check the first four bytes of a file and expect 
> them to be <?xm
> 
> -Bertrand
OK. No problem.

By the way: it is a little bit different on win32. Some tools detect utf 
encoding by checking for BOM. If there is none - ANSI encoding is assumed.

-- 
Leszek Gawron                                      lgawron@mobilebox.pl
Project Manager                                    MobileBox sp. z o.o.
+48 (61) 855 06 67                              http://www.mobilebox.pl
mobile: +48 (501) 720 812                       fax: +48 (61) 853 29 65