You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Bertrand Delacretaz <bd...@apache.org> on 2004/12/09 07:24:19 UTC
Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF
Le 9 déc. 04, à 01:03, Leszek Gawron a écrit :
>>> ...
>>> +<?xml version="1.0" encoding="UTF-8"?>
>> ...
> This is BOM (byte ordering mark). It is being written by some of xml
> editors to the beginning of the multibyte encoded (i.e. utf-8) xml
> file. The file I commited is a valid xml. Check in any xml
> editor/browser...
BOM has no meaning for UTF-8, see
http://www.unicode.org/unicode/faq/utf_bom.html#BOM
It is certainly better *not* to use it, to avoid any confusion. On
unixish OSes, many tools check the first four bytes of a file and
expect them to be <?xm
-Bertrand
Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src:
java/org/apache/cocoon/components/flow webapp/WEB-INF
Posted by Antonio Gallardo <ag...@agssa.net>.
On Jue, 9 de Diciembre de 2004, 2:21, Leszek Gawron dijo:
> By the way: it is a little bit different on win32. Some tools detect utf
> encoding by checking for BOM. If there is none - ANSI encoding is assumed.
....then switch to Linux! ;-)
Best Regards,
Antonio Gallardo
Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src:
java/org/apache/cocoon/components/flow webapp/WEB-INF
Posted by Leszek Gawron <lg...@mobilebox.pl>.
Bertrand Delacretaz wrote:
> Le 9 déc. 04, à 09:49, Leszek Gawron a écrit :
>
>> ... Because Microsoft did it, and there is so much Notepad data out
>> there, the UTF-8 BOM became a de facto standard and then a de jure
>> standard. (Although the BOM is optional.)..
>
>
> hmm...not sure if notepad is the kind of reference that we want to use
> here ;-)
>
> Anyway, I think most or all our XML files are UTF-8 with no BOM, so it's
> probably not a good idea to change.
It is not only the problem of notepad. At least 2 tools I use (UltraEdit
and Araxis Merge) follow the same logic.
--
Leszek Gawron lgawron@mobilebox.pl
Project Manager MobileBox sp. z o.o.
+48 (61) 855 06 67 http://www.mobilebox.pl
mobile: +48 (501) 720 812 fax: +48 (61) 853 29 65
Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF
Posted by Bertrand Delacretaz <bd...@apache.org>.
Le 9 déc. 04, à 09:49, Leszek Gawron a écrit :
> ... Because Microsoft did it, and there is so much Notepad data out
> there, the UTF-8 BOM became a de facto standard and then a de jure
> standard. (Although the BOM is optional.)..
hmm...not sure if notepad is the kind of reference that we want to use
here ;-)
Anyway, I think most or all our XML files are UTF-8 with no BOM, so
it's probably not a good idea to change.
-Bertrand
Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src:
java/org/apache/cocoon/components/flow webapp/WEB-INF
Posted by Leszek Gawron <lg...@mobilebox.pl>.
Vadim Gritsenko wrote:
> Antonio Gallardo wrote:
>
>> On Jue, 9 de Diciembre de 2004, 2:49, Leszek Gawron dijo:
>>
>>> Bertrand Delacretaz wrote:
>>>
>>>> Le 9 déc. 04, à 09:21, Leszek Gawron a écrit :
>>>>
>>>>> ...By the way: it is a little bit different on win32. Some tools
>>>>> detect utf encoding by checking for BOM. If there is none - ANSI
>>>>> encoding is assumed...
>>>>
>>>>
>>>> AFAIU this is ok for 16-bit based encodings, not for UTF-8.
>>>>
>>> http://www.xencraft.com/resources/unicodebom.html
>
> ...
>
>>>
>>> M$ again.
>>
>>
>> This is the standard:
>>
>> http://www.zvon.org/tmRFC/RFC3023/Output/chapter8.html#sub1 :-D
>
>
> No, it's not. The standard is:
> http://www.w3c.org/TR/2004/REC-xml11-20040204/#NT-XMLDecl
>
> XML *must* start with '<?xml'. So no MS junk of any kind is allowed.
> Please don't use notepad - vim and far both have syntax highlight and do
> not write boms of any kind :)
I am not using notepad :). Apparently even sophisticated win32 text
editors (like my favourive UltraEdit) support UTF-8 BOMs.
--
Leszek Gawron lgawron@mobilebox.pl
Project Manager MobileBox sp. z o.o.
+48 (61) 855 06 67 http://www.mobilebox.pl
mobile: +48 (501) 720 812 fax: +48 (61) 853 29 65
Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src:
java/org/apache/cocoon/components/flow webapp/WEB-INF
Posted by Vadim Gritsenko <va...@reverycodes.com>.
Antonio Gallardo wrote:
> On Jue, 9 de Diciembre de 2004, 2:49, Leszek Gawron dijo:
>
>>Bertrand Delacretaz wrote:
>>
>>>Le 9 déc. 04, à 09:21, Leszek Gawron a écrit :
>>>
>>>>...By the way: it is a little bit different on win32. Some tools
>>>>detect utf encoding by checking for BOM. If there is none - ANSI
>>>>encoding is assumed...
>>>
>>>AFAIU this is ok for 16-bit based encodings, not for UTF-8.
>>>
>>http://www.xencraft.com/resources/unicodebom.html
...
>>
>>M$ again.
>
> This is the standard:
>
> http://www.zvon.org/tmRFC/RFC3023/Output/chapter8.html#sub1 :-D
No, it's not. The standard is:
http://www.w3c.org/TR/2004/REC-xml11-20040204/#NT-XMLDecl
XML *must* start with '<?xml'. So no MS junk of any kind is allowed. Please
don't use notepad - vim and far both have syntax highlight and do not write boms
of any kind :)
Vadim
Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src:
java/org/apache/cocoon/components/flow webapp/WEB-INF
Posted by Antonio Gallardo <ag...@agssa.net>.
On Jue, 9 de Diciembre de 2004, 2:49, Leszek Gawron dijo:
> Bertrand Delacretaz wrote:
>> Le 9 déc. 04, à 09:21, Leszek Gawron a écrit :
>>
>>> ...By the way: it is a little bit different on win32. Some tools
>>> detect utf encoding by checking for BOM. If there is none - ANSI
>>> encoding is assumed...
>>
>>
>> AFAIU this is ok for 16-bit based encodings, not for UTF-8.
>>
>> -Bertrand
> http://www.xencraft.com/resources/unicodebom.html
> <quote>
> Even though UTF-8 does not need a BOM to indicate endianness, Microsoft
> Notepad began prepending a BOM to its UTF-8 text files. Actually, it is
> a conversion of U+FEFF to an encoding as UTF-8 serialized bytes: EF BB
> BF (or in 4GL: CHR(15711167)). There is some value in the BOM being used
> as a file signature, indicating the plain text file is encoded as
> Unicode UTF-8, as opposed to some other code page. That particular
> 3-byte sequence is unlikely to represent data in any other code page,
> given the text is supposed to be human readable in some language.
> However, there is some small possibility that it represents some string
> in some code page... Because Microsoft did it, and there is so much
> Notepad data out there, the UTF-8 BOM became a de facto standard and
> then a de jure standard. (Although the BOM is optional.)
> </quote>
>
> M$ again.
This is the standard:
http://www.zvon.org/tmRFC/RFC3023/Output/chapter8.html#sub1 :-D
Best Regards,
Antonio Gallardo.
Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src:
java/org/apache/cocoon/components/flow webapp/WEB-INF
Posted by Leszek Gawron <lg...@mobilebox.pl>.
Bertrand Delacretaz wrote:
> Le 9 déc. 04, à 09:21, Leszek Gawron a écrit :
>
>> ...By the way: it is a little bit different on win32. Some tools
>> detect utf encoding by checking for BOM. If there is none - ANSI
>> encoding is assumed...
>
>
> AFAIU this is ok for 16-bit based encodings, not for UTF-8.
>
> -Bertrand
http://www.xencraft.com/resources/unicodebom.html
<quote>
Even though UTF-8 does not need a BOM to indicate endianness, Microsoft
Notepad began prepending a BOM to its UTF-8 text files. Actually, it is
a conversion of U+FEFF to an encoding as UTF-8 serialized bytes: EF BB
BF (or in 4GL: CHR(15711167)). There is some value in the BOM being used
as a file signature, indicating the plain text file is encoded as
Unicode UTF-8, as opposed to some other code page. That particular
3-byte sequence is unlikely to represent data in any other code page,
given the text is supposed to be human readable in some language.
However, there is some small possibility that it represents some string
in some code page... Because Microsoft did it, and there is so much
Notepad data out there, the UTF-8 BOM became a de facto standard and
then a de jure standard. (Although the BOM is optional.)
</quote>
M$ again.
--
Leszek Gawron lgawron@mobilebox.pl
Project Manager MobileBox sp. z o.o.
+48 (61) 855 06 67 http://www.mobilebox.pl
mobile: +48 (501) 720 812 fax: +48 (61) 853 29 65
Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src: java/org/apache/cocoon/components/flow webapp/WEB-INF
Posted by Bertrand Delacretaz <bd...@apache.org>.
Le 9 déc. 04, à 09:21, Leszek Gawron a écrit :
> ...By the way: it is a little bit different on win32. Some tools
> detect utf encoding by checking for BOM. If there is none - ANSI
> encoding is assumed...
AFAIU this is ok for 16-bit based encodings, not for UTF-8.
-Bertrand
Re: svn commit: r111262 - in cocoon/branches/BRANCH_2_1_X/src:
java/org/apache/cocoon/components/flow webapp/WEB-INF
Posted by Leszek Gawron <lg...@mobilebox.pl>.
Bertrand Delacretaz wrote:
> Le 9 déc. 04, à 01:03, Leszek Gawron a écrit :
>
>>>> ...
>>>> +<?xml version="1.0" encoding="UTF-8"?>
>>>
>>> ...
>>
>> This is BOM (byte ordering mark). It is being written by some of xml
>> editors to the beginning of the multibyte encoded (i.e. utf-8) xml
>> file. The file I commited is a valid xml. Check in any xml
>> editor/browser...
>
>
> BOM has no meaning for UTF-8, see
> http://www.unicode.org/unicode/faq/utf_bom.html#BOM
>
> It is certainly better *not* to use it, to avoid any confusion. On
> unixish OSes, many tools check the first four bytes of a file and expect
> them to be <?xm
>
> -Bertrand
OK. No problem.
By the way: it is a little bit different on win32. Some tools detect utf
encoding by checking for BOM. If there is none - ANSI encoding is assumed.
--
Leszek Gawron lgawron@mobilebox.pl
Project Manager MobileBox sp. z o.o.
+48 (61) 855 06 67 http://www.mobilebox.pl
mobile: +48 (501) 720 812 fax: +48 (61) 853 29 65