You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Th...@emc.com on 2008/03/18 15:38:02 UTC

BOM for UTF-8

Hello,
 
Has there been any discussion or thought to adding a BOM to a UTF-8
serialized file if the developer specifically set the BOM feature?  By
default, this should not exist, but the BOM is pretty useful for certain
editors to correctly identify the underlying encoding if they are not
parsing the first line.
 
Thanks,
Nicholas Thayer

RE: BOM for UTF-8

Posted by Th...@emc.com.
Thanks.  Sounds good. 

Nicholas D. Thayer
 
-----Original Message-----
From: Alberto Massari [mailto:amassari@datadirect.com] 
Sent: Wednesday, March 19, 2008 2:13 AM
To: c-dev@xerces.apache.org
Subject: Re: BOM for UTF-8

Thayer_Nicholas@emc.com wrote:
> Hello,
>  
> Has there been any discussion or thought to adding a BOM to a UTF-8 
> serialized file if the developer specifically set the BOM feature?  By

> default, this should not exist, but the BOM is pretty useful for 
> certain editors to correctly identify the underlying encoding if they 
> are not parsing the first line.

Hi Nicholas,
as the "http://apache.org/xml/features/dom/byte-order-mark" feature is 
by default set to "false", it would be safe to print the BOM even when 
UTF-8 is the encoding. We'll make the change for Xerces 3.0.

Alberto


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


Re: BOM for UTF-8

Posted by Alberto Massari <am...@datadirect.com>.
Thayer_Nicholas@emc.com wrote:
> Hello,
>  
> Has there been any discussion or thought to adding a BOM to a UTF-8 
> serialized file if the developer specifically set the BOM feature?  By 
> default, this should not exist, but the BOM is pretty useful for 
> certain editors to correctly identify the underlying encoding if they 
> are not parsing the first line.

Hi Nicholas,
as the "http://apache.org/xml/features/dom/byte-order-mark" feature is 
by default set to "false", it would be safe to print the BOM even when 
UTF-8 is the encoding. We'll make the change for Xerces 3.0.

Alberto


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


RE: BOM for UTF-8

Posted by "Sahoglu, Ozgur" <Oz...@intuit.com>.
Hi Nicholas,

 

UTF-8 datastreams can contain a BOM. However, UTF-8 is byte oriented and
always has the same byte order. A BOM can be used as a signature, but it
will make no difference to the endianness of the bytestream. I agree
with you that it may be helpful to some applications to identify the
encoding form.

 

The danger is though; some recipients of UTF-8 encoded data do not
expect a BOM. Especially if UTF-8 is used in 8-bit environments, the use
of a BOM will interfere with any protocol or file format that expects
specific ASCII characters at the beginning, such as the use of "#!" of
at the beginning of Unix shell scripts.

 

Cheers,

 

 

-Ozgur Sahoglu

 

________________________________

From: Thayer_Nicholas@emc.com [mailto:Thayer_Nicholas@emc.com] 
Sent: Tuesday, March 18, 2008 7:38 AM
To: c-dev@xerces.apache.org
Subject: BOM for UTF-8

 

Hello,

 

Has there been any discussion or thought to adding a BOM to a UTF-8
serialized file if the developer specifically set the BOM feature?  By
default, this should not exist, but the BOM is pretty useful for certain
editors to correctly identify the underlying encoding if they are not
parsing the first line.

 

Thanks,

Nicholas Thayer