You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-users@xerces.apache.org by Th...@emc.com on 2007/12/31 17:28:21 UTC

UTF-8 BOM generation option

Hello all,
 
I sent this same email to the c-dev list.  Its content applies from both
a user as well as a dev (mods) perspective, so I'm posting to this list
as well.
 
-----------------
 
I realize that the UTF-8 spec does not require the 0xEFBBBF 3-byte BOM
be added to an UTF-8 encoded file, but some editors (MS, vim, etc.) use
this BOM when reading the XML file to determine encoding.  The reality
of the situation is that a number of UTF-8 files do contain a BOM, and
this trend seems to becoming more prevalent (at least with the XML
datasets that I have been exposed to over the years) with time.
 
Luckily, Xerces handles BOM markers for UTF-8 files already, there is
not a compatibility issue with being able to read their own generated
files.
 
My suggestion is to allow Xerces to generate a BOM for a UTF-8 encoded
file if is explicitly asked to do so through the serializer (DOMWriter)
by setting the XMLUni::fgDOMWRTDOM feature.  Most people won't set this
feature resulting in the current solution of generated UTF-8 files not
containing the BOM, but by making this change the addition of a BOM for
UTF-8 encoded generated files would now be an option for those who
indeed do want it.
 
Since the Xerces code is well written, the code modifications would be
quite small to accommodate this change.
 
I can make the changes and submit as a patch request, but first I would
like to generate a discussion about this topic to help determine what
the best implementation should be.  I'd ask that a pragmatic and
realistic viewpoint rather than a hard-line spec viewpoint be adopted
since the reality of BOMs for UTF-8 encoded files are out there and will
not be going away.
 
Thank you,
_Nicholas

RE: UTF-8 BOM generation option

Posted by Th...@emc.com.

Hello Keith,

Yes, interoperability is a primary concern.  That is why the parsers
_must_ generate UTF-8 without a BOM for those parsers that do not handle
this case, and I believe having the BOM not be output on a default UTF-8
generation is the appropriate solution given that the spec does not
require the BOM to be present for byte streams.  

Therefore, the safest route is to not generate a BOM by default because
all XML parsers should handle non-BOM UTF-8 xml files.  If the user's
environment is closed enough to verify that the available parsers will
handle UTF-8 files with a BOM, then the option to add the BOM flag to
the generated files should be available.

My suggestion results in a zero-impact for most users, but still allows
the option of BOM generation for those that request it.

Regards,
Nicholas

-----Original Message-----
From: Keith Mendoza [mailto:pantherse@gmail.com] 
Sent: Monday, December 31, 2007 11:14 AM
To: c-users@xerces.apache.org
Subject: Re: UTF-8 BOM generation option

Here's my 0.02 in this issue: I think we should look at the safest route
to
take with this. As Nicholas stated, is that files containing this BOM is
becoming more prevalent. So if that's the case, I personally think that
Xerces (both Java and C versions) should just generate the BOM.

However, I also understand that this change could cause a potential
problem.
One situation I see is application using XML for some kind of
inter-process
communication, not necessarily XML-RPM or SOAP. So if we got one
application
using Xerces to parse the XML data received; and another one NOT using
Xerces and NOT supporting the 3-byte BOM. If the Xerces-dependent
application transmit the 3-byte BOM, will the other application handle
the
data properly or not?

Hope this helps stir up the conversation,
Keith

On Dec 31, 2007 8:28 AM, <Th...@emc.com> wrote:

> Hello all,
>
> I sent this same email to the c-dev list.  Its content applies from
both
> a user as well as a dev (mods) perspective, so I'm posting to this
list
> as well.
>
> -----------------
>
> I realize that the UTF-8 spec does not require the 0xEFBBBF 3-byte BOM
> be added to an UTF-8 encoded file, but some editors (MS, vim, etc.)
use
> this BOM when reading the XML file to determine encoding.  The reality
> of the situation is that a number of UTF-8 files do contain a BOM, and
> this trend seems to becoming more prevalent (at least with the XML
> datasets that I have been exposed to over the years) with time.
>
> Luckily, Xerces handles BOM markers for UTF-8 files already, there is
> not a compatibility issue with being able to read their own generated
> files.
>
> My suggestion is to allow Xerces to generate a BOM for a UTF-8 encoded
> file if is explicitly asked to do so through the serializer
(DOMWriter)
> by setting the XMLUni::fgDOMWRTDOM feature.  Most people won't set
this
> feature resulting in the current solution of generated UTF-8 files not
> containing the BOM, but by making this change the addition of a BOM
for
> UTF-8 encoded generated files would now be an option for those who
> indeed do want it.
>
> Since the Xerces code is well written, the code modifications would be
> quite small to accommodate this change.
>
> I can make the changes and submit as a patch request, but first I
would
> like to generate a discussion about this topic to help determine what
> the best implementation should be.  I'd ask that a pragmatic and
> realistic viewpoint rather than a hard-line spec viewpoint be adopted
> since the reality of BOMs for UTF-8 encoded files are out there and
will
> not be going away.
>
> Thank you,
> _Nicholas
>
>

-- 
www.savedbycuriosity.com

Re: UTF-8 BOM generation option

Posted by Keith Mendoza <pa...@gmail.com>.

Here's my 0.02 in this issue: I think we should look at the safest route to
take with this. As Nicholas stated, is that files containing this BOM is
becoming more prevalent. So if that's the case, I personally think that
Xerces (both Java and C versions) should just generate the BOM.

However, I also understand that this change could cause a potential problem.
One situation I see is application using XML for some kind of inter-process
communication, not necessarily XML-RPM or SOAP. So if we got one application
using Xerces to parse the XML data received; and another one NOT using
Xerces and NOT supporting the 3-byte BOM. If the Xerces-dependent
application transmit the 3-byte BOM, will the other application handle the
data properly or not?

Hope this helps stir up the conversation,
Keith

On Dec 31, 2007 8:28 AM, <Th...@emc.com> wrote:

> Hello all,
>
> I sent this same email to the c-dev list.  Its content applies from both
> a user as well as a dev (mods) perspective, so I'm posting to this list
> as well.
>
> -----------------
>
> I realize that the UTF-8 spec does not require the 0xEFBBBF 3-byte BOM
> be added to an UTF-8 encoded file, but some editors (MS, vim, etc.) use
> this BOM when reading the XML file to determine encoding.  The reality
> of the situation is that a number of UTF-8 files do contain a BOM, and
> this trend seems to becoming more prevalent (at least with the XML
> datasets that I have been exposed to over the years) with time.
>
> Luckily, Xerces handles BOM markers for UTF-8 files already, there is
> not a compatibility issue with being able to read their own generated
> files.
>
> My suggestion is to allow Xerces to generate a BOM for a UTF-8 encoded
> file if is explicitly asked to do so through the serializer (DOMWriter)
> by setting the XMLUni::fgDOMWRTDOM feature.  Most people won't set this
> feature resulting in the current solution of generated UTF-8 files not
> containing the BOM, but by making this change the addition of a BOM for
> UTF-8 encoded generated files would now be an option for those who
> indeed do want it.
>
> Since the Xerces code is well written, the code modifications would be
> quite small to accommodate this change.
>
> I can make the changes and submit as a patch request, but first I would
> like to generate a discussion about this topic to help determine what
> the best implementation should be.  I'd ask that a pragmatic and
> realistic viewpoint rather than a hard-line spec viewpoint be adopted
> since the reality of BOMs for UTF-8 encoded files are out there and will
> not be going away.
>
> Thank you,
> _Nicholas
>
>


-- 
www.savedbycuriosity.com