You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Adal Chiriliuc <ad...@myrealbox.com> on 2004/03/06 22:04:40 UTC

Unicode byte-order mark (BOM)

There is a problem in the binary/text detector from Subversion 1.0.0 (Win32).
The Unicode standard defines a so called byte-order mark. This is usually
placed at the begining of a Unicode plain text file. This marker can
have these representations:

EF BB BF    - UTF-8
FE FF       - UTF-16/UCS-2, little endian
FF FE       - UTF-16/UCS-2, big endian
FF FE 00 00 - UTF-32/UCS-4, little endian
00 00 FE FF - UTF-32/UCS-4, big-endian

When you save a plain text file as Unicode from Notepad (Windows XP)
it adds this mark at the beginning of the file. But then if you add
that file to a Subversion repository, it's marked as
application/octet-stream. If you remove the byte-order mark and add it
again (under a different name, of course), it doesn't mark it as
application/octet-stream.

More info and some ideas on how to determine if a file is Unicode:
http://msdn.microsoft.com/library/en-us/intl/unicode_42jv.asp
http://msdn.microsoft.com/library/en-us/intl/unicode_81np.asp

Adal Chiriliuc


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Unicode byte-order mark (BOM)

Posted by Philip Martin <ph...@codematters.co.uk>.
Adal Chiriliuc <ad...@myrealbox.com> writes:

> EF BB BF    - UTF-8

Subversion could treat UTF-8 as text, but I'm not so sure about those
below.

> FE FF       - UTF-16/UCS-2, little endian
> FF FE       - UTF-16/UCS-2, big endian
> FF FE 00 00 - UTF-32/UCS-4, little endian
> 00 00 FE FF - UTF-32/UCS-4, big-endian

The problem is that Subversion's internal 3-way merge treats files as
byte streams and splits lines on a \n byte.  If such a byte occurs
anywhere other than the last byte of a multi-byte character the result
could be an invalid file.  Unless the internal diff library is made
multi-byte aware then these encodings need to be treated as binary.
Note: UTF-8 doesn't have this problem, it is safe for Subversion to
treat it as text.

Is there an external diff3 program that handles these multi-byte
encodings?

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Unicode byte-order mark (BOM)

Posted by Greg Hudson <gh...@MIT.EDU>.
On Sat, 2004-03-06 at 17:04, Adal Chiriliuc wrote:
> EF BB BF    - UTF-8
> FE FF       - UTF-16/UCS-2, little endian
> FF FE       - UTF-16/UCS-2, big endian
> FF FE 00 00 - UTF-32/UCS-4, little endian
> 00 00 FE FF - UTF-32/UCS-4, big-endian
> 
> When you save a plain text file as Unicode from Notepad (Windows XP)
> it adds this mark at the beginning of the file. But then if you add
> that file to a Subversion repository, it's marked as
> application/octet-stream. If you remove the byte-order mark and add it
> again (under a different name, of course), it doesn't mark it as
> application/octet-stream.

That's perplexing.  Here's how we determine whether a file is binary
right now:

  /* Right now, this function is going to be really stupid.  It's
     going to examine the first block of data, and make sure that 85%
     of the bytes are such that their value is in the ranges 0x07-0x0D
     or 0x20-0x7F, and that 100% of those bytes is not 0x00.

     If those criteria are not met, we're calling it binary. */

For UTF-8 text, the byte-order marker might nudge the count of non-ASCII
bytes just enough to make the first 1024 bytes less than 85% ASCII, but
most of the time, it shouldn't matter.  For UTF-16 or UTF-32 text, there
are going to be a pile of zero bytes in there anyway, so it will look
binary regardless.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org