You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by David Weintraub <qa...@gmail.com> on 2008/12/16 21:08:42 UTC

Enforcing UTF-8 Coding

I've been given the order to make sure all of our files are encoded in
UTF-8. There are several problems: First of all, we use Eclipse on
Windows as our development platform, and the default in that
application is to use the Windows code page, and setting that is up to
the user.

I could try compiling Java with the --encoding parameter. In Java 1.6,
the compile will fail when it hits a character that isn't property
encoded. Actually, it fails when it finds a character it doesn't
understand how it is encoded. The encoding could be wrong, so it is
set to the wrong charater, but there's no way for the compiler to know
that.

Now comes the question of our HTML, CSS, XML, JavaScript, and XSL
files. Since the compiler doesn't run through these, how can I ensure
that these are property encoded too?

If I had a way of checking the encoding of files, I could write a
pre-commit hook to fail the build if the encoding on a file is wrong,
but how do I do this?

--
David Weintraub
qazwart@gmail.com

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=985227

To unsubscribe from this discussion, e-mail: [users-unsubscribe@subversion.tigris.org].

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=988147

To unsubscribe from this discussion, e-mail: [users-unsubscribe@subversion.tigris.org].

Re: Enforcing UTF-8 Coding

Posted by B Smith-Mannschott <bs...@gmail.com>.
On Tue, Dec 16, 2008 at 10:08 PM, David Weintraub <qa...@gmail.com> wrote:

> I've been given the order to make sure all of our files are encoded in
> UTF-8. There are several problems: First of all, we use Eclipse on
> Windows as our development platform, and the default in that
> application is to use the Windows code page, and setting that is up to
> the user.
>
> I could try compiling Java with the --encoding parameter. In Java 1.6,
> the compile will fail when it hits a character that isn't property
> encoded. Actually, it fails when it finds a character it doesn't
> understand how it is encoded. The encoding could be wrong, so it is
> set to the wrong charater, but there's no way for the compiler to know
> that.
>
> Now comes the question of our HTML, CSS, XML, JavaScript, and XSL
> files. Since the compiler doesn't run through these, how can I ensure
> that these are property encoded too?
>
> If I had a way of checking the encoding of files, I could write a
> pre-commit hook to fail the build if the encoding on a file is wrong,
> but how do I do this?
>

What I do in my hook script, is something like this:

def probably_utf8(file_like):
    try:
        file_like.read().decode("UTF-8")
    except:
        return False
    else:
        return True

This won't catch every theoretically possible violation, but it's more than
good enough to keep everyone honest.

// ben

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=985295

To unsubscribe from this discussion, e-mail: [users-unsubscribe@subversion.tigris.org].

Re: Enforcing UTF-8 Coding

Posted by B Smith-Mannschott <bs...@gmail.com>.
On Tue, Dec 16, 2008 at 10:08 PM, David Weintraub <qa...@gmail.com> wrote:

> I've been given the order to make sure all of our files are encoded in
> UTF-8. There are several problems: First of all, we use Eclipse on
> Windows as our development platform, and the default in that
> application is to use the Windows code page, and setting that is up to
> the user.
>
> I could try compiling Java with the --encoding parameter. In Java 1.6,
> the compile will fail when it hits a character that isn't property
> encoded. Actually, it fails when it finds a character it doesn't
> understand how it is encoded. The encoding could be wrong, so it is
> set to the wrong charater, but there's no way for the compiler to know
> that.
>
> Now comes the question of our HTML, CSS, XML, JavaScript, and XSL
> files. Since the compiler doesn't run through these, how can I ensure
> that these are property encoded too?
>
> If I had a way of checking the encoding of files, I could write a
> pre-commit hook to fail the build if the encoding on a file is wrong,
> but how do I do this?
>

What I do in my hook script, is something like this:

def probably_utf8(file_like):
    try:
        file_like.read().decode("UTF-8")
    except:
        return False
    else:
        return True

This won't catch every theoretically possible violation, but it's more than
good enough to keep everyone honest.

// ben

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=985295

To unsubscribe from this discussion, e-mail: [users-unsubscribe@subversion.tigris.org].

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=988146

To unsubscribe from this discussion, e-mail: [users-unsubscribe@subversion.tigris.org].