You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hc.apache.org by Julius Davies <ju...@gmail.com> on 2006/12/11 20:36:42 UTC

Re: utf8 in the code/comments

Roland,

Sorry to muddy the waters.  It would be really nice if we could assume UTF-8
for everything.  HTML entities are not very readable except when using a
browser!

As part of the https hostname verification I've been working on, I've been
testing with some UTF-8 in the hostname.  I would really prefer to just
write "花子.co.jp" directly in the javadocs instead of  "&#x82b1;&
#x5b50;.co.jp".

Can we assume UTF-8 for everything, but try to stick to the 0-127 range of
UTF-8 if at all possible (to be polite)?

-- 
yours,

Julius Davies
416-652-0183
http://juliusdavies.ca/

Re: utf8 in the code/comments

Posted by Roland Weber <ht...@dubioso.net>.
Hi Julius,

> You're right about international domain names using a special
> "punycode" (e.g. "花子.co.jp" actually becomes "xn--i8s592g.co.jp").
> But I can't find any reference to what then goes inside SSL certificate!
> Should it be "花子.co.jp" or "xn--i8s592g.co.jp"?

The latter. You can't expect all certificate validation software
to be punycode-aware. The idea of internationalized domain names
is that the "enabled" software presents an internationalized user
interface, but then converts to punycode so that all the backends
can be used unmodified. Or at least that's my interpretation. The
certificate has to work if somebody uses the punycode directly.

cheers,
  Roland


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


Re: utf8 in the code/comments

Posted by Julius Davies <ju...@cucbc.com>.
Hi, Roland,


> > As part of the https hostname verification I've been working on, I've been
> > testing with some UTF-8 in the hostname.  I would really prefer to just
> > write "花子.co.jp" directly in the javadocs instead of  "&#x82b1;&
> > #x5b50;.co.jp".
> 
> Isn't that hostname against all specs? I heard of something called
> punycode for international domain names ;-) (The second kanji means
> "child", right?) 
> 

You're right about international domain names using a special
"punycode" (e.g. "花子.co.jp" actually becomes "xn--i8s592g.co.jp").
But I can't find any reference to what then goes inside SSL certificate!
Should it be "花子.co.jp" or "xn--i8s592g.co.jp"?

I don't read or write Japanese at all, but 花子 is the name "Hanako",
which means flower child.  :-)


> Even if you write that in the source code, that
> doesn't mean everybody has a font installed to display it. Little
> squares indicating "undisplayable character" are even less readable
> than HTML character references.

Good point!


yours,

-- 
Julius Davies
Senior Application Developer, Technology Services
Credit Union Central of British Columbia
http://www.cucbc.com/
Tel: 416-652-0183
Cel: 647-232-7571

1441 Creekside Drive
Vancouver, BC
Canada
V6J 4S7


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


Re: utf8 in the code/comments

Posted by sebb <se...@gmail.com>.
On 12/12/06, Tatu Saloranta <co...@yahoo.com> wrote:
> --- Roland Weber <ht...@dubioso.net> wrote:
> ...

> As far as I understand this is correct: Subversion
> does (IMO) the smart thing, and does try to outsmart
> its users. Thus, it doesn't try to do automatic
> linefeed conversion, for example, and I wouldn't
> expect it to try to do encoding changes either.

SVN does do line conversion if you set the property:

svn:eol-style native

This is very useful for text files that may be editted on various
different systems, as it ensures that differences work properly, no
matter which system they are editted or diffed on.

If you do want to apply the property, the file has to be in the
correct native format first, so best to do this on the OS where the
file was last updated.

(Dunno about encoding.)

Sebastian

---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


Re: utf8 in the code/comments

Posted by Tatu Saloranta <co...@yahoo.com>.
--- Roland Weber <ht...@dubioso.net> wrote:
...
> I'd like to hear some more opinions on this. Does
> anybody know how
> well Subversion handles UTF-8 text files? No
> automatic conversion
> to local codepages on checkout or other unexpected
> surprises?

As far as I understand this is correct: Subversion
does (IMO) the smart thing, and does try to outsmart
its users. Thus, it doesn't try to do automatic
linefeed conversion, for example, and I wouldn't
expect it to try to do encoding changes either.

For what it's worth, I would recommend using Unicode
quoting within Source code strings for any
non-7bit-ascii characters. Comments are bit trickier,
although with Javadoc comments one can use xml/html
character entities (as they are usually rendered to be
viewed using browser).

-+ Tatu +-



 
____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


Re: utf8 in the code/comments

Posted by Roland Weber <ht...@dubioso.net>.
Hi Julius,

> Sorry to muddy the waters.  It would be really nice if we could assume
> UTF-8
> for everything.  HTML entities are not very readable except when using a
> browser!

Agreed. Up until now, Odi's last name was the only case for non-ASCII
characters though :-)

> As part of the https hostname verification I've been working on, I've been
> testing with some UTF-8 in the hostname.  I would really prefer to just
> write "花子.co.jp" directly in the javadocs instead of  "&#x82b1;&
> #x5b50;.co.jp".

Isn't that hostname against all specs? I heard of something called
punycode for international domain names ;-) (The second kanji means
"child", right?) Even if you write that in the source code, that
doesn't mean everybody has a font installed to display it. Little
squares indicating "undisplayable character" are even less readable
than HTML character references.

> Can we assume UTF-8 for everything, but try to stick to the 0-127 range of
> UTF-8 if at all possible (to be polite)?

I'd like to hear some more opinions on this. Does anybody know how
well Subversion handles UTF-8 text files? No automatic conversion
to local codepages on checkout or other unexpected surprises?
Everybody has a UTF-8 compatible editor that will not silently
convert to a different encoding? I'm sure I can figure out how to
use Emacs for that, in due time.

cheers,
  Roland

---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


Re: utf8 in the code/comments

Posted by Julius Davies <ju...@cucbc.com>.
I've just realized why life was going so well for me.  My operating
system is currently set to:

LANG=en_CA.UTF-8

But sheesh, one can't depend on people having the OS setup in a
particular way.  Please forgive my naïveté!  :-p


Our build scripts ultimately end up calling "javac".  We should make
sure all calls to "javac" include the "-encoding UTF-8" option.

Meanwhile I'm going to fix my recent contribution to clean up lines such
as this:

DEFAULT.verify("花子.co.jp", x509);


yours,

Julius



On Tue, 2006-12-12 at 12:26 +0100, Oleg Kalnichevski wrote:
> On Tue, 2006-12-12 at 09:33 +0100, Ortwin Glück wrote:
> > 
> > Julius Davies wrote:
> > > Roland,
> > > 
> > > Sorry to muddy the waters.  It would be really nice if we could assume 
> > > UTF-8
> > > for everything.
> > 
> > Would be nice. The question is how well does it integrate with our IDEs. 
> > There is no point using UTF-8 for .java files if it breaks every second day.
> > Java files can not specify their encoding inline (contrary to XML for 
> > instance), so the encoding in use must be stored in some other meta-data 
> > place. The Eclipse .project file provides such a place, but are 
> > currently not in SVN for all projects. But I know that Roland prefers 
> > Emacs (of which I know nothing). Then there is diff/patch whose 
> > behvaiour probably depends on environment variables. If we don't want to 
> > make development a hell, then please let's use the least common 
> > denominator.
> > 
> > Ortwin
> > 
> 
> I agree with Odi, IDEs tend to be the largest troublemakers when it
> comes to using non US-ASCII charsets in java source files. We should
> assume UTF-8 encoding per default but nonetheless make an effort to
> escape all 'funny' characters just in case.
> 
> Oleg
> 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org
> > 
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org
> 
-- 
Julius Davies
Senior Application Developer, Technology Services
Credit Union Central of British Columbia
http://www.cucbc.com/
Tel: 416-652-0183
Cel: 647-232-7571

1441 Creekside Drive
Vancouver, BC
Canada
V6J 4S7


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


Re: utf8 in the code/comments

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Tue, 2006-12-12 at 09:33 +0100, Ortwin Glück wrote:
> 
> Julius Davies wrote:
> > Roland,
> > 
> > Sorry to muddy the waters.  It would be really nice if we could assume 
> > UTF-8
> > for everything.
> 
> Would be nice. The question is how well does it integrate with our IDEs. 
> There is no point using UTF-8 for .java files if it breaks every second day.
> Java files can not specify their encoding inline (contrary to XML for 
> instance), so the encoding in use must be stored in some other meta-data 
> place. The Eclipse .project file provides such a place, but are 
> currently not in SVN for all projects. But I know that Roland prefers 
> Emacs (of which I know nothing). Then there is diff/patch whose 
> behvaiour probably depends on environment variables. If we don't want to 
> make development a hell, then please let's use the least common 
> denominator.
> 
> Ortwin
> 

I agree with Odi, IDEs tend to be the largest troublemakers when it
comes to using non US-ASCII charsets in java source files. We should
assume UTF-8 encoding per default but nonetheless make an effort to
escape all 'funny' characters just in case.

Oleg

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org


Re: utf8 in the code/comments

Posted by Ortwin Glück <od...@odi.ch>.

Julius Davies wrote:
> Roland,
> 
> Sorry to muddy the waters.  It would be really nice if we could assume 
> UTF-8
> for everything.

Would be nice. The question is how well does it integrate with our IDEs. 
There is no point using UTF-8 for .java files if it breaks every second day.
Java files can not specify their encoding inline (contrary to XML for 
instance), so the encoding in use must be stored in some other meta-data 
place. The Eclipse .project file provides such a place, but are 
currently not in SVN for all projects. But I know that Roland prefers 
Emacs (of which I know nothing). Then there is diff/patch whose 
behvaiour probably depends on environment variables. If we don't want to 
make development a hell, then please let's use the least common 
denominator.

Ortwin

---------------------------------------------------------------------
To unsubscribe, e-mail: httpcomponents-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpcomponents-dev-help@jakarta.apache.org