You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@poi.apache.org by Tetsuya Kitahata <te...@apache.org> on 2003/06/17 20:31:39 UTC

Proposal: All DOCUMENTS TO UTF-8

Hello,

I think that some of current documents (e.g. news.xml, trans/es/*.xml)
have "ISO-8859-1" encoding style. However, my favorite text editor
can not read them properly (garbled chars) at the point of 
"Umlauts" and "Ntildes" etc.

I want these particular letters to be converted into 
"Unicode Escape Sequence Style" (\uxxxx).

Also, the header lines of the xmls, 
 <?xml version="1.0" encoding="ISO-8859-1"?>
 ->
 <?xml version="1.0" encoding="UTF-8"?>

Thirdly, I want the guideline of translations to be slightly
changed to be fit in these above.

Any thoughts?

If there's no objections, I think I can do this by myself.
(Just using "native2ascii" with codepage 1252)

Sincerely,

-- Tetsuya (tetsuya@apache.org)

Re: Proposal: All DOCUMENTS TO UTF-8

Posted by Tetsuya Kitahata <te...@apache.org>.

I see. Thank you all for the comments and opinions.
I'll let the /trans/xx/ dir leave as they are.

Sorry for my ignorance. we often use "native2ascii" (and reverse)
or whatever when thinking about l10n and i18n: e.g. Java Resource Bundle.
However, it might not be common in western europe, US, etc......

Right. I got it.

Again, thank you for all the comments.

Sincerely,

-- Tetsuya (tetsuya@apache.org)

P.S.
For example, "ör" compose one 'kanji' character, while "ös" compose
another 'kanji', automatically. This is so called 'multi-byte problem'.
Mail Clients can deal with these properly as long as the mail headers
("charset") are set appropriately, on the other hand text editors can
not do this well. This means that maybe I will not be able to 'commit'
the efforts of translations of es,de,it,.... posted in bugzilla (or [PATCH])

---------------------------------------------------------------------

On Fri, 20 Jun 2003 15:15:26 -0400
(Subject: Re: Proposal: All DOCUMENTS TO UTF-8)
"Andrew C. Oliver" <ac...@apache.org> wrote:

> I'm inclined to agree.  Editors that do the \uxxx are not common in
> countries such as Spain, Mexico, the US.  This would constitute a nasty
> barrier to entry that would probably discourage contribution.
> 
> -Andy
> 
> On 6/19/03 6:11 PM, "Rainer Klute" <ra...@epost.de> wrote:
> 
> > On Wed, 18 Jun 2003 03:31:39 +0900 Tetsuya Kitahata <te...@apache.org>
> > wrote:
> >> I think that some of current documents (e.g. news.xml, trans/es/*.xml)
> >> have "ISO-8859-1" encoding style. However, my favorite text editor
> >> can not read them properly (garbled chars) at the point of
> >> "Umlauts" and "Ntildes" etc.
> > 
> > I think each file should have the encoding that fits best for its particular
> > language. So please leave ISO-8859-1 for the western languages and use UTF-8
> > or whatever fits best for Japanese.
> > 
> > Best regards
> > Rainer Klute
> > 
> >                          Rainer Klute IT-Consulting GmbH
> > Dipl.-Inform.
> > Rainer Klute             E-Mail:  klute@rainer-klute.de
> > Körner Grund 24          Telefon: +49 172 2324824
> > D-44143 Dortmund           Telefax: +49 231 5349423
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: poi-dev-help@jakarta.apache.org
> > 
> 
> -- 
> Andrew C. Oliver
> http://www.superlinksoftware.com/poi.jsp
> Custom enhancements and Commercial Implementation for Jakarta POI
> 
> http://jakarta.apache.org/poi
> For Java and Excel, Got POI?
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-dev-help@jakarta.apache.org

-----------------------------------------------------
Tetsuya Kitahata --  Terra-International, Inc.
E-mail: kitahata@bb.mbn.or.jp : tetsuya@apache.org
http://www.terra-intl.com/
(Apache Jakarta Translation, Japanese)
http://jakarta.terra-intl.com/

Re: Proposal: All DOCUMENTS TO UTF-8

Posted by "Andrew C. Oliver" <ac...@apache.org>.

I'm inclined to agree.  Editors that do the \uxxx are not common in
countries such as Spain, Mexico, the US.  This would constitute a nasty
barrier to entry that would probably discourage contribution.

-Andy

On 6/19/03 6:11 PM, "Rainer Klute" <ra...@epost.de> wrote:

> On Wed, 18 Jun 2003 03:31:39 +0900 Tetsuya Kitahata <te...@apache.org>
> wrote:
>> I think that some of current documents (e.g. news.xml, trans/es/*.xml)
>> have "ISO-8859-1" encoding style. However, my favorite text editor
>> can not read them properly (garbled chars) at the point of
>> "Umlauts" and "Ntildes" etc.
> 
> I think each file should have the encoding that fits best for its particular
> language. So please leave ISO-8859-1 for the western languages and use UTF-8
> or whatever fits best for Japanese.
> 
> Best regards
> Rainer Klute
> 
>                          Rainer Klute IT-Consulting GmbH
> Dipl.-Inform.
> Rainer Klute             E-Mail:  klute@rainer-klute.de
> Körner Grund 24          Telefon: +49 172 2324824
> D-44143 Dortmund           Telefax: +49 231 5349423
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-dev-help@jakarta.apache.org
> 

-- 
Andrew C. Oliver
http://www.superlinksoftware.com/poi.jsp
Custom enhancements and Commercial Implementation for Jakarta POI

http://jakarta.apache.org/poi
For Java and Excel, Got POI?

Re: Proposal: All DOCUMENTS TO UTF-8

Posted by Rainer Klute <ra...@epost.de>.

On Wed, 18 Jun 2003 03:31:39 +0900 Tetsuya Kitahata <te...@apache.org> wrote:
> I think that some of current documents (e.g. news.xml, trans/es/*.xml)
> have "ISO-8859-1" encoding style. However, my favorite text editor
> can not read them properly (garbled chars) at the point of 
> "Umlauts" and "Ntildes" etc.

I think each file should have the encoding that fits best for its particular language. So please leave ISO-8859-1 for the western languages and use UTF-8 or whatever fits best for Japanese.

Best regards
Rainer Klute

                           Rainer Klute IT-Consulting GmbH
  Dipl.-Inform.
  Rainer Klute             E-Mail:  klute@rainer-klute.de
  Körner Grund 24          Telefon: +49 172 2324824
D-44143 Dortmund           Telefax: +49 231 5349423

Re: Proposal: All DOCUMENTS TO UTF-8

Posted by Avik Sengupta <av...@apache.org>.

I can see why it would help to have documents in UTF-8 for
internationalisation. However, remember that for CVS, these documents
would have to be checked in as binary, else a big mess will happen. 

On Wed, 2003-06-18 at 01:40, Tetsuya Kitahata wrote:
> Well, this is why I stick to be the translations of the japanese
> off from apache.org.
> If I put the japanese translations (Shift_JIS or EUC_JP) on apache.org,
> I thought I had to convert them all to unicode escape seq style. I know
> how to deal with the unicode escape seq, but I was not sure the others
> could do the same. (Shift_JIS -> escape seq., escape seq. -> Shift_JIS)
> 
> Of course, I am thinking about what you said. So, I proposed
> "'translations guideline should be changed slightly to be fit to it??"
> 
> FYI:
> My favorite text editor is made in japan. Maybe most of the text editors
> created by the japanese can not deal with the Ntildes etc.
> (But, fortunately, my favorite Mail Client can deal with them, made in
> japan)
> 
> It goes for the codebase itself, too. Just Latin vs non-Latin problem.
> 
> Sincerely,
> 
> -- Tetsuya (tetsuya@apache.org)
> 
> P.S. Sorry, I do not want to mail in Kanji to apache.org
> mailing lists. It would be  just noisy. If you wish, I'll do it as a
> personal mail to you, attached.
> 
> ---------------------------------------------------------------------
> 
> On Tue, 17 Jun 2003 21:28:04 +0200
> (Subject: Re: Proposal: All DOCUMENTS TO UTF-8)
> Agustín Martín <ag...@terra.es> wrote:
> 
> > Even though I have no vote, I would say stick with ISO-8859-1 or similar.
> > Writing documentation in xml is fine, but if a translator has to start 
> > converting all his chars (ñ, á, é, í, ó, ú) to unicode, it will be a bit 
> > of a hassle.
> > 
> > ¿Tetsuya, out of curiosity, what is your favourite editor that doesn't 
> > support latin1?
> > 
> > If localized files/docs could use local STANDARD encodings, it would 
> > make life easier on translators. Tetsuya, you could write directly in 
> > kanji (is that the right name?) :-)
> > 
> > My 2 cents,
> >    Agustín.
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-dev-help@jakarta.apache.org
-- 
Avik Sengupta <av...@apache.org>

Re: Proposal: All DOCUMENTS TO UTF-8

Posted by Tetsuya Kitahata <te...@apache.org>.

Well, this is why I stick to be the translations of the japanese
off from apache.org.
If I put the japanese translations (Shift_JIS or EUC_JP) on apache.org,
I thought I had to convert them all to unicode escape seq style. I know
how to deal with the unicode escape seq, but I was not sure the others
could do the same. (Shift_JIS -> escape seq., escape seq. -> Shift_JIS)

Of course, I am thinking about what you said. So, I proposed
"'translations guideline should be changed slightly to be fit to it??"

FYI:
My favorite text editor is made in japan. Maybe most of the text editors
created by the japanese can not deal with the Ntildes etc.
(But, fortunately, my favorite Mail Client can deal with them, made in
japan)

It goes for the codebase itself, too. Just Latin vs non-Latin problem.

Sincerely,

-- Tetsuya (tetsuya@apache.org)

P.S. Sorry, I do not want to mail in Kanji to apache.org
mailing lists. It would be  just noisy. If you wish, I'll do it as a
personal mail to you, attached.

---------------------------------------------------------------------

On Tue, 17 Jun 2003 21:28:04 +0200
(Subject: Re: Proposal: All DOCUMENTS TO UTF-8)
Agustín Martín <ag...@terra.es> wrote:

> Even though I have no vote, I would say stick with ISO-8859-1 or similar.
> Writing documentation in xml is fine, but if a translator has to start 
> converting all his chars (ñ, á, é, í, ó, ú) to unicode, it will be a bit 
> of a hassle.
> 
> ¿Tetsuya, out of curiosity, what is your favourite editor that doesn't 
> support latin1?
> 
> If localized files/docs could use local STANDARD encodings, it would 
> make life easier on translators. Tetsuya, you could write directly in 
> kanji (is that the right name?) :-)
> 
> My 2 cents,
>    Agustín.

Re: Proposal: All DOCUMENTS TO UTF-8

Posted by Agustín Martín <ag...@terra.es>.

Even though I have no vote, I would say stick with ISO-8859-1 or similar.
Writing documentation in xml is fine, but if a translator has to start 
converting all his chars (ñ, á, é, í, ó, ú) to unicode, it will be a bit 
of a hassle.

¿Tetsuya, out of curiosity, what is your favourite editor that doesn't 
support latin1?

If localized files/docs could use local STANDARD encodings, it would 
make life easier on translators. Tetsuya, you could write directly in 
kanji (is that the right name?) :-)

My 2 cents,
   Agustín.

Tetsuya Kitahata wrote:

>Hello,
>
>I think that some of current documents (e.g. news.xml, trans/es/*.xml)
>have "ISO-8859-1" encoding style. However, my favorite text editor
>can not read them properly (garbled chars) at the point of 
>"Umlauts" and "Ntildes" etc.
>
>I want these particular letters to be converted into 
>"Unicode Escape Sequence Style" (\uxxxx).
>
>Also, the header lines of the xmls, 
> <?xml version="1.0" encoding="ISO-8859-1"?>
> ->
> <?xml version="1.0" encoding="UTF-8"?>
>
>Thirdly, I want the guideline of translations to be slightly
>changed to be fit in these above.
>
>Any thoughts?
>
>If there's no objections, I think I can do this by myself.
>(Just using "native2ascii" with codepage 1252)
>
>Sincerely,
>
>-- Tetsuya (tetsuya@apache.org)
>
-- 
<Agustin/>

Agustín Martín Barbero
agusmb at netscape dot com
agusmba at terra dot com

Re: Proposal: All DOCUMENTS TO UTF-8

Posted by Tetsuya Kitahata <te...@apache.org>.

I tried to convert them to \uxxxx style, but somewhat
I could not make it well. It might be due to the forrest itself.

So, I decided just change news.xml and casestudies.xml from
ISO-8859-1 to UTF-8 and used "Character Mnemonic Entities".
I'll let the /trans/xx/ dir leave as they are for a while.

Sincerely,

-- Tetsuya (tetsuya@apache.org)

---------------------------------------------------------------------

On Wed, 18 Jun 2003 19:42:45 +0900
(Subject: Re: Proposal: All DOCUMENTS TO UTF-8)
Tetsuya Kitahata <te...@apache.org> wrote:

> 
> On Wed, 18 Jun 2003 01:48:01 -0400
> (Subject: Re: Proposal: All DOCUMENTS TO UTF-8)
> "Andrew C. Oliver" <ac...@apache.org> wrote:
> 
> > Make sure the spanish and the such works...  Aside from that, I don't
> > care.
> 
> Okay, I think I have to care about the Spanish and German Translators
> (and other languages')
> 
> I've thought that folks in Latin area could memorize the unicode escape
> sequence char numbers correspond to those "a few" (or several) extra
> characters. (So, the translators would not feel such a burdon
> relatively, I guessed ... Still I'm not sure.)
> However, we can not put into our poor brains all of the unicode escape
> sequence char numbers correspond to *over one million* extra characters.
> (So, we have to use "native2ascii" or whatever, when put them
> into apache.org server and retrieve them from cvs, I guessed)
> 
> When we think about the issues of internationalization
> (localization) of codebase and docs etc., there might be two
> *BIG* hurdles.
> 1. English (en) vs. non-English Latins (de,it,fr,es ..) problem
> 2. Latins vs. non-Latins (ja,kr,tw,cn ..) problem
> 
> These might be *invisible* but *inevitable*. Please take these into
> consideration, too.
> 
> Also, I want to put this proposal (All DOCUMENTS TO UTF-8) to
> vote in a couple of days. Then, also I want Agustin to participate
> in this vote. Is it all right?
> 
> Sincerely,
> 
> -- Tetsuya (tetsuya@apache.org)
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-dev-help@jakarta.apache.org

Re: Proposal: All DOCUMENTS TO UTF-8

Posted by Tetsuya Kitahata <te...@apache.org>.

On Wed, 18 Jun 2003 01:48:01 -0400
(Subject: Re: Proposal: All DOCUMENTS TO UTF-8)
"Andrew C. Oliver" <ac...@apache.org> wrote:

> Make sure the spanish and the such works...  Aside from that, I don't
> care.

Okay, I think I have to care about the Spanish and German Translators
(and other languages')

I've thought that folks in Latin area could memorize the unicode escape
sequence char numbers correspond to those "a few" (or several) extra
characters. (So, the translators would not feel such a burdon
relatively, I guessed ... Still I'm not sure.)
However, we can not put into our poor brains all of the unicode escape
sequence char numbers correspond to *over one million* extra characters.
(So, we have to use "native2ascii" or whatever, when put them
into apache.org server and retrieve them from cvs, I guessed)

When we think about the issues of internationalization
(localization) of codebase and docs etc., there might be two
*BIG* hurdles.
1. English (en) vs. non-English Latins (de,it,fr,es ..) problem
2. Latins vs. non-Latins (ja,kr,tw,cn ..) problem

These might be *invisible* but *inevitable*. Please take these into
consideration, too.

Also, I want to put this proposal (All DOCUMENTS TO UTF-8) to
vote in a couple of days. Then, also I want Agustin to participate
in this vote. Is it all right?

Sincerely,

-- Tetsuya (tetsuya@apache.org)

Re: Proposal: All DOCUMENTS TO UTF-8

Posted by "Andrew C. Oliver" <ac...@apache.org>.

Make sure the spanish and the such works...  Aside from that, I don't care.

On 6/17/03 2:31 PM, "Tetsuya Kitahata" <te...@apache.org> wrote:

> Hello,
> 
> I think that some of current documents (e.g. news.xml, trans/es/*.xml)
> have "ISO-8859-1" encoding style. However, my favorite text editor
> can not read them properly (garbled chars) at the point of
> "Umlauts" and "Ntildes" etc.
> 
> I want these particular letters to be converted into
> "Unicode Escape Sequence Style" (\uxxxx).
> 
> Also, the header lines of the xmls,
> <?xml version="1.0" encoding="ISO-8859-1"?>
> ->
> <?xml version="1.0" encoding="UTF-8"?>
> 
> Thirdly, I want the guideline of translations to be slightly
> changed to be fit in these above.
> 
> Any thoughts?
> 
> If there's no objections, I think I can do this by myself.
> (Just using "native2ascii" with codepage 1252)
> 
> Sincerely,
> 
> -- Tetsuya (tetsuya@apache.org)
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-dev-help@jakarta.apache.org
> 

-- 
Andrew C. Oliver
http://www.superlinksoftware.com/poi.jsp
Custom enhancements and Commercial Implementation for Jakarta POI

http://jakarta.apache.org/poi
For Java and Excel, Got POI?