You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Milind Gadre <mi...@ecplatforms.com> on 2001/01/06 20:17:25 UTC

XML & International Character Sets

A very happy new year to all.

I was wondering if you could help answer a question I have about
multiple char sets in a single XML file.

I would like to use XML to have a single localization file for several
languages. There would be several elements of the form

    <language name="en" country="US">
        <message code="LOGIN_FAILED">Your login attempt failed for
username {0}, password {1}</message>
    </language>

    <language name="ja" country="JP">
        <message code="LOGIN_FAILED">--IN JAPANESE--</message>
    </language>

This (single) file with several language elements would be handed off to
translators in each language.

The questions I have are the following

[0] Is this a good idea in the first place. The advantage I see is that
things are in one place - potentially leading to some code generation at
some point.

[1] Since the single file contains several languages, how would the file
be saved? I am assuming it would be saved as Unicode - (XML encoding:
UTF-16).

[2] Since the main file is saved as Unicode, it would not be possible
for each element to be in a different format, so do I need to specify a
xml:lang attribute for each language element?

[3] Finally, how do the Xerces-J encodings map into the java.util.Locale
object? Example:

    Xerces-J encoding: Japanese ISO-2022-JP (iso-2022-jp)

    java.util.Locale:
        language: ja ...???...
        country: JP ...???...
        variant: ...???...

Thanks in advance for any help.

Regards...

Milind Gadre
ecPlatforms, Inc
901 Mariner's Island Blvd, Suite 565
San Mateo, CA 94404
C: 510-919-0596
F: 815-352-0779
milind@ecplatforms.com



Re: Why is shift-jis being rejected?

Posted by Andy Clark <an...@apache.org>.
Milind Gadre wrote:
>     encoding="shift-jis"

Use "Shift_JIS". It's not case-sensitive but the dash needs
to be an underscore.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

Why is shift-jis being rejected?

Posted by Milind Gadre <mi...@ecplatforms.com>.
I have an XML file containing a mix of Japanese characters and regular
English tags. The file has

    encoding="shift-jis"

This XML file displays perfectly in Internet Explorer 5. However, when I
try and parse the file using Xerces DOMParser, I get

    org.xml.sax.SAXParseException: The encoding "shift-jis" is not
supported.

The Xerces-J FAQ lists shift-jis as one of the supported encodings.

Any idea what is going on?

Do I need the 'i18n.jar' file (supplied as part of JDK1.2.2) in my
classpath ?

Regards...

Milind Gadre
ecPlatforms, Inc
901 Mariner's Island Blvd, Suite 565
San Mateo, CA 94404
C: 510-919-0596
F: 815-352-0779
milind@ecplatforms.com



Re: XML & International Character Sets

Posted by Milind Gadre <mi...@ecplatforms.com>.
Andy, thanks a ton for the detailed reply. After doing some research I
had come to some of the same conclusions, but your email opened my eyes
to other possibilities, especially the use of entity references.

Regards...

Milind Gadre
ecPlatforms, Inc
901 Mariner's Island Blvd, Suite 565
San Mateo, CA 94404
C: 510-919-0596
F: 815-352-0779
milind@ecplatforms.com



Re: XML & International Character Sets

Posted by Andy Clark <an...@apache.org>.
Milind Gadre wrote:
> [0] Is this a good idea in the first place. The advantage I see is that
> things are in one place - potentially leading to some code generation at
> some point.

It's totally up to you. I don't see a problem with it
unless you 1) have a very large file and maintaing such
a large file becomes cumbersome, or 2) you have multiple
people needing to work on the same file at the same time.

Perhaps a better idea would be to separate the various 
language translations into separate files and use entity
references in the main file to pull all of them together.
For example:

  <?xml version='1.0'?>
  <!-- MAIN TRANSLATION FILE -->
  <!DOCTYPE translations SYSTEM 'translations.dtd' [
   <!ENTITY english  SYSTEM 'messages_en_US.ent'>
   <!ENTITY japanese SYSTEM 'messages_ja_JP.ent'>
  ]>
  <translations>
   &english;
   &japanese;
  </translations>

  <?xml encoding='ISO-8859-1'?>
  <!-- English translations -->
  <language name='en' country='US'>
   <message code='LOGIN_FAILED'> ... </message>
  </language>
  
  <?xml encoding='Shift_JIS'?>
  <!-- Japanese translations -->
  <langage name='ja' country='JP'>
   <message code='LOGIN_FAILED'> ... </message>
  </language>

This approach allows you to keep each language separate and
still use the most natural encoding method for that language
in each file. Note that the actual translations are kept in
XML entities -- they are *not* XML documents by themselves.

> [1] Since the single file contains several languages, how would the file
> be saved? I am assuming it would be saved as Unicode - (XML encoding:
> UTF-16).

Both UTF-8 and UTF-16 are capable of encoding all of the 
codepoints in Unicode. And if you roll all of your
translations into a single file, then you have to use
one of these two encodings. However, if you keep them
separate, then each file can have its own encoding as I
mentioned previously.

> [2] Since the main file is saved as Unicode, it would not be possible
> for each element to be in a different format, so do I need to specify a
> xml:lang attribute for each language element?

You don't have to but it's always nice to use xml:lang to
store that information. Remember that xml:lang is only a
hint to the application -- the parser does nothing with it.

> [3] Finally, how do the Xerces-J encodings map into the java.util.Locale
> object? Example:

Any I18N experts out there who know this off the top of their
heads?

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

RE: XML & International Character Sets

Posted by Jay Cain <ja...@cett.msstate.edu>.
Milind,

A0:
Generally, this is a very good idea. Having all the interpretations in one
file will make it easier to use XML technologies such as XSLT and XLinks.

A1:
You will probably need to use the UTF-16 encoding, but that depends on what
languages are used.

A2:
The xml:lang attribute was defined for exactly this situation. The xml:lang
attribute is not required to have mixed languages in an XML document, but it
is the standard way to denote the language of an an element's content. You
may consider using the form

  <message xml:lang="en-US" code="LOGIN_FAILED">...</message>

and dropping the <language> element altogether.

A3:
Sorry, can't help you here.

- - - - -
Jay Cain
Center for Educational and Training Technology
Mississippi State University