You are viewing a plain text version of this content. The canonical link for it is here.

Posted to l10n@openoffice.apache.org by janI <ja...@apache.org> on 2013/03/16 00:47:30 UTC

Language codes ???

Hi

I am (as usual confused). I have merged translation files from our sources,
sdf files and pottle. I have the following codes (directories):

af          brx       dz        eu  he     ka  ky     my  om       ro
sk    tr          ts            zu
ar          bs        el         fa   hi     kab  lt      nb   or
ru       sl     sw_TZ  ug
as         ca        en_AU  fi    hr     kk   lv      ne   pa_IN    rw
so    ta         uk
ast        ca_XV  en_GB  fr    hu    km  mai   nl    pap      sa_IN  son
te         uz
be_BY  cs         en_US  fur  id     kn   mk   nn    pl        sat
sq    tg         ve
bg        cy         en_ZA  ga   is     ko   ml    nr     ps       sc
sr    th          vi
bn        da         eo        gd   it     kok  mn   nso  pt
sd       ss    tk         xh
bo        de         es        gl    ja    ks    mni  ny    pt_BR  sh
st    tlh         zh_CN
br         dgo       et        gu    jbo  ku    mr   oc     pyg
si        sv    tn          zh_TW

(All the po files are available in "branches/l10n/main/l10ntools/lang" once
svn is finished)

Where  can I find the relation between the directory names and the
languages (human names), someone (I think andrea) mentioned it was country
codes ?

I expected dialects within a language to be written as e.g. es_XX, and I
know there is an ongoing effort on translating to
   Catalan Euskadi and Gallego
but I cannot find, so I am afraid I have missed something for my test.
(personal note, to my friends up north, sorry for using the word "dialect",
but spain is still one country)

I am also a bit puzzled about pt_BR and ca_XV (google just gave me LO as
answer).

thanks a lot in advance for any help.
rgds
jan I.

Re: Language codes ???

Posted by Andrea Pescetti <pe...@apache.org>.

On 16/03/2013 janI wrote:
> 3 possibilities when inserting a language message that has not been
> translated:
> 1) Do not insert the message for this language
> 2) Insert the message with an empty string
> 3) Replace the string with the en-US string and insert that
> I think 3) is the most correct approach ? or is there an automatic fallback
> for non-existing strings so 1) would be the correct way ?

Option 3 is surely the current and expected outcome, i.e., if a string 
is not translated then the English string is used instead.

But I don't know how we handle it now internally, whether it is 
automatic that a missing translation is replaced by the English original 
at build time (like your option 1) or we need to explicitly put the 
English version in place of the translation (while leaving PO files in 
the untranslated status, option 3). If the replacement is automatic then 
option 1 seems cleaner, just leave untranslated what is untranslated.

Regards,
   Andrea.

---------------------------------------------------------------------
To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
For additional commands, e-mail: l10n-help@openoffice.apache.org

Re: Language codes ???

Posted by Aivaras Stepukonis <as...@gmail.com>.

In that case, I think it is better to offer information in English than 
none at all (this way the user is retaining the option to do the 
improvised translation on his/her own).

Sincerely,

Aivaras

2013.03.16 12:50, janI rašė:
> On 16 March 2013 10:51, Andrea Pescetti <pe...@apache.org> wrote:
>
>> janI wrote:
>>
>>> I have the following codes (directories):
>>> af brx dz eu he ka ky my om ro ...
>>>
>>> Where  can I find the relation between the directory names and the
>>> languages (human names), someone (I think andrea) mentioned it was country
>>> codes ?
>>>
>> We don't use country codes, we rely on the LANGUAGE codes, which are ISO
>> standards. So, in general:
>> - if it is a two-letter code, look it up in ISO 639-1:
>> http://en.wikipedia.org/wiki/**List_of_ISO_639-1_codes<http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes> ("af" -> "Afrikaans")
>> - if it is a three-letter code, use ISO 639-2 or (more complete, extends
>> 639-2) 639-3: http://en.wikipedia.org/wiki/**List_of_ISO_639-3_codes<http://en.wikipedia.org/wiki/List_of_ISO_639-3_codes>("pap" -> "Papiamento")
>>
>>
>>   I expected dialects within a language to be written as e.g. es_XX, and I
>>> know there is an ongoing effort on translating to
>>>      Catalan Euskadi and Gallego
>>>
>> No, this would be a dangerous approach! There is a lot of "political
>> correctness" at work here. Everything that is in ISO is a language. So all
>> languages spoken in Spain have equal dignity and their own codes. Catalan
>> is "ca", Basque/Euskadi is "eu", Gallego is "gl" and you listed all three
>> of them.
>>
>>
>>   I am also a bit puzzled about pt_BR and ca_XV
>> These are extensions made to accommodate language variants. Languages in
>> the form '[a-z]*_[A-Z]*' are an internal convention to be read as:
>> language_PLACE. So en_US means "English, as spoken in the US"; en_GB =
>> "English, as spoken in Great Britain"; pt_BR = "Portoguese, as spoken in
>> Brazil"; ca_XV = "Catalan, as spoken in Valencia [or Comunidad
>> Valenciana]". zh_CN and zh_TW are often called "simplified" and
>> "traditional" Chinese, instead of being linked to China and Taiwan as the
>> two codes would mean.
>>
> Thanks a lot for a very full filling answer.
>
> Most of our languages are not translated 100% meaning a lot of strings are
> empty, when genLang generates source files with all languages (as today) I
> have 3 possibilities when inserting a language message that has not been
> translated:
>
> 1) Do not insert the message for this language
> 2) Insert the message with an empty string
> 3) Replace the string with the en-US string and insert that
>
> I think 3) is the most correct approach ? or is there an automatic fallback
> for non-existing strings so 1) would be the correct way ?
>
>
> Ps. this does of course not affect the .po files, they stay untranslated.
>
>> Regards,
>>    Andrea.
>>
>> ------------------------------**------------------------------**---------
>> To unsubscribe, e-mail: l10n-unsubscribe@openoffice.**apache.org<l1...@openoffice.apache.org>
>> For additional commands, e-mail: l10n-help@openoffice.apache.**org<l1...@openoffice.apache.org>
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
For additional commands, e-mail: l10n-help@openoffice.apache.org

Re: Language codes ???

Posted by janI <ja...@apache.org>.

On 16 March 2013 10:51, Andrea Pescetti <pe...@apache.org> wrote:

> janI wrote:
>
>> I have the following codes (directories):
>> af brx dz eu he ka ky my om ro ...
>>
>> Where  can I find the relation between the directory names and the
>> languages (human names), someone (I think andrea) mentioned it was country
>> codes ?
>>
>
> We don't use country codes, we rely on the LANGUAGE codes, which are ISO
> standards. So, in general:
> - if it is a two-letter code, look it up in ISO 639-1:
> http://en.wikipedia.org/wiki/**List_of_ISO_639-1_codes<http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes> ("af" -> "Afrikaans")
> - if it is a three-letter code, use ISO 639-2 or (more complete, extends
> 639-2) 639-3: http://en.wikipedia.org/wiki/**List_of_ISO_639-3_codes<http://en.wikipedia.org/wiki/List_of_ISO_639-3_codes>("pap" -> "Papiamento")
>
>
>  I expected dialects within a language to be written as e.g. es_XX, and I
>> know there is an ongoing effort on translating to
>>     Catalan Euskadi and Gallego
>>
>
> No, this would be a dangerous approach! There is a lot of "political
> correctness" at work here. Everything that is in ISO is a language. So all
> languages spoken in Spain have equal dignity and their own codes. Catalan
> is "ca", Basque/Euskadi is "eu", Gallego is "gl" and you listed all three
> of them.
>
>
>  I am also a bit puzzled about pt_BR and ca_XV
>>
>
> These are extensions made to accommodate language variants. Languages in
> the form '[a-z]*_[A-Z]*' are an internal convention to be read as:
> language_PLACE. So en_US means "English, as spoken in the US"; en_GB =
> "English, as spoken in Great Britain"; pt_BR = "Portoguese, as spoken in
> Brazil"; ca_XV = "Catalan, as spoken in Valencia [or Comunidad
> Valenciana]". zh_CN and zh_TW are often called "simplified" and
> "traditional" Chinese, instead of being linked to China and Taiwan as the
> two codes would mean.
>
Thanks a lot for a very full filling answer.

Most of our languages are not translated 100% meaning a lot of strings are
empty, when genLang generates source files with all languages (as today) I
have 3 possibilities when inserting a language message that has not been
translated:

1) Do not insert the message for this language
2) Insert the message with an empty string
3) Replace the string with the en-US string and insert that

I think 3) is the most correct approach ? or is there an automatic fallback
for non-existing strings so 1) would be the correct way ?


Ps. this does of course not affect the .po files, they stay untranslated.

>
> Regards,
>   Andrea.
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: l10n-unsubscribe@openoffice.**apache.org<l1...@openoffice.apache.org>
> For additional commands, e-mail: l10n-help@openoffice.apache.**org<l1...@openoffice.apache.org>
>
>

Re: Language codes ???

Posted by Xuacu <xu...@gmail.com>.

Hi!

2013/3/16 Andrea Pescetti <pe...@apache.org>:

A good explanation about ISO codes for languages!

>
>> I expected dialects within a language to be written as e.g. es_XX, and I
>> know there is an ongoing effort on translating to
>>     Catalan Euskadi and Gallego
>
>
> No, this would be a dangerous approach! There is a lot of "political
> correctness" at work here. Everything that is in ISO is a language. So all
> languages spoken in Spain have equal dignity and their own codes. Catalan is
> "ca", Basque/Euskadi is "eu", Gallego is "gl" and you listed all three of
> them.
>

And there are also the Asturian (ast) and the Aragonese (an)
languages, often forgotten because they don't have an official legal
status in Spain. But still we exist ;)

All the best
--
Xuacu

---------------------------------------------------------------------
To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
For additional commands, e-mail: l10n-help@openoffice.apache.org

Re: Language codes ???

Posted by janI <ja...@apache.org>.

On Mar 19, 2013 1:45 PM, "Claudio Filho" <fi...@gmail.com> wrote:
>
> Hi
>
> 2013/3/19 Jürgen Schmidt <jo...@gmail.com>:
> > I think we have a mix of both which was confusing to me as well at the
> > beginning. Pootle seems to use "_" where we in the office
> > "extras/l10n/source/..." use "-" and also for the language selection in
> > configure "--with-lang="en-US de es pt-BR ..."
>
> In other softwares (I remember of Mozilla), they use "_".
>
> IMHO, we can change from "-" to "_", but we need to evaluate the cost
> to change. Maybe open a branch only for adaptation for l10n, like Janl
> is doing.

no need to, I think it is part of my integration,  or maybe a second phase.

>
> Cheers,
> Claudio
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
> For additional commands, e-mail: l10n-help@openoffice.apache.org
>

Re: Language codes ???

Posted by Claudio Filho <fi...@gmail.com>.

Hi

2013/3/19 Jürgen Schmidt <jo...@gmail.com>:
> I think we have a mix of both which was confusing to me as well at the
> beginning. Pootle seems to use "_" where we in the office
> "extras/l10n/source/..." use "-" and also for the language selection in
> configure "--with-lang="en-US de es pt-BR ..."

In other softwares (I remember of Mozilla), they use "_".

IMHO, we can change from "-" to "_", but we need to evaluate the cost
to change. Maybe open a branch only for adaptation for l10n, like Janl
is doing.

Cheers,
Claudio

---------------------------------------------------------------------
To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
For additional commands, e-mail: l10n-help@openoffice.apache.org

Re: Language codes ???

Posted by Jürgen Schmidt <jo...@gmail.com>.

On 3/18/13 11:06 PM, Rob Weir wrote:
> On Mon, Mar 18, 2013 at 4:10 PM, Andrea Pescetti <pe...@apache.org> wrote:
>> Rob Weir wrote:
>>>
>>> Do you know why we don't just follow the IETF's recommendations in
>>> this area?  They have a similar scheme, BCP 47, but use a hyphen
>>> rather than underscore, e.g., en-US, pt-BR.  This is what is used on
>>> the web in general, e.g., in HTTP headers.
>>> See:   http://www.rfc-editor.org/bcp/bcp47.txt
>>
>>
>> I have absolutely no idea, probably it just happened that someone chose a
>> convention for OpenOffice.
>>
> 
> If it is possible to synch up on the BCP 47 standard, it might have
> some advantages.  For example, it should make recommending a specific
> download for AOO very easy.  Most browsers put the user's locale into
> the HTTP request header "Accept-Language" using the BCP.47 format.
> They can even put multiple, prioritized languages.  For example, I.E.
> can send something like this:
> 
> Accept-Language: fr-FR,de-DE;q=0.5
> 
> That means it prefers French (with default weight q=1.0) but will also
> accept German, but with a lower weight.
> 
> If we were consistent with how we tag the languages, we could make
> better recommendations for users whose 1st language we don't support,
> using the same logic that websites do today.

I think we have a mix of both which was confusing to me as well at the
beginning. Pootle seems to use "_" where we in the office
"extras/l10n/source/..." use "-" and also for the language selection in
configure "--with-lang="en-US de es pt-BR ..."

Juergen


> 
> -Rob
> 
>>
>>> The even take it a step further, which might be useful in some cases.
>>> For example:  sr-Latn-RS means Serbian language written in Latin
>>> script, as used in Serbia.
>>
>>
>> In this case we have both, and we call them "sh" and "sr":
>> http://www.openoffice.org/download/legacy/other.html
>> But indeed we wouldn't be able to use this trick in other, similar cases.
>>
>>
>> Regards,
>>   Andrea.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
>> For additional commands, e-mail: l10n-help@openoffice.apache.org
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
> For additional commands, e-mail: l10n-help@openoffice.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
For additional commands, e-mail: l10n-help@openoffice.apache.org

Re: Language codes ???

Posted by Rob Weir <ro...@apache.org>.

On Mon, Mar 18, 2013 at 4:10 PM, Andrea Pescetti <pe...@apache.org> wrote:
> Rob Weir wrote:
>>
>> Do you know why we don't just follow the IETF's recommendations in
>> this area?  They have a similar scheme, BCP 47, but use a hyphen
>> rather than underscore, e.g., en-US, pt-BR.  This is what is used on
>> the web in general, e.g., in HTTP headers.
>> See:   http://www.rfc-editor.org/bcp/bcp47.txt
>
>
> I have absolutely no idea, probably it just happened that someone chose a
> convention for OpenOffice.
>

If it is possible to synch up on the BCP 47 standard, it might have
some advantages.  For example, it should make recommending a specific
download for AOO very easy.  Most browsers put the user's locale into
the HTTP request header "Accept-Language" using the BCP.47 format.
They can even put multiple, prioritized languages.  For example, I.E.
can send something like this:

Accept-Language: fr-FR,de-DE;q=0.5

That means it prefers French (with default weight q=1.0) but will also
accept German, but with a lower weight.

If we were consistent with how we tag the languages, we could make
better recommendations for users whose 1st language we don't support,
using the same logic that websites do today.

-Rob

>
>> The even take it a step further, which might be useful in some cases.
>> For example:  sr-Latn-RS means Serbian language written in Latin
>> script, as used in Serbia.
>
>
> In this case we have both, and we call them "sh" and "sr":
> http://www.openoffice.org/download/legacy/other.html
> But indeed we wouldn't be able to use this trick in other, similar cases.
>
>
> Regards,
>   Andrea.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
> For additional commands, e-mail: l10n-help@openoffice.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
For additional commands, e-mail: l10n-help@openoffice.apache.org

Re: Language codes ???

Posted by Andrea Pescetti <pe...@apache.org>.

Rob Weir wrote:
> Do you know why we don't just follow the IETF's recommendations in
> this area?  They have a similar scheme, BCP 47, but use a hyphen
> rather than underscore, e.g., en-US, pt-BR.  This is what is used on
> the web in general, e.g., in HTTP headers.
> See:   http://www.rfc-editor.org/bcp/bcp47.txt

I have absolutely no idea, probably it just happened that someone chose 
a convention for OpenOffice.

> The even take it a step further, which might be useful in some cases.
> For example:  sr-Latn-RS means Serbian language written in Latin
> script, as used in Serbia.

In this case we have both, and we call them "sh" and "sr":
http://www.openoffice.org/download/legacy/other.html
But indeed we wouldn't be able to use this trick in other, similar cases.

Regards,
   Andrea.

---------------------------------------------------------------------
To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
For additional commands, e-mail: l10n-help@openoffice.apache.org

Re: Language codes ???

Posted by Rob Weir <ro...@apache.org>.

On Sat, Mar 16, 2013 at 5:51 AM, Andrea Pescetti <pe...@apache.org> wrote:
> janI wrote:
>>
>> I have the following codes (directories):
>> af brx dz eu he ka ky my om ro ...
>>
>> Where  can I find the relation between the directory names and the
>> languages (human names), someone (I think andrea) mentioned it was country
>> codes ?
>
>
> We don't use country codes, we rely on the LANGUAGE codes, which are ISO
> standards. So, in general:
> - if it is a two-letter code, look it up in ISO 639-1:
> http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes  ("af" -> "Afrikaans")
> - if it is a three-letter code, use ISO 639-2 or (more complete, extends
> 639-2) 639-3: http://en.wikipedia.org/wiki/List_of_ISO_639-3_codes ("pap" ->
> "Papiamento")
>
>
>> I expected dialects within a language to be written as e.g. es_XX, and I
>> know there is an ongoing effort on translating to
>>     Catalan Euskadi and Gallego
>
>
> No, this would be a dangerous approach! There is a lot of "political
> correctness" at work here. Everything that is in ISO is a language. So all
> languages spoken in Spain have equal dignity and their own codes. Catalan is
> "ca", Basque/Euskadi is "eu", Gallego is "gl" and you listed all three of
> them.
>
>
>> I am also a bit puzzled about pt_BR and ca_XV
>
>
> These are extensions made to accommodate language variants. Languages in the
> form '[a-z]*_[A-Z]*' are an internal convention to be read as:
> language_PLACE. So en_US means "English, as spoken in the US"; en_GB =
> "English, as spoken in Great Britain"; pt_BR = "Portoguese, as spoken in
> Brazil"; ca_XV = "Catalan, as spoken in Valencia [or Comunidad Valenciana]".
> zh_CN and zh_TW are often called "simplified" and "traditional" Chinese,
> instead of being linked to China and Taiwan as the two codes would mean.
>

Do you know why we don't just follow the IETF's recommendations in
this area?  They have a similar scheme, BCP 47, but use a hyphen
rather than underscore, e.g., en-US, pt-BR.  This is what is used on
the web in general, e.g., in HTTP headers.

See:   http://www.rfc-editor.org/bcp/bcp47.txt

The even take it a step further, which might be useful in some cases.
For example:  sr-Latn-RS means Serbian language written in Latin
script, as used in Serbia.

-Rob



> Regards,
>   Andrea.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
> For additional commands, e-mail: l10n-help@openoffice.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
For additional commands, e-mail: l10n-help@openoffice.apache.org

Re: Language codes ???

Posted by Andrea Pescetti <pe...@apache.org>.

janI wrote:
> I have the following codes (directories):
> af brx dz eu he ka ky my om ro ...
> Where  can I find the relation between the directory names and the
> languages (human names), someone (I think andrea) mentioned it was country
> codes ?

We don't use country codes, we rely on the LANGUAGE codes, which are ISO 
standards. So, in general:
- if it is a two-letter code, look it up in ISO 639-1: 
http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes  ("af" -> "Afrikaans")
- if it is a three-letter code, use ISO 639-2 or (more complete, extends 
639-2) 639-3: http://en.wikipedia.org/wiki/List_of_ISO_639-3_codes 
("pap" -> "Papiamento")

> I expected dialects within a language to be written as e.g. es_XX, and I
> know there is an ongoing effort on translating to
>     Catalan Euskadi and Gallego

No, this would be a dangerous approach! There is a lot of "political 
correctness" at work here. Everything that is in ISO is a language. So 
all languages spoken in Spain have equal dignity and their own codes. 
Catalan is "ca", Basque/Euskadi is "eu", Gallego is "gl" and you listed 
all three of them.

> I am also a bit puzzled about pt_BR and ca_XV

These are extensions made to accommodate language variants. Languages in 
the form '[a-z]*_[A-Z]*' are an internal convention to be read as: 
language_PLACE. So en_US means "English, as spoken in the US"; en_GB = 
"English, as spoken in Great Britain"; pt_BR = "Portoguese, as spoken in 
Brazil"; ca_XV = "Catalan, as spoken in Valencia [or Comunidad 
Valenciana]". zh_CN and zh_TW are often called "simplified" and 
"traditional" Chinese, instead of being linked to China and Taiwan as 
the two codes would mean.

Regards,
   Andrea.

---------------------------------------------------------------------
To unsubscribe, e-mail: l10n-unsubscribe@openoffice.apache.org
For additional commands, e-mail: l10n-help@openoffice.apache.org