You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Jason Baldridge <ja...@gmail.com> on 2011/05/17 20:33:32 UTC

switch to ISO 639-2 codes for languages?

I think we should change to the three character convention for language
specific materials, e.g. "eng" rather than "en" for English.

http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes

Do others agree?

-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: switch to ISO 639-2 codes for languages?

Posted by Jörn Kottmann <ko...@gmail.com>.

Is there support for -3 in java? Currently all we do is a check that the 
language is
a valid 2 letter code. The idea was when we added it that we will be able
to have language dependent feature generation one day, but up to today we
only do something special in the sentence detector for thai.

Jörn

On 5/17/11 8:50 PM, Benson Margulies wrote:
> -2 is pretty useless. Use -3 if you want to switch.
>
> On Tue, May 17, 2011 at 2:40 PM, Oleg Tikhonov<ol...@apache.org>  wrote:
>> My two cents, tesseract-ocr also uses ISO 639-3 and it would be great for
>> those who builds the solutions such as openNLP + tesseract.
>>
>> -Oleg
>>
>> On Tue, May 17, 2011 at 9:33 PM, Jason Baldridge
>> <ja...@gmail.com>wrote:
>>
>>> I think we should change to the three character convention for language
>>> specific materials, e.g. "eng" rather than "en" for English.
>>>
>>> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>>>
>>> Do others agree?
>>>
>>> --
>>> Jason Baldridge
>>> Assistant Professor, Department of Linguistics
>>> The University of Texas at Austin
>>> http://www.jasonbaldridge.com
>>> http://twitter.com/jasonbaldridge
>>>

Re: switch to ISO 639-2 codes for languages?

Posted by Chris Collins <ch...@yahoo.com>.

Nutch Language classifier uses alpha2.  Most systems I have used in the past (albeit not NLP oriented) typically use alpha2. Also the names are explicitly called out when users of OpenNLP load a model, this would be one more place existing users would have to change (consider that an API incompatibility).

Whats the net gain your interested in by moving to alpha3?

C
On May 17, 2011, at 12:16 PM, Jason Baldridge wrote:

> Sure. So change that to be ISO 639-3.
> 
> On Tue, May 17, 2011 at 1:50 PM, Benson Margulies <bi...@gmail.com>wrote:
> 
>> -2 is pretty useless. Use -3 if you want to switch.
>> 
>> On Tue, May 17, 2011 at 2:40 PM, Oleg Tikhonov <ol...@apache.org> wrote:
>>> My two cents, tesseract-ocr also uses ISO 639-3 and it would be great for
>>> those who builds the solutions such as openNLP + tesseract.
>>> 
>>> -Oleg
>>> 
>>> On Tue, May 17, 2011 at 9:33 PM, Jason Baldridge
>>> <ja...@gmail.com>wrote:
>>> 
>>>> I think we should change to the three character convention for language
>>>> specific materials, e.g. "eng" rather than "en" for English.
>>>> 
>>>> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>>>> 
>>>> Do others agree?
>>>> 
>>>> --
>>>> Jason Baldridge
>>>> Assistant Professor, Department of Linguistics
>>>> The University of Texas at Austin
>>>> http://www.jasonbaldridge.com
>>>> http://twitter.com/jasonbaldridge
>>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Jason Baldridge
> Assistant Professor, Department of Linguistics
> The University of Texas at Austin
> http://www.jasonbaldridge.com
> http://twitter.com/jasonbaldridge

Re: switch to ISO 639-2 codes for languages?

Posted by Jason Baldridge <ja...@gmail.com>.

Sure. So change that to be ISO 639-3.

On Tue, May 17, 2011 at 1:50 PM, Benson Margulies <bi...@gmail.com>wrote:

> -2 is pretty useless. Use -3 if you want to switch.
>
> On Tue, May 17, 2011 at 2:40 PM, Oleg Tikhonov <ol...@apache.org> wrote:
> > My two cents, tesseract-ocr also uses ISO 639-3 and it would be great for
> > those who builds the solutions such as openNLP + tesseract.
> >
> > -Oleg
> >
> > On Tue, May 17, 2011 at 9:33 PM, Jason Baldridge
> > <ja...@gmail.com>wrote:
> >
> >> I think we should change to the three character convention for language
> >> specific materials, e.g. "eng" rather than "en" for English.
> >>
> >> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
> >>
> >> Do others agree?
> >>
> >> --
> >> Jason Baldridge
> >> Assistant Professor, Department of Linguistics
> >> The University of Texas at Austin
> >> http://www.jasonbaldridge.com
> >> http://twitter.com/jasonbaldridge
> >>
> >
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: switch to ISO 639-2 codes for languages?

Posted by Jörn Kottmann <ko...@gmail.com>.

On 5/18/11 12:11 AM, James Kosin wrote:
> +1
>
> But I'd like to see more mapping of languages to default encoding types
> as well for each language.  Or automatic support in java for the
> language and encoding via the OS first and override options for those
> performing multiple languages than the native.

Making the encoding dependent on the language is not really well-defined,
with which encoding do I end up when I specify French as language?

The encoding could be the default encoding of the platform and 
additionally be defined
by the user. We decided that the user must always specify the encoding, 
because
then he needs to think about in which encoding the training/test data is.

Since training is often done for foreign languages I believe it prevents 
many from just
running with the incorrect default encoding.

Anyway I also use OS X where MacRoman is the default encoding which is 
just incompatible
with all the training data I have.

Jörn

Re: switch to ISO 639-2 codes for languages?

Posted by James Kosin <ja...@gmail.com>.

+1

But I'd like to see more mapping of languages to default encoding types
as well for each language.  Or automatic support in java for the
language and encoding via the OS first and override options for those
performing multiple languages than the native.

James

On 5/17/2011 4:45 PM, Jason Baldridge wrote:
> +1
>
> On Tue, May 17, 2011 at 3:39 PM, Jörn Kottmann <ko...@gmail.com> wrote:
>
>> I can see that, so switching the language codes I think should be something
>> that should be done when we do bigger changes anyway. Maybe for 1.6
>> together
>> with a switch to opennlp-ml and maybe bigger changes in our feature
>> generation
>> code.
>>
>> Jörn
>>
>>
>> On 5/17/11 10:32 PM, Benson Margulies wrote:
>>
>>> there are important distinctions missing in the twos. Farsi / Dari/
>>> etc and others.
>>>
>>> On May 17, 2011, at 4:25 PM, "Jörn Kottmann"<ko...@gmail.com>  wrote:
>>>
>>>  Is there support for -3 in java? Currently all we do is a check that the
>>>> language is
>>>> a valid 2 letter code. The idea was when we added it that we will be able
>>>> to have language dependent feature generation one day, but up to today we
>>>> only do something special in the sentence detector for thai.
>>>>
>>>> Jörn
>>>>
>>>> On 5/17/11 8:50 PM, Benson Margulies wrote:
>>>>
>>>>> -2 is pretty useless. Use -3 if you want to switch.
>>>>>
>>>>> On Tue, May 17, 2011 at 2:40 PM, Oleg Tikhonov<ol...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> My two cents, tesseract-ocr also uses ISO 639-3 and it would be great
>>>>>> for
>>>>>> those who builds the solutions such as openNLP + tesseract.
>>>>>>
>>>>>> -Oleg
>>>>>>
>>>>>> On Tue, May 17, 2011 at 9:33 PM, Jason Baldridge
>>>>>> <ja...@gmail.com>wrote:
>>>>>>
>>>>>>  I think we should change to the three character convention for
>>>>>>> language
>>>>>>> specific materials, e.g. "eng" rather than "en" for English.
>>>>>>>
>>>>>>> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>>>>>>>
>>>>>>> Do others agree?
>>>>>>>
>>>>>>> --
>>>>>>> Jason Baldridge
>>>>>>> Assistant Professor, Department of Linguistics
>>>>>>> The University of Texas at Austin
>>>>>>> http://www.jasonbaldridge.com
>>>>>>> http://twitter.com/jasonbaldridge
>>>>>>>
>>>>>>>
>

Re: switch to ISO 639-2 codes for languages?

Posted by Jason Baldridge <ja...@gmail.com>.

+1

On Tue, May 17, 2011 at 3:39 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> I can see that, so switching the language codes I think should be something
> that should be done when we do bigger changes anyway. Maybe for 1.6
> together
> with a switch to opennlp-ml and maybe bigger changes in our feature
> generation
> code.
>
> Jörn
>
>
> On 5/17/11 10:32 PM, Benson Margulies wrote:
>
>> there are important distinctions missing in the twos. Farsi / Dari/
>> etc and others.
>>
>> On May 17, 2011, at 4:25 PM, "Jörn Kottmann"<ko...@gmail.com>  wrote:
>>
>>  Is there support for -3 in java? Currently all we do is a check that the
>>> language is
>>> a valid 2 letter code. The idea was when we added it that we will be able
>>> to have language dependent feature generation one day, but up to today we
>>> only do something special in the sentence detector for thai.
>>>
>>> Jörn
>>>
>>> On 5/17/11 8:50 PM, Benson Margulies wrote:
>>>
>>>> -2 is pretty useless. Use -3 if you want to switch.
>>>>
>>>> On Tue, May 17, 2011 at 2:40 PM, Oleg Tikhonov<ol...@apache.org>
>>>> wrote:
>>>>
>>>>> My two cents, tesseract-ocr also uses ISO 639-3 and it would be great
>>>>> for
>>>>> those who builds the solutions such as openNLP + tesseract.
>>>>>
>>>>> -Oleg
>>>>>
>>>>> On Tue, May 17, 2011 at 9:33 PM, Jason Baldridge
>>>>> <ja...@gmail.com>wrote:
>>>>>
>>>>>  I think we should change to the three character convention for
>>>>>> language
>>>>>> specific materials, e.g. "eng" rather than "en" for English.
>>>>>>
>>>>>> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>>>>>>
>>>>>> Do others agree?
>>>>>>
>>>>>> --
>>>>>> Jason Baldridge
>>>>>> Assistant Professor, Department of Linguistics
>>>>>> The University of Texas at Austin
>>>>>> http://www.jasonbaldridge.com
>>>>>> http://twitter.com/jasonbaldridge
>>>>>>
>>>>>>
>


-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: switch to ISO 639-2 codes for languages?

Posted by Jörn Kottmann <ko...@gmail.com>.

I can see that, so switching the language codes I think should be something
that should be done when we do bigger changes anyway. Maybe for 1.6 together
with a switch to opennlp-ml and maybe bigger changes in our feature 
generation
code.

Jörn

On 5/17/11 10:32 PM, Benson Margulies wrote:
> there are important distinctions missing in the twos. Farsi / Dari/
> etc and others.
>
> On May 17, 2011, at 4:25 PM, "Jörn Kottmann"<ko...@gmail.com>  wrote:
>
>> Is there support for -3 in java? Currently all we do is a check that the
>> language is
>> a valid 2 letter code. The idea was when we added it that we will be able
>> to have language dependent feature generation one day, but up to today we
>> only do something special in the sentence detector for thai.
>>
>> Jörn
>>
>> On 5/17/11 8:50 PM, Benson Margulies wrote:
>>> -2 is pretty useless. Use -3 if you want to switch.
>>>
>>> On Tue, May 17, 2011 at 2:40 PM, Oleg Tikhonov<ol...@apache.org>   wrote:
>>>> My two cents, tesseract-ocr also uses ISO 639-3 and it would be great for
>>>> those who builds the solutions such as openNLP + tesseract.
>>>>
>>>> -Oleg
>>>>
>>>> On Tue, May 17, 2011 at 9:33 PM, Jason Baldridge
>>>> <ja...@gmail.com>wrote:
>>>>
>>>>> I think we should change to the three character convention for language
>>>>> specific materials, e.g. "eng" rather than "en" for English.
>>>>>
>>>>> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>>>>>
>>>>> Do others agree?
>>>>>
>>>>> --
>>>>> Jason Baldridge
>>>>> Assistant Professor, Department of Linguistics
>>>>> The University of Texas at Austin
>>>>> http://www.jasonbaldridge.com
>>>>> http://twitter.com/jasonbaldridge
>>>>>

Re: switch to ISO 639-2 codes for languages?

Posted by Benson Margulies <bi...@gmail.com>.

there are important distinctions missing in the twos. Farsi / Dari/
etc and others.

On May 17, 2011, at 4:25 PM, "Jörn Kottmann" <ko...@gmail.com> wrote:

> Is there support for -3 in java? Currently all we do is a check that the
> language is
> a valid 2 letter code. The idea was when we added it that we will be able
> to have language dependent feature generation one day, but up to today we
> only do something special in the sentence detector for thai.
>
> Jörn
>
> On 5/17/11 8:50 PM, Benson Margulies wrote:
>> -2 is pretty useless. Use -3 if you want to switch.
>>
>> On Tue, May 17, 2011 at 2:40 PM, Oleg Tikhonov<ol...@apache.org>  wrote:
>>> My two cents, tesseract-ocr also uses ISO 639-3 and it would be great for
>>> those who builds the solutions such as openNLP + tesseract.
>>>
>>> -Oleg
>>>
>>> On Tue, May 17, 2011 at 9:33 PM, Jason Baldridge
>>> <ja...@gmail.com>wrote:
>>>
>>>> I think we should change to the three character convention for language
>>>> specific materials, e.g. "eng" rather than "en" for English.
>>>>
>>>> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>>>>
>>>> Do others agree?
>>>>
>>>> --
>>>> Jason Baldridge
>>>> Assistant Professor, Department of Linguistics
>>>> The University of Texas at Austin
>>>> http://www.jasonbaldridge.com
>>>> http://twitter.com/jasonbaldridge
>>>>
>

Re: switch to ISO 639-2 codes for languages?

Posted by Benson Margulies <bi...@gmail.com>.

-2 is pretty useless. Use -3 if you want to switch.

On Tue, May 17, 2011 at 2:40 PM, Oleg Tikhonov <ol...@apache.org> wrote:
> My two cents, tesseract-ocr also uses ISO 639-3 and it would be great for
> those who builds the solutions such as openNLP + tesseract.
>
> -Oleg
>
> On Tue, May 17, 2011 at 9:33 PM, Jason Baldridge
> <ja...@gmail.com>wrote:
>
>> I think we should change to the three character convention for language
>> specific materials, e.g. "eng" rather than "en" for English.
>>
>> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>>
>> Do others agree?
>>
>> --
>> Jason Baldridge
>> Assistant Professor, Department of Linguistics
>> The University of Texas at Austin
>> http://www.jasonbaldridge.com
>> http://twitter.com/jasonbaldridge
>>
>

Re: switch to ISO 639-2 codes for languages?

Posted by Oleg Tikhonov <ol...@apache.org>.

My two cents, tesseract-ocr also uses ISO 639-3 and it would be great for
those who builds the solutions such as openNLP + tesseract.

-Oleg

On Tue, May 17, 2011 at 9:33 PM, Jason Baldridge
<ja...@gmail.com>wrote:

> I think we should change to the three character convention for language
> specific materials, e.g. "eng" rather than "en" for English.
>
> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>
> Do others agree?
>
> --
> Jason Baldridge
> Assistant Professor, Department of Linguistics
> The University of Texas at Austin
> http://www.jasonbaldridge.com
> http://twitter.com/jasonbaldridge
>

Re: switch to ISO 639-2 codes for languages?

Posted by Jörn Kottmann <ko...@gmail.com>.

I created a jira for this issue:
https://issues.apache.org/jira/browse/OPENNLP-176

Jörn

On 5/17/11 8:33 PM, Jason Baldridge wrote:
> I think we should change to the three character convention for language
> specific materials, e.g. "eng" rather than "en" for English.
>
> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>
> Do others agree?
>

Re: switch to ISO 639-2 codes for languages?

Posted by Jason Baldridge <ja...@gmail.com>.

Fewer clashes and clearer naming. More up-to-date with current standards.

On Tue, May 17, 2011 at 1:39 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 5/17/11 8:33 PM, Jason Baldridge wrote:
>
>> I think we should change to the three character convention for language
>> specific materials, e.g. "eng" rather than "en" for English.
>>
>> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>>
>> Do others agree?
>>
>>  I do not really have an opinion here, why do you think three
> letter codes are better?
>
> Jörn
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: switch to ISO 639-2 codes for languages?

Posted by Jörn Kottmann <ko...@gmail.com>.

On 5/17/11 8:33 PM, Jason Baldridge wrote:
> I think we should change to the three character convention for language
> specific materials, e.g. "eng" rather than "en" for English.
>
> http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
>
> Do others agree?
>
I do not really have an opinion here, why do you think three
letter codes are better?

Jörn