You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Jan Høydahl / Cominvent <ja...@cominvent.com> on 2010/08/20 22:07:07 UTC

Adding languages to LanguageIdentifier

Hi,

What is the procedure to add a language profile to LanguageIdentifier? Do we use Wikipedia as training set?

I'd like to add some languages relevant for Norway.
In Norway there are two official languages: nb and nn. Those are recommended used instead of the common "no" tag.

We also have a third language, Sami. You have northern sami and southern sami. The referenced ISO-639 list (http://www.w3.org/WAI/ER/IG/ert/iso639.htm) is obsolete as it does not list any of these. A better list is http://www.loc.gov/standards/iso639-2/php/code_list.php

What if we have a requirement to represent language dialects such as en-US and en-GB? ISO-639 does not deal with such. Perhaps it is better to switch to RFC 5646 and IANA Language Subtag Registry (http://rishida.net/utils/subtags/) which uses ISO-6391 and ISO-639-2 but allows for region variants as well?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

Re: Adding languages to LanguageIdentifier

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.

Hi,

Picking up on this thread again.

I created TIKA-546 "Add ability to create language profiles to tika-app".
Do you think this is a viable route?

But when I try to find the class org.apache.nutch.analysis.lang.NGramProfile in trunk, it is gone. Is there already an existing initiative to port language profile creation over to Tika?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 24. aug. 2010, at 16.57, Jukka Zitting wrote:

> Hi,
> 
> On Tue, Aug 24, 2010 at 4:50 PM, Jan Høydahl / Cominvent
> <ja...@cominvent.com> wrote:
>> Do anyone have an answer to this question that I posted last week?
>> I know how to generate profiles for Nutch, but not for Tika.
> 
> It's the same thing, you just need to postprocess the Nutch profile
> files to only contain three-letter ngrams as that's what Tika
> currently uses as the standard ngram size.
> 
> Any sufficiently representative corpus of text should be good enough
> for the language profiles. It would also be good to include some
> simple test cases that we can use to verify that future updates to the
> language profiles won't break things.
> 
> BR,
> 
> Jukka Zitting

Re: Adding languages to LanguageIdentifier

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.

Hi,

Thanks for the answer. That's easy enough.

I cannot find documented what the original training texts were. Shouldn't those be in svn, so profiles could be re-built if the algorithm/format changes?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 24. aug. 2010, at 16.57, Jukka Zitting wrote:

> Hi,
> 
> On Tue, Aug 24, 2010 at 4:50 PM, Jan Høydahl / Cominvent
> <ja...@cominvent.com> wrote:
>> Do anyone have an answer to this question that I posted last week?
>> I know how to generate profiles for Nutch, but not for Tika.
> 
> It's the same thing, you just need to postprocess the Nutch profile
> files to only contain three-letter ngrams as that's what Tika
> currently uses as the standard ngram size.
> 
> Any sufficiently representative corpus of text should be good enough
> for the language profiles. It would also be good to include some
> simple test cases that we can use to verify that future updates to the
> language profiles won't break things.
> 
> BR,
> 
> Jukka Zitting

Re: Adding languages to LanguageIdentifier

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Tue, Aug 24, 2010 at 4:50 PM, Jan Høydahl / Cominvent
<ja...@cominvent.com> wrote:
> Do anyone have an answer to this question that I posted last week?
> I know how to generate profiles for Nutch, but not for Tika.

It's the same thing, you just need to postprocess the Nutch profile
files to only contain three-letter ngrams as that's what Tika
currently uses as the standard ngram size.

Any sufficiently representative corpus of text should be good enough
for the language profiles. It would also be good to include some
simple test cases that we can use to verify that future updates to the
language profiles won't break things.

BR,

Jukka Zitting

Re: Adding languages to LanguageIdentifier

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.

Hi,

Do anyone have an answer to this question that I posted last week?
I know how to generate profiles for Nutch, but not for Tika.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 20. aug. 2010, at 22.07, Jan Høydahl / Cominvent wrote:

> Hi,
> 
> What is the procedure to add a language profile to LanguageIdentifier? Do we use Wikipedia as training set?
> 
> I'd like to add some languages relevant for Norway.
> In Norway there are two official languages: nb and nn. Those are recommended used instead of the common "no" tag.
> 
> We also have a third language, Sami. You have northern sami and southern sami. The referenced ISO-639 list (http://www.w3.org/WAI/ER/IG/ert/iso639.htm) is obsolete as it does not list any of these. A better list is http://www.loc.gov/standards/iso639-2/php/code_list.php
> 
> What if we have a requirement to represent language dialects such as en-US and en-GB? ISO-639 does not deal with such. Perhaps it is better to switch to RFC 5646 and IANA Language Subtag Registry (http://rishida.net/utils/subtags/) which uses ISO-6391 and ISO-639-2 but allows for region variants as well?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>