You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Jan Høydahl / Cominvent <ja...@cominvent.com> on 2010/11/08 22:43:04 UTC

Re: Adding languages to LanguageIdentifier

Hi,

Picking up on this thread again.

I created TIKA-546 "Add ability to create language profiles to tika-app".
Do you think this is a viable route?

But when I try to find the class org.apache.nutch.analysis.lang.NGramProfile in trunk, it is gone. Is there already an existing initiative to port language profile creation over to Tika?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 24. aug. 2010, at 16.57, Jukka Zitting wrote:

> Hi,
> 
> On Tue, Aug 24, 2010 at 4:50 PM, Jan Høydahl / Cominvent
> <ja...@cominvent.com> wrote:
>> Do anyone have an answer to this question that I posted last week?
>> I know how to generate profiles for Nutch, but not for Tika.
> 
> It's the same thing, you just need to postprocess the Nutch profile
> files to only contain three-letter ngrams as that's what Tika
> currently uses as the standard ngram size.
> 
> Any sufficiently representative corpus of text should be good enough
> for the language profiles. It would also be good to include some
> simple test cases that we can use to verify that future updates to the
> language profiles won't break things.
> 
> BR,
> 
> Jukka Zitting