You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Jan Høydahl / Cominvent <ja...@cominvent.com> on 2010/11/08 22:43:04 UTC
Re: Adding languages to LanguageIdentifier
Hi,
Picking up on this thread again.
I created TIKA-546 "Add ability to create language profiles to tika-app".
Do you think this is a viable route?
But when I try to find the class org.apache.nutch.analysis.lang.NGramProfile in trunk, it is gone. Is there already an existing initiative to port language profile creation over to Tika?
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
On 24. aug. 2010, at 16.57, Jukka Zitting wrote:
> Hi,
>
> On Tue, Aug 24, 2010 at 4:50 PM, Jan Høydahl / Cominvent
> <ja...@cominvent.com> wrote:
>> Do anyone have an answer to this question that I posted last week?
>> I know how to generate profiles for Nutch, but not for Tika.
>
> It's the same thing, you just need to postprocess the Nutch profile
> files to only contain three-letter ngrams as that's what Tika
> currently uses as the standard ngram size.
>
> Any sufficiently representative corpus of text should be good enough
> for the language profiles. It would also be good to include some
> simple test cases that we can use to verify that future updates to the
> language profiles won't break things.
>
> BR,
>
> Jukka Zitting