You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Michael Baessler <mb...@michael-baessler.de> on 2007/08/15 14:00:25 UTC

Using Nutch LanguageIdentifierPlugin in Apache UIMA

Hi,

I'm one of the Apache UIMA committers and while searching for an open 
source language detection technology I found the
Nutch LanguageIdentifierPlugin.

First a short introduction what UIMA is:
UIMA stands for Unstructured Information Management Architecture and is 
a component architecture and software framework implementation
for the analysis of unstructured content like text, video and audio 
data. The framework has a pluggable architecture to build a chain of
analysis engines to analyze the content. For further and more detailed 
information about UIMA, please refer to the Apache UIMA homepage:
http://incubator.apache.org/uima/

We are interested in such a language identifier technology to wrap it as 
UIMA analysis engine, so that it can be used to build an analysis chain 
to analyze text content.
We created an UIMA sandbox to host such analysis engines that everybody 
can use these engines he is interested in to build an analysis chain for 
his needs.

Now my questions:
Is there a place where I can find some more details about how your 
language identification works?
Will it be possible to share the language identification technology so 
that we can wrap it as UIMA analysis engine? My current understanding 
is, that it is only available within Nutch but not separately.

Since both projects are hosted on Apache, I don't see any license issues 
when using your technology. :-)

Thanks for your answers in advance!

-- Michael





Re: Using Nutch LanguageIdentifierPlugin in Apache UIMA

Posted by Michael Baessler <mb...@michael-baessler.de>.
Thanks for your reply.

I will talk to Jukka Zitting and Chris Mattmann about the language 
detection components.

-- Michael

Andrzej Bialecki wrote:
> Michael Baessler wrote:
>> Hi,
>>
>> I'm one of the Apache UIMA committers and while searching for an open 
>> source language detection technology I found the
>> Nutch LanguageIdentifierPlugin.
>
>
> Hello Michael,
>
>
>> Now my questions:
>> Is there a place where I can find some more details about how your 
>> language identification works?
>
> It uses character n-gram models of different languages, i.e. 
> histograms of relative frequencies of character groups. It builds a 
> similar model for the text under examination, and then compares its 
> model to other pre-defined models. The best match wins. This method is 
> described in a paper by Cavnar and Trenkle 
> (http://citeseer.ist.psu.edu/68861.html).
>
> This works very well even for short texts, and doesn't require any 
> linguistic knowledge. However, it works poorly for texts that contain 
> sections in different languages, or texts in an unknown language, or 
> extremely short texts.
>
>
>> Will it be possible to share the language identification technology 
>> so that we can wrap it as UIMA analysis engine? My current 
>> understanding is, that it is only available within Nutch but not 
>> separately.
>
> There is a grass-roots effort underway to extract portions of Nutch 
> related to content parsing into a separate framework, called Tika. 
> Jukka Zitting and Chris Mattmann would be the right people to talk to.
>
>>
>> Since both projects are hosted on Apache, I don't see any license 
>> issues when using your technology. :-)
>
> Neither do I. AFAIK, ASF encourages maximum re-use of Apache 
> components over external ones.
>


Re: Using Nutch LanguageIdentifierPlugin in Apache UIMA

Posted by Andrzej Bialecki <ab...@getopt.org>.
Michael Baessler wrote:
> Hi,
> 
> I'm one of the Apache UIMA committers and while searching for an open 
> source language detection technology I found the
> Nutch LanguageIdentifierPlugin.


Hello Michael,


> Now my questions:
> Is there a place where I can find some more details about how your 
> language identification works?

It uses character n-gram models of different languages, i.e. histograms 
of relative frequencies of character groups. It builds a similar model 
for the text under examination, and then compares its model to other 
pre-defined models. The best match wins. This method is described in a 
paper by Cavnar and Trenkle (http://citeseer.ist.psu.edu/68861.html).

This works very well even for short texts, and doesn't require any 
linguistic knowledge. However, it works poorly for texts that contain 
sections in different languages, or texts in an unknown language, or 
extremely short texts.


> Will it be possible to share the language identification technology so 
> that we can wrap it as UIMA analysis engine? My current understanding 
> is, that it is only available within Nutch but not separately.

There is a grass-roots effort underway to extract portions of Nutch 
related to content parsing into a separate framework, called Tika. Jukka 
Zitting and Chris Mattmann would be the right people to talk to.

> 
> Since both projects are hosted on Apache, I don't see any license issues 
> when using your technology. :-)

Neither do I. AFAIK, ASF encourages maximum re-use of Apache components 
over external ones.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com