You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Yossi Tamari <yo...@pipl.com> on 2017/10/24 09:26:48 UTC

Usage of Tika LanguageIdentifier in language-identifier plugin

Hi

 

The language-identifier plugin uses
org.apache.tika.language.LanguageIdentifier for extracting the language from
the document text. There are two issues with that:

1.	LanguageIdentifier is deprecated in Tika.
2.	It does not support CJK language (and I suspect a lot of other
languages -
https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages
_and_their_ISO_636_Codes), and it doesn't even fail gracefully with them -
in my experience Chinese was recognized as Italian.

 

Since in Tika LanguageIdentifier was superseded by
org.apache.tika.language.detect.LanguageDetector, it seems obvious to make
that change in the plugin as well. However, because the design of
LanguageDetector is terrible, it makes the implementation not reentrant,
meaning the full language model would have to be reloaded on each call to
the detector.

 

For my needs, I have modified the plugin to use
com.optimaize.langdetect.LanguageDetector directly, which is what Tika's
LanguageDetector uses internally (at least by default). My question is
whether that is a change that should be made to the official plugin. 

 

Thanks,

               Yossi.

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Yossi,

> does not separate the Detector object, which contains the model and should be reused, from the
> text writer object, which should be request specific.

But shouldn't a call of reset() make it ready for re-use (the Detector object including the writer)?

But I agree that a reentrant function maybe easier to integrate. Nutch plugins also need to be
thread-safe, esp. parsers and parse filters if running in a multi-threaded parsing fetcher.
Without a reentrant function and without a 100% stateless detector, the only way is to use a
ThreadLocal instance of the detector. At a first glance, the optimaize detecter seems to be stateless.

> I chose optimaize mainly because Tika did. Using langid instead should be very simple, but the
> fact that the project has not seen a single commit in the last 4 years, and the usage numbers are
> also quite low, gives me pause...

Of course, maintenance or community around a project is an important factor. CLD2 is also not really
maintained, plus the models are fixed, no code available to retrain them.

> what I have done locally

In any case, would be great if you would open an issue on Jira and a pull request on github.
Which way to go may be discussed further.

Thanks,
Sebastian


On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> Why not LanguageDetector: The API does not separate the Detector object, which contains the model and should be reused, from the text writer object, which should be request specific. The same API Object instance contains references to both. In code terms, both loadModels() and addText() are non-static members of LanguageDetector.
> 
> Developing another language-identifier-optimaize is basically what I have done locally, but it seems to me having both in the Nutch repository would just be confusing for users. 99% of the code would also be duplicated (the relevant code is about 5 lines).
> 
> I chose optimaize mainly because Tika did. Using langid instead should be very simple, but the fact that the project has not seen a single commit in the last 4 years, and the usage numbers are also quite low, gives me pause...
> 
> 
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: 24 October 2017 13:18
>> To: user@nutch.apache.org
>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>
>> Hi Yossi,
>>
>> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
>>
>>> it is not possible to initialize the detector in setConf and then reuse it
>>
>> Could explain why? The API/interface should allow to get an instance and call
>> loadModels() or not?
>>
>>>>> For my needs, I have modified the plugin to use
>>>>> com.optimaize.langdetect.LanguageDetector directly, which is what
>>
>> Of course, that's also possible. Or just add a plugin language-identifier-
>> optimaize.
>>
>> Btw., I recently had a look on various open source language identifier
>> implementations would prefer
>> langid (a port from Python/C) because it's faster and has a better precision:
>>   https://github.com/carrotsearch/langid-java.git
>>   https://github.com/saffsd/langid.c.git
>>   https://github.com/saffsd/langid.py.git
>> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's
>> C++).
>>
>> Thanks,
>> Sebastian
>>
>> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
>>> Hi Sebastian,
>>>
>>> Please reread the second paragraph of my email 😊.
>>> In short, it is not possible to initialize the detector in setConf and then reuse it,
>> and initializing it per call would be extremely slow.
>>>
>>> 	Yossi.
>>>
>>>
>>>> -----Original Message-----
>>>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>>>> Sent: 24 October 2017 12:41
>>>> To: user@nutch.apache.org
>>>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>>>
>>>> Hi Yossi,
>>>>
>>>> why not port it to use
>>>>
>>>>
>> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
>>>> tector.html
>>>>
>>>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
>>>>
>>>> Sebastian
>>>>
>>>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
>>>>> Hi
>>>>>
>>>>>
>>>>>
>>>>> The language-identifier plugin uses
>>>>> org.apache.tika.language.LanguageIdentifier for extracting the
>>>>> language from the document text. There are two issues with that:
>>>>>
>>>>> 1.	LanguageIdentifier is deprecated in Tika.
>>>>> 2.	It does not support CJK language (and I suspect a lot of other
>>>>> languages -
>>>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
>>>>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
>>>>> with them - in my experience Chinese was recognized as Italian.
>>>>>
>>>>>
>>>>>
>>>>> Since in Tika LanguageIdentifier was superseded by
>>>>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
>>>>> make that change in the plugin as well. However, because the design of
>>>>> LanguageDetector is terrible, it makes the implementation not
>>>>> reentrant, meaning the full language model would have to be reloaded
>>>>> on each call to the detector.
>>>>>
>>>>>
>>>>>
>>>>> For my needs, I have modified the plugin to use
>>>>> com.optimaize.langdetect.LanguageDetector directly, which is what
>>>>> Tika's LanguageDetector uses internally (at least by default). My
>>>>> question is whether that is a change that should be made to the official
>> plugin.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>                Yossi.
>>>>>
>>>>>
>>>
>>>
> 
>

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Posted by Yossi Tamari <yo...@pipl.com>.

Why not LanguageDetector: The API does not separate the Detector object, which contains the model and should be reused, from the text writer object, which should be request specific. The same API Object instance contains references to both. In code terms, both loadModels() and addText() are non-static members of LanguageDetector.

Developing another language-identifier-optimaize is basically what I have done locally, but it seems to me having both in the Nutch repository would just be confusing for users. 99% of the code would also be duplicated (the relevant code is about 5 lines).

I chose optimaize mainly because Tika did. Using langid instead should be very simple, but the fact that the project has not seen a single commit in the last 4 years, and the usage numbers are also quite low, gives me pause...


> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: 24 October 2017 13:18
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Yossi,
> 
> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
> 
> > it is not possible to initialize the detector in setConf and then reuse it
> 
> Could explain why? The API/interface should allow to get an instance and call
> loadModels() or not?
> 
> >>> For my needs, I have modified the plugin to use
> >>> com.optimaize.langdetect.LanguageDetector directly, which is what
> 
> Of course, that's also possible. Or just add a plugin language-identifier-
> optimaize.
> 
> Btw., I recently had a look on various open source language identifier
> implementations would prefer
> langid (a port from Python/C) because it's faster and has a better precision:
>   https://github.com/carrotsearch/langid-java.git
>   https://github.com/saffsd/langid.c.git
>   https://github.com/saffsd/langid.py.git
> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's
> C++).
> 
> Thanks,
> Sebastian
> 
> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> > Hi Sebastian,
> >
> > Please reread the second paragraph of my email 😊.
> > In short, it is not possible to initialize the detector in setConf and then reuse it,
> and initializing it per call would be extremely slow.
> >
> > 	Yossi.
> >
> >
> >> -----Original Message-----
> >> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> >> Sent: 24 October 2017 12:41
> >> To: user@nutch.apache.org
> >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> >>
> >> Hi Yossi,
> >>
> >> why not port it to use
> >>
> >>
> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> >> tector.html
> >>
> >> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> >>
> >> Sebastian
> >>
> >> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> >>> Hi
> >>>
> >>>
> >>>
> >>> The language-identifier plugin uses
> >>> org.apache.tika.language.LanguageIdentifier for extracting the
> >>> language from the document text. There are two issues with that:
> >>>
> >>> 1.	LanguageIdentifier is deprecated in Tika.
> >>> 2.	It does not support CJK language (and I suspect a lot of other
> >>> languages -
> >>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> >>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> >>> with them - in my experience Chinese was recognized as Italian.
> >>>
> >>>
> >>>
> >>> Since in Tika LanguageIdentifier was superseded by
> >>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> >>> make that change in the plugin as well. However, because the design of
> >>> LanguageDetector is terrible, it makes the implementation not
> >>> reentrant, meaning the full language model would have to be reloaded
> >>> on each call to the detector.
> >>>
> >>>
> >>>
> >>> For my needs, I have modified the plugin to use
> >>> com.optimaize.langdetect.LanguageDetector directly, which is what
> >>> Tika's LanguageDetector uses internally (at least by default). My
> >>> question is whether that is a change that should be made to the official
> plugin.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>                Yossi.
> >>>
> >>>
> >
> >

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Yossi,

sorry while fast-reading I've thought it's about the old LanguageIdentifier.

> it is not possible to initialize the detector in setConf and then reuse it

Could explain why? The API/interface should allow to get an instance and call loadModels() or not?

>>> For my needs, I have modified the plugin to use
>>> com.optimaize.langdetect.LanguageDetector directly, which is what

Of course, that's also possible. Or just add a plugin language-identifier-optimaize.

Btw., I recently had a look on various open source language identifier implementations would prefer
langid (a port from Python/C) because it's faster and has a better precision:
  https://github.com/carrotsearch/langid-java.git
  https://github.com/saffsd/langid.c.git
  https://github.com/saffsd/langid.py.git
Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's C++).

Thanks,
Sebastian

On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> Hi Sebastian,
> 
> Please reread the second paragraph of my email 😊.
> In short, it is not possible to initialize the detector in setConf and then reuse it, and initializing it per call would be extremely slow.
> 
> 	Yossi.
> 
> 
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
>> Sent: 24 October 2017 12:41
>> To: user@nutch.apache.org
>> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
>>
>> Hi Yossi,
>>
>> why not port it to use
>>
>> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
>> tector.html
>>
>> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
>>
>> Sebastian
>>
>> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
>>> Hi
>>>
>>>
>>>
>>> The language-identifier plugin uses
>>> org.apache.tika.language.LanguageIdentifier for extracting the
>>> language from the document text. There are two issues with that:
>>>
>>> 1.	LanguageIdentifier is deprecated in Tika.
>>> 2.	It does not support CJK language (and I suspect a lot of other
>>> languages -
>>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
>>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
>>> with them - in my experience Chinese was recognized as Italian.
>>>
>>>
>>>
>>> Since in Tika LanguageIdentifier was superseded by
>>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
>>> make that change in the plugin as well. However, because the design of
>>> LanguageDetector is terrible, it makes the implementation not
>>> reentrant, meaning the full language model would have to be reloaded
>>> on each call to the detector.
>>>
>>>
>>>
>>> For my needs, I have modified the plugin to use
>>> com.optimaize.langdetect.LanguageDetector directly, which is what
>>> Tika's LanguageDetector uses internally (at least by default). My
>>> question is whether that is a change that should be made to the official plugin.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>                Yossi.
>>>
>>>
> 
>

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

Posted by Yossi Tamari <yo...@pipl.com>.

Hi Sebastian,

Please reread the second paragraph of my email 😊.
In short, it is not possible to initialize the detector in setConf and then reuse it, and initializing it per call would be extremely slow.

	Yossi.


> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: 24 October 2017 12:41
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Yossi,
> 
> why not port it to use
> 
> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> tector.html
> 
> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> 
> Sebastian
> 
> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> > Hi
> >
> >
> >
> > The language-identifier plugin uses
> > org.apache.tika.language.LanguageIdentifier for extracting the
> > language from the document text. There are two issues with that:
> >
> > 1.	LanguageIdentifier is deprecated in Tika.
> > 2.	It does not support CJK language (and I suspect a lot of other
> > languages -
> > https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> > guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> > with them - in my experience Chinese was recognized as Italian.
> >
> >
> >
> > Since in Tika LanguageIdentifier was superseded by
> > org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> > make that change in the plugin as well. However, because the design of
> > LanguageDetector is terrible, it makes the implementation not
> > reentrant, meaning the full language model would have to be reloaded
> > on each call to the detector.
> >
> >
> >
> > For my needs, I have modified the plugin to use
> > com.optimaize.langdetect.LanguageDetector directly, which is what
> > Tika's LanguageDetector uses internally (at least by default). My
> > question is whether that is a change that should be made to the official plugin.
> >
> >
> >
> > Thanks,
> >
> >                Yossi.
> >
> >

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Yossi,

why not port it to use
   http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDetector.html

The upgrade to Tika 1.16 is already in progress (NUTCH-2439).

Sebastian

On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> Hi
> 
>  
> 
> The language-identifier plugin uses
> org.apache.tika.language.LanguageIdentifier for extracting the language from
> the document text. There are two issues with that:
> 
> 1.	LanguageIdentifier is deprecated in Tika.
> 2.	It does not support CJK language (and I suspect a lot of other
> languages -
> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages
> _and_their_ISO_636_Codes), and it doesn't even fail gracefully with them -
> in my experience Chinese was recognized as Italian.
> 
>  
> 
> Since in Tika LanguageIdentifier was superseded by
> org.apache.tika.language.detect.LanguageDetector, it seems obvious to make
> that change in the plugin as well. However, because the design of
> LanguageDetector is terrible, it makes the implementation not reentrant,
> meaning the full language model would have to be reloaded on each call to
> the detector.
> 
>  
> 
> For my needs, I have modified the plugin to use
> com.optimaize.langdetect.LanguageDetector directly, which is what Tika's
> LanguageDetector uses internally (at least by default). My question is
> whether that is a change that should be made to the official plugin. 
> 
>  
> 
> Thanks,
> 
>                Yossi.
> 
>