You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2019/11/15 10:51:00 UTC
[jira] [Updated] (NUTCH-2449) Usage of Tika LanguageIdentifier in
language-identifier plugin
[ https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2449:
-----------------------------------
Fix Version/s: 1.17
> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
> Key: NUTCH-2449
> URL: https://issues.apache.org/jira/browse/NUTCH-2449
> Project: Nutch
> Issue Type: Improvement
> Components: plugin
> Affects Versions: 1.13
> Reporter: Yossi Tamari
> Priority: Major
> Fix For: 1.17
>
>
> The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier for extracting the language from the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages - https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes), and it doesn’t even fail gracefully with them - in my experience Chinese was recognized as Italian.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)