You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/06/05 15:46:00 UTC

[jira] [Commented] (NUTCH-2449) Usage of Tika LanguageIdentifier in language-identifier plugin

    [ https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501969#comment-16501969 ] 

ASF GitHub Bot commented on NUTCH-2449:
---------------------------------------

sebastian-nagel commented on issue #233: NUTCH-2449: Replace Tika LanguageIdentifier in language-identifier
URL: https://github.com/apache/nutch/pull/233#issuecomment-394760001
 
 
   Hi @YossiTamari, finally I've found the time to test the PR. Fetching your branch failed, to resolve conflicts I've created a [new branch](https://github.com/sebastian-nagel/nutch/tree/YossiTamari-NUTCH-2449) and applied your patch. One trivial fix: still need to copy `langmapping.properties` (used to parse HTML lang attribute) to runtime. Everything works fine! If there are no objection I'll merge soon. Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
>                 Key: NUTCH-2449
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2449
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Major
>
> The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier for extracting the language from the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages - https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes), and it doesn’t even fail gracefully with them - in my experience Chinese was recognized as Italian.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)