You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2011/08/09 17:24:27 UTC

[jira] [Commented] (NUTCH-619) Another Language Identifier Plugin using Unicode code point range

    [ https://issues.apache.org/jira/browse/NUTCH-619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081699#comment-13081699 ] 

Lewis John McGibbney commented on NUTCH-619:
--------------------------------------------

If language identification is delegated to Apache Tika, will all of the above point be considered and addressed?

Understandably Apache Tika is still evolving (and this issue is quite clearly not), however I suppose the points made above referring to linguistic properties should be considered within any language identification process.

If on the other hand we can confirm that the above points will be addressed then I suggest we close this issue and make reference to the fact that it has been superseded by NUTCH-1075.    

> Another Language Identifier Plugin using Unicode code point range
> -----------------------------------------------------------------
>
>                 Key: NUTCH-619
>                 URL: https://issues.apache.org/jira/browse/NUTCH-619
>             Project: Nutch
>          Issue Type: Wish
>            Reporter: Vinci
>
> After I checked the language-identifier plugin, I found the internal implementation is inefficient for language that can be clear identify based on their unicode codepoint  (e.g. CJK Language)
> If Nutch work under unicode, can anybody write a language identifier based on unicode  code point range? The map is here:
> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
> also you can refer to NutchAnalysis.jj for some of language code range 
> * Some late developed language or rare character - include some CJK character, are moved to SIP
> * May be a special property should be set if multiple language character detected (languages that are other than English alphabet) - my suggestion here is, let CJK locale be the default case as they need bi-gram or other analyzer for better indexing
> ** CJK character is very difficult to further divide as they are share han characters - if you really want to identify the specific  member of CJK, you need to use the language identifier plugin

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira