You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2015/09/01 23:49:47 UTC

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

    [ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14726266#comment-14726266 ] 

Ken Krugler commented on TIKA-1723:
-----------------------------------

Hi Tim - I just attached a new version of my patch, which addresses the issues raised as per above. Still some TODOs, but I'd like to get this committed so you could do the move to tika-langdetect. Let me know what you think.

> Integrate language-detector into Tika
> -------------------------------------
>
>                 Key: TIKA-1723
>                 URL: https://issues.apache.org/jira/browse/TIKA-1723
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 1.11
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-1723-2.patch, TIKA-1723-3.patch, TIKA-1723.patch, TIKA-1723v2.patch
>
>
> The language-detector project at https://github.com/optimaize/language-detector is faster, has more languages (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a number of issues this raises, especially if [~chrismattmann] moves forward with turning language detection into a pluggable extension point.
> I'll add comments with results below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)