You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/08/27 18:09:53 UTC

[jira] Resolved: (TIKA-501) Encoding based language estimate wrong for UTF-8 plaintext

     [ https://issues.apache.org/jira/browse/TIKA-501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler resolved TIKA-501.
------------------------------

    Resolution: Fixed

http://svn.apache.org/viewvc?view=revision&revision=990186

> Encoding based language estimate wrong for UTF-8 plaintext
> ----------------------------------------------------------
>
>                 Key: TIKA-501
>                 URL: https://issues.apache.org/jira/browse/TIKA-501
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Ken Krugler
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: TIKA-501.patch
>
>
> Using the CLI tool on plain-text file and outputting metadata.
> The "Content-Language:" is output based on encoding based language estimate. But it is not reliable as it does not detect anything for UTF-8 and detects english for ISO-8859-1.
> Jukka wrote:
> {quote}
> We already dropped encoding-based language estimates from the HTML
> parser, and I think we should do the same also for plain text
> documents.
> {quote}
> Chris, Paul and Ingo already +1'ed this on the mailing list.
> PS: I think it is unclear that "Content-Language" is not based on the LanguageIdentifier feature. Would make sense to clarify this. However, there's another issue filed to enable true language identification from CLI as well, which would fill this gap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.