You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tyler Palsulich (JIRA)" <ji...@apache.org> on 2014/07/21 15:50:39 UTC

[jira] [Closed] (TIKA-1050) Charset detection gives wrong results for GB18030 encoding

     [ https://issues.apache.org/jira/browse/TIKA-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tyler Palsulich closed TIKA-1050.
---------------------------------

       Resolution: Cannot Reproduce
    Fix Version/s: 1.6
         Assignee: Tyler Palsulich

The attached file is detected as GB18030. So, I'm closing this issue. Let me know if you're still having problems, Amit.

{code}
➜ java -jar tika-app/target/tika-app-1.6-SNAPSHOT.jar Test\ data-GB.txt
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Length" content="403"/>
<meta name="Content-Encoding" content="GB18030"/>
<meta name="Content-Type" content="text/plain; charset=GB18030"/>
<meta name="resourceName" content="Test data-GB.txt"/>
{code}

> Charset detection gives wrong results for GB18030 encoding
> ----------------------------------------------------------
>
>                 Key: TIKA-1050
>                 URL: https://issues.apache.org/jira/browse/TIKA-1050
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>            Reporter: Amit Gupta
>            Assignee: Tyler Palsulich
>            Priority: Critical
>             Fix For: 1.6
>
>         Attachments: Test data-GB.txt
>
>
> CharsetDetector gives IBM866 as the charset for text file that is in GB18030.
> GB18030 gets a lower confidence than IBM866.



--
This message was sent by Atlassian JIRA
(v6.2#6252)