You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tyler Palsulich (JIRA)" <ji...@apache.org> on 2014/07/21 15:50:39 UTC
[jira] [Closed] (TIKA-1050) Charset detection gives wrong results
for GB18030 encoding
[ https://issues.apache.org/jira/browse/TIKA-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tyler Palsulich closed TIKA-1050.
---------------------------------
Resolution: Cannot Reproduce
Fix Version/s: 1.6
Assignee: Tyler Palsulich
The attached file is detected as GB18030. So, I'm closing this issue. Let me know if you're still having problems, Amit.
{code}
➜ java -jar tika-app/target/tika-app-1.6-SNAPSHOT.jar Test\ data-GB.txt
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Length" content="403"/>
<meta name="Content-Encoding" content="GB18030"/>
<meta name="Content-Type" content="text/plain; charset=GB18030"/>
<meta name="resourceName" content="Test data-GB.txt"/>
{code}
> Charset detection gives wrong results for GB18030 encoding
> ----------------------------------------------------------
>
> Key: TIKA-1050
> URL: https://issues.apache.org/jira/browse/TIKA-1050
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2
> Reporter: Amit Gupta
> Assignee: Tyler Palsulich
> Priority: Critical
> Fix For: 1.6
>
> Attachments: Test data-GB.txt
>
>
> CharsetDetector gives IBM866 as the charset for text file that is in GB18030.
> GB18030 gets a lower confidence than IBM866.
--
This message was sent by Atlassian JIRA
(v6.2#6252)