You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jeremy McLain (JIRA)" <ji...@apache.org> on 2014/03/20 18:13:46 UTC

[jira] [Closed] (TIKA-1262) parseToString fails to detect content-type / charset

     [ https://issues.apache.org/jira/browse/TIKA-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy McLain closed TIKA-1262.
-------------------------------

       Resolution: Not A Problem
    Fix Version/s: 1.5

See comment by Jukka Zitting.

> parseToString fails to detect content-type / charset
> ----------------------------------------------------
>
>                 Key: TIKA-1262
>                 URL: https://issues.apache.org/jira/browse/TIKA-1262
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.5
>         Environment: Java 1.7; Windows 7 64 bit
>            Reporter: Jeremy McLain
>             Fix For: 1.5
>
>         Attachments: ChineseTextExtraction.java, GB2312.txt, russian-koi8-r.txt
>
>
> The code that demonstrates this bug can be found in attachment: ChineseTextExtraction.java. 
> Observed behavior:
> Tika.parseToString(InputStream, Metadata) incorrectly detects 'application/octet-stream' for the Content-Type and returns an empty string for the contents.
> Expected behavior:
> It should detect 'text/plain' for the Content-Type and return a Unicode string of the contents of the file.
> Notes:
> GB2312.txt is a plain text file containing some Chinese encoded with the GB2312 charset. GB2312 is a very common charset and encoding. Tika should be able to handle this without any problems. In fact, the CharsetDetector class on its own accurately detects the charset as GB18030 which is a super set of GB2312. CharsetDetector.getString() handles converting the GB2312 bytes to Unicode just fine. I don't understand why the Tika facade fails.
> Edit:
> I have the same issue with the file russian-koi8-r.txt. koi8-r is also a common charset. It appears that this isn't just a GB2312 issue. It seems to work fine with ISO-8859-1 (English) files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)