You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Antoine Libert (JIRA)" <ji...@apache.org> on 2013/04/10 16:38:15 UTC

[jira] [Created] (TIKA-1103) Tika.parseToString(InputStream) does not output the same content as parseToString(File)

Antoine Libert created TIKA-1103:
------------------------------------

             Summary: Tika.parseToString(InputStream) does not output the same content as parseToString(File)
                 Key: TIKA-1103
                 URL: https://issues.apache.org/jira/browse/TIKA-1103
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.3, 1.2, 1.1
         Environment: Windows 7 x64
            Reporter: Antoine Libert


Tika.parseToString(...) outputs different results with the following PDF file (iPhone user guide in german, bug also happens with french).

http://manuals.info.apple.com/de_DE/iphone_benutzerhandbuch.pdf

1.3 parseToString(File) : actual content (good)
1.2 parseToString(File) : actual content (good)
1.1 parseToString(File) : actual content (good)

1.3 parseToString(InputStream) : empty
1.2 parseToString(InputStream) : PDF binary shown as text
1.1 parseToString(InputStream) : PDF binary shown as text


Simple test case:

Tika tika = new Tika();
File f = new File("iphone_benutzerhandbuch.pdf")
TikaInputStream is2 = TikaInputStream.get(f);
String st2 = tika.parseToString(is2); // inputstream
String stt2 = tika.parseToString(f); // file
assertTrue(st2.equals(stt2)); // false



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira