You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2013/04/10 17:50:15 UTC

[jira] [Commented] (TIKA-1103) Tika.parseToString(InputStream) does not output the same content as parseToString(File)

    [ https://issues.apache.org/jira/browse/TIKA-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627923#comment-13627923 ] 

Ken Krugler commented on TIKA-1103:
-----------------------------------

One possible explanation is that when you pass an InputStream to Tika, it can't use the file name suffix to help with detecting the file type. In the above, what happens if you explicitly process the input stream with the PDF parser?
                
> Tika.parseToString(InputStream) does not output the same content as parseToString(File)
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-1103
>                 URL: https://issues.apache.org/jira/browse/TIKA-1103
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.1, 1.2, 1.3
>         Environment: Windows 7 x64
> Java 6 Update 33
>            Reporter: Antoine Libert
>
> Tika.parseToString(...) outputs different results with the following PDF file (iPhone user guide in german, bug also happens with french).
> http://manuals.info.apple.com/de_DE/iphone_benutzerhandbuch.pdf
> 1.3 parseToString(File) : actual content (good)
> 1.2 parseToString(File) : actual content (good)
> 1.1 parseToString(File) : actual content (good)
> 1.3 parseToString(InputStream) : empty
> 1.2 parseToString(InputStream) : PDF binary shown as text
> 1.1 parseToString(InputStream) : PDF binary shown as text
> Simple test case:
> Tika tika = new Tika();
> File f = new File("iphone_benutzerhandbuch.pdf")
> TikaInputStream is2 = TikaInputStream.get(f);
> String st2 = tika.parseToString(is2); // inputstream
> String stt2 = tika.parseToString(f); // file
> assertTrue(st2.equals(stt2)); // false

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira