You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/09/08 06:37:00 UTC

[jira] [Commented] (TIKA-3545) TIKA PDF parsing issues

    [ https://issues.apache.org/jira/browse/TIKA-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411720#comment-17411720 ] 

Tilman Hausherr commented on TIKA-3545:
---------------------------------------

Please attach the PDF. Also try copy & pasting the text from Adobe.

> TIKA PDF parsing issues
> -----------------------
>
>                 Key: TIKA-3545
>                 URL: https://issues.apache.org/jira/browse/TIKA-3545
>             Project: Tika
>          Issue Type: Bug
>          Components: parser, tika-server
>    Affects Versions: 1.21
>         Environment: Tested on DEV env
>            Reporter: Priya
>            Priority: Major
>         Attachments: 365.jpg
>
>
> I am using tika-core 1.21 and tika-parsers 1.21 jar files as tika dependencies in Manifoldcf 2.14 version to crawl some files, Out of which some of the PDF's files are not getting parsed correctly.
>  Getting some issues while parsing *PDF* files. Some strange characters appeared, tried changing Tika jar files version also 1.24 and 1.27 (for 1.27-it didn't even extract files correctly).
>   
>  Also checked with the document content, it seems to be fine.
>  Can anybody help me on this.
> Image attached for reference of strange characters.
> Tried version changing , but didn't help



--
This message was sent by Atlassian Jira
(v8.3.4#803005)