You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/09/08 06:37:00 UTC
[jira] [Commented] (TIKA-3545) TIKA PDF parsing issues
[ https://issues.apache.org/jira/browse/TIKA-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411720#comment-17411720 ]
Tilman Hausherr commented on TIKA-3545:
---------------------------------------
Please attach the PDF. Also try copy & pasting the text from Adobe.
> TIKA PDF parsing issues
> -----------------------
>
> Key: TIKA-3545
> URL: https://issues.apache.org/jira/browse/TIKA-3545
> Project: Tika
> Issue Type: Bug
> Components: parser, tika-server
> Affects Versions: 1.21
> Environment: Tested on DEV env
> Reporter: Priya
> Priority: Major
> Attachments: 365.jpg
>
>
> I am using tika-core 1.21 and tika-parsers 1.21 jar files as tika dependencies in Manifoldcf 2.14 version to crawl some files, Out of which some of the PDF's files are not getting parsed correctly.
> Getting some issues while parsing *PDF* files. Some strange characters appeared, tried changing Tika jar files version also 1.24 and 1.27 (for 1.27-it didn't even extract files correctly).
>
> Also checked with the document content, it seems to be fine.
> Can anybody help me on this.
> Image attached for reference of strange characters.
> Tried version changing , but didn't help
--
This message was sent by Atlassian Jira
(v8.3.4#803005)