You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tyler Palsulich (JIRA)" <ji...@apache.org> on 2015/03/15 21:42:38 UTC

[jira] [Commented] (TIKA-1174) Invalid characters in filtered PDF output

    [ https://issues.apache.org/jira/browse/TIKA-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362545#comment-14362545 ] 

Tyler Palsulich commented on TIKA-1174:
---------------------------------------

Thank you for reporting this, [~mattsheppard]! [~tallison@apache.org] or [~tilman], any comment on this PDF encoding issue?

> Invalid characters in filtered PDF output
> -----------------------------------------
>
>                 Key: TIKA-1174
>                 URL: https://issues.apache.org/jira/browse/TIKA-1174
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>         Environment: Mac OS X 10.8.5, Java 1.7u40 (but also seen on CentOS5)
>            Reporter: Matt Sheppard
>            Priority: Minor
>         Attachments: map_sp_1c_a4.pdf
>
>
> The PDF document at http://www.logan.qld.gov.au/__data/assets/pdf_file/0010/9496/map_sp_1a_a4.pdf produces invalid characters in the output when filtered by Tika 1.4.
> {noformat}
> >
> /opt/funnelback/mbin/java/bin/java -jar tika-app-1.4.jar map_sp_1c_a4.pdf | hea…
> …d -n 40
> ERROR - Error: Could not parse predefined CMAP file for 'nullžf °-ˇžl,¡ì$1-UCS2'
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> [snip]
> <p>Cycle network
> </p>
> <p>
> </p>
> <p>HILEY
> </p>
> {noformat}
> Is there any proper way to avoid this, or is the best approach to strip such characters from Tika's output?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)