You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Matt Sheppard (JIRA)" <ji...@apache.org> on 2013/09/20 07:30:54 UTC

[jira] [Created] (TIKA-1174) Invalid characters in filtered PDF output

Matt Sheppard created TIKA-1174:
-----------------------------------

             Summary: Invalid characters in filtered PDF output
                 Key: TIKA-1174
                 URL: https://issues.apache.org/jira/browse/TIKA-1174
             Project: Tika
          Issue Type: Bug
         Environment: Mac OS X 10.8.5, Java 1.7u40 (but also seen on CentOS5)
            Reporter: Matt Sheppard
            Priority: Minor


The PDF document at http://www.logan.qld.gov.au/__data/assets/pdf_file/0010/9496/map_sp_1a_a4.pdf produces invalid characters in the output when filtered by Tika 1.4.

{noformat}
>
/opt/funnelback/mbin/java/bin/java -jar tika-app-1.4.jar map_sp_1c_a4.pdf | hea…
…d -n 40
ERROR - Error: Could not parse predefined CMAP file for 'nullžf °-ˇžl,¡ì$1-UCS2'
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>


[snip]

<p>Cycle network
</p>
<p>
</p>
<p>HILEY

</p>
{noformat}

Is there any proper way to avoid this, or is the best approach to strip such characters from Tika's output?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira