You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/03/11 18:27:00 UTC

[jira] [Closed] (TIKA-3307) extracted text strings have repeated characters

     [ https://issues.apache.org/jira/browse/TIKA-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr closed TIKA-3307.
---------------------------------
    Resolution: Not A Bug

> extracted text strings have repeated characters
> -----------------------------------------------
>
>                 Key: TIKA-3307
>                 URL: https://issues.apache.org/jira/browse/TIKA-3307
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Paul Tyson
>            Priority: Major
>         Attachments: WSHP-PRC025F-EN_07132019.pdf
>
>
> Extracted text from some PDF files includes some strings with repeated (doubled) characters.
> To reproduce the problem, download attached PDF file and run the following command:
> {code:java}
> java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep '(.)\1(.)\2'
> {code}
> The bad strings all seem to be headings, so perhaps something in the font or other style features are causing the problem.
> First detected in version 1.19, retested with 1.25. Did not test earlier versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)