You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/03/01 18:29:00 UTC

[jira] [Commented] (TIKA-3307) extracted text strings have repeated characters

    [ https://issues.apache.org/jira/browse/TIKA-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293083#comment-17293083 ] 

Tilman Hausherr commented on TIKA-3307:
---------------------------------------

suppressDuplicateOverlappingText is set to false by default.

> extracted text strings have repeated characters
> -----------------------------------------------
>
>                 Key: TIKA-3307
>                 URL: https://issues.apache.org/jira/browse/TIKA-3307
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Paul Tyson
>            Priority: Major
>         Attachments: WSHP-PRC025F-EN_07132019.pdf
>
>
> Extracted text from some PDF files includes some strings with repeated (doubled) characters.
> To reproduce the problem, download attached PDF file and run the following command:
> {code:java}
> java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep '(.)\1(.)\2'
> {code}
> The bad strings all seem to be headings, so perhaps something in the font or other style features are causing the problem.
> First detected in version 1.19, retested with 1.25. Did not test earlier versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)