You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2020/01/09 19:54:00 UTC

[jira] [Commented] (PDFBOX-4737) Text extraction is gibberish

    [ https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012180#comment-17012180 ] 

Tilman Hausherr commented on PDFBOX-4737:
-----------------------------------------

Re the file mentioned in PDFBOX-4549 ([^obfuscateTest_Duplicate_2_3.pdf]), that one currently returns nothing, although Adobe has something.

If we'd do "strict" text extraction we could still hit files that are purposely obfuscated, see the comment by mkl. IMHO this isn't the job of PDFBox. This should be done by the caller.

> Text extraction is gibberish
> ----------------------------
>
>                 Key: PDFBOX-4737
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4737
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.18
>            Reporter: Jorge Spinsanti
>            Priority: Major
>         Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf
>
>
> As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549 there are many PDFs where the text extraction is gibberish.
> Perhaps you can add two modes (strict/lax) to text extraction to avoid gibberish if not useful. Add a file to analyze the problem.
> [^noUnicodeMapping.pdf]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org