You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2022/04/01 15:14:00 UTC

[jira] [Commented] (PDFBOX-5406) Assumption of Identity Not Valid for Text Extraction

    [ https://issues.apache.org/jira/browse/PDFBOX-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515961#comment-17515961 ] 

Tilman Hausherr commented on PDFBOX-5406:
-----------------------------------------

Yes sometimes we get trash. But there are also cases where Adobe Reader brings trash. Some files have a /ToUnicode map and still return trash.

We don't have a "strict" setting because there's no simple solution. Use a word dictionary to detect whether the output is trash, and then run OCR.

> Assumption of Identity Not Valid for Text Extraction
> ----------------------------------------------------
>
>                 Key: PDFBOX-5406
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5406
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.24
>            Reporter: Michael Tighe
>            Priority: Major
>
> PDF BOX issue 1090 (closed years ago) makes an assumption that can lead to serious issues when the text extraction process returns garbage.
> Version: PDFBOX v2.0.24
> PDFBOX -> PDFont.java -> loadUnicodeCMap line 150
> The code distinctly KNOWS that there is no UNICODE map.
> It then makes a number of guesses - runs out of options, and explicitly makes an assumption that silently creates bad output.{{{}{}}}
> {{    LOG.warn("Invalid ToUnicode CMap in font " + getName());}}
> {{    ...}}
> {{    LOG.warn("Using predefined identity CMap instead");}}
> Every document that I've seen that produces that WARNING has bad text returned for the document when you use PDFBOX to do text extraction.
> My logic is that the CMap is being ignored by the producer of that PDF, and assuming that it's possible to use the reverse causes silent failure on the part of PDFBOX.  The software package calling PDFBOX gets no warning that there is an issue.
> I propose that this code throw an exception rather than a warning.
> That way the extraction caller KNOWS that the text is wrong.
> I have examples identical to those shown in the original issue.
> Is there any more recent work on this issue?  E.g., parameters that could be set to say "I want perfect extraction or no extraction"? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org