You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Michael Tighe (Jira)" <ji...@apache.org> on 2022/03/31 16:12:00 UTC

[jira] [Created] (PDFBOX-5406) Assumption of Identity Not Valid for Text Extraction

Michael Tighe created PDFBOX-5406:
-------------------------------------

             Summary: Assumption of Identity Not Valid for Text Extraction
                 Key: PDFBOX-5406
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5406
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 2.0.24
            Reporter: Michael Tighe


PDF BOX issue 1090 (closed years ago) makes an assumption that can lead to serious issues when the text extraction process returns garbage.

Version: PDFBOX v2.0.24

PDFBOX -> PDFont.java -> loadUnicodeCMap line 150

The code distinctly KNOWS that there is no UNICODE map.

It then makes a number of guesses - runs out of options, and explicitly makes an assumption that silently creates bad output.{{{}{}}}

{{    LOG.warn("Invalid ToUnicode CMap in font " + getName());}}

{{    ...}}

{{    LOG.warn("Using predefined identity CMap instead");}}

Every document that I've seen that produces that WARNING has bad text returned for the document when you use PDFBOX to do text extraction.

My logic is that the CMap is being ignored by the producer of that PDF, and assuming that it's possible to use the reverse causes silent failure on the part of PDFBOX.  The software package calling PDFBOX gets no warning that there is an issue.

I propose that this code throw an exception rather than a warning.

That way the extraction caller KNOWS that the text is wrong.

I have examples identical to those shown in the original issue.

Is there any more recent work on this issue?  E.g., parameters that could be set to say "I want perfect extraction or no extraction"? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org