You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2022/11/06 20:25:00 UTC

[jira] [Updated] (PDFBOX-5540) export:text creates jibberish / malformed output

     [ https://issues.apache.org/jira/browse/PDFBOX-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr updated PDFBOX-5540:
------------------------------------
    Affects Version/s: 3.0.0 PDFBox
                           (was: 3.0.0 JBIG2)

> export:text creates jibberish / malformed output
> ------------------------------------------------
>
>                 Key: PDFBOX-5540
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5540
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.0 PDFBox
>         Environment: Same on Windows, Linux and macOS
>            Reporter: Alfons
>            Priority: Minor
>         Attachments: test.pdf, test.txt
>
>
> Using PDFBox as part of Tika and having issues with some PDFs outputting unreadable content. Copying text from Adobe / macOS Preview / Browsers works as expected.
> I have also tried "re-encoding" the PDF by editing and saving it with Acrobat, thinking it could be an issue with their original PDF creator and using pdfbox with different encodings, but output mostly remained unchanged.
> I attached the PDF and text it produces. Running it PDFBox via CLI as follows:
> {code:java}
> root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf          
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org