You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Maruan Sahyoun (JIRA)" <ji...@apache.org> on 2018/08/02 10:20:00 UTC
[jira] [Comment Edited] (PDFBOX-4284) LibreOffice6 PDF Conversion
broke PDFTextStripper result
[ https://issues.apache.org/jira/browse/PDFBOX-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566581#comment-16566581 ]
Maruan Sahyoun edited comment on PDFBOX-4284 at 8/2/18 10:19 AM:
-----------------------------------------------------------------
I see what you mean, but the text extraction is correct although not expected as the font information for the Glyph {{u}} encodes it as {{ui}} in the 6.0 version where in the 5.2 version {{u}} is correctly encoded as {{u}}. Thats why on screen you don't see the extra {{i}} as the glyph is taken for the visual representation where the unicode mapping is taken for the text extraction.
You can take a look at the font information (the text uses {{F1}}) in the PDFDebugger. Take a look at the following paths and compare them for the two PDFs in question:
- Root/Pages/Kids/[0]/Resources/Font/F1
- Root/Pages/Kids/[0]/Resources/Font/F1/ToUnicode
So in fact the PDF conversion is broken in LibreOffice not the text extraction in PDFBox IMHO.
was (Author: msahyoun):
I see what you mean, but the text extraction is correct although not expected as the font information for the Glyph {{u}} encodes it as {{ui}} in the 6.0 version where in the 5.2 version {{u}} is correctly encoded as {{u}}. Thats why on screen you don't see the extra {{i}} as the glyph is taken for the visual representation where the unicode mapping is taken for the text extraction.
You can take a look at the font information (the text uses {{F1}}) in the PDFDebugger. Take a look at the following paths and compare them for the two PDFs in question:
- Root/Pages/Kids/[0]/Resources/Font/F1
- Root/Pages/Kids/[0]/Resources/Font/F1/ToUnicode
> LibreOffice6 PDF Conversion broke PDFTextStripper result
> ----------------------------------------------------------
>
> Key: PDFBOX-4284
> URL: https://issues.apache.org/jira/browse/PDFBOX-4284
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 3.0.0 PDFBox
> Environment: Window 10 and CentOS7
> Reporter: David KELLER
> Priority: Major
> Labels: features
> Attachments: libreoffice_5.2-font.png, libreoffice_5.2.pdf, libreoffice_5.2.txt, libreoffice_6.0-font.png, libreoffice_6.0.pdf, libreoffice_6.0.txt, original-document.docx
>
>
> here the test program:
> {{public class ExtractTextPdfTest {}}
> {{ }}
> {{ public static void main(String[] args) throws Exception {}}
> {{ // #7272}}
> {{// String documentIn = "c:\\data\\test}}
> {{libreoffice_5.2.pdf";}}
> {{ String documentIn = "c:\\data\\test}}
> {{libreoffice_6.0.pdf";}}
> {{ }}
> {{ try (PDDocument pdDocument = PDDocument.load(new File(documentIn))) {}}
> {{ PDFTextStripper stripper = new PDFTextStripper();}}
> {{ String content = stripper.getText(pdDocument);}}
> {{ System.out.println(content);}}
> {{ }}}
> {{ }}
> {{ }}}
> {{}}}
>
> 1/ run PDFTextStripper on a Word document converted by LibreOffice 5.2 in PDF
> result :
> {quote}Réf : #chrono# Le #date#
> Affaire suivie par :
> #recipient.salutation#
> #recipient.name#
> #recipient.streetNumber#
> #recipient.streetName#
> #recipient.zipCode#
> #recipient.locality#
> #object#
> #recipient.salutation#,
> Nous avons bien reçu votre candidature pour le poste de…………………………. et nous vous
> remercions de l’intérêt que vous portez à notre administration.
> Afin d'examiner votre candidature de manière plus complète, nous souhaiterions vous rencontrer.
> Aussi, nous vous proposons un rendez-vous en nos locaux avec M ... , responsable du service de ... , le
> ... à ... heures.
> Nous vous prions d’agréer, #recipient.salutation#, l’expression de nos salutations distinguées.
> Le Maire,
> #signature#
> {quote}
>
> 2/ run PDFTextStripper on the same Word document converted by LibreOffice 6.0 in PDF
>
> result :
> {quote}Réf : Destinataire
> Affaire suiiiie aar : Adresse
> Code Postal
> Ville
> Paris, le 25/07/2018
> Madame, Moinsieuir
> Nous avons le plaisir de vous informer que suite à la Commission d’Attribution de Logement
> qui s’est tenue le xx/xx/xxxx, nous avons décidé de vous attribuer le logement situé au xx
> rue xxxxxxxxxxxxxxxxxxxx, 75 000 Paris.
> Les caractéristiuies de ce logemeint soint les suiiiaintes :
> Suirface habitable :
> Tyae de logemeint :
> Garage/Parkiing :
> Mointaint dui loyer :
> Mointaint des charges :
> Mointaint dui déaôt de garainte :
> Date d’eintrée dains les lieuix :
> Les s mointaints arécisés soint à déduiire, le cas échéaint, de l'aide aui logemeint (APL, AL) calcuilée et
> commuiiniiuiée aar iotre Caisse d'allocatoins familiales.
> Vouis aiez juisiui’aui xx/xx/xx aouir inouis siginifer l’acceatatoin de ce logemeint aar letre
> recommaindée aiec accuisé de réceatoin.
> Vouis ariaint d’agréer, Madame, Moinsieuir, l’exaressioin de mes saluitatoins distinguiées.
> Le Maire,
> #siginatuire#
> {quote}
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org