You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Brian Carrier (JIRA)" <ji...@apache.org> on 2009/03/31 20:08:50 UTC

[jira] Resolved: (PDFBOX-444) Incorrect Diacritic Merging/Placement

     [ https://issues.apache.org/jira/browse/PDFBOX-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Carrier resolved PDFBOX-444.
----------------------------------

    Resolution: Fixed

Patch applied and checked into trunk.  Updated regression tests also checked in.

Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
Sending        trunk/src/main/java/org/apache/pdfbox/util/TextPosition.java
Sending        trunk/test/input/Garcia2004_thesis.pdf-sorted.txt
Sending        trunk/test/input/Garcia2004_thesis.pdf.txt
Sending        trunk/test/input/cweb.pdf-sorted.txt
Sending        trunk/test/input/cweb.pdf.txt
Transmitting file data ......
Committed revision 760554.



> Incorrect Diacritic Merging/Placement
> -------------------------------------
>
>                 Key: PDFBOX-444
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-444
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>         Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf, Diacritic_fix.diff
>
>
> When looking at the spacing issue found in PDFBOX-77, I found a separate issue with the placement of the diacritic characters in the file 03_2_SSL.pdf which I have attached here. 
> The issue is that there are separate TextPositions used to render the character itself and its diacritic. For example, the word 
> And¨ erung,  should have its diacritic over the A character and not after the d. This sort of issue occurs when the -sort option is enabled. Otherwise the produced word looks like this,
> ¨Anderung. This is still not correct in that the A and the diacritic should be merged to take up one character's width of space. This occurs throughout the document. 
> Currently, PDFBOX does handle merging of diacritic characters but it assumes that the TextPosition for the diacritic occurs after the TextPosition it is supposed to be merged with, when in this file
> the diacritic TextPosition comes beforehand. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.