You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Kenneth Glidden (JIRA)" <ji...@apache.org> on 2009/03/05 22:39:56 UTC

[jira] Updated: (PDFBOX-430) Incorrect diacritic placement in text extraction

     [ https://issues.apache.org/jira/browse/PDFBOX-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kenneth Glidden updated PDFBOX-430:
-----------------------------------

    Attachment: pdfbox-430-diffs.txt
                TextPosition.java
                PDFTextStripper.java

For the record and to establish authorship and transfer of rights to ASF, I've attached TextPosition.java and PDFTextStripper.java which contain the changes that Brian checked in for me on 18/Feb/09.  I coded the algorithms therein and Brian graciously code reviewed and refactored.

I also attached pdfbox-430-diffs.txt which records the diffs.

> Incorrect diacritic placement in text extraction
> ------------------------------------------------
>
>                 Key: PDFBOX-430
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-430
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Brian Carrier
>         Attachments: pdfbox-430-diffs.txt, PDFTextStripper.java, TextPosition.java
>
>
> Some PDF files store diacritics (accents over characters) as separate text elements. The PDF files essentially have a chunk of text and then backup and place the diacritic over one of the characters in the chunk of text. With text extraction, the current design does not allow the diacritic to be placed over a character in the chunk and instead it is placed after the chunk. 
> The debug-diac2.pdf file in PDFBOX-429 shows this problem. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.