You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Justin LeFebvre (JIRA)" <ji...@apache.org> on 2009/03/19 18:44:50 UTC

[jira] Created: (PDFBOX-444) Incorrect Diacritic Merging/Placement

Incorrect Diacritic Merging/Placement
-------------------------------------

                 Key: PDFBOX-444
                 URL: https://issues.apache.org/jira/browse/PDFBOX-444
             Project: PDFBox
          Issue Type: Bug
            Reporter: Justin LeFebvre
         Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf

When looking at the spacing issue found in PDFBOX-77, I found a separate issue with the placement of the diacritic characters in the file 03_2_SSL.pdf which I have attached here. 
The issue is that there are separate TextPositions used to render the character itself and its diacritic. For example, the word 

And¨ erung,  should have its diacritic over the A character and not after the d. This sort of issue occurs when the -sort option is enabled. Otherwise the produced word looks like this,

¨Anderung. This is still not correct in that the A and the diacritic should be merged to take up one character's width of space. This occurs throughout the document. 

Currently, PDFBOX does handle merging of diacritic characters but it assumes that the TextPosition for the diacritic occurs after the TextPosition it is supposed to be merged with, when in this file
the diacritic TextPosition comes beforehand. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PDFBOX-444) Incorrect Diacritic Merging/Placement

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Carrier resolved PDFBOX-444.
----------------------------------

    Resolution: Fixed

Patch applied and checked into trunk.  Updated regression tests also checked in.

Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
Sending        trunk/src/main/java/org/apache/pdfbox/util/TextPosition.java
Sending        trunk/test/input/Garcia2004_thesis.pdf-sorted.txt
Sending        trunk/test/input/Garcia2004_thesis.pdf.txt
Sending        trunk/test/input/cweb.pdf-sorted.txt
Sending        trunk/test/input/cweb.pdf.txt
Transmitting file data ......
Committed revision 760554.



> Incorrect Diacritic Merging/Placement
> -------------------------------------
>
>                 Key: PDFBOX-444
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-444
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>         Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf, Diacritic_fix.diff
>
>
> When looking at the spacing issue found in PDFBOX-77, I found a separate issue with the placement of the diacritic characters in the file 03_2_SSL.pdf which I have attached here. 
> The issue is that there are separate TextPositions used to render the character itself and its diacritic. For example, the word 
> And¨ erung,  should have its diacritic over the A character and not after the d. This sort of issue occurs when the -sort option is enabled. Otherwise the produced word looks like this,
> ¨Anderung. This is still not correct in that the A and the diacritic should be merged to take up one character's width of space. This occurs throughout the document. 
> Currently, PDFBOX does handle merging of diacritic characters but it assumes that the TextPosition for the diacritic occurs after the TextPosition it is supposed to be merged with, when in this file
> the diacritic TextPosition comes beforehand. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-444) Incorrect Diacritic Merging/Placement

Posted by "Justin LeFebvre (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Justin LeFebvre updated PDFBOX-444:
-----------------------------------

    Attachment: 03_2_SSL-unsorted.txt
                03_2_SSL-sorted.txt
                03_2_SSL.pdf

> Incorrect Diacritic Merging/Placement
> -------------------------------------
>
>                 Key: PDFBOX-444
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-444
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Justin LeFebvre
>         Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf
>
>
> When looking at the spacing issue found in PDFBOX-77, I found a separate issue with the placement of the diacritic characters in the file 03_2_SSL.pdf which I have attached here. 
> The issue is that there are separate TextPositions used to render the character itself and its diacritic. For example, the word 
> And¨ erung,  should have its diacritic over the A character and not after the d. This sort of issue occurs when the -sort option is enabled. Otherwise the produced word looks like this,
> ¨Anderung. This is still not correct in that the A and the diacritic should be merged to take up one character's width of space. This occurs throughout the document. 
> Currently, PDFBOX does handle merging of diacritic characters but it assumes that the TextPosition for the diacritic occurs after the TextPosition it is supposed to be merged with, when in this file
> the diacritic TextPosition comes beforehand. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-444) Incorrect Diacritic Merging/Placement

Posted by "Justin LeFebvre (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Justin LeFebvre updated PDFBOX-444:
-----------------------------------

    Attachment: Diacritic_fix.diff

The attached diff file, Diacritic_fix.diff has the code changes that will fix this issue. Note: This fix will cause the regression tests to fail, however, Brian and I have reviewed the log file and have confirmed that the lines that are failing are actually improvements to the previous output. 

> Incorrect Diacritic Merging/Placement
> -------------------------------------
>
>                 Key: PDFBOX-444
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-444
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>         Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf, Diacritic_fix.diff
>
>
> When looking at the spacing issue found in PDFBOX-77, I found a separate issue with the placement of the diacritic characters in the file 03_2_SSL.pdf which I have attached here. 
> The issue is that there are separate TextPositions used to render the character itself and its diacritic. For example, the word 
> And¨ erung,  should have its diacritic over the A character and not after the d. This sort of issue occurs when the -sort option is enabled. Otherwise the produced word looks like this,
> ¨Anderung. This is still not correct in that the A and the diacritic should be merged to take up one character's width of space. This occurs throughout the document. 
> Currently, PDFBOX does handle merging of diacritic characters but it assumes that the TextPosition for the diacritic occurs after the TextPosition it is supposed to be merged with, when in this file
> the diacritic TextPosition comes beforehand. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-444) Incorrect Diacritic Merging/Placement

Posted by "Justin LeFebvre (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Justin LeFebvre updated PDFBOX-444:
-----------------------------------

    Component/s: Text extraction

> Incorrect Diacritic Merging/Placement
> -------------------------------------
>
>                 Key: PDFBOX-444
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-444
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>         Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf
>
>
> When looking at the spacing issue found in PDFBOX-77, I found a separate issue with the placement of the diacritic characters in the file 03_2_SSL.pdf which I have attached here. 
> The issue is that there are separate TextPositions used to render the character itself and its diacritic. For example, the word 
> And¨ erung,  should have its diacritic over the A character and not after the d. This sort of issue occurs when the -sort option is enabled. Otherwise the produced word looks like this,
> ¨Anderung. This is still not correct in that the A and the diacritic should be merged to take up one character's width of space. This occurs throughout the document. 
> Currently, PDFBOX does handle merging of diacritic characters but it assumes that the TextPosition for the diacritic occurs after the TextPosition it is supposed to be merged with, when in this file
> the diacritic TextPosition comes beforehand. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.