You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Justin LeFebvre (JIRA)" <ji...@apache.org> on 2009/03/19 18:44:50 UTC
[jira] Created: (PDFBOX-444) Incorrect Diacritic Merging/Placement
Incorrect Diacritic Merging/Placement
-------------------------------------
Key: PDFBOX-444
URL: https://issues.apache.org/jira/browse/PDFBOX-444
Project: PDFBox
Issue Type: Bug
Reporter: Justin LeFebvre
Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf
When looking at the spacing issue found in PDFBOX-77, I found a separate issue with the placement of the diacritic characters in the file 03_2_SSL.pdf which I have attached here.
The issue is that there are separate TextPositions used to render the character itself and its diacritic. For example, the word
And¨ erung, should have its diacritic over the A character and not after the d. This sort of issue occurs when the -sort option is enabled. Otherwise the produced word looks like this,
¨Anderung. This is still not correct in that the A and the diacritic should be merged to take up one character's width of space. This occurs throughout the document.
Currently, PDFBOX does handle merging of diacritic characters but it assumes that the TextPosition for the diacritic occurs after the TextPosition it is supposed to be merged with, when in this file
the diacritic TextPosition comes beforehand.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PDFBOX-444) Incorrect Diacritic Merging/Placement
Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brian Carrier resolved PDFBOX-444.
----------------------------------
Resolution: Fixed
Patch applied and checked into trunk. Updated regression tests also checked in.
Sending trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
Sending trunk/src/main/java/org/apache/pdfbox/util/TextPosition.java
Sending trunk/test/input/Garcia2004_thesis.pdf-sorted.txt
Sending trunk/test/input/Garcia2004_thesis.pdf.txt
Sending trunk/test/input/cweb.pdf-sorted.txt
Sending trunk/test/input/cweb.pdf.txt
Transmitting file data ......
Committed revision 760554.
> Incorrect Diacritic Merging/Placement
> -------------------------------------
>
> Key: PDFBOX-444
> URL: https://issues.apache.org/jira/browse/PDFBOX-444
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: Justin LeFebvre
> Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf, Diacritic_fix.diff
>
>
> When looking at the spacing issue found in PDFBOX-77, I found a separate issue with the placement of the diacritic characters in the file 03_2_SSL.pdf which I have attached here.
> The issue is that there are separate TextPositions used to render the character itself and its diacritic. For example, the word
> And¨ erung, should have its diacritic over the A character and not after the d. This sort of issue occurs when the -sort option is enabled. Otherwise the produced word looks like this,
> ¨Anderung. This is still not correct in that the A and the diacritic should be merged to take up one character's width of space. This occurs throughout the document.
> Currently, PDFBOX does handle merging of diacritic characters but it assumes that the TextPosition for the diacritic occurs after the TextPosition it is supposed to be merged with, when in this file
> the diacritic TextPosition comes beforehand.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-444) Incorrect Diacritic Merging/Placement
Posted by "Justin LeFebvre (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Justin LeFebvre updated PDFBOX-444:
-----------------------------------
Attachment: 03_2_SSL-unsorted.txt
03_2_SSL-sorted.txt
03_2_SSL.pdf
> Incorrect Diacritic Merging/Placement
> -------------------------------------
>
> Key: PDFBOX-444
> URL: https://issues.apache.org/jira/browse/PDFBOX-444
> Project: PDFBox
> Issue Type: Bug
> Reporter: Justin LeFebvre
> Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf
>
>
> When looking at the spacing issue found in PDFBOX-77, I found a separate issue with the placement of the diacritic characters in the file 03_2_SSL.pdf which I have attached here.
> The issue is that there are separate TextPositions used to render the character itself and its diacritic. For example, the word
> And¨ erung, should have its diacritic over the A character and not after the d. This sort of issue occurs when the -sort option is enabled. Otherwise the produced word looks like this,
> ¨Anderung. This is still not correct in that the A and the diacritic should be merged to take up one character's width of space. This occurs throughout the document.
> Currently, PDFBOX does handle merging of diacritic characters but it assumes that the TextPosition for the diacritic occurs after the TextPosition it is supposed to be merged with, when in this file
> the diacritic TextPosition comes beforehand.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-444) Incorrect Diacritic Merging/Placement
Posted by "Justin LeFebvre (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Justin LeFebvre updated PDFBOX-444:
-----------------------------------
Attachment: Diacritic_fix.diff
The attached diff file, Diacritic_fix.diff has the code changes that will fix this issue. Note: This fix will cause the regression tests to fail, however, Brian and I have reviewed the log file and have confirmed that the lines that are failing are actually improvements to the previous output.
> Incorrect Diacritic Merging/Placement
> -------------------------------------
>
> Key: PDFBOX-444
> URL: https://issues.apache.org/jira/browse/PDFBOX-444
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: Justin LeFebvre
> Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf, Diacritic_fix.diff
>
>
> When looking at the spacing issue found in PDFBOX-77, I found a separate issue with the placement of the diacritic characters in the file 03_2_SSL.pdf which I have attached here.
> The issue is that there are separate TextPositions used to render the character itself and its diacritic. For example, the word
> And¨ erung, should have its diacritic over the A character and not after the d. This sort of issue occurs when the -sort option is enabled. Otherwise the produced word looks like this,
> ¨Anderung. This is still not correct in that the A and the diacritic should be merged to take up one character's width of space. This occurs throughout the document.
> Currently, PDFBOX does handle merging of diacritic characters but it assumes that the TextPosition for the diacritic occurs after the TextPosition it is supposed to be merged with, when in this file
> the diacritic TextPosition comes beforehand.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-444) Incorrect Diacritic Merging/Placement
Posted by "Justin LeFebvre (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Justin LeFebvre updated PDFBOX-444:
-----------------------------------
Component/s: Text extraction
> Incorrect Diacritic Merging/Placement
> -------------------------------------
>
> Key: PDFBOX-444
> URL: https://issues.apache.org/jira/browse/PDFBOX-444
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: Justin LeFebvre
> Attachments: 03_2_SSL-sorted.txt, 03_2_SSL-unsorted.txt, 03_2_SSL.pdf
>
>
> When looking at the spacing issue found in PDFBOX-77, I found a separate issue with the placement of the diacritic characters in the file 03_2_SSL.pdf which I have attached here.
> The issue is that there are separate TextPositions used to render the character itself and its diacritic. For example, the word
> And¨ erung, should have its diacritic over the A character and not after the d. This sort of issue occurs when the -sort option is enabled. Otherwise the produced word looks like this,
> ¨Anderung. This is still not correct in that the A and the diacritic should be merged to take up one character's width of space. This occurs throughout the document.
> Currently, PDFBOX does handle merging of diacritic characters but it assumes that the TextPosition for the diacritic occurs after the TextPosition it is supposed to be merged with, when in this file
> the diacritic TextPosition comes beforehand.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.