You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Justin LeFebvre (JIRA)" <ji...@apache.org> on 2009/06/19 19:32:07 UTC
[jira] Commented: (PDFBOX-439) Incorrect text for Exolab.pdf in Regression Test

    [ https://issues.apache.org/jira/browse/PDFBOX-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721893#action_12721893 ] 

Justin LeFebvre commented on PDFBOX-439:
----------------------------------------

Apparently, this issue occurs due to the way this file produces bold text. It seems to use many of the same characters overlapping each other with very slight differences in positions in order to achieve the visual effect. However, since we are just analyzing the TJ positions, we end up printing these unnecessary characters when extracting the text. There is currently code in PDFTextStripper:processTextPosition(TP) that has been designed to remove duplicate text, however it will only remove TextPositions that not only overlap, but represent the same string values. This is problematic when dealing with this file because, for some reason, it has some TJ positions that are, for example, "JAV" and "VA" where the Vs overlap. In the current suppress duplicates code the overlap is never found and both the "JAV" and the "VA" are written to the output. 

Both Brian and I worked on trying to fix that situation by looking for any sort of overlapping characters on the page but this was turning out to be a bigger hassle than expected do to problems with rotated text and lack of reported individual character widths for some files. 

> Incorrect text for Exolab.pdf in Regression Test
> ------------------------------------------------
>
>                 Key: PDFBOX-439
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-439
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>
> When looking through text for an unrelated issue, I noticed that the file Exolab.pdf in the regression test produced the following line,
> JAJAVVAA CODINING STANDAG STANDARD.......................................................................................................................1
> when the line should say,
> JAVA CODING STANDARD .......................................................................................................................1
> Also this line,
> 5 COD5 CODE EXAMPLMPLES................................S ...................................................................................................................................26
> should be
> 5 CODE EXAMPLES...................................................................................................................................26
> However, Adobe has trouble with this one as well. 
> These two issues only occurred when the file was run with the -sort option enabled. 
> However, In both the unsorted and sorted tests this line was improperly handled:
> APPENDIX A : DOCUMENT HISTORYT HISTORYT HISTORY...................................................................................................33 
> should produce
> APPENDIX A : DOCUMENT HISTORY ...................................................................................................33
> I ran into this test using the current trunk. 
> The Exolab.pdf file is located in the ..\source\trunk\test\input folder 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.