You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Justin LeFebvre (JIRA)" <ji...@apache.org> on 2009/03/17 15:22:50 UTC

[jira] Created: (PDFBOX-439) Incorrect text for Exolab.pdf in Regression Test

Incorrect text for Exolab.pdf in Regression Test
------------------------------------------------

                 Key: PDFBOX-439
                 URL: https://issues.apache.org/jira/browse/PDFBOX-439
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
            Reporter: Justin LeFebvre


When looking through text for an unrelated issue, I noticed that the file Exolab.pdf in the regression test produced the following line,

JAJAVVAA CODINING STANDAG STANDARD.......................................................................................................................1
when the line should say,
JAVA CODING STANDARD .......................................................................................................................1

Also this line,

5 COD5 CODE EXAMPLMPLES................................S ...................................................................................................................................26
should be
5 CODE EXAMPLES...................................................................................................................................26
However, Adobe has trouble with this one as well. 

These two issues only occurred when the file was run with the -sort option enabled. 

However, In both the unsorted and sorted tests this line was improperly handled:

APPENDIX A : DOCUMENT HISTORYT HISTORYT HISTORY...................................................................................................33 
should produce
APPENDIX A : DOCUMENT HISTORY ...................................................................................................33

I ran into this test using the current trunk. 

The Exolab.pdf file is located in the ..\source\trunk\test\input folder 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-439) Incorrect text for Exolab.pdf in Regression Test

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722610#action_12722610 ] 

Brian Carrier commented on PDFBOX-439:
--------------------------------------

A few more details on what we tried:
- Our goal was to detect the overlaps based on text coordinates and use logic similar to how we are currently detecting and merging in diacritics (see PDFBOX-444). 
- There is an open question about how we effeciently search through existing TextPositions to find the overlap because we are not storing them in sorted order.  We initially did a basic approach of comparing new TextPositions with existing TextPositions and this caused the regression tests to take 4 times as long.  Storing in sorted order would make things more efficient, but there has been a desire to preserve the non-sorted order of the text chunks.
- In general, the merging approach worked, except that we found some files in the regression tests that had character widths of 0 and others with very large widths. The 0s were because the character width is not currently being calculated in processEncodedText() for rotated text and we could not find the source of the very large widths.



> Incorrect text for Exolab.pdf in Regression Test
> ------------------------------------------------
>
>                 Key: PDFBOX-439
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-439
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>
> When looking through text for an unrelated issue, I noticed that the file Exolab.pdf in the regression test produced the following line,
> JAJAVVAA CODINING STANDAG STANDARD.......................................................................................................................1
> when the line should say,
> JAVA CODING STANDARD .......................................................................................................................1
> Also this line,
> 5 COD5 CODE EXAMPLMPLES................................S ...................................................................................................................................26
> should be
> 5 CODE EXAMPLES...................................................................................................................................26
> However, Adobe has trouble with this one as well. 
> These two issues only occurred when the file was run with the -sort option enabled. 
> However, In both the unsorted and sorted tests this line was improperly handled:
> APPENDIX A : DOCUMENT HISTORYT HISTORYT HISTORY...................................................................................................33 
> should produce
> APPENDIX A : DOCUMENT HISTORY ...................................................................................................33
> I ran into this test using the current trunk. 
> The Exolab.pdf file is located in the ..\source\trunk\test\input folder 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-439) Incorrect text for Exolab.pdf in Regression Test

Posted by "Justin LeFebvre (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721893#action_12721893 ] 

Justin LeFebvre commented on PDFBOX-439:
----------------------------------------

Apparently, this issue occurs due to the way this file produces bold text. It seems to use many of the same characters overlapping each other with very slight differences in positions in order to achieve the visual effect. However, since we are just analyzing the TJ positions, we end up printing these unnecessary characters when extracting the text. There is currently code in PDFTextStripper:processTextPosition(TP) that has been designed to remove duplicate text, however it will only remove TextPositions that not only overlap, but represent the same string values. This is problematic when dealing with this file because, for some reason, it has some TJ positions that are, for example, "JAV" and "VA" where the Vs overlap. In the current suppress duplicates code the overlap is never found and both the "JAV" and the "VA" are written to the output. 

Both Brian and I worked on trying to fix that situation by looking for any sort of overlapping characters on the page but this was turning out to be a bigger hassle than expected do to problems with rotated text and lack of reported individual character widths for some files. 

> Incorrect text for Exolab.pdf in Regression Test
> ------------------------------------------------
>
>                 Key: PDFBOX-439
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-439
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>
> When looking through text for an unrelated issue, I noticed that the file Exolab.pdf in the regression test produced the following line,
> JAJAVVAA CODINING STANDAG STANDARD.......................................................................................................................1
> when the line should say,
> JAVA CODING STANDARD .......................................................................................................................1
> Also this line,
> 5 COD5 CODE EXAMPLMPLES................................S ...................................................................................................................................26
> should be
> 5 CODE EXAMPLES...................................................................................................................................26
> However, Adobe has trouble with this one as well. 
> These two issues only occurred when the file was run with the -sort option enabled. 
> However, In both the unsorted and sorted tests this line was improperly handled:
> APPENDIX A : DOCUMENT HISTORYT HISTORYT HISTORY...................................................................................................33 
> should produce
> APPENDIX A : DOCUMENT HISTORY ...................................................................................................33
> I ran into this test using the current trunk. 
> The Exolab.pdf file is located in the ..\source\trunk\test\input folder 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.