You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Brian Carrier (JIRA)" <ji...@apache.org> on 2008/11/17 22:18:44 UTC
[jira] Commented: (PDFBOX-374) text areas not properly being sorted because of page rotation

    [ https://issues.apache.org/jira/browse/PDFBOX-374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648342#action_12648342 ] 

Brian Carrier commented on PDFBOX-374:
--------------------------------------

After reviewing the patches PDFBOX-363 and finding some more examples that were not fixed by the previous patch in this entry, a new patch is attached. Note: the landscape_rot90.pdf file that was later attached to PDFBOX-363 is an example that is not solved by the previous patch, but is solved by this patch.

This patch moves all knowledge about page rotation and the direction of text to the TextPosition class. The text matrix is now relied on instead of the page rotation value. New APIs were added so that callers could get text direction adjusted coordinates. The functionality of the original APIs is maintained for other parts of PDFBox.  Other code was adjusted accordingly.  I also did some cleanup in PDFStreamEngine and PDFTextStripper to remove unused variables and rename some variables to make their contents easier to understand. 

There are some failures on the regression tests, but most of them are better:
- The two mismatches in "hexnumberproblem.pdf" are because the new code produces better output.
- The mismatches in ocalc.pdf are all because the new code produces better output. 
- The mismatches in test_rotate_270.pdf are because the new code put "t" on its own line and caused every line after it to fail. The previous version of the code produced better results in this case, but it is not clear how. The text is on an angle relative to the other text and the height of "t" is such that it is equivalent to being on another line of text. I tried to adjust the code so that it was more liberal with making new lines, but it caused lots of other failures in the regression tests. 

Note that the regression tests do not currently sort the text based on location, so the page rotation issues are not tested.  New regression tests must be created. 


> text areas not properly being sorted because of page rotation
> -------------------------------------------------------------
>
>                 Key: PDFBOX-374
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-374
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>         Attachments: rotation.pdf, text-rotation-081117.zip
>
>
> When PDFTextStripper is set to sort the text before outputting, the sorting is not correct if a page rotation exists.  The reason is because both TextPositionComparator and PDFStreamEngine take the rotation into account.  So, the rotation is applied twice by the time the comparison is done in TextPositionComparator. 
> Also, it seems that the rotation code in PDFStreamEngine is not consistent. I verified the code for 0 and 90 degrees works, but the 180 and 270 situations do not seem consistent with the goal of adjusting the X and Y values so that 0,0 is in the upper left, which is what the 0 and 90 code does.  I do not have examples of 180 and 270 to test with. There are no comments in this section, so I have been guessing about its purpose.
> The attached patches:
> - Remove the rotation from TextPositionComparator
> - Adds comments and makes changes to the 180 and 270 situations to make it consistent with 0 and 90. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.