You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2013/05/26 14:26:20 UTC

[jira] [Commented] (PDFBOX-1553) Offset of extracted coordinates

    [ https://issues.apache.org/jira/browse/PDFBOX-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13667302#comment-13667302 ] 

Andreas Lehmkühler commented on PDFBOX-1553:
--------------------------------------------

I'd a look at your code and it seems that the calculation of the rectangle for the character is wrong:

DocumentRect rect = new DocumentRect(
				text.getX(), 
				text.getY() - text.getHeight(), 
				text.getX() + text.getWidth() , 
				text.getY()); 

But to be consistent with the calculation of the document rectangle it should be:

DocumentRect rect = new DocumentRect(
				text.getX(), 
				text.getY(),
				text.getX() + text.getWidth() , 
				text.getY() + text.getHeight()); 

                
> Offset of extracted coordinates
> -------------------------------
>
>                 Key: PDFBOX-1553
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1553
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>         Environment: Linux Ubuntu 64 bit, Java
>            Reporter: Vitalie Bureanu
>            Priority: Minor
>              Labels: offset
>         Attachments: EnSt10_offset.pdf, EnSt11_offset.pdf, Extracted coordinates of rects.jpg, Parser.java, Selection in Adobe Reader.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hello,
> Preamble: We are glad to use PDFBox and I personally grateful to all developers who sustain this project. It is good work, guys!
> We have one problem. For our application purposes we extract from pdf "char by char" with rispective coordinates for each char. (see attached Parser)
> After this we group chars into the words. We noticed that for some pdf documents we have a strange "offset" for extracted rect coordinates. (see screens)
> The offset is seems to be incremental (not sure) - at left top corner of document is near to real coordinates of character, but at right bottom corner is near to 0.5 cm..
> If I make selection in Adobe Reader - it seems all ok.
> I attached two pdf files with offset to this post.
> If you want to see the offset "in action" you can use our service to do it at http://pdf2data.cloudforpeople.com/ (Please do not consider it as advertising)
> Please can you test these files and tell me if it is a really bug?
> How we can resolve it?
> Thanks,
> Vitalie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira