You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Sébastien Dailly (Created JIRA)" <ji...@apache.org> on 2011/11/15 14:53:53 UTC

[jira] [Created] (PDFBOX-1170) Strange behavior in TextPositionComparator

Strange behavior in TextPositionComparator
------------------------------------------

                 Key: PDFBOX-1170
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1170
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.6.0, 1.7.0
         Environment: Windows
            Reporter: Sébastien Dailly
            Priority: Minor


When extracting text for the pdf (see attachement) with setSortByPosition(true), the output does not follow nor the visual position of the elements, nor the document structure.

Here is the output of PDfTextStripper :

11111 333333333333333 : 222222222 

The expected output would be :

11111 : 222222222 333333333333333 

The string « 11111 : » is defined in only one instruction :

 [(1) -9.555729866 (1) 17.5939998627 (1) 3.5597500801 (1) 1.9403500557 (1) 4.1794600487 ( ) -0.1493600011 (:) -4.7775301933 ( ) 250 ] TJ

How explain that the 3... is inserted inside ?

(Note : the pdf has been deflated and edited for « anonymising » the text. I also removed a picture, wich explain the XRef error )

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1170) Strange behavior in TextPositionComparator

Posted by "Sébastien Dailly (Updated JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sébastien Dailly updated PDFBOX-1170:
-------------------------------------

    Attachment: output.pdf

The document
                
> Strange behavior in TextPositionComparator
> ------------------------------------------
>
>                 Key: PDFBOX-1170
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1170
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0, 1.7.0
>         Environment: Windows
>            Reporter: Sébastien Dailly
>            Priority: Minor
>         Attachments: output.pdf
>
>
> When extracting text for the pdf (see attachement) with setSortByPosition(true), the output does not follow nor the visual position of the elements, nor the document structure.
> Here is the output of PDfTextStripper :
> 11111 333333333333333 : 222222222 
> The expected output would be :
> 11111 : 222222222 333333333333333 
> The string « 11111 : » is defined in only one instruction :
>  [(1) -9.555729866 (1) 17.5939998627 (1) 3.5597500801 (1) 1.9403500557 (1) 4.1794600487 ( ) -0.1493600011 (:) -4.7775301933 ( ) 250 ] TJ
> How explain that the 3... is inserted inside ?
> (Note : the pdf has been deflated and edited for « anonymising » the text. I also removed a picture, wich explain the XRef error )

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira