You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Sébastien Dailly (Created JIRA)" <ji...@apache.org> on 2011/11/15 14:53:53 UTC
[jira] [Created] (PDFBOX-1170) Strange behavior in
TextPositionComparator
Strange behavior in TextPositionComparator
------------------------------------------
Key: PDFBOX-1170
URL: https://issues.apache.org/jira/browse/PDFBOX-1170
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.6.0, 1.7.0
Environment: Windows
Reporter: Sébastien Dailly
Priority: Minor
When extracting text for the pdf (see attachement) with setSortByPosition(true), the output does not follow nor the visual position of the elements, nor the document structure.
Here is the output of PDfTextStripper :
11111 333333333333333 : 222222222
The expected output would be :
11111 : 222222222 333333333333333
The string « 11111 : » is defined in only one instruction :
[(1) -9.555729866 (1) 17.5939998627 (1) 3.5597500801 (1) 1.9403500557 (1) 4.1794600487 ( ) -0.1493600011 (:) -4.7775301933 ( ) 250 ] TJ
How explain that the 3... is inserted inside ?
(Note : the pdf has been deflated and edited for « anonymising » the text. I also removed a picture, wich explain the XRef error )
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PDFBOX-1170) Strange behavior in
TextPositionComparator
Posted by "Sébastien Dailly (Updated JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sébastien Dailly updated PDFBOX-1170:
-------------------------------------
Attachment: output.pdf
The document
> Strange behavior in TextPositionComparator
> ------------------------------------------
>
> Key: PDFBOX-1170
> URL: https://issues.apache.org/jira/browse/PDFBOX-1170
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.6.0, 1.7.0
> Environment: Windows
> Reporter: Sébastien Dailly
> Priority: Minor
> Attachments: output.pdf
>
>
> When extracting text for the pdf (see attachement) with setSortByPosition(true), the output does not follow nor the visual position of the elements, nor the document structure.
> Here is the output of PDfTextStripper :
> 11111 333333333333333 : 222222222
> The expected output would be :
> 11111 : 222222222 333333333333333
> The string « 11111 : » is defined in only one instruction :
> [(1) -9.555729866 (1) 17.5939998627 (1) 3.5597500801 (1) 1.9403500557 (1) 4.1794600487 ( ) -0.1493600011 (:) -4.7775301933 ( ) 250 ] TJ
> How explain that the 3... is inserted inside ?
> (Note : the pdf has been deflated and edited for « anonymising » the text. I also removed a picture, wich explain the XRef error )
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira