You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Wolfgang Fahl (JIRA)" <ji...@apache.org> on 2018/01/05 21:01:00 UTC
[jira] [Created] (PDFBOX-4054) allow to access positions of text
extracted by
Wolfgang Fahl created PDFBOX-4054:
-------------------------------------
Summary: allow to access positions of text extracted by
Key: PDFBOX-4054
URL: https://issues.apache.org/jira/browse/PDFBOX-4054
Project: PDFBox
Issue Type: Improvement
Affects Versions: 1.8.13
Environment: any
Reporter: Wolfgang Fahl
Priority: Critical
https://stackoverflow.com/questions/25109969/how-to-extract-a-paragraph-from-a-pdf-file-and-store-its-position/48119163?noredirect=1#comment83218312_48119163
describes a need that pdftotext -bbox-layout fulfills by supplying structural information
for the text extraction.
There has been no PDFBox answer for a while so I assume such a feature is missing.
A similar approach would be a useful improvement ot PDFBox and much wanted for certain applications - e.g. when the position of a text on a page is important for it's meaning.
The poppler xhtml approach supplies for example:
<flow>
<block xMin="333.000000" yMin="270.150000" xMax="360.004000" yMax="275.150000">
<line xMin="333.000000" yMin="270.150000" xMax="360.004000" yMax="275.150000">
<word xMin="333.000000" yMin="270.150000" xMax="342.896500" yMax="275.150000">Your</word>
<word xMin="347.047500" yMin="270.150000" xMax="360.004000" yMax="275.150000">Bank</word>
</line>
</block>
</flow>
flow/block/line/word is a hierachy and you get position information for block and line.
PdfBox could supply similar information via callbacks.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org