You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Brian Carrier (JIRA)" <ji...@apache.org> on 2008/11/17 22:50:44 UTC

[jira] Commented: (PDFBOX-106) Line Breaks

    [ https://issues.apache.org/jira/browse/PDFBOX-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648366#action_12648366 ] 

Brian Carrier commented on PDFBOX-106:
--------------------------------------

PDF stores text in chunks and the maximum size of each chunk is a line (frequently a line is broken up into several chunks). Each chunk is given coordinates on the page where it should be placed. PDFBox currently determines if the coordinates of one chunk is below the previous chunk and uses that as a way to determine where to add newlines. 

To do what you suggest (which isn't a bad idea), PDFBox would need to guess where paragraph boundaries are, which is much harder (since some paragraphs are denoted by indents, some by an extra line, etc.). 

This issue can be closed or turned into a "feature request".



> Line Breaks
> -----------
>
>                 Key: PDFBOX-106
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-106
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1348208
> Originally submitted by nobody on 2005-11-04 04:11.
> I'm investigating methods of extracting text from pdf
> files, I have an issue though - when viewing the PDF in
> acrobat the lines obviously break at the edge of the
> page, though the next line is clearly of the same sentance.
> When the text is extracted into a .txt file, the line
> breaks are still there, is there a method where I can
> make sure that the text doesn't break onto the next
> line in the text document? or is the text encoded in
> the pdf forced into a line break and hence this problem
> cannot be resolved?
> Thanks
> Dan Eastley
> dj (at) eastley.net

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.