You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/11 01:07:34 UTC

[jira] [Closed] (PDFBOX-34) Retain white lines while extracting text

     [ https://issues.apache.org/jira/browse/PDFBOX-34?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson closed PDFBOX-34.
-----------------------------
    Resolution: Won't Fix

We'll take a patch for this, but it's unlikely to be added otherwise.

> Retain white lines while extracting text
> ----------------------------------------
>
>                 Key: PDFBOX-34
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-34
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1116196
> Originally submitted by croky on 2005-02-04 06:12.
> In the output of PDF-box the white lines between
> paragraphs are lost. For me the white lines are
> important info to detect different paragraphs in texts,
> so I would like to know if there is a way to keep the
> white lines in the output. I hope it doesn't take much
> effort to make this possible, I'll have a look at it
> myself as well..
> [comment on SourceForge]
> Originally sent by nobody.
> Logged In: NO 
> Hey Ben,
> Could you tell us how the Parser/Stripper works? 
> I tried extending the Stripper class, I can only pick up End Paragraphs when there's a drastic difference in space.
> Maybe that'll help all of us out. Thanks.
> [comment on SourceForge]
> Originally sent by lord0.
> Logged In: YES 
> user_id=1037043
> I concur with Croky's request. For my usage it is vital that
> I can somehow detect paragraphs. Preserving the white lines
> would allow this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)