You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2018/11/12 17:18:00 UTC

[jira] [Commented] (PDFBOX-4376) Get text within pdf by paragraphs.

    [ https://issues.apache.org/jira/browse/PDFBOX-4376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684116#comment-16684116 ] 

Tilman Hausherr commented on PDFBOX-4376:
-----------------------------------------

You can delimit paragraphs by using {{stripper.setParagraphStart("XXXX")}} and {{stripper.setParagraphEnd("YYYY")}}. I tested this with your file. However this won't work the way you wish, because what you consider to be one paragraph are really 5. I managed to solve this by calling {{stripper.setDropThreshold(5)}}. However it is possible that you'll have other separate paragraphs now coming together.

Another alternative might be to use the structure information in the PDF. These are present in your PDF, but this is poorly supported by PDFBox, and nobody of the core team is a specialist in this topic.

You can open your file with PDFDebugger, then choose "View", "Choose internal structure" in the menu, then go to Root/StructTreeRoot and find out what's there.

> Get text within pdf by paragraphs.
> ----------------------------------
>
>                 Key: PDFBOX-4376
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4376
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Kaushlendra Singh
>            Priority: Major
>         Attachments: Sample.pdf
>
>
> There is a scenario in which I have to fetch the text within pdf page paragraph wise not line by line. All these text paragraphs are built from text frames created by Indesign editor.
> For example: In attached pdf document, my requirement is to fetch complete text of bounding box all at once along with its coordinates starting from "For your card ending in: XXXX" and ending at "purchases into gateways."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org