You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Shriram <sh...@yahoo.com> on 2012/03/06 08:26:50 UTC

Extracting text between two bookmarks using Apache PdfBox

I am using Apache PDFBox to read a PDF document which has a hierarchy, which is defined by the bookmarks. The hierarchy is in a tree form with contents only at the leaf level. When I try to extract the text between two leaf level bookmarks(using Stripper.setStartBookmark(), Stripper.setEndBookmark() and Stripper.writeText()), I get the text in the whole page instead. In short, my problem is similar to that mentioned in http://www.java-forums.org/advanced-java/51032-pdox-1-6-0-extract-text-between-2-bookmarks-same-page-sos.html
Is there a way to extract the contents between two bookmarks? If so, what should be the change in my code?

Re: Extracting text between two bookmarks using Apache PdfBox

Posted by Shriram <sh...@yahoo.com>.

Thank you for your reply Edson Alves Pereira, but I am sure that my PDF structure is in a proper manner, i.e. there are several bookmarks within in a single page and some text between leaf level bookmarks. It is not the other way round, i.e. a page within a bookmark.

Re: Extracting text between two bookmarks using Apache PdfBox

Posted by Shriram <sh...@yahoo.com>.

Thank you for your reply Edson Alves Pereira, but I am sure that my PDF structure is in a proper manner, i.e. there are several bookmarks within in a single page and some text between leaf level bookmarks. It is not the other way round, i.e. a page within a bookmark.

________________________________
 From: Edson Alves Pereira <lo...@gmail.com>
To: dev@pdfbox.apache.org; Shriram <sh...@yahoo.com> 
Sent: Tuesday, March 6, 2012 11:43 PM
Subject: Re: Extracting text between two bookmarks using Apache PdfBox

Is possible that your whole page is inside a bookmark, check how is your pdf structure.

On Tue, Mar 6, 2012 at 4:26 AM, Shriram <sh...@yahoo.com> wrote:

I am using Apache PDFBox to read a PDF document which has a hierarchy, which is defined by the bookmarks. The hierarchy is in a tree form with contents only at the leaf level. When I try to extract the text between two leaf level bookmarks(using Stripper.setStartBookmark(), Stripper.setEndBookmark() and Stripper.writeText()), I get the text in the whole page instead. In short, my problem is similar to that mentioned in http://www.java-forums.org/advanced-java/51032-pdox-1-6-0-extract-text-between-2-bookmarks-same-page-sos.html
>Is there a way to extract the contents between two bookmarks? If so, what should be the change in my code?

Re: Extracting text between two bookmarks using Apache PdfBox

Posted by Edson Alves Pereira <lo...@gmail.com>.

Is possible that your whole page is inside a bookmark, check how is your
pdf structure.

On Tue, Mar 6, 2012 at 4:26 AM, Shriram <sh...@yahoo.com> wrote:

> I am using Apache PDFBox to read a PDF document which has a hierarchy,
> which is defined by the bookmarks. The hierarchy is in a tree form with
> contents only at the leaf level. When I try to extract the text between two
> leaf level bookmarks(using Stripper.setStartBookmark(),
> Stripper.setEndBookmark() and Stripper.writeText()), I get the text in the
> whole page instead. In short, my problem is similar to that mentioned in
> http://www.java-forums.org/advanced-java/51032-pdox-1-6-0-extract-text-between-2-bookmarks-same-page-sos.html
> Is there a way to extract the contents between two bookmarks? If so, what
> should be the change in my code?