You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Adam Nichols (JIRA)" <ji...@apache.org> on 2010/08/12 22:13:22 UTC
[jira] Closed: (PDFBOX-68) Extracting Text in/between Bookmarks
[ https://issues.apache.org/jira/browse/PDFBOX-68?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Nichols closed PDFBOX-68.
------------------------------
Assignee: Adam Nichols
Resolution: Duplicate
Duplicate of PDFBOX-126
> Extracting Text in/between Bookmarks
> ------------------------------------
>
> Key: PDFBOX-68
> URL: https://issues.apache.org/jira/browse/PDFBOX-68
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Assignee: Adam Nichols
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1230940
> Originally submitted by hasan_mushtaq on 2005-07-01 06:35.
> The Text is not properly extracted between the
> bookmarks. Description is as follows
> I used the setStartBookmark and the setEndBookmark
> functions of the PDFTextStripper class to set the
> begining and end of the part of text which i wanted to
> extract. The first bookmark (Introduction) was present
> in the second page and the second bokmark (Results) was
> also present in the second page. let me explain
>
> <page>
> some text before intro blah blah blah ...
>
> Introduction. -- BOOKMARK
> The real aim of
> intro text is to give a simple intro..
>
> Results. -- BOOKMARK
> the result is here
> </page>
>
> The text that i got was
>
> <textExtracted>
> intro text is to give a simple intro..
>
> Results. -- BOOKMARK
>
> the result is here
> </textExtracted>
>
> but the exact result should be
>
> <correctResult>
> The real aim of
> intro text is to give a simple intro..
> </correctResult>
>
> In the <textExtracted> we see that it has skipped some
> text in the intro and has gone untill the end of the page.
>
> It might be a problem with how the bookmarks were
> created for this pdf. But I have tried with other pdf
> files as well and they don't give the exact text
> between bookmarks. If the first bookmark is on the
> first page and the second on the second then I would
> get all the text of the two pages irrespective of the
> location of the bookmarks. I think I need to understand
> how this information is present in the structure of a
> PDF File and how pdfbox accesses it, is it a problem in
> the creation of the bookmarks when the pdf is made or
> we are missing something? I will be looking at the
> PDFReference Guide to understand. Meanwhile any help
> regarding the solution of this issue and the info on
> how bookmarks are represented in a pdf structure would
> be highly appreciated..
>
> Best Regards
> Hasan
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.