You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Adam Nichols (JIRA)" <ji...@apache.org> on 2010/08/12 22:13:22 UTC
[jira] Closed: (PDFBOX-68) Extracting Text in/between Bookmarks

     [ https://issues.apache.org/jira/browse/PDFBOX-68?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Nichols closed PDFBOX-68.
------------------------------

      Assignee: Adam Nichols
    Resolution: Duplicate

Duplicate of PDFBOX-126

> Extracting Text in/between Bookmarks
> ------------------------------------
>
>                 Key: PDFBOX-68
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-68
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Assignee: Adam Nichols
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1230940
> Originally submitted by hasan_mushtaq on 2005-07-01 06:35.
> The Text is not properly extracted between the
> bookmarks. Description is as follows
> I used the setStartBookmark and the setEndBookmark
> functions of the PDFTextStripper class to set the
> begining and end of the part of text which i wanted to
> extract. The first bookmark (Introduction) was present
> in the second page and the second bokmark (Results) was
> also present in the second page. let me explain 
>  
> <page> 
> some text before intro blah blah blah ... 
>  
> Introduction. -- BOOKMARK 
> The real aim of 
> intro text is to give a simple intro.. 
>  
> Results. -- BOOKMARK 
> the result is here 
> </page> 
>  
> The text that i got was 
>  
> <textExtracted> 
> intro text is to give a simple intro.. 
>  
> Results. -- BOOKMARK 
>  
> the result is here  
> </textExtracted> 
>  
> but the exact result should be 
>  
> <correctResult> 
> The real aim of 
> intro text is to give a simple intro.. 
> </correctResult> 
>  
> In the <textExtracted> we see that it has skipped some
> text in the intro and has gone untill the end of the page. 
>  
> It might be a problem with how the bookmarks were
> created for this pdf. But I have tried with other pdf
> files as well and they don't give the exact text
> between bookmarks. If the first bookmark is on the
> first page and the second on the second then I would
> get all the text of the two pages irrespective of the
> location of the bookmarks. I think I need to understand
> how this information is present in the structure of a
> PDF File and how pdfbox accesses it, is it a problem in
> the creation of the bookmarks when the pdf is made or
> we are missing something? I will be looking at the
> PDFReference Guide to understand. Meanwhile any help
> regarding the solution of this issue and the info on
> how bookmarks are represented in a pdf structure would
> be highly appreciated.. 
>  
> Best Regards 
> Hasan

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.