You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "rey bernal (JIRA)" <ji...@apache.org> on 2015/11/01 22:22:27 UTC

[jira] [Updated] (PDFBOX-3079) Extracting text between bookmarks not working

     [ https://issues.apache.org/jira/browse/PDFBOX-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

rey bernal updated PDFBOX-3079:
-------------------------------
    Attachment: Test.java
                test.pdf

Expected result for subsection 1:

--- Subsection 1:
This is subsection 1 
1. one 
2. two 
3. three 

Actual result for subsection 1:
--- Subsection 1:
 
Main Section 
This is the main section. 
1. one 
2. two 
3. three 
Subsection 1 
This is subsection 1 
1. one 
2. two 
3. three 
Subsection 1.1 
This is subsection 1.1 
1. one 
2. two 
3. three 
Subsection 1.2 
This is subsection 1.2 
1. one 
2. two 
3. three 
Subsection 2 
This is subsection 2 
1. one 
2. two 
3. three 
Subsection 2.1 
This is subsection 2.1 
1. one 
2. two 
3. three 
Subsection 2.2 
This is subsection 2.2 

> Extracting text between bookmarks not working
> ---------------------------------------------
>
>                 Key: PDFBOX-3079
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3079
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.0
>         Environment: Windows
>            Reporter: rey bernal
>            Priority: Critical
>              Labels: features
>             Fix For: 2.0.0
>
>         Attachments: Test.java, test.pdf
>
>
> org.apache.pdfbox.text.PDFTextStripper does not really support extraction of content between bookmarks. from looking at the code in pdfbox-parent/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java it is clear that is using the bookmarks that the user provided to determine the pages to extract content from.
> There is a business need to extract the text that lies strictly between bookmarks. Refer to the attached example program and sample file.
> The extraction to the sections in the first page all return the entire first page instead of the content inside each bookmark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org