You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "rey bernal (JIRA)" <ji...@apache.org> on 2015/11/01 22:22:27 UTC
[jira] [Updated] (PDFBOX-3079) Extracting text between bookmarks
not working
[ https://issues.apache.org/jira/browse/PDFBOX-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
rey bernal updated PDFBOX-3079:
-------------------------------
Attachment: Test.java
test.pdf
Expected result for subsection 1:
--- Subsection 1:
This is subsection 1
1. one
2. two
3. three
Actual result for subsection 1:
--- Subsection 1:
Main Section
This is the main section.
1. one
2. two
3. three
Subsection 1
This is subsection 1
1. one
2. two
3. three
Subsection 1.1
This is subsection 1.1
1. one
2. two
3. three
Subsection 1.2
This is subsection 1.2
1. one
2. two
3. three
Subsection 2
This is subsection 2
1. one
2. two
3. three
Subsection 2.1
This is subsection 2.1
1. one
2. two
3. three
Subsection 2.2
This is subsection 2.2
> Extracting text between bookmarks not working
> ---------------------------------------------
>
> Key: PDFBOX-3079
> URL: https://issues.apache.org/jira/browse/PDFBOX-3079
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 2.0.0
> Environment: Windows
> Reporter: rey bernal
> Priority: Critical
> Labels: features
> Fix For: 2.0.0
>
> Attachments: Test.java, test.pdf
>
>
> org.apache.pdfbox.text.PDFTextStripper does not really support extraction of content between bookmarks. from looking at the code in pdfbox-parent/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java it is clear that is using the bookmarks that the user provided to determine the pages to extract content from.
> There is a business need to extract the text that lies strictly between bookmarks. Refer to the attached example program and sample file.
> The extraction to the sections in the first page all return the entire first page instead of the content inside each bookmark.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org