You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/10 21:45:35 UTC
[jira] [Updated] (PDFBOX-1143) PDFTextStripper doesn't process text
annotations
[ https://issues.apache.org/jira/browse/PDFBOX-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Hewson updated PDFBOX-1143:
--------------------------------
Affects Version/s: 1.7.0
> PDFTextStripper doesn't process text annotations
> ------------------------------------------------
>
> Key: PDFBOX-1143
> URL: https://issues.apache.org/jira/browse/PDFBOX-1143
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.7.0
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 2.0.0
>
>
> Users are able to add annotations (comments) to a PDF, and PDFBox
> processes them correctly: you can retrieve them via
> PDPage.getAnnotations.
> But PDFTextStripper currently doesn't extract the text from
> annotations.
> I think it [optionally] should?
> I think we'd add a boolean (shouldProcessAnnotations?), and if
> enabled, we'd visit the annotations that have sub-type FreeText, and
> extract what text we can (Subject, TitlePopup, Contents, maybe
> RichContents?), associate the .getRectangle with the text to make a
> TextPosition, and then somehow associate that with the right
> "article" (so that annotations "over" a given article are rendered
> with it).
> Alternatively we just put all annotations into their own "article"?
> I'm not familiar enough with PDF text positioning nor PDFTextStripper
> to work out a real patch here... but I think this approach should
> work?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)