You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/10 21:45:35 UTC

[jira] [Updated] (PDFBOX-1143) PDFTextStripper doesn't process text annotations

     [ https://issues.apache.org/jira/browse/PDFBOX-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson updated PDFBOX-1143:
--------------------------------
    Affects Version/s: 1.7.0

> PDFTextStripper doesn't process text annotations
> ------------------------------------------------
>
>                 Key: PDFBOX-1143
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1143
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.0
>            Reporter: Michael McCandless
>            Priority: Minor
>             Fix For: 2.0.0
>
>
> Users are able to add annotations (comments) to a PDF, and PDFBox
> processes them correctly: you can retrieve them via
> PDPage.getAnnotations.
> But PDFTextStripper currently doesn't extract the text from
> annotations.
> I think it [optionally] should?
> I think we'd add a boolean (shouldProcessAnnotations?), and if
> enabled, we'd visit the annotations that have sub-type FreeText, and
> extract what text we can (Subject, TitlePopup, Contents, maybe
> RichContents?), associate the .getRectangle with the text to make a
> TextPosition, and then somehow associate that with the right
> "article" (so that annotations "over" a given article are rendered
> with it).
> Alternatively we just put all annotations into their own "article"?
> I'm not familiar enough with PDF text positioning nor PDFTextStripper
> to work out a real patch here... but I think this approach should
> work?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)