You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2013/12/10 02:16:07 UTC
[jira] [Resolved] (TIKA-973) PDF form data isn't included in
extracted content.
[ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-973.
------------------------------
Resolution: Fixed
Fix Version/s: 1.5
Fixed in r1549727.
> PDF form data isn't included in extracted content.
> --------------------------------------------------
>
> Key: TIKA-973
> URL: https://issues.apache.org/jira/browse/TIKA-973
> Project: Tika
> Issue Type: Bug
> Components: general
> Affects Versions: 1.2
> Reporter: Michael Graessle
> Priority: Minor
> Fix For: 1.5
>
> Attachments: TIKA-973-patch.tar.gz, TIKA-973.patch.tar.gz, i-9_screenshot.png
>
>
> When extracting content from PDFs, PDF form data isn't extracted.
> The following code extracts this data via PDF box, but it seems like something Tika should be doing.
> PDDocumentCatalog docCatalog = load.getDocumentCatalog();
> if (docCatalog != null) {
> PDAcroForm acroForm = docCatalog.getAcroForm();
> if (acroForm != null) {
> @SuppressWarnings("unchecked")
> List<PDField> fields = acroForm.getFields();
> if (fields != null && fields.size() > 0) {
> documentContent.append(" ");
> for (PDField field : fields) {
> if (field.getValue()!=null) {
> documentContent.append(field.getValue());
> documentContent.append(" ");
> }
> }
> }
> }
> }
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)