You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2013/06/27 03:30:21 UTC

[jira] [Updated] (TIKA-973) PDF form data isn't included in extracted content.

     [ https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-973:
-----------------------------

    Attachment: TIKA-973-patch.tar.gz

Patch attached.  Dumps contents of pdf forms at end of document.  

AcroForm field name metadata is in attribute values.  Basic format is <ol>.

Let me know how this looks.

Thank you, Ben Litchfield, for org.apache.pdfbox.examples.fdf.PrintFields

                
> PDF form data isn't included in extracted content.
> --------------------------------------------------
>
>                 Key: TIKA-973
>                 URL: https://issues.apache.org/jira/browse/TIKA-973
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.2
>            Reporter: Michael Graessle
>            Priority: Minor
>         Attachments: TIKA-973-patch.tar.gz
>
>
> When extracting content from PDFs, PDF form data isn't extracted. 
> The following code extracts this data via PDF box, but it seems like something Tika should be doing.
> PDDocumentCatalog docCatalog = load.getDocumentCatalog();
> if (docCatalog != null) {
>   PDAcroForm acroForm = docCatalog.getAcroForm();
>   if (acroForm != null) {
> 	@SuppressWarnings("unchecked")
> 	List<PDField> fields = acroForm.getFields();
> 	if (fields != null && fields.size() > 0) {
> 	  documentContent.append(" ");
> 	  for (PDField field : fields) {
> 		if (field.getValue()!=null) {
> 		  documentContent.append(field.getValue());
> 		  documentContent.append(" ");
> 		}
> 	  }
> 	}
>   }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira