You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/03/02 03:26:18 UTC
[jira] [Resolved] (TIKA-1857) Enhance PDFParser to extract text
from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-1857.
-------------------------------
Resolution: Fixed
[~pascal.essiembre], thank you for this pull request! I made a few modifications, but we now have basic XFA processing, thanks to you. To obtain the XFA-only behavior, you'll need to do something like this:
{noformat}
ParseContext context = new ParseContext();
PDFParserConfig config = new PDFParserConfig();
config.setIfXFAExtractOnlyXFA(true);
context.set(PDFParserConfig.class, config);
{noformat}
[~msahyoun], thank you, again, for helping me understand XFA and Acroforms!
For posterity, here are some areas for improvement in XFA parsing:
* handle metadata stored in <desc> section (govdocs1: 754282.pdf, 982106.pdf)
* handle pdf metadata (access permissions, etc.) in <pdf> element
* extract different types of uris as metadata
* add extraction of <image> data (govdocs1: 754282.pdf)
* add computation of traversal order for fields
* figure out when text extracted from xfa fields is duplicative of that
extracted from the rest of the pdf...and do this efficiently and quickly
* avoid duplication with <speak> and <tooltip> elements
> Enhance PDFParser to extract text from XFA forms
> ------------------------------------------------
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Pascal Essiembre
> Priority: Trivial
> Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA). Information about XFA: https://en.wikipedia.org/wiki/XFA
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)