You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/03/02 03:26:18 UTC

[jira] [Resolved] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

     [ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-1857.
-------------------------------
    Resolution: Fixed

[~pascal.essiembre], thank you for this pull request!  I made a few modifications, but we now have basic XFA processing, thanks to you.  To obtain the XFA-only behavior, you'll need to do something like this:

{noformat}
        ParseContext context = new ParseContext();
        PDFParserConfig config = new PDFParserConfig();
        config.setIfXFAExtractOnlyXFA(true);
        context.set(PDFParserConfig.class, config);
{noformat}

[~msahyoun], thank you, again, for helping me understand XFA and Acroforms!

For posterity, here are some areas for improvement in XFA parsing:
 *     handle metadata stored in <desc> section (govdocs1: 754282.pdf, 982106.pdf)
 *     handle pdf metadata (access permissions, etc.) in &lt;pdf&gt; element
 *     extract different types of uris as metadata
 *     add extraction of <image> data (govdocs1: 754282.pdf)
 *     add computation of traversal order for fields
 *     figure out when text extracted from xfa fields is duplicative of that
       extracted from the rest of the pdf...and do this efficiently and quickly
 *     avoid duplication with <speak> and <tooltip> elements


> Enhance PDFParser to extract text from XFA forms
> ------------------------------------------------
>
>                 Key: TIKA-1857
>                 URL: https://issues.apache.org/jira/browse/TIKA-1857
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Pascal Essiembre
>            Priority: Trivial
>              Labels: patch
>             Fix For: 1.13
>
>         Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)