You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/02/16 20:50:18 UTC

[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

    [ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149014#comment-15149014 ] 

Tim Allison edited comment on TIKA-1857 at 2/16/16 7:49 PM:
------------------------------------------------------------

from TIKA-1607's [comment|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=15148914&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15148914]

bq. In the case of XFA forms, the form IS the content. 

Got it.  Doh.  Thank you. 

As I look at a few of these docs from govdocs1 w/ XFA data, it looks like the form also contains the PDF's standard metadata...(author etc.) which is not necessarily stored in the older mechanism: COSDictionary.  govdocs1's {{517660.pdf}} shows this -- the author and title can be extracted from the XFA, but that info is not extracted with our current methods.

bq. I'll support whichever way you pick, but I personally can't see use cases where extracting that workaround message is the intent when using Tika. I do see value in keeping the entire DOM though. Maybe you can do as you suggest, but "in addition" to returning the XFA text as the content?

Y, that would be in addition.  Thank you, again.




was (Author: tallison@mitre.org):
from TIKA-1607's [comment|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=15148914&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15148914]

bq. In the case of XFA forms, the form IS the content. 

Got it.  Doh.  Thank you. 

As I look at a few of these docs from govdocs1 w/ XFA data, it looks like the form also contains the PDF's standard metadata...(author etc.) which is not necessarily stored in the older mechanism: COSDictionary.

bq. I'll support whichever way you pick, but I personally can't see use cases where extracting that workaround message is the intent when using Tika. I do see value in keeping the entire DOM though. Maybe you can do as you suggest, but "in addition" to returning the XFA text as the content?

Y, that would be in addition.  Thank you, again.



> Enhance PDFParser to extract text from XFA forms
> ------------------------------------------------
>
>                 Key: TIKA-1857
>                 URL: https://issues.apache.org/jira/browse/TIKA-1857
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Pascal Essiembre
>            Priority: Trivial
>              Labels: patch
>             Fix For: 1.13
>
>         Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)