You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/12/24 04:01:01 UTC

[jira] [Commented] (TIKA-2222) Contributing a XFDL Parser

    [ https://issues.apache.org/jira/browse/TIKA-2222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15774207#comment-15774207 ] 

ASF GitHub Bot commented on TIKA-2222:
--------------------------------------

GitHub user essiembre opened a pull request:

    https://github.com/apache/tika/pull/143

    New XFDL parser for TIKA-2222 contributed by pascal.essiembre

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/essiembre/tika TIKA-2222

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tika/pull/143.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #143
    
----
commit f6acb7c9b509e98c76c520123e79941071a08ea6
Author: Pascal Essiembre <pa...@norconex.com>
Date:   2016-12-24T03:58:48Z

    New XFDL parser for TIKA-2222 contributed by pascal.essiembre

----


> Contributing a XFDL Parser
> --------------------------
>
>                 Key: TIKA-2222
>                 URL: https://issues.apache.org/jira/browse/TIKA-2222
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>         Environment: Any.
>            Reporter: Pascal Essiembre
>            Priority: Minor
>
> I am considering contributing a XFDL parser but I first have a few questions. Feel free to close and let me know if this is not the proper channel for asking such questions.
> XFDL files are XML-based forms that can be regular text or base64 encoded.  They contain form field labels, field values, formulas, screen coordinates, etc.   Not everything is relevant so the default XML parser will extract too much text.
> My question is about what to store as metadata vs content.    
> Some people may want to capture the form field values only while others may feel capturing the field labels are as important.  Because people may be interested in different things, I am thinking of storing each as separate metadata entries.  In doing so, it may make it so that no or very little "content" is extracted, which can also be a nuisance to some.    So... would it be acceptable to store specific values as both metadata entries (structured) and content (unstructured)?  That approach would be the most flexible for users, but is there a concern with having information stored in two locations?
> As for the metadata entries themselves, I am thinking this XFDL XML representation...
> {code:xml}
>       <field sid="FieldID">
>           …
>          <value>This is a value.</value>
>          <label>This is a label</label>
>          …
>       </field>
> {code}
> …could be metadata stored/flatten like this :
> {code}
> field.FieldID.value= This is a value.
> field.FieldID.label= This is a label.
> {code}
> Any comments on this approach?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)