You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Pascal Essiembre (JIRA)" <ji...@apache.org> on 2016/12/21 18:10:00 UTC

[jira] [Created] (TIKA-2222) Contributing a XFDL Parser

Pascal Essiembre created TIKA-2222:
--------------------------------------

             Summary: Contributing a XFDL Parser
                 Key: TIKA-2222
                 URL: https://issues.apache.org/jira/browse/TIKA-2222
             Project: Tika
          Issue Type: Improvement
          Components: parser
         Environment: Any.
            Reporter: Pascal Essiembre
            Priority: Minor


I am considering contributing a XFDL parser but I first have a few questions. Feel free to close and let me know if this is not the proper channel for asking such questions.

XFDL files are XML-based forms that can be regular text or base64 encoded.  They contain form field labels, field values, formulas, screen coordinates, etc.   Not everything is relevant so the default XML parser will extract too much text.

My question is about what to store as metadata vs content.    

Some people may want to capture the form field values only while others may feel capturing the field labels are as important.  Because people may be interested in different things, I am thinking of storing each as separate metadata entries.  In doing so, it may make it so that no or very little "content" is extracted, which can also be a nuisance to some.    So... would it be acceptable to store specific values as both metadata entries (structured) and content (unstructured)?  That approach would be the most flexible for users, but is there a concern with having information stored in two locations?

As for the metadata entries themselves, I am thinking this XFDL XML representation...

{code:xml}
      <field sid="FieldID">
          …
         <value>This is a value.</value>
         <label>This is a label</label>
         …
      </field>
{code}

…could be metadata stored/flatten like this :

{code}
field.FieldID.value= This is a value.
field.FieldID.label= This is a label.
{code}

Any comments on this approach?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)