You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Christopher Creutzig (JIRA)" <ji...@apache.org> on 2017/08/14 13:47:00 UTC
[jira] [Created] (TIKA-2442) Non-terminal interactive form fields not handled recursively

Christopher Creutzig created TIKA-2442:
------------------------------------------

             Summary: Non-terminal interactive form fields not handled recursively
                 Key: TIKA-2442
                 URL: https://issues.apache.org/jira/browse/TIKA-2442
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.14
            Reporter: Christopher Creutzig


(I am not sure if this is a Tika or a PDFBox problem; I tried finding a form extractor in PDFBox, but the app api does not have one. PDFDebugger does show me the expected tree structure.)

The attached PDF has a non-terminal field named “parent” and two children, “child1” and “child2.” According to the PDF spec in section 8.6, the fully qualified field names should be parent.child1 and parent.child2. That is the output given by pdftk:

> pdftk simple-form.pdf dump_data_fields
---
FieldType: Text
FieldName: parent.child1
FieldFlags: 0
FieldValue: child1 value
FieldJustification: Left
---
FieldType: Text
FieldName: parent.child2
FieldFlags: 0
FieldValue: child2 value
FieldJustification: Left

Tika with the ToXMLContentHandler seems to silently ignore the children, however, returning only a parent with no value.

Calling code:

import java.io.FileInputStream;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.PasswordProvider;
import org.apache.tika.sax.ToXMLContentHandler;

class readAsXHTML {
  public static String readAsXHTML(String filename) throws Exception {
    ToXMLContentHandler handler = new ToXMLContentHandler();
    Detector detector = new DefaultDetector();
    Parser parser = new AutoDetectParser(detector);
    ParseContext context = new ParseContext();
    Metadata metadata = new Metadata();
    FileInputStream fh = null;

    final String pass = password;

    try {
      fh = new FileInputStream(filename);
      parser.parse(fh, handler, metadata, context);
      
      return(handler.toString());
    }
    finally {
      if (fh != null) {
        fh.close();
      }
    }
  }
}


Abbreviated output:

<body><div class="page"><p />
</div>
<div class="acroform"><ol>	<li>parent: </li>
</ol>
</div>
</body>

Expected:
<body><div class="page"><p />
</div>
<div class="acroform"><ol>
  <li>parent.child1: child1 value</li>
  <li>parent.child2: child2 value</li>
</ol>
</div>
</body>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)