You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2018/03/09 18:04:00 UTC

[jira] [Comment Edited] (TIKA-2442) Non-terminal interactive form fields not handled recursively

    [ https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393296#comment-16393296 ] 

Tilman Hausherr edited comment on TIKA-2442 at 3/9/18 6:04 PM:
---------------------------------------------------------------

Isn't this issue solved? (I stumbled upon it while searching for something else)


was (Author: tilman):
Isn't this issue solved? (I stumbled up it while searching for something else)

> Non-terminal interactive form fields not handled recursively
> ------------------------------------------------------------
>
>                 Key: TIKA-2442
>                 URL: https://issues.apache.org/jira/browse/TIKA-2442
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Christopher Creutzig
>            Priority: Major
>         Attachments: simple-form.pdf
>
>
> (I am not sure if this is a Tika or a PDFBox problem; I tried finding a form extractor in PDFBox, but the app api does not have one. PDFDebugger does show me the expected tree structure.)
> The attached PDF has a non-terminal field named “parent” and two children, “child1” and “child2.” According to the PDF spec in section 8.6, the fully qualified field names should be parent.child1 and parent.child2. That is the output given by pdftk:
> > pdftk simple-form.pdf dump_data_fields
> ---
> FieldType: Text
> FieldName: parent.child1
> FieldFlags: 0
> FieldValue: child1 value
> FieldJustification: Left
> ---
> FieldType: Text
> FieldName: parent.child2
> FieldFlags: 0
> FieldValue: child2 value
> FieldJustification: Left
> Tika with the ToXMLContentHandler seems to silently ignore the children, however, returning only a parent with no value.
> Calling code:
> import java.io.FileInputStream;
> import org.apache.tika.detect.DefaultDetector;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.PasswordProvider;
> import org.apache.tika.sax.ToXMLContentHandler;
> class readAsXHTML {
>   public static String readAsXHTML(String filename) throws Exception {
>     ToXMLContentHandler handler = new ToXMLContentHandler();
>     Detector detector = new DefaultDetector();
>     Parser parser = new AutoDetectParser(detector);
>     ParseContext context = new ParseContext();
>     Metadata metadata = new Metadata();
>     FileInputStream fh = null;
>     final String pass = password;
>     try {
>       fh = new FileInputStream(filename);
>       parser.parse(fh, handler, metadata, context);
>       
>       return(handler.toString());
>     }
>     finally {
>       if (fh != null) {
>         fh.close();
>       }
>     }
>   }
> }
> Abbreviated output:
> <body><div class="page"><p />
> </div>
> <div class="acroform"><ol>	<li>parent: </li>
> </ol>
> </div>
> </body>
> Expected:
> <body><div class="page"><p />
> </div>
> <div class="acroform"><ol>
>   <li>parent.child1: child1 value</li>
>   <li>parent.child2: child2 value</li>
> </ol>
> </div>
> </body>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)