You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/08/14 14:36:00 UTC

[jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively

    [ https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125756#comment-16125756 ] 

Tim Allison commented on TIKA-2442:
-----------------------------------

Thank you for opening this.  I need to look more carefully to find the source, but this is definitely a bug.  We should be parsing these recursively.

> Non-terminal interactive form fields not handled recursively
> ------------------------------------------------------------
>
>                 Key: TIKA-2442
>                 URL: https://issues.apache.org/jira/browse/TIKA-2442
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Christopher Creutzig
>         Attachments: simple-form.pdf
>
>
> (I am not sure if this is a Tika or a PDFBox problem; I tried finding a form extractor in PDFBox, but the app api does not have one. PDFDebugger does show me the expected tree structure.)
> The attached PDF has a non-terminal field named “parent” and two children, “child1” and “child2.” According to the PDF spec in section 8.6, the fully qualified field names should be parent.child1 and parent.child2. That is the output given by pdftk:
> > pdftk simple-form.pdf dump_data_fields
> ---
> FieldType: Text
> FieldName: parent.child1
> FieldFlags: 0
> FieldValue: child1 value
> FieldJustification: Left
> ---
> FieldType: Text
> FieldName: parent.child2
> FieldFlags: 0
> FieldValue: child2 value
> FieldJustification: Left
> Tika with the ToXMLContentHandler seems to silently ignore the children, however, returning only a parent with no value.
> Calling code:
> import java.io.FileInputStream;
> import org.apache.tika.detect.DefaultDetector;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.PasswordProvider;
> import org.apache.tika.sax.ToXMLContentHandler;
> class readAsXHTML {
>   public static String readAsXHTML(String filename) throws Exception {
>     ToXMLContentHandler handler = new ToXMLContentHandler();
>     Detector detector = new DefaultDetector();
>     Parser parser = new AutoDetectParser(detector);
>     ParseContext context = new ParseContext();
>     Metadata metadata = new Metadata();
>     FileInputStream fh = null;
>     final String pass = password;
>     try {
>       fh = new FileInputStream(filename);
>       parser.parse(fh, handler, metadata, context);
>       
>       return(handler.toString());
>     }
>     finally {
>       if (fh != null) {
>         fh.close();
>       }
>     }
>   }
> }
> Abbreviated output:
> <body><div class="page"><p />
> </div>
> <div class="acroform"><ol>	<li>parent: </li>
> </ol>
> </div>
> </body>
> Expected:
> <body><div class="page"><p />
> </div>
> <div class="acroform"><ol>
>   <li>parent.child1: child1 value</li>
>   <li>parent.child2: child2 value</li>
> </ol>
> </div>
> </body>



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

RE: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Thank you, Maruan!  I opened PDFBOX-3898 after breaking out the spec...I may be misreading it, tho!

-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: Tuesday, August 15, 2017 11:58 AM
To: dev@pdfbox.apache.org
Subject: Re: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively

Hi Tim,

> Am 15.08.2017 um 17:31 schrieb Allison, Timothy B. <ta...@mitre.org>:
> 
> All,
>  I can't tell if the triggering file is corrupt or how we want to handle it on the PDFBox side.  The problem is that the parent node is a PDTextField -- a PDTerminalField -- so we don't/can't look for children, even though it actually does have pointers in Kids.

I had a quick look with the debugger and the file looks fine. There is nothing wrong with a non terminal field having a field type /FT and the kids (terminal fields) having not. In such case the field type should be taken for the kids.

Which vesion of PDFBox is Tika 1.14 on?

BR
Maruan


> 
> The output from PrintFields is:
> 
> 1 top-level fields were found on the form
> |--parent.parent = ,  
> |type=org.apache.pdfbox.pdmodel.interactive.form.PDTextField
> 
> -----Original Message-----
> From: Tim Allison (JIRA) [mailto:jira@apache.org]
> Sent: Monday, August 14, 2017 10:36 AM
> To: dev@tika.apache.org
> Subject: [jira] [Commented] (TIKA-2442) Non-terminal interactive form 
> fields not handled recursively
> 
> 
>    [ 
> https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jir
> a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125
> 756#comment-16125756 ]
> 
>> Non-terminal interactive form fields not handled recursively
>> ------------------------------------------------------------
>> 
>>                Key: TIKA-2442
>>                URL: https://issues.apache.org/jira/browse/TIKA-2442
>>            Project: Tika
>>         Issue Type: Bug
>>         Components: parser
>>   Affects Versions: 1.14
>>           Reporter: Christopher Creutzig
>>        Attachments: simple-form.pdf
>> 
>> 
>> (I am not sure if this is a Tika or a PDFBox problem; I tried finding 
>> a form extractor in PDFBox, but the app api does not have one. PDFDebugger does show me the expected tree structure.) The attached PDF has a non-terminal field named “parent” and two children, “child1” and “child2.” According to the PDF spec in section 8.6, the fully qualified field names should be parent.child1 and parent.child2. That is the output given by pdftk:
>>> pdftk simple-form.pdf dump_data_fields
>> ---
>> FieldType: Text
>> FieldName: parent.child1
>> FieldFlags: 0
>> FieldValue: child1 value
>> FieldJustification: Left
>> ---
>> FieldType: Text
>> FieldName: parent.child2
>> FieldFlags: 0
>> FieldValue: child2 value
>> FieldJustification: Left
>> Tika with the ToXMLContentHandler seems to silently ignore the children, however, returning only a parent with no value.
>> Calling code:
>> import java.io.FileInputStream;
>> import org.apache.tika.detect.DefaultDetector;
>> import org.apache.tika.detect.Detector; import 
>> org.apache.tika.metadata.Metadata;
>> import org.apache.tika.parser.AutoDetectParser;
>> import org.apache.tika.parser.ParseContext;
>> import org.apache.tika.parser.Parser; import 
>> org.apache.tika.parser.PasswordProvider;
>> import org.apache.tika.sax.ToXMLContentHandler;
>> class readAsXHTML {
>>  public static String readAsXHTML(String filename) throws Exception {
>>    ToXMLContentHandler handler = new ToXMLContentHandler();
>>    Detector detector = new DefaultDetector();
>>    Parser parser = new AutoDetectParser(detector);
>>    ParseContext context = new ParseContext();
>>    Metadata metadata = new Metadata();
>>    FileInputStream fh = null;
>>    final String pass = password;
>>    try {
>>      fh = new FileInputStream(filename);
>>      parser.parse(fh, handler, metadata, context);
>> 
>>      return(handler.toString());
>>    }
>>    finally {
>>      if (fh != null) {
>>        fh.close();
>>      }
>>    }
>>  }
>> }
>> Abbreviated output:
>> <body><div class="page"><p />
>> </div>
>> <div class="acroform"><ol>	<li>parent: </li>
>> </ol>
>> </div>
>> </body>
>> Expected:
>> <body><div class="page"><p />
>> </div>
>> <div class="acroform"><ol>
>>  <li>parent.child1: child1 value</li>
>>  <li>parent.child2: child2 value</li> </ol> </div> </body>
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.4.14#64029)
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Tim,

> Am 15.08.2017 um 17:31 schrieb Allison, Timothy B. <ta...@mitre.org>:
> 
> All,
>  I can't tell if the triggering file is corrupt or how we want to handle it on the PDFBox side.  The problem is that the parent node is a PDTextField -- a PDTerminalField -- so we don't/can't look for children, even though it actually does have pointers in Kids.

I had a quick look with the debugger and the file looks fine. There is nothing wrong with a non terminal field having a field type /FT and the kids (terminal fields) having not. In such case the field type should be taken for the kids.

Which vesion of PDFBox is Tika 1.14 on?

BR
Maruan


> 
> The output from PrintFields is:
> 
> 1 top-level fields were found on the form
> |--parent.parent = ,  type=org.apache.pdfbox.pdmodel.interactive.form.PDTextField
> 
> -----Original Message-----
> From: Tim Allison (JIRA) [mailto:jira@apache.org] 
> Sent: Monday, August 14, 2017 10:36 AM
> To: dev@tika.apache.org
> Subject: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively
> 
> 
>    [ https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125756#comment-16125756 ] 
> 
>> Non-terminal interactive form fields not handled recursively
>> ------------------------------------------------------------
>> 
>>                Key: TIKA-2442
>>                URL: https://issues.apache.org/jira/browse/TIKA-2442
>>            Project: Tika
>>         Issue Type: Bug
>>         Components: parser
>>   Affects Versions: 1.14
>>           Reporter: Christopher Creutzig
>>        Attachments: simple-form.pdf
>> 
>> 
>> (I am not sure if this is a Tika or a PDFBox problem; I tried finding 
>> a form extractor in PDFBox, but the app api does not have one. PDFDebugger does show me the expected tree structure.) The attached PDF has a non-terminal field named “parent” and two children, “child1” and “child2.” According to the PDF spec in section 8.6, the fully qualified field names should be parent.child1 and parent.child2. That is the output given by pdftk:
>>> pdftk simple-form.pdf dump_data_fields
>> ---
>> FieldType: Text
>> FieldName: parent.child1
>> FieldFlags: 0
>> FieldValue: child1 value
>> FieldJustification: Left
>> ---
>> FieldType: Text
>> FieldName: parent.child2
>> FieldFlags: 0
>> FieldValue: child2 value
>> FieldJustification: Left
>> Tika with the ToXMLContentHandler seems to silently ignore the children, however, returning only a parent with no value.
>> Calling code:
>> import java.io.FileInputStream;
>> import org.apache.tika.detect.DefaultDetector;
>> import org.apache.tika.detect.Detector; import 
>> org.apache.tika.metadata.Metadata;
>> import org.apache.tika.parser.AutoDetectParser;
>> import org.apache.tika.parser.ParseContext;
>> import org.apache.tika.parser.Parser;
>> import org.apache.tika.parser.PasswordProvider;
>> import org.apache.tika.sax.ToXMLContentHandler;
>> class readAsXHTML {
>>  public static String readAsXHTML(String filename) throws Exception {
>>    ToXMLContentHandler handler = new ToXMLContentHandler();
>>    Detector detector = new DefaultDetector();
>>    Parser parser = new AutoDetectParser(detector);
>>    ParseContext context = new ParseContext();
>>    Metadata metadata = new Metadata();
>>    FileInputStream fh = null;
>>    final String pass = password;
>>    try {
>>      fh = new FileInputStream(filename);
>>      parser.parse(fh, handler, metadata, context);
>> 
>>      return(handler.toString());
>>    }
>>    finally {
>>      if (fh != null) {
>>        fh.close();
>>      }
>>    }
>>  }
>> }
>> Abbreviated output:
>> <body><div class="page"><p />
>> </div>
>> <div class="acroform"><ol>	<li>parent: </li>
>> </ol>
>> </div>
>> </body>
>> Expected:
>> <body><div class="page"><p />
>> </div>
>> <div class="acroform"><ol>
>>  <li>parent.child1: child1 value</li>
>>  <li>parent.child2: child2 value</li> </ol> </div> </body>
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.4.14#64029)
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

FW: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively

Posted by "Allison, Timothy B." <ta...@mitre.org>.

All,
  I can't tell if the triggering file is corrupt or how we want to handle it on the PDFBox side.  The problem is that the parent node is a PDTextField -- a PDTerminalField -- so we don't/can't look for children, even though it actually does have pointers in Kids.

The output from PrintFields is:

1 top-level fields were found on the form
|--parent.parent = ,  type=org.apache.pdfbox.pdmodel.interactive.form.PDTextField

-----Original Message-----
From: Tim Allison (JIRA) [mailto:jira@apache.org] 
Sent: Monday, August 14, 2017 10:36 AM
To: dev@tika.apache.org
Subject: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively


    [ https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125756#comment-16125756 ] 

> Non-terminal interactive form fields not handled recursively
> ------------------------------------------------------------
>
>                 Key: TIKA-2442
>                 URL: https://issues.apache.org/jira/browse/TIKA-2442
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Christopher Creutzig
>         Attachments: simple-form.pdf
>
>
> (I am not sure if this is a Tika or a PDFBox problem; I tried finding 
> a form extractor in PDFBox, but the app api does not have one. PDFDebugger does show me the expected tree structure.) The attached PDF has a non-terminal field named “parent” and two children, “child1” and “child2.” According to the PDF spec in section 8.6, the fully qualified field names should be parent.child1 and parent.child2. That is the output given by pdftk:
> > pdftk simple-form.pdf dump_data_fields
> ---
> FieldType: Text
> FieldName: parent.child1
> FieldFlags: 0
> FieldValue: child1 value
> FieldJustification: Left
> ---
> FieldType: Text
> FieldName: parent.child2
> FieldFlags: 0
> FieldValue: child2 value
> FieldJustification: Left
> Tika with the ToXMLContentHandler seems to silently ignore the children, however, returning only a parent with no value.
> Calling code:
> import java.io.FileInputStream;
> import org.apache.tika.detect.DefaultDetector;
> import org.apache.tika.detect.Detector; import 
> org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.PasswordProvider;
> import org.apache.tika.sax.ToXMLContentHandler;
> class readAsXHTML {
>   public static String readAsXHTML(String filename) throws Exception {
>     ToXMLContentHandler handler = new ToXMLContentHandler();
>     Detector detector = new DefaultDetector();
>     Parser parser = new AutoDetectParser(detector);
>     ParseContext context = new ParseContext();
>     Metadata metadata = new Metadata();
>     FileInputStream fh = null;
>     final String pass = password;
>     try {
>       fh = new FileInputStream(filename);
>       parser.parse(fh, handler, metadata, context);
>       
>       return(handler.toString());
>     }
>     finally {
>       if (fh != null) {
>         fh.close();
>       }
>     }
>   }
> }
> Abbreviated output:
> <body><div class="page"><p />
> </div>
> <div class="acroform"><ol>	<li>parent: </li>
> </ol>
> </div>
> </body>
> Expected:
> <body><div class="page"><p />
> </div>
> <div class="acroform"><ol>
>   <li>parent.child1: child1 value</li>
>   <li>parent.child2: child2 value</li> </ol> </div> </body>



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org