You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Jim Idle <ji...@proofpoint.com> on 2017/06/04 08:33:52 UTC

Detecting document format/parsing problems

When using Java direct calls and the AutoDectect parser I notice that if a document is deliberately (malware) or accidentally (some bug say) corrupt or badly formatted, then the underlying parsers will oft times log an error, but this is not passed on by Tika.

Any examples out there on how I can be informed of parsing errors? Basically I would like to know that the document has format problems and as much info as I can about what is wrong (though in fact I could live with just counting the number of errors if that's all that can be done), but I don't want to stop the parse if the underlying parser can recover (good to know if it aborts before finishing though).

Jim

RE: Detecting document format/parsing problems

Posted by Jim Idle <ji...@proofpoint.com>.
Tim,

Thanks for the advice - this sounds like what I want and I will give it a go and get back to you. I have not really looked at RecursiveParserWrapper but when I first looked at it, I seemed to think that there was a reason I could not. That may no longer be that case though, so I will look again.

Jim

From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Tuesday, June 6, 2017 01:31
To: user@tika.apache.org
Subject: RE: Detecting document format/parsing problems

Hi Jim,

  On a second read, I don't _think_ there's a good way to do this currently.  Although there are subtleties in how "underlying parsers" deal with different types of errors.

For example, if the PDFBox's parser logs an "I can't find the Unicode mapping for Font X", you're right, Tika doesn't let you know about this because Tika itself doesn't know about this.

If, however, the dependent parser throws an exception that can be recovered from, Tika sometimes does now about this and will let you know...e.g. Tika's PDFParser might catch an IOException on page 3 and then try to parse page 4...it will throw the page 3 exception after it has finished parsing the document.

Generally speaking with embedded documents, Tika's AutoDetectParser's legacy behavior has been to swallow exceptions.  So, if you're trying to identify exceptions in embedded files (e.g. macros), I'd strongly recommend using the RecursiveParserWrapper (-J option in tika-app, /rmeta endpoint in tika-server).  Unlike the AutoDetectParser, the RecursiveParserWrapper catches exceptions and records them in a field in the metadata [1].

That's the behavior if a parser throws an exception on an embedded document.  However, if a parent document (let's say a .doc file) has problems handling an embedded InputStream (say with an embedded image), that exception will be stored in the metadata of the .doc file[2].

In short, things are complicated.  Please let us know if we can modify our code or documentation to help your use cases.

Best,

             Tim


[1] https://tika.apache.org/1.15/api/org/apache/tika/parser/RecursiveParserWrapper.html#EMBEDDED_EXCEPTION<https://urldefense.proofpoint.com/v2/url?u=https-3A__tika.apache.org_1.15_api_org_apache_tika_parser_RecursiveParserWrapper.html-23EMBEDDED-5FEXCEPTION&d=DwMFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=JlzTtWFQKgkW34vO_DayZZXav2gfUEsHjyuIz-7HVa4&s=rq6D2b4g9UnjnzhpQTEO8YZF2rvN0p6RBpsD6zPzKmo&e=>

[2] https://tika.apache.org/1.15/api/org/apache/tika/metadata/TikaCoreProperties.html#TIKA_META_EXCEPTION_EMBEDDED_STREAM<https://urldefense.proofpoint.com/v2/url?u=https-3A__tika.apache.org_1.15_api_org_apache_tika_metadata_TikaCoreProperties.html-23TIKA-5FMETA-5FEXCEPTION-5FEMBEDDED-5FSTREAM&d=DwMFAg&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=JlzTtWFQKgkW34vO_DayZZXav2gfUEsHjyuIz-7HVa4&s=ycIyxKCZjgap40pc7Gc9PLuPxswpsl5DwuhJGOUrzgo&e=>

From: Jim Idle [mailto:jidle@proofpoint.com]
Sent: Sunday, June 4, 2017 4:34 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Detecting document format/parsing problems

When using Java direct calls and the AutoDectect parser I notice that if a document is deliberately (malware) or accidentally (some bug say) corrupt or badly formatted, then the underlying parsers will oft times log an error, but this is not passed on by Tika.

Any examples out there on how I can be informed of parsing errors? Basically I would like to know that the document has format problems and as much info as I can about what is wrong (though in fact I could live with just counting the number of errors if that's all that can be done), but I don't want to stop the parse if the underlying parser can recover (good to know if it aborts before finishing though).

Jim

RE: Detecting document format/parsing problems

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Hi Jim,

  On a second read, I don't _think_ there's a good way to do this currently.  Although there are subtleties in how "underlying parsers" deal with different types of errors.

For example, if the PDFBox's parser logs an "I can't find the Unicode mapping for Font X", you're right, Tika doesn't let you know about this because Tika itself doesn't know about this.

If, however, the dependent parser throws an exception that can be recovered from, Tika sometimes does now about this and will let you know...e.g. Tika's PDFParser might catch an IOException on page 3 and then try to parse page 4...it will throw the page 3 exception after it has finished parsing the document.

Generally speaking with embedded documents, Tika's AutoDetectParser's legacy behavior has been to swallow exceptions.  So, if you're trying to identify exceptions in embedded files (e.g. macros), I'd strongly recommend using the RecursiveParserWrapper (-J option in tika-app, /rmeta endpoint in tika-server).  Unlike the AutoDetectParser, the RecursiveParserWrapper catches exceptions and records them in a field in the metadata [1].

That's the behavior if a parser throws an exception on an embedded document.  However, if a parent document (let's say a .doc file) has problems handling an embedded InputStream (say with an embedded image), that exception will be stored in the metadata of the .doc file[2].

In short, things are complicated.  Please let us know if we can modify our code or documentation to help your use cases.

Best,

             Tim


[1] https://tika.apache.org/1.15/api/org/apache/tika/parser/RecursiveParserWrapper.html#EMBEDDED_EXCEPTION

[2] https://tika.apache.org/1.15/api/org/apache/tika/metadata/TikaCoreProperties.html#TIKA_META_EXCEPTION_EMBEDDED_STREAM

From: Jim Idle [mailto:jidle@proofpoint.com]
Sent: Sunday, June 4, 2017 4:34 AM
To: user@tika.apache.org
Subject: Detecting document format/parsing problems

When using Java direct calls and the AutoDectect parser I notice that if a document is deliberately (malware) or accidentally (some bug say) corrupt or badly formatted, then the underlying parsers will oft times log an error, but this is not passed on by Tika.

Any examples out there on how I can be informed of parsing errors? Basically I would like to know that the document has format problems and as much info as I can about what is wrong (though in fact I could live with just counting the number of errors if that's all that can be done), but I don't want to stop the parse if the underlying parser can recover (good to know if it aborts before finishing though).

Jim