You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/10/09 13:57:26 UTC

[jira] [Comment Edited] (TIKA-1764) Provide information on failed document parsing in ParsingEmbeddedDocumentExtractor

    [ https://issues.apache.org/jira/browse/TIKA-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950277#comment-14950277 ] 

Tim Allison edited comment on TIKA-1764 at 10/9/15 11:57 AM:
-------------------------------------------------------------

Y, I completely agree that we all need to see when embedded documents are failing.  The RecursiveParserWrapper allowed me to discover TIKA-1651, for example, and I suspect that there are lots of other discoveries to be made with embedded objects.

I think I now remember why I haven't gotten around to fixing this...

The problem with logging the full metadata value at that point in the code is that there is no container document information in the metadata object at that point of the parsing via the standard AutoDetectParser.  So, all you'd get would be the detected mime type, the embedded object's name and any metadata that was pulled out before the parse failed.  In short, without other changes in our code, there would be no way to link that stacktrace or the metadata back to the source document with the AutoDetectParser.

I (re)tested this just now to confirm.  I truncated a ppt file and zipped it up.  This is what I got at that point in the code:
{noformat}
inside parsingEmbdeddedExtractor: date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: embeddedRelationshipId ; testPPT_comment_broken.ppt
inside parsingEmbdeddedExtractor: X-Parsed-By ; org.apache.tika.parser.DefaultParser
inside parsingEmbdeddedExtractor: meta:save-date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: resourceName ; testPPT_comment_broken.ppt
inside parsingEmbdeddedExtractor: dcterms:modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Last-Modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Content-Length ; 63760
inside parsingEmbdeddedExtractor: Last-Save-Date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Content-Type ; application/vnd.ms-powerpoint
{noformat}

So, the only way to get the container doc's information would be to cache it as you're parsing the embedded documents and transmit that information through the ParseContext.  This is exactly what the RecursiveMetadataParser does, so I'm not sure that we'd want to modify anything within Tika to solve this problem because I think the existing solution is sufficient.

If you're using Solr Cell... I opened a ticket a while ago to parameterize the use of the RecursiveMetadataParser in Solr Cell/DIH (SOLR-7229), but I haven't worked on it at all.  If you'd like to help on that by giving feedback on what you'd need, I think the Solr community would be receptive.  We had very quick commits on SOLR-7189 and SOLR-7231.

As a side note, I would very strongly encourage you to support SOLR-7632 and move Tika out of the same JVM that is sending updates to Solr.  I don't think this should be the default, but I do think that users should be able to configure the use of tika-server instead of the current embedded use of Tika.

Finally, speaking of embedded documents, if you have any friends over on Kite, I'd encourage them to look at Kite's failure to handle embedded documents [here|https://github.com/kite-sdk/kite/issues/397].  There's every chance they've fixed this by now, but as of July, no dice.


 



was (Author: tallison@mitre.org):
Y, I completely agree that we all need to see when embedded documents are failing.  The RecursiveParserWrapper allowed me to discover TIKA-1651, for example, and I suspect that there are lots of other discoveries to be made with embedded objects.

I think I now remember why I haven't gotten around to fixing this...

The problem with logging the full metadata value at that point in the code is that there is no container document information in the metadata object at that point of the parsing via the standard AutoDetectParser.  So, all you'd get would be the detected mime type, the embedded object's name and any metadata that was pulled out before the parse failed.  In short, without other changes in our code, there would be no way to link that stacktrace back to the source document with the AutoDetectParser.

I (re)tested this just now to confirm.  I truncated a ppt file and zipped it up.  This is what I got at that point in the code:
{noformat}
inside parsingEmbdeddedExtractor: date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: embeddedRelationshipId ; testPPT_comment_broken.ppt
inside parsingEmbdeddedExtractor: X-Parsed-By ; org.apache.tika.parser.DefaultParser
inside parsingEmbdeddedExtractor: meta:save-date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: resourceName ; testPPT_comment_broken.ppt
inside parsingEmbdeddedExtractor: dcterms:modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Last-Modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Content-Length ; 63760
inside parsingEmbdeddedExtractor: Last-Save-Date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Content-Type ; application/vnd.ms-powerpoint
{noformat}

So, the only way to get the container doc's information would be to cache it as you're parsing the embedded documents and transmit that information through the ParseContext.  This is exactly what the RecursiveMetadataParser does, so I'm not sure that we'd want to modify anything within Tika to solve this problem because I think the existing solution is sufficient.

If you're using Solr Cell... I opened a ticket a while ago to parameterize the use of the RecursiveMetadataParser in Solr Cell/DIH (SOLR-7229), but I haven't worked on it at all.  If you'd like to help on that by giving feedback on what you'd need, I think the Solr community would be receptive.  We had very quick commits on SOLR-7189 and SOLR-7231.

As a side note, I would very strongly encourage you to support SOLR-7632 and move Tika out of the same JVM that is sending updates to Solr.  I don't think this should be the default, but I do think that users should be able to configure the use of tika-server instead of the current embedded use of Tika.

Finally, speaking of embedded documents, if you have any friends over on Kite, I'd encourage them to look at Kite's failure to handle embedded documents [here|https://github.com/kite-sdk/kite/issues/397].  There's every chance they've fixed this by now, but as of July, no dice.


 


> Provide information on failed document parsing in ParsingEmbeddedDocumentExtractor
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-1764
>                 URL: https://issues.apache.org/jira/browse/TIKA-1764
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.5, 1.10
>            Reporter: Odilo Oehmichen
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The {{ParsingEmbeddedDocumentExtractor}} delegates the parsing of documents to a {{Parser}}-instance.  
> If this parser fails with a {{TikaException}} the extractor class returns silenty:
> {code}
>  catch (TikaException e) {
>             // TODO: can we log a warning somehow?
>             // Could not parse the entry, just skip the content
>         }
> {code}
> This behaviour makes it very hard to detect problems concerning parsing.
> As the {{TODO}} in the source already states, please a some logging of the exception here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)