You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/04/13 18:58:00 UTC

[jira] [Created] (TIKA-3352) Add a handler for json output from the /tika endpoint

Tim Allison created TIKA-3352:
---------------------------------

             Summary: Add a handler for json output from the /tika endpoint
                 Key: TIKA-3352
                 URL: https://issues.apache.org/jira/browse/TIKA-3352
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


I've been focusing mostly on the {{/rmeta}} endpoint.  However, for many users who aren't as cognizant of the wild and crazy things that can happen with embedded files (e.g., the rest of the world), it would be useful to have some of the advantages of the /rmeta endpoint without the complexity.

This would allow text + metadata in the response (for those who don't want to parse the xhtml).  It would include "late metadata", that is metadata that is only added after the content extraction has begun, which does not appear in our usual xhtml output.  This would enable storing the stacktrace (if the -s/--stackTrace commandline option is selected) in a field (as is done in /rmeta) so that users would get what they could from a failed parse and be able to align parse exceptions with the detected mime type.

Unlike /rmeta, this proposal would not include stacktraces from embedded files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)