You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/03/02 22:53:05 UTC

[jira] [Comment Edited] (TIKA-944) Extend tika-server API to be consistent with tika-app CLI

    [ https://issues.apache.org/jira/browse/TIKA-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343665#comment-14343665 ] 

Tim Allison edited comment on TIKA-944 at 3/2/15 9:53 PM:
----------------------------------------------------------

Some items that came to mind:

# There's a slight disconnect in how we handle extraction from embedded docs:
#* Tika-app commandline -t extracts embedded content
#* Tika-app gui does not
#* /tika does not
# We also can't currently specify a tika config file on the command line for tika-server (easy fix).
# tika-server has a hardcoded substitution of the XMLParser for the HtmlParser:
{noformat}
 Map<MediaType, Parser> parsers = parser.getParsers();
 parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
 parser.setParsers(parsers);
{noformat}

We should probably clean this up, but I'm not sure how to do it and respect backwards compatibility, and, frankly, 25% of the exceptions on govdocs1 now are from the XMLParser hitting non-compliant xml, so this choice makes quite a bit of sense. :)


was (Author: tallison@mitre.org):
There's a slight disconnect in how we handle extraction from embedded docs:

* Tika-app commandline -t extracts embedded content
* Tika-app gui does not
* /tika does not

> Extend tika-server API to be consistent with tika-app CLI
> ---------------------------------------------------------
>
>                 Key: TIKA-944
>                 URL: https://issues.apache.org/jira/browse/TIKA-944
>             Project: Tika
>          Issue Type: New Feature
>          Components: server
>    Affects Versions: 1.1
>         Environment: Any
>            Reporter: Jason Judge
>            Assignee: Chris A. Mattmann
>              Labels: exposed-functionality, tika-server
>
> The tika-server API (web service) provides a limited set of functionality compared to the tika-app command-line version. Notable things missing are:
> 1. Language recognition.
> 2. Output in various formats (JSON for metadata, XHTML for the extracted text).
> Those are the two main things that would be useful to me, but ideally the server should be able to provide all the functionality that the command-line app does, taking the command-line as the model to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)