You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/01/06 02:03:00 UTC

[jira] [Commented] (TIKA-2542) Support in tika-server for getting plain text and metadata at the same time

    [ https://issues.apache.org/jira/browse/TIKA-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314293#comment-16314293 ] 

ASF GitHub Bot commented on TIKA-2542:
--------------------------------------

mcaracuel opened a new pull request #216: Implementation of TIKA-2542 by mcaracuel
URL: https://github.com/apache/tika/pull/216
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Support in tika-server for getting plain text and metadata at the same time
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-2542
>                 URL: https://issues.apache.org/jira/browse/TIKA-2542
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, server
>    Affects Versions: 1.17
>            Reporter: Manolo Caracuel
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.18
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It would be good to have a way to get a files plain text extracted and also get the metadata detected. Currently you can only get the metadata if the request has Accepts of text/xml or text/html but then the text in the body is not the plain text as it contains html elements as well.
> I propose that when requesting /tika/plain with Accepts header of text/xml, an xhtml document is returned with the metadata in head's meta elements and the plain text in the body.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)