You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2019/02/21 14:23:00 UTC

[jira] [Comment Edited] (TIKA-2755) Allow Tika to skip extraction of tags in HTML

    [ https://issues.apache.org/jira/browse/TIKA-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774107#comment-16774107 ] 

Tim Allison edited comment on TIKA-2755 at 2/21/19 2:22 PM:
------------------------------------------------------------

{noformat}
~s/^[\r\n]|[\r\n]$//
~s/[\r\n]{2}/\n/
{noformat}

But seriously, I don't.  Those are artifacts of the xhtml->text conversion.  Maybe take a look in what we're doing in the ToTextHandler?


was (Author: tallison@mitre.org):

~s/^[\r\n]|[\r\n]$//
~s/[\r\n]{2}/\n/

But seriously, I don't.  Those are artifacts of the xhtml->text conversion.  Maybe take a look in what we're doing in the ToTextHandler?

> Allow Tika to skip extraction of <img> tags in HTML
> ---------------------------------------------------
>
>                 Key: TIKA-2755
>                 URL: https://issues.apache.org/jira/browse/TIKA-2755
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.19.1
>            Reporter: Harinder
>            Priority: Major
>         Attachments: TestForImageTag.html
>
>
> We are using Tika Server to extract text from HTML files. Tika extracts the alt text of image tags present in HTML files as _[image: this is the alt text of the image]_. This ends up in Solr and shows up in the results when we generate document summaries at query time (via Solr’s highlight functionality).
> If you PUT the attached HTML file to /tika, it will return the following response
> {code:java}
> [image: Return to the homepage]
> This is a test{code}
> It would be nice to have just this instead
> {code:java}
> This is a test {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)