You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Harinder (JIRA)" <ji...@apache.org> on 2018/10/12 18:17:00 UTC
[jira] [Created] (TIKA-2755) Allow Tika to skip extraction of
tags in HTML
Harinder created TIKA-2755:
------------------------------
Summary: Allow Tika to skip extraction of <img> tags in HTML
Key: TIKA-2755
URL: https://issues.apache.org/jira/browse/TIKA-2755
Project: Tika
Issue Type: Improvement
Components: server
Affects Versions: 1.19.1
Reporter: Harinder
Attachments: TestForImageTag.html
We are using Tika Server to extract text from HTML files. Tika extracts the alt text of image tags present in HTML files as _[image: this is the alt text of the image]_. This ends up in Solr and shows up in the results when we generate document summaries at query time (via Solr’s highlight functionality).
If you PUT the attached HTML file to /tika, it will return the following response
{code:java}
[image: Return to the homepage]
This is a test{code}
It would be nice to have just this instead
{code:java}
This is a test {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)