You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Donald Van den Driessche (JIRA)" <ji...@apache.org> on 2018/10/19 11:26:00 UTC
[jira] [Created] (CONNECTORS-1550) HTML Tag mapping
Donald Van den Driessche created CONNECTORS-1550:
----------------------------------------------------
Summary: HTML Tag mapping
Key: CONNECTORS-1550
URL: https://issues.apache.org/jira/browse/CONNECTORS-1550
Project: ManifoldCF
Issue Type: Wish
Components: Elastic Search connector, Tika extractor, Web connector
Affects Versions: ManifoldCF 2.10
Reporter: Donald Van den Driessche
I’ll be crawling a website with the standard Web connecter. I want to extract just certain html tags like <h1>, <h2> and <p>.
I’ve set up an HTML extractor transformation connector and the internal Tika transformation connector. But I can’t find any place to do a mapping to the output for this.
Do I have to write my own transformation connector to extract the content of these tags? Or is there a built in solution?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)