You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2019/05/10 19:36:00 UTC
[jira] [Created] (CONNECTORS-1605) Update HTML Extractor connector
Karl Wright created CONNECTORS-1605:
---------------------------------------
Summary: Update HTML Extractor connector
Key: CONNECTORS-1605
URL: https://issues.apache.org/jira/browse/CONNECTORS-1605
Project: ManifoldCF
Issue Type: Improvement
Affects Versions: ManifoldCF 2.9.1
Reporter: Olivier Tavard
Assignee: Karl Wright
Fix For: ManifoldCF 2.10
Attachments: fix_englobing_tag_selection.txt, global_patch.txt, html_extractor_transformation_connector.txt, patch_HTML_extractor_connector_05_06_19.txt, patch_html_extractor_08_14_18.txt, patch_html_extractor_fix_logs_08_10_18.txt
Hi,
I developed a transformation connector based on Jsoup. The goal of this code is to simply choose an encompassing tag in a HTML document for text extracting. And inside this tag, this connector allows you to remove subparts that you do no want : all the tags corresponding to declared types or specific attribute tag names for example.
The code is in Apache V2 licence and it is in attachment.
It needs some work including code refactoring, renaming classes, unit tests that I will be able to do if you are interested by the code.
The documentation is here :
[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
It does not use additional libraries that the ones already present in MCF project. It is based on Jsoup library on lib folder.
Best regards,
Olivier
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)