You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2019/05/10 19:37:00 UTC

[jira] [Updated] (CONNECTORS-1605) Update HTML Extractor connector

     [ https://issues.apache.org/jira/browse/CONNECTORS-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright updated CONNECTORS-1605:
------------------------------------
    Fix Version/s:     (was: ManifoldCF 2.10)

> Update HTML Extractor connector
> -------------------------------
>
>                 Key: CONNECTORS-1605
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1605
>             Project: ManifoldCF
>          Issue Type: Improvement
>    Affects Versions: ManifoldCF 2.13
>            Reporter: Olivier Tavard
>            Assignee: Karl Wright
>            Priority: Minor
>         Attachments: fix_englobing_tag_selection.txt, global_patch.txt, html_extractor_transformation_connector.txt, patch_HTML_extractor_connector_05_06_19.txt, patch_html_extractor_08_14_18.txt, patch_html_extractor_fix_logs_08_10_18.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code is to simply choose an encompassing tag in a HTML document for text extracting. And inside this tag, this connector allows you to remove subparts that you do no want : all the tags corresponding to declared types or specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)