You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2018/11/21 08:51:00 UTC

[jira] [Commented] (CONNECTORS-1557) HTML Tag extractor

    [ https://issues.apache.org/jira/browse/CONNECTORS-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694406#comment-16694406 ] 

Karl Wright commented on CONNECTORS-1557:
-----------------------------------------

The best way to deliver the code is as a patch attachment to a ticket like this.

I hope that the transformer you wrote is consistent with the other transformers that ManifoldCF provides, e.g. the HTML Extractor and the Metadata Adjuster.  Generally we are not fond of transformers that take on more than the most basic part of what might be structured as a multi-part transformation.  From your description it sounds like you've basically extended the HTML extractor and added functionality to it similar to what the Metadata Adjuster does.   If that's true, it might be good to only provide the extraction functionality extension from CSS to the HTML extractor, and let the Metadata Adjuster handle the field mappings.

Please let me know how you want to proceed.


> HTML Tag extractor
> ------------------
>
>                 Key: CONNECTORS-1557
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1557
>             Project: ManifoldCF
>          Issue Type: New Feature
>            Reporter: Donald Van den Driessche
>            Priority: Major
>
> I wrote a HTML Tag extractor, based on the HTML Extractor.
> I needed to extract specific HTML tags and transfer them to their own field in my output repository.
> Input
>  * Englobing tag (CSS selector)
>  * Blacklist (CSS selector)
>  * Fieldmapping (CSS selector)
>  * Strip HTML
> Process
>  * Retrieve Englobing tag
>  * Remove blacklist
>  * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + strip HTML (if requested)
>  * Englobing tag minus blacklist: strip HTML (if requested) and return as output (content)
> How can I best deliver the source code?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)