You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2019/02/15 16:34:00 UTC

[jira] [Commented] (NUTCH-1870) Generic xsl parser plugin

    [ https://issues.apache.org/jira/browse/NUTCH-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769480#comment-16769480 ] 

ASF GitHub Bot commented on NUTCH-1870:
---------------------------------------

sebastian-nagel commented on pull request #439: NUTCH-1870 XSL parse filter
URL: https://github.com/apache/nutch/pull/439
 
 
   - apply patch contributed by @albinscode
   - load configuration files from classpath and address thread-safety
   
   Note: not ready yet:
   - TODOs in code
   - unit tests fail (with DOM built by tagsoup parser)
   - see also open points in [NUTCH-1870](https://issues.apache.org/jira/browse/NUTCH-1870)
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Generic xsl parser plugin
> -------------------------
>
>                 Key: NUTCH-1870
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1870
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>    Affects Versions: 1.9
>            Reporter: Albinscode
>            Priority: Major
>         Attachments: NUTCH-1870-trunk-v3.patch, NUTCH-1870-trunk-v4.patch, nutch-site.xml, xsl-parse-plugin.patch, xsl-parse-plugin2.patch
>
>
> The aim of this plugin is to use XSLT to extract metadata from HTML DOM structures.
> | Your Data | --> | Parse-html plugin  or TIKA plugin | --> | DOM structure | --> |XSLT plugin |
>                   
>                   
> The main advantage is that:
> - You won't have to produce any java code, only XSLT and configuration
> - It can process DOM structure from DocumentFragment (@see NekoHtml and @see TagSoup)
> - It is HtmlParseFilter plugin compatible and can be plugged as any other plugin (parse-js, parse-swf, etc...)
> This topic has been discussed on http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)