You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2017/06/02 05:26:04 UTC
[jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup
CSS selectors
[ https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034162#comment-16034162 ]
Lewis John McGibbney commented on NUTCH-2389:
---------------------------------------------
[~kaidul], i think that the plugin should be a ParseFilter. Do you happend to have a pull request we can review? Unit tests would also be VERY welcome. Thank you
> Precise data parsing using Jsoup CSS selectors
> ----------------------------------------------
>
> Key: NUTCH-2389
> URL: https://issues.apache.org/jira/browse/NUTCH-2389
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 2.3
> Reporter: Kaidul Islam
> Assignee: Kaidul Islam
> Fix For: 2.4
>
> Original Estimate: 0.05h
> Remaining Estimate: 0.05h
>
> As far as I know, currently Nutch 1.x and 2.x has no features to extract/parse exact contents for specific websites. I've developed a plugin {{parse-jsoup}} using Jsoup for my current project to extract precise content for site specific crawling using detailed XML configuration(field name, CSS-selector, attribute, extraction rules, data-type, default-value etc).
> Please let me know if this feature seems relevant and currently not present in Nutch. I have also plan to export it into Nutch 1.x.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)