You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2014/11/02 01:15:34 UTC
[jira] [Commented] (NUTCH-1644) Should have a parser that uses
xpath
[ https://issues.apache.org/jira/browse/NUTCH-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193609#comment-14193609 ]
Lewis John McGibbney commented on NUTCH-1644:
---------------------------------------------
[~talat], the patch you put here is kinda wild.
* The formatting for the XML is all over the place
* it includes solr4-schema.xml which is now non-existent within 2.X
* it includes references to article titles, authors and content within the above schema as well as solr-mapping.xml
* It includes a bunch of local plugin nutch-site.xml which I am not sure fits in with the existing plugin configuration.
* the package names are com.atlantbh.nutch where they should be org.apache.nutch
* the Java code is not formatted correctly
* this appears to be an IndexingFilter as well...
* There seems to be an awful amount of code! Same with XML!
* It is a patch for Git, not for SVN
Thank for uploading but I feel that this needs a lot of work.
> Should have a parser that uses xpath
> ------------------------------------
>
> Key: NUTCH-1644
> URL: https://issues.apache.org/jira/browse/NUTCH-1644
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 2.2.1
> Reporter: cihad güzel
> Assignee: Lewis John McGibbney
> Labels: parser, xpath
> Fix For: 2.4
>
> Attachments: NUTCH-1644.patch, filter-xpath.patch
>
>
> May want to parse some url via xpath. May be blog or news web sites. Should be a plugin using xpath parse.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)