You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Nathan Gass (JIRA)" <ji...@apache.org> on 2012/11/09 17:54:12 UTC

[jira] [Commented] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x

    [ https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494113#comment-13494113 ] 

Nathan Gass commented on NUTCH-1495:
------------------------------------

P.S. I only tested the filter part of the patch, new normalizations are not yet tested and I noticed at least one bug in the patch for normalizations. I'll add a new patch as soon as I got around to test normalizations.
                
> -normalize and -filter for updatedb command in nutch 2.x
> --------------------------------------------------------
>
>                 Key: NUTCH-1495
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1495
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.2
>            Reporter: Nathan Gass
>         Attachments: patch-updatedb-normalize-filter-2012-11-09.txt
>
>
> AFAIS in nutch 1.x you could change your url filters and normalizers during the crawl, and update the db using crawldb -normalize -filter. There does not seem to be a away to achieve the same in nutch 2.x?
> Anyway, I went ahead and tried to implement -normalize and -filter for the nutch 2.x updatedb command. I have no experience with any of the used technologies including java, so please check the attached code carefully before using it. I'm very interested to hear if this is the right approach or any other comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira