You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "byron miller (JIRA)" <ji...@apache.org> on 2005/10/25 16:49:08 UTC

[jira] Commented: (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag

    [ http://issues.apache.org/jira/browse/NUTCH-49?page=comments#action_12355864 ] 

byron miller commented on NUTCH-49:
-----------------------------------

Can something like this be adapted to use the regex filter as well? it would be nice to say new only and match urls of x type or  x link score or some other expressions.  (not just the very topN)



> Flag for generate to fetch only new pages to complement the -refetchonly flag
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-49
>          URL: http://issues.apache.org/jira/browse/NUTCH-49
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Reporter: Luke Baker
>     Priority: Minor
>  Attachments: fetchnewonly.patch
>
> It would be useful, especially for research/testing purposes, to have a flag for the FetchListTool that make sure to only include URLs in the fetchlist that have not already been fetched (according to the information from the webdb that you're generating the fetchlist from).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira