You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2005/04/22 20:38:23 UTC

[jira] Commented: (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag

     [ http://issues.apache.org/jira/browse/NUTCH-49?page=comments#action_63531 ]
     
Doug Cutting commented on NUTCH-49:
-----------------------------------

This seems like reasonable functionality to add.

However the code needs a little cleanup.  We should at least use constants for the different modes.  Better yet would be to use a type-safe enumeration, e.g., a nested class like:

  public static final class Fetch {
    private String name;
    
    private Fetch(String name) { this.name = name}

    public String toString() { 
     return this.getClass().getName()+":"+name;
    }
    
    public static final Mode ALL = new Mode("ALL");
    public static final Mode NEW = new Mode("NEW");
    public static final Mode OLD = new Mode("OLD");

  }

Then fetch with something like:

   new FetchListTool(..., Fetch.ALL, ...);
   ...

The above code is of course untested, needs javadoc, etc.

Doug

> Flag for generate to fetch only new pages to complement the -refetchonly flag
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-49
>          URL: http://issues.apache.org/jira/browse/NUTCH-49
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Reporter: Luke Baker
>     Priority: Minor
>  Attachments: fetchnewonly.patch
>
> It would be useful, especially for research/testing purposes, to have a flag for the FetchListTool that make sure to only include URLs in the fetchlist that have not already been fetched (according to the information from the webdb that you're generating the fetchlist from).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira