You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Rod Taylor <rb...@sitesell.com> on 2005/10/05 17:09:40 UTC

Nutch Contract Work

We are interested in hiring a contractor to work on the map/reduce
branch and complete some items which have not been completed, as well as
helping us work through some performance / exception recovery issues for
a full-web crawl.

Any and all patches created for us for Nutch would be expected to be
merged back into the Nutch svn repository after going through the
appropriate discussion and review with the Nutch community.

Some (most?) of these may be trivial or easy to do.

The immediate items on our list include:
      * A new implementation of segread for map/reduce.
      * The ability to use multiple weighted temporary directories.
        (Some areas are preferred, but fall back to others if space
        constrained).
      * If the temporary directories are out of space, attempt to free
        used but no longer required space. Be more aggressive in
        temporary space recovery when low.
      * Reduction of exceptions in general. We regularly have exceptions
        in the log file which may or may not be important. Determine
        their importance and if they don't matter, catch 'em and change
        them to a simple log line.
      * Advice on our configuration and how it can be improved to meet
        our needs.

Additional items being considered (possible contract extension):
      * Currently Nutch has "downtime" while while processing updatedb /
        generate. Create a way of eliminating this, possibly by
        generating multiple segments and overlapping with updatedb /
        generate.
      * Allow the work that Crawl does to be specified in the
        configuration file (elimination of part of the process --
        possibly adding other parts. Say a list like "generate, fetch,
        parse, segdump, run some shell script or touch a file indicating
        item is ready". This is coupled with the previous point of using
        a constant crawl.
      * Additional suggestions that you make which could help improve
        our Nutch performance and reliability.

regards,
	Rod Taylor

	rbt@sitesell.com
	416-977-8778

-- 
Rod Taylor <rb...@sitesell.com>