You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Chris Anderson <jc...@grabb.it> on 2008/06/28 22:12:14 UTC

stripped down crawl

Hi,

I'm running a crawl which uses my own parsers (via hadoop streaming
jar) so I have no use of the Lucene / fulltext side of Nutch. Of
course I'm using Nutch to develop the ongoing crawldb. My question is
can I leave the Parsing stage out of my CrawlTask, and still have
Nutch generate the next depth's URLs?


Here is what I'm doing currently:

Injector injector = new Injector(conf);
Generator generator = new Generator(conf);
Fetcher fetcher = new Fetcher(conf);
ParseSegment parseSegment = new ParseSegment(conf);
CrawlDb crawlDbTool = new CrawlDb(conf);

// initialize crawlDb
injector.inject(crawlDb, rootUrlDir);
int i;
for (i = 0; i < depth; i++) {
  // generate new segment
  Path segment = generator.generate(crawlDb, segments, -1, topN,
System.currentTimeMillis(),false,false);
  if (segment == null) {
    LOG.info("Stopping at depth=" + i + " - no more URLs to fetch.");
    break;
  }
  // fetch it
  fetcher.fetch(segment, threads);

  // parse it, if needed
  if (!Fetcher.isParsing(job)) {
    parseSegment.parse(segment);
  }
  // update crawldb
  crawlDbTool.update(crawlDb, new Path[]{segment}, true, false);
}

-- 
Chris Anderson
http://jchris.mfdz.com