You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chris Anderson <jc...@grabb.it> on 2008/06/28 22:12:14 UTC
stripped down crawl
Hi,
I'm running a crawl which uses my own parsers (via hadoop streaming
jar) so I have no use of the Lucene / fulltext side of Nutch. Of
course I'm using Nutch to develop the ongoing crawldb. My question is
can I leave the Parsing stage out of my CrawlTask, and still have
Nutch generate the next depth's URLs?
Here is what I'm doing currently:
Injector injector = new Injector(conf);
Generator generator = new Generator(conf);
Fetcher fetcher = new Fetcher(conf);
ParseSegment parseSegment = new ParseSegment(conf);
CrawlDb crawlDbTool = new CrawlDb(conf);
// initialize crawlDb
injector.inject(crawlDb, rootUrlDir);
int i;
for (i = 0; i < depth; i++) {
// generate new segment
Path segment = generator.generate(crawlDb, segments, -1, topN,
System.currentTimeMillis(),false,false);
if (segment == null) {
LOG.info("Stopping at depth=" + i + " - no more URLs to fetch.");
break;
}
// fetch it
fetcher.fetch(segment, threads);
// parse it, if needed
if (!Fetcher.isParsing(job)) {
parseSegment.parse(segment);
}
// update crawldb
crawlDbTool.update(crawlDb, new Path[]{segment}, true, false);
}
--
Chris Anderson
http://jchris.mfdz.com