You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Winton Davies <wd...@cs.stanford.edu> on 2008/07/01 21:14:05 UTC

nutch crawl : file:/// vs http://localhost/

Soooooo...

Executive Summary: Funnel all Files thru a webserver if you want page 
weighting (OPIC?) and anchor text to be indexed/used.

I just did some experiments after unsuccessfully trying to 
invertlinks on the segments built from file:///enwiki/. (I'd 
originally crawled with ignore internal turned on).

Seems that for the intranet crawl, using file:/// as the hierarchy 
rather than http://localhost/somelinktosamefiles/ results in no OPIC 
scoring and no Anchor text. You also have to disable the 
<db.ignore.internal.links> by setting it to false.

My initial thought is that crawling the file system should be faster 
than pulling the same files thru http://localhost/. Examining the 
hadoop log, the times for the same set of 889 pages are 5 and 7 
seconds respectively. Same machine, with http://localhost/ having the 
potential advantage of the file:/// being in cache.

Any explanations from anyone? Comments?

Cheers,
  Winton