You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Winton Davies <wd...@cs.stanford.edu> on 2008/07/01 21:14:05 UTC
nutch crawl : file:/// vs http://localhost/
Soooooo...
Executive Summary: Funnel all Files thru a webserver if you want page
weighting (OPIC?) and anchor text to be indexed/used.
I just did some experiments after unsuccessfully trying to
invertlinks on the segments built from file:///enwiki/. (I'd
originally crawled with ignore internal turned on).
Seems that for the intranet crawl, using file:/// as the hierarchy
rather than http://localhost/somelinktosamefiles/ results in no OPIC
scoring and no Anchor text. You also have to disable the
<db.ignore.internal.links> by setting it to false.
My initial thought is that crawling the file system should be faster
than pulling the same files thru http://localhost/. Examining the
hadoop log, the times for the same set of 889 pages are 5 and 7
seconds respectively. Same machine, with http://localhost/ having the
potential advantage of the file:/// being in cache.
Any explanations from anyone? Comments?
Cheers,
Winton