You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dan Kinder <dk...@gmail.com> on 2014/06/16 23:31:05 UTC
Clarifications regarding re-crawl and Nutch2 storage

Hi there,

My company currently runs a full-web crawler (focusing on written content
including content from PDFs, word docs, etc. to support our product). It's
fully proprietary (including the indexing solution) and fairly old.

We're looking to potentially upgrade and I've been reading quite a bit
about Nutch. It seems promising but I have questions I've had trouble
finding answers to in the existing wikis and blogs. My apologies if I just
haven't dug deep enough on these; feel free to point to resources.

1) The Nutch examples generally seem to update the link database, generate
new segments, crawl, then repeat. Can this be done continuously and
simultaneously, so that we are constantly using our crawl bandwidth? (I.e.
is there an issue generating new segments while crawls and db updates are
happening?) I wonder this especially because we're interested in keeping as
live a dataset as possible; most of the docs seem to indicate that a large
crawl may take on the order of weeks, and thus a new link may not be
indexed until the following cycle a month or two after we grab or inject it.

2) I see that Nutch 1 is tied to Hadoop as a backend, vs. Nutch 2 which
allows pluggable backends via Gora. Yet I'm getting the (possibly false)
impression that HDFS/Hadoop is still somehow involved in Nutch 2 (there's
still a crawlDir and such referenced here:
http://wiki.apache.org/nutch/Nutch2Cassandra, FYI we're most interested in
a Cassandra backend right now). If this is true how does it play in? Is
Hadoop/HDFS used for job distribution and intermediate data while all
permanent data is in Cassandra?

3) What is Nutch's behavior for non-200 HTTP codes? More broadly, are there
any controls regarding how often to retry previously fetched links (maybe
depending on their return code, whether they had changes, pagerank, etc.),
and how often to try newly fetched links? My reading so far indicates that
with the default 30-day refresh interval we'll simply try to re-crawl every
single link every interval; if this is true then it seems like we would
often be crawling pages that haven't changed.

Thanks!
-dan