You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2007/01/31 00:26:56 UTC

[Nutch Wiki] Update of "MonitoringNutchCrawls" by MikeBrzozowski

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by MikeBrzozowski:
http://wiki.apache.org/nutch/MonitoringNutchCrawls

New page:
= Monitoring Nutch Crawls =

So you got Nutch all configured and turned it loose on your site, but your itchy trigger finger just needs to know how well it's working? Here are a couple ways you can keep an eye on your crawl:

== Monitoring network traffic ==

One way is to watch Nutch suck up your bandwidth as it crawls its way around. If you look at a graph of historical bandwidth usage, you should see it spike up and stay at a fairly consistent plateau, with valleys every so often each time a segment completes (since while Nutch is merging segments it doesn't use any bandwidth).

Some tools for this:
 * [http://www.ntop.org/overview.html ntop] (Linux, Windows) - A nifty program that gives you a Web-based history of your machine's bandwidth usage. You might get lucky and have it install easily... because the website isn't terribly helpful for install help.

== Monitoring fetch statistics ==

Of course, the bandwidth alone doesn't tell the whole story. How many pages are you retrieving? How many failed?

Here's a quick little shell script to do this; I'm sure people can improve on this--edit this page if so!

#!/bin/sh
echo "Monitoring nohup.out crawl progress..."
while :
do
  echo "Tried `grep 'fetching' nohup.out | wc -l` pages; `grep 'failed' nohup.out | wc -l` failed."
  sleep 60
done

=== To run this script: ===
 1. Save this script as something like monitorCrawl.sh
 2. Run your preferred crawl script with nohup, like this: nohup <nutch crawl command or script> &
 3. By default, this will output to nohup.out in the working directory. From the same directory, run: sh monitorCrawl.sh

This will give you minute-by-minute stats on how many pages nutch tried to fetch and how many failed with errors (e.g. 404, server unreachable).