You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2006/08/23 20:50:54 UTC

[Nutch Wiki] Update of "RunNutchInEclipse" by RenaudRichardet

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by RenaudRichardet:
http://wiki.apache.org/nutch/RunNutchInEclipse

New page:
= RunNutchInEclipse =

== Tested with ==
 * Nutch release 0.8
 * Eclipse 3.2
 * Java 1.4 (should work with 1.5, though)
 * Ubuntu (should work on most platform, though)

== Warning ==

Setting up Nutch to run into Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents).
It's very powerful to debug Nutch in Eclipse, but again you might be quickier by looking at the logs (logs/hadoop.log).

== Steps ==

=== Install Nutch ===
 * Grab a fresh release of Nutch 0.8 or make a fresh checkout of Nutch 0.8 from svn
 * Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory

=== Create a new java project in Eclipse ===
 * File > New > Project > Java project > click Next
 * select "Create prject from existing source" and use the location where you downloaded Nutch
 * click on Next, and wait while Eclipse is scanning the
 * add the folder "conf" to the classpath (important)
 * set output dir to "build", create it if necessary
 * DO NOT add build to classpath

=== Configure Nutch ===
 * see the [http://lucene.apache.org/nutch/tutorial8.html Tutorial]
 * make sure Nutch is configured correctly before testing it into Eclipse

=== Build Nutch ===
 * make sure that there's no files in the build/ dir. Delete them if necessary
 * right click on build.xml and select "Run as..." > "Ant build"
 * Eclipe will start to build Nutch. Check the progress in the Console

=== Create launcher ===
 * Menu Run > "Run..."
 * create "New" for "Java Application"
 * set in Main class
{{{
org.apache.nutch.crawl.Crawl
}}}
 * on tab Arguments, Program Arguments
{{{
urls -dir crawl -depth 3 -topN 50
}}}
 * in VM arguments
{{{
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
}}}
 * click on "Run"
 * if all works, you should see Nutch getting busy at crawling

== Debug Nutch in Eclipse ==
 * Set some breakpoint and debug a crawl
 * It can be tricky to find out where to set the breakpoint, because there's a lot of thereading in Nutch and Hadoop. Here are a few good places to set breakpoints:
{{{
Fetcher [line: 371] - run
Fetcher [line: 438] - fetch
Fetcher$FetcherThread [line: 149] - run()
Generator [line: 281] - generate
Generator$Selector [line: 119] - map
OutlinkExtractor [line: 111] - getOutlinks
}}}

== If things do not work... ==
Yes, Nutch and Eclipse can be a difficult companionship sometimes ;-)

=== plugin dir not found ===
copy plugins to a safe place after the initial build and reference it in nutch-defaults.xml
{{{
<property>
  <name>plugin.folders</name>
  <value>/home/....../nutch-0.8/build_backup/plugins</value>
}}}

=== classNotFound ===
 * open the class itself, rightclick
 * refresh the build dir

Credits: RenaudRichardet