You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Rick Moynihan <ri...@calicojack.co.uk> on 2008/08/04 19:33:07 UTC

Nutch is resilient to automated testing

Hi all,

A colleague I have been working with has developed a plugin to index 
content with Nutch.  And though it does the job admirably, the 
complexity and design of Nutch has proven resistent to easily writing 
automated tests for this component.

I'm desperately trying to write some JUnit unit/integration tests for 
this component, however Nutch doesn't make this simple enough, and I 
fear this amongst other things is a barrier to Nutch adoption.

What I want to do is:

- Setup a Jetty server within the test with the content I want to index 
(easy enough with CrawlDBTestUtil)
- Configure a crawl (i.e. fetch, index, merge, dedup etc...) and 
override the configuration with my plugin and configuration.
- Store the index (preferably in memory, but on the disk is ok).
- assert that particular searches return items etc...


At first I thought this would be a simple matter of using 
CrawlDBTestUtil to establish the server side, then using
org.apache.nutch.crawl.Crawl to perform all the relevant steps resulting 
in an index of the content, which I can then run assertions on via 
NutchBean.

Ideally I'd like to create just one Configuration object, override the 
settings as I wish, and then pass this object into Crawl and NutchBean 
appropriately.

Sadly however org.apache.nutch.crawl.Crawl isn't really a class, as it 
really only has a static main method which performs all the operations 
in batch.  This design makes the class hard to reuse within the context 
of my test.  This leaves me with the following options:

- call the main method and pass it an ugly array of Strings to do what I 
require.  This is ugly due also to assumptions underlying the design of 
this component (configuration files on the classpath etc...)  Also it 
allows little or no reuse of configuration with other parts of the code 
(e.g. NutchBean).

- Copy/Paste/Modify Crawl into my test.  The code in Crawl recently 
changed to account for hadoop 0.17, so I don't really want to do this 
only to find the API changes.  Plus I believe that tests should be 
simple to read.  Explicitly performing 30 steps in order to test a 
component isn't a good idea, as it hides the forest for the trees.

CrawlDBTestUtil is a step in the right direction, but more work is 
needed.  Is it possible to get this marked as a bug/feature-request and 
fixed in time for 1.0?

Thanks again for your help.

R.