You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Rick Moynihan <ri...@calicojack.co.uk> on 2008/08/07 12:07:02 UTC

Nutch is resilient to automated testing

I first posted this to Nutch-Dev, but had no response; so I'm reposting 
it here.  If you've already seen it, apologies for the dupe.

Hi all,

A colleague I have been working with has developed a plugin to index
content with Nutch.  And though it does the job admirably, the
complexity and design of Nutch has proven resistent to easily writing
automated tests for this component.

I'm desperately trying to write some JUnit unit/integration tests for
this component, however Nutch doesn't make this simple enough, and I
fear this amongst other things is a barrier to Nutch adoption.

What I want to do is:

- Setup a Jetty server within the test with the content I want to index
(easy enough with CrawlDBTestUtil)
- Configure a crawl (i.e. fetch, index, merge, dedup etc...) and
override the configuration with my plugin and configuration.
- Store the index (preferably in memory, but on the disk is ok).
- assert that particular searches return items etc...


At first I thought this would be a simple matter of using
CrawlDBTestUtil to establish the server side, then using
org.apache.nutch.crawl.Crawl to perform all the relevant steps resulting
in an index of the content, which I can then run assertions on via
NutchBean.

Ideally I'd like to create just one Configuration object, override the
settings as I wish, and then pass this object into Crawl and NutchBean
appropriately.

Sadly however org.apache.nutch.crawl.Crawl isn't really a class, as it
really only has a static main method which performs all the operations
in batch.  This design makes the class hard to reuse within the context
of my test.  This leaves me with the following options:

- call the main method and pass it an ugly array of Strings to do what I
require.  This is ugly due also to assumptions underlying the design of
this component (configuration files on the classpath etc...)  Also it
allows little or no reuse of configuration with other parts of the code
(e.g. NutchBean).

- Copy/Paste/Modify Crawl into my test.  The code in Crawl recently
changed to account for hadoop 0.17, so I don't really want to do this
only to find the API changes.  Plus I believe that tests should be
simple to read.  Explicitly performing 30 steps in order to test a
component isn't a good idea, as it hides the forest for the trees.

CrawlDBTestUtil is a step in the right direction, but more work is
needed.  Is it possible to get this marked as a bug/feature-request and
fixed in time for 1.0?

Thanks again for your help.

R.

Re: Nutch is resilient to automated testing

Posted by Rick Moynihan <ri...@calicojack.co.uk>.

brainstorm wrote:
> I think that if you share some (preliminar/broken or working) code
> you're actually writing and:
> 
> 1) Expected results
> 2) Actual results
> 
> Could be useful to start diagnosing your problem. IMHO, there's
> nothing more specific than the actual test code ;)
> 
> Regards,
> Roman

Hi Roman, thanks for your response.  Here is some code which I've copied 
and modified from the beginnings of the test case I started to write, 
but abandoned, due to the complexity of process.  I'm sure there are 
many mistakes in the code below; but it was never finished, and I've 
made some minor mods in my mail client.  I abandoned it when I realised 
I was essentially rewriting Crawl inside my test.  If you scroll down 
further you can see a pseudocode TestCase that aludes to the kind of 
thing I'd like to see - to make life as a developer extending Nutch easier!

First the code I wrote:

public class TestMyParserPlugin extends TestCase {

     private String fileSeparator = 
System.getProperty("file.separator");    // This system property is 
defined in ./src/plugin/build-plugin.xml
     private String sampleDir = System.getProperty("test.data", ".");
     public static final Log LOG = 
LogFactory.getLog(TestEnsembliParser.class.getName());
     final static Path testdir = new Path("build/test/test-site");
     Configuration conf;
     FileSystem fs;
     Path crawldbPath;
     Path segmentsPath;
     Path urlPath;
     Server server;

     /**
      * Default Constructor.
      *
      * @param name
      *          The name of this {@link TestCase}.
      */
     public TestMyParserPlugin(String name) {
         super(name);
         System.out.println("Constructing TestEnsembliParser");
     }

     protected void setUp() throws Exception {
         System.out.println("setting Up");
         conf = CrawlDBTestUtil.createConfiguration();
         conf.set("plugin.includes", "myplugin|other-plugins|query-basic");
         fs = FileSystem.get(conf);
         fs.delete(testdir);
         urlPath = new Path(testdir, "urls");
         crawldbPath = new Path(testdir, "crawldb");
         segmentsPath = new Path(testdir, "segments");
         server = 
CrawlDBTestUtil.getServer(conf.getInt("content.server.port", 50000), 
"build/test/data/ensembli-test-site");
         server.start();
         System.out.println("setUp!");
     }

     protected void tearDown() throws InterruptedException, IOException {
         server.stop();
         fs.delete(testdir);
         System.out.println("tornDown");
     }
     // Add a single URL to the list of local files to get.
     private void addUrl(ArrayList<String> urls, String page) {
         urls.add("http://127.0.0.1:" + 
server.getListeners()[0].getPort() + "/" + page);
         System.out.println("added Url: " + page);
     }

     /**
      * Test the full cycle of Parse, Index, Query against the test 
documents.
      * Load the Parser, run the Indexer, and then do a Query with some 
assertions
      * against it that we know the results of.
      */
     public void testParseIndexQueryCycle() throws ProtocolNotFound, 
ParseException, IOException {
         System.out.println("testParseIndexQueryCycle");
         //generate seedlist

         ArrayList<String> urls = new ArrayList<String>();

         addUrl(urls, "document-a");
         addUrl(urls, "document-b");
         addUrl(urls, "document-c");
         addUrl(urls, "document-d");

         CrawlDBTestUtil.generateSeedList(fs, urlPath, urls);

         //inject
         Injector injector = new Injector(conf);
         injector.inject(crawldbPath, urlPath);

         //generate
         Generator g = new Generator(conf);
         Path generatedSegment = g.generate(crawldbPath, segmentsPath, 1,
                 Long.MAX_VALUE, Long.MAX_VALUE, false, false);

         //fetch
         conf.setBoolean("fetcher.parse", true);
         Fetcher fetcher = new Fetcher(conf);
         fetcher.fetch(generatedSegment, 1);

         NutchBean nb = new NutchBean(conf);

         System.out.println("About to query NutchBean");

         //run some queries to assert content was indexed properly and 
is queryable
         Hits hits = nb.queryNutch("content only found in document-a", 10);

         //pseudo code below...
         //code to assert specific documents were found etc...
         //maybe also code to assert specific values were indexed into
         //specific fields in the index.
         //assertHitsIncludes("document-a");

         hits = nb.queryNutch("content in document-a and document-b", 10);

         //assertHitsIncludes(new String[] { "document-a", "document-b" });

     }
}


Note that I'm not entirely sure how the Crawl process works, so I'm not 
100% if my code above is recreating the behaviour of a Nutch crawl (I 
suspect it's not).  But as a plugin developer I shouldn't need to be 
concerned with these details.  Rather I just want to operate at the 
level of abstraction my plugin requires.

Compare the above with a hypothetical TestCase, that assumes the 
presence of some minimal helpers and a NutchTestCase class that extends 
JUnit's TestCase:

public class TestMyParserPlugin extends NutchTestCase {

     private Configuration conf;
     private Server server;
     private List urls;

     protected void setUp() throws Exception {
         System.out.println("setting Up");

         conf = CrawlDBTestUtil.createConfigurationWithInMemoryIndex();
         conf.set("plugin.includes", "myplugin|other-plugins|query-basic");

         server = 
CrawlDBTestUtil.getServer(conf.getInt("content.server.port", 50000), 
"build/test/data/test-site");
         server.start();
         System.out.println("setUp!");

         //generate seedlist
         urls = new ArrayList<String>();
         addUrl(urls, "document-a");
         addUrl(urls, "document-b");
         addUrl(urls, "document-c");
         addUrl(urls, "document-d");
     }


     protected void tearDown() throws InterruptedException, IOException {
         server.stop();
     }
     // Add a single URL to the list of local files to get.
     private void addUrl(ArrayList<String> urls, String page) {
         urls.add(urlifyDocument(page));
         System.out.println("added Url: " + page);
     }

     private String urlifyDocument(String document) {
         return "http://127.0.0.1:" + server.getListeners()[0].getPort() 
+ "/" + document;
     }

     public void testParseIndexQueryCycle() throws Exception {
         System.out.println("testParseIndexQueryCycle");

         Crawl crawler = new Crawl(conf);

         //perform crawl of urls with MyParserPlugin and generate an 
in-memory index according to conf
         crawler.crawl(urls);

         //run some queries on the in memory index
         NutchBean nb = new NutchBean(conf);

         Hits hits = nb.queryNutch("content only found in document-a", 10);

         //handy assert methods included on Super class NutchTestCase
         assertHitsIncludes("document-a", hits);

         hits = nb.queryNutch("content only found in document-a", 10);
         assertHitsIncludes(new String[] { urlifyDocument("document-a"), 
urlifyDocument("document-b") }, hits);
         //check that the query only returned two results
         assertNumberOfHitsReturned(2, hits);

     }
}

I don't think we require major changes to support the above code which 
is far clearer (and requires less knowledge of Nutch internals).

In the above code, I reuse org.apache.nutch.crawl.Crawl, passing it my 
Configuration object directly.  Currently this isn't easily done as like 
all of the classes which are used from the commandline they consist 
primarily of a static main method, and are hence not easily reusable.

Making each of these classes proper objects, with defined interfaces 
seems like a good idea.  Particularly as it would allow developers to 
more easily construct index merging with deduping etc, rather than 
relying on a custom shell script.  Arguably this is something which 
should be available in Nutch; as it's something almost everyone has to 
deal with.

I think implementing these static methods as proper objects in their own 
right (with main methods delegating to the object) makes sense and would 
enable this kind of functionality to be developed in Java, where it can 
be more easily tested etc...  Whilst exposing what seem to be 
essentially primitive (i.e. fundamental) index-management operations to 
further Java side integration.

Anyway, you might want to take my suggestions with a pinch of salt, as I 
don't know too much about the details underlying Nutch; I just feel that 
it could do more to make things easier for those getting started.  I 
hope the Nutch community takes this as fair criticism.  Nutch/Hadoop 
solve a lot of hard problems; it just feels like it could do more to 
make things easier for those trying to build systems on top of it.

In addition, a NutchTestCase class that extended junit with a suite of 
assertion methods specific to Nutch would make things a little nicer. 
I've hinted at a few in the sourcecode.  Perhaps we could think up some 
more.

Do these ideas seem worthwhile?  Are they something we could get onto 
the development agenda?

Thanks again,

R.


> On Thu, Aug 7, 2008 at 12:07 PM, Rick Moynihan <ri...@calicojack.co.uk> wrote:
>> I first posted this to Nutch-Dev, but had no response; so I'm reposting it
>> here.  If you've already seen it, apologies for the dupe.
>>
>> Hi all,
>>
>> A colleague I have been working with has developed a plugin to index
>> content with Nutch.  And though it does the job admirably, the
>> complexity and design of Nutch has proven resistent to easily writing
>> automated tests for this component.
>>
>> I'm desperately trying to write some JUnit unit/integration tests for
>> this component, however Nutch doesn't make this simple enough, and I
>> fear this amongst other things is a barrier to Nutch adoption.
>>
>> What I want to do is:
>>
>> - Setup a Jetty server within the test with the content I want to index
>> (easy enough with CrawlDBTestUtil)
>> - Configure a crawl (i.e. fetch, index, merge, dedup etc...) and
>> override the configuration with my plugin and configuration.
>> - Store the index (preferably in memory, but on the disk is ok).
>> - assert that particular searches return items etc...
>>
>>
>> At first I thought this would be a simple matter of using
>> CrawlDBTestUtil to establish the server side, then using
>> org.apache.nutch.crawl.Crawl to perform all the relevant steps resulting
>> in an index of the content, which I can then run assertions on via
>> NutchBean.
>>
>> Ideally I'd like to create just one Configuration object, override the
>> settings as I wish, and then pass this object into Crawl and NutchBean
>> appropriately.
>>
>> Sadly however org.apache.nutch.crawl.Crawl isn't really a class, as it
>> really only has a static main method which performs all the operations
>> in batch.  This design makes the class hard to reuse within the context
>> of my test.  This leaves me with the following options:
>>
>> - call the main method and pass it an ugly array of Strings to do what I
>> require.  This is ugly due also to assumptions underlying the design of
>> this component (configuration files on the classpath etc...)  Also it
>> allows little or no reuse of configuration with other parts of the code
>> (e.g. NutchBean).
>>
>> - Copy/Paste/Modify Crawl into my test.  The code in Crawl recently
>> changed to account for hadoop 0.17, so I don't really want to do this
>> only to find the API changes.  Plus I believe that tests should be
>> simple to read.  Explicitly performing 30 steps in order to test a
>> component isn't a good idea, as it hides the forest for the trees.
>>
>> CrawlDBTestUtil is a step in the right direction, but more work is
>> needed.  Is it possible to get this marked as a bug/feature-request and
>> fixed in time for 1.0?
>>
>> Thanks again for your help.
>>
>> R.
>>
>>
>>
>>
>

Re: Nutch is resilient to automated testing

Posted by brainstorm <br...@gmail.com>.

I think that if you share some (preliminar/broken or working) code
you're actually writing and:

1) Expected results
2) Actual results

Could be useful to start diagnosing your problem. IMHO, there's
nothing more specific than the actual test code ;)

Regards,
Roman

On Thu, Aug 7, 2008 at 12:07 PM, Rick Moynihan <ri...@calicojack.co.uk> wrote:
> I first posted this to Nutch-Dev, but had no response; so I'm reposting it
> here.  If you've already seen it, apologies for the dupe.
>
> Hi all,
>
> A colleague I have been working with has developed a plugin to index
> content with Nutch.  And though it does the job admirably, the
> complexity and design of Nutch has proven resistent to easily writing
> automated tests for this component.
>
> I'm desperately trying to write some JUnit unit/integration tests for
> this component, however Nutch doesn't make this simple enough, and I
> fear this amongst other things is a barrier to Nutch adoption.
>
> What I want to do is:
>
> - Setup a Jetty server within the test with the content I want to index
> (easy enough with CrawlDBTestUtil)
> - Configure a crawl (i.e. fetch, index, merge, dedup etc...) and
> override the configuration with my plugin and configuration.
> - Store the index (preferably in memory, but on the disk is ok).
> - assert that particular searches return items etc...
>
>
> At first I thought this would be a simple matter of using
> CrawlDBTestUtil to establish the server side, then using
> org.apache.nutch.crawl.Crawl to perform all the relevant steps resulting
> in an index of the content, which I can then run assertions on via
> NutchBean.
>
> Ideally I'd like to create just one Configuration object, override the
> settings as I wish, and then pass this object into Crawl and NutchBean
> appropriately.
>
> Sadly however org.apache.nutch.crawl.Crawl isn't really a class, as it
> really only has a static main method which performs all the operations
> in batch.  This design makes the class hard to reuse within the context
> of my test.  This leaves me with the following options:
>
> - call the main method and pass it an ugly array of Strings to do what I
> require.  This is ugly due also to assumptions underlying the design of
> this component (configuration files on the classpath etc...)  Also it
> allows little or no reuse of configuration with other parts of the code
> (e.g. NutchBean).
>
> - Copy/Paste/Modify Crawl into my test.  The code in Crawl recently
> changed to account for hadoop 0.17, so I don't really want to do this
> only to find the API changes.  Plus I believe that tests should be
> simple to read.  Explicitly performing 30 steps in order to test a
> component isn't a good idea, as it hides the forest for the trees.
>
> CrawlDBTestUtil is a step in the right direction, but more work is
> needed.  Is it possible to get this marked as a bug/feature-request and
> fixed in time for 1.0?
>
> Thanks again for your help.
>
> R.
>
>
>
>