You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Michael Gottesman <go...@reed.edu> on 2008/05/27 20:53:31 UTC

Patch Nutch -> Hadoop .17

Hello. I am currently developing a patch so that Nutch can be used as a 
job jar in a hadoop .17 framework. The task turned out to not be that 
complicated, just involving updating certain deprecated methods that 
were removed in hadoop .17 and parameterizing certain methods and 
classes. So the diff is not that long. If you could give me some 
advice/hints on the following it would be much appreciated since I would 
then be able to go and finish the task and submit it to JIRA as a patch:

Basically the build compiles but still breaks two unit tests which we 
can not seem to find the cause of. They are:

    * TestCrawlDbMerger.java
    * TestDeleteDuplicates.java

I have tracked down the bug in TestCrawlDbMerger to a difference in 
fetchTimes in Url10 and Url20. The resultant is continually 10 seconds 
behind the expected.

I have not had as much of an opportunity to examine why 
TestDeleteDuplicates fails.

The diff of my changes are at this address <http://pastie.caboo.se/204167>.

Thank you so much in advance,

Michael

Re: Patch Nutch -> Hadoop .17

Posted by Andrzej Bialecki <ab...@getopt.org>.

Michael Gottesman wrote:
> Hello. I am currently developing a patch so that Nutch can be used as a 
> job jar in a hadoop .17 framework. The task turned out to not be that 
> complicated, just involving updating certain deprecated methods that 
> were removed in hadoop .17 and parameterizing certain methods and 
> classes. So the diff is not that long. If you could give me some 
> advice/hints on the following it would be much appreciated since I would 
> then be able to go and finish the task and submit it to JIRA as a patch:
> 
> Basically the build compiles but still breaks two unit tests which we 
> can not seem to find the cause of. They are:
> 
>    * TestCrawlDbMerger.java
>    * TestDeleteDuplicates.java
> 
> I have tracked down the bug in TestCrawlDbMerger to a difference in 
> fetchTimes in Url10 and Url20. The resultant is continually 10 seconds 
> behind the expected.

This may be caused by an actual bug in CrawlDbMerger.Merger. Hadoop has 
a largely undocumented feature (which is otherwise very beneficial), 
where key and value instances are reused as much as possible. However, 
CrawlDbMerger:73 assigns the result from a CrawlDatum instance, which is 
returned from the Iterator. The above policy of reuse means that Hadoop 
may reuse this particular instance and fill it with other data.

Please try this patch and see if it helps:


Index: src/java/org/apache/nutch/crawl/CrawlDbMerger.java
===================================================================
--- src/java/org/apache/nutch/crawl/CrawlDbMerger.java  (revision 638779)
+++ src/java/org/apache/nutch/crawl/CrawlDbMerger.java  (working copy)
@@ -64,13 +64,13 @@

      public void reduce(Text key, Iterator<CrawlDatum> values, 
OutputCollector<Text, CrawlDatum> output, Reporter reporter)
              throws IOException {
-      CrawlDatum res = null;
+      CrawlDatum res = new CrawlDatum();
        long resTime = 0L;
        meta.clear();
        while (values.hasNext()) {
          CrawlDatum val = values.next();
          if (res == null) {
-          res = val;
+          res.set(val);
            resTime = schedule.calculateLastFetchTime(res);
            meta.putAll(res.getMetaData());
            continue;
@@ -80,12 +80,13 @@
          if (valTime > resTime) {
            // collect all metadata, newer values override older values
            meta.putAll(val.getMetaData());
-          res = val;
+          res.set(val);
            resTime = valTime ;
          } else {
            // insert older metadata before newer
            val.getMetaData().putAll(meta);
-          meta = val.getMetaData();
+          meta.clear();
+          meta.putAll(val.getMetaData());
          }
        }
        res.setMetaData(meta);


> The diff of my changes are at this address <http://pastie.caboo.se/204167>.

This patch skips the unpacking of hadoop scripts - IMHO this is a 
mistake. While Hadoop artifacts should not be packaged into the Nutch 
jar / job files, they still need to be unpacked to provide the means of 
running test jobs during development (or to run single-node messy 
installs ;) ).


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com