You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by David Podunavac <da...@wyona.com> on 2006/08/25 16:26:29 UTC

reading crawl dir from nutch-default.xml

Hi

i think this patch will make it way easier to configure nutch, crawl dir
will be read from
nutch-default.xml instead of a relative path from where it has been executed
So nutch-default.xml will have its
<property>
  <name>searcher.dir</name>
  <value>PATH_TO_CRAWL_DIR</value>
  <description>
and this value will be used instead

Index: nutch-0.8/src/java/org/apache/nutch/crawl/Crawl.java
===================================================================
--- nutch-0.8/src/java/org/apache/nutch/crawl/Crawl.java       
(Revision 436809)
+++ nutch-0.8/src/java/org/apache/nutch/crawl/Crawl.java       
(Arbeitskopie)
@@ -53,10 +53,12 @@

     Configuration conf = NutchConfiguration.create();
     conf.addDefaultResource("crawl-tool.xml");
+    conf.addDefaultResource("nutch-default.xml");
     JobConf job = new NutchJob(conf);

     Path rootUrlDir = null;
-    Path dir = new Path("crawl-" + getDate());
+    String path2crawlDir = conf.get("searcher.dir");
+    Path dir = new Path(path2crawlDir);
     int threads = job.getInt("fetcher.threads.fetch", 10);
     int depth = 5;
     int topN = Integer.MAX_VALUE;


and this patch will make the CrawlDbReader find that crawl directory

Index: nutch-0.8/src/java/org/apache/nutch/crawl/CrawlDbReader.java
===================================================================
--- nutch-0.8/src/java/org/apache/nutch/crawl/CrawlDbReader.java       
(Revision 436809)
+++ nutch-0.8/src/java/org/apache/nutch/crawl/CrawlDbReader.java       
(Arbeitskopie)
@@ -406,8 +406,10 @@
       return;
     }
     String param = null;
-    String crawlDb = args[0];
+    //String crawlDb = args[0];
     Configuration conf = NutchConfiguration.create();
+    conf.addDefaultResource("nutch-default.xml");
+    String crawlDb = conf.get("searcher.dir") + "/crawldb";
     for (int i = 1; i < args.length; i++) {
       if (args[i].equals("-stats")) {
         dbr.processStatJob(crawlDb, conf);



WDYT

thanks

David