You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Paul Baclace <pe...@baclace.net> on 2005/09/17 01:54:27 UTC

mapred patch for improved error message and some javadoc comments

Here is a patch for improving the error message that is displayed
when an intranet crawl commandline has a file instead of a directory
of files containing URLs.

The old error msg:
   java.io.IOException: No input files in: [Ljava.io.File;@c24c0

Obviously, the default toString() says nothing.

The new error msg:
   java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml , \tmp\nutch\mapred\local\job_4oob2p.xml , nutch-site.xml

My aim here is to expose the order in which conf files are searched.  It's not ideal since it exposes a transient xml file (job_4oob2p.xml) and there is no mention of the particular property involved, but the property name is not known to the method producing the error msg.

However, emitting the sequence of conf files can help a lot when map in one's mind does not match the territory.  I'm particulary keen about config structure, as evidenced by the PropDiff tool and Java Tip I wrote to help administer WebLogic clusters:

   http://www.baclace.net/java/prop_diff/index.html

A similar tool for finding the difference and union of xml conf files might be handy.

The patch follows.

Index: ./src/java/org/apache/nutch/mapred/InputFormatBase.java
===================================================================
--- ./src/java/org/apache/nutch/mapred/InputFormatBase.java	(revision 289645)
+++ ./src/java/org/apache/nutch/mapred/InputFormatBase.java	(working copy)
@@ -46,8 +46,17 @@
                                                 Reporter reporter)
      throws IOException;

-  /** Subclasses may override to, e.g., select only files matching a regular
-   * expression.*/
+  /** List input directories.
+   * Subclasses may override to, e.g., select only files matching a regular
+   * expression.
+   * Property mapred.input.subdir, if set, names a subdirectory that
+   * is appended to all input dirs specified by job, and if the given fs
+   * lists those too, each is added to the returned array of File.
+   * @param fs
+   * @param job
+   * @return array of File objects, never zero length.
+   * @throws IOException if zero items.
+   */
    protected File[] listFiles(NutchFileSystem fs, JobConf job)
      throws IOException {
      File[] dirs = job.getInputDirs();
@@ -73,7 +82,7 @@
      }

      if (result.size() == 0) {
-      throw new IOException("No input files in: "+job.getInputDirs());
+      throw new IOException("No input directories specified in: "+job);
      }
      return (File[])result.toArray(new File[result.size()]);
    }
Index: ./src/java/org/apache/nutch/util/NutchConf.java
===================================================================
--- ./src/java/org/apache/nutch/util/NutchConf.java	(revision 289645)
+++ ./src/java/org/apache/nutch/util/NutchConf.java	(working copy)
@@ -30,7 +30,8 @@
  import javax.xml.transform.stream.StreamResult;

  /** Provides access to Nutch configuration parameters.
- *
+ * <p>An ordered list of configuration parameter files with
+ * default and always-overrides site parameters.
   * <p>Default values for all parameters are specified in a file named
   * <tt>nutch-default.xml</tt> located on the classpath.  Overrides for these
   * defaults should be in an optional file named <tt>nutch-site.xml</tt>, also
@@ -36,8 +37,10 @@
   * defaults should be in an optional file named <tt>nutch-site.xml</tt>, also
   * located on the classpath.  Typically these files reside in the
   * <tt>conf/</tt> subdirectory at the top-level of a Nutch installation.
+ * <p>The resource files are read upon first access of values (set, get,
+ * or write) after {@link #addConfResource(String)} or
+ * {@link #addConfResource(File)}.
   */
-
  public class NutchConf {
    private static final Logger LOG =
      LogFormatter.getLogger("org.apache.nutch.util.NutchConf");
@@ -57,7 +60,7 @@
      resourceNames.add("nutch-site.xml");
    }

-  /** A new configuration with the same settings as another. */
+  /** A new configuration with the same settings cloned from another. */
    public NutchConf(NutchConf other) {
      this.resourceNames = (ArrayList)other.resourceNames.clone();
      if (other.properties != null)
@@ -394,6 +397,25 @@
      }
    }

+
+  public String toString() {
+    StringBuffer sb = new StringBuffer(resourceNames.size()*30);
+    sb.append("NutchConf: ");
+    ListIterator i = resourceNames.listIterator();
+    while (i.hasNext()) {
+      if (i.nextIndex() != 0) {
+        sb.append(" , ");
+      }
+      Object obj = i.next();
+      if (obj instanceof File) {
+        sb.append((File)obj);
+      } else {
+        sb.append((String)obj);
+      }
+    }
+    return sb.toString();
+  }
+
    /** For debugging.  List non-default properties to the terminal and exit. */
    public static void main(String[] args) throws Exception {
      get().write(System.out);

Re: mapred patch for improved error message and some javadoc comments

Posted by Doug Cutting <cu...@nutch.org>.
Paul Baclace wrote:
> Here is a patch for improving the error message that is displayed
> when an intranet crawl commandline has a file instead of a directory
> of files containing URLs.

I have committed this to the mapred branch.

Thanks, Paul!

Doug