You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Lincoln Ritter <li...@lincolnritter.com> on 2008/06/12 00:25:48 UTC
SegmentMerger "no input paths" problem and "special files/directories"
Greetings,
I'm running nutch trunk with the patch for hadoop 0.17 from NUTCH-634
(http://issues.apache.org/jira/browse/NUTCH-634)
I've run into a problem merging segments:
$ ./bin/nutch mergesegs crawl/segments_merge -dir crawl/segments/
08/06/11 14:32:35 INFO segment.SegmentMerger: Merging 3 segments to
crawl/segments_merge/20080611143235
08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: adding
hdfs://localhost:54310/user/lritter/crawl/segments/20080611135945
08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: adding
hdfs://localhost:54310/user/lritter/crawl/segments/20080611141414
08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: adding
hdfs://localhost:54310/user/lritter/crawl/segments/_logs
08/06/11 14:32:35 INFO segment.SegmentMerger: SegmentMerger: using
segment data from:
java.io.IOException: No input paths specified in input
at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:173)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:705)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:605)
at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:648)
This looks to be the same (or similar) issue as:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg10999.html
In my case, the merger seems to think that the '_log' directory is
valid fodder for merging. This is "clearly" not the case. In this
case, I assume that underscore-prefixed names are "reserved" by nutch.
With this assumption, I can make a filter that screens these out. I
have done this and attached a patch against trunk below.
While the patch fixes my immediate problem it makes me a little
nervous that I am designating underscore-prefixed stuff as "special"
in a pretty adhoc way. Is there any "real" way to determine whether or
not a directory contains segment information?
Thanks!
-lincoln
--
lincolnritter.com
--- PATCH ---
Index: src/java/org/apache/nutch/segment/SegmentMerger.java
===================================================================
--- src/java/org/apache/nutch/segment/SegmentMerger.java (revision 666871)
+++ src/java/org/apache/nutch/segment/SegmentMerger.java (working copy)
@@ -626,7 +626,7 @@
boolean normalize = false;
for (int i = 1; i < args.length; i++) {
if (args[i].equals("-dir")) {
- Path[] files = fs.listPaths(new Path(args[++i]),
HadoopFSUtil.getPassDirectoriesFilter(fs));
+ Path[] files = fs.listPaths(new Path(args[++i]),
HadoopFSUtil.getPassNormalDirectoriesFilter(fs));
for (int j = 0; j < files.length; j++)
segs.add(files[j]);
} else if (args[i].equals("-filter")) {
Index: src/java/org/apache/nutch/util/HadoopFSUtil.java
===================================================================
--- src/java/org/apache/nutch/util/HadoopFSUtil.java (revision 666871)
+++ src/java/org/apache/nutch/util/HadoopFSUtil.java (working copy)
@@ -51,6 +51,23 @@
};
}
+
+ /**
+ * Returns PathFilter that passes directories that are not
"special" through.
+ */
+ public static PathFilter getPassNormalDirectoriesFilter(final
FileSystem fs) {
+ return new PathFilter() {
+ public boolean accept(final Path path) {
+ try {
+ FileStatus status = fs.getFileStatus(path);
+ return status.isDir() &&
!status.getPath().getName().startsWith("_");
+ } catch (IOException ioe) {
+ return false;
+ }
+ }
+
+ };
+ }
/**
* Turns an array of FileStatus into an array of Paths.
--- END PATCH ---