You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/05/27 19:43:12 UTC
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with
no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304651#comment-15304651 ]
ASF GitHub Bot commented on NUTCH-2184:
---------------------------------------
Github user naegelejd commented on a diff in the pull request:
https://github.com/apache/nutch/pull/95#discussion_r64957448
--- Diff: src/java/org/apache/nutch/indexer/IndexingJob.java ---
@@ -155,43 +161,146 @@ public void index(Path crawlDb, Path linkDb, List<Path> segments,
counter.getName());
}
long end = System.currentTimeMillis();
- LOG.info("Indexer: finished at " + sdf.format(end) + ", elapsed: "
- + TimingUtil.elapsedTime(start, end));
+ LOG.info("Indexer: finished at {}, elapsed: {}", sdf.format(end),
+ TimingUtil.elapsedTime(start, end));
} finally {
FileSystem.get(job).delete(tmp, true);
}
}
public int run(String[] args) throws Exception {
- if (args.length < 2) {
- System.err
- //.println("Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]");
- .println("Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize] [-addBinaryContent] [-base64]");
- IndexWriters writers = new IndexWriters(getConf());
- System.err.println(writers.describe());
- return -1;
- }
-
- final Path crawlDb = new Path(args[0]);
- Path linkDb = null;
-
- final List<Path> segments = new ArrayList<Path>();
- String params = null;
-
- boolean noCommit = false;
- boolean deleteGone = false;
- boolean filter = false;
- boolean normalize = false;
- boolean addBinaryContent = false;
- boolean base64 = false;
+ // boolean options
+ Option helpOpt = new Option("h", "help", false, "show this help message");
+ // argument options
+ @SuppressWarnings("static-access")
+ Option crawldbOpt = OptionBuilder
+ .withArgName("crawldb")
+ .hasArg()
+ .withDescription(
+ "a crawldb directory to use with this tool (optional)")
+ .create("crawldb");
+ @SuppressWarnings("static-access")
+ Option linkdbOpt = OptionBuilder
+ .withArgName("linkdb")
+ .hasArg()
+ .withDescription(
+ "a linkdb directory to use with this tool (optional)")
+ .create("linkdb");
+ @SuppressWarnings("static-access")
+ Option paramsOpt = OptionBuilder
+ .withArgName("params")
+ .hasArg()
+ .withDescription(
+ "key value parameters to be used with this tool e.g. k1=v1&k2=v2... (optional)")
+ .create("params");
+ @SuppressWarnings("static-access")
+ Option segOpt = OptionBuilder
+ .withArgName("segment")
+ .hasArgs()
+ .withDescription("the segment(s) to use (either this or --segmentDir is mandatory)")
+ .create("segment");
+ @SuppressWarnings("static-access")
+ Option segmentDirOpt = OptionBuilder
+ .withArgName("segmentDir")
+ .hasArg()
+ .withDescription(
+ "directory containing one or more segments to be used with this tool "
+ + "(either this or --segment is mandatory)")
+ .create("segmentDir");
+ @SuppressWarnings("static-access")
+ Option noCommitOpt = OptionBuilder
+ .withArgName("noCommit")
+ .withDescription(
+ "do the commits once and for all the reducers in one go (optional)")
--- End diff --
This description is backward: the "-noCommit" option tells the Indexer *not* to do a final commit after the job finishes.
> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Fix For: 1.13
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 'loose' data structures which are currently considered as critical e.g. crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no accompanying crawldb or linkdb.
> Absence of the latter is OK as linkdb is optional however currently in [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] crawldb is mandatory.
> This ticket should enhance the IndexerMapReduce code to support the use case where you ONLY have segments and want to force an index for every record present.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)