You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/05/27 19:43:12 UTC

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

    [ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304651#comment-15304651 ] 

ASF GitHub Bot commented on NUTCH-2184:
---------------------------------------

Github user naegelejd commented on a diff in the pull request:

    https://github.com/apache/nutch/pull/95#discussion_r64957448
  
    --- Diff: src/java/org/apache/nutch/indexer/IndexingJob.java ---
    @@ -155,43 +161,146 @@ public void index(Path crawlDb, Path linkDb, List<Path> segments,
                 counter.getName());
           }
           long end = System.currentTimeMillis();
    -      LOG.info("Indexer: finished at " + sdf.format(end) + ", elapsed: "
    -          + TimingUtil.elapsedTime(start, end));
    +      LOG.info("Indexer: finished at {}, elapsed: {}", sdf.format(end),
    +          TimingUtil.elapsedTime(start, end));
         } finally {
           FileSystem.get(job).delete(tmp, true);
         }
       }
     
       public int run(String[] args) throws Exception {
    -    if (args.length < 2) {
    -      System.err
    -      //.println("Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]");
    -      .println("Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize] [-addBinaryContent] [-base64]");
    -      IndexWriters writers = new IndexWriters(getConf());
    -      System.err.println(writers.describe());
    -      return -1;
    -    }
    -
    -    final Path crawlDb = new Path(args[0]);
    -    Path linkDb = null;
    -
    -    final List<Path> segments = new ArrayList<Path>();
    -    String params = null;
    -
    -    boolean noCommit = false;
    -    boolean deleteGone = false;
    -    boolean filter = false;
    -    boolean normalize = false;
    -    boolean addBinaryContent = false;
    -    boolean base64 = false;
    +    // boolean options
    +    Option helpOpt = new Option("h", "help", false, "show this help message");
    +    // argument options
    +    @SuppressWarnings("static-access")
    +    Option crawldbOpt = OptionBuilder
    +    .withArgName("crawldb")
    +    .hasArg()
    +    .withDescription(
    +        "a crawldb directory to use with this tool (optional)")
    +    .create("crawldb");
    +    @SuppressWarnings("static-access")
    +    Option linkdbOpt = OptionBuilder
    +    .withArgName("linkdb")
    +    .hasArg()
    +    .withDescription(
    +        "a linkdb directory to use with this tool (optional)")
    +    .create("linkdb");
    +    @SuppressWarnings("static-access")
    +    Option paramsOpt = OptionBuilder
    +    .withArgName("params")
    +    .hasArg()
    +    .withDescription(
    +        "key value parameters to be used with this tool e.g. k1=v1&k2=v2... (optional)")
    +    .create("params");
    +    @SuppressWarnings("static-access")
    +    Option segOpt = OptionBuilder
    +    .withArgName("segment")
    +    .hasArgs()
    +    .withDescription("the segment(s) to use (either this or --segmentDir is mandatory)")
    +    .create("segment");
    +    @SuppressWarnings("static-access")
    +    Option segmentDirOpt = OptionBuilder
    +    .withArgName("segmentDir")
    +    .hasArg()
    +    .withDescription(
    +        "directory containing one or more segments to be used with this tool "
    +            + "(either this or --segment is mandatory)")
    +    .create("segmentDir");
    +    @SuppressWarnings("static-access")
    +    Option noCommitOpt = OptionBuilder
    +    .withArgName("noCommit")
    +    .withDescription(
    +        "do the commits once and for all the reducers in one go (optional)")
    --- End diff --
    
    This description is backward: the "-noCommit" option tells the Indexer *not* to do a final commit after the job finishes.


> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>
>                 Key: NUTCH-2184
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2184
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.13
>
>         Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 'loose' data structures which are currently considered as critical e.g. crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case where you ONLY have segments and want to force an index for every record present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)