You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/03/02 05:24:18 UTC

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

    [ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175004#comment-15175004 ] 

ASF GitHub Bot commented on NUTCH-2184:
---------------------------------------

GitHub user lewismc opened a pull request:

    https://github.com/apache/nutch/pull/95

    NUTCH-2184 Enable IndexingJob to function with no crawldb

    OK folks, this issue addresses https://issues.apache.org/jira/browse/NUTCH-2184 by
     * rebasing the [NUTCH-2184v2.patch](https://issues.apache.org/jira/secure/attachment/12784260/NUTCH-2184v2.patch) against master branch
     * making the IndexerMapReduceMapper and IndexerMapReduceReducer in IndexerMapReduce code explicit so that these functions can be tested
     * adding in some mrunit tests for testing the IndexerMapReduceMapper and IndexerMapReduceReducer
     * removing some trivial imports which are unsed
     * formatting ivy.xml which has somehow (again) become a dogs dinner
     * adding default constructor to NutchIndexAction()
    
    Any questions, then please let me know. I would really appreciate if people could pull this code and try it out within your test or local environment.
    Thanks, also thanks Markus for the original suggestions for tests, etc.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lewismc/nutch NUTCH-2184

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/95.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #95
    
----
commit c4429eb7e4a33fc619cea5e5d6c26f54969e4f55
Author: Lewis John McGibbney <le...@jpl.nasa.gov>
Date:   2016-03-02T04:21:52Z

    NUTCH-2184 Enable IndexingJob to function with no crawldb

----


> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>
>                 Key: NUTCH-2184
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2184
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.12
>
>         Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 'loose' data structures which are currently considered as critical e.g. crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case where you ONLY have segments and want to force an index for every record present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)