You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2014/02/18 11:03:20 UTC

[jira] [Commented] (NUTCH-1706) IndexerMapReduce does not remove db_redir_temp etc

    [ https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903916#comment-13903916 ] 

Sebastian Nagel commented on NUTCH-1706:
----------------------------------------

Hi [~markus17], point 2 is definitely a problem: in a sample crawl (seed was {{http://nutch.apache.org/}}) out of 2 fetch_notmodified items one is lost when indexing (data attached).
{code}
# 1. index only "old" segments
% bin/nutch index -Ddummy.path=index2013.txt crawl/crawldb \
    crawl/segments/20131115203640/ \
    crawl/segments/20131115203847/ \
    -deleteGone

# 2. also include "new" segment containing refetches
% bin/nutch index -Ddummy.path=index2014.txt crawl/crawldb \
    crawl/segments/20131115203640/ \
    crawl/segments/20131115203847/ \
    crawl/segments/20140217140849/ \
    -deleteGone

# 3. since the "new" segment contains only "successful" refetches (of fetch_success or fetch_notmodified)
#    both indexes should contain exactly the same number of documents. But they do not!
% diff index2013.txt index2014.txt 
26d25
< add   http://tika.apache.org/
{code}
The second not modified page ({{http://nutch.apache.org/}}) is indexed. Running the debugger showed that ordering of values in the reduce function is different for both pages, also in local mode. We should take this serious and check whether we could guarantee that the newest values are always preferred (similar as in SegmentMerger).

Nevertheless a fetch_notmodified datum should never overwrite any other fetch datum. Attached patch includes this check again, apart from that it is identical to [~markus17]'s patch.

> IndexerMapReduce does not remove db_redir_temp etc
> --------------------------------------------------
>
>                 Key: NUTCH-1706
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1706
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.7
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Blocker
>             Fix For: 1.8
>
>         Attachments: NUTCH-1706-trunk-v2.patch, NUTCH-1706-trunk.patch, nutch-1706-testdata.tgz
>
>
> Code path is wrong in IndexerMapReduce, the delete code should be located after all reducer values have been gathered.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)