You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Commented) (JIRA)" <ji...@apache.org> on 2012/01/25 17:42:44 UTC

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

    [ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193115#comment-13193115 ] 

Sebastian Nagel commented on NUTCH-1113:
----------------------------------------

I had a look at the attached segment dumps: the merged data is by far larger than the unmerged one.
And there are 739 identical linked CrawlDatum objects. Maybe this is an artifact of NUTCH-1252?

{quote}
it seems that all pages that:

A) Have already been fetched *AND
B) Are set as the location of a redirect in subsequent iterations through the crawl process

will be "lost" after a segment merge.
{quote}
I've run into the same situation.

{quote}
SegmentMerger makes the assumption that all values from the newest segment are preferable, so in this case, there's a crawl_fetch segment piece for this URL in two segments. In the first segment, it's marked Status 33 (fetch success) and in the second segment, it's marked Status 67 (linked), so the status 67 overwrites the status 33 crawl_fetch segment piece. From there, the URL data is excluded (correctly) from the index, because it's not marked as fetch success.
{quote}
I think this assumption is ok but it is necessary to preserve more than one (the latest) CrawlDatum:
# at least the latest out of {FETCH_SUCCESS, FETCH_GONE, FETCH_RETRY, FETCH_REDIR*}
# eventually the latest of FETCH_NOTMODIFIED (when re-indexing all segments IndexerMapReduce does not index documents with only a FETCH_NOTMODIFIED)
# possibly all linked CrawlDatums in crawl_fetch of the latest segment (similarily to those in crawl_parse)
                
> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
>                 Key: NUTCH-1113
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Edward Drapkin
>             Fix For: 1.5
>
>         Attachments: merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments.  Somehow the segment merger causes me to lose ~20% of my crawl database!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira