You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2011/03/26 13:56:05 UTC

[jira] [Resolved] (NUTCH-971) IndexMerger produces indexes itself cannot merge anymore

     [ https://issues.apache.org/jira/browse/NUTCH-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-971.
---------------------------------

    Resolution: Won't Fix

1.3 and 2.0 rely on SOLR for the indexing and search. This patch deals with the legacy Lucene-bsaed indexing and won't be applied to the code.
Nutch-users are encouraged to migrate to SOLR for indexing as this will be maintained in future versions of Nutch.
Your patch should be useful for users who have to use 1.2 or older versions though, thanks for sharing it.

> IndexMerger produces indexes itself cannot merge anymore
> --------------------------------------------------------
>
>                 Key: NUTCH-971
>                 URL: https://issues.apache.org/jira/browse/NUTCH-971
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.3
>
>         Attachments: IndexMerger-part.diff
>
>
> Here's what I do:
> 1. index the fetched segs
> $ rm -r $new_indexes $temp_indexes
> $ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*
>  
> I examine the index with luke and it's as expected.
> 2. merge the new index with the previous
> $ bin/nutch merge $temp_indexes $new_indexes $indexes
> IndexMerger: starting at 2011-03-26 10:24:58
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01
> On the first iteration, when $indexes is empty it works fine by essentially duplicating  $new_indexes into $temp_indexes.
> But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index $temp_indexes contains only #new_indexes and nothing from $indexes, which indeed still contains the data from the previous iteration. That is, it doesn't merge.
> This unexpected merge behavior is NOT symmetric, i.e.
> $ bin/nutch merge $temp_indexes $indexes $new_indexes
> IndexMerger: starting at 2011-03-26 10:32:15
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01
> The morale of the story is that a merged index cannot be merged with another, i.e. bin/nutch merge is meant to  merge only 2 indeces generated with bin/nutch index (or solrindex, perhaps).
> The difference between the 2 indeces I can tell is that the merged index doesn't contain file index_done (and a hidden companion), but adding those to the merged index before merging it again doesn't solve either.
> The way/workaround to make the merged index equivalent to the bin/nutch index generated index seems to be putting it in a "part" subdirectory:
> bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
> IndexMerger: starting at 2011-03-26 11:18:10
> IndexMerger: merging indexes to: crawl/temp_indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01
> Where was this documented? I'd recommend rather not documenting but have IndexMerger handle it as in the attached patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira