You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Gabriele Kahlout (JIRA)" <ji...@apache.org> on 2011/03/26 11:34:05 UTC

[jira] [Created] (NUTCH-971) IndexMerger produces indexes itself cannot merge anymore

IndexMerger produces indexes itself cannot merge anymore
--------------------------------------------------------

                 Key: NUTCH-971
                 URL: https://issues.apache.org/jira/browse/NUTCH-971
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.2
            Reporter: Gabriele Kahlout
            Priority: Minor
             Fix For: 1.3


Here's what I do:

1. index the fetched segs
$ rm -r $new_indexes $temp_indexes
$ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*
 
I examine the index with luke and it's as expected.

2. merge the new index with the previous
$ bin/nutch merge $temp_indexes $new_indexes $indexes
IndexMerger: starting at 2011-03-26 10:24:58
IndexMerger: merging indexes to: crawl/temp_indexes
Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01

On the first iteration, when $indexes is empty it works fine by essentially duplicating  $new_indexes into $temp_indexes.
But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index $temp_indexes contains only #new_indexes and nothing from $indexes, which indeed still contains the data from the previous iteration. That is, it doesn't merge.
This unexpected merge behavior is NOT symmetric, i.e.

$ bin/nutch merge $temp_indexes $indexes $new_indexes
IndexMerger: starting at 2011-03-26 10:32:15
IndexMerger: merging indexes to: crawl/temp_indexes
Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01

The morale of the story is that a merged index cannot be merged with another, i.e. bin/nutch merge is meant to  merge only 2 indeces generated with bin/nutch index (or solrindex, perhaps).
The difference between the 2 indeces I can tell is that the merged index doesn't contain file index_done (and a hidden companion), but adding those to the merged index before merging it again doesn't solve either.

The way/workaround to make the merged index equivalent to the bin/nutch index generated index seems to be putting it in a "part" subdirectory:

bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
IndexMerger: starting at 2011-03-26 11:18:10
IndexMerger: merging indexes to: crawl/temp_indexes/part-1
Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01

Where was this documented? I'd recommend rather not documenting but have IndexMerger handle it as in the attached patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-971) IndexMerger produces indexes itself cannot merge anymore

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriele Kahlout updated NUTCH-971:
-----------------------------------

    Attachment: IndexMerger-part.diff

Checks if output path ends in a part dir and if not adds it.

> IndexMerger produces indexes itself cannot merge anymore
> --------------------------------------------------------
>
>                 Key: NUTCH-971
>                 URL: https://issues.apache.org/jira/browse/NUTCH-971
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.3
>
>         Attachments: IndexMerger-part.diff
>
>
> Here's what I do:
> 1. index the fetched segs
> $ rm -r $new_indexes $temp_indexes
> $ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*
>  
> I examine the index with luke and it's as expected.
> 2. merge the new index with the previous
> $ bin/nutch merge $temp_indexes $new_indexes $indexes
> IndexMerger: starting at 2011-03-26 10:24:58
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01
> On the first iteration, when $indexes is empty it works fine by essentially duplicating  $new_indexes into $temp_indexes.
> But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index $temp_indexes contains only #new_indexes and nothing from $indexes, which indeed still contains the data from the previous iteration. That is, it doesn't merge.
> This unexpected merge behavior is NOT symmetric, i.e.
> $ bin/nutch merge $temp_indexes $indexes $new_indexes
> IndexMerger: starting at 2011-03-26 10:32:15
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01
> The morale of the story is that a merged index cannot be merged with another, i.e. bin/nutch merge is meant to  merge only 2 indeces generated with bin/nutch index (or solrindex, perhaps).
> The difference between the 2 indeces I can tell is that the merged index doesn't contain file index_done (and a hidden companion), but adding those to the merged index before merging it again doesn't solve either.
> The way/workaround to make the merged index equivalent to the bin/nutch index generated index seems to be putting it in a "part" subdirectory:
> bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
> IndexMerger: starting at 2011-03-26 11:18:10
> IndexMerger: merging indexes to: crawl/temp_indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01
> Where was this documented? I'd recommend rather not documenting but have IndexMerger handle it as in the attached patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-971) IndexMerger produces indexes itself cannot merge anymore

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-971.
---------------------------------

    Resolution: Won't Fix

1.3 and 2.0 rely on SOLR for the indexing and search. This patch deals with the legacy Lucene-bsaed indexing and won't be applied to the code.
Nutch-users are encouraged to migrate to SOLR for indexing as this will be maintained in future versions of Nutch.
Your patch should be useful for users who have to use 1.2 or older versions though, thanks for sharing it.

> IndexMerger produces indexes itself cannot merge anymore
> --------------------------------------------------------
>
>                 Key: NUTCH-971
>                 URL: https://issues.apache.org/jira/browse/NUTCH-971
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.3
>
>         Attachments: IndexMerger-part.diff
>
>
> Here's what I do:
> 1. index the fetched segs
> $ rm -r $new_indexes $temp_indexes
> $ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*
>  
> I examine the index with luke and it's as expected.
> 2. merge the new index with the previous
> $ bin/nutch merge $temp_indexes $new_indexes $indexes
> IndexMerger: starting at 2011-03-26 10:24:58
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01
> On the first iteration, when $indexes is empty it works fine by essentially duplicating  $new_indexes into $temp_indexes.
> But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index $temp_indexes contains only #new_indexes and nothing from $indexes, which indeed still contains the data from the previous iteration. That is, it doesn't merge.
> This unexpected merge behavior is NOT symmetric, i.e.
> $ bin/nutch merge $temp_indexes $indexes $new_indexes
> IndexMerger: starting at 2011-03-26 10:32:15
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01
> The morale of the story is that a merged index cannot be merged with another, i.e. bin/nutch merge is meant to  merge only 2 indeces generated with bin/nutch index (or solrindex, perhaps).
> The difference between the 2 indeces I can tell is that the merged index doesn't contain file index_done (and a hidden companion), but adding those to the merged index before merging it again doesn't solve either.
> The way/workaround to make the merged index equivalent to the bin/nutch index generated index seems to be putting it in a "part" subdirectory:
> bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
> IndexMerger: starting at 2011-03-26 11:18:10
> IndexMerger: merging indexes to: crawl/temp_indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01
> Where was this documented? I'd recommend rather not documenting but have IndexMerger handle it as in the attached patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-971) IndexMerger produces indexes itself cannot merge anymore

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriele Kahlout updated NUTCH-971:
-----------------------------------

    Attachment: IndexMerger-part.diff

Checks if the output index path ends with a part directory and if not adds one.

> IndexMerger produces indexes itself cannot merge anymore
> --------------------------------------------------------
>
>                 Key: NUTCH-971
>                 URL: https://issues.apache.org/jira/browse/NUTCH-971
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.3
>
>         Attachments: IndexMerger-part.diff
>
>
> Here's what I do:
> 1. index the fetched segs
> $ rm -r $new_indexes $temp_indexes
> $ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*
>  
> I examine the index with luke and it's as expected.
> 2. merge the new index with the previous
> $ bin/nutch merge $temp_indexes $new_indexes $indexes
> IndexMerger: starting at 2011-03-26 10:24:58
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01
> On the first iteration, when $indexes is empty it works fine by essentially duplicating  $new_indexes into $temp_indexes.
> But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index $temp_indexes contains only #new_indexes and nothing from $indexes, which indeed still contains the data from the previous iteration. That is, it doesn't merge.
> This unexpected merge behavior is NOT symmetric, i.e.
> $ bin/nutch merge $temp_indexes $indexes $new_indexes
> IndexMerger: starting at 2011-03-26 10:32:15
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01
> The morale of the story is that a merged index cannot be merged with another, i.e. bin/nutch merge is meant to  merge only 2 indeces generated with bin/nutch index (or solrindex, perhaps).
> The difference between the 2 indeces I can tell is that the merged index doesn't contain file index_done (and a hidden companion), but adding those to the merged index before merging it again doesn't solve either.
> The way/workaround to make the merged index equivalent to the bin/nutch index generated index seems to be putting it in a "part" subdirectory:
> bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
> IndexMerger: starting at 2011-03-26 11:18:10
> IndexMerger: merging indexes to: crawl/temp_indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01
> Where was this documented? I'd recommend rather not documenting but have IndexMerger handle it as in the attached patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-971) IndexMerger produces indexes itself cannot merge anymore

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche closed NUTCH-971.
-------------------------------


> IndexMerger produces indexes itself cannot merge anymore
> --------------------------------------------------------
>
>                 Key: NUTCH-971
>                 URL: https://issues.apache.org/jira/browse/NUTCH-971
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.3
>
>         Attachments: IndexMerger-part.diff
>
>
> Here's what I do:
> 1. index the fetched segs
> $ rm -r $new_indexes $temp_indexes
> $ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*
>  
> I examine the index with luke and it's as expected.
> 2. merge the new index with the previous
> $ bin/nutch merge $temp_indexes $new_indexes $indexes
> IndexMerger: starting at 2011-03-26 10:24:58
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01
> On the first iteration, when $indexes is empty it works fine by essentially duplicating  $new_indexes into $temp_indexes.
> But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index $temp_indexes contains only #new_indexes and nothing from $indexes, which indeed still contains the data from the previous iteration. That is, it doesn't merge.
> This unexpected merge behavior is NOT symmetric, i.e.
> $ bin/nutch merge $temp_indexes $indexes $new_indexes
> IndexMerger: starting at 2011-03-26 10:32:15
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01
> The morale of the story is that a merged index cannot be merged with another, i.e. bin/nutch merge is meant to  merge only 2 indeces generated with bin/nutch index (or solrindex, perhaps).
> The difference between the 2 indeces I can tell is that the merged index doesn't contain file index_done (and a hidden companion), but adding those to the merged index before merging it again doesn't solve either.
> The way/workaround to make the merged index equivalent to the bin/nutch index generated index seems to be putting it in a "part" subdirectory:
> bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
> IndexMerger: starting at 2011-03-26 11:18:10
> IndexMerger: merging indexes to: crawl/temp_indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01
> Where was this documented? I'd recommend rather not documenting but have IndexMerger handle it as in the attached patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-971) IndexMerger produces indexes itself cannot merge anymore

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriele Kahlout updated NUTCH-971:
-----------------------------------

    Comment: was deleted

(was: Checks if the output index path ends with a part directory and if not adds one.)

> IndexMerger produces indexes itself cannot merge anymore
> --------------------------------------------------------
>
>                 Key: NUTCH-971
>                 URL: https://issues.apache.org/jira/browse/NUTCH-971
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.3
>
>
> Here's what I do:
> 1. index the fetched segs
> $ rm -r $new_indexes $temp_indexes
> $ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*
>  
> I examine the index with luke and it's as expected.
> 2. merge the new index with the previous
> $ bin/nutch merge $temp_indexes $new_indexes $indexes
> IndexMerger: starting at 2011-03-26 10:24:58
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01
> On the first iteration, when $indexes is empty it works fine by essentially duplicating  $new_indexes into $temp_indexes.
> But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index $temp_indexes contains only #new_indexes and nothing from $indexes, which indeed still contains the data from the previous iteration. That is, it doesn't merge.
> This unexpected merge behavior is NOT symmetric, i.e.
> $ bin/nutch merge $temp_indexes $indexes $new_indexes
> IndexMerger: starting at 2011-03-26 10:32:15
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01
> The morale of the story is that a merged index cannot be merged with another, i.e. bin/nutch merge is meant to  merge only 2 indeces generated with bin/nutch index (or solrindex, perhaps).
> The difference between the 2 indeces I can tell is that the merged index doesn't contain file index_done (and a hidden companion), but adding those to the merged index before merging it again doesn't solve either.
> The way/workaround to make the merged index equivalent to the bin/nutch index generated index seems to be putting it in a "part" subdirectory:
> bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
> IndexMerger: starting at 2011-03-26 11:18:10
> IndexMerger: merging indexes to: crawl/temp_indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01
> Where was this documented? I'd recommend rather not documenting but have IndexMerger handle it as in the attached patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-971) IndexMerger produces indexes itself cannot merge anymore

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011627#comment-13011627 ] 

Gabriele Kahlout commented on NUTCH-971:
----------------------------------------

I expect that installing solr and then replacing index with solrindex in 1. the merge should work. Am I correct?

> IndexMerger produces indexes itself cannot merge anymore
> --------------------------------------------------------
>
>                 Key: NUTCH-971
>                 URL: https://issues.apache.org/jira/browse/NUTCH-971
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.3
>
>         Attachments: IndexMerger-part.diff
>
>
> Here's what I do:
> 1. index the fetched segs
> $ rm -r $new_indexes $temp_indexes
> $ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*
>  
> I examine the index with luke and it's as expected.
> 2. merge the new index with the previous
> $ bin/nutch merge $temp_indexes $new_indexes $indexes
> IndexMerger: starting at 2011-03-26 10:24:58
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01
> On the first iteration, when $indexes is empty it works fine by essentially duplicating  $new_indexes into $temp_indexes.
> But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index $temp_indexes contains only #new_indexes and nothing from $indexes, which indeed still contains the data from the previous iteration. That is, it doesn't merge.
> This unexpected merge behavior is NOT symmetric, i.e.
> $ bin/nutch merge $temp_indexes $indexes $new_indexes
> IndexMerger: starting at 2011-03-26 10:32:15
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01
> The morale of the story is that a merged index cannot be merged with another, i.e. bin/nutch merge is meant to  merge only 2 indeces generated with bin/nutch index (or solrindex, perhaps).
> The difference between the 2 indeces I can tell is that the merged index doesn't contain file index_done (and a hidden companion), but adding those to the merged index before merging it again doesn't solve either.
> The way/workaround to make the merged index equivalent to the bin/nutch index generated index seems to be putting it in a "part" subdirectory:
> bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
> IndexMerger: starting at 2011-03-26 11:18:10
> IndexMerger: merging indexes to: crawl/temp_indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01
> Where was this documented? I'd recommend rather not documenting but have IndexMerger handle it as in the attached patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-971) IndexMerger produces indexes itself cannot merge anymore

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011629#comment-13011629 ] 

Julien Nioche commented on NUTCH-971:
-------------------------------------

You will need to reindex your docs using the solrindex command. There is no need for merging the indices as SOLR will do that by updating its existing index. 

> IndexMerger produces indexes itself cannot merge anymore
> --------------------------------------------------------
>
>                 Key: NUTCH-971
>                 URL: https://issues.apache.org/jira/browse/NUTCH-971
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.3
>
>         Attachments: IndexMerger-part.diff
>
>
> Here's what I do:
> 1. index the fetched segs
> $ rm -r $new_indexes $temp_indexes
> $ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*
>  
> I examine the index with luke and it's as expected.
> 2. merge the new index with the previous
> $ bin/nutch merge $temp_indexes $new_indexes $indexes
> IndexMerger: starting at 2011-03-26 10:24:58
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01
> On the first iteration, when $indexes is empty it works fine by essentially duplicating  $new_indexes into $temp_indexes.
> But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index $temp_indexes contains only #new_indexes and nothing from $indexes, which indeed still contains the data from the previous iteration. That is, it doesn't merge.
> This unexpected merge behavior is NOT symmetric, i.e.
> $ bin/nutch merge $temp_indexes $indexes $new_indexes
> IndexMerger: starting at 2011-03-26 10:32:15
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01
> The morale of the story is that a merged index cannot be merged with another, i.e. bin/nutch merge is meant to  merge only 2 indeces generated with bin/nutch index (or solrindex, perhaps).
> The difference between the 2 indeces I can tell is that the merged index doesn't contain file index_done (and a hidden companion), but adding those to the merged index before merging it again doesn't solve either.
> The way/workaround to make the merged index equivalent to the bin/nutch index generated index seems to be putting it in a "part" subdirectory:
> bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
> IndexMerger: starting at 2011-03-26 11:18:10
> IndexMerger: merging indexes to: crawl/temp_indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01
> Where was this documented? I'd recommend rather not documenting but have IndexMerger handle it as in the attached patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-971) IndexMerger produces indexes itself cannot merge anymore

Posted by "Gabriele Kahlout (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabriele Kahlout updated NUTCH-971:
-----------------------------------

    Attachment:     (was: IndexMerger-part.diff)

> IndexMerger produces indexes itself cannot merge anymore
> --------------------------------------------------------
>
>                 Key: NUTCH-971
>                 URL: https://issues.apache.org/jira/browse/NUTCH-971
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.3
>
>
> Here's what I do:
> 1. index the fetched segs
> $ rm -r $new_indexes $temp_indexes
> $ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*
>  
> I examine the index with luke and it's as expected.
> 2. merge the new index with the previous
> $ bin/nutch merge $temp_indexes $new_indexes $indexes
> IndexMerger: starting at 2011-03-26 10:24:58
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01
> On the first iteration, when $indexes is empty it works fine by essentially duplicating  $new_indexes into $temp_indexes.
> But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index $temp_indexes contains only #new_indexes and nothing from $indexes, which indeed still contains the data from the previous iteration. That is, it doesn't merge.
> This unexpected merge behavior is NOT symmetric, i.e.
> $ bin/nutch merge $temp_indexes $indexes $new_indexes
> IndexMerger: starting at 2011-03-26 10:32:15
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01
> The morale of the story is that a merged index cannot be merged with another, i.e. bin/nutch merge is meant to  merge only 2 indeces generated with bin/nutch index (or solrindex, perhaps).
> The difference between the 2 indeces I can tell is that the merged index doesn't contain file index_done (and a hidden companion), but adding those to the merged index before merging it again doesn't solve either.
> The way/workaround to make the merged index equivalent to the bin/nutch index generated index seems to be putting it in a "part" subdirectory:
> bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
> IndexMerger: starting at 2011-03-26 11:18:10
> IndexMerger: merging indexes to: crawl/temp_indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01
> Where was this documented? I'd recommend rather not documenting but have IndexMerger handle it as in the attached patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira