You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Carl Dorestos <ca...@gmail.com> on 2006/04/02 20:34:26 UTC

Merging Nutch crawls under 0.8-dev

Hi all,

I'd appreciate your help with this question. I am using Nutch/Hadoop 0.8 (of
3/31/06). I am using DFS.I want to merge multiple crawls and search the
combined content

For example, i'd like to be able to:
- Crawl 1 million urls into a directory crawlA (with directories segments,
crawldb, linkdb, indexes, index)
- Similarly, crawl different 1 million urls into a directory crawlB
- and then combine the contents or the indexes and be able to search the
contents of the 2 million urls

I searched this list and found similar questions. But I none of the answers
worked for me, as some of them were specific to pre-0.8 Nutch. I tried
several things already

I made a new directory crawl-all (with empty subdirectories segments,
crawldb, linkdb) then i copied crawlA/segments/<timestampA> and
crawlB/segments/<timestampB> into crawl-all/segments, then I issued the
command bin/nutch index crawl-all/indexes crawl-all/linkdb crawl-all/crawldb
crawl-all/segments/<timestampA> crawl-all/segments/<timestampB>. What I got
is just an almost empty crawl-all/indexes directotry (about 100 bytes in
all).

I also tried to index each segment separately into one common indexes
directory (crawl-all/indexes) but I got an error on the second time I issued
the index command that the directory (crawl-all/indexes) already exists.

I am sure someone must have been able to to merge the results of multiple
crawls using 0.8. I'd appreciate your help and please provide details.

Thanks.

Carl