You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kai_testing Middleton <ka...@yahoo.com> on 2007/07/16 22:51:19 UTC

four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge

I've been reviewing the four different merge commands (as of nutch v0.9):

$ nutch | grep merg
  mergedb           merge crawldb-s, with optional filtering
  mergesegs         merge several segments, with optional filtering and slicing
  mergelinkdb       merge linkdb-s, with optional filtering
  merge             merge several segment indexes

Here are the javadocs:
mergedb -- http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/CrawlDbMerger.html
mergesegs -- http://lucene.apache.org/nutch/apidocs/org/apache/nutch/segment/SegmentMerger.html
mergelinkdb -- http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/LinkDbMerger.html
merge -- http://lucene.apache.org/nutch/apidocs/org/apache/nutch/indexer/IndexMerger.html

Naively: why are there four merge commands? Are some subsets of the others?  Are they used in conjunction? What are the usage scenarios of each?

I notice that Andrzej wrote the first three, and they have wiki entries (pretty much the same as the javadoc):
(I found these from http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg03588.html)
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs
It seems most of the nutch-user discussions I've seen so far relate to the simple merge command.  Are the first three "advanced commands"?  





       
____________________________________________________________________________________
Yahoo! oneSearch: Finally, mobile search 
that gives answers, not web links. 
http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC

Re: four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge

Posted by Doğacan Güney <do...@gmail.com>.
Hi

On 7/16/07, Kai_testing Middleton <ka...@yahoo.com> wrote:
> I've been reviewing the four different merge commands (as of nutch v0.9):
>
> $ nutch | grep merg
>   mergedb           merge crawldb-s, with optional filtering
>   mergesegs         merge several segments, with optional filtering and slicing
>   mergelinkdb       merge linkdb-s, with optional filtering
>   merge             merge several segment indexes
>
> Here are the javadocs:
> mergedb -- http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/CrawlDbMerger.html
> mergesegs -- http://lucene.apache.org/nutch/apidocs/org/apache/nutch/segment/SegmentMerger.html
> mergelinkdb -- http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/LinkDbMerger.html
> merge -- http://lucene.apache.org/nutch/apidocs/org/apache/nutch/indexer/IndexMerger.html
>
> Naively: why are there four merge commands? Are some subsets of the others?  Are they used in conjunction? What are the usage scenarios of each?

Each is used in a different scenario
mergedb: as its name does not imply, it is used to merge crawldb. So
consider this mergecrawldb

mergesegs: merges segments. It merges <segment>/{content,crawl_fetch,
crawl_generate, crawl_parse, parse_data, parse_text} information from
different segments.

merge: Merges lucene indexes. After a index job, you end up with a
indexes directory with a bunch of part-<num> directories inside.
Command merge takes such a directory and produces a single index. A
single index has a better performance (I think). You can say that
merge is poorly named, it should have been called mergeindexes or
something.

mergelinkdb: Should be obvious, merges linkdb-s.

So none of them is a subset of another. They all have different
purposes. It is kind of confusing to have a "merge" command that only
merges indexes, so perhaps we can add a mergeindexes command, keep
merge for some time (noting that it has been deprecated) then remove
it.

>
> I notice that Andrzej wrote the first three, and they have wiki entries (pretty much the same as the javadoc):
> (I found these from http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg03588.html)
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs
> It seems most of the nutch-user discussions I've seen so far relate to the simple merge command.  Are the first three "advanced commands"?
>
>
>
>
>
>
> ____________________________________________________________________________________
> Yahoo! oneSearch: Finally, mobile search
> that gives answers, not web links.
> http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC


-- 
Doğacan Güney

Re: four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge

Posted by Andrzej Bialecki <ab...@getopt.org>.
Kai_testing Middleton wrote:
> I've been reviewing the four different merge commands (as of nutch v0.9):
> 
> $ nutch | grep merg
>   mergedb           merge crawldb-s, with optional filtering
>   mergesegs         merge several segments, with optional filtering and slicing
>   mergelinkdb       merge linkdb-s, with optional filtering
>   merge             merge several segment indexes
> 
> Here are the javadocs:
> mergedb -- http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/CrawlDbMerger.html
> mergesegs -- http://lucene.apache.org/nutch/apidocs/org/apache/nutch/segment/SegmentMerger.html
> mergelinkdb -- http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/LinkDbMerger.html
> merge -- http://lucene.apache.org/nutch/apidocs/org/apache/nutch/indexer/IndexMerger.html
> 
> Naively: why are there four merge commands? Are some subsets of the others?  Are they used in conjunction? What are the usage scenarios of each?
> 
> I notice that Andrzej wrote the first three, and they have wiki entries (pretty much the same as the javadoc):
> (I found these from http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg03588.html)
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
> http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs
> It seems most of the nutch-user discussions I've seen so far relate to the simple merge command.  Are the first three "advanced commands"?  
>

They serve different purpose - let's assume that somehow you've got two 
crawldb-s, e.g. you ran two crawls with different seed lists and 
different filters. Now you want to take these collections of urls and 
create a one big crawl. Then you would use mergedb to merge crawldb-s, 
mergelinkdb to merge linkdb-s, and mergesegs to merge segments ;)

And a simple "merge" merges indexes of multiple segments, which is a 
performance-related step in the regular Nutch work-cycle.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com