You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Murat Ali Bayir <mu...@agmlab.com> on 2006/08/01 19:35:41 UTC

linkdbmerge

Hi everbody, I want to know how mergelinkdb function works. Assume that 
we have two linkdb in the first one
the URLx is referred by URLa, URLb and URLc in the second one same URLx 
is refferred by URLa, URLk. I want to
know structure of the output linkdb.
does it contains one entry for URLx referred by URLa, URLb, URLc and 
URLk or
just append second linkdb to first one and contains two entry for URLx 
given below
URLx <- URLa  URLb, URLc and
..
..
..
URLx <- URLa  URLk

Re: linkdbmerge

Posted by Andrzej Bialecki <ab...@getopt.org>.

Murat Ali Bayir wrote:
> Assume that we have no restriction for max.inlinks, and we have two 
> crawl namely crawl_depth1 than continue same crawl  with crawl_depth2. 
> There are two cases for obtainning final linkdb.
> First one is run
>
> ./nutch invertlinks linkdb_depth1 segment_depth1
> ./nutch invertlinks linkdb_depth2 segment_depth2
> ./nutch mergelinkdb final_linkdb_1 linkdb_depth1 linkdb_depth2
>
> and second one is run.
>
> /nutch invertlinks final_linkdb2 segment_depth1 segment_depth2
>
> is there any differenece between final_linkdb1 and final_linkdb2 ? I 
> mean Is merge operation is loosless in this case?

It should be - if it's not then it's a bug that needs to be fixed.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: linkdbmerge

Posted by Murat Ali Bayir <mu...@agmlab.com>.

Assume that we have no restriction for max.inlinks, and we have two 
crawl namely crawl_depth1 than continue same crawl  with crawl_depth2. 
There are two cases for obtainning final linkdb.
First one is run

./nutch invertlinks linkdb_depth1 segment_depth1
./nutch invertlinks linkdb_depth2 segment_depth2
./nutch mergelinkdb final_linkdb_1 linkdb_depth1 linkdb_depth2

and second one is run.

/nutch invertlinks final_linkdb2 segment_depth1 segment_depth2

is there any differenece between final_linkdb1 and final_linkdb2 ? I 
mean Is merge operation is loosless in this case?


Andrzej Bialecki wrote:

> Murat Ali Bayir wrote:
>
>> Hi everbody, I want to know how mergelinkdb function works. Assume 
>> that we have two linkdb in the first one
>> the URLx is referred by URLa, URLb and URLc in the second one same 
>> URLx is refferred by URLa, URLk. I want to
>> know structure of the output linkdb.
>> does it contains one entry for URLx referred by URLa, URLb, URLc and 
>> URLk or
>> just append second linkdb to first one and contains two entry for 
>> URLx given below
>> URLx <- URLa  URLb, URLc and
>> ..
>> ..
>> ..
>> URLx <- URLa  URLk
>>
>>
>
> No, these two entries are merged into one (that's why the name :) ). 
> At any given time, in a valid linkdb there is exactly zero or one 
> entries for any given target URL.
>
> You should note that there is a limit set on how many inlinks we are 
> going to store for any given URL (db.max.inlinks) - which may lead to 
> some surprises. If e.g. the linkdbA already hit that limit, and the 
> other linkdbB didn't, then two scenarios are possible - either you get 
> the list just containing all links from linkdbA and none from linkdbB, 
> or you get the list containing all links from linkdbB plus some links 
> from linkdbA ...
>

Re: linkdbmerge

Posted by Andrzej Bialecki <ab...@getopt.org>.

Murat Ali Bayir wrote:
> Hi everbody, I want to know how mergelinkdb function works. Assume 
> that we have two linkdb in the first one
> the URLx is referred by URLa, URLb and URLc in the second one same 
> URLx is refferred by URLa, URLk. I want to
> know structure of the output linkdb.
> does it contains one entry for URLx referred by URLa, URLb, URLc and 
> URLk or
> just append second linkdb to first one and contains two entry for URLx 
> given below
> URLx <- URLa  URLb, URLc and
> ..
> ..
> ..
> URLx <- URLa  URLk
>
>

No, these two entries are merged into one (that's why the name :) ). At 
any given time, in a valid linkdb there is exactly zero or one entries 
for any given target URL.

You should note that there is a limit set on how many inlinks we are 
going to store for any given URL (db.max.inlinks) - which may lead to 
some surprises. If e.g. the linkdbA already hit that limit, and the 
other linkdbB didn't, then two scenarios are possible - either you get 
the list just containing all links from linkdbA and none from linkdbB, 
or you get the list containing all links from linkdbB plus some links 
from linkdbA ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com