You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Carl Cerecke <ca...@nzs.com> on 2007/10/04 02:31:31 UTC

invertlinks not getting all links in segments

Hi,

I have nearly 10M pages in over 200 segments. After creating the linkdb 
by running invertlinks, and then dumping the linkdb with readlinkdb, I 
noticed that many links were missing from the linkdb.

When reading the information on a fetched page from the segment, I could 
see many outlinks, but none of those outlinks made it into the linkdb.

I crawled the site separately, limiting the crawl to that site only, and 
the links were in the linkd correctly. But in the large crawl, they 
don't make it into the linkdb.

Any suggestions?


Re: invertlinks not getting all links in segments

Posted by Doğacan Güney <do...@gmail.com>.
Hi

On 10/4/07, Carl Cerecke <ca...@nzs.com> wrote:
> Hi,
>
> I have nearly 10M pages in over 200 segments. After creating the linkdb
> by running invertlinks, and then dumping the linkdb with readlinkdb, I
> noticed that many links were missing from the linkdb.
>
> When reading the information on a fetched page from the segment, I could
> see many outlinks, but none of those outlinks made it into the linkdb.
>
> I crawled the site separately, limiting the crawl to that site only, and
> the links were in the linkd correctly. But in the large crawl, they
> don't make it into the linkdb.
>
> Any suggestions?
>
>

Linkdb stores at most db.max.inlinks many inlinks per entry. If there
are more links pointing to a page, they will be dropped.


-- 
Doğacan Güney