You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Fuad Efendi <fu...@efendi.ca> on 2006/02/01 04:15:54 UTC

RE: How many data have you got?

>> When I performed a whole-web crawl test according to the tutorial, I got
>> Number of pages: 36668
>> Number of links: 46721.
>> Then how many have you got?

>I only played around with Nutch some month ago, and I got as many as
500.000 
>pages and several million links within a few days over my home DSL line.
Your 
>crawler might be stuck somewhere ...?

Number of pages - it's probably number of Page instances, number of
successfully retrieved web-pages.
Number of links - probably total number of Link instances in WebDB,
including non-retrieved pages, and links to the same Page instance. 

Different pages may have different links (with different anchor text and
even different URL) to the same Page instance; page equality is defined as
MD5 hash (checksum of all bytes in plain HTTP response).

Single page may have hundreds of links, including links to foreign hosts.

Nutch 0.7.1