You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Thomas Anderson <t....@gmail.com> on 2011/02/21 13:32:15 UTC

Is it possible to estimate data size to be crawled?

I need to crawl pages from a designated website but it is meaningless
to crawl the whole website; therefore, only a small data used for the
experiment will be crawled.

But is there any way to estimate how many data size will be crawled
depending on crawling parameter e.g. level, topN? Or any place that
may contain related doc?

I appreciate any suggestion.