You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Escobar <pa...@gmail.com> on 2014/12/04 18:30:41 UTC

Fetch multi host or domains at the same time

Hi,

In our configuration of nutch, we have 5 sites registered in the
regex-urlfilter.  So in the seed file, we have loaded url´s extracted of
sitemap for every site (by groups of domains).
In the nutch-site.xml we updated the next configuration:

<property>
  <name>generate.max.count</name>
  <value>72</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
</property>

<property>
  <name>generate.count.mode</name>
  <value>domain</value>
  <description>Determines how the URLs are counted for generator.max.count.
  Default value is 'host' but can be 'domain'. Note that we do not count
  per IP in the new version of the Generator.
  </description>
</property>

Aditionally in the crawl script we changed the next line:
sizeFetchlist=`expr $numSlaves \* 50000`

by the following line:
sizeFetchlist=`expr $numSlaves \* 360`

With this configuration we get the next message in the log when start the
crawl process:

Host or domain site1 has more than 72 URLs for all 1 segments. Additional
URLs won't be included in the fetchlist.
Host or domain site2 has more than 72 URLs for all 1 segments. Additional
URLs won't be included in the fetchlist.
Host or domain site3 has more than 72 URLs for all 1 segments. Additional
URLs won't be included in the fetchlist.
Host or domain site4 has more than 72 URLs for all 1 segments. Additional
URLs won't be included in the fetchlist.

So we are watching that only four sites are fetching:
-activeThreads=50, spinWaiting=48, fetchQueues.totalSize=277,
fetchQueues.getQueueCount=4

These four sites are those with more url's in seed. After some time the
fetchQueues.getQueueCount value decrease to 1 prioritizing and fetching
that site with more url in the seed file.
-activeThreads=50, spinWaiting=49, fetchQueues.totalSize=43,
fetchQueues.getQueueCount=1

What is the correct configuration for fetch url´s simultaneously for every
site configured in the regex-urlfilter.txt file?

Thanks.

Paul Escobar Mossos