You are viewing a plain text version of this content. The canonical link for it is here.
Posted to agent@nutch.apache.org by Daniele Menozzi <me...@ngi.it> on 2005/09/26 18:23:56 UTC

Pages/s rate decreasing

Hi all, I'trying to fetch some million of pages,but I've got some
performance problems.
I'm using a P4 1700, 768MB ram, and a 10Mb connection.
I've changed theese configuration values in nuke-sites.xml:

<property>
  <name>fetcher.threads.fetch</name>
  <value>25</value>
</property>

<property>
  <name>http.max.delays</name>
  <value>1</value>
</property>

<property>
  <name>fetcher.threads.per.host</name>
  <value>1</value>
</property>

<property>
  <name>io.sort.factor</name>
  <value>10</value>
</property>

<property>
  <name>io.sort.mb</name>
  <value>1</value>
</property>

<property>
  <name>indexer.maxMergeDocs</name>
  <value>20</value>
</property>

<property>
  <name>indexer.termIndexInterval</name>
  <value>64</value>
</property>

and I've also added the following line into bin/nutch:
JAVA_HEAP_MAX=-Xmx750M

It seems a good configuration. So, I give the fetch command, I get theese log messages:

050926 181531 status: segment 20050924151836, 100 pages, 11 errors, 1277608 bytes, 11755 ms
050926 181531 status: 8.507018 pages/s, 849.11206 kb/s, 12776.08 bytes/page
050926 181537 status: segment 20050924151836, 200 pages, 17 errors, 2620277 bytes, 18157 ms
050926 181537 status: 11.015036 pages/s, 1127.4392 kb/s, 13101.385 bytes/page
050926 181548 status: segment 20050924151836, 300 pages, 26 errors, 4243689 bytes, 28657 ms
050926 181548 status: 10.468647 pages/s, 1156.9187 kb/s, 14145.63 bytes/page
050926 181557 status: segment 20050924151836, 400 pages, 32 errors, 5515098 bytes, 38102 ms
050926 181557 status: 10.4981365 pages/s, 1130.8252 kb/s, 13787.745 bytes/page
050926 181607 status: segment 20050924151836, 500 pages, 44 errors, 6678319 bytes, 48464 ms
050926 181607 status: 10.3169365 pages/s, 1076.5592 kb/s, 13356.638 bytes/page

but,after some thousand of pages, rates decrease constantly:

050926 180746 status: segment 20050924151836, 6400 pages, 566 errors,85809551 bytes, 853401 ms
050926 180746 status: 7.4994054 pages/s, 785.5476 kb/s, 13407.742 bytes/page
050926 180807 status: segment 20050924151836, 6500 pages, 581 errors,87133135 bytes, 874799 ms
050926 180807 status: 7.4302783 pages/s, 778.1532 kb/s, 13405.098 bytes/page
050926 180823 status: segment 20050924151836, 6600 pages, 589 errors, 88789053 bytes, 890686 ms
050926 180823 status: 7.410019 pages/s, 778.79803 kb/s, 13452.888 bytes/page
050926 180841 status: segment 20050924151836, 6700 pages, 594 errors, 90286731 bytes, 908720 ms
050926 180841 status: 7.3730083 pages/s, 776.21826 kb/s, 13475.631 bytes/page
050926 180901 status: segment 20050924151836, 6800 pages, 601 errors, 91663461 bytes, 928498 ms
050926 180901 status: 7.323656 pages/s, 771.268 kb/s, 13479.921 bytes/page
050926 181014 status: segment 20050924151836, 7200 pages, 627 errors,96922711 bytes, 1001732 ms
050926 181014 status: 7.187551 pages/s, 755.8995 kb/s, 13461.487 bytes/page
050926 181037 status: segment 20050924151836, 7300 pages, 637 errors, 98478215 bytes, 1024844 ms
050926 181037 status: 7.1230354 pages/s, 750.7104 kb/s, 13490.167 bytes/page


and I cannot understand how to get a fixed 10pages/s rate (or even a higher one!!). I've read this pages
http://wiki.apache.org/nutch/HardwareRequirements
and it states that is possible, with 25 fetchers, to download (more or less) at 4Mbit per second,
with hardware similar to mine.
So, how can I set up nutch to fetch at a higher rate??


Thank you so much!!!!!
	Menoz


-- 
		      Free Software Enthusiast
		 Debian Powered Linux User #332564 
		     http://menoz.homelinux.org

Re: Pages/s rate decreasing

Posted by Daniele Menozzi <me...@ngi.it>.
On  13:01:15 26/Sep , Jay Pound wrote:
> I crawl with 100 threads, but you need to have 10% available for outbound
> traffic so if your connection is a cablemodem with only 512kb upload you
> will not be able to download at 10mbit but 5mbit, your numbers are low, most

I have also 10Mb in upload..
What machine do you usually use?

> of the machines I run can crawl 30pages a sec with 3mbit 2xT1's and 42pages
> a sec with 9mbit cablemodem connection, just run more threads, that is if
> your hardware can handle it.

but I thought my machine was powerful enought to run 50/100 threads.. I
cannot understand, because the cpu isn't really used, an even memory is
quite unused...


> you can also try and run the -noparse flag and
> parse the data later or on a different faster machine!!!

ok, I'll try with this,but it seems to have no relevant effect..

Thank you!!!