You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by purpureleaf <pu...@gmail.com> on 2007/08/09 11:37:44 UTC

Fetcher get slower and slower in one run of crawling

Hi, I have worked with nutch for sometime. One thing I am always curious is
when crawling, fetcher's speed will get slower and slower, no matter what
configuration I use.
My last test get this: ( just one site to make the problem more simple)

OS : winxp
java : 1.6.0.2
nutch: 0.9
cpu : AMD 1800
mem : 1G
network : 3m adsl

site : wikipedia.org
threads per site :30
server.delay : 0.5

It starts about 6page/s, but reduce to 4 in some minutes, then get slower
and slower. I have run it for 8 hours, just 2page/s left, and it was till
slowing down.
But if I stop it and start one other, it returns full speed (then slows down
again). I am ok with 2 pages/s for one site, but I do hope it will keep that
speed.

I found there are some guys in this list has the same problem. But I can't
find an answer.
If nutch designed to work this way?

Thanks!
-- 
View this message in context: http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12069282
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetcher get slower and slower in one run of crawling

Posted by purpureleaf <pu...@gmail.com>.

just tested it again :(

With 2 threads, 1 second delay and 0.5s both got about 1.3page/s, but it was
not dropping.

Maybe I will try the download. But now I really wander why Google can index
wikipedia? Acctually many wikipedia sites has the same ip. Google has
indexed 120m pages of it. It much has done something.


Martin Kuen wrote:
> 
> hi,
> 
> hm :( . . . okay
> Can you see the pages/sec value still decreasing?
> 
> Probarbly I am wrong . . . but if you start your crawl with a low number
> of
> threads (1 or 2), you should immediatly see a value which is very close to
> what you'd expect to see - considering the "server.delay" property. If
> this
> is not true --> I am wrong
> 
> Regarding wikipedia: The english wikipedia has somewhat more than
> 1,900,000
> articles now. This number doesn't take into account all the revisions that
> occured to them. If I recall it correctly a full dump of the english
> wikipedia would be around 600 GB. However, the actual content (the most
> up-todate articles) fits into a 2.5 GB download (bz2 compressed). This
> download excludes things like images, user discussions, revisions, and so
> on. But with this download you're ready-to-go to set-up your own wikipedia
> mirror.
> 
> 
> Cheers
> 
> 
> On 8/9/07, purpureleaf <pu...@gmail.com> wrote:
>>
>>
>> Hi, sounds that it is the cause, but I just tested it again. With
>> server.delay = 1s doesn't result in 1page/s, almost the same speed.
>> confused:(
>> I really didn't try to hammer wikipedia, just want to find a site with
>> enough pages to test.
>>
>> So with more than 12M pages of wikipedia, I guess it is almost impossible
>> to
>> crawl wikipedia on line.
>> How does google do this?
>>
>>
>> Martin Kuen wrote:
>> >
>> > hi there,
>> >
>> > the property "server.delay" is the delay for one site (e.g. wikipedia).
>> > So,
>> > if you have a delay of 0.5 you'll fetch 2 pages per second.
>> >
>> > In my opinion there is something about the fetcher's code that doesn't
>> > makes
>> > it obey this rule in the very beginning . . . probarbly at start-up 30
>> > threads start immediatly without caring about this setting, which could
>> > cause a high pages/sec value in the beginning . . . but then the rule
>> is
>> > applied correctly and this averaged-value (pages/sec) becomes corrected
>> in
>> > a
>> > step-by-step manner - however I have no  evidence for this assumption.
>> >
>> > If you look around the Fetcher's code (or maybe at the http-plugin -
>> don't
>> > remember) you'll find a config-property called "
>> > protocol.plugin.check.blocking". If you set it to false you'll override
>> > the
>> > "server.delay" property. The result of this action is that you'll start
>> > "hammering" the wikipedia site.
>> > I tried to achieve the same by setting the "server.delay" to 0 . . .
>> > however
>> > . . . things didn't work well (I didn't investigate too much - I found
>> the
>> > "
>> > check.blocking" property, which worked?!).
>> >
>> > Btw. I propose that you should not start (large) crawls on the
>> > wikipedia-sites. The wiki guys don't like it. If you're just running a
>> > test
>> > and fetch a few pages . . . ok . . . but a crawl of 8 hours . . . hmm .
>> ..
>> > not just a few pages, right?
>> > Furthermore a "server.delay" of 0.5 doesn't really appear polite to me
>> .
>> .
>> > .
>> >
>> > Ok, so what? If you're interested in indexing the wikipedia articles,
>> you
>> > can set-up wikipedia on your local computer . . .
>> > http://en.wikipedia.org/wiki/Wikipedia:Database_download
>> > Then you can run your fetch on your local machine or in your intranet
>> and
>> > you'll just be limited by the speed of the machine powering the
>> mediawiki
>> > application. I tried this with the German wikipedia dump and it took a
>> > little bit more than 33 hours (AMD Athlon 2600 dualcore, 2GB RAM,
>> WinXP,
>> > java 1.5, nutch 0.9, ~614.000 articles, ~5.3 pages per second). I
>> didn't
>> > really care about performance, so I think this could be faster.
>> >
>> >
>> > cheers
>> >
>> >
>> >
>> >
>> >
>> > On 8/9/07, purpureleaf <pu...@gmail.com> wrote:
>> >>
>> >>
>> >> Hi, thanks for your reply
>> >>
>> >> Yes I was fetching from wikipedia only, I do this just for test this
>> >> slowing
>> >> down effect. But not too much I think, 4pages/s, still gets slower and
>> >> slower, forever. So the fetcher is supposed to be slower than 1page/s
>> >> (per
>> >> site) ?
>> >> I watched my bandwith, it used less than 20k/s, way less than my
>> prodiver
>> >> feel easy.
>> >>
>> >>
>> >>
>> >> Dennis Kubes-2 wrote:
>> >> >
>> >> > If this is stalling on only a few fetching tasks check the logs,
>> more
>> >> > than likely it is fetching many pages from a single site (i.e.
>> amazon,
>> >> > wikipedia, cnn) and the politeness settings (which you want to keep)
>> >> are
>> >> > slowing it down.
>> >> >
>> >> > If it is stalling on many task but a single machines check the
>> hardware
>> >> > for the machine.  We have seed hard disk speed decrease dramatically
>> >> > right before they are going to die.  On linux do something like
>> hdparm
>> >> > -tT /dev/hda where hda is the device to check.  Average speeds for
>> Sata
>> >> > should be in the 75MBps range for disk reads and 7000+ range for
>> cached
>> >> > reads.
>> >> >
>> >> > Another thing is you may be maxing your bandwidth and your provider
>> is
>> >> > throttling you?
>> >> >
>> >> > Dennis KUbes
>> >> >
>> >> > purpureleaf wrote:
>> >> >> Hi, I have worked with nutch for sometime. One thing I am always
>> >> curious
>> >> >> is
>> >> >> when crawling, fetcher's speed will get slower and slower, no
>> matter
>> >> what
>> >> >> configuration I use.
>> >> >> My last test get this: ( just one site to make the problem more
>> >> simple)
>> >> >>
>> >> >> OS : winxp
>> >> >> java : 1.6.0.2
>> >> >> nutch: 0.9
>> >> >> cpu : AMD 1800
>> >> >> mem : 1G
>> >> >> network : 3m adsl
>> >> >>
>> >> >> site : wikipedia.org
>> >> >> threads per site :30
>> >> >> server.delay : 0.5
>> >> >>
>> >> >> It starts about 6page/s, but reduce to 4 in some minutes, then get
>> >> slower
>> >> >> and slower. I have run it for 8 hours, just 2page/s left, and it
>> was
>> >> till
>> >> >> slowing down.
>> >> >> But if I stop it and start one other, it returns full speed (then
>> >> slows
>> >> >> down
>> >> >> again). I am ok with 2 pages/s for one site, but I do hope it will
>> >> keep
>> >> >> that
>> >> >> speed.
>> >> >>
>> >> >> I found there are some guys in this list has the same problem. But
>> I
>> >> >> can't
>> >> >> find an answer.
>> >> >> If nutch designed to work this way?
>> >> >>
>> >> >> Thanks!
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12073371
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12076754
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12083911
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetcher get slower and slower in one run of crawling

Posted by Martin Kuen <ma...@gmail.com>.

hi,

hm :( . . . okay
Can you see the pages/sec value still decreasing?

Probarbly I am wrong . . . but if you start your crawl with a low number of
threads (1 or 2), you should immediatly see a value which is very close to
what you'd expect to see - considering the "server.delay" property. If this
is not true --> I am wrong

Regarding wikipedia: The english wikipedia has somewhat more than 1,900,000
articles now. This number doesn't take into account all the revisions that
occured to them. If I recall it correctly a full dump of the english
wikipedia would be around 600 GB. However, the actual content (the most
up-todate articles) fits into a 2.5 GB download (bz2 compressed). This
download excludes things like images, user discussions, revisions, and so
on. But with this download you're ready-to-go to set-up your own wikipedia
mirror.


Cheers


On 8/9/07, purpureleaf <pu...@gmail.com> wrote:
>
>
> Hi, sounds that it is the cause, but I just tested it again. With
> server.delay = 1s doesn't result in 1page/s, almost the same speed.
> confused:(
> I really didn't try to hammer wikipedia, just want to find a site with
> enough pages to test.
>
> So with more than 12M pages of wikipedia, I guess it is almost impossible
> to
> crawl wikipedia on line.
> How does google do this?
>
>
> Martin Kuen wrote:
> >
> > hi there,
> >
> > the property "server.delay" is the delay for one site (e.g. wikipedia).
> > So,
> > if you have a delay of 0.5 you'll fetch 2 pages per second.
> >
> > In my opinion there is something about the fetcher's code that doesn't
> > makes
> > it obey this rule in the very beginning . . . probarbly at start-up 30
> > threads start immediatly without caring about this setting, which could
> > cause a high pages/sec value in the beginning . . . but then the rule is
> > applied correctly and this averaged-value (pages/sec) becomes corrected
> in
> > a
> > step-by-step manner - however I have no  evidence for this assumption.
> >
> > If you look around the Fetcher's code (or maybe at the http-plugin -
> don't
> > remember) you'll find a config-property called "
> > protocol.plugin.check.blocking". If you set it to false you'll override
> > the
> > "server.delay" property. The result of this action is that you'll start
> > "hammering" the wikipedia site.
> > I tried to achieve the same by setting the "server.delay" to 0 . . .
> > however
> > . . . things didn't work well (I didn't investigate too much - I found
> the
> > "
> > check.blocking" property, which worked?!).
> >
> > Btw. I propose that you should not start (large) crawls on the
> > wikipedia-sites. The wiki guys don't like it. If you're just running a
> > test
> > and fetch a few pages . . . ok . . . but a crawl of 8 hours . . . hmm .
> ..
> > not just a few pages, right?
> > Furthermore a "server.delay" of 0.5 doesn't really appear polite to me .
> .
> > .
> >
> > Ok, so what? If you're interested in indexing the wikipedia articles,
> you
> > can set-up wikipedia on your local computer . . .
> > http://en.wikipedia.org/wiki/Wikipedia:Database_download
> > Then you can run your fetch on your local machine or in your intranet
> and
> > you'll just be limited by the speed of the machine powering the
> mediawiki
> > application. I tried this with the German wikipedia dump and it took a
> > little bit more than 33 hours (AMD Athlon 2600 dualcore, 2GB RAM, WinXP,
> > java 1.5, nutch 0.9, ~614.000 articles, ~5.3 pages per second). I didn't
> > really care about performance, so I think this could be faster.
> >
> >
> > cheers
> >
> >
> >
> >
> >
> > On 8/9/07, purpureleaf <pu...@gmail.com> wrote:
> >>
> >>
> >> Hi, thanks for your reply
> >>
> >> Yes I was fetching from wikipedia only, I do this just for test this
> >> slowing
> >> down effect. But not too much I think, 4pages/s, still gets slower and
> >> slower, forever. So the fetcher is supposed to be slower than 1page/s
> >> (per
> >> site) ?
> >> I watched my bandwith, it used less than 20k/s, way less than my
> prodiver
> >> feel easy.
> >>
> >>
> >>
> >> Dennis Kubes-2 wrote:
> >> >
> >> > If this is stalling on only a few fetching tasks check the logs, more
> >> > than likely it is fetching many pages from a single site (i.e.
> amazon,
> >> > wikipedia, cnn) and the politeness settings (which you want to keep)
> >> are
> >> > slowing it down.
> >> >
> >> > If it is stalling on many task but a single machines check the
> hardware
> >> > for the machine.  We have seed hard disk speed decrease dramatically
> >> > right before they are going to die.  On linux do something like
> hdparm
> >> > -tT /dev/hda where hda is the device to check.  Average speeds for
> Sata
> >> > should be in the 75MBps range for disk reads and 7000+ range for
> cached
> >> > reads.
> >> >
> >> > Another thing is you may be maxing your bandwidth and your provider
> is
> >> > throttling you?
> >> >
> >> > Dennis KUbes
> >> >
> >> > purpureleaf wrote:
> >> >> Hi, I have worked with nutch for sometime. One thing I am always
> >> curious
> >> >> is
> >> >> when crawling, fetcher's speed will get slower and slower, no matter
> >> what
> >> >> configuration I use.
> >> >> My last test get this: ( just one site to make the problem more
> >> simple)
> >> >>
> >> >> OS : winxp
> >> >> java : 1.6.0.2
> >> >> nutch: 0.9
> >> >> cpu : AMD 1800
> >> >> mem : 1G
> >> >> network : 3m adsl
> >> >>
> >> >> site : wikipedia.org
> >> >> threads per site :30
> >> >> server.delay : 0.5
> >> >>
> >> >> It starts about 6page/s, but reduce to 4 in some minutes, then get
> >> slower
> >> >> and slower. I have run it for 8 hours, just 2page/s left, and it was
> >> till
> >> >> slowing down.
> >> >> But if I stop it and start one other, it returns full speed (then
> >> slows
> >> >> down
> >> >> again). I am ok with 2 pages/s for one site, but I do hope it will
> >> keep
> >> >> that
> >> >> speed.
> >> >>
> >> >> I found there are some guys in this list has the same problem. But I
> >> >> can't
> >> >> find an answer.
> >> >> If nutch designed to work this way?
> >> >>
> >> >> Thanks!
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12073371
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12076754
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Fetcher get slower and slower in one run of crawling

Posted by purpureleaf <pu...@gmail.com>.

Hi, sounds that it is the cause, but I just tested it again. With
server.delay = 1s doesn't result in 1page/s, almost the same speed.
confused:(
I really didn't try to hammer wikipedia, just want to find a site with
enough pages to test.

So with more than 12M pages of wikipedia, I guess it is almost impossible to
crawl wikipedia on line.
How does google do this?


Martin Kuen wrote:
> 
> hi there,
> 
> the property "server.delay" is the delay for one site (e.g. wikipedia).
> So,
> if you have a delay of 0.5 you'll fetch 2 pages per second.
> 
> In my opinion there is something about the fetcher's code that doesn't
> makes
> it obey this rule in the very beginning . . . probarbly at start-up 30
> threads start immediatly without caring about this setting, which could
> cause a high pages/sec value in the beginning . . . but then the rule is
> applied correctly and this averaged-value (pages/sec) becomes corrected in
> a
> step-by-step manner - however I have no  evidence for this assumption.
> 
> If you look around the Fetcher's code (or maybe at the http-plugin - don't
> remember) you'll find a config-property called "
> protocol.plugin.check.blocking". If you set it to false you'll override
> the
> "server.delay" property. The result of this action is that you'll start
> "hammering" the wikipedia site.
> I tried to achieve the same by setting the "server.delay" to 0 . . .
> however
> . . . things didn't work well (I didn't investigate too much - I found the
> "
> check.blocking" property, which worked?!).
> 
> Btw. I propose that you should not start (large) crawls on the
> wikipedia-sites. The wiki guys don't like it. If you're just running a
> test
> and fetch a few pages . . . ok . . . but a crawl of 8 hours . . . hmm . ..
> not just a few pages, right?
> Furthermore a "server.delay" of 0.5 doesn't really appear polite to me . .
> .
> 
> Ok, so what? If you're interested in indexing the wikipedia articles, you
> can set-up wikipedia on your local computer . . .
> http://en.wikipedia.org/wiki/Wikipedia:Database_download
> Then you can run your fetch on your local machine or in your intranet and
> you'll just be limited by the speed of the machine powering the mediawiki
> application. I tried this with the German wikipedia dump and it took a
> little bit more than 33 hours (AMD Athlon 2600 dualcore, 2GB RAM, WinXP,
> java 1.5, nutch 0.9, ~614.000 articles, ~5.3 pages per second). I didn't
> really care about performance, so I think this could be faster.
> 
> 
> cheers
> 
> 
> 
> 
> 
> On 8/9/07, purpureleaf <pu...@gmail.com> wrote:
>>
>>
>> Hi, thanks for your reply
>>
>> Yes I was fetching from wikipedia only, I do this just for test this
>> slowing
>> down effect. But not too much I think, 4pages/s, still gets slower and
>> slower, forever. So the fetcher is supposed to be slower than 1page/s
>> (per
>> site) ?
>> I watched my bandwith, it used less than 20k/s, way less than my prodiver
>> feel easy.
>>
>>
>>
>> Dennis Kubes-2 wrote:
>> >
>> > If this is stalling on only a few fetching tasks check the logs, more
>> > than likely it is fetching many pages from a single site (i.e. amazon,
>> > wikipedia, cnn) and the politeness settings (which you want to keep)
>> are
>> > slowing it down.
>> >
>> > If it is stalling on many task but a single machines check the hardware
>> > for the machine.  We have seed hard disk speed decrease dramatically
>> > right before they are going to die.  On linux do something like hdparm
>> > -tT /dev/hda where hda is the device to check.  Average speeds for Sata
>> > should be in the 75MBps range for disk reads and 7000+ range for cached
>> > reads.
>> >
>> > Another thing is you may be maxing your bandwidth and your provider is
>> > throttling you?
>> >
>> > Dennis KUbes
>> >
>> > purpureleaf wrote:
>> >> Hi, I have worked with nutch for sometime. One thing I am always
>> curious
>> >> is
>> >> when crawling, fetcher's speed will get slower and slower, no matter
>> what
>> >> configuration I use.
>> >> My last test get this: ( just one site to make the problem more
>> simple)
>> >>
>> >> OS : winxp
>> >> java : 1.6.0.2
>> >> nutch: 0.9
>> >> cpu : AMD 1800
>> >> mem : 1G
>> >> network : 3m adsl
>> >>
>> >> site : wikipedia.org
>> >> threads per site :30
>> >> server.delay : 0.5
>> >>
>> >> It starts about 6page/s, but reduce to 4 in some minutes, then get
>> slower
>> >> and slower. I have run it for 8 hours, just 2page/s left, and it was
>> till
>> >> slowing down.
>> >> But if I stop it and start one other, it returns full speed (then
>> slows
>> >> down
>> >> again). I am ok with 2 pages/s for one site, but I do hope it will
>> keep
>> >> that
>> >> speed.
>> >>
>> >> I found there are some guys in this list has the same problem. But I
>> >> can't
>> >> find an answer.
>> >> If nutch designed to work this way?
>> >>
>> >> Thanks!
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12073371
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12076754
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetcher get slower and slower in one run of crawling

Posted by purpureleaf <pu...@gmail.com>.

Oh? emm this makes sense.
I thought is means every thread will wait for 0.5s after another thread.
But at the beginning, it is much higher than 2page/s, so the fetcher work
this way so that the more pages it get, it will more close to 2page/s?

Martin Kuen wrote:
> 
> hi there,
> 
> the property "server.delay" is the delay for one site (e.g. wikipedia).
> So,
> if you have a delay of 0.5 you'll fetch 2 pages per second.
> 
> In my opinion there is something about the fetcher's code that doesn't
> makes
> it obey this rule in the very beginning . . . probarbly at start-up 30
> threads start immediatly without caring about this setting, which could
> cause a high pages/sec value in the beginning . . . but then the rule is
> applied correctly and this averaged-value (pages/sec) becomes corrected in
> a
> step-by-step manner - however I have no  evidence for this assumption.
> 
> If you look around the Fetcher's code (or maybe at the http-plugin - don't
> remember) you'll find a config-property called "
> protocol.plugin.check.blocking". If you set it to false you'll override
> the
> "server.delay" property. The result of this action is that you'll start
> "hammering" the wikipedia site.
> I tried to achieve the same by setting the "server.delay" to 0 . . .
> however
> . . . things didn't work well (I didn't investigate too much - I found the
> "
> check.blocking" property, which worked?!).
> 
> Btw. I propose that you should not start (large) crawls on the
> wikipedia-sites. The wiki guys don't like it. If you're just running a
> test
> and fetch a few pages . . . ok . . . but a crawl of 8 hours . . . hmm . ..
> not just a few pages, right?
> Furthermore a "server.delay" of 0.5 doesn't really appear polite to me . .
> .
> 
> Ok, so what? If you're interested in indexing the wikipedia articles, you
> can set-up wikipedia on your local computer . . .
> http://en.wikipedia.org/wiki/Wikipedia:Database_download
> Then you can run your fetch on your local machine or in your intranet and
> you'll just be limited by the speed of the machine powering the mediawiki
> application. I tried this with the German wikipedia dump and it took a
> little bit more than 33 hours (AMD Athlon 2600 dualcore, 2GB RAM, WinXP,
> java 1.5, nutch 0.9, ~614.000 articles, ~5.3 pages per second). I didn't
> really care about performance, so I think this could be faster.
> 
> 
> cheers
> 
> 
> 
> 
> 
> On 8/9/07, purpureleaf <pu...@gmail.com> wrote:
>>
>>
>> Hi, thanks for your reply
>>
>> Yes I was fetching from wikipedia only, I do this just for test this
>> slowing
>> down effect. But not too much I think, 4pages/s, still gets slower and
>> slower, forever. So the fetcher is supposed to be slower than 1page/s
>> (per
>> site) ?
>> I watched my bandwith, it used less than 20k/s, way less than my prodiver
>> feel easy.
>>
>>
>>
>> Dennis Kubes-2 wrote:
>> >
>> > If this is stalling on only a few fetching tasks check the logs, more
>> > than likely it is fetching many pages from a single site (i.e. amazon,
>> > wikipedia, cnn) and the politeness settings (which you want to keep)
>> are
>> > slowing it down.
>> >
>> > If it is stalling on many task but a single machines check the hardware
>> > for the machine.  We have seed hard disk speed decrease dramatically
>> > right before they are going to die.  On linux do something like hdparm
>> > -tT /dev/hda where hda is the device to check.  Average speeds for Sata
>> > should be in the 75MBps range for disk reads and 7000+ range for cached
>> > reads.
>> >
>> > Another thing is you may be maxing your bandwidth and your provider is
>> > throttling you?
>> >
>> > Dennis KUbes
>> >
>> > purpureleaf wrote:
>> >> Hi, I have worked with nutch for sometime. One thing I am always
>> curious
>> >> is
>> >> when crawling, fetcher's speed will get slower and slower, no matter
>> what
>> >> configuration I use.
>> >> My last test get this: ( just one site to make the problem more
>> simple)
>> >>
>> >> OS : winxp
>> >> java : 1.6.0.2
>> >> nutch: 0.9
>> >> cpu : AMD 1800
>> >> mem : 1G
>> >> network : 3m adsl
>> >>
>> >> site : wikipedia.org
>> >> threads per site :30
>> >> server.delay : 0.5
>> >>
>> >> It starts about 6page/s, but reduce to 4 in some minutes, then get
>> slower
>> >> and slower. I have run it for 8 hours, just 2page/s left, and it was
>> till
>> >> slowing down.
>> >> But if I stop it and start one other, it returns full speed (then
>> slows
>> >> down
>> >> again). I am ok with 2 pages/s for one site, but I do hope it will
>> keep
>> >> that
>> >> speed.
>> >>
>> >> I found there are some guys in this list has the same problem. But I
>> >> can't
>> >> find an answer.
>> >> If nutch designed to work this way?
>> >>
>> >> Thanks!
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12073371
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12076311
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetcher get slower and slower in one run of crawling

Posted by Martin Kuen <ma...@gmail.com>.

hi there,

the property "server.delay" is the delay for one site (e.g. wikipedia). So,
if you have a delay of 0.5 you'll fetch 2 pages per second.

In my opinion there is something about the fetcher's code that doesn't makes
it obey this rule in the very beginning . . . probarbly at start-up 30
threads start immediatly without caring about this setting, which could
cause a high pages/sec value in the beginning . . . but then the rule is
applied correctly and this averaged-value (pages/sec) becomes corrected in a
step-by-step manner - however I have no  evidence for this assumption.

If you look around the Fetcher's code (or maybe at the http-plugin - don't
remember) you'll find a config-property called "
protocol.plugin.check.blocking". If you set it to false you'll override the
"server.delay" property. The result of this action is that you'll start
"hammering" the wikipedia site.
I tried to achieve the same by setting the "server.delay" to 0 . . . however
. . . things didn't work well (I didn't investigate too much - I found the "
check.blocking" property, which worked?!).

Btw. I propose that you should not start (large) crawls on the
wikipedia-sites. The wiki guys don't like it. If you're just running a test
and fetch a few pages . . . ok . . . but a crawl of 8 hours . . . hmm . ..
not just a few pages, right?
Furthermore a "server.delay" of 0.5 doesn't really appear polite to me . . .

Ok, so what? If you're interested in indexing the wikipedia articles, you
can set-up wikipedia on your local computer . . .
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Then you can run your fetch on your local machine or in your intranet and
you'll just be limited by the speed of the machine powering the mediawiki
application. I tried this with the German wikipedia dump and it took a
little bit more than 33 hours (AMD Athlon 2600 dualcore, 2GB RAM, WinXP,
java 1.5, nutch 0.9, ~614.000 articles, ~5.3 pages per second). I didn't
really care about performance, so I think this could be faster.

cheers

On 8/9/07, purpureleaf <pu...@gmail.com> wrote:
>
>
> Hi, thanks for your reply
>
> Yes I was fetching from wikipedia only, I do this just for test this
> slowing
> down effect. But not too much I think, 4pages/s, still gets slower and
> slower, forever. So the fetcher is supposed to be slower than 1page/s (per
> site) ?
> I watched my bandwith, it used less than 20k/s, way less than my prodiver
> feel easy.
>
>
>
> Dennis Kubes-2 wrote:
> >
> > If this is stalling on only a few fetching tasks check the logs, more
> > than likely it is fetching many pages from a single site (i.e. amazon,
> > wikipedia, cnn) and the politeness settings (which you want to keep) are
> > slowing it down.
> >
> > If it is stalling on many task but a single machines check the hardware
> > for the machine.  We have seed hard disk speed decrease dramatically
> > right before they are going to die.  On linux do something like hdparm
> > -tT /dev/hda where hda is the device to check.  Average speeds for Sata
> > should be in the 75MBps range for disk reads and 7000+ range for cached
> > reads.
> >
> > Another thing is you may be maxing your bandwidth and your provider is
> > throttling you?
> >
> > Dennis KUbes
> >
> > purpureleaf wrote:
> >> Hi, I have worked with nutch for sometime. One thing I am always
> curious
> >> is
> >> when crawling, fetcher's speed will get slower and slower, no matter
> what
> >> configuration I use.
> >> My last test get this: ( just one site to make the problem more simple)
> >>
> >> OS : winxp
> >> java : 1.6.0.2
> >> nutch: 0.9
> >> cpu : AMD 1800
> >> mem : 1G
> >> network : 3m adsl
> >>
> >> site : wikipedia.org
> >> threads per site :30
> >> server.delay : 0.5
> >>
> >> It starts about 6page/s, but reduce to 4 in some minutes, then get
> slower
> >> and slower. I have run it for 8 hours, just 2page/s left, and it was
> till
> >> slowing down.
> >> But if I stop it and start one other, it returns full speed (then slows
> >> down
> >> again). I am ok with 2 pages/s for one site, but I do hope it will keep
> >> that
> >> speed.
> >>
> >> I found there are some guys in this list has the same problem. But I
> >> can't
> >> find an answer.
> >> If nutch designed to work this way?
> >>
> >> Thanks!
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12073371
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Fetcher get slower and slower in one run of crawling

Posted by purpureleaf <pu...@gmail.com>.

Hi, thanks for your reply

Yes I was fetching from wikipedia only, I do this just for test this slowing
down effect. But not too much I think, 4pages/s, still gets slower and
slower, forever. So the fetcher is supposed to be slower than 1page/s (per
site) ?
I watched my bandwith, it used less than 20k/s, way less than my prodiver
feel easy.



Dennis Kubes-2 wrote:
> 
> If this is stalling on only a few fetching tasks check the logs, more 
> than likely it is fetching many pages from a single site (i.e. amazon, 
> wikipedia, cnn) and the politeness settings (which you want to keep) are 
> slowing it down.
> 
> If it is stalling on many task but a single machines check the hardware 
> for the machine.  We have seed hard disk speed decrease dramatically 
> right before they are going to die.  On linux do something like hdparm 
> -tT /dev/hda where hda is the device to check.  Average speeds for Sata 
> should be in the 75MBps range for disk reads and 7000+ range for cached 
> reads.
> 
> Another thing is you may be maxing your bandwidth and your provider is 
> throttling you?
> 
> Dennis KUbes
> 
> purpureleaf wrote:
>> Hi, I have worked with nutch for sometime. One thing I am always curious
>> is
>> when crawling, fetcher's speed will get slower and slower, no matter what
>> configuration I use.
>> My last test get this: ( just one site to make the problem more simple)
>> 
>> OS : winxp
>> java : 1.6.0.2
>> nutch: 0.9
>> cpu : AMD 1800
>> mem : 1G
>> network : 3m adsl
>> 
>> site : wikipedia.org
>> threads per site :30
>> server.delay : 0.5
>> 
>> It starts about 6page/s, but reduce to 4 in some minutes, then get slower
>> and slower. I have run it for 8 hours, just 2page/s left, and it was till
>> slowing down.
>> But if I stop it and start one other, it returns full speed (then slows
>> down
>> again). I am ok with 2 pages/s for one site, but I do hope it will keep
>> that
>> speed.
>> 
>> I found there are some guys in this list has the same problem. But I
>> can't
>> find an answer.
>> If nutch designed to work this way?
>> 
>> Thanks!
> 
> 

-- 
View this message in context: http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12073371
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetcher get slower and slower in one run of crawling

Posted by Dennis Kubes <ku...@apache.org>.

If this is stalling on only a few fetching tasks check the logs, more 
than likely it is fetching many pages from a single site (i.e. amazon, 
wikipedia, cnn) and the politeness settings (which you want to keep) are 
slowing it down.

If it is stalling on many task but a single machines check the hardware 
for the machine.  We have seed hard disk speed decrease dramatically 
right before they are going to die.  On linux do something like hdparm 
-tT /dev/hda where hda is the device to check.  Average speeds for Sata 
should be in the 75MBps range for disk reads and 7000+ range for cached 
reads.

Another thing is you may be maxing your bandwidth and your provider is 
throttling you?

Dennis KUbes

purpureleaf wrote:
> Hi, I have worked with nutch for sometime. One thing I am always curious is
> when crawling, fetcher's speed will get slower and slower, no matter what
> configuration I use.
> My last test get this: ( just one site to make the problem more simple)
> 
> OS : winxp
> java : 1.6.0.2
> nutch: 0.9
> cpu : AMD 1800
> mem : 1G
> network : 3m adsl
> 
> site : wikipedia.org
> threads per site :30
> server.delay : 0.5
> 
> It starts about 6page/s, but reduce to 4 in some minutes, then get slower
> and slower. I have run it for 8 hours, just 2page/s left, and it was till
> slowing down.
> But if I stop it and start one other, it returns full speed (then slows down
> again). I am ok with 2 pages/s for one site, but I do hope it will keep that
> speed.
> 
> I found there are some guys in this list has the same problem. But I can't
> find an answer.
> If nutch designed to work this way?
> 
> Thanks!

Re: Fetcher get slower and slower in one run of crawling

Posted by purpureleaf <pu...@gmail.com>.

Thanks for your reply
JDK's gc? but it wasn't use much memory.

Brian Demers wrote:
> 
> Have you tried to change the gc options?
> 
> On 8/9/07, purpureleaf <pu...@gmail.com> wrote:
>>
>> Hi, I have worked with nutch for sometime. One thing I am always curious
>> is
>> when crawling, fetcher's speed will get slower and slower, no matter what
>> configuration I use.
>> My last test get this: ( just one site to make the problem more simple)
>>
>> OS : winxp
>> java : 1.6.0.2
>> nutch: 0.9
>> cpu : AMD 1800
>> mem : 1G
>> network : 3m adsl
>>
>> site : wikipedia.org
>> threads per site :30
>> server.delay : 0.5
>>
>> It starts about 6page/s, but reduce to 4 in some minutes, then get slower
>> and slower. I have run it for 8 hours, just 2page/s left, and it was till
>> slowing down.
>> But if I stop it and start one other, it returns full speed (then slows
>> down
>> again). I am ok with 2 pages/s for one site, but I do hope it will keep
>> that
>> speed.
>>
>> I found there are some guys in this list has the same problem. But I
>> can't
>> find an answer.
>> If nutch designed to work this way?
>>
>> Thanks!
>> --
>> View this message in context:
>> http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12069282
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12073216
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetcher get slower and slower in one run of crawling

Posted by Brian Demers <br...@gmail.com>.

Have you tried to change the gc options?

On 8/9/07, purpureleaf <pu...@gmail.com> wrote:
>
> Hi, I have worked with nutch for sometime. One thing I am always curious is
> when crawling, fetcher's speed will get slower and slower, no matter what
> configuration I use.
> My last test get this: ( just one site to make the problem more simple)
>
> OS : winxp
> java : 1.6.0.2
> nutch: 0.9
> cpu : AMD 1800
> mem : 1G
> network : 3m adsl
>
> site : wikipedia.org
> threads per site :30
> server.delay : 0.5
>
> It starts about 6page/s, but reduce to 4 in some minutes, then get slower
> and slower. I have run it for 8 hours, just 2page/s left, and it was till
> slowing down.
> But if I stop it and start one other, it returns full speed (then slows down
> again). I am ok with 2 pages/s for one site, but I do hope it will keep that
> speed.
>
> I found there are some guys in this list has the same problem. But I can't
> find an answer.
> If nutch designed to work this way?
>
> Thanks!
> --
> View this message in context: http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12069282
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>