You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by jc <jv...@gmail.com> on 2013/02/28 23:44:46 UTC

a lot of threads spinwaiting

Hi guys,

I'm sorry if this question has been answered before, I looked but didn't
find anything. 

This is my scenario (only relevant settings I think):
seed urls: about 60 homepages from different domains
generate.max.count = 10000
fetcher.threads.per.host = 3   I'm trying to be polite here :-)
partition.url.mode = byHost
fetcher.threads.fetch = 200
fetcher.threads.per.queue = 1
topN = 1000000
depth = 1

Since the very beggining I've got a lot of spinwaiting threads (I'm not sure
if those are threads because it doesn't really say in the log)

194/200 spinwaiting/active, 166 pages, 3 errors, 4.7 3.8 pages/s , 1471 1412
kb/s, 10000 URLs in 19 queues

I don't understand why there are 19 queues, is it maybe that only 19
websites are being fetched? Anyways, why is it that there are 194
spinwaiting out of 200 active threads?

Thanks a lot in advance for your time.

Regards,
jc



--
View this message in context: http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: a lot of threads spinwaiting

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

Regarding politeness, 3 threads per queue is not really polite :)

Cheers

 
 
-----Original message-----
> From:jc <jv...@gmail.com>
> Sent: Fri 01-Mar-2013 15:08
> To: user@nutch.apache.org
> Subject: Re: a lot of threads spinwaiting
> 
> Hi Roland and lufeng,
> 
> Thank you very much for your replies, I already tested lufeng advice, with
> results pretty much as expected.
> 
> By the way, my nutch installation is based on 2.1 version with hbase as
> crawldb storage
> 
> Roland, maybe fetcher.server.delay param has something to do with that as
> well, I set it to 3 secs, setting it to 0 would be unpolite?
> 
> All info you provided has helped me a lot, only one issue remains unfixed
> yet, there are more than 60 URLs from different hosts in my seed file, and
> only 20 queues, things may seem that all other 40 hosts have no more URLs to
> generate, but I really haven't seen any URL coming from those hosts since
> the creation of the crawldb.
> 
> Based on my poor experience following params would allow a number of 60
> queues for my vertical crawl, am I missing something?
> 
> topN = 1 million
> fetcher.threads.per.queue = 3
> fetcher.threads.per.host = 3 (just in case, I remember you told me to use
> per.queue instead)
> fetcher.threads.fetch = 200
> seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only
> urls from these hosts, they're all there, I checked)
> crawldb record count > 1 million
> 
> Thanks again for all your help
> 
> Regards,
> JC
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4043988.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Re: a lot of threads spinwaiting

Posted by jc <jv...@gmail.com>.
Thanks a lot for all your answers, this really is an active community

Roland, I had that problem once, it's not the case here, I'll try to look
into the crawldb, though hbase is not as friendly for filtering as I would
like it to, I'm still a newbie there

Regards,
JC



--
View this message in context: http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4044084.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: a lot of threads spinwaiting

Posted by Roland <ro...@rvh-gmbh.de>.
Hi JC,

I think Marcus already answered about politeness :) But without delay it 
will be worse :)

Do this missing URLs match on one of the filtering regex?
Take a look at .../conf/regex-urlfilter.txt, I had a problem with this 
regex:
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
It will just silently drop all URLs with GET parameters.

--Roland


Am 01.03.2013 15:08, schrieb jc:
> Hi Roland and lufeng,
>
> Thank you very much for your replies, I already tested lufeng advice, with
> results pretty much as expected.
>
> By the way, my nutch installation is based on 2.1 version with hbase as
> crawldb storage
>
> Roland, maybe fetcher.server.delay param has something to do with that as
> well, I set it to 3 secs, setting it to 0 would be unpolite?
>
> All info you provided has helped me a lot, only one issue remains unfixed
> yet, there are more than 60 URLs from different hosts in my seed file, and
> only 20 queues, things may seem that all other 40 hosts have no more URLs to
> generate, but I really haven't seen any URL coming from those hosts since
> the creation of the crawldb.
>
> Based on my poor experience following params would allow a number of 60
> queues for my vertical crawl, am I missing something?
>
> topN = 1 million
> fetcher.threads.per.queue = 3
> fetcher.threads.per.host = 3 (just in case, I remember you told me to use
> per.queue instead)
> fetcher.threads.fetch = 200
> seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only
> urls from these hosts, they're all there, I checked)
> crawldb record count > 1 million
>
> Thanks again for all your help
>
> Regards,
> JC

Re: a lot of threads spinwaiting

Posted by jc <jv...@gmail.com>.
Hi Roland and lufeng,

Thank you very much for your replies, I already tested lufeng advice, with
results pretty much as expected.

By the way, my nutch installation is based on 2.1 version with hbase as
crawldb storage

Roland, maybe fetcher.server.delay param has something to do with that as
well, I set it to 3 secs, setting it to 0 would be unpolite?

All info you provided has helped me a lot, only one issue remains unfixed
yet, there are more than 60 URLs from different hosts in my seed file, and
only 20 queues, things may seem that all other 40 hosts have no more URLs to
generate, but I really haven't seen any URL coming from those hosts since
the creation of the crawldb.

Based on my poor experience following params would allow a number of 60
queues for my vertical crawl, am I missing something?

topN = 1 million
fetcher.threads.per.queue = 3
fetcher.threads.per.host = 3 (just in case, I remember you told me to use
per.queue instead)
fetcher.threads.fetch = 200
seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only
urls from these hosts, they're all there, I checked)
crawldb record count > 1 million

Thanks again for all your help

Regards,
JC



--
View this message in context: http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4043988.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: a lot of threads spinwaiting

Posted by Roland <ro...@rvh-gmbh.de>.
Hi jc,

and one thing to add: check the robots.txt file of your crawled hosts, 
maybe they are limiting your fetches with delays:
Crawl-delay: 10

--Roland


Am 01.03.2013 03:32, schrieb feng lu:
> Hi jc
>
> <<
> I don't understand why there are 19 queues, is it maybe that only 19
> websites are being fetched?
> Because each queue handles FetchItems which come from the same Queue ID (be
> it a proto/hostname or proto/IP or proto/domain pair). And the Queue ID
> will be created based on queueMode argument. So here may be there 19
> different Queue ID in FetchItemQueues.
>
> <<
>   Anyways, why is it that there are 194 spinwaiting out of 200 active
> threads?
> First of all, i see that the parameter "fetcher.threads.per.host" has been
> replaced by "fetcher.threads.per.queue" in nutch 1.6. I see that there are
> 200 fetching threads that can fetch items from any host. However, all
> remaining items are from the different 19 hosts. And total urls count is
> 10000. Each queue come from the same Queue ID. So the logs indicate that
> only 6 threads is fetching and another 13 threads have finished fetching.
> Maybe another 13 queues are too small without spend too much time.
>
> Thanks
> lufeng
>
>
> On Fri, Mar 1, 2013 at 6:44 AM, jc <jv...@gmail.com> wrote:
>
>> Hi guys,
>>
>> I'm sorry if this question has been answered before, I looked but didn't
>> find anything.
>>
>> This is my scenario (only relevant settings I think):
>> seed urls: about 60 homepages from different domains
>> generate.max.count = 10000
>> fetcher.threads.per.host = 3   I'm trying to be polite here :-)
>> partition.url.mode = byHost
>> fetcher.threads.fetch = 200
>> fetcher.threads.per.queue = 1
>> topN = 1000000
>> depth = 1
>>
>> Since the very beggining I've got a lot of spinwaiting threads (I'm not
>> sure
>> if those are threads because it doesn't really say in the log)
>>
>> 194/200 spinwaiting/active, 166 pages, 3 errors, 4.7 3.8 pages/s , 1471
>> 1412
>> kb/s, 10000 URLs in 19 queues
>>
>> I don't understand why there are 19 queues, is it maybe that only 19
>> websites are being fetched? Anyways, why is it that there are 194
>> spinwaiting out of 200 active threads?
>>
>> Thanks a lot in advance for your time.
>>
>> Regards,
>> jc
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>


Re: a lot of threads spinwaiting

Posted by feng lu <am...@gmail.com>.
Hi jc

<<
I don't understand why there are 19 queues, is it maybe that only 19
websites are being fetched?
>>
Because each queue handles FetchItems which come from the same Queue ID (be
it a proto/hostname or proto/IP or proto/domain pair). And the Queue ID
will be created based on queueMode argument. So here may be there 19
different Queue ID in FetchItemQueues.

<<
 Anyways, why is it that there are 194 spinwaiting out of 200 active
threads?
>>
First of all, i see that the parameter "fetcher.threads.per.host" has been
replaced by "fetcher.threads.per.queue" in nutch 1.6. I see that there are
200 fetching threads that can fetch items from any host. However, all
remaining items are from the different 19 hosts. And total urls count is
10000. Each queue come from the same Queue ID. So the logs indicate that
only 6 threads is fetching and another 13 threads have finished fetching.
Maybe another 13 queues are too small without spend too much time.

Thanks
lufeng


On Fri, Mar 1, 2013 at 6:44 AM, jc <jv...@gmail.com> wrote:

> Hi guys,
>
> I'm sorry if this question has been answered before, I looked but didn't
> find anything.
>
> This is my scenario (only relevant settings I think):
> seed urls: about 60 homepages from different domains
> generate.max.count = 10000
> fetcher.threads.per.host = 3   I'm trying to be polite here :-)
> partition.url.mode = byHost
> fetcher.threads.fetch = 200
> fetcher.threads.per.queue = 1
> topN = 1000000
> depth = 1
>
> Since the very beggining I've got a lot of spinwaiting threads (I'm not
> sure
> if those are threads because it doesn't really say in the log)
>
> 194/200 spinwaiting/active, 166 pages, 3 errors, 4.7 3.8 pages/s , 1471
> 1412
> kb/s, 10000 URLs in 19 queues
>
> I don't understand why there are 19 queues, is it maybe that only 19
> websites are being fetched? Anyways, why is it that there are 194
> spinwaiting out of 200 active threads?
>
> Thanks a lot in advance for your time.
>
> Regards,
> jc
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)