You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kaveh minooie <ka...@plutoz.com> on 2012/02/06 23:22:35 UTC

Thread spinWaiting, utilizing bandwidth and connection time out error

so I found the exception:

2012-02-06 12:20:36,614 ERROR org.apache.nutch.protocol.httpclient.Http: 
org.apache.commons.httpclient.ConnectTimeoutException: The host did not 
accept the connection within timeout of 50000 ms

which would have been understandable except for the fact that I am 
barley using 10% of my bandwidth ( i understand that it could be busy on 
the other end but I don't think this is the case here). I don't seem to 
be able to utilize my bandwidth correctly or completely. I am defining 4 
available slot per datanode for map (and reduce for that matter) and 
during tasks like updatedb or generate I can see that my entire cpu 
power available on all the nodes are being used (cpu idle time goes to 
0.0) so I don't think increasing the available slots would give me 
anything more. now the only thing that I can think of to increase the 
fetcher performance is to increase the number of threads. when i use 
anything more than 16 (I tried 32 and I have quad core cpus) for almost 
95% of pages I get the above mentioned exception.  when I run it with 16 
treads I am able to fetch most of the pages (more than 90%)  but when I 
look at the log file most of the time 15 out of 16 threads are spinwaiting:



2012-02-06 12:20:24,686 INFO org.apache.nutch.fetcher.Fetcher: 
-activeThreads=16, spinWaiting=13, fetchQueues.totalSize=800



2012-02-06 12:20:50,555 INFO org.apache.nutch.fetcher.Fetcher: 
-activeThreads=16, spinWaiting=16, fetchQueues.totalSize=800




2012-02-06 12:20:52,612 INFO org.apache.nutch.fetcher.Fetcher: 
-activeThreads=16, spinWaiting=15, fetchQueues.totalSize=799




2012-02-06 12:21:07,040 INFO org.apache.nutch.fetcher.Fetcher: 
-activeThreads=16, spinWaiting=14, fetchQueues.totalSize=799
2012-02-06 12:21:08,156 INFO org.apache.nutch.fetcher.Fetcher: 
-activeThreads=16, spinWaiting=14, fetchQueues.totalSize=800
2012-02-06 12:21:09,220 INFO org.apache.nutch.fetcher.Fetcher: 
-activeThreads=16, spinWaiting=16, fetchQueues.totalSize=800
2012-02-06 12:21:10,248 INFO org.apache.nutch.fetcher.Fetcher: 
-activeThreads=16, spinWaiting=16, fetchQueues.totalSize=800
2012-02-06 12:21:11,289 INFO org.apache.nutch.fetcher.Fetcher: 
-activeThreads=16, spinWaiting=16, fetchQueues.totalSize=800


these are excerpts from the log files. I am trying to crawl something 
between 3000 to 4000 sites in each run so I don't think they are waiting 
due  to politeness factor. and when I check the firewall I see that I am 
only using a fraction of the available bandwidth and even that is only 
true in the beginning of the fetching process and it sharply goes down 
to something between 1/3 to 1/4 of the initial value. I want to be able 
to use my entire bandwidth. what am I doing wrong? what should I do?

Thanks,

On 02/06/2012 12:36 PM, Markus Jelsma wrote:
>
>> yes I think i did get the right mapper since this is what i could see in
>> the 50030/taskdetails.jsp:
>>
>> 	attempt_201202041349_0021_m_000004_0
>> and i looked at the same folder (same name) in that datanode machine.
>
> Odd. But without the stack trace we cannot figure out what's going on. But
> usually when throughput drops and errors rise then you crossed the limits or
> your hardware, network and/or settings. Always check syslog as well.
>
> anyway, did you view the log file entirely? via the gui or shell? it must be
> there or some magic is happening.
>
>>
>> and the number of exceptions do not increase linearly but they do
>> exponentially so for example if I run the same thing with 32 threads the
>> number of exception would be 69 and the number of success would be only
>> 1 or 2. it seems like a race scenario to me that causes them all fail,
>> but I don't know...
>>
>> On 02/06/2012 12:08 PM, Markus Jelsma wrote:
>>>> Thanks Markus.
>>>>
>>>>       when I am running the fetcher, some times in the job detail I see
>>>>
>>>> something like this: (this is on 50030:jobtasks.jsp btw)
>>>>
>>>>       16 threads, 1 queues, 15 URLs queued, 55 pages, 1 errors, 0.0 (1)
>>>>
>>>> pages/s, 19.0 (11815) kbits/s,
>>>>
>>>> when i click on the counters I see this:
>>>>
>>>> FetcherStatus
>>>>
>>>> 	exception 	2
>>>> 	success 	69
>>>>
>>>> and that was why I was looking for the nutch log files which Thanks to
>>>> Markus I found, but when I looked at the syslog file corresponding to
>>>> the job Id and attempt that generate above message there was no
>>>> exception in the log files.
>>>
>>> Did you get the right mapper? Each mapper has its own log file.
>>>
>>>> so does any one know what this means? btw,
>>>> the number of "exception" increases exponentially when I increase the
>>>> number of threads for fetcher (like using 32 instead of 16).
>>>
>>> Well, that makes sense. If for example one out of 10 fetches fail per
>>> thread then you get twice as many failures for doubling the number of
>>> threads. There are always exceptions unless you are in an ideal world.
>>>
>>>> If some one
>>>> could explain to me what is the mechanism here that causes this I would
>>>> very much appreciate it.
>>>>
>>>> Thanks,
>>>>
>>>> On 02/06/2012 07:46 AM, Markus Jelsma wrote:
>>>>> there's a userlog directory on your datanodes. You can also view them
>>>>> through your web gui.
>>>>>
>>>>> On Sunday 05 February 2012 00:09:16 kaveh minooie wrote:
>>>>>> Hi everyone
>>>>>>
>>>>>>       Anybody knows how I can see the nutch logs when it is run on top
>>>>>>       of
>>>>>>
>>>>>> hadoop in a multi nodes environment?  in hadoop log directories, I
>>>>>> only have datanode and tasktracker log files but they don't have any
>>>>>> nutch entries in them.
>>>>>>
>>>>>> Thanks,

-- 
Kaveh Minooie

www.plutoz.com