You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2016/12/09 01:15:30 UTC

Fetcher "hung while processing"

I sometimes get a bunch of warning messages that say Thread #x hung while processing <url>
Is this just a normal thing to see occasionally, or should I look to find some resolution? I do have an example where the same host shows up on a multitude of these messages, which puzzles me. I think there should be only one thread per host, due to me specifying fetcher.threads.per.queue=1
Here is example log showing the first 20 of 50 hung threads. Note that http://shinystat.com and http://fabulous.com show up more than once.

2016-12-09 00:47:29,559 WARN [main] org.apache.nutch.fetcher.Fetcher: Aborting with 50 hung threads.
2016-12-09 00:47:29,560 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #0 hung while processing https://www.hugedomains.com/domain_search.cfm?catSearch=434
2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #1 hung while processing http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=820
2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #2 hung while processing http://shinystat.com/it/pro/info_pro.html
2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #3 hung while processing http://events.stanford.edu/byCategory/13/
2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #4 hung while processing https://www.ladesk.com/pricing/hosted/terms-and-conditions/
2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #5 hung while processing http://shinystat.com/en/opt-out_free.html
2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #6 hung while processing http://shinystat.com/fr/biz/info_biz.html
2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #7 hung while processing http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=88
2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #8 hung while processing http://www.youronlinechoices.com/sk/slovnik-pojmov
2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #9 hung while processing https://twitter.com/sakura_ope
2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #10 hung while processing http://europa.eu/european-union/topics/culture_en
2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #11 hung while processing http://www.youronlinechoices.com/ee/opt-out-help
2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #12 hung while processing https://www.hugedomains.com/domain_search.cfm?catSearch=437
2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #13 hung while processing http://hosted.ap.org/dynamic/stories/U/US_OBIT_JOHN_GLENN?SITE=AP&SECTION=HOME&TEMPLATE=DEFAULT
2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #14 hung while processing http://static.fc2.com/sh_css/common/base.css?1200605
2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #15 hung while processing https://www.hugedomains.com/terms.cfm
2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #16 hung while processing https://www.ladesk.com/comparisons/
2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #17 hung while processing http://hu.statcounter.com/features/
2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #18 hung while processing http://europa.eu/european-union/about-eu/working_el
2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #19 hung while processing http://www.atinternet.com/es/recursos/
2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #20 hung while processing http://ietf.org/rfc/rfc2026.txt


Re: Fetcher "hung while processing"

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Michael,

> I have less than the default amount of memory allocated: mapreduce.map.memory.mb=500;
mapreduce.map.java.opts=-Xmx300m.

300 MB Java heap is probably not enough. A fetcher with 50 threads is expected keep in memory
- 50 open connections with 50 buffers to hold the fetched content
- the queues with URLs to fetch next
- parsed and cached robots.txt rules

Afaiks, also Hadoop needs a certain amount of memory to buffer the output before it's spilled to
disk.

If a task runs low in memory, it naturally becomes slower because CPU is used for garbage
collecting. That could explain why the hang ups happen only sometimes somewhat unpredictable.

Best,
Sebastian


On 12/16/2016 08:17 PM, Michael Coffey wrote:
> Hello again, here is a lot more detail on my issue.
> 
> I have had the same problem with many different runs, but certainly not all of them. Typically, it seems to happen on one of several nodes of the cluster (but not the same node each time). One time, I used numThreads=200, instead of 50, and I got 200 hung threads at the end of the fetching task.
> Nutch version is 1.12. It is running distributed on Hadoop with anywhere from 2 to 7 slave nodes. I have no explicit timeout settings in mapred-site.xml, nutch-site.xml, or the crawl script. I have the following fetcher.* props: fetcher.threads.per.queue=1; fetcher.timelimit.mins=180; fetcher.server.delay=1.0. I also have generate.max.count=25.
> 
> I have less than the default amount of memory allocated: mapreduce.map.memory.mb=500; mapreduce.map.java.opts=-Xmx300m.
> 
> Looking at the logs, I don't see anything very obvious. In the 200-thread case, the word "error" appeared only about 70 times. 
> 
> Here is an example describing 3 of the 50 hung threads of one run that attemps to fetch about 300 resources.
> 2016-12-16 04:22:45,644 WARN [main] org.apache.nutch.fetcher.Fetcher: Aborting with 50 hung threads.
> 2016-12-16 04:22:45,645 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #0 hung while processing http://aktr.jp/news/x-girl-sports-x-aktr-02-%E3%83%97%E3%83%AD%E3%82%B8%E3%82%A7%E3%82%AF%E3%83%88%E3%82%B5%E3%82%A4%E3%83%88%E5%85%AC%E9%96%8B/
> 2016-12-16 04:22:45,647 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #1 hung while processing http://www.googletagmanager.com/ns.html?id=GTM-MF52KS
> 2016-12-16 04:22:45,648 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #2 hung while processing http://mashable.com/2016/12/14/teens-drinking-down/
> 
> Earlier in the log2016-12-16 04:12:38,315 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread: fetching http://aktr.jp/news/x-girl-sports-x-aktr-02-%E3%83%97%E3%83%AD%E3%82%B8%E3%82%A7%E3%82%AF%E3%83%88%E3%82%B5%E3%82%A4%E3%83%88%E5%85%AC%E9%96%8B/ (queue crawl delay=1000ms)
> 
> And there was no other mention of aktr.jp/news/x-girl-sports-x-aktr. However, there were 24 akter.jp urls that it tried to fetch, 4 of which ended up on hung threads.
> 
> 2016-12-16 04:12:33,427 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread: fetching http://www.googletagmanager.com/ns.html?id=GTM-MF52KS (queue crawl delay=1000ms)
> 
> And there was no other mention of googletagmanager.com.
> 2016-12-16 04:12:57,991 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread: fetching http://mashable.com/2016/12/14/teens-drinking-down/ (queue crawl delay=1000ms)
> 
> And there was no other mention of mashable.com/2016/12/14/teens. There were 17 mashable.com urls thet it tried to fetch, 2 of which ended up on hung threads.
> 
> I get only about 9 mentions of the word "error" in the syslog for this run.  It includes some failures to get robots.txt for various sites, some NoHttpResponseException, and some ConnectTimeoutException. None of those errors relate to the URLs I mentioned above
> 
> 
>       From: Sebastian Nagel <wa...@googlemail.com>
>  To: user@nutch.apache.org 
>  Sent: Friday, December 9, 2016 12:42 PM
>  Subject: Re: Fetcher "hung while processing"
>    
> Hi Michael,
> 
>> What other post-fetch actions are there?
> 
> Well, the fetched content is spilled to disk which may also become slow in pathological cases.
> 
> But I think it's more important to analyze what happened with the URLs before. The logs
> should contain a message "fetching ..." for every hanging URL. When does it happen?
> 
> If possible, let us know about
> - Nutch version
> - environment (local, distributed)
> - configuration, esp. if not the default:
>     mapreduce.task.timeout
>     fetcher.threads.tlimeout.divisor
>     http.timeout
>   and in doubt all other modified
>     fetcher.*
>   properties
> 
> Is the problem reproducible, or does it happen only sometimes?
> 
> Thanks,
> Sebastian
> 
> On 12/09/2016 04:58 PM, Michael Coffey wrote:
>> The property fetcher.parse is false and I pass -noParsing to the fetch command. What other post-fetch actions are there?
>>
>>
>>       From: Sebastian Nagel <wa...@googlemail.com>
>>   To: user@nutch.apache.org 
>>   Sent: Friday, December 9, 2016 12:58 AM
>>   Subject: Re: Fetcher "hung while processing"
>>     
>> Hi Michael,
>>
>> what about the property fetcher.parse ?
>>
>> The queue is unblocked after a page has been fetched but before parsing.
>> If the parser is hanging or one of the post-fetch actions take too long
>> it may happen that there are multiple URLs from the same host still in
>> process.
>>
>> Sebastian
>>
>> On 12/09/2016 02:15 AM, Michael Coffey wrote:
>>> I sometimes get a bunch of warning messages that say Thread #x hung while processing <url>
>>> Is this just a normal thing to see occasionally, or should I look to find some resolution? I do have an example where the same host shows up on a multitude of these messages, which puzzles me. I think there should be only one thread per host, due to me specifying fetcher.threads.per.queue=1
>>> Here is example log showing the first 20 of 50 hung threads. Note that http://shinystat.com and http://fabulous.com show up more than once.
>>>
>>> 2016-12-09 00:47:29,559 WARN [main] org.apache.nutch.fetcher.Fetcher: Aborting with 50 hung threads.
>>> 2016-12-09 00:47:29,560 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #0 hung while processing https://www.hugedomains.com/domain_search.cfm?catSearch=434
>>> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #1 hung while processing http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=820
>>> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #2 hung while processing http://shinystat.com/it/pro/info_pro.html
>>> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #3 hung while processing http://events.stanford.edu/byCategory/13/
>>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #4 hung while processing https://www.ladesk.com/pricing/hosted/terms-and-conditions/
>>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #5 hung while processing http://shinystat.com/en/opt-out_free.html
>>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #6 hung while processing http://shinystat.com/fr/biz/info_biz.html
>>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #7 hung while processing http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=88
>>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #8 hung while processing http://www.youronlinechoices.com/sk/slovnik-pojmov
>>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #9 hung while processing https://twitter.com/sakura_ope
>>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #10 hung while processing http://europa.eu/european-union/topics/culture_en
>>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #11 hung while processing http://www.youronlinechoices.com/ee/opt-out-help
>>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #12 hung while processing https://www.hugedomains.com/domain_search.cfm?catSearch=437
>>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #13 hung while processing http://hosted.ap.org/dynamic/stories/U/US_OBIT_JOHN_GLENN?SITE=AP&SECTION=HOME&TEMPLATE=DEFAULT
>>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #14 hung while processing http://static.fc2.com/sh_css/common/base.css?1200605
>>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #15 hung while processing https://www.hugedomains.com/terms.cfm
>>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #16 hung while processing https://www.ladesk.com/comparisons/
>>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #17 hung while processing http://hu.statcounter.com/features/
>>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #18 hung while processing http://europa.eu/european-union/about-eu/working_el
>>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #19 hung while processing http://www.atinternet.com/es/recursos/
>>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #20 hung while processing http://ietf.org/rfc/rfc2026.txt
>>>
>>>
>>
>>
>>
>>     
>>
> 
> 
> 
>    
> 


Re: Fetcher "hung while processing"

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.
Hello again, here is a lot more detail on my issue.

I have had the same problem with many different runs, but certainly not all of them. Typically, it seems to happen on one of several nodes of the cluster (but not the same node each time). One time, I used numThreads=200, instead of 50, and I got 200 hung threads at the end of the fetching task.
Nutch version is 1.12. It is running distributed on Hadoop with anywhere from 2 to 7 slave nodes. I have no explicit timeout settings in mapred-site.xml, nutch-site.xml, or the crawl script. I have the following fetcher.* props: fetcher.threads.per.queue=1; fetcher.timelimit.mins=180; fetcher.server.delay=1.0. I also have generate.max.count=25.

I have less than the default amount of memory allocated: mapreduce.map.memory.mb=500; mapreduce.map.java.opts=-Xmx300m.

Looking at the logs, I don't see anything very obvious. In the 200-thread case, the word "error" appeared only about 70 times. 

Here is an example describing 3 of the 50 hung threads of one run that attemps to fetch about 300 resources.
2016-12-16 04:22:45,644 WARN [main] org.apache.nutch.fetcher.Fetcher: Aborting with 50 hung threads.
2016-12-16 04:22:45,645 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #0 hung while processing http://aktr.jp/news/x-girl-sports-x-aktr-02-%E3%83%97%E3%83%AD%E3%82%B8%E3%82%A7%E3%82%AF%E3%83%88%E3%82%B5%E3%82%A4%E3%83%88%E5%85%AC%E9%96%8B/
2016-12-16 04:22:45,647 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #1 hung while processing http://www.googletagmanager.com/ns.html?id=GTM-MF52KS
2016-12-16 04:22:45,648 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #2 hung while processing http://mashable.com/2016/12/14/teens-drinking-down/

Earlier in the log2016-12-16 04:12:38,315 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread: fetching http://aktr.jp/news/x-girl-sports-x-aktr-02-%E3%83%97%E3%83%AD%E3%82%B8%E3%82%A7%E3%82%AF%E3%83%88%E3%82%B5%E3%82%A4%E3%83%88%E5%85%AC%E9%96%8B/ (queue crawl delay=1000ms)

And there was no other mention of aktr.jp/news/x-girl-sports-x-aktr. However, there were 24 akter.jp urls that it tried to fetch, 4 of which ended up on hung threads.

2016-12-16 04:12:33,427 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread: fetching http://www.googletagmanager.com/ns.html?id=GTM-MF52KS (queue crawl delay=1000ms)

And there was no other mention of googletagmanager.com.
2016-12-16 04:12:57,991 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread: fetching http://mashable.com/2016/12/14/teens-drinking-down/ (queue crawl delay=1000ms)

And there was no other mention of mashable.com/2016/12/14/teens. There were 17 mashable.com urls thet it tried to fetch, 2 of which ended up on hung threads.

I get only about 9 mentions of the word "error" in the syslog for this run.  It includes some failures to get robots.txt for various sites, some NoHttpResponseException, and some ConnectTimeoutException. None of those errors relate to the URLs I mentioned above


      From: Sebastian Nagel <wa...@googlemail.com>
 To: user@nutch.apache.org 
 Sent: Friday, December 9, 2016 12:42 PM
 Subject: Re: Fetcher "hung while processing"
   
Hi Michael,

> What other post-fetch actions are there?

Well, the fetched content is spilled to disk which may also become slow in pathological cases.

But I think it's more important to analyze what happened with the URLs before. The logs
should contain a message "fetching ..." for every hanging URL. When does it happen?

If possible, let us know about
- Nutch version
- environment (local, distributed)
- configuration, esp. if not the default:
    mapreduce.task.timeout
    fetcher.threads.tlimeout.divisor
    http.timeout
  and in doubt all other modified
    fetcher.*
  properties

Is the problem reproducible, or does it happen only sometimes?

Thanks,
Sebastian

On 12/09/2016 04:58 PM, Michael Coffey wrote:
> The property fetcher.parse is false and I pass -noParsing to the fetch command. What other post-fetch actions are there?
> 
> 
>      From: Sebastian Nagel <wa...@googlemail.com>
>  To: user@nutch.apache.org 
>  Sent: Friday, December 9, 2016 12:58 AM
>  Subject: Re: Fetcher "hung while processing"
>    
> Hi Michael,
> 
> what about the property fetcher.parse ?
> 
> The queue is unblocked after a page has been fetched but before parsing.
> If the parser is hanging or one of the post-fetch actions take too long
> it may happen that there are multiple URLs from the same host still in
> process.
> 
> Sebastian
> 
> On 12/09/2016 02:15 AM, Michael Coffey wrote:
>> I sometimes get a bunch of warning messages that say Thread #x hung while processing <url>
>> Is this just a normal thing to see occasionally, or should I look to find some resolution? I do have an example where the same host shows up on a multitude of these messages, which puzzles me. I think there should be only one thread per host, due to me specifying fetcher.threads.per.queue=1
>> Here is example log showing the first 20 of 50 hung threads. Note that http://shinystat.com and http://fabulous.com show up more than once.
>>
>> 2016-12-09 00:47:29,559 WARN [main] org.apache.nutch.fetcher.Fetcher: Aborting with 50 hung threads.
>> 2016-12-09 00:47:29,560 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #0 hung while processing https://www.hugedomains.com/domain_search.cfm?catSearch=434
>> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #1 hung while processing http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=820
>> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #2 hung while processing http://shinystat.com/it/pro/info_pro.html
>> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #3 hung while processing http://events.stanford.edu/byCategory/13/
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #4 hung while processing https://www.ladesk.com/pricing/hosted/terms-and-conditions/
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #5 hung while processing http://shinystat.com/en/opt-out_free.html
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #6 hung while processing http://shinystat.com/fr/biz/info_biz.html
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #7 hung while processing http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=88
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #8 hung while processing http://www.youronlinechoices.com/sk/slovnik-pojmov
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #9 hung while processing https://twitter.com/sakura_ope
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #10 hung while processing http://europa.eu/european-union/topics/culture_en
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #11 hung while processing http://www.youronlinechoices.com/ee/opt-out-help
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #12 hung while processing https://www.hugedomains.com/domain_search.cfm?catSearch=437
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #13 hung while processing http://hosted.ap.org/dynamic/stories/U/US_OBIT_JOHN_GLENN?SITE=AP&SECTION=HOME&TEMPLATE=DEFAULT
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #14 hung while processing http://static.fc2.com/sh_css/common/base.css?1200605
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #15 hung while processing https://www.hugedomains.com/terms.cfm
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #16 hung while processing https://www.ladesk.com/comparisons/
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #17 hung while processing http://hu.statcounter.com/features/
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #18 hung while processing http://europa.eu/european-union/about-eu/working_el
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #19 hung while processing http://www.atinternet.com/es/recursos/
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #20 hung while processing http://ietf.org/rfc/rfc2026.txt
>>
>>
> 
> 
> 
>    
> 



   

Re: Fetcher "hung while processing"

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Michael,

> What other post-fetch actions are there?

Well, the fetched content is spilled to disk which may also become slow in pathological cases.

But I think it's more important to analyze what happened with the URLs before. The logs
should contain a message "fetching ..." for every hanging URL. When does it happen?

If possible, let us know about
- Nutch version
- environment (local, distributed)
- configuration, esp. if not the default:
    mapreduce.task.timeout
    fetcher.threads.tlimeout.divisor
    http.timeout
  and in doubt all other modified
    fetcher.*
  properties

Is the problem reproducible, or does it happen only sometimes?

Thanks,
Sebastian

On 12/09/2016 04:58 PM, Michael Coffey wrote:
> The property fetcher.parse is false and I pass -noParsing to the fetch command. What other post-fetch actions are there?
> 
> 
>       From: Sebastian Nagel <wa...@googlemail.com>
>  To: user@nutch.apache.org 
>  Sent: Friday, December 9, 2016 12:58 AM
>  Subject: Re: Fetcher "hung while processing"
>    
> Hi Michael,
> 
> what about the property fetcher.parse ?
> 
> The queue is unblocked after a page has been fetched but before parsing.
> If the parser is hanging or one of the post-fetch actions take too long
> it may happen that there are multiple URLs from the same host still in
> process.
> 
> Sebastian
> 
> On 12/09/2016 02:15 AM, Michael Coffey wrote:
>> I sometimes get a bunch of warning messages that say Thread #x hung while processing <url>
>> Is this just a normal thing to see occasionally, or should I look to find some resolution? I do have an example where the same host shows up on a multitude of these messages, which puzzles me. I think there should be only one thread per host, due to me specifying fetcher.threads.per.queue=1
>> Here is example log showing the first 20 of 50 hung threads. Note that http://shinystat.com and http://fabulous.com show up more than once.
>>
>> 2016-12-09 00:47:29,559 WARN [main] org.apache.nutch.fetcher.Fetcher: Aborting with 50 hung threads.
>> 2016-12-09 00:47:29,560 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #0 hung while processing https://www.hugedomains.com/domain_search.cfm?catSearch=434
>> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #1 hung while processing http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=820
>> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #2 hung while processing http://shinystat.com/it/pro/info_pro.html
>> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #3 hung while processing http://events.stanford.edu/byCategory/13/
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #4 hung while processing https://www.ladesk.com/pricing/hosted/terms-and-conditions/
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #5 hung while processing http://shinystat.com/en/opt-out_free.html
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #6 hung while processing http://shinystat.com/fr/biz/info_biz.html
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #7 hung while processing http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=88
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #8 hung while processing http://www.youronlinechoices.com/sk/slovnik-pojmov
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #9 hung while processing https://twitter.com/sakura_ope
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #10 hung while processing http://europa.eu/european-union/topics/culture_en
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #11 hung while processing http://www.youronlinechoices.com/ee/opt-out-help
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #12 hung while processing https://www.hugedomains.com/domain_search.cfm?catSearch=437
>> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #13 hung while processing http://hosted.ap.org/dynamic/stories/U/US_OBIT_JOHN_GLENN?SITE=AP&SECTION=HOME&TEMPLATE=DEFAULT
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #14 hung while processing http://static.fc2.com/sh_css/common/base.css?1200605
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #15 hung while processing https://www.hugedomains.com/terms.cfm
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #16 hung while processing https://www.ladesk.com/comparisons/
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #17 hung while processing http://hu.statcounter.com/features/
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #18 hung while processing http://europa.eu/european-union/about-eu/working_el
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #19 hung while processing http://www.atinternet.com/es/recursos/
>> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #20 hung while processing http://ietf.org/rfc/rfc2026.txt
>>
>>
> 
> 
> 
>    
> 


Re: Fetcher "hung while processing"

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.
The property fetcher.parse is false and I pass -noParsing to the fetch command. What other post-fetch actions are there?


      From: Sebastian Nagel <wa...@googlemail.com>
 To: user@nutch.apache.org 
 Sent: Friday, December 9, 2016 12:58 AM
 Subject: Re: Fetcher "hung while processing"
   
Hi Michael,

what about the property fetcher.parse ?

The queue is unblocked after a page has been fetched but before parsing.
If the parser is hanging or one of the post-fetch actions take too long
it may happen that there are multiple URLs from the same host still in
process.

Sebastian

On 12/09/2016 02:15 AM, Michael Coffey wrote:
> I sometimes get a bunch of warning messages that say Thread #x hung while processing <url>
> Is this just a normal thing to see occasionally, or should I look to find some resolution? I do have an example where the same host shows up on a multitude of these messages, which puzzles me. I think there should be only one thread per host, due to me specifying fetcher.threads.per.queue=1
> Here is example log showing the first 20 of 50 hung threads. Note that http://shinystat.com and http://fabulous.com show up more than once.
> 
> 2016-12-09 00:47:29,559 WARN [main] org.apache.nutch.fetcher.Fetcher: Aborting with 50 hung threads.
> 2016-12-09 00:47:29,560 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #0 hung while processing https://www.hugedomains.com/domain_search.cfm?catSearch=434
> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #1 hung while processing http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=820
> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #2 hung while processing http://shinystat.com/it/pro/info_pro.html
> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #3 hung while processing http://events.stanford.edu/byCategory/13/
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #4 hung while processing https://www.ladesk.com/pricing/hosted/terms-and-conditions/
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #5 hung while processing http://shinystat.com/en/opt-out_free.html
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #6 hung while processing http://shinystat.com/fr/biz/info_biz.html
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #7 hung while processing http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=88
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #8 hung while processing http://www.youronlinechoices.com/sk/slovnik-pojmov
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #9 hung while processing https://twitter.com/sakura_ope
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #10 hung while processing http://europa.eu/european-union/topics/culture_en
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #11 hung while processing http://www.youronlinechoices.com/ee/opt-out-help
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #12 hung while processing https://www.hugedomains.com/domain_search.cfm?catSearch=437
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #13 hung while processing http://hosted.ap.org/dynamic/stories/U/US_OBIT_JOHN_GLENN?SITE=AP&SECTION=HOME&TEMPLATE=DEFAULT
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #14 hung while processing http://static.fc2.com/sh_css/common/base.css?1200605
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #15 hung while processing https://www.hugedomains.com/terms.cfm
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #16 hung while processing https://www.ladesk.com/comparisons/
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #17 hung while processing http://hu.statcounter.com/features/
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #18 hung while processing http://europa.eu/european-union/about-eu/working_el
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #19 hung while processing http://www.atinternet.com/es/recursos/
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #20 hung while processing http://ietf.org/rfc/rfc2026.txt
> 
> 



   

Re: Fetcher "hung while processing"

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Michael,

what about the property fetcher.parse ?

The queue is unblocked after a page has been fetched but before parsing.
If the parser is hanging or one of the post-fetch actions take too long
it may happen that there are multiple URLs from the same host still in
process.

Sebastian

On 12/09/2016 02:15 AM, Michael Coffey wrote:
> I sometimes get a bunch of warning messages that say Thread #x hung while processing <url>
> Is this just a normal thing to see occasionally, or should I look to find some resolution? I do have an example where the same host shows up on a multitude of these messages, which puzzles me. I think there should be only one thread per host, due to me specifying fetcher.threads.per.queue=1
> Here is example log showing the first 20 of 50 hung threads. Note that http://shinystat.com and http://fabulous.com show up more than once.
> 
> 2016-12-09 00:47:29,559 WARN [main] org.apache.nutch.fetcher.Fetcher: Aborting with 50 hung threads.
> 2016-12-09 00:47:29,560 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #0 hung while processing https://www.hugedomains.com/domain_search.cfm?catSearch=434
> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #1 hung while processing http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=820
> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #2 hung while processing http://shinystat.com/it/pro/info_pro.html
> 2016-12-09 00:47:29,561 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #3 hung while processing http://events.stanford.edu/byCategory/13/
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #4 hung while processing https://www.ladesk.com/pricing/hosted/terms-and-conditions/
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #5 hung while processing http://shinystat.com/en/opt-out_free.html
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #6 hung while processing http://shinystat.com/fr/biz/info_biz.html
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #7 hung while processing http://fabulous.com/informationcenter/index.htm?formcode%5Bobjective%5D=&formcode%5Bevent%5D=&formcode%5Bregistrytime%5D=1481233769&formcode%5Bcertificate%5D=dfd737bc4490a09d4786cb0e87a15ba6&formdata%5Bqid%5D=88
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #8 hung while processing http://www.youronlinechoices.com/sk/slovnik-pojmov
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #9 hung while processing https://twitter.com/sakura_ope
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #10 hung while processing http://europa.eu/european-union/topics/culture_en
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #11 hung while processing http://www.youronlinechoices.com/ee/opt-out-help
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #12 hung while processing https://www.hugedomains.com/domain_search.cfm?catSearch=437
> 2016-12-09 00:47:29,562 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #13 hung while processing http://hosted.ap.org/dynamic/stories/U/US_OBIT_JOHN_GLENN?SITE=AP&SECTION=HOME&TEMPLATE=DEFAULT
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #14 hung while processing http://static.fc2.com/sh_css/common/base.css?1200605
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #15 hung while processing https://www.hugedomains.com/terms.cfm
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #16 hung while processing https://www.ladesk.com/comparisons/
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #17 hung while processing http://hu.statcounter.com/features/
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #18 hung while processing http://europa.eu/european-union/about-eu/working_el
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #19 hung while processing http://www.atinternet.com/es/recursos/
> 2016-12-09 00:47:29,563 WARN [main] org.apache.nutch.fetcher.Fetcher: Thread #20 hung while processing http://ietf.org/rfc/rfc2026.txt
> 
>