You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dennis Kubes <nu...@dragonflymc.com> on 2006/08/04 21:24:11 UTC

Fetch jumps to 1.0 complete

Hi Everybody,

I am running some fetches and some of the task are going along fine and 
are about 20% complete and then they will mysteriously jump to 100% 
complete  Each time I get the "Aborting with N hung threads" in the 
logs.  Is anybody else seeing this?  It there anyway to get around the 
fetcher just stopping like this?

Dennis

Re: Fetch jumps to 1.0 complete

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> I am currently implementing a patch for the older 0.8 code that allows 
> pages with crawl delay > x seconds to be ignored where the number of 
> seconds is configurable.  What do you think the best way to return 
> from the HttpBase would be?  Would it be to throw an HttpException or 
> return a ProtocolStatus with say GONE or something like that? 

In the latest patch in NUTCH-339 I added a ProtocolStatus.WOULDBLOCK, 
and a section in Fetcher2 which is supposed to handle that - although 
after I removed the block/unblockAddr from lib-http there is no code in 
that patch that uses this status code.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Fetch jumps to 1.0 complete

Posted by Dennis Kubes <nu...@dragonflymc.com>.
I am currently implementing a patch for the older 0.8 code that allows 
pages with crawl delay > x seconds to be ignored where the number of 
seconds is configurable.  What do you think the best way to return from 
the HttpBase would be?  Would it be to throw an HttpException or return 
a ProtocolStatus with say GONE or something like that?

Dennis
>
> Dennis Kubes wrote:
>> I added some test code that hacks a 30 second delay when the delay is 
>> greater than 30 seconds.  It prints out the original delay value.  
>> Here is the output I am seeing:
>>
>> task_0005_m_000005_0 Someone is setting way to long of a delay 
>> value...520 seconds
>> task_0005_m_000005_0 Someone is setting way to long of a delay 
>> value...520 seconds
>>
>> So far it has hit 4 of 5 fetcher threads on a single machine.  I am 
>> pretty sure this is what is causing the hung threads.  I have a crawl 
>> running now.  I will update on its status later.  It is now 3am here 
>> so for now must sleep. :-P
>>
>> Dennis
>>
>> Andrzej Bialecki wrote:
>>> Dennis Kubes wrote:
>>>> Just a thought going through the fetcher code.  If the robots.txt 
>>>> specifies a delay >= the task timeout value, the task thread will 
>>>> sleep for that amount of time and eventually be considered a "hung 
>>>> thread" even though it is really just sleeping.  Of course I could 
>>>> be reading the code wrong.  It is about 2am here.  I will test this 
>>>> concept tomorrow to see if that is actually what is happening with 
>>>> the hung threads.
>>>
>>> For the fetcher to die all threads would have to end up in this 
>>> state. But this sort of rings a bell - this may be an unintended 
>>> consequence of implementing Crawl-Delay support ...
>>>
>>> NUTCH-339 now compiles and is lightly tested. Threads don't block 
>>> there, instead they put fetchlist entries on a time-sorted queue, 
>>> and continue working on other items. So, this condition never occurs.
>>>

Re: Fetch jumps to 1.0 complete

Posted by Dennis Kubes <nu...@dragonflymc.com>.
Looks like that fixed it.  Here is some of the output I saw.

task_0005_m_000005_0 Someone is setting way to long of a delay 
value...520 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...520 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...520 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...520 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds
task_0005_m_000003_0 Someone is setting way to long of a delay 
value...60 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...300 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...300 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...300 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...1200 seconds
task_0005_m_000003_0 Someone is setting way to long of a delay 
value...60 seconds
task_0005_m_000003_0 Someone is setting way to long of a delay 
value...60 seconds
task_0005_m_000003_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000003_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000003_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000003_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...120 seconds
task_0005_m_000003_0 Someone is setting way to long of a delay 
value...100 seconds
task_0005_m_000003_0 Someone is setting way to long of a delay 
value...3600 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...500 seconds

Dennis Kubes wrote:
> I added some test code that hacks a 30 second delay when the delay is 
> greater than 30 seconds.  It prints out the original delay value.  
> Here is the output I am seeing:
>
> task_0005_m_000005_0 Someone is setting way to long of a delay 
> value...520 seconds
> task_0005_m_000005_0 Someone is setting way to long of a delay 
> value...520 seconds
>
> So far it has hit 4 of 5 fetcher threads on a single machine.  I am 
> pretty sure this is what is causing the hung threads.  I have a crawl 
> running now.  I will update on its status later.  It is now 3am here 
> so for now must sleep. :-P
>
> Dennis
>
> Andrzej Bialecki wrote:
>> Dennis Kubes wrote:
>>> Just a thought going through the fetcher code.  If the robots.txt 
>>> specifies a delay >= the task timeout value, the task thread will 
>>> sleep for that amount of time and eventually be considered a "hung 
>>> thread" even though it is really just sleeping.  Of course I could 
>>> be reading the code wrong.  It is about 2am here.  I will test this 
>>> concept tomorrow to see if that is actually what is happening with 
>>> the hung threads.
>>
>> For the fetcher to die all threads would have to end up in this 
>> state. But this sort of rings a bell - this may be an unintended 
>> consequence of implementing Crawl-Delay support ...
>>
>> NUTCH-339 now compiles and is lightly tested. Threads don't block 
>> there, instead they put fetchlist entries on a time-sorted queue, and 
>> continue working on other items. So, this condition never occurs.
>>

Re: Fetch jumps to 1.0 complete

Posted by Dennis Kubes <nu...@dragonflymc.com>.
I added some test code that hacks a 30 second delay when the delay is 
greater than 30 seconds.  It prints out the original delay value.  Here 
is the output I am seeing:

task_0005_m_000005_0 Someone is setting way to long of a delay 
value...520 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay 
value...520 seconds

So far it has hit 4 of 5 fetcher threads on a single machine.  I am 
pretty sure this is what is causing the hung threads.  I have a crawl 
running now.  I will update on its status later.  It is now 3am here so 
for now must sleep. :-P

Dennis

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> Just a thought going through the fetcher code.  If the robots.txt 
>> specifies a delay >= the task timeout value, the task thread will 
>> sleep for that amount of time and eventually be considered a "hung 
>> thread" even though it is really just sleeping.  Of course I could be 
>> reading the code wrong.  It is about 2am here.  I will test this 
>> concept tomorrow to see if that is actually what is happening with 
>> the hung threads.
>
> For the fetcher to die all threads would have to end up in this state. 
> But this sort of rings a bell - this may be an unintended consequence 
> of implementing Crawl-Delay support ...
>
> NUTCH-339 now compiles and is lightly tested. Threads don't block 
> there, instead they put fetchlist entries on a time-sorted queue, and 
> continue working on other items. So, this condition never occurs.
>

Re: Fetch jumps to 1.0 complete

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> Just a thought going through the fetcher code.  If the robots.txt 
> specifies a delay >= the task timeout value, the task thread will 
> sleep for that amount of time and eventually be considered a "hung 
> thread" even though it is really just sleeping.  Of course I could be 
> reading the code wrong.  It is about 2am here.  I will test this 
> concept tomorrow to see if that is actually what is happening with the 
> hung threads.

For the fetcher to die all threads would have to end up in this state. 
But this sort of rings a bell - this may be an unintended consequence of 
implementing Crawl-Delay support ...

NUTCH-339 now compiles and is lightly tested. Threads don't block there, 
instead they put fetchlist entries on a time-sorted queue, and continue 
working on other items. So, this condition never occurs.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Fetch jumps to 1.0 complete

Posted by Dennis Kubes <nu...@dragonflymc.com>.
Here is an interesting article to support my theory.

http://nikitathespider.com/articles/RobotsTxt.html

It says that 1% of crawl delays are 900 seconds or greater.  That is 
consistent with what I am seeing.

Dennis

Dennis Kubes wrote:
> Just a thought going through the fetcher code.  If the robots.txt 
> specifies a delay >= the task timeout value, the task thread will 
> sleep for that amount of time and eventually be considered a "hung 
> thread" even though it is really just sleeping.  Of course I could be 
> reading the code wrong.  It is about 2am here.  I will test this 
> concept tomorrow to see if that is actually what is happening with the 
> hung threads.
>
> Dennis
>
> Dennis Kubes wrote:
>> I spoke too soon.  It just took longer to hang. Still testing.
>>
>> Dennis
>>
>> Andrzej Bialecki wrote:
>>> Dennis Kubes wrote:
>>>> Well, I eliminated the regular expressions and changed the timeout 
>>>> value on http to 5000 and the max delays to 5 and although I still 
>>>> have some task running slower and I am getting a few more timeout 
>>>> errors (which is ok for what I am doing) it seems to have moved 
>>>> beyond the point at which it was failing.  As soon as I get this 
>>>> running automatically in production I am going to try and implement 
>>>> the 339 patch.
>>>
>>> Caveat: the patch in NUTCH-339 represents work-in-progress, it 
>>> doesn't even compile. I'm going to update it shortly to a 
>>> lightly-tested compile-able version.
>>>

Re: Fetch jumps to 1.0 complete

Posted by Dennis Kubes <nu...@dragonflymc.com>.
Just a thought going through the fetcher code.  If the robots.txt 
specifies a delay >= the task timeout value, the task thread will sleep 
for that amount of time and eventually be considered a "hung thread" 
even though it is really just sleeping.  Of course I could be reading 
the code wrong.  It is about 2am here.  I will test this concept 
tomorrow to see if that is actually what is happening with the hung threads.

Dennis

Dennis Kubes wrote:
> I spoke too soon.  It just took longer to hang. Still testing.
>
> Dennis
>
> Andrzej Bialecki wrote:
>> Dennis Kubes wrote:
>>> Well, I eliminated the regular expressions and changed the timeout 
>>> value on http to 5000 and the max delays to 5 and although I still 
>>> have some task running slower and I am getting a few more timeout 
>>> errors (which is ok for what I am doing) it seems to have moved 
>>> beyond the point at which it was failing.  As soon as I get this 
>>> running automatically in production I am going to try and implement 
>>> the 339 patch.
>>
>> Caveat: the patch in NUTCH-339 represents work-in-progress, it 
>> doesn't even compile. I'm going to update it shortly to a 
>> lightly-tested compile-able version.
>>

Re: Fetch jumps to 1.0 complete

Posted by Dennis Kubes <nu...@dragonflymc.com>.
I spoke too soon.  It just took longer to hang. Still testing.

Dennis

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> Well, I eliminated the regular expressions and changed the timeout 
>> value on http to 5000 and the max delays to 5 and although I still 
>> have some task running slower and I am getting a few more timeout 
>> errors (which is ok for what I am doing) it seems to have moved 
>> beyond the point at which it was failing.  As soon as I get this 
>> running automatically in production I am going to try and implement 
>> the 339 patch.
>
> Caveat: the patch in NUTCH-339 represents work-in-progress, it doesn't 
> even compile. I'm going to update it shortly to a lightly-tested 
> compile-able version.
>

Re: Fetch jumps to 1.0 complete

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> Well, I eliminated the regular expressions and changed the timeout 
> value on http to 5000 and the max delays to 5 and although I still 
> have some task running slower and I am getting a few more timeout 
> errors (which is ok for what I am doing) it seems to have moved beyond 
> the point at which it was failing.  As soon as I get this running 
> automatically in production I am going to try and implement the 339 
> patch.

Caveat: the patch in NUTCH-339 represents work-in-progress, it doesn't 
even compile. I'm going to update it shortly to a lightly-tested 
compile-able version.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Fetch jumps to 1.0 complete

Posted by Dennis Kubes <nu...@dragonflymc.com>.
Well, I eliminated the regular expressions and changed the timeout value 
on http to 5000 and the max delays to 5 and although I still have some 
task running slower and I am getting a few more timeout errors (which is 
ok for what I am doing) it seems to have moved beyond the point at which 
it was failing.  As soon as I get this running automatically in 
production I am going to try and implement the 339 patch.

Dennis

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> I moved off of the most recent dev branches for our "production" 
>> system and put them on the release version for 0.8.  I only noticed 
>> it recently although it may have been happening before and I just 
>> didn't notice it.  The one change that I did do that may have made it 
>> worse was I removed the crawl-url filter regular expressions for 
>> -[?*!@=] and
>
> This shouldn't be a problem unless your fetcher hangs on regex 
> processing. This is relatively easy to check - pick up one of the 
> remaining longer-running tasks and do a couple of thread dumps, you 
> will see what occupies most of the time; if you can hook up a profiler 
> in a fast-sampling mode then it's even better ...
>
>> -.*(/.+?)/.*?\1/.*?\1/.  Andrzej , didn't you say awhile back when we 
>> were looking at regular expressions for a different stalling problem 
>> that you don't use these in your production systems?
>
> True, I don't - I'm using only a combination of prefix/suffix filters. 
> Prefix filters give me the domains of interest, and suffix filters 
> give me (more or less) mime types I'm interested in. Any other border 
> cases I can hardcode in a separate urlfilter, thus avoiding regexes 
> completely.
>

Re: Fetch jumps to 1.0 complete

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> I moved off of the most recent dev branches for our "production" 
> system and put them on the release version for 0.8.  I only noticed it 
> recently although it may have been happening before and I just didn't 
> notice it.  The one change that I did do that may have made it worse 
> was I removed the crawl-url filter regular expressions for -[?*!@=] and

This shouldn't be a problem unless your fetcher hangs on regex 
processing. This is relatively easy to check - pick up one of the 
remaining longer-running tasks and do a couple of thread dumps, you will 
see what occupies most of the time; if you can hook up a profiler in a 
fast-sampling mode then it's even better ...

> -.*(/.+?)/.*?\1/.*?\1/.  Andrzej , didn't you say awhile back when we 
> were looking at regular expressions for a different stalling problem 
> that you don't use these in your production systems?

True, I don't - I'm using only a combination of prefix/suffix filters. 
Prefix filters give me the domains of interest, and suffix filters give 
me (more or less) mime types I'm interested in. Any other border cases I 
can hardcode in a separate urlfilter, thus avoiding regexes completely.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Fetch jumps to 1.0 complete

Posted by Dennis Kubes <nu...@dragonflymc.com>.
I moved off of the most recent dev branches for our "production" system 
and put them on the release version for 0.8.  I only noticed it recently 
although it may have been happening before and I just didn't notice it.  
The one change that I did do that may have made it worse was I removed 
the crawl-url filter regular expressions for -[?*!@=] and 
-.*(/.+?)/.*?\1/.*?\1/.  Andrzej , didn't you say awhile back when we 
were looking at regular expressions for a different stalling problem 
that you don't use these in your production systems?

Dennis

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> Hi Everybody,
>>
>> I am running some fetches and some of the task are going along fine 
>> and are about 20% complete and then they will mysteriously jump to 
>> 100% complete  Each time I get the "Aborting with N hung threads" in 
>> the logs.  Is anybody else seeing this?  It there anyway to get 
>> around the fetcher just stopping like this?
>
> Please see NUTCH-339. Is this something that you noticed only 
> recently, or was it like that before? There might have been some 
> recent change that made this problem worse (it was always there to 
> some degree) ...
>

Re: Fetch jumps to 1.0 complete

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> Hi Everybody,
>
> I am running some fetches and some of the task are going along fine 
> and are about 20% complete and then they will mysteriously jump to 
> 100% complete  Each time I get the "Aborting with N hung threads" in 
> the logs.  Is anybody else seeing this?  It there anyway to get around 
> the fetcher just stopping like this?

Please see NUTCH-339. Is this something that you noticed only recently, 
or was it like that before? There might have been some recent change 
that made this problem worse (it was always there to some degree) ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com