You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Chip Calhoun <cc...@aip.org> on 2018/04/17 14:45:01 UTC

Nutch fetching times out at 3 hours, not sure why.

I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log:

2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!

I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you.

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalhoun@aip.org
https://www.aip.org/history-programs/niels-bohr-library

Re: Nutch fetching times out at 3 hours, not sure why.

Posted by Chip Calhoun <cc...@aip.org>.

Hi Sebastian,


Yes, that explains it! Now I wish I'd pasted my crawl command in the first place. I'll leave it alone for now, but if it becomes an issue again I know where to check. Thank you.


Chip

________________________________
From: Sebastian Nagel <wa...@googlemail.com>
Sent: Monday, April 30, 2018 4:53:20 PM
To: user@nutch.apache.org
Subject: Re: Nutch fetching times out at 3 hours, not sure why.

Hi Chip,

got it, you probably run bin/crawl which has the option:
  --time-limit-fetch <time_limit_fetch> Number of minutes allocated to the fetching [default: 180]

It's good to have a time limit, in case a single server responds too slowly.

Best,
Sebastian

On 04/30/2018 09:04 PM, Chip Calhoun wrote:
> Hi Sebastian,
>
> Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and saved me a lot of time.
>
> I'm still bewildered by the original problem, though. Both my fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. I'll ignore it unless it causes a problem for my other cores.
>
> Chip
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Monday, April 30, 2018 12:21 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch fetching times out at 3 hours, not sure why.
>
> Hi,
>
> if you still see the log message
>
>    fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!
>
> then it can be only
>  - fetcher.timelimit.mins
>  - fetcher.max.exceptions.per.queue
>
>> I crawl a list of roughly 2600 URLs all on my local server
>
> If this is the case you can crawl more aggressively, see
>   fetcher.server.delay
> or even fetch in parallel from your host, see
>   fetcher.threads.per.queue
>
> Best,
> Sebastian
>
> On 04/30/2018 04:44 PM, Chip Calhoun wrote:
>> I'm still experimenting with this. I had been crawling with a depth of 1 because I don't need anything outside my URLs list, but I tried with a depth of 10. It went through a crawl loop that ended after 3 hours, then a second 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of crawling every URL in my list, though it crawled a few I hadn't included.
>>
>> Are these 3 hour loops standard for large crawls?
>>
>> -----Original Message-----
>> From: Chip Calhoun [mailto:ccalhoun@aip.org]
>> Sent: Tuesday, April 17, 2018 3:27 PM
>> To: user@nutch.apache.org
>> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
>>
>> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, or even at the same point in a URL's fetcher loop; it really seems to be time based.
>>
>> -----Original Message-----
>> From: Sadiki Latty [mailto:slatty@uottawa.ca]
>> Sent: Tuesday, April 17, 2018 1:43 PM
>> To: user@nutch.apache.org
>> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
>>
>> Which version are you running? That value is defaulted to -1 in my current version (1.14)  so shouldn't be something you should have needed to change. My crawls, by default, go for as much as even 12 hours with little to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at?
>>
>> -----Original Message-----
>> From: Chip Calhoun [mailto:ccalhoun@aip.org]
>> Sent: April-17-18 10:45 AM
>> To: user@nutch.apache.org
>> Subject: Nutch fetching times out at 3 hours, not sure why.
>>
>> I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log:
>>
>> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!
>>
>> I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you.
>>
>> Chip Calhoun
>> Digital Archivist
>> Niels Bohr Library & Archives
>> American Institute of Physics
>> One Physics Ellipse
>> College Park, MD  20740-3840  USA
>> Tel: +1 301-209-3180
>> Email: ccalhoun@aip.org
>> https://www.aip.org/history-programs/niels-bohr-library
>>
>

Re: Nutch fetching times out at 3 hours, not sure why.

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Chip,

got it, you probably run bin/crawl which has the option:
  --time-limit-fetch <time_limit_fetch> Number of minutes allocated to the fetching [default: 180]

It's good to have a time limit, in case a single server responds too slowly.

Best,
Sebastian

On 04/30/2018 09:04 PM, Chip Calhoun wrote:
> Hi Sebastian,
> 
> Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and saved me a lot of time.
> 
> I'm still bewildered by the original problem, though. Both my fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. I'll ignore it unless it causes a problem for my other cores.
> 
> Chip
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
> Sent: Monday, April 30, 2018 12:21 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch fetching times out at 3 hours, not sure why.
> 
> Hi,
> 
> if you still see the log message
> 
>    fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!
> 
> then it can be only
>  - fetcher.timelimit.mins
>  - fetcher.max.exceptions.per.queue
> 
>> I crawl a list of roughly 2600 URLs all on my local server
> 
> If this is the case you can crawl more aggressively, see
>   fetcher.server.delay
> or even fetch in parallel from your host, see
>   fetcher.threads.per.queue
> 
> Best,
> Sebastian
> 
> On 04/30/2018 04:44 PM, Chip Calhoun wrote:
>> I'm still experimenting with this. I had been crawling with a depth of 1 because I don't need anything outside my URLs list, but I tried with a depth of 10. It went through a crawl loop that ended after 3 hours, then a second 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of crawling every URL in my list, though it crawled a few I hadn't included. 
>>
>> Are these 3 hour loops standard for large crawls?
>>
>> -----Original Message-----
>> From: Chip Calhoun [mailto:ccalhoun@aip.org] 
>> Sent: Tuesday, April 17, 2018 3:27 PM
>> To: user@nutch.apache.org
>> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
>>
>> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, or even at the same point in a URL's fetcher loop; it really seems to be time based. 
>>
>> -----Original Message-----
>> From: Sadiki Latty [mailto:slatty@uottawa.ca] 
>> Sent: Tuesday, April 17, 2018 1:43 PM
>> To: user@nutch.apache.org
>> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
>>
>> Which version are you running? That value is defaulted to -1 in my current version (1.14)  so shouldn't be something you should have needed to change. My crawls, by default, go for as much as even 12 hours with little to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at?
>>
>> -----Original Message-----
>> From: Chip Calhoun [mailto:ccalhoun@aip.org] 
>> Sent: April-17-18 10:45 AM
>> To: user@nutch.apache.org
>> Subject: Nutch fetching times out at 3 hours, not sure why.
>>
>> I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log:
>>
>> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!
>>
>> I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you.
>>
>> Chip Calhoun
>> Digital Archivist
>> Niels Bohr Library & Archives
>> American Institute of Physics
>> One Physics Ellipse
>> College Park, MD  20740-3840  USA
>> Tel: +1 301-209-3180
>> Email: ccalhoun@aip.org
>> https://www.aip.org/history-programs/niels-bohr-library
>>
>

RE: Nutch fetching times out at 3 hours, not sure why.

Posted by Chip Calhoun <cc...@aip.org>.

Hi Sebastian,

Thank you! Increasing my fetcher.threads.per.queue both fixed my crawl and saved me a lot of time.

I'm still bewildered by the original problem, though. Both my fetcher.timelimit.mins and my fetcher.max.exceptions.per.queue are set to -1. I'll ignore it unless it causes a problem for my other cores.

Chip

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: Monday, April 30, 2018 12:21 PM
To: user@nutch.apache.org
Subject: Re: Nutch fetching times out at 3 hours, not sure why.

Hi,

if you still see the log message

   fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!

then it can be only
 - fetcher.timelimit.mins
 - fetcher.max.exceptions.per.queue

> I crawl a list of roughly 2600 URLs all on my local server

If this is the case you can crawl more aggressively, see
  fetcher.server.delay
or even fetch in parallel from your host, see
  fetcher.threads.per.queue

Best,
Sebastian

On 04/30/2018 04:44 PM, Chip Calhoun wrote:
> I'm still experimenting with this. I had been crawling with a depth of 1 because I don't need anything outside my URLs list, but I tried with a depth of 10. It went through a crawl loop that ended after 3 hours, then a second 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of crawling every URL in my list, though it crawled a few I hadn't included. 
> 
> Are these 3 hour loops standard for large crawls?
> 
> -----Original Message-----
> From: Chip Calhoun [mailto:ccalhoun@aip.org] 
> Sent: Tuesday, April 17, 2018 3:27 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, or even at the same point in a URL's fetcher loop; it really seems to be time based. 
> 
> -----Original Message-----
> From: Sadiki Latty [mailto:slatty@uottawa.ca] 
> Sent: Tuesday, April 17, 2018 1:43 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> Which version are you running? That value is defaulted to -1 in my current version (1.14)  so shouldn't be something you should have needed to change. My crawls, by default, go for as much as even 12 hours with little to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at?
> 
> -----Original Message-----
> From: Chip Calhoun [mailto:ccalhoun@aip.org] 
> Sent: April-17-18 10:45 AM
> To: user@nutch.apache.org
> Subject: Nutch fetching times out at 3 hours, not sure why.
> 
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log:
> 
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!
> 
> I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you.
> 
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalhoun@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
>

Re: Nutch fetching times out at 3 hours, not sure why.

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

if you still see the log message

   fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!

then it can be only
 - fetcher.timelimit.mins
 - fetcher.max.exceptions.per.queue

> I crawl a list of roughly 2600 URLs all on my local server

If this is the case you can crawl more aggressively, see
  fetcher.server.delay
or even fetch in parallel from your host, see
  fetcher.threads.per.queue

Best,
Sebastian

On 04/30/2018 04:44 PM, Chip Calhoun wrote:
> I'm still experimenting with this. I had been crawling with a depth of 1 because I don't need anything outside my URLs list, but I tried with a depth of 10. It went through a crawl loop that ended after 3 hours, then a second 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of crawling every URL in my list, though it crawled a few I hadn't included. 
> 
> Are these 3 hour loops standard for large crawls?
> 
> -----Original Message-----
> From: Chip Calhoun [mailto:ccalhoun@aip.org] 
> Sent: Tuesday, April 17, 2018 3:27 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, or even at the same point in a URL's fetcher loop; it really seems to be time based. 
> 
> -----Original Message-----
> From: Sadiki Latty [mailto:slatty@uottawa.ca] 
> Sent: Tuesday, April 17, 2018 1:43 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch fetching times out at 3 hours, not sure why.
> 
> Which version are you running? That value is defaulted to -1 in my current version (1.14)  so shouldn't be something you should have needed to change. My crawls, by default, go for as much as even 12 hours with little to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at?
> 
> -----Original Message-----
> From: Chip Calhoun [mailto:ccalhoun@aip.org] 
> Sent: April-17-18 10:45 AM
> To: user@nutch.apache.org
> Subject: Nutch fetching times out at 3 hours, not sure why.
> 
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log:
> 
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!
> 
> I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you.
> 
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalhoun@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
>

RE: Nutch fetching times out at 3 hours, not sure why.

Posted by Chip Calhoun <cc...@aip.org>.

I'm still experimenting with this. I had been crawling with a depth of 1 because I don't need anything outside my URLs list, but I tried with a depth of 10. It went through a crawl loop that ended after 3 hours, then a second 3 hour crawl loop, then a third shorter loop. It still stopped 5 URLs short of crawling every URL in my list, though it crawled a few I hadn't included. 

Are these 3 hour loops standard for large crawls?

-----Original Message-----
From: Chip Calhoun [mailto:ccalhoun@aip.org] 
Sent: Tuesday, April 17, 2018 3:27 PM
To: user@nutch.apache.org
Subject: RE: Nutch fetching times out at 3 hours, not sure why.

I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, or even at the same point in a URL's fetcher loop; it really seems to be time based. 

-----Original Message-----
From: Sadiki Latty [mailto:slatty@uottawa.ca] 
Sent: Tuesday, April 17, 2018 1:43 PM
To: user@nutch.apache.org
Subject: RE: Nutch fetching times out at 3 hours, not sure why.

Which version are you running? That value is defaulted to -1 in my current version (1.14)  so shouldn't be something you should have needed to change. My crawls, by default, go for as much as even 12 hours with little to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at?

-----Original Message-----
From: Chip Calhoun [mailto:ccalhoun@aip.org] 
Sent: April-17-18 10:45 AM
To: user@nutch.apache.org
Subject: Nutch fetching times out at 3 hours, not sure why.

I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log:

2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!

I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you.

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalhoun@aip.org
https://www.aip.org/history-programs/niels-bohr-library

RE: Nutch fetching times out at 3 hours, not sure why.

Posted by Chip Calhoun <cc...@aip.org>.

I'm on 1.12, and mine also defaulted at -1. It does not fail at the same URL, or even at the same point in a URL's fetcher loop; it really seems to be time based. 

-----Original Message-----
From: Sadiki Latty [mailto:slatty@uottawa.ca] 
Sent: Tuesday, April 17, 2018 1:43 PM
To: user@nutch.apache.org
Subject: RE: Nutch fetching times out at 3 hours, not sure why.

Which version are you running? That value is defaulted to -1 in my current version (1.14)  so shouldn't be something you should have needed to change. My crawls, by default, go for as much as even 12 hours with little to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at?

-----Original Message-----
From: Chip Calhoun [mailto:ccalhoun@aip.org] 
Sent: April-17-18 10:45 AM
To: user@nutch.apache.org
Subject: Nutch fetching times out at 3 hours, not sure why.

I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log:

2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!

I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you.

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalhoun@aip.org
https://www.aip.org/history-programs/niels-bohr-library

RE: Nutch fetching times out at 3 hours, not sure why.

Posted by Sadiki Latty <sl...@uottawa.ca>.

Which version are you running? That value is defaulted to -1 in my current version (1.14)  so shouldn't be something you should have needed to change. My crawls, by default, go for as much as even 12 hours with little to no tweaking necessary from the nutch-default. Something else is causing it. Is it always the same URL that it fails at?

-----Original Message-----
From: Chip Calhoun [mailto:ccalhoun@aip.org] 
Sent: April-17-18 10:45 AM
To: user@nutch.apache.org
Subject: Nutch fetching times out at 3 hours, not sure why.

I crawl a list of roughly 2600 URLs all on my local server, and I'm only crawling around 1000 of them. The fetcher quits after exactly 3 hours (give or take a few milliseconds) with this message in the log:

2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!

I've seen that 3 hours is the default in some Nutch installations, but I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing something obvious. Any thoughts would be greatly appreciated. Thank you.

Chip Calhoun
Digital Archivist
Niels Bohr Library & Archives
American Institute of Physics
One Physics Ellipse
College Park, MD  20740-3840  USA
Tel: +1 301-209-3180
Email: ccalhoun@aip.org
https://www.aip.org/history-programs/niels-bohr-library