You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by brian4 <bq...@gmail.com> on 2014/03/13 08:12:33 UTC

How to have nutch 2 retry 503 errors

Whenever I try to crawl a large enough list, the website will sometimes
return 503 errors for pages:
fetch of ... failed with: Http code=503

These pages are not down and can still be accessed.  However even if I do
more rounds of crawling, it seems nutch does not attempt to retry fetching
these pages.  This results in only being able to crawl a fraction of the
total pages, e.g. 32,000/37,000.

I am using nutch 2 and protocol-httpclient

Looking at the code I see they should be marked for retry and their status
changed to "unfetched".  I checked the database and found the status is
changed to unfetched, but nevertheless they are not re-fetched in subsequent
iterations.

What am I missing that it's not re-trying to fetch the page?

I have this loop:

DEPTH=3

for ((a=1; a <= DEPTH ; a++))
do

  echo `date` ": Iteration $a of $DEPTH"

  echo "Generating a new fetchlist"
  $NUTCH_BIN/nutch generate -crawlId $CRAWL_ID"
 
  echo `date` ": Fetching : "
  $NUTCH_BIN/nutch fetch -all -crawlId $CRAWL_ID -threads 50

  echo `date` ": Parsing : "
  $NUTCH_BIN/nutch parse -all -crawlId $CRAWL_ID

  echo `date` ": Updating Database"
  $NUTCH_BIN/nutch updatedb -crawlId $CRAWL_ID

done





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-have-nutch-2-retry-503-errors-tp4123311.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to have nutch 2 retry 503 errors

Posted by brian4 <bq...@gmail.com>.

I think it is because I am crawling a single host, eventually I think it
throttles the crawler's connections and returns 503 for connection attempts.

I can work around this by crawling in smaller batches and pausing for a bit
in between batches (and probably also by reducing threads or delaying
connections per host), but since I can't guarantee I'll never encounter a
503 again I wanted to be sure it can correctly handle it by re-fetching it
in a subsequent round of crawling.





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-have-nutch-2-retry-503-errors-tp4123311p4123475.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to have nutch 2 retry 503 errors

Posted by Talat Uyarer <ta...@uyarer.com>.

Hi Brain,

What is your out going of network topology ? This error may be caused by
your firewall or etc.

Talat


2014-03-13 9:12 GMT+02:00 brian4 <bq...@gmail.com>:

> Whenever I try to crawl a large enough list, the website will sometimes
> return 503 errors for pages:
> fetch of ... failed with: Http code=503
>
> These pages are not down and can still be accessed.  However even if I do
> more rounds of crawling, it seems nutch does not attempt to retry fetching
> these pages.  This results in only being able to crawl a fraction of the
> total pages, e.g. 32,000/37,000.
>
> I am using nutch 2 and protocol-httpclient
>
> Looking at the code I see they should be marked for retry and their status
> changed to "unfetched".  I checked the database and found the status is
> changed to unfetched, but nevertheless they are not re-fetched in
> subsequent
> iterations.
>
> What am I missing that it's not re-trying to fetch the page?
>
> I have this loop:
>
> DEPTH=3
>
> for ((a=1; a <= DEPTH ; a++))
> do
>
>   echo `date` ": Iteration $a of $DEPTH"
>
>   echo "Generating a new fetchlist"
>   $NUTCH_BIN/nutch generate -crawlId $CRAWL_ID"
>
>   echo `date` ": Fetching : "
>   $NUTCH_BIN/nutch fetch -all -crawlId $CRAWL_ID -threads 50
>
>   echo `date` ": Parsing : "
>   $NUTCH_BIN/nutch parse -all -crawlId $CRAWL_ID
>
>   echo `date` ": Updating Database"
>   $NUTCH_BIN/nutch updatedb -crawlId $CRAWL_ID
>
> done
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-have-nutch-2-retry-503-errors-tp4123311.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

RE: How to have nutch 2 retry 503 errors

Posted by brian4 <bq...@gmail.com>.

Thanks, which setting it it to change the default from 24 hours?  I can't
seem to find any such property listed in nutch-default.xml.





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-have-nutch-2-retry-503-errors-tp4123311p4123474.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: How to have nutch 2 retry 503 errors

Posted by Markus Jelsma <ma...@openindex.io>.

If a fetch failes due to a transient error, the retry count is increased and the record is retried 24 hours later, by default. Error like these happen all the time, even if the  browser can access it, but at that moment, Nutch could not. 
 
-----Original message-----
> From:brian4 <bq...@gmail.com>
> Sent: Thursday 13th March 2014 8:12
> To: user@nutch.apache.org
> Subject: How to have nutch 2 retry 503 errors
> 
> Whenever I try to crawl a large enough list, the website will sometimes
> return 503 errors for pages:
> fetch of ... failed with: Http code=503
> 
> These pages are not down and can still be accessed.  However even if I do
> more rounds of crawling, it seems nutch does not attempt to retry fetching
> these pages.  This results in only being able to crawl a fraction of the
> total pages, e.g. 32,000/37,000.
> 
> I am using nutch 2 and protocol-httpclient
> 
> Looking at the code I see they should be marked for retry and their status
> changed to "unfetched".  I checked the database and found the status is
> changed to unfetched, but nevertheless they are not re-fetched in subsequent
> iterations.
> 
> What am I missing that it's not re-trying to fetch the page?
> 
> I have this loop:
> 
> DEPTH=3
> 
> for ((a=1; a <= DEPTH ; a++))
> do
> 
>   echo `date` ": Iteration $a of $DEPTH"
> 
>   echo "Generating a new fetchlist"
>   $NUTCH_BIN/nutch generate -crawlId $CRAWL_ID"
>  
>   echo `date` ": Fetching : "
>   $NUTCH_BIN/nutch fetch -all -crawlId $CRAWL_ID -threads 50
> 
>   echo `date` ": Parsing : "
>   $NUTCH_BIN/nutch parse -all -crawlId $CRAWL_ID
> 
>   echo `date` ": Updating Database"
>   $NUTCH_BIN/nutch updatedb -crawlId $CRAWL_ID
> 
> done
> 
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-have-nutch-2-retry-503-errors-tp4123311.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Re: How to have nutch 2 retry 503 errors

Posted by brian4 <bq...@gmail.com>.

Don't worry I am the owner :)
But good etiquette to keep in mind for when I will crawl other sites.



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-have-nutch-2-retry-503-errors-tp4123311p4123522.html
Sent from the Nutch - User mailing list archive at Nabble.com.