You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Tim Fletcher <zi...@gmail.com> on 2011/11/21 17:43:28 UTC

Retrieve HTTP Status code from crawl

Hi All,

I'm trying to get the status code associated with each page. But can't find
a way to do this

I have tried getting the status CrawlDatum.PARSE_DIR_NAME however this
gives me values such as "Status: 67 (linked)"

Also, it is possible to extract data regarding things like 301-302
redirects? For example i would like to trace the redirect path from page1
to page 2 (i.e. all the intermediary pages followed)

Any help on how to get the "raw" HTTP status codes would be
much appreciated.

Regards,
Tim

Re: Retrieve HTTP Status code from crawl

Posted by Markus Jelsma <ma...@openindex.io>.

AFAIK Nutch won't store the HTTP code at all. Instead, it encodes it as a 
single status byte. You can check the CrawlDatum class for status codes and 
their meaning.

However, if you must you can modify the Fetcher to store ProtocolStatus' value 
in the CrawlDatum metadata.

On Monday 21 November 2011 17:43:28 Tim Fletcher wrote:
> Hi All,
> 
> I'm trying to get the status code associated with each page. But can't find
> a way to do this
> 
> I have tried getting the status CrawlDatum.PARSE_DIR_NAME however this
> gives me values such as "Status: 67 (linked)"
> 
> Also, it is possible to extract data regarding things like 301-302
> redirects? For example i would like to trace the redirect path from page1
> to page 2 (i.e. all the intermediary pages followed)
> 
> Any help on how to get the "raw" HTTP status codes would be
> much appreciated.
> 
> Regards,
> Tim

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350