You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by J S <ve...@hotmail.com> on 2005/06/14 09:35:38 UTC

fetch returns error 400

Hi,

During my Intranet crawl, Nutch reports an error 400 for the following URL:

050614 075430 fetch of 
http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+
Briefing%5F090605 failed with: org.apache.nutch.protocol.http.HttpError: 
HTTP Error: 400

If I go to the page in my browser it works fine. However, as you can see 
from the headers below, the first GET does return a 400 but then a rewrite 
is done to append ?OpenDocument on to the end of the URL, and the next GET 
request is successful.

Is there something I can do to get round this ?

Thanks for any help.

JS.

Here are the headers:

http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605

GET 
/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605 
HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5) 
Gecko/20041110 Firefox/1.0
Accept: 
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 400 Bad Request
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:38 GMT
Connection: close
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=US-ASCII
Content-Length: 526
Cache-Control: no-cache
----------------------------------------------------------
http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605?OpenDocument

GET 
/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605?OpenDocument 
HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5) 
Gecko/20041110 Firefox/1.0
Accept: 
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
If-Modified-Since: Tue, 14 Jun 2005 07:24:05 GMT

HTTP/1.x 200 OK
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:39 GMT
Last-Modified: Tue, 14 Jun 2005 07:24:37 GMT
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=US-ASCII
Content-Length: 16061
Cache-Control: no-cache
----------------------------------------------------------
http://planetbp.bp.com/favicon.ico

GET /favicon.ico HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5) 
Gecko/20041110 Firefox/1.0
Accept: image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 404 Not Found
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:39 GMT
Connection: close
Pragma: no-cache
Cache-Control: no-cache
Expires: Tue, 14 Jun 2005 07:24:39 GMT
Content-Type: text/html
Content-Length: 159
----------------------------------------------------------
http://planetbp.bp.com/favicon.ico

GET /favicon.ico HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5) 
Gecko/20041110 Firefox/1.0
Accept: image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 404 Not Found
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:39 GMT
Connection: close
Pragma: no-cache
Cache-Control: no-cache
Expires: Tue, 14 Jun 2005 07:24:39 GMT
Content-Type: text/html
Content-Length: 159
----------------------------------------------------------