You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by J S <ve...@hotmail.com> on 2005/06/14 09:35:38 UTC
fetch returns error 400
Hi,
During my Intranet crawl, Nutch reports an error 400 for the following URL:
050614 075430 fetch of
http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+
Briefing%5F090605 failed with: org.apache.nutch.protocol.http.HttpError:
HTTP Error: 400
If I go to the page in my browser it works fine. However, as you can see
from the headers below, the first GET does return a 400 but then a rewrite
is done to append ?OpenDocument on to the end of the URL, and the next GET
request is successful.
Is there something I can do to get round this ?
Thanks for any help.
JS.
Here are the headers:
http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605
GET
/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605
HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5)
Gecko/20041110 Firefox/1.0
Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
HTTP/1.x 400 Bad Request
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:38 GMT
Connection: close
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=US-ASCII
Content-Length: 526
Cache-Control: no-cache
----------------------------------------------------------
http://planetbp.bp.com/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605?OpenDocument
GET
/general/aptrix/bani.nsf/Content/XXXXPS%5FMB%5F090605%5CXXXXps%5FManagement+Briefing%5F090605?OpenDocument
HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5)
Gecko/20041110 Firefox/1.0
Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
If-Modified-Since: Tue, 14 Jun 2005 07:24:05 GMT
HTTP/1.x 200 OK
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:39 GMT
Last-Modified: Tue, 14 Jun 2005 07:24:37 GMT
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=US-ASCII
Content-Length: 16061
Cache-Control: no-cache
----------------------------------------------------------
http://planetbp.bp.com/favicon.ico
GET /favicon.ico HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5)
Gecko/20041110 Firefox/1.0
Accept: image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
HTTP/1.x 404 Not Found
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:39 GMT
Connection: close
Pragma: no-cache
Cache-Control: no-cache
Expires: Tue, 14 Jun 2005 07:24:39 GMT
Content-Type: text/html
Content-Length: 159
----------------------------------------------------------
http://planetbp.bp.com/favicon.ico
GET /favicon.ico HTTP/1.1
Host: planetbp.bp.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.5)
Gecko/20041110 Firefox/1.0
Accept: image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
HTTP/1.x 404 Not Found
Server: Lotus-Domino
Date: Tue, 14 Jun 2005 07:24:39 GMT
Connection: close
Pragma: no-cache
Cache-Control: no-cache
Expires: Tue, 14 Jun 2005 07:24:39 GMT
Content-Type: text/html
Content-Length: 159
----------------------------------------------------------