You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2015/01/26 23:38:38 UTC

ProtocolStatus 16 'Exception' for particular domain

Hi Folks,
I'm working on obtaining forum data posted for various topics from across a
number of web sites.
An example would be the technolgy-related posts from
http://www.hackforums.net.
If I take the above site as an example, and attampt to use parsechecker, I
get the following with protocol-http

./bin/nutch parsechecker -dumpText "http://www.hackforums.net"
fetching: http://www.hackforums.net
http.proxy.host = null
http.proxy.port = 8080
http.timeout = 10000
http.content.limit = -1
http.agent = kilchattan/Nutch-1.10-SNAPSHOT (A targeted crawl of hackforums
technology discussion; myemail@gmail.com)
http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Fetch failed with protocol status: exception(16), lastModified=0: Http
code=403, url=http://www.hackforums.net

Where ProtoclStatus 16 [0] indicates "Unspecified exception occured.
Further information may be provided in args.". If we then look further we
see Http code is =403, which is general accepted that "...that the server
can be reached and understood the request, but refuses to take any further
action." [1].

If I used protocl-httpclient on debug mode, I get some more log detail

2015-01-26 13:22:27,094 INFO  httpclient.Http - http.agent =
kilchattan/Nutch-1.10-SNAPSHOT (A targeted crawl of hackforums technology
discussion)
2015-01-26 13:22:27,094 INFO  httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2015-01-26 13:22:27,094 INFO  httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2015-01-26 13:22:27,097 DEBUG params.DefaultHttpParams - Set parameter
http.connection.timeout = 10000
2015-01-26 13:22:27,097 DEBUG params.DefaultHttpParams - Set parameter
http.socket.timeout = 10000
2015-01-26 13:22:27,098 DEBUG params.DefaultHttpParams - Set parameter
http.socket.sendbuffer = 8192
2015-01-26 13:22:27,098 DEBUG params.DefaultHttpParams - Set parameter
http.socket.receivebuffer = 8192
2015-01-26 13:22:27,098 DEBUG params.DefaultHttpParams - Set parameter
http.connection-manager.max-total = 10
2015-01-26 13:22:27,099 DEBUG params.DefaultHttpParams - Set parameter
http.connection-manager.max-per-host = {HostConfiguration[]=10}
2015-01-26 13:22:27,099 DEBUG params.DefaultHttpParams - Set parameter
http.connection-manager.timeout = 10000
2015-01-26 13:22:27,101 DEBUG params.DefaultHttpParams - Set parameter
http.default-headers = [User-Agent: nutch/Nutch-1.10-SNAPSHOT (A targeted
crawl of hackforums technology discussion)^M
, Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3^M
, Accept-Charset: utf-8,ISO-8859-1;q=0.7,*;q=0.7^M
, Accept:
text/html,application/xml;q=0.9,application/xhtml+xml,text/xml;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5^M
, Accept-Encoding: x-gzip, gzip, deflate^M
]
2015-01-26 13:22:27,141 DEBUG params.DefaultHttpParams - Set parameter
http.protocol.version = HTTP/1.0
2015-01-26 13:22:27,141 DEBUG params.DefaultHttpParams - Set parameter
http.protocol.unambiguous-statusline = false
2015-01-26 13:22:27,141 DEBUG params.DefaultHttpParams - Set parameter
http.protocol.single-cookie-header = false
2015-01-26 13:22:27,141 DEBUG params.DefaultHttpParams - Set parameter
http.protocol.strict-transfer-encoding = false
2015-01-26 13:22:27,141 DEBUG params.DefaultHttpParams - Set parameter
http.protocol.reject-head-body = false
2015-01-26 13:22:27,141 DEBUG params.DefaultHttpParams - Set parameter
http.protocol.warn-extra-input = false
2015-01-26 13:22:27,141 DEBUG params.DefaultHttpParams - Set parameter
http.protocol.status-line-garbage-limit = 2147483647
2015-01-26 13:22:27,141 DEBUG params.DefaultHttpParams - Set parameter
http.protocol.content-charset = UTF-8
2015-01-26 13:22:27,141 DEBUG params.DefaultHttpParams - Set parameter
http.protocol.cookie-policy = compatibility
2015-01-26 13:22:27,141 DEBUG params.DefaultHttpParams - Set parameter
http.protocol.single-cookie-header = true
2015-01-26 13:22:27,147 DEBUG httpclient.MultiThreadedHttpConnectionManager
- HttpConnectionManager.getConnection:  config = HostConfiguration[host=
http://www.hackforums.net], timeout = 10000
2015-01-26 13:22:27,147 DEBUG httpclient.MultiThreadedHttpConnectionManager
- Allocating new connection, hostConfig=HostConfiguration[host=
http://www.hackforums.net]
2015-01-26 13:22:27,151 DEBUG httpclient.HttpConnection - Open connection
to www.hackforums.net:80
2015-01-26 13:22:27,168 DEBUG wire.header - >> "GET
/forumdisplay.php?fid=107 HTTP/1.0[\r][\n]"
2015-01-26 13:22:27,168 DEBUG httpclient.HttpMethodBase - Adding Host
request header
2015-01-26 13:22:27,180 DEBUG wire.header - >> "User-Agent:
kilchattan/Nutch-1.10-SNAPSHOT (A targeted crawl of hackforums technology
discussion)[\r][\n]"
2015-01-26 13:22:27,180 DEBUG wire.header - >> "Accept-Language:
en-us,en-gb,en;q=0.7,*;q=0.3[\r][\n]"
2015-01-26 13:22:27,180 DEBUG wire.header - >> "Accept-Charset:
utf-8,ISO-8859-1;q=0.7,*;q=0.7[\r][\n]"
2015-01-26 13:22:27,181 DEBUG wire.header - >> "Accept:
text/html,application/xml;q=0.9,application/xhtml+xml,text/xml;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5[\r][\n]"
2015-01-26 13:22:27,181 DEBUG wire.header - >> "Accept-Encoding: x-gzip,
gzip, deflate[\r][\n]"
2015-01-26 13:22:27,181 DEBUG wire.header - >> "Host: www.hackforums.net
[\r][\n]"
2015-01-26 13:22:27,181 DEBUG wire.header - >> "[\r][\n]"
2015-01-26 13:22:27,275 DEBUG wire.header - << "HTTP/1.1 403
Forbidden[\r][\n]"
2015-01-26 13:22:27,275 DEBUG wire.header - << "HTTP/1.1 403
Forbidden[\r][\n]"
2015-01-26 13:22:27,276 DEBUG wire.header - << "Date: Mon, 26 Jan 2015
21:22:27 GMT[\r][\n]"
2015-01-26 13:22:27,276 DEBUG wire.header - << "Content-Type: text/html;
charset=iso-8859-1[\r][\n]"
2015-01-26 13:22:27,276 DEBUG wire.header - << "Connection: close[\r][\n]"
2015-01-26 13:22:27,277 DEBUG wire.header - << "Set-Cookie:
__cfduid=d894e0a2345d2520c9973fc502200a75f1422307347; expires=Tue,
26-Jan-16 21:22:27 GMT; path=/; domain=.hackforums.net; HttpOnly[\r][\n]"
2015-01-26 13:22:27,277 DEBUG wire.header - << "Server:
cloudflare-nginx[\r][\n]"
2015-01-26 13:22:27,277 DEBUG wire.header - << "CF-RAY:
1aefc417ea590d8b-SJC[\r][\n]"
2015-01-26 13:22:27,277 DEBUG wire.header - << "Content-Encoding:
gzip[\r][\n]"
2015-01-26 13:22:27,277 DEBUG wire.header - << "[\r][\n]"
2015-01-26 13:22:27,281 DEBUG cookie.CookieSpec - Unrecognized cookie
attribute: name=HttpOnly, value=null
2015-01-26 13:22:27,281 DEBUG httpclient.HttpMethodBase - Cookie accepted:
"__cfduid=d894e0a2345d2520c9973fc502200a75f1422307347"
2015-01-26 13:22:27,288 DEBUG wire.content - <<
"[0x1f][0x8b][0x8][0x0][0x0][0x0][0x0][0x0][0x0][0x3],[0xcb][0xc1][\n]"
2015-01-26 13:22:27,288 DEBUG wire.content - <<
"[0x2]![0x10][0x6][0xe0]W[0xf9][0x1f]`[0xd3][0xa0]N][0xa3]`o[0xb][0x9d]":[0xe8]()[0xb5][0x8e]L#bO[0x1f]D[0xf7][0xef][0xbb][0xed][0xb7];[0x9c]Y|[0xe]![0x16][0x9c]DX[0xee][0xd8][0xe0][0xca][\r]k~$[0x85][0x8f][0xf0]/[0xa6]g[0xc][0xf0][0x3][0x83][0x9b]`^&[0x1c][0xb9][0x15][0x95]1[0x81][0x5][0xf3]e1[0xbf]A[0xae]@e@
[0x19][0xc4]E[0x1d])[0xda][0x1b]N[0x91]T[0xeb][0xc1][0xda][0xde][0xbb][0xe9]Y?[0xce][0x10][0xaf][0xf6]OLM[0xf5][0xb][0x0][0x0][0xff][0xff][0x3][0x0][0xa0][0xd8][0xd8][0x84][0x87][0x0][0x0][0x0]"
2015-01-26 13:22:27,288 DEBUG httpclient.HttpMethodBase - Should close
connection in response to directive: close
2015-01-26 13:22:27,288 DEBUG httpclient.HttpConnection - Releasing
connection back to connection manager.
2015-01-26 13:22:27,288 DEBUG httpclient.MultiThreadedHttpConnectionManager
- Freeing connection, hostConfig=HostConfiguration[host=
http://www.hackforums.net]
2015-01-26 13:22:27,288 DEBUG util.IdleConnectionHandler - Adding
connection at: 1422307347288
2015-01-26 13:22:27,288 DEBUG httpclient.MultiThreadedHttpConnectionManager
- Notifying no-one, there are no waiting threads
2015-01-26 13:22:27,488 DEBUG util.ObjectCache - No object cache found for
conf=Configuration: core-default.xml, core-site.xml, nutch-default.xml,
nutch-site.xml, instantiating a new object cache

So again I get 403 Forbidden, however no further log information as to why.
I wonder if anyone else is able to obtain content from this particular
website? I also notice that the server returns "Server: cloudflare-nginx"
and therefore wonder if this is the justification for the 403?

Thanks for any discussion on the topic.
Lewis


[0]
https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/protocol/ProtocolStatus.java#L53
[1] http://en.wikipedia.org/wiki/HTTP_403

-- 
*Lewis*