You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Vishal Shah <vi...@rediff.co.in> on 2007/06/21 12:06:27 UTC

http.content.limit not respected when the Content-Type header has charset attributes

 Hi,
 
  Many of the urls we crawl have headers that look like this:
 
Connection: close
Date: Thu, 21 Jun 2007 09:28:42 GMT
Accept-Ranges: bytes
ETag: "2c0c3-650-cc1eb800"
Server: Apache/2.0.40 (Red Hat Linux)
Content-Length: 1616
Content-Type: text/html; charset=ISO-8859-1
Last-Modified: Mon, 09 Apr 2007 13:13:04 GMT
Client-Date: Thu, 21 Jun 2007 07:42:10 GMT
Client-Peer: 202.141.129.22:80
Client-Response-Num: 1
 
In this case, the cType variable is set to "text/html; charset=ISO-8859-1"
in HttpResponse.java (for both protocol-http and protocol-httpclient). In
this case, the mimeType cannot be found correctly in HttpResponse.java. I am
talking about this piece of code here:
 
     /*
       * Extract the content type from the response and then look for its
       * mimetype preferences specified in mime-type.xml
       */
     String ctype = headers.get(Response.CONTENT_TYPE);
      int downloadSize = 0;
      if (ctype != null && (mimeType = http.getMimeTypes().forName(ctype))
!= null) {
 
In this case, the ctype should actually be set to just "text/html".
Currently, since it's set to "text/html; charset=ISO-8859-1", mimeType
variable is coming out to be null. Thus neither the content limit specified
in mimetypes.xml nor the http.content.limit setting is respected for these
documents.
 
One solution to the problem is to actually check the cType, split on ";" and
take the first part to lookup the mimeType. Anyone got any other ideas?
 
-vishal.