You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2005/12/09 22:14:08 UTC

[jira] Updated: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

     [ http://issues.apache.org/jira/browse/NUTCH-135?page=all ]

Stefan Groschupf updated NUTCH-135:
-----------------------------------

    Attachment: contentProperties_patch.txt

As Doug suggested a patch using TreeMap String.CASE_INSENSITIVE_ORDER that solve the problem of case insensitive http header or general case insensitve content meta data. 
In general I see  two different ways to solve the problem. First leave the API as it is and extend a Properties object to overwriting its methods by using behind the sence a TreeMap. This solution would also require to copy some data between the properties object and treemap back and for several times, since the nutch code uses a Properties object in the content  constructor. The other choice would be to change the API of the content object to cleanly document that a other object, that has a different behavior than the properties object is used. The negative thing on this solution is that there are many small changes in the nutch code base. 
However I decide for a clean way, the last way, since I don't like code that does some things behind the sence that  developers would not expect. So I introduced a tiny ContentProperties object and changed the Content construtor to use the ContentProperties object instead of the java.util.Properties object. The new ContentProperties has a similar API as the Properties class but use case insensitve keys. I changed all classes that use the content object to use the new ContentProperties until object instantiation and I also extend the Content test case to test if case insensitive keys are now supported. 
Feel free to give constructive improvement suggestions, but also please let get us this done as soon as possible since from my point of view this is a critical issue.  All testcases pass on my box, but please double check before commiting.

> http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)
> ------------------------------------------------------------------------------------------------
>
>          Key: NUTCH-135
>          URL: http://issues.apache.org/jira/browse/NUTCH-135
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7.1, 0.7
>     Reporter: Stefan Groschupf
>     Priority: Critical
>      Fix For: 0.8-dev, 0.7.2-dev
>  Attachments: contentProperties_patch.txt
>
> As described in issue nutch-133, some webservers return http header meta data not standard conform case insensitive.
> This provides many negative side effects, for example query thet content type from the meta data return null also in case the webserver returns a content type, but the key is not standard conform e.g. lower case. Also this has effects to the pdf parser that queries the content length etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira