You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Soren Scott (JIRA)" <ji...@apache.org> on 2015/07/20 18:31:04 UTC

[jira] [Closed] (NUTCH-2044) Support for an expanded HttpHeaders list

     [ https://issues.apache.org/jira/browse/NUTCH-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Soren Scott closed NUTCH-2044.
------------------------------
    Resolution: Unresolved

This could be fixed by instantiating an empty metadata dictionary rather than enforcing some out-of-date default list. So if anyone's looking, there you go.

I'm sticking with the reliance on that default key list limits the kinds of things one can do with the crawl data. I know from our side, losing any sort of indication regarding Accept headers, etc, makes acting on the crawl data more of an issue and those additional systems appear a bit brute force rude. 

> Support for an expanded HttpHeaders list
> ----------------------------------------
>
>                 Key: NUTCH-2044
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2044
>             Project: Nutch
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: Soren Scott
>            Priority: Minor
>
> Is there currently any consideration for either a) expanding the current HttpHeaders list from [HttpHeaders.java|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java] to include at least the current permanent or provisional headers or b) revising that handler to iterate some unknown KVP for the headers? Either as a configurable widget or something along those lines?
> I am mostly interested in the Accept headers to help inform some additional actions on the fetched responses but even from an accurate assessment of the crawls, the full set of headers provided by a request is important. I know that we frown on non-standard keys but, again, imperfect world :).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)