You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/07/24 17:08:14 UTC

[jira] Updated: (NUTCH-167) Observation of directive

     [ http://issues.apache.org/jira/browse/NUTCH-167?page=all ]

Andrzej Bialecki  updated NUTCH-167:
------------------------------------

    Attachment: patch.txt

This patch implements support for Pragma: no-cache and Robots: noarchive.

Three "cache policies" are supported in this patch:

* CACHE_FORBIDDEN_CONTENT: for pages that specify "noarchive", only summaries will be shown, but a cached content won't be displayed.

* CACHE_FORBIDDEN_ALL: for pages that specify "noarchive", neither summaries nor cached content will be shown - although they will appear in the list of matching results.

* CACHE_FORBIDDEN_NONE: even for sites that specify "noarchive" Nutch will still disobey it, and show both summaries and cached content. This is the current (broken?) behavior.

Since this patch is important for legal reasons, I'd like to commit it soon, before 0.8 release.

> Observation of <META NAME="ROBOTS" CONTENT="NOARCHIVE"> directive
> -----------------------------------------------------------------
>
>                 Key: NUTCH-167
>                 URL: http://issues.apache.org/jira/browse/NUTCH-167
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, web gui
>    Affects Versions: 0.7.1
>            Reporter: Ed Whittaker
>            Priority: Critical
>         Attachments: patch.txt
>
>
> Though not strictly a bug, this issue is potentially serious for users of Nutch who deploy live systems who might be threatened with legal action for caching copies of copyrighted material. The major search engines all observe this directive (even though apparently it's not stanard) so there's every reason why Nutch should too.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira