You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Manoharam Reddy <ma...@gmail.com> on 2007/06/11 07:37:33 UTC

Why Nutch is indexing HTTP 302 pages

I find in the search results that lots of HTTP 302 pages have been
indexed. This is decreasing the quality of search results. Is there
any way to disable indexing such pages?

I want only HTTP 200 OK pages to be indexed.

Re: Why Nutch is indexing HTTP 302 pages

Posted by Doğacan Güney <do...@gmail.com>.

On 6/11/07, Manoharam Reddy <ma...@gmail.com> wrote:
> I find in the search results that lots of HTTP 302 pages have been
> indexed. This is decreasing the quality of search results. Is there
> any way to disable indexing such pages?
>
> I want only HTTP 200 OK pages to be indexed.
>

If you run fetcher and parser separately, parser has no way of knowing
what status code the page has returned. Since most 302 pages return
some form of HTML (usually something like "this page will redirect
here") parser assumes that is meaningful HTML and parses it. Fetcher
doesn't have this problem. It only parses pages that return 200.

You can fix this by putting status code in Content's Metadata then
only parsing pages that have status code 200. (or, nutch stores page's
headers in content's metadata. You can check if content's metadata has
a "location" header).

-- 
Doğacan Güney