You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tuğcem Oral <tu...@gmail.com> on 2014/11/11 14:17:58 UTC
Nutch 1.6 find original url or redirected ones
Hi all,
I wonder how could I find the original url after it hits a redirection.
They're actually found on seedlist but I can not guarantee which url is
redirected to which url. In Fetcher phase I expect to read it from
Nutch.WRITABLE_REPR_URL_KEY, but it is overriden by redirected url.
Any suggestion how to read them from crawldb, segments or linkdb?
PS: I only crawl first-level pages (depth:1) on seedlist.
Best,
Tugcem.
--
TO