You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dennis Kubes <ku...@apache.org> on 2010/06/01 15:33:56 UTC

Re: Storing website urls instead of complete urls in index

Hi Shreemoyee,

As each page is stored by key in the different Nutch files and this key 
is the url, stripping down the URL to just its domain wouldn't work 
unless you only had a single page per domain.  All Nutch programs, 
including generator and fetcher, work off of the URL as key.

You can extract the domain from the key using something like this:

String host = URLUtil.getHost(key.toString());

And if you are looking to store it during the fetch/parse I would 
suggest looking at storing it in the crawl or parse metadata.  To do 
this though you may have to modify the Fetcher job.

Dennis

On 05/30/2010 02:22 AM, Shreemoyee Sarkar wrote:
> Hi,
>
> Is it possible to store the website url instead of complete url without
> affecting the crawl?
>
> e.g. store http://example.com instead of
> http://example.com/foo/bar/page.html
>
> would the generate URLs and fetch for the subsequent depths go smoothly?
>
> Thanks
> Shreemoyee
>
>