You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Martin Aesch <ma...@googlemail.com> on 2014/05/24 14:15:53 UTC

Nutch fetch local files with arbitrary mapped URLs

Hi all,

I have a bunch of HTML files sitting in my file system. I know the http:// URL of each html file.

If I just fetch from my file system, I will have file:// urls, but I would like to map them to the http:// adress or to any arbitrary adress.

Is there any halfway non-hackish possibility for doing that?

Thanks,
Martin


Re: Nutch fetch local files with arbitrary mapped URLs

Posted by Martin Aesch <ma...@googlemail.com>.
Thanks, Bayu.

I have crawls/data different sources, from wikipedia, from common-crawl,
etc. in different formats. For common-crawl, manipulating dns seems
problematic, have additionally not all urls from a domain, would be too
complex and way too hackish. Looking for a clean way to inject into
fetcher the URLs, i.e. to map a subset of them. 

Currently, I am thinking to have a cassandra-database with key-value
url-content pairs and to interfere directly, where FetcherThread
actually fetches - if I have it in my database/whatever take it from
there, otherwise http-fetch. But then, politeness rules should be
invalidated which looks somehow difficult and I would have to change
nutch core functionality.

Best regards,
Martin





-----Original Message-----
From: Bayu Widyasanyata <bw...@gmail.com>
Reply-to: user@nutch.apache.org
To: user@nutch.apache.org
Subject: Re: Nutch fetch local files with arbitrary mapped URLs
Date: Sun, 25 May 2014 21:45:48 +0700

Hi Martin,

Just put and serves as common web server files inside their "docroot".

If their URIs are fixed-URL then you can create a local hostname with local
dns support (not provided by Internet DNS).

Hope it helps.
---
wassalam,
[bayu]

/sent from Android phone/
On May 24, 2014 7:16 PM, "Martin Aesch" <ma...@googlemail.com> wrote:

> Hi all,
>
> I have a bunch of HTML files sitting in my file system. I know the http://URL of each html file.
>
> If I just fetch from my file system, I will have file:// urls, but I would
> like to map them to the http:// adress or to any arbitrary adress.
>
> Is there any halfway non-hackish possibility for doing that?
>
> Thanks,
> Martin
>
>


Re: Nutch fetch local files with arbitrary mapped URLs

Posted by Bayu Widyasanyata <bw...@gmail.com>.
Hi Martin,

Just put and serves as common web server files inside their "docroot".

If their URIs are fixed-URL then you can create a local hostname with local
dns support (not provided by Internet DNS).

Hope it helps.
---
wassalam,
[bayu]

/sent from Android phone/
On May 24, 2014 7:16 PM, "Martin Aesch" <ma...@googlemail.com> wrote:

> Hi all,
>
> I have a bunch of HTML files sitting in my file system. I know the http://URL of each html file.
>
> If I just fetch from my file system, I will have file:// urls, but I would
> like to map them to the http:// adress or to any arbitrary adress.
>
> Is there any halfway non-hackish possibility for doing that?
>
> Thanks,
> Martin
>
>