You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Martin Aesch <ma...@googlemail.com> on 2014/05/24 14:15:53 UTC
Nutch fetch local files with arbitrary mapped URLs
Hi all,
I have a bunch of HTML files sitting in my file system. I know the http:// URL of each html file.
If I just fetch from my file system, I will have file:// urls, but I would like to map them to the http:// adress or to any arbitrary adress.
Is there any halfway non-hackish possibility for doing that?
Thanks,
Martin
Re: Nutch fetch local files with arbitrary mapped URLs
Posted by Martin Aesch <ma...@googlemail.com>.
Thanks, Bayu.
I have crawls/data different sources, from wikipedia, from common-crawl,
etc. in different formats. For common-crawl, manipulating dns seems
problematic, have additionally not all urls from a domain, would be too
complex and way too hackish. Looking for a clean way to inject into
fetcher the URLs, i.e. to map a subset of them.
Currently, I am thinking to have a cassandra-database with key-value
url-content pairs and to interfere directly, where FetcherThread
actually fetches - if I have it in my database/whatever take it from
there, otherwise http-fetch. But then, politeness rules should be
invalidated which looks somehow difficult and I would have to change
nutch core functionality.
Best regards,
Martin
-----Original Message-----
From: Bayu Widyasanyata <bw...@gmail.com>
Reply-to: user@nutch.apache.org
To: user@nutch.apache.org
Subject: Re: Nutch fetch local files with arbitrary mapped URLs
Date: Sun, 25 May 2014 21:45:48 +0700
Hi Martin,
Just put and serves as common web server files inside their "docroot".
If their URIs are fixed-URL then you can create a local hostname with local
dns support (not provided by Internet DNS).
Hope it helps.
---
wassalam,
[bayu]
/sent from Android phone/
On May 24, 2014 7:16 PM, "Martin Aesch" <ma...@googlemail.com> wrote:
> Hi all,
>
> I have a bunch of HTML files sitting in my file system. I know the http://URL of each html file.
>
> If I just fetch from my file system, I will have file:// urls, but I would
> like to map them to the http:// adress or to any arbitrary adress.
>
> Is there any halfway non-hackish possibility for doing that?
>
> Thanks,
> Martin
>
>
Re: Nutch fetch local files with arbitrary mapped URLs
Posted by Bayu Widyasanyata <bw...@gmail.com>.
Hi Martin,
Just put and serves as common web server files inside their "docroot".
If their URIs are fixed-URL then you can create a local hostname with local
dns support (not provided by Internet DNS).
Hope it helps.
---
wassalam,
[bayu]
/sent from Android phone/
On May 24, 2014 7:16 PM, "Martin Aesch" <ma...@googlemail.com> wrote:
> Hi all,
>
> I have a bunch of HTML files sitting in my file system. I know the http://URL of each html file.
>
> If I just fetch from my file system, I will have file:// urls, but I would
> like to map them to the http:// adress or to any arbitrary adress.
>
> Is there any halfway non-hackish possibility for doing that?
>
> Thanks,
> Martin
>
>