You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Dawid Weiss <da...@cs.put.poznan.pl> on 2005/08/12 19:15:28 UTC
Injecting documents manually.
Has anyone considered/ implemented injecting static pages with a
different URL scheme? I mean the rare scenario when you have tons of
static HTML pages and would want to avoid rerouting queries through your
own Web server, but rather fetch them directly from disk prefixing their
disk path with a given URL prefix.
I looked at the problem briefly (I admit) and it seems it'd require some
manual coding because of the the split between indexer and fetcher pipeline.
Any comments and suggestions are very welcome.
Dawid
Re: Injecting documents manually.
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Thanks, this helps.
D.
Andrzej Bialecki wrote:
> Andy Liu wrote:
>
>> This is built into Nutch. Instead of injecting http:// url's, use
>> file:// , and Nutch will use protocol-file to fetch the files locally.
>>
>> Andy
>
>
> Also, there is a tool I created to skip importing these URLs into
> database first... Take a look at
> http://issues.apache.org/jira/browse/NUTCH-68
>
> So, you can do the following:
>
> cd target
>
> find . -exec printf "file:/`pwd`/{}\n" \; > url.lst
>
> and convert this url.lst to a new segment containing a fetchlist.
>
Re: Injecting documents manually.
Posted by Andrzej Bialecki <ab...@getopt.org>.
Andy Liu wrote:
> This is built into Nutch. Instead of injecting http:// url's, use
> file:// , and Nutch will use protocol-file to fetch the files locally.
>
> Andy
Also, there is a tool I created to skip importing these URLs into
database first... Take a look at
http://issues.apache.org/jira/browse/NUTCH-68
So, you can do the following:
cd target
find . -exec printf "file:/`pwd`/{}\n" \; > url.lst
and convert this url.lst to a new segment containing a fetchlist.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Injecting documents manually.
Posted by Andy Liu <an...@gmail.com>.
This is built into Nutch. Instead of injecting http:// url's, use
file:// , and Nutch will use protocol-file to fetch the files locally.
Andy
On 8/12/05, Dawid Weiss <da...@cs.put.poznan.pl> wrote:
>
> Has anyone considered/ implemented injecting static pages with a
> different URL scheme? I mean the rare scenario when you have tons of
> static HTML pages and would want to avoid rerouting queries through your
> own Web server, but rather fetch them directly from disk prefixing their
> disk path with a given URL prefix.
>
> I looked at the problem briefly (I admit) and it seems it'd require some
> manual coding because of the the split between indexer and fetcher pipeline.
>
> Any comments and suggestions are very welcome.
> Dawid
>
>
>