You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Dawid Weiss <da...@cs.put.poznan.pl> on 2005/08/12 19:15:28 UTC

Injecting documents manually.

Has anyone considered/ implemented injecting static pages with a 
different URL scheme? I mean the rare scenario when you have tons of 
static HTML pages and would want to avoid rerouting queries through your 
own Web server, but rather fetch them directly from disk prefixing their 
disk path with a given URL prefix.

I looked at the problem briefly (I admit) and it seems it'd require some 
manual coding because of the the split between indexer and fetcher pipeline.

Any comments and suggestions are very welcome.
Dawid



Re: Injecting documents manually.

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Thanks, this helps.
D.

Andrzej Bialecki wrote:
> Andy Liu wrote:
> 
>> This is built into Nutch.  Instead of injecting http:// url's, use
>> file:// , and Nutch will use protocol-file to fetch the files locally.
>>
>> Andy
> 
> 
> Also, there is a tool I created to skip importing these URLs into 
> database first... Take a look at 
> http://issues.apache.org/jira/browse/NUTCH-68
> 
> So, you can do the following:
> 
> cd target
> 
> find . -exec printf "file:/`pwd`/{}\n" \; > url.lst
> 
> and convert this url.lst to a new segment containing a fetchlist.
> 

Re: Injecting documents manually.

Posted by Andrzej Bialecki <ab...@getopt.org>.
Andy Liu wrote:
> This is built into Nutch.  Instead of injecting http:// url's, use
> file:// , and Nutch will use protocol-file to fetch the files locally.
> 
> Andy

Also, there is a tool I created to skip importing these URLs into 
database first... Take a look at 
http://issues.apache.org/jira/browse/NUTCH-68

So, you can do the following:

cd target

find . -exec printf "file:/`pwd`/{}\n" \; > url.lst

and convert this url.lst to a new segment containing a fetchlist.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Injecting documents manually.

Posted by Andy Liu <an...@gmail.com>.
This is built into Nutch.  Instead of injecting http:// url's, use
file:// , and Nutch will use protocol-file to fetch the files locally.

Andy

On 8/12/05, Dawid Weiss <da...@cs.put.poznan.pl> wrote:
> 
> Has anyone considered/ implemented injecting static pages with a
> different URL scheme? I mean the rare scenario when you have tons of
> static HTML pages and would want to avoid rerouting queries through your
> own Web server, but rather fetch them directly from disk prefixing their
> disk path with a given URL prefix.
> 
> I looked at the problem briefly (I admit) and it seems it'd require some
> manual coding because of the the split between indexer and fetcher pipeline.
> 
> Any comments and suggestions are very welcome.
> Dawid
> 
> 
>