You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Benjamin Higgins <bh...@gmail.com> on 2006/09/21 19:19:51 UTC

Forcing refetch and index of specified files

How can I instruct Nutch to refetch specific files and then update the index
entries for those files?

I am indexing files on a fileserver and I am able to produce a report of
changed files about every 30 minutes.

I'd like to feed that into Nutch at approximately the same interval so I can
keep the index up-to-date.

Thanks.

Ben

Re: Forcing refetch and index of specified files

Posted by Tomi NA <he...@gmail.com>.

On 9/21/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Benjamin Higgins wrote:
> > How can I instruct Nutch to refetch specific files and then update the
> > index
> > entries for those files?
> >
> > I am indexing files on a fileserver and I am able to produce a report of
> > changed files about every 30 minutes.
> >
> > I'd like to feed that into Nutch at approximately the same interval so
> > I can
> > keep the index up-to-date.
> >
> > Thanks.
>
> Conceptually this should be easy - you just need to generate a fetchlist
> directly from your list of changed files, and not through
> injecting/generating from a crawldb.
>
> I wrote a tool for 0.7 which does this - look at the NUTCH-68 issue in
> JIRA. This would have to be ported to 0.8 - check how Injector does this
> in the first stage, when it converts a simple text file to a MapFile.

Would an algorithm like this make any sense:
for each URL in txt file
  if URL in crawldb
    update the date to "now()+1" in it's crawl datum
  else
    use existing inject logic to inject the new url

After that, it's only a matter of running the recrawl script with -adddays 0.

t.n.a.

Re: Forcing refetch and index of specified files

Posted by Andrzej Bialecki <ab...@getopt.org>.

Benjamin Higgins wrote:
> How can I instruct Nutch to refetch specific files and then update the 
> index
> entries for those files?
>
> I am indexing files on a fileserver and I am able to produce a report of
> changed files about every 30 minutes.
>
> I'd like to feed that into Nutch at approximately the same interval so 
> I can
> keep the index up-to-date.
>
> Thanks.

Conceptually this should be easy - you just need to generate a fetchlist 
directly from your list of changed files, and not through 
injecting/generating from a crawldb.

I wrote a tool for 0.7 which does this - look at the NUTCH-68 issue in 
JIRA. This would have to be ported to 0.8 - check how Injector does this 
in the first stage, when it converts a simple text file to a MapFile.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com