You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Benjamin Higgins <bh...@gmail.com> on 2006/09/21 19:19:51 UTC
Forcing refetch and index of specified files
How can I instruct Nutch to refetch specific files and then update the index
entries for those files?
I am indexing files on a fileserver and I am able to produce a report of
changed files about every 30 minutes.
I'd like to feed that into Nutch at approximately the same interval so I can
keep the index up-to-date.
Thanks.
Ben
Re: Forcing refetch and index of specified files
Posted by Tomi NA <he...@gmail.com>.
On 9/21/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Benjamin Higgins wrote:
> > How can I instruct Nutch to refetch specific files and then update the
> > index
> > entries for those files?
> >
> > I am indexing files on a fileserver and I am able to produce a report of
> > changed files about every 30 minutes.
> >
> > I'd like to feed that into Nutch at approximately the same interval so
> > I can
> > keep the index up-to-date.
> >
> > Thanks.
>
> Conceptually this should be easy - you just need to generate a fetchlist
> directly from your list of changed files, and not through
> injecting/generating from a crawldb.
>
> I wrote a tool for 0.7 which does this - look at the NUTCH-68 issue in
> JIRA. This would have to be ported to 0.8 - check how Injector does this
> in the first stage, when it converts a simple text file to a MapFile.
Would an algorithm like this make any sense:
for each URL in txt file
if URL in crawldb
update the date to "now()+1" in it's crawl datum
else
use existing inject logic to inject the new url
After that, it's only a matter of running the recrawl script with -adddays 0.
t.n.a.
Re: Forcing refetch and index of specified files
Posted by Andrzej Bialecki <ab...@getopt.org>.
Benjamin Higgins wrote:
> How can I instruct Nutch to refetch specific files and then update the
> index
> entries for those files?
>
> I am indexing files on a fileserver and I am able to produce a report of
> changed files about every 30 minutes.
>
> I'd like to feed that into Nutch at approximately the same interval so
> I can
> keep the index up-to-date.
>
> Thanks.
Conceptually this should be easy - you just need to generate a fetchlist
directly from your list of changed files, and not through
injecting/generating from a crawldb.
I wrote a tool for 0.7 which does this - look at the NUTCH-68 issue in
JIRA. This would have to be ported to 0.8 - check how Injector does this
in the first stage, when it converts a simple text file to a MapFile.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com