You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hong Li <ce...@gmail.com> on 2006/03/19 15:36:37 UTC

automatically fetch new added contents of given website?

Greeting,

Anyone can tell me how to regularly to fetch one given website to grab its
new added contents? I am using  nutch crawl to get the first complete
contents of our website to replace mysql based search but can't figure out
how to run nutch the second time since it always complain the crawl
directory already exists.

TIA,

Re: automatically fetch new added contents of given website?

Posted by Hong Li <ce...@gmail.com>.
Thanks. From what I'd read now, seems there is way to inject new url to the
existing webdb then fetch those new addeded url, index them and merge with
existing version? Is this the same as  you said "generate segments and do
it"?


On 3/19/06, Raghavendra Prabhu <rr...@gmail.com> wrote:
>
> You cannot run the crawl step once more
>
> You have to generate segments and do it
>
> The normal crawl cannot be used
>
>
>
>
> On 3/19/06, Hong Li <ce...@gmail.com> wrote:
> >
> > Greeting,
> >
> > Anyone can tell me how to regularly to fetch one given website to grab
> its
> > new added contents? I am using  nutch crawl to get the first complete
> > contents of our website to replace mysql based search but can't figure
> out
> > how to run nutch the second time since it always complain the crawl
> > directory already exists.
> >
> > TIA,
> >
> >
>
>

Re: automatically fetch new added contents of given website?

Posted by Raghavendra Prabhu <rr...@gmail.com>.
You cannot run the crawl step once more

You have to generate segments and do it

The normal crawl cannot be used




On 3/19/06, Hong Li <ce...@gmail.com> wrote:
>
> Greeting,
>
> Anyone can tell me how to regularly to fetch one given website to grab its
> new added contents? I am using  nutch crawl to get the first complete
> contents of our website to replace mysql based search but can't figure out
> how to run nutch the second time since it always complain the crawl
> directory already exists.
>
> TIA,
>
>