You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kumar Limbu <ku...@gmail.com> on 2005/12/19 02:52:12 UTC

How to recrawl urls

Hi everyone,

I have browsed through the nutch documentation but I have not found enough
information on how to recrawl the urls that I have already crawled. Do we
have to do a recrawling ourselves or the nutch application will do it?

More information on this regard will be highly appreciated. Thank you very
much.

--
Keep on smiling :) Kumar

Re: How to recrawl urls

Posted by Stefan Groschupf <sg...@media-style.com>.

do the steps manually as described here:

http://wiki.apache.org/nutch/SimpleMapReduceTutorial




Am 21.12.2005 um 13:01 schrieb Arun Kaundal:

> Hi Giang
>    But If I want to run the crawlTool manually after say each hour.  
> It throw
> an error like Crawl directory already exist. If I comment this  
> statement, I
> will get number of errors like this.... Directory alreday exist.  
> What shoul
> I do ...
>    plz show me way out...
>
>
> On 12/20/05, Nguyen Ngoc Giang <gi...@gmail.com> wrote:
>>
>> The scheme of intranet crawling is like this: Firstly, you create  
>> a webdb
>> using WebDBAdminTool. After that, you fetch a seed URL using
>> WebDBInjector.
>> The seed URL is inserted into your webdb, marked by current date  
>> and time.
>> Then, you create a fetch list using FetchListTool. The  
>> FetchListTool read
>> all URLs in the webdb which are due to crawl, and put them to the
>> fetchlist.
>> Next, the Fetcher crawls all URLs in the fetchlist. Finally, once  
>> crawling
>> is finished, UpdateDatabaseTool extracts all outlinks and put them to
>> webdb.
>> Newly extracted outlinks are set date and time to current date and  
>> time,
>> while all just-crawled URLs date and time are set to next 30 days  
>> (these
>> things happen actually in FetchListTool). So all extracted links  
>> will be
>> crawled for the next time, but not the just-crawled URLs. So on  
>> and so
>> forth.
>>
>> Therefore, once the crawler is still alive after 30 days (or the  
>> threshold
>> that you set), all "just-crawled" urls will be taken out to recrawl.
>> That's
>> why we need to maintain a live crawler at that time. This could be  
>> done
>> using cron job, I think.
>>
>> Regards,
>>   Giang
>>
>>
>>
>> On 12/20/05, Kumar Limbu <ku...@gmail.com> wrote:
>>>
>>> Hi Nguyen,
>>>
>>> Thank you for you information, but I would like to confirm that.  
>>> I do
>> see
>>> a
>>> variable that define the next fetch interval but I am not sure of  
>>> it. If
>>> anyone has more information on this regard please let me know.
>>>
>>> Thank you in advance,
>>>
>>>
>>>
>>>
>>> On 12/19/05, Nguyen Ngoc Giang <gi...@gmail.com> wrote:
>>>>
>>>> As I understand, by default, all links in Nutch are recrawled  
>>>> after 30
>>>> days, as long as your Nutch process is still running. FetchListTool
>>> takes
>>>> care of this setting. So maybe you can write a script (and put  
>>>> it in
>>>> cron?)
>>>> to reactivate the crawler.
>>>>
>>>> Regards,
>>>>   Giang
>>>>
>>>>
>>>> On 12/19/05, Kumar Limbu <ku...@gmail.com> wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I have browsed through the nutch documentation but I have not  
>>>>> found
>>>> enough
>>>>> information on how to recrawl the urls that I have already  
>>>>> crawled.
>> Do
>>>> we
>>>>> have to do a recrawling ourselves or the nutch application will do
>> it?
>>>>>
>>>>> More information on this regard will be highly appreciated. Thank
>> you
>>>> very
>>>>> much.
>>>>>
>>>>> --
>>>>> Keep on smiling :) Kumar
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Keep on smiling :) Kumar
>>>
>>>
>>
>>

Re: How to recrawl urls

Posted by Arun Kaundal <ar...@gmail.com>.

Hi Giang
   But If I want to run the crawlTool manually after say each hour. It throw
an error like Crawl directory already exist. If I comment this statement, I
will get number of errors like this.... Directory alreday exist. What shoul
I do ...
   plz show me way out...


On 12/20/05, Nguyen Ngoc Giang <gi...@gmail.com> wrote:
>
> The scheme of intranet crawling is like this: Firstly, you create a webdb
> using WebDBAdminTool. After that, you fetch a seed URL using
> WebDBInjector.
> The seed URL is inserted into your webdb, marked by current date and time.
> Then, you create a fetch list using FetchListTool. The FetchListTool read
> all URLs in the webdb which are due to crawl, and put them to the
> fetchlist.
> Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling
> is finished, UpdateDatabaseTool extracts all outlinks and put them to
> webdb.
> Newly extracted outlinks are set date and time to current date and time,
> while all just-crawled URLs date and time are set to next 30 days (these
> things happen actually in FetchListTool). So all extracted links will be
> crawled for the next time, but not the just-crawled URLs. So on and so
> forth.
>
> Therefore, once the crawler is still alive after 30 days (or the threshold
> that you set), all "just-crawled" urls will be taken out to recrawl.
> That's
> why we need to maintain a live crawler at that time. This could be done
> using cron job, I think.
>
> Regards,
>   Giang
>
>
>
> On 12/20/05, Kumar Limbu <ku...@gmail.com> wrote:
> >
> > Hi Nguyen,
> >
> > Thank you for you information, but I would like to confirm that. I do
> see
> > a
> > variable that define the next fetch interval but I am not sure of it. If
> > anyone has more information on this regard please let me know.
> >
> > Thank you in advance,
> >
> >
> >
> >
> > On 12/19/05, Nguyen Ngoc Giang <gi...@gmail.com> wrote:
> > >
> > > As I understand, by default, all links in Nutch are recrawled after 30
> > > days, as long as your Nutch process is still running. FetchListTool
> > takes
> > > care of this setting. So maybe you can write a script (and put it in
> > > cron?)
> > > to reactivate the crawler.
> > >
> > > Regards,
> > >   Giang
> > >
> > >
> > > On 12/19/05, Kumar Limbu <ku...@gmail.com> wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > I have browsed through the nutch documentation but I have not found
> > > enough
> > > > information on how to recrawl the urls that I have already crawled.
> Do
> > > we
> > > > have to do a recrawling ourselves or the nutch application will do
> it?
> > > >
> > > > More information on this regard will be highly appreciated. Thank
> you
> > > very
> > > > much.
> > > >
> > > > --
> > > > Keep on smiling :) Kumar
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > Keep on smiling :) Kumar
> >
> >
>
>

Re: How to recrawl urls

Posted by Nguyen Ngoc Giang <gi...@gmail.com>.

  The scheme of intranet crawling is like this: Firstly, you create a webdb
using WebDBAdminTool. After that, you fetch a seed URL using WebDBInjector.
The seed URL is inserted into your webdb, marked by current date and time.
Then, you create a fetch list using FetchListTool. The FetchListTool read
all URLs in the webdb which are due to crawl, and put them to the fetchlist.
Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling
is finished, UpdateDatabaseTool extracts all outlinks and put them to webdb.
Newly extracted outlinks are set date and time to current date and time,
while all just-crawled URLs date and time are set to next 30 days (these
things happen actually in FetchListTool). So all extracted links will be
crawled for the next time, but not the just-crawled URLs. So on and so
forth.

  Therefore, once the crawler is still alive after 30 days (or the threshold
that you set), all "just-crawled" urls will be taken out to recrawl. That's
why we need to maintain a live crawler at that time. This could be done
using cron job, I think.

  Regards,
   Giang

On 12/20/05, Kumar Limbu <ku...@gmail.com> wrote:
>
> Hi Nguyen,
>
> Thank you for you information, but I would like to confirm that. I do see
> a
> variable that define the next fetch interval but I am not sure of it. If
> anyone has more information on this regard please let me know.
>
> Thank you in advance,
>
>
>
>
> On 12/19/05, Nguyen Ngoc Giang <gi...@gmail.com> wrote:
> >
> > As I understand, by default, all links in Nutch are recrawled after 30
> > days, as long as your Nutch process is still running. FetchListTool
> takes
> > care of this setting. So maybe you can write a script (and put it in
> > cron?)
> > to reactivate the crawler.
> >
> > Regards,
> >   Giang
> >
> >
> > On 12/19/05, Kumar Limbu <ku...@gmail.com> wrote:
> > >
> > > Hi everyone,
> > >
> > > I have browsed through the nutch documentation but I have not found
> > enough
> > > information on how to recrawl the urls that I have already crawled. Do
> > we
> > > have to do a recrawling ourselves or the nutch application will do it?
> > >
> > > More information on this regard will be highly appreciated. Thank you
> > very
> > > much.
> > >
> > > --
> > > Keep on smiling :) Kumar
> > >
> > >
> >
> >
>
>
> --
> Keep on smiling :) Kumar
>
>

Re: How to recrawl urls

Posted by Kumar Limbu <ku...@gmail.com>.

Hi Nguyen,

Thank you for you information, but I would like to confirm that. I do see a
variable that define the next fetch interval but I am not sure of it. If
anyone has more information on this regard please let me know.

Thank you in advance,




On 12/19/05, Nguyen Ngoc Giang <gi...@gmail.com> wrote:
>
> As I understand, by default, all links in Nutch are recrawled after 30
> days, as long as your Nutch process is still running. FetchListTool takes
> care of this setting. So maybe you can write a script (and put it in
> cron?)
> to reactivate the crawler.
>
> Regards,
>   Giang
>
>
> On 12/19/05, Kumar Limbu <ku...@gmail.com> wrote:
> >
> > Hi everyone,
> >
> > I have browsed through the nutch documentation but I have not found
> enough
> > information on how to recrawl the urls that I have already crawled. Do
> we
> > have to do a recrawling ourselves or the nutch application will do it?
> >
> > More information on this regard will be highly appreciated. Thank you
> very
> > much.
> >
> > --
> > Keep on smiling :) Kumar
> >
> >
>
>


--
Keep on smiling :) Kumar

Re: How to recrawl urls

Posted by Nguyen Ngoc Giang <gi...@gmail.com>.

  As I understand, by default, all links in Nutch are recrawled after 30
days, as long as your Nutch process is still running. FetchListTool takes
care of this setting. So maybe you can write a script (and put it in cron?)
to reactivate the crawler.

  Regards,
   Giang

On 12/19/05, Kumar Limbu <ku...@gmail.com> wrote:
>
> Hi everyone,
>
> I have browsed through the nutch documentation but I have not found enough
> information on how to recrawl the urls that I have already crawled. Do we
> have to do a recrawling ourselves or the nutch application will do it?
>
> More information on this regard will be highly appreciated. Thank you very
> much.
>
> --
> Keep on smiling :) Kumar
>
>