You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/03/03 11:09:33 UTC

Re: [Nutch-general] Updating Intranet

Paul,
i do not understand what you mean.
When you use the crawl command you should already have an updated index 
in the end.
If you like to reindex may since you plan to use more plugin, simply 
delete index* in your segment folders and use the nutch index command.
HTH
Stefan
Am 02.03.2005 um 20:49 schrieb sub paul:

> Hi,
>
> I was trying to find out how to update my index after I have done the
> intial intranet crawl.
>
> Should I use the same procedure as whole-web crawl to crawl my list of 
> websites?
>
> Regards,
> Paul
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real 
> users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>
>
-----------information technology-------------------
company:     http://www.media-style.com
forum:           http://www.text-mining.org
blog:	             http://www.find23.net


Re: [Nutch-general] Updating Intranet

Posted by sub paul <su...@gmail.com>.
Thanks Stefan,

I will give it a whirl.

Paul



On Thu, 3 Mar 2005 20:18:54 +0100, Stefan Groschupf <sg...@media-style.com> wrote:
> 
> > Use the command described in the internet crawling tool!
> >
> sorry... nonsense.
> ... use the commands described in the internet crawling tutorial.
> http://incubator.apache.org/nutch/tutorial.html#Whole-web+Crawling
> 
>

Re: [Nutch-general] Updating Intranet

Posted by Stefan Groschupf <sg...@media-style.com>.
> Use the command described in the internet crawling tool!
>
sorry... nonsense.
... use the commands described in the internet crawling tutorial.
http://incubator.apache.org/nutch/tutorial.html#Whole-web+Crawling


Re: [Nutch-general] Updating Intranet

Posted by Stefan Groschupf <sg...@media-style.com>.
Paul,
so you mean refetch the pages.

Use the command described in the internet crawling tool!
Set the regex for your intranet and set in the nutch config file the 
lifetime of your pages from 30 to 1 day.
Then you can daily (cron job) generate a fetchlist that contains the 
pages from yesterday and new pages.

HTH
Stefan


Am 03.03.2005 um 19:55 schrieb sub paul:

> Hi Stefan,
>
> I meant when I want to refetch the new pages, and add those pages to
> the index. How can I do that?
>
> It seems like intranet crawl using the bin/nutch crawl command is a
> one time deal. You get whatever you want, and if you want to fetch
> again, and index again ( more pages ), you start over.
>
> I want to fetch only the pages that are not in the index anymore.
>
> Thanks Stefan for you help.
>
> Regards,
> Paul
>
>
>
> On Thu, 3 Mar 2005 11:09:33 +0100, Stefan Groschupf 
> <sg...@media-style.com> wrote:
>> Paul,
>> i do not understand what you mean.
>> When you use the crawl command you should already have an updated 
>> index
>> in the end.
>> If you like to reindex may since you plan to use more plugin, simply
>> delete index* in your segment folders and use the nutch index command.
>> HTH
>> Stefan
>> Am 02.03.2005 um 20:49 schrieb sub paul:
>>
>>> Hi,
>>>
>>> I was trying to find out how to update my index after I have done the
>>> intial intranet crawl.
>>>
>>> Should I use the same procedure as whole-web crawl to crawl my list 
>>> of
>>> websites?
>>>
>>> Regards,
>>> Paul
>>>
>>>
>>> -------------------------------------------------------
>>> SF email is sponsored by - The IT Product Guide
>>> Read honest & candid reviews on hundreds of IT Products from real
>>> users.
>>> Discover which products truly live up to the hype. Start reading now.
>>> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>>> _______________________________________________
>>> Nutch-general mailing list
>>> Nutch-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/nutch-general
>>>
>>>
>> -----------information technology-------------------
>> company:     http://www.media-style.com
>> forum:           http://www.text-mining.org
>> blog:                http://www.find23.net
>>
>>
>
>
---------------------------------------------------------------
company:		http://www.media-style.com
forum:		http://www.text-mining.org
blog:			http://www.find23.net


Re: [Nutch-general] Updating Intranet

Posted by sub paul <su...@gmail.com>.
Hi Stefan,

I meant when I want to refetch the new pages, and add those pages to
the index. How can I do that?

It seems like intranet crawl using the bin/nutch crawl command is a
one time deal. You get whatever you want, and if you want to fetch
again, and index again ( more pages ), you start over.

I want to fetch only the pages that are not in the index anymore. 

Thanks Stefan for you help.

Regards,
Paul



On Thu, 3 Mar 2005 11:09:33 +0100, Stefan Groschupf <sg...@media-style.com> wrote:
> Paul,
> i do not understand what you mean.
> When you use the crawl command you should already have an updated index
> in the end.
> If you like to reindex may since you plan to use more plugin, simply
> delete index* in your segment folders and use the nutch index command.
> HTH
> Stefan
> Am 02.03.2005 um 20:49 schrieb sub paul:
> 
> > Hi,
> >
> > I was trying to find out how to update my index after I have done the
> > intial intranet crawl.
> >
> > Should I use the same procedure as whole-web crawl to crawl my list of
> > websites?
> >
> > Regards,
> > Paul
> >
> >
> > -------------------------------------------------------
> > SF email is sponsored by - The IT Product Guide
> > Read honest & candid reviews on hundreds of IT Products from real
> > users.
> > Discover which products truly live up to the hype. Start reading now.
> > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > _______________________________________________
> > Nutch-general mailing list
> > Nutch-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nutch-general
> >
> >
> -----------information technology-------------------
> company:     http://www.media-style.com
> forum:           http://www.text-mining.org
> blog:                http://www.find23.net
> 
>