You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Neeti Gupta <ne...@yahoo.com> on 2009/06/24 13:52:47 UTC

recrawling

we had made a crawler that visit various sites, and i want the crawler to
crawl sites as soon as they are updated, if anyone can help me to know how i
can know when the site is updated and its the time to crawl again
-- 
View this message in context: http://www.nabble.com/recrawling-tp24183356p24183356.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: recrawling

Posted by Neeti Gupta <ne...@yahoo.com>.

But are there any rules by which we can define when to crawl a website to get
its updated contents
as soon as possible.



Otis Gospodnetic-2 wrote:
> 
> 
> Neeti,
> 
> I don't think there is a way to know when a regular web site has been
> updated.  You can issue GET or HEAD requests and look at the Last-Modified
> date, but this is not 100% reliable.  You can fetch and compare content,
> but that's not 100% reliable either.  If you are indexing blogs, then you
> can get "pings" when they update, or can rely on detecting changes in
> their feeds.
> 
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: Neeti Gupta <ne...@yahoo.com>
>> To: nutch-user@lucene.apache.org
>> Sent: Wednesday, June 24, 2009 7:52:47 AM
>> Subject: recrawling
>> 
>> 
>> we had made a crawler that visit various sites, and i want the crawler to
>> crawl sites as soon as they are updated, if anyone can help me to know
>> how i
>> can know when the site is updated and its the time to crawl again
>> -- 
>> View this message in context: 
>> http://www.nabble.com/recrawling-tp24183356p24183356.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/recrawling-tp24183356p24474563.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: recrawling

Posted by Otis Gospodnetic <og...@yahoo.com>.

Neeti,

I don't think there is a way to know when a regular web site has been updated.  You can issue GET or HEAD requests and look at the Last-Modified date, but this is not 100% reliable.  You can fetch and compare content, but that's not 100% reliable either.  If you are indexing blogs, then you can get "pings" when they update, or can rely on detecting changes in their feeds.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Neeti Gupta <ne...@yahoo.com>
> To: nutch-user@lucene.apache.org
> Sent: Wednesday, June 24, 2009 7:52:47 AM
> Subject: recrawling
> 
> 
> we had made a crawler that visit various sites, and i want the crawler to
> crawl sites as soon as they are updated, if anyone can help me to know how i
> can know when the site is updated and its the time to crawl again
> -- 
> View this message in context: 
> http://www.nabble.com/recrawling-tp24183356p24183356.html
> Sent from the Nutch - User mailing list archive at Nabble.com.