You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Neeti Gupta <ne...@yahoo.com> on 2009/06/24 13:52:47 UTC
recrawling
we had made a crawler that visit various sites, and i want the crawler to
crawl sites as soon as they are updated, if anyone can help me to know how i
can know when the site is updated and its the time to crawl again
--
View this message in context: http://www.nabble.com/recrawling-tp24183356p24183356.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: recrawling
Posted by Neeti Gupta <ne...@yahoo.com>.
But are there any rules by which we can define when to crawl a website to get
its updated contents
as soon as possible.
Otis Gospodnetic-2 wrote:
>
>
> Neeti,
>
> I don't think there is a way to know when a regular web site has been
> updated. You can issue GET or HEAD requests and look at the Last-Modified
> date, but this is not 100% reliable. You can fetch and compare content,
> but that's not 100% reliable either. If you are indexing blogs, then you
> can get "pings" when they update, or can rely on detecting changes in
> their feeds.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Neeti Gupta <ne...@yahoo.com>
>> To: nutch-user@lucene.apache.org
>> Sent: Wednesday, June 24, 2009 7:52:47 AM
>> Subject: recrawling
>>
>>
>> we had made a crawler that visit various sites, and i want the crawler to
>> crawl sites as soon as they are updated, if anyone can help me to know
>> how i
>> can know when the site is updated and its the time to crawl again
>> --
>> View this message in context:
>> http://www.nabble.com/recrawling-tp24183356p24183356.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
--
View this message in context: http://www.nabble.com/recrawling-tp24183356p24474563.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: recrawling
Posted by Otis Gospodnetic <og...@yahoo.com>.
Neeti,
I don't think there is a way to know when a regular web site has been updated. You can issue GET or HEAD requests and look at the Last-Modified date, but this is not 100% reliable. You can fetch and compare content, but that's not 100% reliable either. If you are indexing blogs, then you can get "pings" when they update, or can rely on detecting changes in their feeds.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
> From: Neeti Gupta <ne...@yahoo.com>
> To: nutch-user@lucene.apache.org
> Sent: Wednesday, June 24, 2009 7:52:47 AM
> Subject: recrawling
>
>
> we had made a crawler that visit various sites, and i want the crawler to
> crawl sites as soon as they are updated, if anyone can help me to know how i
> can know when the site is updated and its the time to crawl again
> --
> View this message in context:
> http://www.nabble.com/recrawling-tp24183356p24183356.html
> Sent from the Nutch - User mailing list archive at Nabble.com.