You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ShivaKarthik S <sh...@gmail.com> on 2011/10/12 08:39:59 UTC

Reg: Comapring tow segments

Hi all
  I have 5 urls and i crawled the data today. I have the crawl segment for
those 5 URLs. Now i am planning to crawl the same URLs tomorrow. So i will
again get one segment for same depth. I would like to compare these two
crawls to check whether any update in contents have occurred or not. Can
anyone please let  me know how to compare the different crawls or how to
find out that there is any change in the contents.

I know one possible solution is RSS FEED. But i have chosen those urls which
doesn't have RSS Feed support or Sitemap support.

-- 
Thanks and Regards
Shiva

Re: Reg: Comapring tow segments

Posted by Marek Bachmann <m....@uni-kassel.de>.
On 12.10.2011 08:39, ShivaKarthik S wrote:
> Hi all
>    I have 5 urls and i crawled the data today. I have the crawl segment for
> those 5 URLs. Now i am planning to crawl the same URLs tomorrow. So i will
> again get one segment for same depth. I would like to compare these two
> crawls to check whether any update in contents have occurred or not. Can
> anyone please let  me know how to compare the different crawls or how to
> find out that there is any change in the contents.
>
> I know one possible solution is RSS FEED. But i have chosen those urls which
> doesn't have RSS Feed support or Sitemap support.
>

I think when you only want to know IF there were changes you can see it 
in the status of the url in the crawldb.
When the urls are scheduled for fetching again an there were no changes 
the should have the stauts 6 (not_modified)