You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Feng (Michael) Ji" <fj...@yahoo.com> on 2005/08/04 05:30:42 UTC

digest field in Nutch index directory

hi there,

I found an interesting field in Nutch index directory,
called "digest". Seems it is a hashed signature for a
fetched page content. Is that true? 

I verified my guess by checking same page in two
different crawling round. The value of this field are
the same for both segments.

Essentially, I plan to check the updating status for a
page I crawling. If there is no change (means no
updating yet), I won't index this page to my search
engine. To achieve this function, I will compare the
"digest" fields of two pages with same URL.

Is it the right approach? Does Nutch provide an API
call to check the updating status for a particular web
page?

thanks,

Michael,


		
__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search. 
http://info.mail.yahoo.com/mail_250