You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Feng (Michael) Ji" <fj...@yahoo.com> on 2005/08/04 05:30:42 UTC
digest field in Nutch index directory
hi there,
I found an interesting field in Nutch index directory,
called "digest". Seems it is a hashed signature for a
fetched page content. Is that true?
I verified my guess by checking same page in two
different crawling round. The value of this field are
the same for both segments.
Essentially, I plan to check the updating status for a
page I crawling. If there is no change (means no
updating yet), I won't index this page to my search
engine. To achieve this function, I will compare the
"digest" fields of two pages with same URL.
Is it the right approach? Does Nutch provide an API
call to check the updating status for a particular web
page?
thanks,
Michael,
__________________________________
Do you Yahoo!?
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250