You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by weishenyun <wl...@yahoo.com.cn> on 2012/08/21 11:44:51 UTC

What is the Nutch page-update mechanism after recrawl

Hi everyone here,
       I want to know how Nutch update page after recrawl. For example, a
page was fetched successfully and stored in the DB or file system by last
crawl command. But it returns 404 when recrawl the same page, will Nutch use
this 404's page information to update the former successful page information
? How about other situation, 301? 302? 503?
      Thanks in advance.



--
View this message in context: http://lucene.472066.n3.nabble.com/What-is-the-Nutch-page-update-mechanism-after-recrawl-tp4002366.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: What is the Nutch page-update mechanism after recrawl

Posted by feng lu <am...@gmail.com>.
I think nutch will use fetch-related status to update the db-related
status. if the recrawl url is gone (404) , the fetch-related status is
STATUS_FETCH_GONE, it will update former page with STATUS_DB_GONE, if the
url is temorary failure , it will try MAX times to recrawl the url until
reach the db.fetch.retry.max times.

On Tue, Aug 21, 2012 at 6:18 PM, weishenyun <wl...@yahoo.com.cn> wrote:

> Hi IT_ailen:
>        I know what 404 means and I also know adaptive fetch schedule. But I
> want to know what Nutch will do when it meet some exceptions by recrawl.
> Still an example, a same page was fetched successfully and recrawled for
> three times. In all three times of recrawl, it returns 404 or other
> exceptions. Will Nutch uses exception page info to update the former
> successful page?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/What-is-the-Nutch-page-update-mechanism-after-recrawl-tp4002366p4002373.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)

Re: What is the Nutch page-update mechanism after recrawl

Posted by weishenyun <wl...@yahoo.com.cn>.
Hi IT_ailen:
       I know what 404 means and I also know adaptive fetch schedule. But I
want to know what Nutch will do when it meet some exceptions by recrawl.
Still an example, a same page was fetched successfully and recrawled for
three times. In all three times of recrawl, it returns 404 or other
exceptions. Will Nutch uses exception page info to update the former
successful page?



--
View this message in context: http://lucene.472066.n3.nabble.com/What-is-the-Nutch-page-update-mechanism-after-recrawl-tp4002366p4002373.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: What is the Nutch page-update mechanism after recrawl

Posted by IT_ailen <zy...@gmail.com>.
Hello, weishenyun,
     As I know, when re-crawling a page, nutch will post some additional 
parameters to the destine server(such as update time), with which the 
server can decide either to return a state 304(unchanged) or to respond 
with the newly modified page. And the state 404 means the page you are 
fetching has gone.

Best regards,
     Ailen
On 2012?08?21? 17:44, weishenyun [via Lucene] wrote:
> Hi everyone here,
>        I want to know how Nutch update page after recrawl. For 
> example, a page was fetched successfully and stored in the DB or file 
> system by last crawl command. But it returns 404 when recrawl the same 
> page, will Nutch use this 404's page information to update the former 
> successful page information ? How about other situation, 301? 302? 503?
>       Thanks in advance.
>
> ------------------------------------------------------------------------
> If you reply to this email, your message will be added to the 
> discussion below:
> http://lucene.472066.n3.nabble.com/What-is-the-Nutch-page-update-mechanism-after-recrawl-tp4002366.html 
>
> To start a new topic under Nutch - User, email 
> ml-node+s472066n603147h80@n3.nabble.com
> To unsubscribe from Nutch - User, click here 
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=enlsZnJlZXBhcmFkaXNlQGdtYWlsLmNvbXw2MDMxNDd8NTIxMDAxODUx>.
> NAML 
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> 
>





-----
I'm what I am.
--
View this message in context: http://lucene.472066.n3.nabble.com/What-is-the-Nutch-page-update-mechanism-after-recrawl-tp4002366p4002369.html
Sent from the Nutch - User mailing list archive at Nabble.com.