You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by julien <ju...@hotmail.fr> on 2015/03/18 16:25:30 UTC

How to get the status page after crawl?

Hello,

After a crawl nutch : I would like recover all status urls. 
Do you know how I can retrieve the status of urls? Code 200, 404, 503 ...?

Thank



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-the-status-page-after-crawl-tp4193761.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: [MASSMAIL]Re: How to get the status page after crawl?

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.

In short you're after a parse plugin, this are quite commons, as we usually want to extract more data from the webpage than just all the text in one field. My general advice on this is to browse through the source code of the shipped plugins provided by Nutch, this usually are quite a great starting point. Other advice could be to select that plugin that is somehow similar to the one you're trying to write and take it as a starting point then move forward, if the plugins you're studying comes with tests even better, you can learn a lot about a plugin by taking a peek into the tests.

One of the great things about Nutch is that you can write your plugins in peace without even looking into the source of some other sections that you don't require, which is kind of nice for starters. 

Regards,

----- Original Message -----
From: "Mohammed Omer" <be...@gmail.com>
To: user@nutch.apache.org, "julien alvez" <ju...@hotmail.fr>
Sent: Thursday, March 19, 2015 12:09:29 AM
Subject: [MASSMAIL]Re: How to get the status page after crawl?

There's a straight-forward tutorial on writing a plugin to add custom
fields, albeit based on v0.9, at
http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

More info on plugins at:
http://wiki.apache.org/nutch/AboutPlugins
http://wiki.apache.org/nutch/WritingPluginExample-1.2

Thank you,

Mo

On Wed, Mar 18, 2015 at 10:25 AM, julien <ju...@hotmail.fr> wrote:

> Hello,
>
> After a crawl nutch : I would like recover all status urls.
> Do you know how I can retrieve the status of urls? Code 200, 404, 503 ...?
>
> Thank
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-the-status-page-after-crawl-tp4193761.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: How to get the status page after crawl?

Posted by Mohammed Omer <be...@gmail.com>.

There's a straight-forward tutorial on writing a plugin to add custom
fields, albeit based on v0.9, at
http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

More info on plugins at:
http://wiki.apache.org/nutch/AboutPlugins
http://wiki.apache.org/nutch/WritingPluginExample-1.2

Thank you,

Mo

On Wed, Mar 18, 2015 at 10:25 AM, julien <ju...@hotmail.fr> wrote:

> Hello,
>
> After a crawl nutch : I would like recover all status urls.
> Do you know how I can retrieve the status of urls? Code 200, 404, 503 ...?
>
> Thank
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-get-the-status-page-after-crawl-tp4193761.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>