You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by kiran chitturi <ch...@gmail.com> on 2013/03/05 18:37:46 UTC

Parse statistics in Nutch

Hi!

We already get statistics for fetcher using (readdb -stats) but can we also
include parse Statistics in the statistics.

It will be very helpful in knowing how many documents are successfully
parsed and we could use different methods to reparse if we see lot of
failing documents.

Only way i know to get how many documents are parsed is to check Solr on
how many documents are indexed.

What do you guys think of this ?

-- 
Kiran Chitturi

Re: Parse statistics in Nutch

Posted by kiran chitturi <ch...@gmail.com>.

Thanks Lewis. I will give a try at this


On Tue, Mar 5, 2013 at 12:59 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> There are a few things you can do Kiran.
> My preference is to use custom counters for successfully and unsuccessfully
> parsed docs within the ParserJob or equivalent. I would be surprised if
> this is not already there however.
> It is not much trouble to add counters to something like this. We already
> do it in InjectorJob for instance to make explicit the number of filtered
> URLs and the number of URLs injected post filtering and normalization.
>
> On Tuesday, March 5, 2013, kiran chitturi <ch...@gmail.com>
> wrote:
> > Hi!
> >
> > We already get statistics for fetcher using (readdb -stats) but can we
> also
> > include parse Statistics in the statistics.
> >
> > It will be very helpful in knowing how many documents are successfully
> > parsed and we could use different methods to reparse if we see lot of
> > failing documents.
> >
> > Only way i know to get how many documents are parsed is to check Solr on
> > how many documents are indexed.
> >
> > What do you guys think of this ?
> >
> > --
> > Kiran Chitturi
> >
>
> --
> *Lewis*
>



-- 
Kiran Chitturi

Re: Parse statistics in Nutch

Posted by Lewis John Mcgibbney <le...@gmail.com>.

There are a few things you can do Kiran.
My preference is to use custom counters for successfully and unsuccessfully
parsed docs within the ParserJob or equivalent. I would be surprised if
this is not already there however.
It is not much trouble to add counters to something like this. We already
do it in InjectorJob for instance to make explicit the number of filtered
URLs and the number of URLs injected post filtering and normalization.

On Tuesday, March 5, 2013, kiran chitturi <ch...@gmail.com> wrote:
> Hi!
>
> We already get statistics for fetcher using (readdb -stats) but can we
also
> include parse Statistics in the statistics.
>
> It will be very helpful in knowing how many documents are successfully
> parsed and we could use different methods to reparse if we see lot of
> failing documents.
>
> Only way i know to get how many documents are parsed is to check Solr on
> how many documents are indexed.
>
> What do you guys think of this ?
>
> --
> Kiran Chitturi
>

-- 
*Lewis*