You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jaydeep Bagrecha <ba...@usc.edu> on 2015/02/08 23:22:36 UTC

572:Crawl statistics for each repository ?

Is there a way to crawl all 3 repositories together and get statistics for each one individually?

OR

Do we have to crawl each repository separately and get its statistics from corresponding crawldb?

Thanks,
Jaydeep



Re: 572:Crawl statistics for each repository ?

Posted by feng lu <am...@gmail.com>.
Hi Jaydeep

you can following command to get statistics for each host when using one
database to crawl multiple repository.

bin/nutch readdb crawldb/crawldb/ -stats -sort

On Mon, Feb 9, 2015 at 12:01 PM, Jaydeep Bagrecha <ba...@usc.edu> wrote:

> Thanks.
>
> *P.S*
> The question was:-
> *Given M (repo)repositories(M corresponding seedlist urls),find crawl
> statistics(number of fetched/unfetched urls,etc)for each repo separately?*
>
> So,Is there a way to crawl all M repo together(include eg:-domain name of
> all m in regex-urlfilter.txt file) and get statistics for each one
> individually.
>
> OR
>
>
> Do we have to crawl each repo separately(include domain name of  only 1
> repo in regex-urlfilter.txt)and get its statistics from corresponding
> crawldb?
>
>
>
>
>
> Thanks,
> Jaydeep Bagrecha
>
>
>
> On Feb 8, 2015, at 6:24 PM, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
>
> Hi Jaydeep,
>
> Please qualify what this question is about - I know what it’s
> about but you have provided very little detail for anyone else
> on this to list to discern it.
>
> The short answer is no: crawldb stats are per crawl.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Jaydeep Bagrecha <ba...@usc.edu>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Sunday, February 8, 2015 at 2:22 PM
> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Subject: 572:Crawl statistics for each repository ?
>
>
> Is there a way to crawl all 3 repositories together and get statistics
>
> for each one individually?
>
>
> OR
>
>
> Do we have to crawl each repository separately and get its statistics
>
> from corresponding crawldb?
>
>
> Thanks,
>
> Jaydeep
>
>
>
>
>


-- 
Don't Grow Old, Grow Up... :-)

Re: 572:Crawl statistics for each repository ?

Posted by Jaydeep Bagrecha <ba...@usc.edu>.
Thanks.

P.S
The question was:-
Given M (repo)repositories(M corresponding seedlist urls),find crawl statistics(number of fetched/unfetched urls,etc)for each repo separately?

So,Is there a way to crawl all M repo together(include eg:-domain name of all m in regex-urlfilter.txt file) and get statistics for each one individually.

> OR

> Do we have to crawl each repo separately(include domain name of  only 1 repo in regex-urlfilter.txt)and get its statistics from corresponding crawldb?





Thanks,
Jaydeep Bagrecha



> On Feb 8, 2015, at 6:24 PM, Mattmann, Chris A (3980) <ch...@jpl.nasa.gov> wrote:
> 
> Hi Jaydeep,
> 
> Please qualify what this question is about - I know what it’s
> about but you have provided very little detail for anyone else
> on this to list to discern it.
> 
> The short answer is no: crawldb stats are per crawl.
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Jaydeep Bagrecha <ba...@usc.edu>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Sunday, February 8, 2015 at 2:22 PM
> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Subject: 572:Crawl statistics for each repository ?
> 
>> 
>> Is there a way to crawl all 3 repositories together and get statistics
>> for each one individually?
>> 
>> OR
>> 
>> Do we have to crawl each repository separately and get its statistics
>> from corresponding crawldb?
>> 
>> Thanks,
>> Jaydeep
> 

Re: 572:Crawl statistics for each repository ?

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi Jaydeep,

Please qualify what this question is about - I know what it’s
about but you have provided very little detail for anyone else
on this to list to discern it.

The short answer is no: crawldb stats are per crawl.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Jaydeep Bagrecha <ba...@usc.edu>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Sunday, February 8, 2015 at 2:22 PM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: 572:Crawl statistics for each repository ?

>
>Is there a way to crawl all 3 repositories together and get statistics
>for each one individually?
>
>OR
>
>Do we have to crawl each repository separately and get its statistics
>from corresponding crawldb?
>
>Thanks,
>Jaydeep
>
>