You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by imehesz <im...@gmail.com> on 2013/02/25 20:28:12 UTC

Nutch status info on each domain individually

hello,

I can finally run Nutch (+Solr) with JAVA, my only question left is, how can
I make sure if a particular domain has been crawled?

Let's say I have 300 sites to crawl and index.
So far my work-around was to execute a simple Solr query for each domain
URL, and see if the indexing timestamp in the Solr DB is greater then the
Nutch crawling start date-time. It works, but I'm curious if there is a
better way to do this. 

thanks,
--iM



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-status-info-on-each-domain-individually-tp4042815.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Nutch status info on each domain individually

Posted by Markus Jelsma <ma...@openindex.io>.
Well, you can always the DomainStatistics utilities to get the raw numbers on hosts, domains and TLD's but this won't tell you whether a domain has been fully crawled because the crawling frontier can always change.

You can be sure that everything (disregarding url filters) has been crawled if no more records are selected before fetched records are eligible again for refetch (default interval).

NUTCH-1325 does a better job in providing stats for hosts than the current DomainStatistics but it's uncommitted. It'll work though.

https://issues.apache.org/jira/browse/NUTCH-1325
 
-----Original message-----
> From:Tejas Patil <te...@gmail.com>
> Sent: Mon 25-Feb-2013 20:46
> To: user@nutch.apache.org
> Subject: Re: Nutch status info on each domain individually
> 
> I can't of any existing nutch utility which can be used here. Maybe dumping
> the crawldb and then grepping over it would sound reasonable if the number
> of hosts is large and the crawldb is small. This will be a bad idea if this
> has to be done after every nutch cycle on a large crawldb.
> 
> If you are ready to write some small code, then it can become easy:
> 1. Write some code to query the index so that you need not have to do that
> manually. OR
> 2. Write a map reduce code to read crawdb wherein the mapper emits the
> hosts of the url.
> 
> #1 is better deal in terms of execution time.
> 
> Thanks,
> Tejas Patil
> 
> 
> On Mon, Feb 25, 2013 at 11:28 AM, imehesz <im...@gmail.com> wrote:
> 
> > hello,
> >
> > I can finally run Nutch (+Solr) with JAVA, my only question left is, how
> > can
> > I make sure if a particular domain has been crawled?
> >
> > Let's say I have 300 sites to crawl and index.
> > So far my work-around was to execute a simple Solr query for each domain
> > URL, and see if the indexing timestamp in the Solr DB is greater then the
> > Nutch crawling start date-time. It works, but I'm curious if there is a
> > better way to do this.
> >
> > thanks,
> > --iM
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Nutch-status-info-on-each-domain-individually-tp4042815.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> 

Re: Nutch status info on each domain individually

Posted by Tejas Patil <te...@gmail.com>.
I can't of any existing nutch utility which can be used here. Maybe dumping
the crawldb and then grepping over it would sound reasonable if the number
of hosts is large and the crawldb is small. This will be a bad idea if this
has to be done after every nutch cycle on a large crawldb.

If you are ready to write some small code, then it can become easy:
1. Write some code to query the index so that you need not have to do that
manually. OR
2. Write a map reduce code to read crawdb wherein the mapper emits the
hosts of the url.

#1 is better deal in terms of execution time.

Thanks,
Tejas Patil


On Mon, Feb 25, 2013 at 11:28 AM, imehesz <im...@gmail.com> wrote:

> hello,
>
> I can finally run Nutch (+Solr) with JAVA, my only question left is, how
> can
> I make sure if a particular domain has been crawled?
>
> Let's say I have 300 sites to crawl and index.
> So far my work-around was to execute a simple Solr query for each domain
> URL, and see if the indexing timestamp in the Solr DB is greater then the
> Nutch crawling start date-time. It works, but I'm curious if there is a
> better way to do this.
>
> thanks,
> --iM
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-status-info-on-each-domain-individually-tp4042815.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>