You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Benjamin Higgins <bh...@gmail.com> on 2006/04/18 19:52:54 UTC

Index statistics

Hi, I looked through the FAQ but found nothing about getting basic index
statistics, like quite simply, how many pages are in the index.

How can I figure that out?

Thanks,
Ben

Re: Index statistics

Posted by TDLN <di...@gmail.com>.

Luke (http://www.getopt.org/luke/) comes in handy for those purposes.

Rgrds, Thomas

On 4/18/06, Benjamin Higgins <bh...@gmail.com> wrote:
> Hi, I looked through the FAQ but found nothing about getting basic index
> statistics, like quite simply, how many pages are in the index.
>
> How can I figure that out?
>
> Thanks,
> Ben
>
>


--
D-SEN Software Engineering - www.dsen.nl

Re: Index statistics

Posted by Andrzej Bialecki <ab...@getopt.org>.

TDLN wrote:
> I think the nutch readdb command only gives statistics for the crawldb
> (crawled Pages) and not the index.
>
>   

That's correct. You can use Lucene API to retrieve the number of
documents in the index, it's quite simple.

Something like that (you can use BSH to run it as a script, or compile
it in its own class), I'm writing this from my head so there may be some
errors:

import org.apache.lucene.index.*;

public class IndexStats {
    public static void main(String[] args) throws Exception {
       IndexReader ir = IndexReader.open(args[0]);
       System.out.println("Number of documents: " + ir.numDocs());
       ir.close();
    }
}


-- 
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Index statistics

Posted by TDLN <di...@gmail.com>.

I think the nutch readdb command only gives statistics for the crawldb
(crawled Pages) and not the index.

Rgrds, Thomas

On 4/18/06, Michael Levy <Lu...@gmail.com> wrote:
> Ben, how about this:
> bin/nutch readdb crawled/db -stats
> where crawled is the directory holding the index?
>
> Here's a good article that covers the topic:
> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
>
> Benjamin Higgins wrote:
> > Hi, I looked through the FAQ but found nothing about getting basic index
> > statistics, like quite simply, how many pages are in the index.
> >
> > How can I figure that out?
> >
> > Thanks,
> > Ben
> >
> >
>


--
D-SEN Software Engineering - www.dsen.nl

Re: Index statistics

Posted by Michael Levy <Lu...@gmail.com>.

Ben, how about this:
bin/nutch readdb crawled/db -stats
where crawled is the directory holding the index?

Here's a good article that covers the topic:
http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html

Benjamin Higgins wrote:
> Hi, I looked through the FAQ but found nothing about getting basic index
> statistics, like quite simply, how many pages are in the index.
>
> How can I figure that out?
>
> Thanks,
> Ben
>
>