You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Benjamin Higgins <bh...@gmail.com> on 2006/04/18 19:52:54 UTC
Index statistics
Hi, I looked through the FAQ but found nothing about getting basic index
statistics, like quite simply, how many pages are in the index.
How can I figure that out?
Thanks,
Ben
Re: Index statistics
Posted by TDLN <di...@gmail.com>.
Luke (http://www.getopt.org/luke/) comes in handy for those purposes.
Rgrds, Thomas
On 4/18/06, Benjamin Higgins <bh...@gmail.com> wrote:
> Hi, I looked through the FAQ but found nothing about getting basic index
> statistics, like quite simply, how many pages are in the index.
>
> How can I figure that out?
>
> Thanks,
> Ben
>
>
--
D-SEN Software Engineering - www.dsen.nl
Re: Index statistics
Posted by Andrzej Bialecki <ab...@getopt.org>.
TDLN wrote:
> I think the nutch readdb command only gives statistics for the crawldb
> (crawled Pages) and not the index.
>
>
That's correct. You can use Lucene API to retrieve the number of
documents in the index, it's quite simple.
Something like that (you can use BSH to run it as a script, or compile
it in its own class), I'm writing this from my head so there may be some
errors:
import org.apache.lucene.index.*;
public class IndexStats {
public static void main(String[] args) throws Exception {
IndexReader ir = IndexReader.open(args[0]);
System.out.println("Number of documents: " + ir.numDocs());
ir.close();
}
}
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Index statistics
Posted by TDLN <di...@gmail.com>.
I think the nutch readdb command only gives statistics for the crawldb
(crawled Pages) and not the index.
Rgrds, Thomas
On 4/18/06, Michael Levy <Lu...@gmail.com> wrote:
> Ben, how about this:
> bin/nutch readdb crawled/db -stats
> where crawled is the directory holding the index?
>
> Here's a good article that covers the topic:
> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
>
> Benjamin Higgins wrote:
> > Hi, I looked through the FAQ but found nothing about getting basic index
> > statistics, like quite simply, how many pages are in the index.
> >
> > How can I figure that out?
> >
> > Thanks,
> > Ben
> >
> >
>
--
D-SEN Software Engineering - www.dsen.nl
Re: Index statistics
Posted by Michael Levy <Lu...@gmail.com>.
Ben, how about this:
bin/nutch readdb crawled/db -stats
where crawled is the directory holding the index?
Here's a good article that covers the topic:
http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
Benjamin Higgins wrote:
> Hi, I looked through the FAQ but found nothing about getting basic index
> statistics, like quite simply, how many pages are in the index.
>
> How can I figure that out?
>
> Thanks,
> Ben
>
>