You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bryan Woliner <br...@gmail.com> on 2005/12/06 02:26:48 UTC

Number of URLs in segment fetchlist vs. Number of URLs in index

How is the number of URLs in a a group of segment's fetchlists related
to the number of urls in an index.

Specifically, when I call the following command using the "segments2"
directory, I find out that there are 166 entries in 15 segments:

$ bin/nutch segread -list -dir segments

However, when I tried to prune the index of the same "segments2"
directory, using the following command, it tells me that 15 of 45
directories have been deleted:

$ bin/nutch org.apache.nutch.tools.PruneIndexTool segments2 -dryrun
-queries queries.txt -showfields url,title

-------------------------

What I don't understand is how the number of directories went from 166
in the fetchlists for this folder of segments, to only 45 in the
indexes. I'm positive that there were not 121 duplicate URLs (or
anywhere near that amount).

Thanks,
Bryan