You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Tomblin <pt...@xcski.com> on 2009/08/11 20:10:02 UTC
How do I get all the documents in the index without searching?
I want to iterate through all the documents that are in the crawl,
programattically. The only code I can find does searches. I don't
want to search for a term, I want everything. Is there a way to do
this?
--
http://www.linkedin.com/in/paultomblin
Re: How do I get all the documents in the index without searching?
Posted by Alex McLintock <al...@gmail.com>.
Try looking at how the indexers work. They *do* iterate through all
the documents in the crawl (or rather one segment at a time). However
they do it in a Hadoop way...
2009/8/11 Paul Tomblin <pt...@xcski.com>:
> I want to iterate through all the documents that are in the crawl,
> programattically. The only code I can find does searches. I don't
> want to search for a term, I want everything. Is there a way to do
> this?
Re: How do I get all the documents in the index without searching?
Posted by Paul Tomblin <pt...@xcski.com>.
On Tue, Aug 11, 2009 at 2:10 PM, Paul Tomblin<pt...@xcski.com> wrote:
> I want to iterate through all the documents that are in the crawl,
> programattically. The only code I can find does searches. I don't
> want to search for a term, I want everything. Is there a way to do
> this?
To answer my own question, what I ended up doing was
IndexReader reader = IndexReader.open(indexDir.getAbsolutePath());
for (int i = 0; i < reader.numDocs(); i++)
{
Document doc = reader.document(i);
}
Now that I have the Document, I have to figure out how to process it
further to get the actual contents, but I assume that I need to go
back to the segment for that.
--
http://www.linkedin.com/in/paultomblin