You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul Tomblin <pt...@xcski.com> on 2009/08/11 20:10:02 UTC

How do I get all the documents in the index without searching?

I want to iterate through all the documents that are in the crawl,
programattically.  The only code I can find does searches.  I don't
want to search for a term, I want everything.  Is there a way to do
this?

-- 
http://www.linkedin.com/in/paultomblin

Re: How do I get all the documents in the index without searching?

Posted by Alex McLintock <al...@gmail.com>.

Try looking at how the indexers work. They *do* iterate through all
the documents in the crawl (or rather one segment at a time). However
they do it in a Hadoop way...



2009/8/11 Paul Tomblin <pt...@xcski.com>:
> I want to iterate through all the documents that are in the crawl,
> programattically.  The only code I can find does searches.  I don't
> want to search for a term, I want everything.  Is there a way to do
> this?

Re: How do I get all the documents in the index without searching?

Posted by Paul Tomblin <pt...@xcski.com>.

On Tue, Aug 11, 2009 at 2:10 PM, Paul Tomblin<pt...@xcski.com> wrote:
> I want to iterate through all the documents that are in the crawl,
> programattically.  The only code I can find does searches.  I don't
> want to search for a term, I want everything.  Is there a way to do
> this?

To answer my own question, what I ended up doing was
            IndexReader reader = IndexReader.open(indexDir.getAbsolutePath());
            for (int i = 0; i < reader.numDocs(); i++)
            {
                Document doc = reader.document(i);
            }

Now that I have the Document, I have to figure out how to process it
further to get the actual contents, but I assume that I need to go
back to the segment for that.



-- 
http://www.linkedin.com/in/paultomblin