You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Georger Araujo <ge...@gmail.com> on 2011/02/12 15:07:32 UTC

Iterating over all documents in an index

Hi,
I want to iterate over all documents in a given index. I've found the
following piece of code [1]:

IndexReader reader = // create IndexReader
for (int i=0; i<reader.maxDoc(); i++) {
    if (reader.isDeleted(i))
        continue;

    Document doc = reader.document(i);
    String docId = doc.get("docId");

    // do something with docId here...
}

I implemented it in my code and it worked fine. After that, I found out
about MatchAllDocsQuery.
I am not concerned with scoring nor sorting - all I want to do is iterate
over all documents in the index and collect their terms. My ultimate goal is
to build a bag-of-words of all documents and their terms so that I can run a
clustering algorithm on it.I've also found out about Mahout's built-in
vector creation utility [2], but I need to do this task from my own code.

I ask, what is the recommended approach?

[1]
http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index
[2]
https://cwiki.apache.org/confluence/display/MAHOUT/Creating%20Vectors%20from%20Text

Regards,

Georger

Re: Iterating over all documents in an index

Posted by Georger Araujo <ge...@gmail.com>.
Hi Erick,
Thanks for the input. I didn't really do the doc.get() in my code, because
what I'm really looking for are the stemmed terms in the index. I only
borrowed the idea from the snippet. Here's how my code looks like now:

Map<String,Integer> termMap = new TreeMap<String,Integer>();
List<String> filenames = new ArrayList<String>( reader.maxDoc() );

for ( int i = 0; i < reader.maxDoc(); i++ ) {
    if ( reader.isDeleted( i ) )
        continue;

        Document doc = reader.document( i );
        filenames.add( i, doc.get( "filename" ) );

        TermFreqVector tfVector = reader.getTermFreqVector( i,
"filecontents" );
        for ( int j = 0; j < tfVector.getTerms().length; j++ ) {
            // Add terms to the map...
        }
    }

The point is that I need to read the term frequency vector for each
document. I actually want the stemmed terms. From that and the filenames
(stored in the "filename" field), i was able to build a matrix that looks
like this:
        term1 term2 ... termn
doc1    1        1            0
doc2    0        1            0
...
docn    1        0            1

I already have this working. I just don't know if I'm doing it the
simplest/quickest way, though - and that's going to be an issue because I
intend to run the clustering algorithm on matrices as big as 5000 x 100000.
Regards,

Georger

2011/2/12 Erick Erickson <er...@gmail.com>

> Be aware that when you do a doc.get(), the fields are the
> *stored* fields in their original, unanalyzed form. Is that really
> what you want? Or do you want the tokenized form of the fields?
>
> If the latter, you might get the Luke code, it reconstructs all the fields
> in the document from the terms that are actually indexed. Note two
> things: 1> it's slow. You're really undoing all the work that went into
> inverting the index in the first place.
> 2> it's lossy. For instance, a term that's been stemmed will only have
> the stemmed version in the index. Is that OK?
>
> Best
> Erick
>
> On Sat, Feb 12, 2011 at 9:07 AM, Georger Araujo
> <ge...@gmail.com> wrote:
> > Hi,
> > I want to iterate over all documents in a given index. I've found the
> > following piece of code [1]:
> >
> > IndexReader reader = // create IndexReader
> > for (int i=0; i<reader.maxDoc(); i++) {
> >    if (reader.isDeleted(i))
> >        continue;
> >
> >    Document doc = reader.document(i);
> >    String docId = doc.get("docId");
> >
> >    // do something with docId here...
> > }
> >
> > I implemented it in my code and it worked fine. After that, I found out
> > about MatchAllDocsQuery.
> > I am not concerned with scoring nor sorting - all I want to do is iterate
> > over all documents in the index and collect their terms. My ultimate goal
> is
> > to build a bag-of-words of all documents and their terms so that I can
> run a
> > clustering algorithm on it.I've also found out about Mahout's built-in
> > vector creation utility [2], but I need to do this task from my own code.
> >
> > I ask, what is the recommended approach?
> >
> > [1]
> >
> http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index
> > [2]
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating%20Vectors%20from%20Text
> >
> > Regards,
> >
> > Georger
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Iterating over all documents in an index

Posted by Erick Erickson <er...@gmail.com>.
Be aware that when you do a doc.get(), the fields are the
*stored* fields in their original, unanalyzed form. Is that really
what you want? Or do you want the tokenized form of the fields?

If the latter, you might get the Luke code, it reconstructs all the fields
in the document from the terms that are actually indexed. Note two
things: 1> it's slow. You're really undoing all the work that went into
inverting the index in the first place.
2> it's lossy. For instance, a term that's been stemmed will only have
the stemmed version in the index. Is that OK?

Best
Erick

On Sat, Feb 12, 2011 at 9:07 AM, Georger Araujo
<ge...@gmail.com> wrote:
> Hi,
> I want to iterate over all documents in a given index. I've found the
> following piece of code [1]:
>
> IndexReader reader = // create IndexReader
> for (int i=0; i<reader.maxDoc(); i++) {
>    if (reader.isDeleted(i))
>        continue;
>
>    Document doc = reader.document(i);
>    String docId = doc.get("docId");
>
>    // do something with docId here...
> }
>
> I implemented it in my code and it worked fine. After that, I found out
> about MatchAllDocsQuery.
> I am not concerned with scoring nor sorting - all I want to do is iterate
> over all documents in the index and collect their terms. My ultimate goal is
> to build a bag-of-words of all documents and their terms so that I can run a
> clustering algorithm on it.I've also found out about Mahout's built-in
> vector creation utility [2], but I need to do this task from my own code.
>
> I ask, what is the recommended approach?
>
> [1]
> http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index
> [2]
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating%20Vectors%20from%20Text
>
> Regards,
>
> Georger
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org