You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chun Wei Ho <cw...@gmail.com> on 2006/01/26 09:15:12 UTC

Getting the document number (with IndexReader)

I am attempting to prune an index by getting each document in turn and
then checking/deleting it:

IndexReader ir = IndexReader.open(path);
for(int i=0;i<ir.numDocs();i++) {
	Document doc = ir.document(i);
	if(thisDocShouldBeDeleted(doc)) {
		ir.delete(docNum); // <- I need the docNum for doc.
	}
}

How do I get the docNum for IndexReader.delete() function in the above
case? Is there a API function I am missing? I am working with a merged
index over different segments so the docNum might not be in running
sequence with the counter i.

In general, is there a better way to do this sort of thing?

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Getting the document number (with IndexReader)

Posted by Paul Elschot <pa...@xs4all.nl>.
On Friday 27 January 2006 02:36, Chun Wei Ho wrote:
> Thanks for the info :) One last related question.
> 
> If I delete documents using a IndexReader(), can I assume that the
> internal document numbers of other undeleted documents (obtained using
> the same IndexReader instance) will not change until I call
> IndexReader.close()?

On the same IndexReader, yes.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Getting the document number (with IndexReader)

Posted by Chun Wei Ho <cw...@gmail.com>.
Thanks for the info :) One last related question.

If I delete documents using a IndexReader(), can I assume that the
internal document numbers of other undeleted documents (obtained using
the same IndexReader instance) will not change until I call
IndexReader.close()?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Getting the document number (with IndexReader)

Posted by Paul Elschot <pa...@xs4all.nl>.
On Thursday 26 January 2006 19:44, Chris Hostetter wrote:
> 
> : > The document number is the variable i in this case.
> : If the document number is the variable i (enumerated from numDocs()),
> : what's the difference between numDocs() and maxDoc() in this case? I
> : was previously under the impression that the internal docNum might be
> : different to the counter.
> 
> Iterating between 1 and maxDoc-1 will give you the range of all possible
> doc ids, but some of those docs may have already been deleted.  I believe
> that is what you want to do. ... you can check if a doc is deleted using
> IndexReader.isDeleted(i)
> 
> numDocs is implimented as maxDocs() - deletedDocs.count(), so i don't
> think it ever makes sese to iterate up to numDocs.
> 
> : I'm doing something akin to a rangeQuery, where I delete documents
> : within a certain range (in addition to other criteria). Is it better
> : to do a query on the range, mark all the docNums getting them with
> : Hits.id(), and then retrieve docs and test for deletion according to
> : that?
> 
> Take a look at the way RangeFilter.bits() is implimented.  if you
> cut/paste that code and replace the call to bits.set(termDocs.doc()); with
> reader.delete(termDocs.doc()) I think you've have exactly what you want.
> 
> Or, since cutting/pasting code is "A Bad Thing" from a maintenence/bug
> fixing standpoint, you could just call RangeFilter.bits(reader) yourself,
> and then iterate of the set bits and call delete on each one.

Perhaps an extra rewrite method with a term visitor argument?

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Getting the document number (with IndexReader)

Posted by Chris Hostetter <ho...@fucit.org>.
: > The document number is the variable i in this case.
: If the document number is the variable i (enumerated from numDocs()),
: what's the difference between numDocs() and maxDoc() in this case? I
: was previously under the impression that the internal docNum might be
: different to the counter.

Iterating between 1 and maxDoc-1 will give you the range of all possible
doc ids, but some of those docs may have already been deleted.  I believe
that is what you want to do. ... you can check if a doc is deleted using
IndexReader.isDeleted(i)

numDocs is implimented as maxDocs() - deletedDocs.count(), so i don't
think it ever makes sese to iterate up to numDocs.

: I'm doing something akin to a rangeQuery, where I delete documents
: within a certain range (in addition to other criteria). Is it better
: to do a query on the range, mark all the docNums getting them with
: Hits.id(), and then retrieve docs and test for deletion according to
: that?

Take a look at the way RangeFilter.bits() is implimented.  if you
cut/paste that code and replace the call to bits.set(termDocs.doc()); with
reader.delete(termDocs.doc()) I think you've have exactly what you want.

Or, since cutting/pasting code is "A Bad Thing" from a maintenence/bug
fixing standpoint, you could just call RangeFilter.bits(reader) yourself,
and then iterate of the set bits and call delete on each one.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Getting the document number (with IndexReader)

Posted by Paul Elschot <pa...@xs4all.nl>.
On Thursday 26 January 2006 09:47, Chun Wei Ho wrote:
> Hi,
> 
> Thanks for the help, just a few more questions:
> 
> On 1/26/06, Paul Elschot <pa...@xs4all.nl> wrote:
> > On Thursday 26 January 2006 09:15, Chun Wei Ho wrote:
> > > I am attempting to prune an index by getting each document in turn and
> > > then checking/deleting it:
> > >
> > > IndexReader ir = IndexReader.open(path);
> > > for(int i=0;i<ir.numDocs();i++) {
> > >       Document doc = ir.document(i);
> > >       if(thisDocShouldBeDeleted(doc)) {
> > >               ir.delete(docNum); // <- I need the docNum for doc.
> > >       }
> > > }
> > >
> > > How do I get the docNum for IndexReader.delete() function in the above
> > > case? Is there a API function I am missing? I am working with a merged
> >
> > The document number is the variable i in this case.
> If the document number is the variable i (enumerated from numDocs()),
> what's the difference between numDocs() and maxDoc() in this case? I
> was previously under the impression that the internal docNum might be
> different to the counter.

Iirc, the difference between maxDoc() + 1 and numDocs() is the number of
deleted documents. Check the javadocs to be sure.

> 
> > > index over different segments so the docNum might not be in running
> > > sequence with the counter i.
> > > In general, is there a better way to do this sort of thing?
> >
> > This code:
> >
> >         Document doc = ir.document(i);
> >
> > normally retrieves all the stored fields of the document and that is
> > quite costly. In case you know that the document(s) to be deleted
> > match(es) a Term, it's better to use IndexReader.delete(Term).
> 
> I'm doing something akin to a rangeQuery, where I delete documents
> within a certain range (in addition to other criteria). Is it better
> to do a query on the range, mark all the docNums getting them with
> Hits.id(), and then retrieve docs and test for deletion according to
> that?

In that case it is faster to use the Terms generated inside the range query
and then use these on IndexReader.delete(Term).
To generate the terms have a look at the source code of the rewrite()
method of RangeQuery here:
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/search/

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Getting the document number (with IndexReader)

Posted by Chun Wei Ho <cw...@gmail.com>.
Hi,

Thanks for the help, just a few more questions:

On 1/26/06, Paul Elschot <pa...@xs4all.nl> wrote:
> On Thursday 26 January 2006 09:15, Chun Wei Ho wrote:
> > I am attempting to prune an index by getting each document in turn and
> > then checking/deleting it:
> >
> > IndexReader ir = IndexReader.open(path);
> > for(int i=0;i<ir.numDocs();i++) {
> >       Document doc = ir.document(i);
> >       if(thisDocShouldBeDeleted(doc)) {
> >               ir.delete(docNum); // <- I need the docNum for doc.
> >       }
> > }
> >
> > How do I get the docNum for IndexReader.delete() function in the above
> > case? Is there a API function I am missing? I am working with a merged
>
> The document number is the variable i in this case.
If the document number is the variable i (enumerated from numDocs()),
what's the difference between numDocs() and maxDoc() in this case? I
was previously under the impression that the internal docNum might be
different to the counter.

> > index over different segments so the docNum might not be in running
> > sequence with the counter i.
> > In general, is there a better way to do this sort of thing?
>
> This code:
>
>         Document doc = ir.document(i);
>
> normally retrieves all the stored fields of the document and that is
> quite costly. In case you know that the document(s) to be deleted
> match(es) a Term, it's better to use IndexReader.delete(Term).

I'm doing something akin to a rangeQuery, where I delete documents
within a certain range (in addition to other criteria). Is it better
to do a query on the range, mark all the docNums getting them with
Hits.id(), and then retrieve docs and test for deletion according to
that?

Thanks for the help

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Getting the document number (with IndexReader)

Posted by Paul Elschot <pa...@xs4all.nl>.
On Thursday 26 January 2006 09:15, Chun Wei Ho wrote:
> I am attempting to prune an index by getting each document in turn and
> then checking/deleting it:
> 
> IndexReader ir = IndexReader.open(path);
> for(int i=0;i<ir.numDocs();i++) {
> 	Document doc = ir.document(i);
> 	if(thisDocShouldBeDeleted(doc)) {
> 		ir.delete(docNum); // <- I need the docNum for doc.
> 	}
> }
> 
> How do I get the docNum for IndexReader.delete() function in the above
> case? Is there a API function I am missing? I am working with a merged

The document number is the variable i in this case.

> index over different segments so the docNum might not be in running
> sequence with the counter i.
> 
> In general, is there a better way to do this sort of thing?

This code:

 	Document doc = ir.document(i);

normally retrieves all the stored fields of the document and that is
quite costly. In case you know that the document(s) to be deleted
match(es) a Term, it's better to use IndexReader.delete(Term).

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org