You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Giulio Cesare Solaroli <gi...@gmail.com> on 2004/07/16 09:21:27 UTC

Deleting a document with an IndexWriter open

Dear developers,

is there any architectural reason while an IndexWriter could not
delete a document?

I understand that the IndexReader (besides its strange naming for this
feature) is the right class to use to delete a document, but this
raises a huge problem for me.

We add almost 50.000 documents a day, while deleting a similar amount
of old documents over the same period.
We index new documents in batch every 5 minutes while deleting the old
ones and optimize the index twice a day, in order to keep good
performance for the queries and the number of index files under
control.

In this situation, I try to keep the same IndexWriter open as much as
possible, in order to avoid any unnecessary fragmentation of the
index.
Before indexing any document, I can check to see if the document has
already been inserted, but I am not able to delete it without closing
the IndexWriter, opening an IndexReader, deleting the document,
closing the IndexReader an opening again the IndexWritere.

This arrangement seems reasonable if updated documents are scarce, but
doesn't seem feasible to work with a high rate of updated documents.

I would prefer to avoid deleting all updated documents from the index
before opening the IndexWriter because the updating and indexing
procedure would get much more complex, and because I will introduce a
significant time gap where a previously available document is no more
available on the index.

Do you confirm my idea that keeping and IndexWriter open as much as
possible while indexing batch of documents is a "good thing"?

Is there any option to ever see a deleteDocument method in the
IndexWriter class, or should I start planning how to handle the update
of documents in another way?

Thank you very much for your attention.

Regards,

Giulio Cesare Solaroli

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Christoph Goller <go...@detego-software.de>.
Christiaan Fluit wrote:
> Christoph Goller wrote:
> 
>> 1) Keep an IndexReader/Searcher open on your index in order to guarantee
>> reed access and a consistent index during the whole process.
>>
>> 2) Open a new IndexReader and delete all the documents that you want to
>> update.
>>
>> 3) Close the IndexReader (makes the deletions visible for any new
>> readers/writers but not for the still opened Searcher/Reader).
> 
> 
> But what happens when a query evaluated by the IndexReader/IndexSearcher 
> mentioned in (1) results in a hit that has already been deleted by the 
> second IndexReader? I.e. you try to access a document through one 
> IndexReader that has already been deleted by another IndexReader.
> 
> I would expect some kind of exception because the document data can no 
> longer be found, or am I missing something? I hope I am because your 
> solution does sound very attractive ;)

The IndexReader opened in 1 keeps a copy of his version of the index
until he is closed. This means he will never see any changes made by
another reader or a writer.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Christiaan Fluit <Ch...@aduna.biz>.
Christoph Goller wrote:
> 1) Keep an IndexReader/Searcher open on your index in order to guarantee
> reed access and a consistent index during the whole process.
> 
> 2) Open a new IndexReader and delete all the documents that you want to
> update.
> 
> 3) Close the IndexReader (makes the deletions visible for any new
> readers/writers but not for the still opened Searcher/Reader).

But what happens when a query evaluated by the IndexReader/IndexSearcher 
mentioned in (1) results in a hit that has already been deleted by the 
second IndexReader? I.e. you try to access a document through one 
IndexReader that has already been deleted by another IndexReader.

I would expect some kind of exception because the document data can no 
longer be found, or am I missing something? I hope I am because your 
solution does sound very attractive ;)


Kind regards,

Chris
--


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Doug Cutting <cu...@apache.org>.
Christoph Goller wrote:
> Giulio Cesare Solaroli wrote:
>> is there any architectural reason while an IndexWriter could not
>> delete a document?
> 
> There are such reasons. Maybe Doug can give additional
> insight.

No, you did a great job of describing the issues.  Thanks!

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Doug Cutting wrote:

>
> Then you need to ensure that you leave the index has no deletions, and 
> optimize it if it has any, to remove them.  This is probably most 
> safely done as the first step, rather than the last.

Good point. I didn't think about this.

>
> I'm not sure this method has many advantages over what Christoph 
> orginally suggested in:
>
> http://www.mail-archive.com/lucene-dev%40jakarta.apache.org/msg06165.html

Yes, I agree that it's not too different. The main benefit I see, and I 
think this may be significant for some applications, is that in 
Christoph's original method new documents must be iterated over twice - 
in his steps 2 and 4. This may be a problem for some applications 
because it requires buffering newly arrived documents somewhere - 
something that Lucene will not directly help with. That means people may 
have to write substantial external code to support this usage (or 
perhaps use a database, file system, etc).

With the modification I'm proposing, the documents can be added to the 
index as they arrive. No buffering is required and documents are handled 
exactly once. The "buffering" occurs instead on document ids to be 
deleted, which is much easier to do and one can even use the BitSet 
class (or Filter) supplied with Lucene.

>
> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Giulio Cesare Solaroli <gi...@gmail.com>.
So, I was not thinking that much different. :-]

Giulio Cesare

On Tue, 20 Jul 2004 14:37:11 +0200, Christoph Goller
<go...@detego-software.de> wrote:
> Giulio Cesare Solaroli wrote:
> > Hi all,
> >
> > I would like to submit a "think different" approach to this problem
> > for evaluation for you developers.
> >
> > Would it be possible to just mark the relevant documents as "deleted"
> > (instead of deleting them altogether) with an IndexWriter used for
> > inserting new documents?
> >
> > "marking" a document as deleted would leave it on the index, but it
> > would not include it in any result set.
> >
> > At a later time, an IndexReader could be opened to really delete all
> > "marked" documents.
> >
> > Does this approach is compatible with Lucene architecture?
> 
> It´s already done in a quite similar fashion, but the roles of
> IndexReader and IndexWriter are exchanged. If you call delete with
> an IndexReader, the document(s) are only marked as deleted. You might
> even call undeletAll to undo the deletion. It´s only with the next
> explicit optimize or implicit merge from an IndexWriter that the
> document(s) are irreversibly deleted.
> 
> Christoph
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Christoph Goller <go...@detego-software.de>.
Giulio Cesare Solaroli wrote:
> Hi all,
> 
> I would like to submit a "think different" approach to this problem
> for evaluation for you developers.
> 
> Would it be possible to just mark the relevant documents as "deleted"
> (instead of deleting them altogether) with an IndexWriter used for
> inserting new documents?
> 
> "marking" a document as deleted would leave it on the index, but it
> would not include it in any result set.
> 
> At a later time, an IndexReader could be opened to really delete all
> "marked" documents.
> 
> Does this approach is compatible with Lucene architecture?

It´s already done in a quite similar fashion, but the roles of
IndexReader and IndexWriter are exchanged. If you call delete with
an IndexReader, the document(s) are only marked as deleted. You might
even call undeletAll to undo the deletion. It´s only with the next
explicit optimize or implicit merge from an IndexWriter that the
document(s) are irreversibly deleted.

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Giulio Cesare Solaroli <gi...@gmail.com>.
Hi all,

I would like to submit a "think different" approach to this problem
for evaluation for you developers.

Would it be possible to just mark the relevant documents as "deleted"
(instead of deleting them altogether) with an IndexWriter used for
inserting new documents?

"marking" a document as deleted would leave it on the index, but it
would not include it in any result set.

At a later time, an IndexReader could be opened to really delete all
"marked" documents.

Does this approach is compatible with Lucene architecture?

Regards,

Giulio Cesare



On Mon, 19 Jul 2004 20:44:26 -0700, Doug Cutting <cu...@apache.org> wrote:
> Dmitry Serebrennikov wrote:
> > Doug Cutting wrote:
> >
> >> Dmitry Serebrennikov wrote:
> >>
> >>> So here's a modified sequence of operations, perhaps a bit more
> >>> efficient than proposed by Christoph:
> >>> 1) Open an IndexReader for searching - S. Keep it open until the
> >>> transaction is committed.
> >>> 2) Open a second IndexReader for deletions - D.
> >>> 3) Create a filter bitset F (or use any other mechanism for storing
> >>> document numbers to be deleted)
> >>> 4) Open an IndexWriter for new documents - W.
> >>> 5) As documents come in, add them using W. Find their old versions in
> >>> D and record their document numbers in F. D will not show any new
> >>> documents, only documents present at the time D was created.
> >>> 6) Close W.
> >>> 7) Use D to delete all documents marked in F.
> >>> 8) Close D.
> >>
> >>
> >>
> >> What happens if there are deletions in S and D, and then, in step 5,
> >> as documents are added to W and segments are merged, documents are
> >> renumbered?  Wouldn't that invalidate F?  Currently we don't permit
> >> one to delete documents from an IndexReader while an IndexWriter is
> >> open, to prevent this sort of thing.  Am I missing something?
> >
> >
> > I was assuming that there would never be deletions in S.
> 
> Then you need to ensure that you leave the index has no deletions, and
> optimize it if it has any, to remove them.  This is probably most safely
> done as the first step, rather than the last.
> 
> I'm not sure this method has many advantages over what Christoph
> orginally suggested in:
> 
> http://www.mail-archive.com/lucene-dev%40jakarta.apache.org/msg06165.html
> 
> 
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Doug Cutting <cu...@apache.org>.
Dmitry Serebrennikov wrote:
> Doug Cutting wrote:
> 
>> Dmitry Serebrennikov wrote:
>>
>>> So here's a modified sequence of operations, perhaps a bit more 
>>> efficient than proposed by Christoph:
>>> 1) Open an IndexReader for searching - S. Keep it open until the 
>>> transaction is committed.
>>> 2) Open a second IndexReader for deletions - D.
>>> 3) Create a filter bitset F (or use any other mechanism for storing 
>>> document numbers to be deleted)
>>> 4) Open an IndexWriter for new documents - W.
>>> 5) As documents come in, add them using W. Find their old versions in 
>>> D and record their document numbers in F. D will not show any new 
>>> documents, only documents present at the time D was created.
>>> 6) Close W.
>>> 7) Use D to delete all documents marked in F.
>>> 8) Close D.
>>
>>
>>
>> What happens if there are deletions in S and D, and then, in step 5, 
>> as documents are added to W and segments are merged, documents are 
>> renumbered?  Wouldn't that invalidate F?  Currently we don't permit 
>> one to delete documents from an IndexReader while an IndexWriter is 
>> open, to prevent this sort of thing.  Am I missing something?
> 
> 
> I was assuming that there would never be deletions in S.

Then you need to ensure that you leave the index has no deletions, and 
optimize it if it has any, to remove them.  This is probably most safely 
done as the first step, rather than the last.

I'm not sure this method has many advantages over what Christoph 
orginally suggested in:

http://www.mail-archive.com/lucene-dev%40jakarta.apache.org/msg06165.html

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Doug Cutting wrote:

> Dmitry Serebrennikov wrote:
>
>> So here's a modified sequence of operations, perhaps a bit more 
>> efficient than proposed by Christoph:
>> 1) Open an IndexReader for searching - S. Keep it open until the 
>> transaction is committed.
>> 2) Open a second IndexReader for deletions - D.
>> 3) Create a filter bitset F (or use any other mechanism for storing 
>> document numbers to be deleted)
>> 4) Open an IndexWriter for new documents - W.
>> 5) As documents come in, add them using W. Find their old versions in 
>> D and record their document numbers in F. D will not show any new 
>> documents, only documents present at the time D was created.
>> 6) Close W.
>> 7) Use D to delete all documents marked in F.
>> 8) Close D.
>
>
> What happens if there are deletions in S and D, and then, in step 5, 
> as documents are added to W and segments are merged, documents are 
> renumbered?  Wouldn't that invalidate F?  Currently we don't permit 
> one to delete documents from an IndexReader while an IndexWriter is 
> open, to prevent this sort of thing.  Am I missing something?

I was assuming that there would never be deletions in S. As far as D, 
since it was opened prior to W, I thought that would guarantee that it 
would not be affected by anything done in W, including optimizations.
Ok, I think I see the point of confusion. I wasn't suggesting that 
documents are actually deleted in step 5. Instead their doc ids are 
recorded for later deletion. Actual deletion occurs in step 7, after W 
is closed in 6. So, since S is never used for deletion, we have only one 
writer (W) or one reader with deletes (D) open at a time. I think this 
should work. No?

Dmitry.

>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Doug Cutting <cu...@apache.org>.
Dmitry Serebrennikov wrote:
> So here's a modified sequence of operations, perhaps a bit more 
> efficient than proposed by Christoph:
> 1) Open an IndexReader for searching - S. Keep it open until the 
> transaction is committed.
> 2) Open a second IndexReader for deletions - D.
> 3) Create a filter bitset F (or use any other mechanism for storing 
> document numbers to be deleted)
> 4) Open an IndexWriter for new documents - W.
> 5) As documents come in, add them using W. Find their old versions in D 
> and record their document numbers in F. D will not show any new 
> documents, only documents present at the time D was created.
> 6) Close W.
> 7) Use D to delete all documents marked in F.
> 8) Close D.

What happens if there are deletions in S and D, and then, in step 5, as 
documents are added to W and segments are merged, documents are 
renumbered?  Wouldn't that invalidate F?  Currently we don't permit one 
to delete documents from an IndexReader while an IndexWriter is open, to 
prevent this sort of thing.  Am I missing something?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Christoph Goller <go...@detego-software.de>.
Dmitry Serebrennikov wrote:
> Another solution that works well in some applications is to rely on 
> document number. This number will remain the same for the life of an 
> IndexReader. This number is also always larger for documents added 
> later. So given two documents with the same ID, the one with the highest 
> document number is the latest one. The rest can be deleted. One way to 
> store a list of documents easily is to use a filter (which could also be 
> serialized to disk if needed). This filter would only be valid for the 
> IndexReader used to create it.
> 
> So here's a modified sequence of operations, perhaps a bit more 
> efficient than proposed by Christoph:
> 1) Open an IndexReader for searching - S. Keep it open until the 
> transaction is committed.
> 2) Open a second IndexReader for deletions - D.
> 3) Create a filter bitset F (or use any other mechanism for storing 
> document numbers to be deleted)
> 4) Open an IndexWriter for new documents - W.
> 5) As documents come in, add them using W. Find their old versions in D 
> and record their document numbers in F. D will not show any new 
> documents, only documents present at the time D was created.
> 6) Close W.
> 7) Use D to delete all documents marked in F.
> 8) Close D.
> 
> Step 8 commits the transaction. At this point, another IndexReader S2 
> can be created and all new searches can go to that. Once all searches 
> using S are done, S can be closed.
> 
> Would this work? I think it might. Anyone sees any holes in this? This 
> can even allow multiple Ws to be used concurrently, and perhaps even 
> multiple machines can be utilized that write to the same index, but I'm 
> not sure if this is desirable.

The propsed mechanism could indeed be made thread-safe and efficient
multithreaded update would be possible. Thats probably what you have in
mind. However, having more than one IndexWriter is not possible and not
required, since IndexWriter is already optimized for multithreading. Well,
I think you know this anyway, I add it just for other listeners.

> Yea, this would be a great thing to have available in Lucene...
> Dmitry.

One could add a class called IndexUpdate that could handle all that.
There should be a possibility to specify a field or set of fields for
identifying dublicate documents.

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Another solution that works well in some applications is to rely on 
document number. This number will remain the same for the life of an 
IndexReader. This number is also always larger for documents added 
later. So given two documents with the same ID, the one with the highest 
document number is the latest one. The rest can be deleted. One way to 
store a list of documents easily is to use a filter (which could also be 
serialized to disk if needed). This filter would only be valid for the 
IndexReader used to create it.

So here's a modified sequence of operations, perhaps a bit more 
efficient than proposed by Christoph:
1) Open an IndexReader for searching - S. Keep it open until the 
transaction is committed.
2) Open a second IndexReader for deletions - D.
3) Create a filter bitset F (or use any other mechanism for storing 
document numbers to be deleted)
4) Open an IndexWriter for new documents - W.
5) As documents come in, add them using W. Find their old versions in D 
and record their document numbers in F. D will not show any new 
documents, only documents present at the time D was created.
6) Close W.
7) Use D to delete all documents marked in F.
8) Close D.

Step 8 commits the transaction. At this point, another IndexReader S2 
can be created and all new searches can go to that. Once all searches 
using S are done, S can be closed.

Would this work? I think it might. Anyone sees any holes in this? This 
can even allow multiple Ws to be used concurrently, and perhaps even 
multiple machines can be utilized that write to the same index, but I'm 
not sure if this is desirable.

Yea, this would be a great thing to have available in Lucene...
Dmitry.


Christoph Goller wrote:

> Giulio Cesare Solaroli wrote:
>
>> I have been thinking about this for a while, but could not find out a
>> reasonable solution.
>> The basic problems are:
>> - where do I (safely) store the index of the documents that needs to 
>> be deleted?
>> - how can I uniquely identify the Lucene documents that I have to
>> delete, given that there are different Lucene document matching a
>> single "real" document?
>>
>> The second problem could be "easily" solved adding a kind of version
>> field (stored in the Lucene index) that is incremented every time a
>> new version of a document is inserted. In this way, when searching for
>> duplicated documents (using the "real" document ID) I will find a set
>> of Lucene documents and I could delete all but the one with the
>> highest version number.
>
>
> You need unique document ids. They may either be produced by the
> fulltext-Index (example 1) or they may come from outside (example 2):
>
> 1) You could use a unique id for every doucment added to the Lucene index
> (a kind of counter for the number of added documents). You have to 
> provide
> this number by yourself. It is not provided by Lucene! We are doing this
> in some applications. This unique id is stored in a dedicated field 
> and in
> your database you associate this unique id with your document. If you 
> change
> your document in the database, you find the unique id there and thus 
> you know
> which document to delete in the Lucene index. If the changed document 
> is added
> to the Lucene-Index, you get a new unique id and store this one with 
> the changed
> document in your database.
>
> 2) In another application we store a url of each document in the 
> Lucene index.
> If the document underlying the url has changed, we know which document 
> to delete
> in the Lucene index simply via the url and we store the new version of 
> the document again with a url-field.
>
> Christoph
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Christoph Goller <go...@detego-software.de>.
Giulio Cesare Solaroli wrote:
> I have been thinking about this for a while, but could not find out a
> reasonable solution.
> The basic problems are:
> - where do I (safely) store the index of the documents that needs to be deleted?
> - how can I uniquely identify the Lucene documents that I have to
> delete, given that there are different Lucene document matching a
> single "real" document?
> 
> The second problem could be "easily" solved adding a kind of version
> field (stored in the Lucene index) that is incremented every time a
> new version of a document is inserted. In this way, when searching for
> duplicated documents (using the "real" document ID) I will find a set
> of Lucene documents and I could delete all but the one with the
> highest version number.

You need unique document ids. They may either be produced by the
fulltext-Index (example 1) or they may come from outside (example 2):

1) You could use a unique id for every doucment added to the Lucene index
(a kind of counter for the number of added documents). You have to provide
this number by yourself. It is not provided by Lucene! We are doing this
in some applications. This unique id is stored in a dedicated field and in
your database you associate this unique id with your document. If you change
your document in the database, you find the unique id there and thus you know
which document to delete in the Lucene index. If the changed document is added
to the Lucene-Index, you get a new unique id and store this one with the changed
document in your database.

2) In another application we store a url of each document in the Lucene index.
If the document underlying the url has changed, we know which document to delete
in the Lucene index simply via the url and we store the new version of the 
document again with a url-field.

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Giulio Cesare Solaroli <gi...@gmail.com>.
On Fri, 16 Jul 2004 15:07:11 +0200, Christoph Goller
<go...@detego-software.de> wrote:
> Giulio Cesare Solaroli wrote:
> > This is the main problem; in my current arrangement, it is quite
> > difficult to find out the documents that needs to be updated in
> > advance; it would have been much easier to find out whether every
> > single document where a new entry or a document already present, and
> > thus to update (instead of insert).
> 
> You could perhaps do it the other way round, first add all modified
> documents and then delete the old versions.

I have been thinking about this for a while, but could not find out a
reasonable solution.
The basic problems are:
- where do I (safely) store the index of the documents that needs to be deleted?
- how can I uniquely identify the Lucene documents that I have to
delete, given that there are different Lucene document matching a
single "real" document?

The second problem could be "easily" solved adding a kind of version
field (stored in the Lucene index) that is incremented every time a
new version of a document is inserted. In this way, when searching for
duplicated documents (using the "real" document ID) I will find a set
of Lucene documents and I could delete all but the one with the
highest version number.

The real problem is where to keep a list of documents to be deleted. I
could keep a list in memory, but if my application crashed (or, more
often, we kill it), I will have duplicated documents on the index.
I could store it on the DB (where all the real documents are), but in
this case I could only store the real ID, as the DocID of the Lucene
Index could change.

This is probably feasible, but with quite an high overhead.

Giulio Cesare Solaroli

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Christoph Goller <go...@detego-software.de>.
Giulio Cesare Solaroli wrote:
> This is the main problem; in my current arrangement, it is quite
> difficult to find out the documents that needs to be updated in
> advance; it would have been much easier to find out whether every
> single document where a new entry or a document already present, and
> thus to update (instead of insert).

You could perhaps do it the other way round, first add all modified
documents and then delete the old versions.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Giulio Cesare Solaroli <gi...@gmail.com>.
Hi Christoph,

On Fri, 16 Jul 2004 13:50:51 +0200, Christoph Goller
<go...@detego-software.de> wrote:
>[snip on good reasons why an IndexWriter can not delete documents]
> 
>> 
> If you want to do several updates at the same time, the most efficient
> way would be to:
> 
> 1) Keep an IndexReader/Searcher open on your index in order to guarantee
> reed access and a consistent index during the whole process.
> 
> 2) Open a new IndexReader and delete all the documents that you want to
> update.

This is the main problem; in my current arrangement, it is quite
difficult to find out the documents that needs to be updated in
advance; it would have been much easier to find out whether every
single document where a new entry or a document already present, and
thus to update (instead of insert).

I can try to work on finding a better way to list of updated
documents, but I was hoping to solve this problem with a different
route.

[...]
> > Do you confirm my idea that keeping and IndexWriter open as much as
> > possible while indexing batch of documents is a "good thing"?
> 
> Yes. IndexWriter works with a RamDirectory as cache. If you close
> it after each document and open a new one, you enforce unnecessary
> write operations to your hard disk.
> 
> > Is there any option to ever see a deleteDocument method in the
> > IndexWriter class
> 
> Probably not. I guess you either have to update every document separately
> as described in your email (open and close a reader and writer for each
> document), or do it in the way I describe above (more efficient).

I am not competent enough to suggest any possible solution for this
problem, but I hope that developers with required knowledge will take
this option into consideration for future versions of Lucene as it
will really simplify some tasks to the users.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Christoph Goller <go...@detego-software.de>.
Giulio Cesare Solaroli wrote:
> Dear developers,
> 
> is there any architectural reason while an IndexWriter could not
> delete a document?

There are such reasons. Maybe Doug can give additional
insight. Here is what I think:

One reason I see is that there is no such thing as a unique
document id in Lucene. The IndexReader is the object through which an
index is accessed and search is also done through a reader. The document
ids used by one IndexReader/IndexSearcher instance are unique/valid only
with regard to this instance and the reader/search does not have a
possibility for changing document ids. However, by calling optimize
on an index with deletions, document ids will change. Some documents
will have other ids after calling the optimize than before. This has
no effect on an existing reader instance, only on IndexReader instances
generated after the optimize.

Of course an application can take care of unique document ids and store
them in a dedicated field. The ids could e.g. be urls If this
unique id is used for specifying a document for deletion or other terms
are used for specifying the document(s) for deletion, index access as
provided by a reader is needed to do the deletion. IndexWriter currently
does not have these capabilities.

So the only solution to the update problem is to build a wrapper around
Lucene that handles reading, writing, and updating. And this is what you
are actually doing :-)

> I understand that the IndexReader (besides its strange naming for this
> feature) is the right class to use to delete a document, but this
> raises a huge problem for me.
> 
> We add almost 50.000 documents a day, while deleting a similar amount
> of old documents over the same period.
> We index new documents in batch every 5 minutes while deleting the old
> ones and optimize the index twice a day, in order to keep good
> performance for the queries and the number of index files under
> control.
> 
> In this situation, I try to keep the same IndexWriter open as much as
> possible, in order to avoid any unnecessary fragmentation of the
> index.
> Before indexing any document, I can check to see if the document has
> already been inserted, but I am not able to delete it without closing
> the IndexWriter, opening an IndexReader, deleting the document,
> closing the IndexReader an opening again the IndexWritere.
> 
> This arrangement seems reasonable if updated documents are scarce, but
> doesn't seem feasible to work with a high rate of updated documents.
> 
> I would prefer to avoid deleting all updated documents from the index
> before opening the IndexWriter because the updating and indexing
> procedure would get much more complex, and because I will introduce a
> significant time gap where a previously available document is no more
> available on the index.

If you want to do several updates at the same time, the most efficient
way would be to:

1) Keep an IndexReader/Searcher open on your index in order to guarantee
reed access and a consistent index during the whole process.

2) Open a new IndexReader and delete all the documents that you want to
update.

3) Close the IndexReader (makes the deletions visible for any new
readers/writers but not for the still opened Searcher/Reader).

4) Open an IndexWriter and add all modified documents.

5) Close the IndexWriter (makes the insertions visible for any new
readers/writers but not for the still opened Searcher/Reader).

6) Substitute the IndexReader/Searcher with a new one to make
the changes visible.

> Do you confirm my idea that keeping and IndexWriter open as much as
> possible while indexing batch of documents is a "good thing"?

Yes. IndexWriter works with a RamDirectory as cache. If you close
it after each document and open a new one, you enforce unnecessary
write operations to your hard disk.

> Is there any option to ever see a deleteDocument method in the
> IndexWriter class

Probably not. I guess you either have to update every document separately
as described in your email (open and close a reader and writer for each
document), or do it in the way I describe above (more efficient).

Christoph








---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Giulio Cesare Solaroli <gi...@gmail.com>.
On Fri, 16 Jul 2004 10:00:08 +0200, Christiaan Fluit
<ch...@aduna.biz> wrote:
> [snip]
> 
> That's exactly what we do. It's not optimal but it works.
> 
> My guess would be that the chosen architecture makes it possible to
> query the index while it is simultaneously being updated. I believe
> (correct me if I'm wrong) that you can add documents to an index and
> query it at the same time.

I can confirm this. That is what we were doing to find out if a
document ready to be indexed was already present on the index.

> Only the IndexReader won't see the new
> documents that have been added after it has been created.

That is also correct; we have already found a way around this problem
as we refresh the main IndexReader every few minutes to keep it
aligned with the content of the index.

> I can imagine
> that this is much harder to realize when IndexWriter can not only add
> documents but also delete them.

As far as I understand this, the indexReader blocks a copy of the
whole index as it is opened, in order to keep a consistent environment
while it is running. This should isolate it from update to the index,
both new documents and deleted documents.

But I don't know Lucene internals enough to bet a single cent on my
previous statement. :-]

> Can any of the Lucene designers tell me why you chose to create an
> IndexWriter and an IndexReader class instead of a single Index class
> that handles all these aspects for you? It seems to me that this way you
> could have solved a lot of these issues internally in Lucene, instead of
> leaving it up to the integrator.


Giulio Cesare Solaroli

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Deleting a document with an IndexWriter open

Posted by Christiaan Fluit <Ch...@aduna.biz>.
Giulio Cesare Solaroli wrote:
> is there any architectural reason while an IndexWriter could not
> delete a document?

[snip]

> In this situation, I try to keep the same IndexWriter open as much as
> possible, in order to avoid any unnecessary fragmentation of the
> index.
> Before indexing any document, I can check to see if the document has
> already been inserted, but I am not able to delete it without closing
> the IndexWriter, opening an IndexReader, deleting the document,
> closing the IndexReader an opening again the IndexWritere.

That's exactly what we do. It's not optimal but it works.

My guess would be that the chosen architecture makes it possible to 
query the index while it is simultaneously being updated. I believe 
(correct me if I'm wrong) that you can add documents to an index and 
query it at the same time. Only the IndexReader won't see the new 
documents that have been added after it has been created. I can imagine 
that this is much harder to realize when IndexWriter can not only add 
documents but also delete them.

Can any of the Lucene designers tell me why you chose to create an 
IndexWriter and an IndexReader class instead of a single Index class 
that handles all these aspects for you? It seems to me that this way you 
could have solved a lot of these issues internally in Lucene, instead of 
leaving it up to the integrator.


Kind regards,

Chris
--


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org