You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Cam Bazz <ca...@gmail.com> on 2008/08/08 23:39:29 UTC

delete by doc id

hello,

what would happen if I modified the class IndexWriter, and made the delete
by id method public?

I have two fields in my documents and I got to be able to delete by those
two fields, (by query in other words) and I do not wish to go trunk version.

I am getting quite desperate, and if not found a solution I will have to
make my documents with 3 fields, a, b and a + b so I can delete by a and b.

Best.

could there be a side effect?

Best.

-c.b.

Re: delete by doc id

Posted by Cam Bazz <ca...@gmail.com>.
Hello Andy,

Thanks for your input. I understand what you are saying and trying to use
lucene as a relational db is a little too far, however,
in certain specialized areas, lucene works better than relational databases.
If you can setup the scheme, so that it is non-normalized, and if you dont
need too much updates, it really works nice.

I have read this paper: http://db.cs.yale.edu/vldb07hstore.pdf , in which
the authors claim 2 orders of magnitude speed increase (between 80 and 100
to be exact) by using specialized databases - (lucene is not mentioned)

I am using lucene to store graphs. So far I have tried many different
databases, ranging from mysql,oracle,postgres,objectivity/db,db4o and many
other. oracle and mysql start crying after 10M records. object relational
db's are good when it comes to traversal, but locating things are hard, and
if the data is coming in random distributions ingest rates are terribly
slow.

so far, lucene has been very fast. both for ingest, searching. its only
drawback is delete and update. (which I understand why)

I have been bugging lucene developers with similar questions for the past
year. With each release lucene is getting better and better.
And by adding deleteByQuery feature, i think it will be complete. (will
never have to depend on docid's again)

To solve my problem I have added a third field, which is not stored and
combination of the other two fields. Works nicely, but some overhead I
guess.

Is there any way that the developers can integrate the deleteByQuery into
2.3 release? It would be greatly appreciated, and I believe many other
people are waiting for the same feature.

Best Regards,
-C.B.





On Sat, Aug 9, 2008 at 1:55 AM, Andy Triana <an...@gmail.com> wrote:

> I rarely submit but I've been seeing this sort of thing more and more
> on this board.
>
> It seems that there is a need to treat Lucene as if it were a data
> storage or database like repository, where in
> fact it isn't.
>
> In our case, for large indexes we run either a parallel process to
> create the index or a nightly process. Always treating
> the index as a disposable lookup mechanism.
>
> When it comes to deleting items from an index, we simply just mark
> those items in a separate list or data structure and then filter them
> away until the index is refreshed the next go around.
>
> My point is, that it seems like folks want Lucene to act like a
> relational database when in fact it is not meant to be used that way.
>
> Perhaps I'm wrong and upgrades will allow for efficient deletions, but
> it has always been clear to me that Lucene is strictly
> an index and should be treated as a feature of your storage
> repository, i.e. database, file system, web, whatever. But not relied
> upon for that very storage or management of that storage.
>
> The deletion issue is true of almost any indexing engine I've used in
> the past, i.e. DT Search, Verity, etc.
>
> Am I missing something? For us it has been a "best practice" to treat
> Lucene as described.
>
> //andy
>
>
> On Fri, Aug 8, 2008 at 2:39 PM, Cam Bazz <ca...@gmail.com> wrote:
> > hello,
> >
> > what would happen if I modified the class IndexWriter, and made the
> delete
> > by id method public?
> >
> > I have two fields in my documents and I got to be able to delete by those
> > two fields, (by query in other words) and I do not wish to go trunk
> version.
> >
> > I am getting quite desperate, and if not found a solution I will have to
> > make my documents with 3 fields, a, b and a + b so I can delete by a and
> b.
> >
> > Best.
> >
> > could there be a side effect?
> >
> > Best.
> >
> > -c.b.
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: delete by doc id

Posted by Andy Triana <an...@gmail.com>.
I rarely submit but I've been seeing this sort of thing more and more
on this board.

It seems that there is a need to treat Lucene as if it were a data
storage or database like repository, where in
fact it isn't.

In our case, for large indexes we run either a parallel process to
create the index or a nightly process. Always treating
the index as a disposable lookup mechanism.

When it comes to deleting items from an index, we simply just mark
those items in a separate list or data structure and then filter them
away until the index is refreshed the next go around.

My point is, that it seems like folks want Lucene to act like a
relational database when in fact it is not meant to be used that way.

Perhaps I'm wrong and upgrades will allow for efficient deletions, but
it has always been clear to me that Lucene is strictly
an index and should be treated as a feature of your storage
repository, i.e. database, file system, web, whatever. But not relied
upon for that very storage or management of that storage.

The deletion issue is true of almost any indexing engine I've used in
the past, i.e. DT Search, Verity, etc.

Am I missing something? For us it has been a "best practice" to treat
Lucene as described.

//andy


On Fri, Aug 8, 2008 at 2:39 PM, Cam Bazz <ca...@gmail.com> wrote:
> hello,
>
> what would happen if I modified the class IndexWriter, and made the delete
> by id method public?
>
> I have two fields in my documents and I got to be able to delete by those
> two fields, (by query in other words) and I do not wish to go trunk version.
>
> I am getting quite desperate, and if not found a solution I will have to
> make my documents with 3 fields, a, b and a + b so I can delete by a and b.
>
> Best.
>
> could there be a side effect?
>
> Best.
>
> -c.b.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: delete by doc id

Posted by Cam Bazz <ca...@gmail.com>.
I get the id's to delete from a query in indexsearcher.

I think I am going trunk, hope it wont cause a lot of pain.

best.
-C.B.

On Sat, Aug 9, 2008 at 2:30 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> It's risky.
>
> How would you get the IDs to know which ones to delete?  A separate reader
> running on the side?
>
> The problem is, as IndexWriter merges segments, the IDs shift.  Any reader
> you have already open won't see this shift (until you reopen it), so you
> could end up deleting the wrong IDs.
>
> Mike
>
>
> Cam Bazz wrote:
>
>  hello,
>>
>> what would happen if I modified the class IndexWriter, and made the delete
>> by id method public?
>>
>> I have two fields in my documents and I got to be able to delete by those
>> two fields, (by query in other words) and I do not wish to go trunk
>> version.
>>
>> I am getting quite desperate, and if not found a solution I will have to
>> make my documents with 3 fields, a, b and a + b so I can delete by a and
>> b.
>>
>> Best.
>>
>> could there be a side effect?
>>
>> Best.
>>
>> -c.b.
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: delete by doc id

Posted by Michael McCandless <lu...@mikemccandless.com>.
It's risky.

How would you get the IDs to know which ones to delete?  A separate  
reader running on the side?

The problem is, as IndexWriter merges segments, the IDs shift.  Any  
reader you have already open won't see this shift (until you reopen  
it), so you could end up deleting the wrong IDs.

Mike

Cam Bazz wrote:

> hello,
>
> what would happen if I modified the class IndexWriter, and made the  
> delete
> by id method public?
>
> I have two fields in my documents and I got to be able to delete by  
> those
> two fields, (by query in other words) and I do not wish to go trunk  
> version.
>
> I am getting quite desperate, and if not found a solution I will  
> have to
> make my documents with 3 fields, a, b and a + b so I can delete by a  
> and b.
>
> Best.
>
> could there be a side effect?
>
> Best.
>
> -c.b.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org