You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Greg Gershman <gr...@yahoo.com> on 2006/02/13 15:47:04 UTC

Help with mass delete from large index

I'm trying to delete a large number of documents
(~15million) from a a large index (30+ million
documents).  I've started with an optimized index, and
a list of docIds (our own unique identifier for a
document, not a Lucene doc number) to pass to the
IndexReader.delete(Term t) method.  I've had a few
different problems.

The following code is inside the loop that iterates
through the document IDs:

                  try {
                    Term t = new Term("docID",
String.valueOf(docID));
                   
deletedCount+=indexReader.delete(t);
                    }
                   catch (Exception e)
                   {
                       System.out.println("Error while
deleting docID#" + docID);
                       e.printStackTrace();
                   }

In order to commit the deletions, I also close and
reopen the IndexReader periodically.

At first I was reopening the IndexReader after every
500K documents deleted.  The problem was that after
~60-75K deletions, the delete call began to throw a
NullPointerException:

Error while deleting docID#27136356
java.lang.NullPointerException
        at java.lang.String.compareTo(String.java:402)
        at
org.apache.lucene.index.Term.compareTo(Term.java:76)
        at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143)
        at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:132)
        at
org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:51)
        at
org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:364)
        at
org.apache.lucene.index.IndexReader.delete(IndexReader.java:449)
        at IndexEraser.main(IndexEraser.java:32)

After a little fiddling around, I tried reducing the
interval between reopens to 5000, and most of the
NullPointerExceptions went away.

A test search of the resulting, unoptimized index
worked fine.

I then optimized the index to reduce the size of the
index.  Now, instead of getting data back for many of
the results, I get a null value.

Any ideas?  I'm really confused, and the only other
option I can think of is to reindex the documents I
need, which would take much longer than deleting the
ones I dont.

Thanks!

Greg Gershman

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Help with mass delete from large index

Posted by "Michael D. Curtin" <mi...@curtin.com>.
Chandramohan wrote:

>>perform such a cull again, you might make several
>>distinct indexes (one per 
>>day, per week, per whatever) during that reindexing
>>so the next time will be 
>>much easier.
> 
> How would you search and consolidate the results
> across multiple indexes?  Hits from each index will
> have independent scoring.

Frankly, I ignore the scores in my application.  The data itself isn't English 
prose, so the TF/IDF calcuations are stretched at best, as a measure of 
relevance.  I presort the documents to be in "relevance" order (a popularity 
metric), then specify index ordering for the results.

If that wouldn't work for your application, it seems to me that large-enough 
sub-sections *would* produce equivalent scores.  That is, if the sub-indexes 
were big enough, one could directly compare scores, so a simple merge would 
work.  If the total document corpus is small, then the need for sub-indexes 
isn't there anyhow.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Help with mass delete from large index

Posted by Chandramohan <cl...@yahoo.com>.
> perform such a cull again, you might make several
> distinct indexes (one per 
> day, per week, per whatever) during that reindexing
> so the next time will be 
> much easier.

How would you search and consolidate the results
across multiple indexes?  Hits from each index will
have independent scoring.

CL

--- "Michael D. Curtin" <mi...@curtin.com> wrote:

> Now that it's already in 1 index, I'm afraid you
> can't just delete a few 
> files.  On the other hand, if it's only a one-time
> thing, reindexing with only 
> the docs you want shouldn't be too bad.  If you
> think you might ever need to 
> perform such a cull again, you might make several
> distinct indexes (one per 
> day, per week, per whatever) during that reindexing
> so the next time will be 
> much easier.
> 
> Good luck!
> 
> --MDC
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Help with mass delete from large index

Posted by Chris Hostetter <ho...@fucit.org>.
: I can create a test case; should I include an index
: along with it (it could be rather large)?

the ideal test case creates the index in it's constructor or setUp method.
since the index is going to be totally artificial, the data doesn't
matter, just theterm you want to delete on (and they can probably be
sequenial based on your description of the problem)

take a look at BaseTestRangeFilter for an example of what i mean...

http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/test/org/apache/lucene/search/BaseTestRangeFilter.java?rev=150661&view=markup

(too bad i declared the index in that class to be of type RAMDirectory,
otherwise you could just subclass it)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Help with mass delete from large index

Posted by Greg Gershman <gr...@yahoo.com>.
I can create a test case; should I include an index
along with it (it could be rather large)?

I'm running the deletion process again with the latest
nightly build.  So far I haven't seen any of the
previous problems, so perhaps there is already a fix
in place.

Thanks!

Greg

--- Daniel Naber <lu...@danielnaber.de>
wrote:

> On Montag 13 Februar 2006 19:42, Greg Gershman
> wrote:
> 
> > I'm still wondering if anyone has any thoughts on
> the
> > NullPointerException and/or the delete/optimize
> > problems I'm having.  They seem to be very real
> > issues.
> 
> I haven't seen this before (and don't remember
> anyone on the list 
> mentioning it), but if you could create a test case
> that reproduces it (on 
> an artificial test index), it can probably be
> debugged by someone. Also, 
> did you try the same with Lucene 1.9?
> 
> Regards
>  Daniel
> 
> -- 
> http://www.danielnaber.de
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Help with mass delete from large index

Posted by Daniel Naber <lu...@danielnaber.de>.
On Montag 13 Februar 2006 19:42, Greg Gershman wrote:

> I'm still wondering if anyone has any thoughts on the
> NullPointerException and/or the delete/optimize
> problems I'm having.  They seem to be very real
> issues.

I haven't seen this before (and don't remember anyone on the list 
mentioning it), but if you could create a test case that reproduces it (on 
an artificial test index), it can probably be debugged by someone. Also, 
did you try the same with Lucene 1.9?

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Help with mass delete from large index

Posted by Greg Gershman <gr...@yahoo.com>.
Thanks, that is the way things will be done in the
future.

I'm still wondering if anyone has any thoughts on the
NullPointerException and/or the delete/optimize
problems I'm having.  They seem to be very real
issues.

Greg

--- "Michael D. Curtin" <mi...@curtin.com> wrote:

> Greg Gershman wrote:
> 
> > No problem; this is not meant to be a regular
> > operation, rather it's a (hopefully) one-time
> thing
> > till the index can be restructured.
> > 
> > The data is chronological in nature, deleting
> > everything before a specific point in time.  The
> index
> > is optimized, so is it possible to remove specific
> > files?  I'm open to other suggestions as to how to
> > approach this.
> 
> Now that it's already in 1 index, I'm afraid you
> can't just delete a few 
> files.  On the other hand, if it's only a one-time
> thing, reindexing with only 
> the docs you want shouldn't be too bad.  If you
> think you might ever need to 
> perform such a cull again, you might make several
> distinct indexes (one per 
> day, per week, per whatever) during that reindexing
> so the next time will be 
> much easier.
> 
> Good luck!
> 
> --MDC
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Help with mass delete from large index

Posted by "Michael D. Curtin" <mi...@curtin.com>.
Greg Gershman wrote:

> No problem; this is not meant to be a regular
> operation, rather it's a (hopefully) one-time thing
> till the index can be restructured.
> 
> The data is chronological in nature, deleting
> everything before a specific point in time.  The index
> is optimized, so is it possible to remove specific
> files?  I'm open to other suggestions as to how to
> approach this.

Now that it's already in 1 index, I'm afraid you can't just delete a few 
files.  On the other hand, if it's only a one-time thing, reindexing with only 
the docs you want shouldn't be too bad.  If you think you might ever need to 
perform such a cull again, you might make several distinct indexes (one per 
day, per week, per whatever) during that reindexing so the next time will be 
much easier.

Good luck!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Help with mass delete from large index

Posted by Greg Gershman <gr...@yahoo.com>.
No problem; this is not meant to be a regular
operation, rather it's a (hopefully) one-time thing
till the index can be restructured.

The data is chronological in nature, deleting
everything before a specific point in time.  The index
is optimized, so is it possible to remove specific
files?  I'm open to other suggestions as to how to
approach this.

Also, I neglected to mention I'm using version 1.4.3.

Greg 


--- "Michael D. Curtin" <mi...@curtin.com> wrote:

> Greg Gershman wrote:
> 
> > I'm trying to delete a large number of documents
> > (~15million) from a a large index (30+ million
> > documents).  I've started with an optimized index,
> and
> > a list of docIds (our own unique identifier for a
> > document, not a Lucene doc number) to pass to the
> > IndexReader.delete(Term t) method.  I've had a few
> > different problems.
> > ...
> > Any ideas?  I'm really confused, and the only
> other
> > option I can think of is to reindex the documents
> I
> > need, which would take much longer than deleting
> the
> > ones I dont.
> 
> Maybe it would be useful to take a step back up the
> tree of abstractions here 
> and reexamine why you're deleting such a large
> fraction of your index, 
> particularly if you're doing it on a regular basis. 
> For example, is there a 
> chronological or other "natural" break in the data
> such that you could make 2 
> indexes with ~15M docs each in the first place, then
> just delete a few index 
> *files* instead of 15M documents, one at a time?
> 
> --MDC
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Help with mass delete from large index

Posted by "Michael D. Curtin" <mi...@curtin.com>.
Greg Gershman wrote:

> I'm trying to delete a large number of documents
> (~15million) from a a large index (30+ million
> documents).  I've started with an optimized index, and
> a list of docIds (our own unique identifier for a
> document, not a Lucene doc number) to pass to the
> IndexReader.delete(Term t) method.  I've had a few
> different problems.
> ...
> Any ideas?  I'm really confused, and the only other
> option I can think of is to reindex the documents I
> need, which would take much longer than deleting the
> ones I dont.

Maybe it would be useful to take a step back up the tree of abstractions here 
and reexamine why you're deleting such a large fraction of your index, 
particularly if you're doing it on a regular basis.  For example, is there a 
chronological or other "natural" break in the data such that you could make 2 
indexes with ~15M docs each in the first place, then just delete a few index 
*files* instead of 15M documents, one at a time?

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Help with mass delete from large index

Posted by Greg Gershman <gr...@yahoo.com>.
I tried the same operation with the nightly 1.9 build
and it worked fine, no NPEs during the deletes, and
after optimization, search worked fine.

I did a little bit of debugging, a call to getField
returned null, so I think it was more than just the
Term value that was missing.  As the error only
occured after optimization, it must have something to
do with the document numbers...

Is it still worth creating a test case, or since it's
fixed in 1.9, just moving on?

Greg

--- Otis Gospodnetic <ot...@yahoo.com>
wrote:

> I have seen this error in my Simpy logs before....
> at least the NPE in compareTo (I don't recall the
> rest of the stack).
> Have you tried debugging this?  I suppose the Term
> field or value is null somehow... not sure why.
> 
> Otis
> P.S.
> Deleting files - don't :)
> 
> ----- Original Message ----
> From: Greg Gershman <gr...@yahoo.com>
> To: java-user@lucene.apache.org
> Sent: Mon 13 Feb 2006 09:47:04 AM EST
> Subject: Help with mass delete from large index
> 
> I'm trying to delete a large number of documents
> (~15million) from a a large index (30+ million
> documents).  I've started with an optimized index,
> and
> a list of docIds (our own unique identifier for a
> document, not a Lucene doc number) to pass to the
> IndexReader.delete(Term t) method.  I've had a few
> different problems.
> 
> The following code is inside the loop that iterates
> through the document IDs:
> 
>                   try {
>                     Term t = new Term("docID",
> String.valueOf(docID));
>                    
> deletedCount+=indexReader.delete(t);
>                     }
>                    catch (Exception e)
>                    {
>                        System.out.println("Error
> while
> deleting docID#" + docID);
>                        e.printStackTrace();
>                    }
> 
> In order to commit the deletions, I also close and
> reopen the IndexReader periodically.
> 
> At first I was reopening the IndexReader after every
> 500K documents deleted.  The problem was that after
> ~60-75K deletions, the delete call began to throw a
> NullPointerException:
> 
> Error while deleting docID#27136356
> java.lang.NullPointerException
>         at
> java.lang.String.compareTo(String.java:402)
>         at
> org.apache.lucene.index.Term.compareTo(Term.java:76)
>         at
>
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143)
>         at
>
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:132)
>         at
>
org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:51)
>         at
>
org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:364)
>         at
>
org.apache.lucene.index.IndexReader.delete(IndexReader.java:449)
>         at IndexEraser.main(IndexEraser.java:32)
> 
> After a little fiddling around, I tried reducing the
> interval between reopens to 5000, and most of the
> NullPointerExceptions went away.
> 
> A test search of the resulting, unoptimized index
> worked fine.
> 
> I then optimized the index to reduce the size of the
> index.  Now, instead of getting data back for many
> of
> the results, I get a null value.
> 
> Any ideas?  I'm really confused, and the only other
> option I can think of is to reindex the documents I
> need, which would take much longer than deleting the
> ones I dont.
> 
> Thanks!
> 
> Greg Gershman
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 
> 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Help with mass delete from large index

Posted by Otis Gospodnetic <ot...@yahoo.com>.
I have seen this error in my Simpy logs before.... at least the NPE in compareTo (I don't recall the rest of the stack).
Have you tried debugging this?  I suppose the Term field or value is null somehow... not sure why.

Otis
P.S.
Deleting files - don't :)

----- Original Message ----
From: Greg Gershman <gr...@yahoo.com>
To: java-user@lucene.apache.org
Sent: Mon 13 Feb 2006 09:47:04 AM EST
Subject: Help with mass delete from large index

I'm trying to delete a large number of documents
(~15million) from a a large index (30+ million
documents).  I've started with an optimized index, and
a list of docIds (our own unique identifier for a
document, not a Lucene doc number) to pass to the
IndexReader.delete(Term t) method.  I've had a few
different problems.

The following code is inside the loop that iterates
through the document IDs:

                  try {
                    Term t = new Term("docID",
String.valueOf(docID));
                   
deletedCount+=indexReader.delete(t);
                    }
                   catch (Exception e)
                   {
                       System.out.println("Error while
deleting docID#" + docID);
                       e.printStackTrace();
                   }

In order to commit the deletions, I also close and
reopen the IndexReader periodically.

At first I was reopening the IndexReader after every
500K documents deleted.  The problem was that after
~60-75K deletions, the delete call began to throw a
NullPointerException:

Error while deleting docID#27136356
java.lang.NullPointerException
        at java.lang.String.compareTo(String.java:402)
        at
org.apache.lucene.index.Term.compareTo(Term.java:76)
        at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143)
        at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:132)
        at
org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:51)
        at
org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:364)
        at
org.apache.lucene.index.IndexReader.delete(IndexReader.java:449)
        at IndexEraser.main(IndexEraser.java:32)

After a little fiddling around, I tried reducing the
interval between reopens to 5000, and most of the
NullPointerExceptions went away.

A test search of the resulting, unoptimized index
worked fine.

I then optimized the index to reduce the size of the
index.  Now, instead of getting data back for many of
the results, I get a null value.

Any ideas?  I'm really confused, and the only other
option I can think of is to reindex the documents I
need, which would take much longer than deleting the
ones I dont.

Thanks!

Greg Gershman

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org