You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jason Corekin <ja...@gmail.com> on 2013/12/14 07:28:59 UTC

deleteDocuments(Term... terms) takes a long time to do nothing.

Let me start by stating that I almost certain that I am doing something
wrong, and that I hope that I am because if not there is a VERY large bug
in Lucene.   What I am trying to do is use the method


deleteDocuments(Term... terms)


 out of the IndexWriter class to delete several Term object Arrays, each
fed to it via a separate Thread.  Each array has around 460k+ Term object
in it.  The issue is that after running for around 30 minutes or more the
method finishes, I then have a commit run and nothing changes with my files.
To be fair, I am running a custom Directory implementation that might be
causing problems, but I do not think that this is the case as I do not even
see any of the my Directory methods in the stack trace.  In fact when I set
break points inside the delete methods of my Directory implementation they
never even get hit. To be clear replacing the custom Directory
implementation with a standard one is not an option due to the nature of
the data which is made up of terabytes of small (1k and less) files.  So,
if the issue is in the Directory implementation I have to figure out how to
fix it.


Below are the pieces of code that I think are relevant to this issue as
well as a copy of the stack trace thread that was doing work when I paused
the debug session.  As you are likely to notice, the thread is called a
DBCloner because it is being used to clone the underlying Index based
database (needed to avoid storing trillions of files directly on disk).  The
idea is to duplicate the selected group of terms into a new database and
then delete to original terms from the original database.  The duplicate
work wonderfully, but not matter what I do including cutting the program
down to one thread I cannot shrink the database and the time to try to do
the deletes takes drastically too long.


In an attempt to be as helpful as possible, I will say this.  I have been
tracing this problem for a few days and have seen that

BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)

is where that majority of the execution time is spent.  I have also noticed
that this method return false MUCH more often than it returns true.  I have
been trying to figure out how the mechanics of this process work just in
case the issue was not in my code and I might have been able  to find the
problem.  But I have yet to find the problem either in Lucene 4.5.1 or
Lucene 4.6.  If anyone has any ideas as to what I might be doing wrong, I
would really appreciate reading what you have to say.  Thanks in advance.



Jason



                private void cloneDB() throws QueryNodeException {



                                Document doc;

                                ArrayList<String> fileNames;

                                int start = docRanges[(threadNumber * 2)];

                                int stop = docRanges[(threadNumber * 2) +
1];



                                try {



                                                fileNames = new
ArrayList<String>(docsPerThread);

                                                for (int i = start; i <
stop; i++) {

                                                                doc =
searcher.doc(i);

                                                                try {


adder.addDoc(doc);


fileNames.add(doc.get("FileName"));

                                                                } catch
(TransactionExceptionRE | TransactionException | LockConflictException te) {


adder.txnAbort();


System.err.println(Thread.currentThread().getName() + ": Adding a message
failed, retrying.");

                                                                }

                                                }


deleters[threadNumber].deleteTerms("FileName",
fileNames);


deleters[threadNumber].commit();



                                } catch (IOException | ParseException ex) {


Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE,
null, ex);

                                }

                }





                                public void deleteTerms(String
dbField,ArrayList<String> fieldTexts) throws IOException {

                                Term[] terms = new Term[fieldTexts.size()];

                                for(int i=0;i<fieldTexts.size();i++){

                                                terms[i]= new
Term(dbField,fieldTexts.get(i));

                                }

                                writer.deleteDocuments(terms);

                }



                public void deleteDocuments(Term... terms) throws
IOException





                Thread [DB Cloner 2] (Suspended)

                owns: BufferedUpdatesStream  (id=54)

                owns: IndexWriter  (id=49)

                FST<T>.readFirstRealTargetArc(long, Arc<T>, BytesReader)
line: 979

                FST<T>.findTargetArc(int, Arc<T>, Arc<T>, BytesReader)
line: 1220


BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
line: 1679

                BufferedUpdatesStream.applyTermDeletes(Iterable<Term>,
ReadersAndUpdates, SegmentReader) line: 414

                BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool,
List<SegmentCommitInfo>) line: 283

                IndexWriter.applyAllDeletesAndUpdates() line: 3112

                IndexWriter.applyDeletesAndPurge(boolean) line: 4641


                DocumentsWriter$ApplyDeletesEvent.process(IndexWriter,
boolean, boolean) line: 673

                IndexWriter.processEvents(Queue<Event>, boolean, boolean)
line: 4665

                IndexWriter.processEvents(boolean, boolean) line: 4657


                IndexWriter.deleteDocuments(Term...) line: 1421

                DocDeleter.deleteTerms(String, ArrayList<String>)
line: 95


                DBCloner.cloneDB() line: 233

                DBCloner.run() line: 133

                Thread.run() line: 744

Re: deleteDocuments(Term... terms) takes a long time to do nothing.

Posted by Michael McCandless <lu...@mikemccandless.com>.
OK I'm glad it's resolved.

Another way to handle the "expire old documents" would be to index
into separate indices by time, and use MultiReader to search all of
them.

E.g. maybe one index per day.  This way, to delete a day just means
you don't pass that index to MultiReader.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Dec 16, 2013 at 10:42 PM, Jason Corekin <ja...@gmail.com> wrote:
> Mike,
>
>
>
> Thank you for your help.  Below are a few comments to directly reply to
> your questions, but in general your suggestions helped to get me on the
> right track and I believe that have been able to solve the Lucene component
> of my problems.  The short answer was that I when I had previously tried to
> search by query I used to filenames stored in each document as the query,
> which was essentially equivalent to deleting by term.  You email helped me
> to realize this and in turn change my query to be time range based, which
> now takes seconds to run.
>
>
>
> Thank You
>
>
>
> Jason Corekin
>
>
>
>>It sounds like there are at least two issues.
>
>>
>
>>First, that it takes so long to do the delete.
>
>>
>
>>Unfortunately, deleting by Term is at heart a costly operation.  It
>
>>entails up to one disk seek per segment in your index; a custom
>
>>Directory impl that makes seeking costly would slow things down, or if
>
>>the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
>
>>impl is using the OS).  Is seeking somehow costly in your custom Dir
>
>>impl?
>
>
>
> No, seeks are not slow at all.
>
>>
>
>>If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
>
>>per Term, which may actually be expected.
>
>>
>
>>How many terms in your index?  Can you run CheckIndex and post the output?
>
> In the main test case that was causing problems I believe that there are
> around 3.7million terms and this is tiny in comparison to what will need to
> be held.  Unfortunately I forgot to save the CheckIndex output that I
> created from this test set while the problem was occurring and now that the
> problem is solved I do not think it is worth going back to recreate it.
>
>
>
>>
>
>>You could index your ID field using MemoryPostingsFormat, which should
>
>>be a good speedup, but will consume more RAM.
>
>>
>
>>Is it possible to delete by query instead?  Ie, create a query that
>
>>matches the 460K docs and pass that to
>
>>IndexWriter.deleteDocuments(Query).
>
>>
>
> Thanks so much for this suggestion, I had thought of it on my own.
>
>
>
>>Also, try passing fewer ids at once to Lucene, e.g. break the 460K
>
>>into smaller chunks.  Lucene buffers up all deleted terms from one
>
>>call, and then applies them, so my guess is you're using way too much
>
>>intermediate memory by passign 460K in a single call.
>
>
>
> This does not seem to be the issue now, but I will keep it in mind.
>
>>
>
>>Instead of indexing everything into one index, and then deleting tons
>
>>of docs to "clone" to a new index, why not just index to two separate
>
>>indices to begin with?
>
>>
>
> The clone idea is only a test, the final design is to be able to copy date
> ranges of data out of the main index and into secondary indexes that will
> be backed up and removed from the main system on a regular interval.  This
> copy component of this idea seems to work just fine, it’s getting the
> deletion from the made index to work that is giving me all the trouble.
>
>
>
>>The second issue is that after all that work, nothing in fact changed.
>
>> For that, I think you should make a small test case that just tries
>
>>to delete one document, and iterate/debug until that works.  Your
>
>>StringField indexing line looks correct; make sure you're passing
>
>>precisely the same field name and value?  Make sure you're not
>
>>deleting already-deleted documents?  (Your for loop seems to ignore
>
>>already deleted documents).
>
>
>
> This was caused be in incorrect use of the underlying data structure.  This
> is partially fixed now and what I am currently working on.  I have this
> fixed enough to identify  that it should no longer be related to Lucene.
>
>
>
>>
>
>>Mike McCandless
>
>
> On Sat, Dec 14, 2013 at 5:58 PM, Jason Corekin <ja...@gmail.com>wrote:
>
>> Mike,
>>
>> Thanks for the input, it will take me some time to digest and trying
>> everything you wrote about.  I will post back the answers to your questions
>> and results to from the suggestions you made once I have gone over
>> everything.  Thanks for the quick reply,
>>
>> Jason
>>
>>
>> On Sat, Dec 14, 2013 at 5:13 PM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>> It sounds like there are at least two issues.
>>>
>>> First, that it takes so long to do the delete.
>>>
>>> Unfortunately, deleting by Term is at heart a costly operation.  It
>>> entails up to one disk seek per segment in your index; a custom
>>> Directory impl that makes seeking costly would slow things down, or if
>>> the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
>>> impl is using the OS).  Is seeking somehow costly in your custom Dir
>>> impl?
>>>
>>> If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
>>> per Term, which may actually be expected.
>>>
>>> How many terms in your index?  Can you run CheckIndex and post the output?
>>>
>>> You could index your ID field using MemoryPostingsFormat, which should
>>> be a good speedup, but will consume more RAM.
>>>
>>> Is it possible to delete by query instead?  Ie, create a query that
>>> matches the 460K docs and pass that to
>>> IndexWriter.deleteDocuments(Query).
>>>
>>> Also, try passing fewer ids at once to Lucene, e.g. break the 460K
>>> into smaller chunks.  Lucene buffers up all deleted terms from one
>>> call, and then applies them, so my guess is you're using way too much
>>> intermediate memory by passign 460K in a single call.
>>>
>>> Instead of indexing everything into one index, and then deleting tons
>>> of docs to "clone" to a new index, why not just index to two separate
>>> indices to begin with?
>>>
>>> The second issue is that after all that work, nothing in fact changed.
>>>  For that, I think you should make a small test case that just tries
>>> to delete one document, and iterate/debug until that works.  Your
>>> StringField indexing line looks correct; make sure you're passing
>>> precisely the same field name and value?  Make sure you're not
>>> deleting already-deleted documents?  (Your for loop seems to ignore
>>> already deleted documents).
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Sat, Dec 14, 2013 at 11:38 AM, Jason Corekin <ja...@gmail.com>
>>> wrote:
>>> > I knew that I had forgotten something.  Below is the line that I use to
>>> > create the field that I am trying to use to delete the entries with.  I
>>> > hope this avoids some confusion.  Thank you very much to anyone that
>>> takes
>>> > the time to read these messages.
>>> >
>>> > doc.add(new StringField("FileName",filename, Field.Store.YES));
>>> >
>>> >
>>> > On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin <jason.corekin@gmail.com
>>> >wrote:
>>> >
>>> >> Let me start by stating that I almost certain that I am doing something
>>> >> wrong, and that I hope that I am because if not there is a VERY large
>>> bug
>>> >> in Lucene.   What I am trying to do is use the method
>>> >>
>>> >>
>>> >> deleteDocuments(Term... terms)
>>> >>
>>> >>
>>> >>  out of the IndexWriter class to delete several Term object Arrays,
>>> each
>>> >> fed to it via a separate Thread.  Each array has around 460k+ Term
>>> object
>>> >> in it.  The issue is that after running for around 30 minutes or more
>>> the
>>> >> method finishes, I then have a commit run and nothing changes with my
>>> files.
>>> >> To be fair, I am running a custom Directory implementation that might
>>> be
>>> >> causing problems, but I do not think that this is the case as I do not
>>> even
>>> >> see any of the my Directory methods in the stack trace.  In fact when I
>>> >> set break points inside the delete methods of my Directory
>>> implementation
>>> >> they never even get hit. To be clear replacing the custom Directory
>>> >> implementation with a standard one is not an option due to the nature
>>> of
>>> >> the data which is made up of terabytes of small (1k and less) files.
>>>  So,
>>> >> if the issue is in the Directory implementation I have to figure out
>>> how to
>>> >> fix it.
>>> >>
>>> >>
>>> >> Below are the pieces of code that I think are relevant to this issue as
>>> >> well as a copy of the stack trace thread that was doing work when I
>>> paused
>>> >> the debug session.  As you are likely to notice, the thread is called a
>>> >> DBCloner because it is being used to clone the underlying Index based
>>> >> database (needed to avoid storing trillions of files directly on
>>> disk).  The
>>> >> idea is to duplicate the selected group of terms into a new database
>>> and
>>> >> then delete to original terms from the original database.  The
>>> duplicate
>>> >> work wonderfully, but not matter what I do including cutting the
>>> program
>>> >> down to one thread I cannot shrink the database and the time to try to
>>> do
>>> >> the deletes takes drastically too long.
>>> >>
>>> >>
>>> >> In an attempt to be as helpful as possible, I will say this.  I have
>>> been
>>> >> tracing this problem for a few days and have seen that
>>> >>
>>> >> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>>> >>
>>> >> is where that majority of the execution time is spent.  I have also
>>> >> noticed that this method return false MUCH more often than it returns
>>> true.
>>> >> I have been trying to figure out how the mechanics of this process work
>>> >> just in case the issue was not in my code and I might have been able
>>>  to
>>> >> find the problem.  But I have yet to find the problem either in Lucene
>>> >> 4.5.1 or Lucene 4.6.  If anyone has any ideas as to what I might be
>>> doing
>>> >> wrong, I would really appreciate reading what you have to say.  Thanks
>>> in
>>> >> advance.
>>> >>
>>> >>
>>> >>
>>> >> Jason
>>> >>
>>> >>
>>> >>
>>> >>                 private void cloneDB() throws QueryNodeException {
>>> >>
>>> >>
>>> >>
>>> >>                                 Document doc;
>>> >>
>>> >>                                 ArrayList<String> fileNames;
>>> >>
>>> >>                                 int start = docRanges[(threadNumber *
>>> 2)];
>>> >>
>>> >>                                 int stop = docRanges[(threadNumber *
>>> 2) +
>>> >> 1];
>>> >>
>>> >>
>>> >>
>>> >>                                 try {
>>> >>
>>> >>
>>> >>
>>> >>                                                 fileNames = new
>>> >> ArrayList<String>(docsPerThread);
>>> >>
>>> >>                                                 for (int i = start; i <
>>> >> stop; i++) {
>>> >>
>>> >>                                                                 doc =
>>> >> searcher.doc(i);
>>> >>
>>> >>                                                                 try {
>>> >>
>>> >>
>>> >> adder.addDoc(doc);
>>> >>
>>> >>
>>> >> fileNames.add(doc.get("FileName"));
>>> >>
>>> >>                                                                 } catch
>>> >> (TransactionExceptionRE | TransactionException | LockConflictException
>>> te) {
>>> >>
>>> >>
>>> >> adder.txnAbort();
>>> >>
>>> >>
>>> >> System.err.println(Thread.currentThread().getName() + ": Adding a
>>> message
>>> >> failed, retrying.");
>>> >>
>>> >>                                                                 }
>>> >>
>>> >>                                                 }
>>> >>
>>> >>
>>> deleters[threadNumber].deleteTerms("FileName",
>>> >> fileNames);
>>> >>
>>> >>
>>> >> deleters[threadNumber].commit();
>>> >>
>>> >>
>>> >>
>>> >>                                 } catch (IOException | ParseException
>>> ex)
>>> >> {
>>> >>
>>> >>
>>> Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE,
>>> >> null, ex);
>>> >>
>>> >>                                 }
>>> >>
>>> >>                 }
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>                                 public void deleteTerms(String
>>> >> dbField,ArrayList<String> fieldTexts) throws IOException {
>>> >>
>>> >>                                 Term[] terms = new
>>> >> Term[fieldTexts.size()];
>>> >>
>>> >>                                 for(int i=0;i<fieldTexts.size();i++){
>>> >>
>>> >>                                                 terms[i]= new
>>> >> Term(dbField,fieldTexts.get(i));
>>> >>
>>> >>                                 }
>>> >>
>>> >>                                 writer.deleteDocuments(terms);
>>> >>
>>> >>                 }
>>> >>
>>> >>
>>> >>
>>> >>                 public void deleteDocuments(Term... terms) throws
>>> >> IOException
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>                 Thread [DB Cloner 2] (Suspended)
>>> >>
>>> >>                 owns: BufferedUpdatesStream  (id=54)
>>> >>
>>> >>                 owns: IndexWriter  (id=49)
>>> >>
>>> >>                 FST<T>.readFirstRealTargetArc(long, Arc<T>,
>>> BytesReader)
>>> >> line: 979
>>> >>
>>> >>                 FST<T>.findTargetArc(int, Arc<T>, Arc<T>, BytesReader)
>>> >> line: 1220
>>> >>
>>> >>
>>> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>>> >> line: 1679
>>> >>
>>> >>                 BufferedUpdatesStream.applyTermDeletes(Iterable<Term>,
>>> >> ReadersAndUpdates, SegmentReader) line: 414
>>> >>
>>> >>
>>> BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool,
>>> >> List<SegmentCommitInfo>) line: 283
>>> >>
>>> >>                 IndexWriter.applyAllDeletesAndUpdates() line: 3112
>>> >>
>>> >>                 IndexWriter.applyDeletesAndPurge(boolean) line: 4641
>>> >>
>>> >>
>>> >>                 DocumentsWriter$ApplyDeletesEvent.process(IndexWriter,
>>> >> boolean, boolean) line: 673
>>> >>
>>> >>                 IndexWriter.processEvents(Queue<Event>, boolean,
>>> boolean)
>>> >> line: 4665
>>> >>
>>> >>                 IndexWriter.processEvents(boolean, boolean) line: 4657
>>> >>
>>> >>
>>> >>                 IndexWriter.deleteDocuments(Term...) line: 1421
>>> >>
>>> >>                 DocDeleter.deleteTerms(String, ArrayList<String>)
>>> line: 95
>>> >>
>>> >>
>>> >>                 DBCloner.cloneDB() line: 233
>>> >>
>>> >>                 DBCloner.run() line: 133
>>> >>
>>> >>                 Thread.run() line: 744
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: deleteDocuments(Term... terms) takes a long time to do nothing.

Posted by Jason Corekin <ja...@gmail.com>.
Mike,



Thank you for your help.  Below are a few comments to directly reply to
your questions, but in general your suggestions helped to get me on the
right track and I believe that have been able to solve the Lucene component
of my problems.  The short answer was that I when I had previously tried to
search by query I used to filenames stored in each document as the query,
which was essentially equivalent to deleting by term.  You email helped me
to realize this and in turn change my query to be time range based, which
now takes seconds to run.



Thank You



Jason Corekin



>It sounds like there are at least two issues.

>

>First, that it takes so long to do the delete.

>

>Unfortunately, deleting by Term is at heart a costly operation.  It

>entails up to one disk seek per segment in your index; a custom

>Directory impl that makes seeking costly would slow things down, or if

>the OS doesn't have enough RAM to cache the "hot" pages (if your Dir

>impl is using the OS).  Is seeking somehow costly in your custom Dir

>impl?



No, seeks are not slow at all.

>

>If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec

>per Term, which may actually be expected.

>

>How many terms in your index?  Can you run CheckIndex and post the output?

In the main test case that was causing problems I believe that there are
around 3.7million terms and this is tiny in comparison to what will need to
be held.  Unfortunately I forgot to save the CheckIndex output that I
created from this test set while the problem was occurring and now that the
problem is solved I do not think it is worth going back to recreate it.



>

>You could index your ID field using MemoryPostingsFormat, which should

>be a good speedup, but will consume more RAM.

>

>Is it possible to delete by query instead?  Ie, create a query that

>matches the 460K docs and pass that to

>IndexWriter.deleteDocuments(Query).

>

Thanks so much for this suggestion, I had thought of it on my own.



>Also, try passing fewer ids at once to Lucene, e.g. break the 460K

>into smaller chunks.  Lucene buffers up all deleted terms from one

>call, and then applies them, so my guess is you're using way too much

>intermediate memory by passign 460K in a single call.



This does not seem to be the issue now, but I will keep it in mind.

>

>Instead of indexing everything into one index, and then deleting tons

>of docs to "clone" to a new index, why not just index to two separate

>indices to begin with?

>

The clone idea is only a test, the final design is to be able to copy date
ranges of data out of the main index and into secondary indexes that will
be backed up and removed from the main system on a regular interval.  This
copy component of this idea seems to work just fine, it’s getting the
deletion from the made index to work that is giving me all the trouble.



>The second issue is that after all that work, nothing in fact changed.

> For that, I think you should make a small test case that just tries

>to delete one document, and iterate/debug until that works.  Your

>StringField indexing line looks correct; make sure you're passing

>precisely the same field name and value?  Make sure you're not

>deleting already-deleted documents?  (Your for loop seems to ignore

>already deleted documents).



This was caused be in incorrect use of the underlying data structure.  This
is partially fixed now and what I am currently working on.  I have this
fixed enough to identify  that it should no longer be related to Lucene.



>

>Mike McCandless


On Sat, Dec 14, 2013 at 5:58 PM, Jason Corekin <ja...@gmail.com>wrote:

> Mike,
>
> Thanks for the input, it will take me some time to digest and trying
> everything you wrote about.  I will post back the answers to your questions
> and results to from the suggestions you made once I have gone over
> everything.  Thanks for the quick reply,
>
> Jason
>
>
> On Sat, Dec 14, 2013 at 5:13 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> It sounds like there are at least two issues.
>>
>> First, that it takes so long to do the delete.
>>
>> Unfortunately, deleting by Term is at heart a costly operation.  It
>> entails up to one disk seek per segment in your index; a custom
>> Directory impl that makes seeking costly would slow things down, or if
>> the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
>> impl is using the OS).  Is seeking somehow costly in your custom Dir
>> impl?
>>
>> If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
>> per Term, which may actually be expected.
>>
>> How many terms in your index?  Can you run CheckIndex and post the output?
>>
>> You could index your ID field using MemoryPostingsFormat, which should
>> be a good speedup, but will consume more RAM.
>>
>> Is it possible to delete by query instead?  Ie, create a query that
>> matches the 460K docs and pass that to
>> IndexWriter.deleteDocuments(Query).
>>
>> Also, try passing fewer ids at once to Lucene, e.g. break the 460K
>> into smaller chunks.  Lucene buffers up all deleted terms from one
>> call, and then applies them, so my guess is you're using way too much
>> intermediate memory by passign 460K in a single call.
>>
>> Instead of indexing everything into one index, and then deleting tons
>> of docs to "clone" to a new index, why not just index to two separate
>> indices to begin with?
>>
>> The second issue is that after all that work, nothing in fact changed.
>>  For that, I think you should make a small test case that just tries
>> to delete one document, and iterate/debug until that works.  Your
>> StringField indexing line looks correct; make sure you're passing
>> precisely the same field name and value?  Make sure you're not
>> deleting already-deleted documents?  (Your for loop seems to ignore
>> already deleted documents).
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sat, Dec 14, 2013 at 11:38 AM, Jason Corekin <ja...@gmail.com>
>> wrote:
>> > I knew that I had forgotten something.  Below is the line that I use to
>> > create the field that I am trying to use to delete the entries with.  I
>> > hope this avoids some confusion.  Thank you very much to anyone that
>> takes
>> > the time to read these messages.
>> >
>> > doc.add(new StringField("FileName",filename, Field.Store.YES));
>> >
>> >
>> > On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin <jason.corekin@gmail.com
>> >wrote:
>> >
>> >> Let me start by stating that I almost certain that I am doing something
>> >> wrong, and that I hope that I am because if not there is a VERY large
>> bug
>> >> in Lucene.   What I am trying to do is use the method
>> >>
>> >>
>> >> deleteDocuments(Term... terms)
>> >>
>> >>
>> >>  out of the IndexWriter class to delete several Term object Arrays,
>> each
>> >> fed to it via a separate Thread.  Each array has around 460k+ Term
>> object
>> >> in it.  The issue is that after running for around 30 minutes or more
>> the
>> >> method finishes, I then have a commit run and nothing changes with my
>> files.
>> >> To be fair, I am running a custom Directory implementation that might
>> be
>> >> causing problems, but I do not think that this is the case as I do not
>> even
>> >> see any of the my Directory methods in the stack trace.  In fact when I
>> >> set break points inside the delete methods of my Directory
>> implementation
>> >> they never even get hit. To be clear replacing the custom Directory
>> >> implementation with a standard one is not an option due to the nature
>> of
>> >> the data which is made up of terabytes of small (1k and less) files.
>>  So,
>> >> if the issue is in the Directory implementation I have to figure out
>> how to
>> >> fix it.
>> >>
>> >>
>> >> Below are the pieces of code that I think are relevant to this issue as
>> >> well as a copy of the stack trace thread that was doing work when I
>> paused
>> >> the debug session.  As you are likely to notice, the thread is called a
>> >> DBCloner because it is being used to clone the underlying Index based
>> >> database (needed to avoid storing trillions of files directly on
>> disk).  The
>> >> idea is to duplicate the selected group of terms into a new database
>> and
>> >> then delete to original terms from the original database.  The
>> duplicate
>> >> work wonderfully, but not matter what I do including cutting the
>> program
>> >> down to one thread I cannot shrink the database and the time to try to
>> do
>> >> the deletes takes drastically too long.
>> >>
>> >>
>> >> In an attempt to be as helpful as possible, I will say this.  I have
>> been
>> >> tracing this problem for a few days and have seen that
>> >>
>> >> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>> >>
>> >> is where that majority of the execution time is spent.  I have also
>> >> noticed that this method return false MUCH more often than it returns
>> true.
>> >> I have been trying to figure out how the mechanics of this process work
>> >> just in case the issue was not in my code and I might have been able
>>  to
>> >> find the problem.  But I have yet to find the problem either in Lucene
>> >> 4.5.1 or Lucene 4.6.  If anyone has any ideas as to what I might be
>> doing
>> >> wrong, I would really appreciate reading what you have to say.  Thanks
>> in
>> >> advance.
>> >>
>> >>
>> >>
>> >> Jason
>> >>
>> >>
>> >>
>> >>                 private void cloneDB() throws QueryNodeException {
>> >>
>> >>
>> >>
>> >>                                 Document doc;
>> >>
>> >>                                 ArrayList<String> fileNames;
>> >>
>> >>                                 int start = docRanges[(threadNumber *
>> 2)];
>> >>
>> >>                                 int stop = docRanges[(threadNumber *
>> 2) +
>> >> 1];
>> >>
>> >>
>> >>
>> >>                                 try {
>> >>
>> >>
>> >>
>> >>                                                 fileNames = new
>> >> ArrayList<String>(docsPerThread);
>> >>
>> >>                                                 for (int i = start; i <
>> >> stop; i++) {
>> >>
>> >>                                                                 doc =
>> >> searcher.doc(i);
>> >>
>> >>                                                                 try {
>> >>
>> >>
>> >> adder.addDoc(doc);
>> >>
>> >>
>> >> fileNames.add(doc.get("FileName"));
>> >>
>> >>                                                                 } catch
>> >> (TransactionExceptionRE | TransactionException | LockConflictException
>> te) {
>> >>
>> >>
>> >> adder.txnAbort();
>> >>
>> >>
>> >> System.err.println(Thread.currentThread().getName() + ": Adding a
>> message
>> >> failed, retrying.");
>> >>
>> >>                                                                 }
>> >>
>> >>                                                 }
>> >>
>> >>
>> deleters[threadNumber].deleteTerms("FileName",
>> >> fileNames);
>> >>
>> >>
>> >> deleters[threadNumber].commit();
>> >>
>> >>
>> >>
>> >>                                 } catch (IOException | ParseException
>> ex)
>> >> {
>> >>
>> >>
>> Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE,
>> >> null, ex);
>> >>
>> >>                                 }
>> >>
>> >>                 }
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>                                 public void deleteTerms(String
>> >> dbField,ArrayList<String> fieldTexts) throws IOException {
>> >>
>> >>                                 Term[] terms = new
>> >> Term[fieldTexts.size()];
>> >>
>> >>                                 for(int i=0;i<fieldTexts.size();i++){
>> >>
>> >>                                                 terms[i]= new
>> >> Term(dbField,fieldTexts.get(i));
>> >>
>> >>                                 }
>> >>
>> >>                                 writer.deleteDocuments(terms);
>> >>
>> >>                 }
>> >>
>> >>
>> >>
>> >>                 public void deleteDocuments(Term... terms) throws
>> >> IOException
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>                 Thread [DB Cloner 2] (Suspended)
>> >>
>> >>                 owns: BufferedUpdatesStream  (id=54)
>> >>
>> >>                 owns: IndexWriter  (id=49)
>> >>
>> >>                 FST<T>.readFirstRealTargetArc(long, Arc<T>,
>> BytesReader)
>> >> line: 979
>> >>
>> >>                 FST<T>.findTargetArc(int, Arc<T>, Arc<T>, BytesReader)
>> >> line: 1220
>> >>
>> >>
>> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>> >> line: 1679
>> >>
>> >>                 BufferedUpdatesStream.applyTermDeletes(Iterable<Term>,
>> >> ReadersAndUpdates, SegmentReader) line: 414
>> >>
>> >>
>> BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool,
>> >> List<SegmentCommitInfo>) line: 283
>> >>
>> >>                 IndexWriter.applyAllDeletesAndUpdates() line: 3112
>> >>
>> >>                 IndexWriter.applyDeletesAndPurge(boolean) line: 4641
>> >>
>> >>
>> >>                 DocumentsWriter$ApplyDeletesEvent.process(IndexWriter,
>> >> boolean, boolean) line: 673
>> >>
>> >>                 IndexWriter.processEvents(Queue<Event>, boolean,
>> boolean)
>> >> line: 4665
>> >>
>> >>                 IndexWriter.processEvents(boolean, boolean) line: 4657
>> >>
>> >>
>> >>                 IndexWriter.deleteDocuments(Term...) line: 1421
>> >>
>> >>                 DocDeleter.deleteTerms(String, ArrayList<String>)
>> line: 95
>> >>
>> >>
>> >>                 DBCloner.cloneDB() line: 233
>> >>
>> >>                 DBCloner.run() line: 133
>> >>
>> >>                 Thread.run() line: 744
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Re: deleteDocuments(Term... terms) takes a long time to do nothing.

Posted by Jason Corekin <ja...@gmail.com>.
Mike,

Thanks for the input, it will take me some time to digest and trying
everything you wrote about.  I will post back the answers to your questions
and results to from the suggestions you made once I have gone over
everything.  Thanks for the quick reply,

Jason


On Sat, Dec 14, 2013 at 5:13 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> It sounds like there are at least two issues.
>
> First, that it takes so long to do the delete.
>
> Unfortunately, deleting by Term is at heart a costly operation.  It
> entails up to one disk seek per segment in your index; a custom
> Directory impl that makes seeking costly would slow things down, or if
> the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
> impl is using the OS).  Is seeking somehow costly in your custom Dir
> impl?
>
> If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
> per Term, which may actually be expected.
>
> How many terms in your index?  Can you run CheckIndex and post the output?
>
> You could index your ID field using MemoryPostingsFormat, which should
> be a good speedup, but will consume more RAM.
>
> Is it possible to delete by query instead?  Ie, create a query that
> matches the 460K docs and pass that to
> IndexWriter.deleteDocuments(Query).
>
> Also, try passing fewer ids at once to Lucene, e.g. break the 460K
> into smaller chunks.  Lucene buffers up all deleted terms from one
> call, and then applies them, so my guess is you're using way too much
> intermediate memory by passign 460K in a single call.
>
> Instead of indexing everything into one index, and then deleting tons
> of docs to "clone" to a new index, why not just index to two separate
> indices to begin with?
>
> The second issue is that after all that work, nothing in fact changed.
>  For that, I think you should make a small test case that just tries
> to delete one document, and iterate/debug until that works.  Your
> StringField indexing line looks correct; make sure you're passing
> precisely the same field name and value?  Make sure you're not
> deleting already-deleted documents?  (Your for loop seems to ignore
> already deleted documents).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Dec 14, 2013 at 11:38 AM, Jason Corekin <ja...@gmail.com>
> wrote:
> > I knew that I had forgotten something.  Below is the line that I use to
> > create the field that I am trying to use to delete the entries with.  I
> > hope this avoids some confusion.  Thank you very much to anyone that
> takes
> > the time to read these messages.
> >
> > doc.add(new StringField("FileName",filename, Field.Store.YES));
> >
> >
> > On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin <jason.corekin@gmail.com
> >wrote:
> >
> >> Let me start by stating that I almost certain that I am doing something
> >> wrong, and that I hope that I am because if not there is a VERY large
> bug
> >> in Lucene.   What I am trying to do is use the method
> >>
> >>
> >> deleteDocuments(Term... terms)
> >>
> >>
> >>  out of the IndexWriter class to delete several Term object Arrays, each
> >> fed to it via a separate Thread.  Each array has around 460k+ Term
> object
> >> in it.  The issue is that after running for around 30 minutes or more
> the
> >> method finishes, I then have a commit run and nothing changes with my
> files.
> >> To be fair, I am running a custom Directory implementation that might be
> >> causing problems, but I do not think that this is the case as I do not
> even
> >> see any of the my Directory methods in the stack trace.  In fact when I
> >> set break points inside the delete methods of my Directory
> implementation
> >> they never even get hit. To be clear replacing the custom Directory
> >> implementation with a standard one is not an option due to the nature of
> >> the data which is made up of terabytes of small (1k and less) files.
>  So,
> >> if the issue is in the Directory implementation I have to figure out
> how to
> >> fix it.
> >>
> >>
> >> Below are the pieces of code that I think are relevant to this issue as
> >> well as a copy of the stack trace thread that was doing work when I
> paused
> >> the debug session.  As you are likely to notice, the thread is called a
> >> DBCloner because it is being used to clone the underlying Index based
> >> database (needed to avoid storing trillions of files directly on disk).
>  The
> >> idea is to duplicate the selected group of terms into a new database and
> >> then delete to original terms from the original database.  The duplicate
> >> work wonderfully, but not matter what I do including cutting the program
> >> down to one thread I cannot shrink the database and the time to try to
> do
> >> the deletes takes drastically too long.
> >>
> >>
> >> In an attempt to be as helpful as possible, I will say this.  I have
> been
> >> tracing this problem for a few days and have seen that
> >>
> >> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
> >>
> >> is where that majority of the execution time is spent.  I have also
> >> noticed that this method return false MUCH more often than it returns
> true.
> >> I have been trying to figure out how the mechanics of this process work
> >> just in case the issue was not in my code and I might have been able  to
> >> find the problem.  But I have yet to find the problem either in Lucene
> >> 4.5.1 or Lucene 4.6.  If anyone has any ideas as to what I might be
> doing
> >> wrong, I would really appreciate reading what you have to say.  Thanks
> in
> >> advance.
> >>
> >>
> >>
> >> Jason
> >>
> >>
> >>
> >>                 private void cloneDB() throws QueryNodeException {
> >>
> >>
> >>
> >>                                 Document doc;
> >>
> >>                                 ArrayList<String> fileNames;
> >>
> >>                                 int start = docRanges[(threadNumber *
> 2)];
> >>
> >>                                 int stop = docRanges[(threadNumber * 2)
> +
> >> 1];
> >>
> >>
> >>
> >>                                 try {
> >>
> >>
> >>
> >>                                                 fileNames = new
> >> ArrayList<String>(docsPerThread);
> >>
> >>                                                 for (int i = start; i <
> >> stop; i++) {
> >>
> >>                                                                 doc =
> >> searcher.doc(i);
> >>
> >>                                                                 try {
> >>
> >>
> >> adder.addDoc(doc);
> >>
> >>
> >> fileNames.add(doc.get("FileName"));
> >>
> >>                                                                 } catch
> >> (TransactionExceptionRE | TransactionException | LockConflictException
> te) {
> >>
> >>
> >> adder.txnAbort();
> >>
> >>
> >> System.err.println(Thread.currentThread().getName() + ": Adding a
> message
> >> failed, retrying.");
> >>
> >>                                                                 }
> >>
> >>                                                 }
> >>
> >>
> deleters[threadNumber].deleteTerms("FileName",
> >> fileNames);
> >>
> >>
> >> deleters[threadNumber].commit();
> >>
> >>
> >>
> >>                                 } catch (IOException | ParseException
> ex)
> >> {
> >>
> >>
> Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE,
> >> null, ex);
> >>
> >>                                 }
> >>
> >>                 }
> >>
> >>
> >>
> >>
> >>
> >>                                 public void deleteTerms(String
> >> dbField,ArrayList<String> fieldTexts) throws IOException {
> >>
> >>                                 Term[] terms = new
> >> Term[fieldTexts.size()];
> >>
> >>                                 for(int i=0;i<fieldTexts.size();i++){
> >>
> >>                                                 terms[i]= new
> >> Term(dbField,fieldTexts.get(i));
> >>
> >>                                 }
> >>
> >>                                 writer.deleteDocuments(terms);
> >>
> >>                 }
> >>
> >>
> >>
> >>                 public void deleteDocuments(Term... terms) throws
> >> IOException
> >>
> >>
> >>
> >>
> >>
> >>                 Thread [DB Cloner 2] (Suspended)
> >>
> >>                 owns: BufferedUpdatesStream  (id=54)
> >>
> >>                 owns: IndexWriter  (id=49)
> >>
> >>                 FST<T>.readFirstRealTargetArc(long, Arc<T>, BytesReader)
> >> line: 979
> >>
> >>                 FST<T>.findTargetArc(int, Arc<T>, Arc<T>, BytesReader)
> >> line: 1220
> >>
> >>
> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
> >> line: 1679
> >>
> >>                 BufferedUpdatesStream.applyTermDeletes(Iterable<Term>,
> >> ReadersAndUpdates, SegmentReader) line: 414
> >>
> >>                 BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool,
> >> List<SegmentCommitInfo>) line: 283
> >>
> >>                 IndexWriter.applyAllDeletesAndUpdates() line: 3112
> >>
> >>                 IndexWriter.applyDeletesAndPurge(boolean) line: 4641
> >>
> >>
> >>                 DocumentsWriter$ApplyDeletesEvent.process(IndexWriter,
> >> boolean, boolean) line: 673
> >>
> >>                 IndexWriter.processEvents(Queue<Event>, boolean,
> boolean)
> >> line: 4665
> >>
> >>                 IndexWriter.processEvents(boolean, boolean) line: 4657
> >>
> >>
> >>                 IndexWriter.deleteDocuments(Term...) line: 1421
> >>
> >>                 DocDeleter.deleteTerms(String, ArrayList<String>) line:
> 95
> >>
> >>
> >>                 DBCloner.cloneDB() line: 233
> >>
> >>                 DBCloner.run() line: 133
> >>
> >>                 Thread.run() line: 744
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: deleteDocuments(Term... terms) takes a long time to do nothing.

Posted by Michael McCandless <lu...@mikemccandless.com>.
It sounds like there are at least two issues.

First, that it takes so long to do the delete.

Unfortunately, deleting by Term is at heart a costly operation.  It
entails up to one disk seek per segment in your index; a custom
Directory impl that makes seeking costly would slow things down, or if
the OS doesn't have enough RAM to cache the "hot" pages (if your Dir
impl is using the OS).  Is seeking somehow costly in your custom Dir
impl?

If you are deleting ~1M terms in ~30 minutes that works out to ~2 msec
per Term, which may actually be expected.

How many terms in your index?  Can you run CheckIndex and post the output?

You could index your ID field using MemoryPostingsFormat, which should
be a good speedup, but will consume more RAM.

Is it possible to delete by query instead?  Ie, create a query that
matches the 460K docs and pass that to
IndexWriter.deleteDocuments(Query).

Also, try passing fewer ids at once to Lucene, e.g. break the 460K
into smaller chunks.  Lucene buffers up all deleted terms from one
call, and then applies them, so my guess is you're using way too much
intermediate memory by passign 460K in a single call.

Instead of indexing everything into one index, and then deleting tons
of docs to "clone" to a new index, why not just index to two separate
indices to begin with?

The second issue is that after all that work, nothing in fact changed.
 For that, I think you should make a small test case that just tries
to delete one document, and iterate/debug until that works.  Your
StringField indexing line looks correct; make sure you're passing
precisely the same field name and value?  Make sure you're not
deleting already-deleted documents?  (Your for loop seems to ignore
already deleted documents).

Mike McCandless

http://blog.mikemccandless.com


On Sat, Dec 14, 2013 at 11:38 AM, Jason Corekin <ja...@gmail.com> wrote:
> I knew that I had forgotten something.  Below is the line that I use to
> create the field that I am trying to use to delete the entries with.  I
> hope this avoids some confusion.  Thank you very much to anyone that takes
> the time to read these messages.
>
> doc.add(new StringField("FileName",filename, Field.Store.YES));
>
>
> On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin <ja...@gmail.com>wrote:
>
>> Let me start by stating that I almost certain that I am doing something
>> wrong, and that I hope that I am because if not there is a VERY large bug
>> in Lucene.   What I am trying to do is use the method
>>
>>
>> deleteDocuments(Term... terms)
>>
>>
>>  out of the IndexWriter class to delete several Term object Arrays, each
>> fed to it via a separate Thread.  Each array has around 460k+ Term object
>> in it.  The issue is that after running for around 30 minutes or more the
>> method finishes, I then have a commit run and nothing changes with my files.
>> To be fair, I am running a custom Directory implementation that might be
>> causing problems, but I do not think that this is the case as I do not even
>> see any of the my Directory methods in the stack trace.  In fact when I
>> set break points inside the delete methods of my Directory implementation
>> they never even get hit. To be clear replacing the custom Directory
>> implementation with a standard one is not an option due to the nature of
>> the data which is made up of terabytes of small (1k and less) files.  So,
>> if the issue is in the Directory implementation I have to figure out how to
>> fix it.
>>
>>
>> Below are the pieces of code that I think are relevant to this issue as
>> well as a copy of the stack trace thread that was doing work when I paused
>> the debug session.  As you are likely to notice, the thread is called a
>> DBCloner because it is being used to clone the underlying Index based
>> database (needed to avoid storing trillions of files directly on disk).  The
>> idea is to duplicate the selected group of terms into a new database and
>> then delete to original terms from the original database.  The duplicate
>> work wonderfully, but not matter what I do including cutting the program
>> down to one thread I cannot shrink the database and the time to try to do
>> the deletes takes drastically too long.
>>
>>
>> In an attempt to be as helpful as possible, I will say this.  I have been
>> tracing this problem for a few days and have seen that
>>
>> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>>
>> is where that majority of the execution time is spent.  I have also
>> noticed that this method return false MUCH more often than it returns true.
>> I have been trying to figure out how the mechanics of this process work
>> just in case the issue was not in my code and I might have been able  to
>> find the problem.  But I have yet to find the problem either in Lucene
>> 4.5.1 or Lucene 4.6.  If anyone has any ideas as to what I might be doing
>> wrong, I would really appreciate reading what you have to say.  Thanks in
>> advance.
>>
>>
>>
>> Jason
>>
>>
>>
>>                 private void cloneDB() throws QueryNodeException {
>>
>>
>>
>>                                 Document doc;
>>
>>                                 ArrayList<String> fileNames;
>>
>>                                 int start = docRanges[(threadNumber * 2)];
>>
>>                                 int stop = docRanges[(threadNumber * 2) +
>> 1];
>>
>>
>>
>>                                 try {
>>
>>
>>
>>                                                 fileNames = new
>> ArrayList<String>(docsPerThread);
>>
>>                                                 for (int i = start; i <
>> stop; i++) {
>>
>>                                                                 doc =
>> searcher.doc(i);
>>
>>                                                                 try {
>>
>>
>> adder.addDoc(doc);
>>
>>
>> fileNames.add(doc.get("FileName"));
>>
>>                                                                 } catch
>> (TransactionExceptionRE | TransactionException | LockConflictException te) {
>>
>>
>> adder.txnAbort();
>>
>>
>> System.err.println(Thread.currentThread().getName() + ": Adding a message
>> failed, retrying.");
>>
>>                                                                 }
>>
>>                                                 }
>>
>>                                                 deleters[threadNumber].deleteTerms("FileName",
>> fileNames);
>>
>>
>> deleters[threadNumber].commit();
>>
>>
>>
>>                                 } catch (IOException | ParseException ex)
>> {
>>
>>                                                 Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE,
>> null, ex);
>>
>>                                 }
>>
>>                 }
>>
>>
>>
>>
>>
>>                                 public void deleteTerms(String
>> dbField,ArrayList<String> fieldTexts) throws IOException {
>>
>>                                 Term[] terms = new
>> Term[fieldTexts.size()];
>>
>>                                 for(int i=0;i<fieldTexts.size();i++){
>>
>>                                                 terms[i]= new
>> Term(dbField,fieldTexts.get(i));
>>
>>                                 }
>>
>>                                 writer.deleteDocuments(terms);
>>
>>                 }
>>
>>
>>
>>                 public void deleteDocuments(Term... terms) throws
>> IOException
>>
>>
>>
>>
>>
>>                 Thread [DB Cloner 2] (Suspended)
>>
>>                 owns: BufferedUpdatesStream  (id=54)
>>
>>                 owns: IndexWriter  (id=49)
>>
>>                 FST<T>.readFirstRealTargetArc(long, Arc<T>, BytesReader)
>> line: 979
>>
>>                 FST<T>.findTargetArc(int, Arc<T>, Arc<T>, BytesReader)
>> line: 1220
>>
>>                 BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>> line: 1679
>>
>>                 BufferedUpdatesStream.applyTermDeletes(Iterable<Term>,
>> ReadersAndUpdates, SegmentReader) line: 414
>>
>>                 BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool,
>> List<SegmentCommitInfo>) line: 283
>>
>>                 IndexWriter.applyAllDeletesAndUpdates() line: 3112
>>
>>                 IndexWriter.applyDeletesAndPurge(boolean) line: 4641
>>
>>
>>                 DocumentsWriter$ApplyDeletesEvent.process(IndexWriter,
>> boolean, boolean) line: 673
>>
>>                 IndexWriter.processEvents(Queue<Event>, boolean, boolean)
>> line: 4665
>>
>>                 IndexWriter.processEvents(boolean, boolean) line: 4657
>>
>>
>>                 IndexWriter.deleteDocuments(Term...) line: 1421
>>
>>                 DocDeleter.deleteTerms(String, ArrayList<String>) line: 95
>>
>>
>>                 DBCloner.cloneDB() line: 233
>>
>>                 DBCloner.run() line: 133
>>
>>                 Thread.run() line: 744
>>
>>
>>
>>
>>
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: deleteDocuments(Term... terms) takes a long time to do nothing.

Posted by Jason Corekin <ja...@gmail.com>.
I knew that I had forgotten something.  Below is the line that I use to
create the field that I am trying to use to delete the entries with.  I
hope this avoids some confusion.  Thank you very much to anyone that takes
the time to read these messages.

doc.add(new StringField("FileName",filename, Field.Store.YES));


On Sat, Dec 14, 2013 at 1:28 AM, Jason Corekin <ja...@gmail.com>wrote:

> Let me start by stating that I almost certain that I am doing something
> wrong, and that I hope that I am because if not there is a VERY large bug
> in Lucene.   What I am trying to do is use the method
>
>
> deleteDocuments(Term... terms)
>
>
>  out of the IndexWriter class to delete several Term object Arrays, each
> fed to it via a separate Thread.  Each array has around 460k+ Term object
> in it.  The issue is that after running for around 30 minutes or more the
> method finishes, I then have a commit run and nothing changes with my files.
> To be fair, I am running a custom Directory implementation that might be
> causing problems, but I do not think that this is the case as I do not even
> see any of the my Directory methods in the stack trace.  In fact when I
> set break points inside the delete methods of my Directory implementation
> they never even get hit. To be clear replacing the custom Directory
> implementation with a standard one is not an option due to the nature of
> the data which is made up of terabytes of small (1k and less) files.  So,
> if the issue is in the Directory implementation I have to figure out how to
> fix it.
>
>
> Below are the pieces of code that I think are relevant to this issue as
> well as a copy of the stack trace thread that was doing work when I paused
> the debug session.  As you are likely to notice, the thread is called a
> DBCloner because it is being used to clone the underlying Index based
> database (needed to avoid storing trillions of files directly on disk).  The
> idea is to duplicate the selected group of terms into a new database and
> then delete to original terms from the original database.  The duplicate
> work wonderfully, but not matter what I do including cutting the program
> down to one thread I cannot shrink the database and the time to try to do
> the deletes takes drastically too long.
>
>
> In an attempt to be as helpful as possible, I will say this.  I have been
> tracing this problem for a few days and have seen that
>
> BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
>
> is where that majority of the execution time is spent.  I have also
> noticed that this method return false MUCH more often than it returns true.
> I have been trying to figure out how the mechanics of this process work
> just in case the issue was not in my code and I might have been able  to
> find the problem.  But I have yet to find the problem either in Lucene
> 4.5.1 or Lucene 4.6.  If anyone has any ideas as to what I might be doing
> wrong, I would really appreciate reading what you have to say.  Thanks in
> advance.
>
>
>
> Jason
>
>
>
>                 private void cloneDB() throws QueryNodeException {
>
>
>
>                                 Document doc;
>
>                                 ArrayList<String> fileNames;
>
>                                 int start = docRanges[(threadNumber * 2)];
>
>                                 int stop = docRanges[(threadNumber * 2) +
> 1];
>
>
>
>                                 try {
>
>
>
>                                                 fileNames = new
> ArrayList<String>(docsPerThread);
>
>                                                 for (int i = start; i <
> stop; i++) {
>
>                                                                 doc =
> searcher.doc(i);
>
>                                                                 try {
>
>
> adder.addDoc(doc);
>
>
> fileNames.add(doc.get("FileName"));
>
>                                                                 } catch
> (TransactionExceptionRE | TransactionException | LockConflictException te) {
>
>
> adder.txnAbort();
>
>
> System.err.println(Thread.currentThread().getName() + ": Adding a message
> failed, retrying.");
>
>                                                                 }
>
>                                                 }
>
>                                                 deleters[threadNumber].deleteTerms("FileName",
> fileNames);
>
>
> deleters[threadNumber].commit();
>
>
>
>                                 } catch (IOException | ParseException ex)
> {
>
>                                                 Logger.getLogger(DocReader.class.getName()).log(Level.SEVERE,
> null, ex);
>
>                                 }
>
>                 }
>
>
>
>
>
>                                 public void deleteTerms(String
> dbField,ArrayList<String> fieldTexts) throws IOException {
>
>                                 Term[] terms = new
> Term[fieldTexts.size()];
>
>                                 for(int i=0;i<fieldTexts.size();i++){
>
>                                                 terms[i]= new
> Term(dbField,fieldTexts.get(i));
>
>                                 }
>
>                                 writer.deleteDocuments(terms);
>
>                 }
>
>
>
>                 public void deleteDocuments(Term... terms) throws
> IOException
>
>
>
>
>
>                 Thread [DB Cloner 2] (Suspended)
>
>                 owns: BufferedUpdatesStream  (id=54)
>
>                 owns: IndexWriter  (id=49)
>
>                 FST<T>.readFirstRealTargetArc(long, Arc<T>, BytesReader)
> line: 979
>
>                 FST<T>.findTargetArc(int, Arc<T>, Arc<T>, BytesReader)
> line: 1220
>
>                 BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BytesRef)
> line: 1679
>
>                 BufferedUpdatesStream.applyTermDeletes(Iterable<Term>,
> ReadersAndUpdates, SegmentReader) line: 414
>
>                 BufferedUpdatesStream.applyDeletesAndUpdates(ReaderPool,
> List<SegmentCommitInfo>) line: 283
>
>                 IndexWriter.applyAllDeletesAndUpdates() line: 3112
>
>                 IndexWriter.applyDeletesAndPurge(boolean) line: 4641
>
>
>                 DocumentsWriter$ApplyDeletesEvent.process(IndexWriter,
> boolean, boolean) line: 673
>
>                 IndexWriter.processEvents(Queue<Event>, boolean, boolean)
> line: 4665
>
>                 IndexWriter.processEvents(boolean, boolean) line: 4657
>
>
>                 IndexWriter.deleteDocuments(Term...) line: 1421
>
>                 DocDeleter.deleteTerms(String, ArrayList<String>) line: 95
>
>
>                 DBCloner.cloneDB() line: 233
>
>                 DBCloner.run() line: 133
>
>                 Thread.run() line: 744
>
>
>
>
>
>
>