You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Patrick Johnstone <pj...@gmail.com> on 2009/06/24 16:39:59 UTC

Delete, Commit, Add Interaction

We're indexing a potentially large collection of documentsinto smaller
subgroups we call "collections".  Each document
has a field that identifies the collection it belongs to, in addition
to a unique document id field:

<add>
   <doc>
      <field name="id">foo-1</field>
      <field name="collection">foo</field>
      ......
   </doc>
   <doc>
      <field name="id">foo-2</field>
      <field name="collection">foo</field>
      .....
   </doc>

   ..... etc.
</add>

"collection" and "id" are defined in schema.xml as string fields.

When a collection is being added to the index, it's possible that
there is an existing "foo" collection in the index that needs to be
replaced.  The ids in the new collection will reuse many of the ids
in the old collection, but the replacement is not a document-for-document
replacement process -- there may be more or less documents
in the new collection.

So the replacement operation goes as follows:

<delete>
   <query>collection:foo</query>
</delete>
<commit waitFlush="true" waitSearcher="true" />
<add>
   <doc>
      .....
</add>
<commit waitFlush="true" waitSearcher="true" />

Each of these XML commands happens on a separate HTTP connection.
If the collection doesn't already exist in the index, then the delete
is essentially a noop.

Finally, here's the behavior we're seeing.  In some cases, usually when
the index is starting to get larger (approaching 500,000 documents),
the above procedure will fail to add anything to the index.  That is, none
of the commands return an error code, there is no indication of a problem
in the log files and the process DOES take some amount of time to
complete.  But at the end of the process, there are no documents in
the index whose collection is "foo".  This can happen whether or not
there is an existing "foo" collection already in the index -- in fact, the
typical case is that there is not.

So my question is:  Is there any chance that the delete, commit, and add
commands are interacting in such a way as to cause the add to happen
before the delete so that the add is just replacing the existing "foo"
documents and then the delete is coming along and deleting everything?

My understanding is that the wait attributes to the commit command should
flush the delete out to the index before the add can start but I have
no knowledge of the true sequencing of events in either Solr or Lucene.

If this is happening, how can I know when the delete has been processed
before initiating the add process?

Thanks,

Patrick Johnstone

Re: Delete, Commit, Add Interaction

Posted by Chris Hostetter <ho...@fucit.org>.
: Jul 4, 2009 12:38:43 PM org.apache.solr.update.processor.LogUpdateProcessor finish
: INFO: {} 0 0
: Jul 4, 2009 12:38:43 PM org.apache.solr.core.SolrCore execute
: INFO: [] webapp=/solr path=/update params={} status=0 QTime=0 
: 
: ...that was a delete (not sure why the msg from LogUpdateProcessor is 
: empty) then somehting like this from the commit...

...it was fat finger user error on my part, the log msg from delete should 
look like...

Jul 4, 2009 12:46:30 PM org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {deleteByQuery=name:foo} 0 12
Jul 4, 2009 12:46:30 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={} status=0 QTime=12 



-Hoss


Re: Delete, Commit, Add Interaction

Posted by Chris Hostetter <ho...@fucit.org>.
: <delete>
:    <query>collection:foo</query>
: </delete>
: <commit waitFlush="true" waitSearcher="true" />
: <add>
:    <doc>
:       .....
: </add>
: <commit waitFlush="true" waitSearcher="true" />

	...

: Finally, here's the behavior we're seeing.  In some cases, usually when
: the index is starting to get larger (approaching 500,000 documents),
: the above procedure will fail to add anything to the index.  That is, none
: of the commands return an error code, there is no indication of a problem
: in the log files and the process DOES take some amount of time to

That really shouldn't happen.  if you were using embedded solr, or some 
crazy UpdateProcessor, i can imagine encountering a code path 
where your adds got processed before your delete -- but not if you are 
using HTTP to send XML like that each in a separate HTTP Connection as you 
describe.

: If this is happening, how can I know when the delete has been processed
: before initiating the add process?

When the <commit> command after the delete returns a 200 status code, the 
delete is done.  *DONE* Done, completley done, over and done nothing funky 
going on under the covers done.

can you post some of your log messages from one of these problematic 
instances?  I'm particularly intersted in the INFO level messages from the 
LogUpdateProcessor.finsh and SolrCore.execute that say things like...

Jul 4, 2009 12:38:43 PM org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {} 0 0
Jul 4, 2009 12:38:43 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={} status=0 QTime=0 

...that was a delete (not sure why the msg from LogUpdateProcessor is 
empty) then somehting like this from the commit...

Jul 4, 2009 12:39:55 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true)
Jul 4, 2009 12:39:55 PM org.apache.solr.search.SolrIndexSearcher <init>
INFO: Opening Searcher@15ccfb1 main
   < ... snip a bunch of logging about autowarming various caches ... >
Jul 4, 2009 12:39:55 PM 
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {commit=} 0 50
Jul 4, 2009 12:39:55 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={} status=0 QTime=50 

...and then a bunch of adds...

Jul 4, 2009 12:41:37 PM org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[SP2514N, 6H500F0]} 0 24
Jul 4, 2009 12:41:37 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={} status=0 QTime=24 
Jul 4, 2009 12:41:37 PM org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[F8V7067-APL-KIT, IW-02]} 0 9

...which should be followed by another commit getting logged.

These log messages are all from the example runnning in jetty, your log 
format may vary.  What I'm particularly interested is the timestamps on 
these log messages so if you can turn on millisecond time resolution that 
would be best ... i want to see when exactly the delete/commit/add/commit 
comands are getting executed.




-Hoss