You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2011/10/20 19:00:56 UTC

Query/Delete performance difference between straight HTTP and SolrJ

I've got two build systems for my Solr index that I wrote.  The first 
one is in Perl and uses GET/POST requests via HTTP, the second is in 
Java using SolrJ.  I've noticed a performance discrepancy when 
processing every one of my delete records, currently about 25000 of 
them.  It takes about 5 seconds in Perl and a minute or more via SolrJ.  
In the perl system, I do a full delete like this once an hour.  The 
performance impact of doing it once an hour in the SolrJ version has 
forced me to do it only once per day.  The normal delete process in both 
cases looks for new records and deletes just those.  It happens every 
two minutes in the Perl program and every minute in the Java program.

What might be causing the problem, and what information can I collect to 
help with the diagnosis?  Here's a start:

In both systems, the way this is done is by breaking the list of deleted 
IDs into smaller chunks, doing a query for that chunk, and if results 
are found, issuing a delete for the same query.  This entire process is 
completed sequentially on all seven shards.  In the Perl system, it's 
done using HTTP POST calls, the query using /select and the delete using 
/update and deleteByQuery. The query looks like what's below, only a lot 
longer.  The did field is a tlong:

did:(281472047+OR+281472023+OR+281472022+OR+281472021+OR+281472020+OR+281472019
+OR+281472018+OR+281472017+OR+281472016+OR+276514457+OR+281472031+OR ... )

In the perl system, I limit each query to 1024 values, my 
maxBooleanClauses.  In the SolrJ system, I have limited it to an even 
1000.  By adding detailed logging to the Java program, I have determined 
that it is the query part that's slow, not the delete part.  Each chunk 
of 1000 takes a few seconds to return results, most of which will be 
zero.  In the Perl program a commit is executed for each shard, but 
waitSearcher and waitFlush are set to false.  In the SolrJ program, the 
commit happens later in the code and is not counted.

There are some differences in the Solr implementation that each build 
system talks to.  The Java program talks to a pair of Solr 3.4.0 servers 
running CentOS 6 (ext4).  The Perl program talks to a pair of Xen/CentOS 
5 (ext3) machines (identical hardware) that each host a set of CentOS 5 
virtual machines running Solr 3.2.0.  On CentOS 6, each server houses 
three of the large shards in separate cores, one of them also hosts a 
seventh smaller shard.  On the VM environment, the distribution is the 
same, except that each shard lives in a virtual machine with its own 
Solr instance.

The CentOS 6 machines have 32GB of RAM, the physical hosts for the VM 
environment have recently been upgraded to 64GB and each VM's 
reservation increased, but I can tell you for sure that even when they 
only had 32GB, it still only took a few seconds for the full delete to 
occur in the Perl program.

Logfiles below.  Something to note: At 3:33:42, the idx_delete table was 
trimmed from over 40000 entries to just under 25000 - so the Perl code 
handled a lot more entries, but did it MUCH faster.  I've changed the 
company name to REDACTED in the SolrJ logs, but otherwise left them alone.

LOG: Thu Oct 20 03:03:01 2011: /------====== Start 03:03, wday:4, yday:292
LOG: Thu Oct 20 03:03:01 2011: There are 40170 entries in idx_delete
DBG: Thu Oct 20 03:03:02 2011: MAX(id) is 12998586
LOG: Thu Oct 20 03:03:02 2011: This run - 40170 entries in idx_delete
LOG: Thu Oct 20 03:03:02 2011: Retrieved 40170 entries for deletion
LOG: Thu Oct 20 03:03:03 2011: Shard 0: skipped
DBG: Thu Oct 20 03:03:03 2011: Shard 1: delete ok
LOG: Thu Oct 20 03:03:03 2011: Shard 1: committed
DBG: Thu Oct 20 03:03:04 2011: Shard 2: delete ok
LOG: Thu Oct 20 03:03:04 2011: Shard 2: committed
DBG: Thu Oct 20 03:03:05 2011: Shard 3: delete ok
LOG: Thu Oct 20 03:03:05 2011: Shard 3: committed
LOG: Thu Oct 20 03:03:06 2011: Shard 4: skipped
LOG: Thu Oct 20 03:03:07 2011: Shard 5: skipped
LOG: Thu Oct 20 03:03:07 2011: Shard inc: skipped
LOG: Thu Oct 20 03:03:10 2011: Erased flatDelete file
LOG: Thu Oct 20 03:03:10 2011: Wrote 40170 entries to flatDelete
LOG: Thu Oct 20 03:03:10 2011: \------======  End  03:03

Oct 20, 2011 3:34:00 AM com.REDACTED.idxbuild.Main updateChain
INFO: /---- Running update on chain a
Oct 20, 2011 3:34:00 AM com.REDACTED.idxbuild.solr.IdxChain buildQuery
INFO: chain.a: buildQuery: SELECT * FROM idx_delete WHERE (id > 0 AND id 
<= 12998597)
Oct 20, 2011 3:34:02 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:05 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:07 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:09 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:11 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:14 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:16 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:18 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:20 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:23 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:25 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:27 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:29 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:32 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:34 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:36 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:38 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:40 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:43 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:45 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:47 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:50 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:52 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:55 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 1000 docs
Oct 20, 2011 3:34:57 AM com.REDACTED.idxbuild.REDACTEDChain doDelete
INFO: chain.a: deleted up to 920 docs

Thanks,
Shawn

Re: Query/Delete performance difference between straight HTTP and SolrJ

Posted by Shawn Heisey <so...@elyograg.org>.

On 10/26/2011 10:29 AM, Shawn Heisey wrote:
> One possible thing I can do to make the Java code even faster is to 
> set rows to zero before doing the query, since I only need numFound, 
> not the actual results.  The Perl code does NOT do this, and yet it's 
> super fast.

It turns out I already thought of this and I DO set rows to zero.

     private static final String SOLR_QT = "qt";
     private static final String SOLR_ROWS = "rows";
     private static final String NO_STATS_QUERY_TYPE = "lbcheck";

...

     /**
      * Get the count of all documents matching a query.
      *
      * @param query the query
      * @return the long
      * @throws IdxException
      */
     public long getCount(String query) throws IdxException
     {
         SolrQuery sq = new SolrQuery();
         sq.setParam(SOLR_QT, NO_STATS_QUERY_TYPE);
         sq.setParam(SOLR_ROWS, "0");
         sq.setQuery(query);
         QueryResponse qr = null;
         try
         {
             qr = _solrCore.query(sq);
         }
         catch (Exception e)
         {
             throw new IdxException("Query '" + query + "' failed on " + 
_prefix
                     + _name, e);
         }
         if (qr == null)
         {
             throw new IdxException("Count for '" + query + "' failed on "
                     + _prefix);
         }
         else
         {
             long numFound = qr.getResults().getNumFound();
             int qTime = qr.getQTime();
             LOG.info(_prefix + _name + ": query QTime=" + qTime + 
",numFound=" + numFound);
             return numFound;
         }
     }

And since someone might ask how I actually do the delete, see below.

     /**
      * Delete by query.
      *
      * @throws IdxException
      *
      */
     public void deleteByQuery(String query) throws IdxException
     {
         if (getCount(query) > 0)
         {
             try
             {
                 UpdateResponse ur = _solrCore.deleteByQuery(query);
                 LOG.info(_prefix + _name + ": done deleting " + ur);
                 _needsCommit = true;
             }
             catch (Exception e)
             {
                 throw new IdxException("deleteByQuery failed on " + _prefix
                         + _name, e);
             }
         }
     }

Re: Query/Delete performance difference between straight HTTP and SolrJ

Posted by Shawn Heisey <so...@elyograg.org>.

On 10/27/2011 1:36 AM, Michael Kuhlmann wrote:
> Why do you first query for these documents? Why don't you just delete 
> them? Solr won't harm if no documents are affected by your delete 
> query, and you'll get the number of affected documents in your 
> response anyway. When deleting, Solrj nearly does nothing on its own, 
> it just sends the POST request and analyzes the simple response. The 
> behaviour in a get request is similar. We do thousands of update, 
> delete and get requests in a minute using Solrj without problems, your 
> timing problems must come frome somewhere else. -Kuli 

When you do a delete blind, you have to follow it up with a commit.  On 
my larger shards containing data older than approximately one week, a 
commit is resource intensive and takes 10 to 30 seconds.  As much as 75% 
of the time, there are no updates to my larger shards (10.7 million 
records each), most of the activity happens on the small shard with the 
newest data (usually under 500000 records), which I call the 
incremental.  On almost every update run, there are changes to the 
incremental, but doing a commit on that shard rarely takes more than a 
second or two.

The long commit times on the larger indexes is a result of cache 
warming, and almost all of the time is spent warming the filter cache.  
The answer to the next obvious question: autowarmCount=4 on that cache, 
with a maximum size of 64.  We are working as fast as we can on reducing 
the complexity and size of our filter queries.  It will require 
significant changes in our application.

Thanks,
Shawn

Re: Query/Delete performance difference between straight HTTP and SolrJ

Posted by Michael Kuhlmann <ku...@solarier.de>.

Sorry, I was wrong.

Am 27.10.2011 09:36, schrieb Michael Kuhlmann:
> and you'll get the number of affected documents in your response anyway.

That's not true, you don't get the affected document count. Anyway, it's
still true that you don't need to check for documents first, at least
not when you don't need this information somewhere else.

-Kuli

Re: Query/Delete performance difference between straight HTTP and SolrJ

Posted by Michael Kuhlmann <ku...@solarier.de>.

Am 26.10.2011 18:29, schrieb Shawn Heisey:
> For inserting, I do use a Collection of SolrInputDocuments.  The delete
> process grabs values from idx_delete, does a query like the above (the
> part that's slow in Java), then if any documents are found, issues a
> deleteByQuery with the same string.

Why do you first query for these documents? Why don't you just delete
them? Solr won't harm if no documents are affected by your delete query,
and you'll get the number of affected documents in your response anyway.

When deleting, Solrj nearly does nothing on its own, it just sends the
POST request and analyzes the simple response. The behaviour in a get
request is similar. We do thousands of update, delete and get requests
in a minute using Solrj without problems, your timing problems must come
frome somewhere else.

-Kuli

Re: Query/Delete performance difference between straight HTTP and SolrJ

Posted by Shawn Heisey <so...@elyograg.org>.

On 10/26/2011 1:30 AM, Michael Kuhlmann wrote:
> Hi,
>
> Am 25.10.2011 23:53, schrieb Shawn Heisey:
>> On 10/20/2011 11:00 AM, Shawn Heisey wrote:
>>> [...] I've noticed a performance discrepancy when
>>> processing every one of my delete records, currently about 25000 of
>>> them.
> I din't understand what a delete record is. Do you delete records in
> Solr? This shouldn't be done using records (what is a record in this
> case? A document?); use a query for that.
>
> Or do you add documents that you call delete records?

A record is an entry in the idx_delete table in the database.  When 
something is deleted from the main database table, there's a trigger 
that inserts its did value (for document id), one of the unique IDs we 
have on each document, into idx_delete.  The build system uses this 
table to process deletes from Solr.  Please see the first message in 
this thread for full details.

>> I've managed to make this somewhat better by using multiple threads to
>> do all the deletes on the six large static indexes at once, but that
>> shouldn't be required.  The Perl version doesn't do them at the same time.
> Are you sure? I don't know about the perl client, but maybe it's doing
> the network operation in background?
>
> I a single-thread environment, the client has to wait when sending each
> request until it has been completely sent to the server, doing nothing.
> Multiple threads can help you a lot here.
>
> You can check this when you monitor your client's cpu load.

The Perl programs use LWP::Simple and LWP::Simple::Post and have no 
threading or process forking of any kind.  I am not using a Perl/Solr 
API, I construct the URLs myself from saved templates and send them as a 
browser would.  I'll check and see if my superiors will let me post my 
code publicly.  If not, I may be able to redact it a bit and send it 
unicast to an interested party.

>> 10:27<  cedrichurst>  the only difference i could see is deserializing
>> the java binary object
> This is true, but only in theory. Serializing and deserializing is so
> fast that this shouldn't impact.
>
> If you really want to be sure, use a SolrInputDocument instead of
> annotated classes when sending documents, but as I said, this shouldn't
> matter much.
>
> What's more important: Don't send single documents but rather use
> add(Collection) with multiple documents at once. At least when I
> understood you correctly that you want to send 25000 documents for update...

This is not for *adding* documents.  It's for making a query that looks 
like the following, with up to 1000 clauses instead of four:

did:(1 OR 2 OR 3 OR 4)

For inserting, I do use a Collection of SolrInputDocuments.  The delete 
process grabs values from idx_delete, does a query like the above (the 
part that's slow in Java), then if any documents are found, issues a 
deleteByQuery with the same string.  The Perl code uses a POST request 
for both the query and the delete, text/xml for the latter.

One possible thing I can do to make the Java code even faster is to set 
rows to zero before doing the query, since I only need numFound, not the 
actual results.  The Perl code does NOT do this, and yet it's super fast.

Any other ideas?

Thanks,
Shawn

Re: Query/Delete performance difference between straight HTTP and SolrJ

Posted by Michael Kuhlmann <ku...@solarier.de>.

Hi,

Am 25.10.2011 23:53, schrieb Shawn Heisey:
> On 10/20/2011 11:00 AM, Shawn Heisey wrote:
>> [...] I've noticed a performance discrepancy when
>> processing every one of my delete records, currently about 25000 of
>> them.

I din't understand what a delete record is. Do you delete records in
Solr? This shouldn't be done using records (what is a record in this
case? A document?); use a query for that.

Or do you add documents that you call delete records?

> I've managed to make this somewhat better by using multiple threads to
> do all the deletes on the six large static indexes at once, but that
> shouldn't be required.  The Perl version doesn't do them at the same time.

Are you sure? I don't know about the perl client, but maybe it's doing
the network operation in background?

I a single-thread environment, the client has to wait when sending each
request until it has been completely sent to the server, doing nothing.
Multiple threads can help you a lot here.

You can check this when you monitor your client's cpu load.

> 10:27 < cedrichurst> the only difference i could see is deserializing
> the java binary object

This is true, but only in theory. Serializing and deserializing is so
fast that this shouldn't impact.

If you really want to be sure, use a SolrInputDocument instead of
annotated classes when sending documents, but as I said, this shouldn't
matter much.

What's more important: Don't send single documents but rather use
add(Collection) with multiple documents at once. At least when I
understood you correctly that you want to send 25000 documents for update...

-Kuli

Re: Query/Delete performance difference between straight HTTP and SolrJ

Posted by Shawn Heisey <so...@elyograg.org>.

On 10/27/2011 5:56 AM, Michael Sokolov wrote:
> From everything you've said, it certainly sounds like a low-level I/O 
> problem in the client, not a server slowdown of any sort.  Maybe Perl 
> is using the same connection over and over (keep-alive) and Java is 
> not.  I really don't know.  One thing I've heard is that 
> StreamingUpdateSolrServer (I think that's what it's called) can give 
> better throughput for large request batches.  If you're not using 
> that, you may be having problems w/closing and re-opening connections?

Although I can't claim to know for sure, I'm fairly sure that the simple 
LWP classes I'm using don't do keepalive unless you specifically 
configure the user agent to do so.  I'll look into it some more.

The StreamingUpdateSolrServer says that they only recommend using it 
with the /update handler, not for queries.  I'm not having a problem 
with the deletes themselves, they go pretty fast.  It's all of the 
queries before each delete that are relatively slow.  Doing those 
queries really adds up.  With multithreading, it does all the shards at 
once, but it still can only query for a limited number of values at a 
time due to maxBooleanClauses.  Now I'm checking and deleting 1000 
values at a time, on all shards simultanously.  I use 
CommonsHttpSolrServer, and each of those objects is created only once, 
when the program first starts up.

I figure there are three possibilities:

1) A glaring inefficiency in CommonsHttpSolrServer queries as compared 
to a straight HTTP POST request.
2) The compartmentalization provided by the virtual machine architecture 
creates an odd synergy that is not present when there are only two Solr 
instances on physical machines instead of eight of them (seven shards 
plus a search broker) on virtual machines.
3) The extra physical memory on the servers with virtualization is 
granting more of a disk-cache-related performance improvement than the 
lack of virtualization on the others.

Only the first of those possible problems is something that can be 
determined or fixed without migrating the other servers to my new 
system.  I'm having one other problem with the new build program.  I 
haven't figured out exactly what that problem is, so I am very reluctant 
to switch everything over.  So far it seems to be related to the MySQL 
JDBC connector or my attempt at threading, not Solr.

I mentioned that the hardware is identical except for memory.  That's 
not quite true - the servers accessed by the java program are better.  
One of them has a slightly faster CPU than its counterpart with 
virtualization, and they all have 1TB hard drives as opposed to the 
mixed 500GB & 750GB drives in the other servers.  All of the servers are 
Dell 2950 with six-drive RAID10 arrays.

Re: Query/Delete performance difference between straight HTTP and SolrJ

Posted by Shawn Heisey <so...@elyograg.org>.

On 10/27/2011 5:56 AM, Michael Sokolov wrote:
> From everything you've said, it certainly sounds like a low-level I/O 
> problem in the client, not a server slowdown of any sort.  Maybe Perl 
> is using the same connection over and over (keep-alive) and Java is 
> not.  I really don't know.  One thing I've heard is that 
> StreamingUpdateSolrServer (I think that's what it's called) can give 
> better throughput for large request batches.  If you're not using 
> that, you may be having problems w/closing and re-opening connections?

I turned off the perl build system and had the Java program take over 
full build duties for both index chains.  It's been designed so one copy 
of the program can keep any number of index chains up to date 
simultaneously.

On the most recently hourly run, the servers without virtualization took 
50 seconds, the servers with virtualization and more memory took only 16 
seconds, so it looks like this problem has nothing to do with SolrJ, 
it's due to the 1000 clause queries actually taking a long time to 
execute.  The 16 second runtime is still longer than the last run by the 
perl program (12 seconds), but I am also executing an index rebuild in 
the build cores on those servers, so I'm not overly concerned by that.

At this point there isn't any way for me to know whether the speedup 
with the old server builds is due to the extra memory (OS disk cache) or 
due to some quirk of virtualization.  I'm really hoping it's due to the 
extra memory, because I really don't want to go back to a virtualized 
environment.  I'll be able to figure it out after I eliminate my current 
bug and complete the migration.

Thank you very much to everyone who offered assistance.  It helped me 
make sure my testing was as unbiased as I could achieve.

Shawn

Re: Query/Delete performance difference between straight HTTP and SolrJ

Posted by Michael Sokolov <so...@ifactory.com>.

 From everything you've said, it certainly sounds like a low-level I/O 
problem in the client, not a server slowdown of any sort.  Maybe Perl is 
using the same connection over and over (keep-alive) and Java is not.  I 
really don't know.  One thing I've heard is that 
StreamingUpdateSolrServer (I think that's what it's called) can give 
better throughput for large request batches.  If you're not using that, 
you may be having problems w/closing and re-opening connections?

-Mike

On 10/26/2011 9:56 PM, Shawn Heisey wrote:
> On 10/26/2011 6:16 PM, Michael Sokolov wrote:
>> Have you checked to see when you are committing?  Is the pattern the 
>> same in both instances?  If you are committing after each delete 
>> request in Java, but not in Perl, that could slow things down.
>
> Due to the multihreading of delete requests, I now have the full 
> delete down to 10-15 seconds instead of a minute or more.  This is now 
> an acceptable time, but I am completely mystified as to why the Pelr 
> code can do it without multithreading just as fast, and often faster.  
> The Java code is long-running, and the Perl code is started by cron.  
> If you look back to the first message on the thread, you'll see commit 
> messages in the Perl log, but those commits are done with the wait 
> options set to false.  That's an extra step the Java code isn't doing 
> - and it's STILL faster.

Re: Query/Delete performance difference between straight HTTP and SolrJ

Posted by Shawn Heisey <so...@elyograg.org>.

On 10/26/2011 6:16 PM, Michael Sokolov wrote:
> Have you checked to see when you are committing?  Is the pattern the 
> same in both instances?  If you are committing after each delete 
> request in Java, but not in Perl, that could slow things down.

The commit happens separately, not during the process.  The java logs I 
pasted did not include the other things that happen afterwards, or the 
commit, which can take another 10-30 seconds.

Here's the outer-level code that does the full update cycle.  It does 
deletes, reinserts (documents that have been changed), and inserts (new 
content), then a commit.  The innermost commit method (passed down from 
the code below through a couple of object levels) spits log messages of 
its own, and indicates that no commits are happening until after 
everything is done.

         /**
          * Do all the updates.
          *
          * @throws IdxException
          */
         public synchronized void updateIndex(boolean fullUpdate,
                         boolean useBuildCore) throws IdxException
         {
                 refreshFlags();
                 if (fullUpdate)
                 {
                         _fullDelete = true;
                         _fullReinsert = true;
                 }

                 if (_dailyOptimizeStarted)
                 {
                         LOG.info(_lp
                                         + "Skipping delete and reinsert 
- optimization underway.");
                 }
                 else
                 {
                         doDelete(_fullDelete, useBuildCore);
                         doReinsert(_fullReinsert, useBuildCore);
                         turnOffFullUpdate();
                 }
                 doInsert(useBuildCore);
                 doCommit(useBuildCore);
         }

Due to the multihreading of delete requests, I now have the full delete 
down to 10-15 seconds instead of a minute or more.  This is now an 
acceptable time, but I am completely mystified as to why the Pelr code 
can do it without multithreading just as fast, and often faster.  The 
Java code is long-running, and the Perl code is started by cron.  If you 
look back to the first message on the thread, you'll see commit messages 
in the Perl log, but those commits are done with the wait options set to 
false.  That's an extra step the Java code isn't doing - and it's STILL 
faster.

Thanks,
Shawn

Re: Query/Delete performance difference between straight HTTP and SolrJ

Posted by Michael Sokolov <so...@ifactory.com>.

Have you checked to see when you are committing?  Is the pattern the 
same in both instances?  If you are committing after each delete request 
in Java, but not in Perl, that could slow things down.

On 10/25/2011 5:53 PM, Shawn Heisey wrote:
> On 10/20/2011 11:00 AM, Shawn Heisey wrote:
>> I've got two build systems for my Solr index that I wrote.  The first 
>> one is in Perl and uses GET/POST requests via HTTP, the second is in 
>> Java using SolrJ.  I've noticed a performance discrepancy when 
>> processing every one of my delete records, currently about 25000 of 
>> them.  It takes about 5 seconds in Perl and a minute or more via 
>> SolrJ.  In the perl system, I do a full delete like this once an 
>> hour.  The performance impact of doing it once an hour in the SolrJ 
>> version has forced me to do it only once per day.  The normal delete 
>> process in both cases looks for new records and deletes just those.  
>> It happens every two minutes in the Perl program and every minute in 
>> the Java program.

Re: Query/Delete performance difference between straight HTTP and SolrJ

Posted by Shawn Heisey <so...@elyograg.org>.

On 10/20/2011 11:00 AM, Shawn Heisey wrote:
> I've got two build systems for my Solr index that I wrote.  The first 
> one is in Perl and uses GET/POST requests via HTTP, the second is in 
> Java using SolrJ.  I've noticed a performance discrepancy when 
> processing every one of my delete records, currently about 25000 of 
> them.  It takes about 5 seconds in Perl and a minute or more via 
> SolrJ.  In the perl system, I do a full delete like this once an 
> hour.  The performance impact of doing it once an hour in the SolrJ 
> version has forced me to do it only once per day.  The normal delete 
> process in both cases looks for new records and deletes just those.  
> It happens every two minutes in the Perl program and every minute in 
> the Java program.

I've managed to make this somewhat better by using multiple threads to 
do all the deletes on the six large static indexes at once, but that 
shouldn't be required.  The Perl version doesn't do them at the same time.

I asked on the #solr IRC channel.  Only one person responded, and didn't 
really know how to help me.  He did say one thing that intrigues me:

10:27 < cedrichurst> the only difference i could see is deserializing 
the java binary object

Any thoughts from anyone else?  If deserializing is slow, is there any 
way to avoid it or speed it up?

Thanks,
Shawn