You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Robert Stewart <bs...@gmail.com> on 2012/11/28 19:02:11 UTC

Indexing performance with solrj vs. direct lucene API

I have a project where I am porting existing application from direct
Lucene API usage to using SOLR and SOLRJ client API.

The problem I have is that indexing is 2-5x slower using SOLRJ+SOLR
than using direct Lucene API.

I am creating batches of documents between 200 and 500 documents per
call to add() using SOLRJ.

I tried adjusting SOLR parameters for indexing but did not make any
difference.

Documents are identical (same fields) in both cases.

Nearly identical settings for tokenizing/analyzing/indexing/storing
for each field with Lucene and SOLR.

What could be the possible bottleneck in this case?   Can there
significant over-head unpacking batch of documents in request?  Is
there some SOLR over-head in update handler?

I have tried both SOLR 3.6 and 4.0 with very similar results.

When using SOLR 4.0 I have transaction logging (for NRT search) turned off.

I am also NOT using a unique ID field.

Performance for indexing 200 documents is around 250ms on SOLR, about
60ms on Lucene.

I see that response time wrapping call to SOLRJ API add() method, and
response time logged in SOLR log is nearly the same, so there is very
little network overhead in this case.

Is this typical amount of overhead to use SOLRJ+SOLR vs local Lucene API?

The reason it matters in this case is application needs to rebuilt
index once per day which currently takes about 45 minutes.  Using
SOLRJ+SOLR it will take several hours, which is a show stopper in this
case.

Thanks.

Re: Indexing performance with solrj vs. direct lucene API

Posted by Mark Bennett <mb...@ideaeng.com>.

Hi Robert,

SolrJ is sending data over a socket so that might explain some of the lag.
Are is your SolrJ app and the Solr server running on the same physical
machine?

I thought Mark M's idea sounded good.

One other idea:

When initializing SolrJ's connection for normal searching you probably use
HttpSolrServer.

But when doing massive updates, you might consider using
ConcurrentUpdateSolrServer instead.

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Wed, Nov 28, 2012 at 10:02 AM, Robert Stewart <bs...@gmail.com>wrote:

> I have a project where I am porting existing application from direct
> Lucene API usage to using SOLR and SOLRJ client API.
>
> The problem I have is that indexing is 2-5x slower using SOLRJ+SOLR
> than using direct Lucene API.
>
> I am creating batches of documents between 200 and 500 documents per
> call to add() using SOLRJ.
>
> I tried adjusting SOLR parameters for indexing but did not make any
> difference.
>
> Documents are identical (same fields) in both cases.
>
> Nearly identical settings for tokenizing/analyzing/indexing/storing
> for each field with Lucene and SOLR.
>
> What could be the possible bottleneck in this case?   Can there
> significant over-head unpacking batch of documents in request?  Is
> there some SOLR over-head in update handler?
>
> I have tried both SOLR 3.6 and 4.0 with very similar results.
>
> When using SOLR 4.0 I have transaction logging (for NRT search) turned off.
>
> I am also NOT using a unique ID field.
>
> Performance for indexing 200 documents is around 250ms on SOLR, about
> 60ms on Lucene.
>
> I see that response time wrapping call to SOLRJ API add() method, and
> response time logged in SOLR log is nearly the same, so there is very
> little network overhead in this case.
>
> Is this typical amount of overhead to use SOLRJ+SOLR vs local Lucene API?
>
> The reason it matters in this case is application needs to rebuilt
> index once per day which currently takes about 45 minutes.  Using
> SOLRJ+SOLR it will take several hours, which is a show stopper in this
> case.
>
> Thanks.
>

Re: Indexing performance with solrj vs. direct lucene API

Posted by Mark Miller <ma...@gmail.com>.

One difference is that Solr will call update rather than add by
default. If you are willing to ensure unique id's, you can specify
overwrite=false (I think thats the one) and it will use add instead.

- Mark

On Wed, Nov 28, 2012 at 1:02 PM, Robert Stewart <bs...@gmail.com> wrote:
> I have a project where I am porting existing application from direct
> Lucene API usage to using SOLR and SOLRJ client API.
>
> The problem I have is that indexing is 2-5x slower using SOLRJ+SOLR
> than using direct Lucene API.
>
> I am creating batches of documents between 200 and 500 documents per
> call to add() using SOLRJ.
>
> I tried adjusting SOLR parameters for indexing but did not make any
> difference.
>
> Documents are identical (same fields) in both cases.
>
> Nearly identical settings for tokenizing/analyzing/indexing/storing
> for each field with Lucene and SOLR.
>
> What could be the possible bottleneck in this case?   Can there
> significant over-head unpacking batch of documents in request?  Is
> there some SOLR over-head in update handler?
>
> I have tried both SOLR 3.6 and 4.0 with very similar results.
>
> When using SOLR 4.0 I have transaction logging (for NRT search) turned off.
>
> I am also NOT using a unique ID field.
>
> Performance for indexing 200 documents is around 250ms on SOLR, about
> 60ms on Lucene.
>
> I see that response time wrapping call to SOLRJ API add() method, and
> response time logged in SOLR log is nearly the same, so there is very
> little network overhead in this case.
>
> Is this typical amount of overhead to use SOLRJ+SOLR vs local Lucene API?
>
> The reason it matters in this case is application needs to rebuilt
> index once per day which currently takes about 45 minutes.  Using
> SOLRJ+SOLR it will take several hours, which is a show stopper in this
> case.
>
> Thanks.



-- 
- Mark