You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Pisarev, Vitaliy" <vi...@hp.com> on 2014/02/12 15:52:01 UTC

Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something

I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request).
I am testing 3 use cases:

  1.  No indexing at all - ~45 sec to post 2000 documents
  2.  Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents
  3.  Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents
The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI).
I am worried that I am missing something very big. The code I use for point 2:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc);
solrConnection.commit();
Whereas the code for point 3:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API
Is it possible that committing after each add will degrade performance by a factor of 40?


Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something

Posted by Joel Bernstein <jo...@gmail.com>.
Yes, committing after each document will greatly degrade performance. I
typically use autoCommit and autoSoftCommit to set the time interval
between commits, but commitWithin should have a similar effect.. I often
see performance of 2000+ docs per second on the load using auto commits.
When explicitly committing after each document, your commits will happen
too frequently, overworking the indexing process.

Joel Bernstein
Search Engineer at Heliosearch


On Wed, Feb 12, 2014 at 9:52 AM, Pisarev, Vitaliy <vi...@hp.com>wrote:

> I am running a very simple performance experiment where I post 2000
> documents to my application. Who in turn persists them to a relational DB
> and sends them to Solr for indexing (Synchronously, in the same request).
> I am testing 3 use cases:
>
>   1.  No indexing at all - ~45 sec to post 2000 documents
>   2.  Indexing included - commit after each add. ~8 minutes (!) to post
> and index 2000 documents
>   3.  Indexing included - commitWithin 1ms ~55 seconds (!) to post and
> index 2000 documents
> The 3rd result does not make any sense, I would expect the behavior to be
> similar to the one in point 2. At first I thought that the documents were
> not really committed but I could actually see them being added by executing
> some queries during the experiment (via the solr web UI).
> I am worried that I am missing something very big. The code I use for
> point 2:
> SolrInputDocument = // get doc
> SolrServer solrConnection = // get connection
> solrConnection.add(doc);
> solrConnection.commit();
> Whereas the code for point 3:
> SolrInputDocument = // get doc
> SolrServer solrConnection = // get connection
> solrConnection.add(doc, 1); // According to API documentation I understand
> there is no need to explicitly call commit with this API
> Is it possible that committing after each add will degrade performance by
> a factor of 40?
>
>

Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something

Posted by Jack Krupansky <ja...@basetechnology.com>.
The explicit commit will cause your app to be delayed until that commit 
completes, and then Solr would be idle until that request completion makes 
its way back to your app and you submit another request which finds its way 
to Solr, maybe a few ms. That includes network latency. That interval of 
time could well be more than enough for the short-interval autoCommit or 
commitWithin to run in the background and in parallel with the request 
return to your app and the submission by your app of the subsequent request.

The magic of asynchronous operation in a parallel and distributed computing 
environment, coupled with multi-core processors and parallel threads.

-- Jack Krupansky

-----Original Message----- 
From: Pisarev, Vitaliy
Sent: Wednesday, February 12, 2014 10:28 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr perfromance with commitWithin seesm too good to be true. I 
am afraid I am missing something

I absolutely agree and I even read the NRT page before posting this 
question.

The thing that baffles me is this:

Doing a commit after each add kills the performance.
On the other hand, when I use commit within and specify an (absurd) 1ms 
delay,- I expect that this behavior will be equivalent to making a commit- 
from a functional perspective.

Seeing that there is no magic in the world, I am trying to understand what 
is the price I am actually paying when using the commitWithin feature, on 
the one hand it commits almost immediately, on the other hand, it performs 
wonderfully. Where is the catch?


-----Original Message-----
From: Mark Miller [mailto:markrmiller@gmail.com]
Sent: יום ד 12 פברואר 2014 17:00
To: solr-user
Subject: Re: Solr perfromance with commitWithin seesm too good to be true. I 
am afraid I am missing something

Doing a standard commit after every document is a Solr anti-pattern.

commitWithin is a “near-realtime” commit in recent versions of Solr and not 
a standard commit.

https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching

- Mark

http://about.me/markrmiller

On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy <vi...@hp.com> 
wrote:

> I am running a very simple performance experiment where I post 2000 
> documents to my application. Who in turn persists them to a relational DB 
> and sends them to Solr for indexing (Synchronously, in the same request).
> I am testing 3 use cases:
>
>  1.  No indexing at all - ~45 sec to post 2000 documents  2.  Indexing
> included - commit after each add. ~8 minutes (!) to post and index
> 2000 documents  3.  Indexing included - commitWithin 1ms ~55 seconds
> (!) to post and index 2000 documents The 3rd result does not make any 
> sense, I would expect the behavior to be similar to the one in point 2. At 
> first I thought that the documents were not really committed but I could 
> actually see them being added by executing some queries during the 
> experiment (via the solr web UI).
> I am worried that I am missing something very big. The code I use for 
> point 2:
> SolrInputDocument = // get doc
> SolrServer solrConnection = // get connection solrConnection.add(doc);
> solrConnection.commit(); Whereas the code for point 3:
> SolrInputDocument = // get doc
> SolrServer solrConnection = // get connection solrConnection.add(doc,
> 1); // According to API documentation I understand there is no need to
> explicitly call commit with this API Is it possible that committing after 
> each add will degrade performance by a factor of 40?
> 

Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something

Posted by Erick Erickson <er...@gmail.com>.
Here's some additional background that may shed light on the
performance..

http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Best,
Erick


On Wed, Feb 12, 2014 at 7:40 AM, Dmitry Kan <so...@gmail.com> wrote:

> Cross-posting my answer from SO:
>
> According to this wiki:
>
> https://wiki.apache.org/solr/NearRealtimeSearch
>
> the commitWithin is a soft-commit by default. Soft-commits are very
> efficient in terms of making the added documents immediately searchable.
> But! They are not on the disk yet. That means the documents are being
> committed into RAM. In this setup you would use updateLog to be solr
> instance crash tolerant.
>
> What you do in point 2 is hard-commit, i.e. flush the added documents to
> disk. Doing this after each document add is very expensive. So instead,
> post a bunch of documents and issue a hard commit or even have you
> autoCommit set to some reasonable value, like 10 min or 1 hour (depends on
> your user expectations).
>
>
>
> On Wed, Feb 12, 2014 at 5:28 PM, Pisarev, Vitaliy <vitaliy.pisarev@hp.com
> >wrote:
>
> > I absolutely agree and I even read the NRT page before posting this
> > question.
> >
> > The thing that baffles me is this:
> >
> > Doing a commit after each add kills the performance.
> > On the other hand, when I use commit within and specify an (absurd) 1ms
> > delay,- I expect that this behavior will be equivalent to making a
> commit-
> > from a functional perspective.
> >
> > Seeing that there is no magic in the world, I am trying to understand
> what
> > is the price I am actually paying when using the commitWithin feature, on
> > the one hand it commits almost immediately, on the other hand, it
> performs
> > wonderfully. Where is the catch?
> >
> >
> > -----Original Message-----
> > From: Mark Miller [mailto:markrmiller@gmail.com]
> > Sent: יום ד 12 פברואר 2014 17:00
> > To: solr-user
> > Subject: Re: Solr perfromance with commitWithin seesm too good to be
> true.
> > I am afraid I am missing something
> >
> > Doing a standard commit after every document is a Solr anti-pattern.
> >
> > commitWithin is a “near-realtime” commit in recent versions of Solr and
> > not a standard commit.
> >
> >
> https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching
> >
> > - Mark
> >
> > http://about.me/markrmiller
> >
> > On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy <vi...@hp.com>
> > wrote:
> >
> > > I am running a very simple performance experiment where I post 2000
> > documents to my application. Who in turn persists them to a relational DB
> > and sends them to Solr for indexing (Synchronously, in the same request).
> > > I am testing 3 use cases:
> > >
> > >  1.  No indexing at all - ~45 sec to post 2000 documents  2.  Indexing
> > > included - commit after each add. ~8 minutes (!) to post and index
> > > 2000 documents  3.  Indexing included - commitWithin 1ms ~55 seconds
> > > (!) to post and index 2000 documents The 3rd result does not make any
> > sense, I would expect the behavior to be similar to the one in point 2.
> At
> > first I thought that the documents were not really committed but I could
> > actually see them being added by executing some queries during the
> > experiment (via the solr web UI).
> > > I am worried that I am missing something very big. The code I use for
> > point 2:
> > > SolrInputDocument = // get doc
> > > SolrServer solrConnection = // get connection solrConnection.add(doc);
> > > solrConnection.commit(); Whereas the code for point 3:
> > > SolrInputDocument = // get doc
> > > SolrServer solrConnection = // get connection solrConnection.add(doc,
> > > 1); // According to API documentation I understand there is no need to
> > > explicitly call commit with this API Is it possible that committing
> > after each add will degrade performance by a factor of 40?
> > >
> >
> >
>
>
> --
> Dmitry
> Blog: http://dmitrykan.blogspot.com
> Twitter: twitter.com/dmitrykan
>

Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something

Posted by Dmitry Kan <so...@gmail.com>.
Cross-posting my answer from SO:

According to this wiki:

https://wiki.apache.org/solr/NearRealtimeSearch

the commitWithin is a soft-commit by default. Soft-commits are very
efficient in terms of making the added documents immediately searchable.
But! They are not on the disk yet. That means the documents are being
committed into RAM. In this setup you would use updateLog to be solr
instance crash tolerant.

What you do in point 2 is hard-commit, i.e. flush the added documents to
disk. Doing this after each document add is very expensive. So instead,
post a bunch of documents and issue a hard commit or even have you
autoCommit set to some reasonable value, like 10 min or 1 hour (depends on
your user expectations).



On Wed, Feb 12, 2014 at 5:28 PM, Pisarev, Vitaliy <vi...@hp.com>wrote:

> I absolutely agree and I even read the NRT page before posting this
> question.
>
> The thing that baffles me is this:
>
> Doing a commit after each add kills the performance.
> On the other hand, when I use commit within and specify an (absurd) 1ms
> delay,- I expect that this behavior will be equivalent to making a commit-
> from a functional perspective.
>
> Seeing that there is no magic in the world, I am trying to understand what
> is the price I am actually paying when using the commitWithin feature, on
> the one hand it commits almost immediately, on the other hand, it performs
> wonderfully. Where is the catch?
>
>
> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: יום ד 12 פברואר 2014 17:00
> To: solr-user
> Subject: Re: Solr perfromance with commitWithin seesm too good to be true.
> I am afraid I am missing something
>
> Doing a standard commit after every document is a Solr anti-pattern.
>
> commitWithin is a “near-realtime” commit in recent versions of Solr and
> not a standard commit.
>
> https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching
>
> - Mark
>
> http://about.me/markrmiller
>
> On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy <vi...@hp.com>
> wrote:
>
> > I am running a very simple performance experiment where I post 2000
> documents to my application. Who in turn persists them to a relational DB
> and sends them to Solr for indexing (Synchronously, in the same request).
> > I am testing 3 use cases:
> >
> >  1.  No indexing at all - ~45 sec to post 2000 documents  2.  Indexing
> > included - commit after each add. ~8 minutes (!) to post and index
> > 2000 documents  3.  Indexing included - commitWithin 1ms ~55 seconds
> > (!) to post and index 2000 documents The 3rd result does not make any
> sense, I would expect the behavior to be similar to the one in point 2. At
> first I thought that the documents were not really committed but I could
> actually see them being added by executing some queries during the
> experiment (via the solr web UI).
> > I am worried that I am missing something very big. The code I use for
> point 2:
> > SolrInputDocument = // get doc
> > SolrServer solrConnection = // get connection solrConnection.add(doc);
> > solrConnection.commit(); Whereas the code for point 3:
> > SolrInputDocument = // get doc
> > SolrServer solrConnection = // get connection solrConnection.add(doc,
> > 1); // According to API documentation I understand there is no need to
> > explicitly call commit with this API Is it possible that committing
> after each add will degrade performance by a factor of 40?
> >
>
>


-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: twitter.com/dmitrykan

RE: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something

Posted by "Pisarev, Vitaliy" <vi...@hp.com>.
I absolutely agree and I even read the NRT page before posting this question.

The thing that baffles me is this:

Doing a commit after each add kills the performance.
On the other hand, when I use commit within and specify an (absurd) 1ms delay,- I expect that this behavior will be equivalent to making a commit- from a functional perspective.

Seeing that there is no magic in the world, I am trying to understand what is the price I am actually paying when using the commitWithin feature, on the one hand it commits almost immediately, on the other hand, it performs wonderfully. Where is the catch?


-----Original Message-----
From: Mark Miller [mailto:markrmiller@gmail.com] 
Sent: יום ד 12 פברואר 2014 17:00
To: solr-user
Subject: Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something

Doing a standard commit after every document is a Solr anti-pattern.

commitWithin is a “near-realtime” commit in recent versions of Solr and not a standard commit.

https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching

- Mark

http://about.me/markrmiller

On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy <vi...@hp.com> wrote:

> I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request).
> I am testing 3 use cases:
> 
>  1.  No indexing at all - ~45 sec to post 2000 documents  2.  Indexing 
> included - commit after each add. ~8 minutes (!) to post and index 
> 2000 documents  3.  Indexing included - commitWithin 1ms ~55 seconds 
> (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI).
> I am worried that I am missing something very big. The code I use for point 2:
> SolrInputDocument = // get doc
> SolrServer solrConnection = // get connection solrConnection.add(doc); 
> solrConnection.commit(); Whereas the code for point 3:
> SolrInputDocument = // get doc
> SolrServer solrConnection = // get connection solrConnection.add(doc, 
> 1); // According to API documentation I understand there is no need to 
> explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40?
> 


Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something

Posted by Mark Miller <ma...@gmail.com>.
Doing a standard commit after every document is a Solr anti-pattern.

commitWithin is a “near-realtime” commit in recent versions of Solr and not a standard commit.

https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching

- Mark

http://about.me/markrmiller

On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy <vi...@hp.com> wrote:

> I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request).
> I am testing 3 use cases:
> 
>  1.  No indexing at all - ~45 sec to post 2000 documents
>  2.  Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents
>  3.  Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents
> The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI).
> I am worried that I am missing something very big. The code I use for point 2:
> SolrInputDocument = // get doc
> SolrServer solrConnection = // get connection
> solrConnection.add(doc);
> solrConnection.commit();
> Whereas the code for point 3:
> SolrInputDocument = // get doc
> SolrServer solrConnection = // get connection
> solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API
> Is it possible that committing after each add will degrade performance by a factor of 40?
>