You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Ron Ratovsky <ro...@correlsense.com> on 2010/08/26 16:58:55 UTC

Performance Optimizations and Expected Benchmark Results

Hi everyone,
My colleague and I are fairly new to Lucene. We've been playing around with
it for a while, but we're far from being experts.
We want to use Lucene to allow full text search on the objects our
application produces.
The application is operates at a fairly high throughput. Without indexing,
we manage to process about 10k objects per second. On average, the data is
about 2kb in size, containing several dozens of fields.
When we run the application with indexing on, our throughput drops to 1-2k
ops/sec.
While we expect the performance to drop, we were wondering whether there's a
way to boost the performance we get.
I'm not sure what information is required in order to help us out, so I'd
appreciate if you guys can mention whatever is needed.
Thanks,
Ron

Re: Performance Optimizations and Expected Benchmark Results

Posted by Ted Dunning <te...@gmail.com>.

I think you missed Jenny's point.

She was asking whether you could buffer up a hundred or a thousand (or more)
items and index
them all at once.  This doesn't require that you go to the data store, just
that you have a buffer
that sticks around for a few minutes (or seconds from your apparent indexing
rate).

On Sun, Aug 29, 2010 at 1:06 AM, Ron Ratovsky <ro...@correlsense.com> wrote:

> The indexing is done synchronously to saving the data.
> Doing it asynchronously works slower since then the data to be indexed
> needs
> to be read from the data store, which is slower.
>
> On Thu, Aug 26, 2010 at 18:58, Jenny Brown <sk...@gmail.com> wrote:
>
> > Do you index as you go along, or do you batch your updates to the
> > index?  Sometimes doing a large batch at once can improve total
> > throughput, compared with singles.
> >
> >
> >
>

Re: Performance Optimizations and Expected Benchmark Results

Posted by Ron Ratovsky <ro...@correlsense.com>.

The indexing is done synchronously to saving the data.
Doing it asynchronously works slower since then the data to be indexed needs
to be read from the data store, which is slower.

On Thu, Aug 26, 2010 at 18:58, Jenny Brown <sk...@gmail.com> wrote:

> Do you index as you go along, or do you batch your updates to the
> index?  Sometimes doing a large batch at once can improve total
> throughput, compared with singles.
>
>
>

Re: Performance Optimizations and Expected Benchmark Results

Posted by Jenny Brown <sk...@gmail.com>.

Do you index as you go along, or do you batch your updates to the
index?  Sometimes doing a large batch at once can improve total
throughput, compared with singles.


On Thu, Aug 26, 2010 at 9:58 AM, Ron Ratovsky <ro...@correlsense.com> wrote:
> Hi everyone,
> My colleague and I are fairly new to Lucene. We've been playing around with
> it for a while, but we're far from being experts.
> We want to use Lucene to allow full text search on the objects our
> application produces.
> The application is operates at a fairly high throughput. Without indexing,
> we manage to process about 10k objects per second. On average, the data is
> about 2kb in size, containing several dozens of fields.
> When we run the application with indexing on, our throughput drops to 1-2k
> ops/sec.
> While we expect the performance to drop, we were wondering whether there's a
> way to boost the performance we get.
> I'm not sure what information is required in order to help us out, so I'd
> appreciate if you guys can mention whatever is needed.
> Thanks,
> Ron
>

Re: Performance Optimizations and Expected Benchmark Results

Posted by Ted Dunning <te...@gmail.com>.

Jenny is correct.   Opening and closing is expensive.  The reason is that
most updates are memory-only, but closing an index forces writes to disk
which involves expensive serialization.

On Mon, Aug 30, 2010 at 8:42 AM, Jenny Brown <sk...@gmail.com> wrote:

> On Mon, Aug 30, 2010 at 2:53 AM, Ron Ratovsky <ro...@correlsense.com>
> wrote:
> > Hi Ted and Jenny,
> > Thanks for both your responses.
> > In regards to Jenny's question - the answer is yes. There's no problem
> > processing the objects in batches. I'd be interested to know why that
> would
> > affect performance.
>
> I'm not 100% confident on this, but in my experience, repeatedly
> opening and closing the index is the slow operation -- adding
> documents to it is not.  I get better performance by having a routine
> that runs every 5 minutes, and adds a batch of documents at once,
> rather than trying to add individual items as they come in via an
> irregularly timed stream.  Even if it ran once a minute, batching
> still gives me better results than individual items.
>
> I don't pretend to know why.  :)  It made sense when I developed the
> code but that was a few years ago.  I now only remember what worked,
> not the full explanation of why.
>
>
> Jenny
>

Re: Performance Optimizations and Expected Benchmark Results

Posted by Jenny Brown <sk...@gmail.com>.

On Mon, Aug 30, 2010 at 2:53 AM, Ron Ratovsky <ro...@correlsense.com> wrote:
> Hi Ted and Jenny,
> Thanks for both your responses.
> In regards to Jenny's question - the answer is yes. There's no problem
> processing the objects in batches. I'd be interested to know why that would
> affect performance.

I'm not 100% confident on this, but in my experience, repeatedly
opening and closing the index is the slow operation -- adding
documents to it is not.  I get better performance by having a routine
that runs every 5 minutes, and adds a batch of documents at once,
rather than trying to add individual items as they come in via an
irregularly timed stream.  Even if it ran once a minute, batching
still gives me better results than individual items.

I don't pretend to know why.  :)  It made sense when I developed the
code but that was a few years ago.  I now only remember what worked,
not the full explanation of why.

Jenny

Re: Performance Optimizations and Expected Benchmark Results

Posted by Ron Ratovsky <ro...@correlsense.com>.

Hi Ted and Jenny,
Thanks for both your responses.
In regards to Jenny's question - the answer is yes. There's no problem
processing the objects in batches. I'd be interested to know why that would
affect performance.
As for the numbers and calculations, Ted, thanks for that.
It really opened our eyes in realization that our requirements are not clear
enough.
It's obvious we still have work to do before being able to give out the
actual numbers, but once we have them, I'll post back here.

On Sun, Aug 29, 2010 at 21:17, Ted Dunning <te...@gmail.com> wrote:

> So the current state of your problem is this:
>
> a) desired indexing speed = 10,000 objects per second (peak rate) (ish)
>
> b) total number of objects = 10,000,000
>
> This gives desired from-scratch indexing time of 1000 seconds = 17 minutes
>
> c) object life = 2+ days
>
> but 2 days = 170,000 seconds.  At 10,000 objects per second, this would be
> 1.7 billion objects which is nearly 100x the actual size.  So your rate
> assumptions have massive peak / valley ratios.
>
> If all 10 million objects turn over in 2 days, the average indexing speed
> only needs to be 60 objects per second.  I suspect that this is actually
> considerably higher than you meant to imply, but we can still use it.
>
> d) desired latency before search = a few minutes (call it 100 seconds)
>
> So it sounds like your objects arrive in batches or like you reprocess all
> of your objects frequently.
>
> My question about incremental indexing had to do with whether you really
> needed to re-index everything from scratch every time or whether it would
> be
> feasible to simply index new objects as they arrive.  Moreover, if you
> dedicate a single index per day of data and
> the only deletion policy is mass expiration, then you can simply delete an
> index to accomplish all deletion.
>
> You earlier said that you could pretty easily achieve 1000 objects per
> second indexing speed.  If we assume that your data arrives every 30
> seconds
> in a batch of about 2000 objects, then the indexing for this batch should
> take about 2 seconds.  That seems
> to give you at least a 15:1 safety margin at the cost of implementing a
> buffer that can store a few thousand objects.
>
> Why doesn't that work for you?
>
>
> On Sun, Aug 29, 2010 at 1:18 AM, Ron Ratovsky <ro...@correlsense.com>
> wrote:
>
> > Answers are within the message.
> >
> > On Fri, Aug 27, 2010 at 22:05, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > Can you say a bit more about your application?  How many objects total
> > are
> > > there?
> >
> > Our goal is to hold a few tens of millions objects at any given time.
> >
> >
> > > What is an object lifetime?
> > >
> > At minimum - 2 days. It can increase depending on the application stress
> > (with inverse relation).
> >
> >
> > > How soon must an object be searchable?
> >
> > Preferably asap, but a few minutes should suffice. We don't want to start
> > generating a back-log since it'll just keep growing.
> >
> > > Can the index be built incrementally?
> >
> > I'm not entirely sure what you mean by that.
> >
> > > What is your search speed/throughput requirement?
> >
> > Currently, I don't have the numbers exactly. I imagine the load on the
> > search would be fairly 'low', but I don't know how to quantify it yet.
> >
>

Re: Performance Optimizations and Expected Benchmark Results

Posted by Ted Dunning <te...@gmail.com>.

So the current state of your problem is this:

a) desired indexing speed = 10,000 objects per second (peak rate) (ish)

b) total number of objects = 10,000,000

This gives desired from-scratch indexing time of 1000 seconds = 17 minutes

c) object life = 2+ days

but 2 days = 170,000 seconds.  At 10,000 objects per second, this would be
1.7 billion objects which is nearly 100x the actual size.  So your rate
assumptions have massive peak / valley ratios.

If all 10 million objects turn over in 2 days, the average indexing speed
only needs to be 60 objects per second.  I suspect that this is actually
considerably higher than you meant to imply, but we can still use it.

d) desired latency before search = a few minutes (call it 100 seconds)

So it sounds like your objects arrive in batches or like you reprocess all
of your objects frequently.

My question about incremental indexing had to do with whether you really
needed to re-index everything from scratch every time or whether it would be
feasible to simply index new objects as they arrive.  Moreover, if you
dedicate a single index per day of data and
the only deletion policy is mass expiration, then you can simply delete an
index to accomplish all deletion.

You earlier said that you could pretty easily achieve 1000 objects per
second indexing speed.  If we assume that your data arrives every 30 seconds
in a batch of about 2000 objects, then the indexing for this batch should
take about 2 seconds.  That seems
to give you at least a 15:1 safety margin at the cost of implementing a
buffer that can store a few thousand objects.

Why doesn't that work for you?

On Sun, Aug 29, 2010 at 1:18 AM, Ron Ratovsky <ro...@correlsense.com> wrote:

> Answers are within the message.
>
> On Fri, Aug 27, 2010 at 22:05, Ted Dunning <te...@gmail.com> wrote:
>
> > Can you say a bit more about your application?  How many objects total
> are
> > there?
>
> Our goal is to hold a few tens of millions objects at any given time.
>
>
> > What is an object lifetime?
> >
> At minimum - 2 days. It can increase depending on the application stress
> (with inverse relation).
>
>
> > How soon must an object be searchable?
>
> Preferably asap, but a few minutes should suffice. We don't want to start
> generating a back-log since it'll just keep growing.
>
> > Can the index be built incrementally?
>
> I'm not entirely sure what you mean by that.
>
> > What is your search speed/throughput requirement?
>
> Currently, I don't have the numbers exactly. I imagine the load on the
> search would be fairly 'low', but I don't know how to quantify it yet.
>

Re: Performance Optimizations and Expected Benchmark Results

Posted by Ron Ratovsky <ro...@correlsense.com>.

Answers are within the message.

On Fri, Aug 27, 2010 at 22:05, Ted Dunning <te...@gmail.com> wrote:

> Can you say a bit more about your application?  How many object total are
> there?
>
> Our goal is to hold a few tens of millions objects at any given time.

> What is an object lifetime?
>
> At minimum - 2 days. It can increase depending on the application stress
(with inverse relation).

> How soon must an object be searchable?
>

Preferably asap, but a few minutes should suffice. We don't want to start
generating a back-log since it'll just keep growing.

>
> Can the index be built incrementally?
>

I'm not entirely sure what you mean by that.

>
> What is your search speed/throughput requirement?
>
> Currently, I don't have the numbers exactly. I imagine the load on the
search would be fairly 'low', but I don't know how to quantify it yet.

Re: Performance Optimizations and Expected Benchmark Results

Posted by Ted Dunning <te...@gmail.com>.

Can you say a bit more about your application?  How many object total are
there?

What is an object lifetime?

How soon must an object be searchable?

Can the index be built incrementally?

What is your search speed/throughput requirement?

On Thu, Aug 26, 2010 at 8:58 AM, Ron Ratovsky <ro...@correlsense.com> wrote:

> Hi everyone,
> My colleague and I are fairly new to Lucene. We've been playing around with
> it for a while, but we're far from being experts.
> We want to use Lucene to allow full text search on the objects our
> application produces.
> The application is operates at a fairly high throughput. Without indexing,
> we manage to process about 10k objects per second. On average, the data is
> about 2kb in size, containing several dozens of fields.
> When we run the application with indexing on, our throughput drops to 1-2k
> ops/sec.
> While we expect the performance to drop, we were wondering whether there's
> a
> way to boost the performance we get.
> I'm not sure what information is required in order to help us out, so I'd
> appreciate if you guys can mention whatever is needed.
> Thanks,
> Ron
>