You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Angel Todorov <at...@gmail.com> on 2015/05/21 10:07:52 UTC

Indexing gets significantly slower after every batch commit

hi guys,

I'm crawling a file system folder and indexing 10 million docs, and I am
adding them in batches of 5000, committing every 50 000 docs. The problem I
am facing is that after each commit, the documents per sec that are indexed
gets less and less.

If I do not commit at all, I can index those docs very quickly, and then I
commit once at the end, but once i start indexing docs _after_ that (for
example new files get added to the folder), indexing is also slowing down a
lot.

Is it normal that the SOLR indexing speed depends on the number of
documents that are _already_ indexed? I think it shouldn't matter if i
start from scratch or I index a document in a core that already has a
couple of million docs. Looks like SOLR is either doing something in a
linear fashion, or there is some magic config parameter that I am not aware
of.

I've read all perf docs, and I've tried changing mergeFactor,
autowarmCounts, and the buffer sizes - to no avail.

I am using SOLR 5.1

Thanks !
Angel

Re: Indexing gets significantly slower after every batch commit

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/22/2015 3:15 AM, Angel Todorov wrote:
> Thanks for the feedback guys. What i am going to try now is deploying my
> SOLR server on a physical machine with more RAM, and checking out this
> scenario there. I have some suspicion it could well be a hypervisor issue,
> but let's see. Just for the record - I've noticed those issues on a Win
> 2008R2 VM with 8 GB of RAM and 2 cores.
> 
> I don't see anything strange in the logs. One thing that I need to change,
> though, is the verbosity of logs in the console - looks like by default
> SOLR outputs text in the log for every single document that's indexed, as
> well as for every query that's executed.

Bare metal will always perform better than a virtual machine.  Also,
Solr is *highly* threaded and really likes to have a lot of CPU cores.

Solr doesn't output a log line for every document indexed, unless you
are only including one document in each update request.  You should
definitely batch your updates -- put a few hundred or a few thousand of
them in each update request.  There is overhead to each request beyond
just the logging ... maximize the work done by each one.

I don't know that I would run Solr on Windows in production.  Windows
lags behind the free operating systems in memory management and
filesytem capabilities.  It's not that a Windows server is a BAD
environment, it's just that there are better ones that won't cost you money.

Thanks,
Shawn


Re: Indexing gets significantly slower after every batch commit

Posted by Siegfried Goeschl <sg...@gmx.at>.
Hi Angel,

a while ago I had issues with VMWare VM - somehow snapshots were created regularly which dragged down the machine. So I think is is a good idea to baseline the performance on physical box before moving to VMs, production boxes or whatever is thrown at you

Cheers,

Siegfried Goeschl

> On 22 May 2015, at 11:15, Angel Todorov <at...@gmail.com> wrote:
> 
> Thanks for the feedback guys. What i am going to try now is deploying my
> SOLR server on a physical machine with more RAM, and checking out this
> scenario there. I have some suspicion it could well be a hypervisor issue,
> but let's see. Just for the record - I've noticed those issues on a Win
> 2008R2 VM with 8 GB of RAM and 2 cores.
> 
> I don't see anything strange in the logs. One thing that I need to change,
> though, is the verbosity of logs in the console - looks like by default
> SOLR outputs text in the log for every single document that's indexed, as
> well as for every query that's executed.
> 
> Angel
> 
> 
> On Fri, May 22, 2015 at 1:03 AM, Erick Erickson <er...@gmail.com>
> wrote:
> 
>> bq: Which is logical as index growth and time needed to put something
>> to it is log(n)
>> 
>> Not really. Solr indexes to segments, each segment is a fully
>> consistent "mini index".
>> When a segment gets flushed to disk, a new one is started. Of course
>> there'll be a
>> _little bit_ of added overyead, but it shouldn't be all that noticeable.
>> 
>> Furthermore, they're "append only". In the past, when I've indexed the
>> Wiki example,
>> my indexing speed actually goes faster.
>> 
>> So on the surface this sounds very strange to me. Are you seeing
>> anything at all in the
>> Solr logs that's supsicious?
>> 
>> Best,
>> Erick
>> 
>> On Thu, May 21, 2015 at 12:22 PM, Sergey Shvets <se...@bintime.com>
>> wrote:
>>> Hi Angel
>>> 
>>> We also noticed that kind of performance degrade in our workloads.
>>> 
>>> Which is logical as index growth and time needed to put something to it
>> is
>>> log(n)
>>> 
>>> 
>>> 
>>> четверг, 21 мая 2015 г. пользователь Angel Todorov написал:
>>> 
>>>> hi Shawn,
>>>> 
>>>> Thanks a bunch for your feedback. I've played with the heap size, but I
>>>> don't see any improvement. Even if i index, say , a million docs, and
>> the
>>>> throughput is about 300 docs per sec, and then I shut down solr
>> completely
>>>> - after I start indexing again, the throughput is dropping below 300.
>>>> 
>>>> I should probably experiment with sharding those documents to multiple
>> SOLR
>>>> cores - that should help, I guess. I am talking about something like
>> this:
>>>> 
>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
>>>> 
>>>> Thanks,
>>>> Angel
>>>> 
>>>> 
>>>> On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey <apache@elyograg.org
>>>> <javascript:;>> wrote:
>>>> 
>>>>> On 5/21/2015 2:07 AM, Angel Todorov wrote:
>>>>>> I'm crawling a file system folder and indexing 10 million docs, and
>> I
>>>> am
>>>>>> adding them in batches of 5000, committing every 50 000 docs. The
>>>>> problem I
>>>>>> am facing is that after each commit, the documents per sec that are
>>>>> indexed
>>>>>> gets less and less.
>>>>>> 
>>>>>> If I do not commit at all, I can index those docs very quickly, and
>>>> then
>>>>> I
>>>>>> commit once at the end, but once i start indexing docs _after_ that
>>>> (for
>>>>>> example new files get added to the folder), indexing is also slowing
>>>>> down a
>>>>>> lot.
>>>>>> 
>>>>>> Is it normal that the SOLR indexing speed depends on the number of
>>>>>> documents that are _already_ indexed? I think it shouldn't matter
>> if i
>>>>>> start from scratch or I index a document in a core that already has
>> a
>>>>>> couple of million docs. Looks like SOLR is either doing something
>> in a
>>>>>> linear fashion, or there is some magic config parameter that I am
>> not
>>>>> aware
>>>>>> of.
>>>>>> 
>>>>>> I've read all perf docs, and I've tried changing mergeFactor,
>>>>>> autowarmCounts, and the buffer sizes - to no avail.
>>>>>> 
>>>>>> I am using SOLR 5.1
>>>>> 
>>>>> Have you changed the heap size?  If you use the bin/solr script to
>> start
>>>>> it and don't change the heap size with the -m option or another
>> method,
>>>>> Solr 5.1 runs with a default size of 512MB, which is *very* small.
>>>>> 
>>>>> I bet you are running into problems with frequent and then ultimately
>>>>> constant garbage collection, as Java attempts to free up enough memory
>>>>> to allow the program to continue running.  If that is what is
>> happening,
>>>>> then eventually you will see an OutOfMemoryError exception.  The
>>>>> solution is to increase the heap size.  I would probably start with at
>>>>> least 4G for 10 million docs.
>>>>> 
>>>>> Thanks,
>>>>> Shawn
>>>>> 
>>>>> 
>>>> 
>> 


Re: Indexing gets significantly slower after every batch commit

Posted by Angel Todorov <at...@gmail.com>.
Thanks for the feedback guys. What i am going to try now is deploying my
SOLR server on a physical machine with more RAM, and checking out this
scenario there. I have some suspicion it could well be a hypervisor issue,
but let's see. Just for the record - I've noticed those issues on a Win
2008R2 VM with 8 GB of RAM and 2 cores.

I don't see anything strange in the logs. One thing that I need to change,
though, is the verbosity of logs in the console - looks like by default
SOLR outputs text in the log for every single document that's indexed, as
well as for every query that's executed.

Angel


On Fri, May 22, 2015 at 1:03 AM, Erick Erickson <er...@gmail.com>
wrote:

> bq: Which is logical as index growth and time needed to put something
> to it is log(n)
>
> Not really. Solr indexes to segments, each segment is a fully
> consistent "mini index".
> When a segment gets flushed to disk, a new one is started. Of course
> there'll be a
> _little bit_ of added overyead, but it shouldn't be all that noticeable.
>
> Furthermore, they're "append only". In the past, when I've indexed the
> Wiki example,
> my indexing speed actually goes faster.
>
> So on the surface this sounds very strange to me. Are you seeing
> anything at all in the
> Solr logs that's supsicious?
>
> Best,
> Erick
>
> On Thu, May 21, 2015 at 12:22 PM, Sergey Shvets <se...@bintime.com>
> wrote:
> > Hi Angel
> >
> > We also noticed that kind of performance degrade in our workloads.
> >
> > Which is logical as index growth and time needed to put something to it
> is
> > log(n)
> >
> >
> >
> > четверг, 21 мая 2015 г. пользователь Angel Todorov написал:
> >
> >> hi Shawn,
> >>
> >> Thanks a bunch for your feedback. I've played with the heap size, but I
> >> don't see any improvement. Even if i index, say , a million docs, and
> the
> >> throughput is about 300 docs per sec, and then I shut down solr
> completely
> >> - after I start indexing again, the throughput is dropping below 300.
> >>
> >> I should probably experiment with sharding those documents to multiple
> SOLR
> >> cores - that should help, I guess. I am talking about something like
> this:
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
> >>
> >> Thanks,
> >> Angel
> >>
> >>
> >> On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey <apache@elyograg.org
> >> <javascript:;>> wrote:
> >>
> >> > On 5/21/2015 2:07 AM, Angel Todorov wrote:
> >> > > I'm crawling a file system folder and indexing 10 million docs, and
> I
> >> am
> >> > > adding them in batches of 5000, committing every 50 000 docs. The
> >> > problem I
> >> > > am facing is that after each commit, the documents per sec that are
> >> > indexed
> >> > > gets less and less.
> >> > >
> >> > > If I do not commit at all, I can index those docs very quickly, and
> >> then
> >> > I
> >> > > commit once at the end, but once i start indexing docs _after_ that
> >> (for
> >> > > example new files get added to the folder), indexing is also slowing
> >> > down a
> >> > > lot.
> >> > >
> >> > > Is it normal that the SOLR indexing speed depends on the number of
> >> > > documents that are _already_ indexed? I think it shouldn't matter
> if i
> >> > > start from scratch or I index a document in a core that already has
> a
> >> > > couple of million docs. Looks like SOLR is either doing something
> in a
> >> > > linear fashion, or there is some magic config parameter that I am
> not
> >> > aware
> >> > > of.
> >> > >
> >> > > I've read all perf docs, and I've tried changing mergeFactor,
> >> > > autowarmCounts, and the buffer sizes - to no avail.
> >> > >
> >> > > I am using SOLR 5.1
> >> >
> >> > Have you changed the heap size?  If you use the bin/solr script to
> start
> >> > it and don't change the heap size with the -m option or another
> method,
> >> > Solr 5.1 runs with a default size of 512MB, which is *very* small.
> >> >
> >> > I bet you are running into problems with frequent and then ultimately
> >> > constant garbage collection, as Java attempts to free up enough memory
> >> > to allow the program to continue running.  If that is what is
> happening,
> >> > then eventually you will see an OutOfMemoryError exception.  The
> >> > solution is to increase the heap size.  I would probably start with at
> >> > least 4G for 10 million docs.
> >> >
> >> > Thanks,
> >> > Shawn
> >> >
> >> >
> >>
>

Re: Indexing gets significantly slower after every batch commit

Posted by Erick Erickson <er...@gmail.com>.
bq: Which is logical as index growth and time needed to put something
to it is log(n)

Not really. Solr indexes to segments, each segment is a fully
consistent "mini index".
When a segment gets flushed to disk, a new one is started. Of course
there'll be a
_little bit_ of added overyead, but it shouldn't be all that noticeable.

Furthermore, they're "append only". In the past, when I've indexed the
Wiki example,
my indexing speed actually goes faster.

So on the surface this sounds very strange to me. Are you seeing
anything at all in the
Solr logs that's supsicious?

Best,
Erick

On Thu, May 21, 2015 at 12:22 PM, Sergey Shvets <se...@bintime.com> wrote:
> Hi Angel
>
> We also noticed that kind of performance degrade in our workloads.
>
> Which is logical as index growth and time needed to put something to it is
> log(n)
>
>
>
> четверг, 21 мая 2015 г. пользователь Angel Todorov написал:
>
>> hi Shawn,
>>
>> Thanks a bunch for your feedback. I've played with the heap size, but I
>> don't see any improvement. Even if i index, say , a million docs, and the
>> throughput is about 300 docs per sec, and then I shut down solr completely
>> - after I start indexing again, the throughput is dropping below 300.
>>
>> I should probably experiment with sharding those documents to multiple SOLR
>> cores - that should help, I guess. I am talking about something like this:
>>
>>
>> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
>>
>> Thanks,
>> Angel
>>
>>
>> On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey <apache@elyograg.org
>> <javascript:;>> wrote:
>>
>> > On 5/21/2015 2:07 AM, Angel Todorov wrote:
>> > > I'm crawling a file system folder and indexing 10 million docs, and I
>> am
>> > > adding them in batches of 5000, committing every 50 000 docs. The
>> > problem I
>> > > am facing is that after each commit, the documents per sec that are
>> > indexed
>> > > gets less and less.
>> > >
>> > > If I do not commit at all, I can index those docs very quickly, and
>> then
>> > I
>> > > commit once at the end, but once i start indexing docs _after_ that
>> (for
>> > > example new files get added to the folder), indexing is also slowing
>> > down a
>> > > lot.
>> > >
>> > > Is it normal that the SOLR indexing speed depends on the number of
>> > > documents that are _already_ indexed? I think it shouldn't matter if i
>> > > start from scratch or I index a document in a core that already has a
>> > > couple of million docs. Looks like SOLR is either doing something in a
>> > > linear fashion, or there is some magic config parameter that I am not
>> > aware
>> > > of.
>> > >
>> > > I've read all perf docs, and I've tried changing mergeFactor,
>> > > autowarmCounts, and the buffer sizes - to no avail.
>> > >
>> > > I am using SOLR 5.1
>> >
>> > Have you changed the heap size?  If you use the bin/solr script to start
>> > it and don't change the heap size with the -m option or another method,
>> > Solr 5.1 runs with a default size of 512MB, which is *very* small.
>> >
>> > I bet you are running into problems with frequent and then ultimately
>> > constant garbage collection, as Java attempts to free up enough memory
>> > to allow the program to continue running.  If that is what is happening,
>> > then eventually you will see an OutOfMemoryError exception.  The
>> > solution is to increase the heap size.  I would probably start with at
>> > least 4G for 10 million docs.
>> >
>> > Thanks,
>> > Shawn
>> >
>> >
>>

Re: Indexing gets significantly slower after every batch commit

Posted by Sergey Shvets <se...@bintime.com>.
Hi Angel

We also noticed that kind of performance degrade in our workloads.

Which is logical as index growth and time needed to put something to it is
log(n)



четверг, 21 мая 2015 г. пользователь Angel Todorov написал:

> hi Shawn,
>
> Thanks a bunch for your feedback. I've played with the heap size, but I
> don't see any improvement. Even if i index, say , a million docs, and the
> throughput is about 300 docs per sec, and then I shut down solr completely
> - after I start indexing again, the throughput is dropping below 300.
>
> I should probably experiment with sharding those documents to multiple SOLR
> cores - that should help, I guess. I am talking about something like this:
>
>
> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
>
> Thanks,
> Angel
>
>
> On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey <apache@elyograg.org
> <javascript:;>> wrote:
>
> > On 5/21/2015 2:07 AM, Angel Todorov wrote:
> > > I'm crawling a file system folder and indexing 10 million docs, and I
> am
> > > adding them in batches of 5000, committing every 50 000 docs. The
> > problem I
> > > am facing is that after each commit, the documents per sec that are
> > indexed
> > > gets less and less.
> > >
> > > If I do not commit at all, I can index those docs very quickly, and
> then
> > I
> > > commit once at the end, but once i start indexing docs _after_ that
> (for
> > > example new files get added to the folder), indexing is also slowing
> > down a
> > > lot.
> > >
> > > Is it normal that the SOLR indexing speed depends on the number of
> > > documents that are _already_ indexed? I think it shouldn't matter if i
> > > start from scratch or I index a document in a core that already has a
> > > couple of million docs. Looks like SOLR is either doing something in a
> > > linear fashion, or there is some magic config parameter that I am not
> > aware
> > > of.
> > >
> > > I've read all perf docs, and I've tried changing mergeFactor,
> > > autowarmCounts, and the buffer sizes - to no avail.
> > >
> > > I am using SOLR 5.1
> >
> > Have you changed the heap size?  If you use the bin/solr script to start
> > it and don't change the heap size with the -m option or another method,
> > Solr 5.1 runs with a default size of 512MB, which is *very* small.
> >
> > I bet you are running into problems with frequent and then ultimately
> > constant garbage collection, as Java attempts to free up enough memory
> > to allow the program to continue running.  If that is what is happening,
> > then eventually you will see an OutOfMemoryError exception.  The
> > solution is to increase the heap size.  I would probably start with at
> > least 4G for 10 million docs.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: Indexing gets significantly slower after every batch commit

Posted by Angel Todorov <at...@gmail.com>.
hi Shawn,

Thanks a bunch for your feedback. I've played with the heap size, but I
don't see any improvement. Even if i index, say , a million docs, and the
throughput is about 300 docs per sec, and then I shut down solr completely
- after I start indexing again, the throughput is dropping below 300.

I should probably experiment with sharding those documents to multiple SOLR
cores - that should help, I guess. I am talking about something like this:

https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

Thanks,
Angel


On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 5/21/2015 2:07 AM, Angel Todorov wrote:
> > I'm crawling a file system folder and indexing 10 million docs, and I am
> > adding them in batches of 5000, committing every 50 000 docs. The
> problem I
> > am facing is that after each commit, the documents per sec that are
> indexed
> > gets less and less.
> >
> > If I do not commit at all, I can index those docs very quickly, and then
> I
> > commit once at the end, but once i start indexing docs _after_ that (for
> > example new files get added to the folder), indexing is also slowing
> down a
> > lot.
> >
> > Is it normal that the SOLR indexing speed depends on the number of
> > documents that are _already_ indexed? I think it shouldn't matter if i
> > start from scratch or I index a document in a core that already has a
> > couple of million docs. Looks like SOLR is either doing something in a
> > linear fashion, or there is some magic config parameter that I am not
> aware
> > of.
> >
> > I've read all perf docs, and I've tried changing mergeFactor,
> > autowarmCounts, and the buffer sizes - to no avail.
> >
> > I am using SOLR 5.1
>
> Have you changed the heap size?  If you use the bin/solr script to start
> it and don't change the heap size with the -m option or another method,
> Solr 5.1 runs with a default size of 512MB, which is *very* small.
>
> I bet you are running into problems with frequent and then ultimately
> constant garbage collection, as Java attempts to free up enough memory
> to allow the program to continue running.  If that is what is happening,
> then eventually you will see an OutOfMemoryError exception.  The
> solution is to increase the heap size.  I would probably start with at
> least 4G for 10 million docs.
>
> Thanks,
> Shawn
>
>

Re: Indexing gets significantly slower after every batch commit

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/21/2015 2:07 AM, Angel Todorov wrote:
> I'm crawling a file system folder and indexing 10 million docs, and I am
> adding them in batches of 5000, committing every 50 000 docs. The problem I
> am facing is that after each commit, the documents per sec that are indexed
> gets less and less.
> 
> If I do not commit at all, I can index those docs very quickly, and then I
> commit once at the end, but once i start indexing docs _after_ that (for
> example new files get added to the folder), indexing is also slowing down a
> lot.
> 
> Is it normal that the SOLR indexing speed depends on the number of
> documents that are _already_ indexed? I think it shouldn't matter if i
> start from scratch or I index a document in a core that already has a
> couple of million docs. Looks like SOLR is either doing something in a
> linear fashion, or there is some magic config parameter that I am not aware
> of.
> 
> I've read all perf docs, and I've tried changing mergeFactor,
> autowarmCounts, and the buffer sizes - to no avail.
> 
> I am using SOLR 5.1

Have you changed the heap size?  If you use the bin/solr script to start
it and don't change the heap size with the -m option or another method,
Solr 5.1 runs with a default size of 512MB, which is *very* small.

I bet you are running into problems with frequent and then ultimately
constant garbage collection, as Java attempts to free up enough memory
to allow the program to continue running.  If that is what is happening,
then eventually you will see an OutOfMemoryError exception.  The
solution is to increase the heap size.  I would probably start with at
least 4G for 10 million docs.

Thanks,
Shawn