You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by John Davis <jo...@gmail.com> on 2019/06/01 06:27:09 UTC

Solr Heap Usage

I've read a bunch of the wiki's on solr heap usage and wanted to confirm my
understanding of what all does solr use the heap for:

1. Indexing new documents - until committed? if not how long are the new
documents kept in heap?

2. Merging segments - does solr load the entire segment in memory or chunks
of it? if later how large are these chunks

3. Queries, facets, caches - anything else major?

John

Re: Solr Heap Usage

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/2/2019 4:35 PM, John Davis wrote:
> If we assume there is no query load then effectively this boils down to
> most effective way for adding a large number of documents to the solr
> index. I've looked through SolrJ, DIH and others -- is the bottomline
> across all of them to "batch updates" and not commit as long as possible?

If you want the maximum indexing speed, you'll need to batch updates and 
send multiple batches in parallel.  I cannot tell you how much 
concurrency you need, you'll have to experiment.  I would probably start 
at the same number of threads as you have CPU cores in your Solr server, 
and then try 1.5 times that, and 2 times that, see which works better. 
I'd even try 3 or 4 times the CPU count, just to see how it behaves.

As long as commits are not happening in rapid succession, I wouldn't 
worry too much about those interfering with indexing speed.  Commits 
that don't open a searcher probably should be no more frequent than 
every minute or two, commits that DO open a new searcher should be less 
frequent than that.

Thanks,
Shawn

Re: Solr Heap Usage

Posted by Greg Harris <ha...@gmail.com>.

+1 for eclipse mat. Yourkit is another option. Heap dumps are invaluable
but a pain. If you’re just interested in overall heap and gc analysis I use
gc-viewer, which is usually all you need to know. I do heap dumps when
there are for large deviations from expectations and it is non obvious why

Greg

On Fri, Jun 7, 2019 at 11:30 AM John Davis <jo...@gmail.com>
wrote:

> What would be the best way to understand where heap is being used?
>
> On Tue, Jun 4, 2019 at 9:31 PM Greg Harris <ha...@gmail.com> wrote:
>
> > Just a couple of points I’d make here. I did some testing a while back in
> > which if no commit is made, (hard or soft) there are internal memory
> > structures holding tlogs and it will continue to get worse the more docs
> > that come in. I don’t know if that’s changed in further versions. I’d
> > recommend doing commits with some amount of frequency in indexing heavy
> > apps, otherwise you are likely to have heap issues. I personally would
> > advocate for some of the points already made. There are too many
> variables
> > going on here and ways to modify stuff to make sizing decisions and think
> > you’re doing anything other than a pure guess if you don’t test and
> > monitor. I’d advocate for a process in which testing is done regularly to
> > figure out questions like number of shards/replicas, heap size, memory
> etc.
> > Hard data, good process and regular testing will trump guesswork every
> time
> >
> > Greg
> >
> > On Tue, Jun 4, 2019 at 9:22 AM John Davis <jo...@gmail.com>
> > wrote:
> >
> > > You might want to test with softcommit of hours vs 5m for heavy
> indexing
> > +
> > > light query -- even though there is internal memory structure overhead
> > for
> > > no soft commits, in our testing a 5m soft commit (via commitWithin) has
> > > resulted in a very very large heap usage which I suspect is because of
> > > other overhead associated with it.
> > >
> > > On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson <erickerickson@gmail.com
> >
> > > wrote:
> > >
> > > > I need to update that, didn’t understand the bits about retaining
> > > internal
> > > > memory structures at the time.
> > > >
> > > > > On Jun 4, 2019, at 2:10 AM, John Davis <jo...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Erick - These conflict, what's changed?
> > > > >
> > > > > So if I were going to recommend settings, they’d be something like
> > > this:
> > > > > Do a hard commit with openSearcher=false every 60 seconds.
> > > > > Do a soft commit every 5 minutes.
> > > > >
> > > > > vs
> > > > >
> > > > > Index-heavy, Query-light
> > > > > Set your soft commit interval quite long, up to the maximum latency
> > you
> > > > can
> > > > > stand for documents to be visible. This could be just a couple of
> > > minutes
> > > > > or much longer. Maybe even hours with the capability of issuing a
> > hard
> > > > > commit (openSearcher=true) or soft commit on demand.
> > > > >
> > > >
> > >
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <
> > erickerickson@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > >>> I've looked through SolrJ, DIH and others -- is the bottomline
> > > > >>> across all of them to "batch updates" and not commit as long as
> > > > possible?
> > > > >>
> > > > >> Of course it’s more complicated than that ;)….
> > > > >>
> > > > >> But to start, yes, I urge you to batch. Here’s some stats:
> > > > >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
> > > > >>
> > > > >> Note that at about 100 docs/batch you hit diminishing returns.
> > > > _However_,
> > > > >> that test was run on a single shard collection, so if you have 10
> > > shards
> > > > >> you’d
> > > > >> have to send 1,000 docs/batch. I wouldn’t sweat that number much,
> > just
> > > > >> don’t
> > > > >> send one at a time. And there are the usual gotchas if your
> > documents
> > > > are
> > > > >> 1M .vs. 1K.
> > > > >>
> > > > >> About committing. No, don’t hold off as long as possible. When you
> > > > commit,
> > > > >> segments are merged. _However_, the default 100M internal buffer
> > size
> > > > means
> > > > >> that segments are written anyway even if you don’t hit a commit
> > point
> > > > when
> > > > >> you have 100M of index data, and merges happen anyway. So you
> won’t
> > > save
> > > > >> anything on merging by holding off commits.
> > > > >> And you’ll incur penalties. Here’s more than you want to know
> about
> > > > >> commits:
> > > > >>
> > > > >>
> > > >
> > >
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > > > >>
> > > > >> But some key take-aways… If for some reason Solr abnormally
> > > > >> terminates, the accumulated documents since the last hard
> > > > >> commit are replayed. So say you don’t commit for an hour of
> > > > >> furious indexing and someone does a “kill -9”. When you restart
> > > > >> Solr it’ll try to re-index all the docs for the last hour. Hard
> > > commits
> > > > >> with openSearcher=false aren’t all that expensive. I usually set
> > mine
> > > > >> for a minute and forget about it.
> > > > >>
> > > > >> Transaction logs hold a window, _not_ the entire set of operations
> > > > >> since time began. When you do a hard commit, the current tlog is
> > > > >> closed and a new one opened and ones that are “too old” are
> deleted.
> > > If
> > > > >> you never commit you have a huge transaction log to no good
> purpose.
> > > > >>
> > > > >> Also, while indexing, in order to accommodate “Real Time Get”, all
> > > > >> the docs indexed since the last searcher was opened have a pointer
> > > > >> kept in memory. So if you _never_ open a new searcher, that
> internal
> > > > >> structure can get quite large. So in bulk-indexing operations, I
> > > > >> suggest you open a searcher every so often.
> > > > >>
> > > > >> Opening a new searcher isn’t terribly expensive if you have no
> > > > autowarming
> > > > >> going on. Autowarming as defined in solrconfig.xml in filterCache,
> > > > >> queryResultCache
> > > > >> etc.
> > > > >>
> > > > >> So if I were going to recommend settings, they’d be something like
> > > this:
> > > > >> Do a hard commit with openSearcher=false every 60 seconds.
> > > > >> Do a soft commit every 5 minutes.
> > > > >>
> > > > >> I’d actually be surprised if you were able to measure differences
> > > > between
> > > > >> those settings and just hard commit with openSearcher=true every
> 60
> > > > >> seconds and soft commit at -1 (never)…
> > > > >>
> > > > >> Best,
> > > > >> Erick
> > > > >>
> > > > >>> On Jun 2, 2019, at 3:35 PM, John Davis <
> johndavis925254@gmail.com>
> > > > >> wrote:
> > > > >>>
> > > > >>> If we assume there is no query load then effectively this boils
> > down
> > > to
> > > > >>> most effective way for adding a large number of documents to the
> > solr
> > > > >>> index. I've looked through SolrJ, DIH and others -- is the
> > bottomline
> > > > >>> across all of them to "batch updates" and not commit as long as
> > > > possible?
> > > > >>>
> > > > >>> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <
> > > erickerickson@gmail.com
> > > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Oh, there are about a zillion reasons ;).
> > > > >>>>
> > > > >>>> First of all, most tools that show heap usage also count
> > uncollected
> > > > >>>> garbage. So your 10G could actually be much less “live” data.
> > Quick
> > > > way
> > > > >> to
> > > > >>>> test is to attach jconsole to the running Solr and hit the
> button
> > > that
> > > > >>>> forces a full GC.
> > > > >>>>
> > > > >>>> Another way is to reduce your heap when you start Solr (on a
> test
> > > > system
> > > > >>>> of course) until bad stuff happens, if you reduce it to very
> close
> > > to
> > > > >> what
> > > > >>>> Solr needs, you’ll get slower as more and more cycles are spent
> on
> > > GC,
> > > > >> if
> > > > >>>> you reduce it a little more you’ll get OOMs.
> > > > >>>>
> > > > >>>> You can take heap dumps of course to see where all the memory is
> > > being
> > > > >>>> used, but that’s tricky as it also includes garbage.
> > > > >>>>
> > > > >>>> I’ve seen cache sizes (filterCache in particular) be something
> > that
> > > > uses
> > > > >>>> lots of memory, but that requires queries to be fired. Each
> > > > filterCache
> > > > >>>> entry can take up to roughly maxDoc/8 bytes + overhead….
> > > > >>>>
> > > > >>>> A classic error is to sort, group or facet on a docValues=false
> > > field.
> > > > >>>> Starting with Solr 7.6, you can add an option to fields to throw
> > an
> > > > >> error
> > > > >>>> if you do this, see:
> > > https://issues.apache.org/jira/browse/SOLR-12962
> > > > .
> > > > >>>>
> > > > >>>> In short, there’s not enough information until you dive in and
> > test
> > > > >>>> bunches of stuff to tell.
> > > > >>>>
> > > > >>>> Best,
> > > > >>>> Erick
> > > > >>>>
> > > > >>>>
> > > > >>>>> On Jun 2, 2019, at 2:22 AM, John Davis <
> > johndavis925254@gmail.com>
> > > > >>>> wrote:
> > > > >>>>>
> > > > >>>>> This makes sense, any ideas why lucene/solr will use 10g heap
> > for a
> > > > 20g
> > > > >>>>> index.My hypothesis was merging segments was trying to read it
> > all
> > > > but
> > > > >> if
> > > > >>>>> that's not the case I am out of ideas. The one caveat is we are
> > > > trying
> > > > >> to
> > > > >>>>> add the documents quickly (~1g an hour) but if lucene does
> write
> > > 100m
> > > > >>>>> segments and does streaming merge it shouldn't matter?
> > > > >>>>>
> > > > >>>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <
> > > > wunder@wunderwood.org
> > > > >>>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>>> On May 31, 2019, at 11:27 PM, John Davis <
> > > > johndavis925254@gmail.com>
> > > > >>>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>> 2. Merging segments - does solr load the entire segment in
> > memory
> > > > or
> > > > >>>>>> chunks
> > > > >>>>>>> of it? if later how large are these chunks
> > > > >>>>>>
> > > > >>>>>> No, it does not read the entire segment into memory.
> > > > >>>>>>
> > > > >>>>>> A fundamental part of the Lucene design is streaming posting
> > lists
> > > > >> into
> > > > >>>>>> memory and processing them sequentially. The same amount of
> > memory
> > > > is
> > > > >>>>>> needed for small or large segments. Each posting list is in
> > > > >> document-id
> > > > >>>>>> order. The merge is a merge of sorted lists, writing a new
> > posting
> > > > >> list
> > > > >>>> in
> > > > >>>>>> document-id order.
> > > > >>>>>>
> > > > >>>>>> wunder
> > > > >>>>>> Walter Underwood
> > > > >>>>>> wunder@wunderwood.org
> > > > >>>>>> http://observer.wunderwood.org/  (my blog)
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: Solr Heap Usage

Posted by John Davis <jo...@gmail.com>.

What would be the best way to understand where heap is being used?

On Tue, Jun 4, 2019 at 9:31 PM Greg Harris <ha...@gmail.com> wrote:

> Just a couple of points I’d make here. I did some testing a while back in
> which if no commit is made, (hard or soft) there are internal memory
> structures holding tlogs and it will continue to get worse the more docs
> that come in. I don’t know if that’s changed in further versions. I’d
> recommend doing commits with some amount of frequency in indexing heavy
> apps, otherwise you are likely to have heap issues. I personally would
> advocate for some of the points already made. There are too many variables
> going on here and ways to modify stuff to make sizing decisions and think
> you’re doing anything other than a pure guess if you don’t test and
> monitor. I’d advocate for a process in which testing is done regularly to
> figure out questions like number of shards/replicas, heap size, memory etc.
> Hard data, good process and regular testing will trump guesswork every time
>
> Greg
>
> On Tue, Jun 4, 2019 at 9:22 AM John Davis <jo...@gmail.com>
> wrote:
>
> > You might want to test with softcommit of hours vs 5m for heavy indexing
> +
> > light query -- even though there is internal memory structure overhead
> for
> > no soft commits, in our testing a 5m soft commit (via commitWithin) has
> > resulted in a very very large heap usage which I suspect is because of
> > other overhead associated with it.
> >
> > On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson <er...@gmail.com>
> > wrote:
> >
> > > I need to update that, didn’t understand the bits about retaining
> > internal
> > > memory structures at the time.
> > >
> > > > On Jun 4, 2019, at 2:10 AM, John Davis <jo...@gmail.com>
> > > wrote:
> > > >
> > > > Erick - These conflict, what's changed?
> > > >
> > > > So if I were going to recommend settings, they’d be something like
> > this:
> > > > Do a hard commit with openSearcher=false every 60 seconds.
> > > > Do a soft commit every 5 minutes.
> > > >
> > > > vs
> > > >
> > > > Index-heavy, Query-light
> > > > Set your soft commit interval quite long, up to the maximum latency
> you
> > > can
> > > > stand for documents to be visible. This could be just a couple of
> > minutes
> > > > or much longer. Maybe even hours with the capability of issuing a
> hard
> > > > commit (openSearcher=true) or soft commit on demand.
> > > >
> > >
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <
> erickerickson@gmail.com
> > >
> > > > wrote:
> > > >
> > > >>> I've looked through SolrJ, DIH and others -- is the bottomline
> > > >>> across all of them to "batch updates" and not commit as long as
> > > possible?
> > > >>
> > > >> Of course it’s more complicated than that ;)….
> > > >>
> > > >> But to start, yes, I urge you to batch. Here’s some stats:
> > > >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
> > > >>
> > > >> Note that at about 100 docs/batch you hit diminishing returns.
> > > _However_,
> > > >> that test was run on a single shard collection, so if you have 10
> > shards
> > > >> you’d
> > > >> have to send 1,000 docs/batch. I wouldn’t sweat that number much,
> just
> > > >> don’t
> > > >> send one at a time. And there are the usual gotchas if your
> documents
> > > are
> > > >> 1M .vs. 1K.
> > > >>
> > > >> About committing. No, don’t hold off as long as possible. When you
> > > commit,
> > > >> segments are merged. _However_, the default 100M internal buffer
> size
> > > means
> > > >> that segments are written anyway even if you don’t hit a commit
> point
> > > when
> > > >> you have 100M of index data, and merges happen anyway. So you won’t
> > save
> > > >> anything on merging by holding off commits.
> > > >> And you’ll incur penalties. Here’s more than you want to know about
> > > >> commits:
> > > >>
> > > >>
> > >
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > > >>
> > > >> But some key take-aways… If for some reason Solr abnormally
> > > >> terminates, the accumulated documents since the last hard
> > > >> commit are replayed. So say you don’t commit for an hour of
> > > >> furious indexing and someone does a “kill -9”. When you restart
> > > >> Solr it’ll try to re-index all the docs for the last hour. Hard
> > commits
> > > >> with openSearcher=false aren’t all that expensive. I usually set
> mine
> > > >> for a minute and forget about it.
> > > >>
> > > >> Transaction logs hold a window, _not_ the entire set of operations
> > > >> since time began. When you do a hard commit, the current tlog is
> > > >> closed and a new one opened and ones that are “too old” are deleted.
> > If
> > > >> you never commit you have a huge transaction log to no good purpose.
> > > >>
> > > >> Also, while indexing, in order to accommodate “Real Time Get”, all
> > > >> the docs indexed since the last searcher was opened have a pointer
> > > >> kept in memory. So if you _never_ open a new searcher, that internal
> > > >> structure can get quite large. So in bulk-indexing operations, I
> > > >> suggest you open a searcher every so often.
> > > >>
> > > >> Opening a new searcher isn’t terribly expensive if you have no
> > > autowarming
> > > >> going on. Autowarming as defined in solrconfig.xml in filterCache,
> > > >> queryResultCache
> > > >> etc.
> > > >>
> > > >> So if I were going to recommend settings, they’d be something like
> > this:
> > > >> Do a hard commit with openSearcher=false every 60 seconds.
> > > >> Do a soft commit every 5 minutes.
> > > >>
> > > >> I’d actually be surprised if you were able to measure differences
> > > between
> > > >> those settings and just hard commit with openSearcher=true every 60
> > > >> seconds and soft commit at -1 (never)…
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >>> On Jun 2, 2019, at 3:35 PM, John Davis <jo...@gmail.com>
> > > >> wrote:
> > > >>>
> > > >>> If we assume there is no query load then effectively this boils
> down
> > to
> > > >>> most effective way for adding a large number of documents to the
> solr
> > > >>> index. I've looked through SolrJ, DIH and others -- is the
> bottomline
> > > >>> across all of them to "batch updates" and not commit as long as
> > > possible?
> > > >>>
> > > >>> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <
> > erickerickson@gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>>> Oh, there are about a zillion reasons ;).
> > > >>>>
> > > >>>> First of all, most tools that show heap usage also count
> uncollected
> > > >>>> garbage. So your 10G could actually be much less “live” data.
> Quick
> > > way
> > > >> to
> > > >>>> test is to attach jconsole to the running Solr and hit the button
> > that
> > > >>>> forces a full GC.
> > > >>>>
> > > >>>> Another way is to reduce your heap when you start Solr (on a test
> > > system
> > > >>>> of course) until bad stuff happens, if you reduce it to very close
> > to
> > > >> what
> > > >>>> Solr needs, you’ll get slower as more and more cycles are spent on
> > GC,
> > > >> if
> > > >>>> you reduce it a little more you’ll get OOMs.
> > > >>>>
> > > >>>> You can take heap dumps of course to see where all the memory is
> > being
> > > >>>> used, but that’s tricky as it also includes garbage.
> > > >>>>
> > > >>>> I’ve seen cache sizes (filterCache in particular) be something
> that
> > > uses
> > > >>>> lots of memory, but that requires queries to be fired. Each
> > > filterCache
> > > >>>> entry can take up to roughly maxDoc/8 bytes + overhead….
> > > >>>>
> > > >>>> A classic error is to sort, group or facet on a docValues=false
> > field.
> > > >>>> Starting with Solr 7.6, you can add an option to fields to throw
> an
> > > >> error
> > > >>>> if you do this, see:
> > https://issues.apache.org/jira/browse/SOLR-12962
> > > .
> > > >>>>
> > > >>>> In short, there’s not enough information until you dive in and
> test
> > > >>>> bunches of stuff to tell.
> > > >>>>
> > > >>>> Best,
> > > >>>> Erick
> > > >>>>
> > > >>>>
> > > >>>>> On Jun 2, 2019, at 2:22 AM, John Davis <
> johndavis925254@gmail.com>
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> This makes sense, any ideas why lucene/solr will use 10g heap
> for a
> > > 20g
> > > >>>>> index.My hypothesis was merging segments was trying to read it
> all
> > > but
> > > >> if
> > > >>>>> that's not the case I am out of ideas. The one caveat is we are
> > > trying
> > > >> to
> > > >>>>> add the documents quickly (~1g an hour) but if lucene does write
> > 100m
> > > >>>>> segments and does streaming merge it shouldn't matter?
> > > >>>>>
> > > >>>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <
> > > wunder@wunderwood.org
> > > >>>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>>> On May 31, 2019, at 11:27 PM, John Davis <
> > > johndavis925254@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> 2. Merging segments - does solr load the entire segment in
> memory
> > > or
> > > >>>>>> chunks
> > > >>>>>>> of it? if later how large are these chunks
> > > >>>>>>
> > > >>>>>> No, it does not read the entire segment into memory.
> > > >>>>>>
> > > >>>>>> A fundamental part of the Lucene design is streaming posting
> lists
> > > >> into
> > > >>>>>> memory and processing them sequentially. The same amount of
> memory
> > > is
> > > >>>>>> needed for small or large segments. Each posting list is in
> > > >> document-id
> > > >>>>>> order. The merge is a merge of sorted lists, writing a new
> posting
> > > >> list
> > > >>>> in
> > > >>>>>> document-id order.
> > > >>>>>>
> > > >>>>>> wunder
> > > >>>>>> Walter Underwood
> > > >>>>>> wunder@wunderwood.org
> > > >>>>>> http://observer.wunderwood.org/  (my blog)
> > > >>>>>>
> > > >>>>>>
> > > >>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>

Re: Solr Heap Usage

Posted by Greg Harris <ha...@gmail.com>.

Just a couple of points I’d make here. I did some testing a while back in
which if no commit is made, (hard or soft) there are internal memory
structures holding tlogs and it will continue to get worse the more docs
that come in. I don’t know if that’s changed in further versions. I’d
recommend doing commits with some amount of frequency in indexing heavy
apps, otherwise you are likely to have heap issues. I personally would
advocate for some of the points already made. There are too many variables
going on here and ways to modify stuff to make sizing decisions and think
you’re doing anything other than a pure guess if you don’t test and
monitor. I’d advocate for a process in which testing is done regularly to
figure out questions like number of shards/replicas, heap size, memory etc.
Hard data, good process and regular testing will trump guesswork every time

Greg

On Tue, Jun 4, 2019 at 9:22 AM John Davis <jo...@gmail.com> wrote:

> You might want to test with softcommit of hours vs 5m for heavy indexing +
> light query -- even though there is internal memory structure overhead for
> no soft commits, in our testing a 5m soft commit (via commitWithin) has
> resulted in a very very large heap usage which I suspect is because of
> other overhead associated with it.
>
> On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson <er...@gmail.com>
> wrote:
>
> > I need to update that, didn’t understand the bits about retaining
> internal
> > memory structures at the time.
> >
> > > On Jun 4, 2019, at 2:10 AM, John Davis <jo...@gmail.com>
> > wrote:
> > >
> > > Erick - These conflict, what's changed?
> > >
> > > So if I were going to recommend settings, they’d be something like
> this:
> > > Do a hard commit with openSearcher=false every 60 seconds.
> > > Do a soft commit every 5 minutes.
> > >
> > > vs
> > >
> > > Index-heavy, Query-light
> > > Set your soft commit interval quite long, up to the maximum latency you
> > can
> > > stand for documents to be visible. This could be just a couple of
> minutes
> > > or much longer. Maybe even hours with the capability of issuing a hard
> > > commit (openSearcher=true) or soft commit on demand.
> > >
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > >
> > >
> > >
> > >
> > > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <erickerickson@gmail.com
> >
> > > wrote:
> > >
> > >>> I've looked through SolrJ, DIH and others -- is the bottomline
> > >>> across all of them to "batch updates" and not commit as long as
> > possible?
> > >>
> > >> Of course it’s more complicated than that ;)….
> > >>
> > >> But to start, yes, I urge you to batch. Here’s some stats:
> > >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
> > >>
> > >> Note that at about 100 docs/batch you hit diminishing returns.
> > _However_,
> > >> that test was run on a single shard collection, so if you have 10
> shards
> > >> you’d
> > >> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just
> > >> don’t
> > >> send one at a time. And there are the usual gotchas if your documents
> > are
> > >> 1M .vs. 1K.
> > >>
> > >> About committing. No, don’t hold off as long as possible. When you
> > commit,
> > >> segments are merged. _However_, the default 100M internal buffer size
> > means
> > >> that segments are written anyway even if you don’t hit a commit point
> > when
> > >> you have 100M of index data, and merges happen anyway. So you won’t
> save
> > >> anything on merging by holding off commits.
> > >> And you’ll incur penalties. Here’s more than you want to know about
> > >> commits:
> > >>
> > >>
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > >>
> > >> But some key take-aways… If for some reason Solr abnormally
> > >> terminates, the accumulated documents since the last hard
> > >> commit are replayed. So say you don’t commit for an hour of
> > >> furious indexing and someone does a “kill -9”. When you restart
> > >> Solr it’ll try to re-index all the docs for the last hour. Hard
> commits
> > >> with openSearcher=false aren’t all that expensive. I usually set mine
> > >> for a minute and forget about it.
> > >>
> > >> Transaction logs hold a window, _not_ the entire set of operations
> > >> since time began. When you do a hard commit, the current tlog is
> > >> closed and a new one opened and ones that are “too old” are deleted.
> If
> > >> you never commit you have a huge transaction log to no good purpose.
> > >>
> > >> Also, while indexing, in order to accommodate “Real Time Get”, all
> > >> the docs indexed since the last searcher was opened have a pointer
> > >> kept in memory. So if you _never_ open a new searcher, that internal
> > >> structure can get quite large. So in bulk-indexing operations, I
> > >> suggest you open a searcher every so often.
> > >>
> > >> Opening a new searcher isn’t terribly expensive if you have no
> > autowarming
> > >> going on. Autowarming as defined in solrconfig.xml in filterCache,
> > >> queryResultCache
> > >> etc.
> > >>
> > >> So if I were going to recommend settings, they’d be something like
> this:
> > >> Do a hard commit with openSearcher=false every 60 seconds.
> > >> Do a soft commit every 5 minutes.
> > >>
> > >> I’d actually be surprised if you were able to measure differences
> > between
> > >> those settings and just hard commit with openSearcher=true every 60
> > >> seconds and soft commit at -1 (never)…
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>> On Jun 2, 2019, at 3:35 PM, John Davis <jo...@gmail.com>
> > >> wrote:
> > >>>
> > >>> If we assume there is no query load then effectively this boils down
> to
> > >>> most effective way for adding a large number of documents to the solr
> > >>> index. I've looked through SolrJ, DIH and others -- is the bottomline
> > >>> across all of them to "batch updates" and not commit as long as
> > possible?
> > >>>
> > >>> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <
> erickerickson@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>>> Oh, there are about a zillion reasons ;).
> > >>>>
> > >>>> First of all, most tools that show heap usage also count uncollected
> > >>>> garbage. So your 10G could actually be much less “live” data. Quick
> > way
> > >> to
> > >>>> test is to attach jconsole to the running Solr and hit the button
> that
> > >>>> forces a full GC.
> > >>>>
> > >>>> Another way is to reduce your heap when you start Solr (on a test
> > system
> > >>>> of course) until bad stuff happens, if you reduce it to very close
> to
> > >> what
> > >>>> Solr needs, you’ll get slower as more and more cycles are spent on
> GC,
> > >> if
> > >>>> you reduce it a little more you’ll get OOMs.
> > >>>>
> > >>>> You can take heap dumps of course to see where all the memory is
> being
> > >>>> used, but that’s tricky as it also includes garbage.
> > >>>>
> > >>>> I’ve seen cache sizes (filterCache in particular) be something that
> > uses
> > >>>> lots of memory, but that requires queries to be fired. Each
> > filterCache
> > >>>> entry can take up to roughly maxDoc/8 bytes + overhead….
> > >>>>
> > >>>> A classic error is to sort, group or facet on a docValues=false
> field.
> > >>>> Starting with Solr 7.6, you can add an option to fields to throw an
> > >> error
> > >>>> if you do this, see:
> https://issues.apache.org/jira/browse/SOLR-12962
> > .
> > >>>>
> > >>>> In short, there’s not enough information until you dive in and test
> > >>>> bunches of stuff to tell.
> > >>>>
> > >>>> Best,
> > >>>> Erick
> > >>>>
> > >>>>
> > >>>>> On Jun 2, 2019, at 2:22 AM, John Davis <jo...@gmail.com>
> > >>>> wrote:
> > >>>>>
> > >>>>> This makes sense, any ideas why lucene/solr will use 10g heap for a
> > 20g
> > >>>>> index.My hypothesis was merging segments was trying to read it all
> > but
> > >> if
> > >>>>> that's not the case I am out of ideas. The one caveat is we are
> > trying
> > >> to
> > >>>>> add the documents quickly (~1g an hour) but if lucene does write
> 100m
> > >>>>> segments and does streaming merge it shouldn't matter?
> > >>>>>
> > >>>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <
> > wunder@wunderwood.org
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>>> On May 31, 2019, at 11:27 PM, John Davis <
> > johndavis925254@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> 2. Merging segments - does solr load the entire segment in memory
> > or
> > >>>>>> chunks
> > >>>>>>> of it? if later how large are these chunks
> > >>>>>>
> > >>>>>> No, it does not read the entire segment into memory.
> > >>>>>>
> > >>>>>> A fundamental part of the Lucene design is streaming posting lists
> > >> into
> > >>>>>> memory and processing them sequentially. The same amount of memory
> > is
> > >>>>>> needed for small or large segments. Each posting list is in
> > >> document-id
> > >>>>>> order. The merge is a merge of sorted lists, writing a new posting
> > >> list
> > >>>> in
> > >>>>>> document-id order.
> > >>>>>>
> > >>>>>> wunder
> > >>>>>> Walter Underwood
> > >>>>>> wunder@wunderwood.org
> > >>>>>> http://observer.wunderwood.org/  (my blog)
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: Solr Heap Usage

Posted by John Davis <jo...@gmail.com>.

You might want to test with softcommit of hours vs 5m for heavy indexing +
light query -- even though there is internal memory structure overhead for
no soft commits, in our testing a 5m soft commit (via commitWithin) has
resulted in a very very large heap usage which I suspect is because of
other overhead associated with it.

On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson <er...@gmail.com>
wrote:

> I need to update that, didn’t understand the bits about retaining internal
> memory structures at the time.
>
> > On Jun 4, 2019, at 2:10 AM, John Davis <jo...@gmail.com>
> wrote:
> >
> > Erick - These conflict, what's changed?
> >
> > So if I were going to recommend settings, they’d be something like this:
> > Do a hard commit with openSearcher=false every 60 seconds.
> > Do a soft commit every 5 minutes.
> >
> > vs
> >
> > Index-heavy, Query-light
> > Set your soft commit interval quite long, up to the maximum latency you
> can
> > stand for documents to be visible. This could be just a couple of minutes
> > or much longer. Maybe even hours with the capability of issuing a hard
> > commit (openSearcher=true) or soft commit on demand.
> >
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> >
> >
> >
> > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <er...@gmail.com>
> > wrote:
> >
> >>> I've looked through SolrJ, DIH and others -- is the bottomline
> >>> across all of them to "batch updates" and not commit as long as
> possible?
> >>
> >> Of course it’s more complicated than that ;)….
> >>
> >> But to start, yes, I urge you to batch. Here’s some stats:
> >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
> >>
> >> Note that at about 100 docs/batch you hit diminishing returns.
> _However_,
> >> that test was run on a single shard collection, so if you have 10 shards
> >> you’d
> >> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just
> >> don’t
> >> send one at a time. And there are the usual gotchas if your documents
> are
> >> 1M .vs. 1K.
> >>
> >> About committing. No, don’t hold off as long as possible. When you
> commit,
> >> segments are merged. _However_, the default 100M internal buffer size
> means
> >> that segments are written anyway even if you don’t hit a commit point
> when
> >> you have 100M of index data, and merges happen anyway. So you won’t save
> >> anything on merging by holding off commits.
> >> And you’ll incur penalties. Here’s more than you want to know about
> >> commits:
> >>
> >>
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >>
> >> But some key take-aways… If for some reason Solr abnormally
> >> terminates, the accumulated documents since the last hard
> >> commit are replayed. So say you don’t commit for an hour of
> >> furious indexing and someone does a “kill -9”. When you restart
> >> Solr it’ll try to re-index all the docs for the last hour. Hard commits
> >> with openSearcher=false aren’t all that expensive. I usually set mine
> >> for a minute and forget about it.
> >>
> >> Transaction logs hold a window, _not_ the entire set of operations
> >> since time began. When you do a hard commit, the current tlog is
> >> closed and a new one opened and ones that are “too old” are deleted. If
> >> you never commit you have a huge transaction log to no good purpose.
> >>
> >> Also, while indexing, in order to accommodate “Real Time Get”, all
> >> the docs indexed since the last searcher was opened have a pointer
> >> kept in memory. So if you _never_ open a new searcher, that internal
> >> structure can get quite large. So in bulk-indexing operations, I
> >> suggest you open a searcher every so often.
> >>
> >> Opening a new searcher isn’t terribly expensive if you have no
> autowarming
> >> going on. Autowarming as defined in solrconfig.xml in filterCache,
> >> queryResultCache
> >> etc.
> >>
> >> So if I were going to recommend settings, they’d be something like this:
> >> Do a hard commit with openSearcher=false every 60 seconds.
> >> Do a soft commit every 5 minutes.
> >>
> >> I’d actually be surprised if you were able to measure differences
> between
> >> those settings and just hard commit with openSearcher=true every 60
> >> seconds and soft commit at -1 (never)…
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 2, 2019, at 3:35 PM, John Davis <jo...@gmail.com>
> >> wrote:
> >>>
> >>> If we assume there is no query load then effectively this boils down to
> >>> most effective way for adding a large number of documents to the solr
> >>> index. I've looked through SolrJ, DIH and others -- is the bottomline
> >>> across all of them to "batch updates" and not commit as long as
> possible?
> >>>
> >>> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <erickerickson@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Oh, there are about a zillion reasons ;).
> >>>>
> >>>> First of all, most tools that show heap usage also count uncollected
> >>>> garbage. So your 10G could actually be much less “live” data. Quick
> way
> >> to
> >>>> test is to attach jconsole to the running Solr and hit the button that
> >>>> forces a full GC.
> >>>>
> >>>> Another way is to reduce your heap when you start Solr (on a test
> system
> >>>> of course) until bad stuff happens, if you reduce it to very close to
> >> what
> >>>> Solr needs, you’ll get slower as more and more cycles are spent on GC,
> >> if
> >>>> you reduce it a little more you’ll get OOMs.
> >>>>
> >>>> You can take heap dumps of course to see where all the memory is being
> >>>> used, but that’s tricky as it also includes garbage.
> >>>>
> >>>> I’ve seen cache sizes (filterCache in particular) be something that
> uses
> >>>> lots of memory, but that requires queries to be fired. Each
> filterCache
> >>>> entry can take up to roughly maxDoc/8 bytes + overhead….
> >>>>
> >>>> A classic error is to sort, group or facet on a docValues=false field.
> >>>> Starting with Solr 7.6, you can add an option to fields to throw an
> >> error
> >>>> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962
> .
> >>>>
> >>>> In short, there’s not enough information until you dive in and test
> >>>> bunches of stuff to tell.
> >>>>
> >>>> Best,
> >>>> Erick
> >>>>
> >>>>
> >>>>> On Jun 2, 2019, at 2:22 AM, John Davis <jo...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> This makes sense, any ideas why lucene/solr will use 10g heap for a
> 20g
> >>>>> index.My hypothesis was merging segments was trying to read it all
> but
> >> if
> >>>>> that's not the case I am out of ideas. The one caveat is we are
> trying
> >> to
> >>>>> add the documents quickly (~1g an hour) but if lucene does write 100m
> >>>>> segments and does streaming merge it shouldn't matter?
> >>>>>
> >>>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <
> wunder@wunderwood.org
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>>> On May 31, 2019, at 11:27 PM, John Davis <
> johndavis925254@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> 2. Merging segments - does solr load the entire segment in memory
> or
> >>>>>> chunks
> >>>>>>> of it? if later how large are these chunks
> >>>>>>
> >>>>>> No, it does not read the entire segment into memory.
> >>>>>>
> >>>>>> A fundamental part of the Lucene design is streaming posting lists
> >> into
> >>>>>> memory and processing them sequentially. The same amount of memory
> is
> >>>>>> needed for small or large segments. Each posting list is in
> >> document-id
> >>>>>> order. The merge is a merge of sorted lists, writing a new posting
> >> list
> >>>> in
> >>>>>> document-id order.
> >>>>>>
> >>>>>> wunder
> >>>>>> Walter Underwood
> >>>>>> wunder@wunderwood.org
> >>>>>> http://observer.wunderwood.org/  (my blog)
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Solr Heap Usage

Posted by Erick Erickson <er...@gmail.com>.

I need to update that, didn’t understand the bits about retaining internal memory structures at the time.

> On Jun 4, 2019, at 2:10 AM, John Davis <jo...@gmail.com> wrote:
> 
> Erick - These conflict, what's changed?
> 
> So if I were going to recommend settings, they’d be something like this:
> Do a hard commit with openSearcher=false every 60 seconds.
> Do a soft commit every 5 minutes.
> 
> vs
> 
> Index-heavy, Query-light
> Set your soft commit interval quite long, up to the maximum latency you can
> stand for documents to be visible. This could be just a couple of minutes
> or much longer. Maybe even hours with the capability of issuing a hard
> commit (openSearcher=true) or soft commit on demand.
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> 
> 
> 
> 
> On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <er...@gmail.com>
> wrote:
> 
>>> I've looked through SolrJ, DIH and others -- is the bottomline
>>> across all of them to "batch updates" and not commit as long as possible?
>> 
>> Of course it’s more complicated than that ;)….
>> 
>> But to start, yes, I urge you to batch. Here’s some stats:
>> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
>> 
>> Note that at about 100 docs/batch you hit diminishing returns. _However_,
>> that test was run on a single shard collection, so if you have 10 shards
>> you’d
>> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just
>> don’t
>> send one at a time. And there are the usual gotchas if your documents are
>> 1M .vs. 1K.
>> 
>> About committing. No, don’t hold off as long as possible. When you commit,
>> segments are merged. _However_, the default 100M internal buffer size means
>> that segments are written anyway even if you don’t hit a commit point when
>> you have 100M of index data, and merges happen anyway. So you won’t save
>> anything on merging by holding off commits.
>> And you’ll incur penalties. Here’s more than you want to know about
>> commits:
>> 
>> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>> 
>> But some key take-aways… If for some reason Solr abnormally
>> terminates, the accumulated documents since the last hard
>> commit are replayed. So say you don’t commit for an hour of
>> furious indexing and someone does a “kill -9”. When you restart
>> Solr it’ll try to re-index all the docs for the last hour. Hard commits
>> with openSearcher=false aren’t all that expensive. I usually set mine
>> for a minute and forget about it.
>> 
>> Transaction logs hold a window, _not_ the entire set of operations
>> since time began. When you do a hard commit, the current tlog is
>> closed and a new one opened and ones that are “too old” are deleted. If
>> you never commit you have a huge transaction log to no good purpose.
>> 
>> Also, while indexing, in order to accommodate “Real Time Get”, all
>> the docs indexed since the last searcher was opened have a pointer
>> kept in memory. So if you _never_ open a new searcher, that internal
>> structure can get quite large. So in bulk-indexing operations, I
>> suggest you open a searcher every so often.
>> 
>> Opening a new searcher isn’t terribly expensive if you have no autowarming
>> going on. Autowarming as defined in solrconfig.xml in filterCache,
>> queryResultCache
>> etc.
>> 
>> So if I were going to recommend settings, they’d be something like this:
>> Do a hard commit with openSearcher=false every 60 seconds.
>> Do a soft commit every 5 minutes.
>> 
>> I’d actually be surprised if you were able to measure differences between
>> those settings and just hard commit with openSearcher=true every 60
>> seconds and soft commit at -1 (never)…
>> 
>> Best,
>> Erick
>> 
>>> On Jun 2, 2019, at 3:35 PM, John Davis <jo...@gmail.com>
>> wrote:
>>> 
>>> If we assume there is no query load then effectively this boils down to
>>> most effective way for adding a large number of documents to the solr
>>> index. I've looked through SolrJ, DIH and others -- is the bottomline
>>> across all of them to "batch updates" and not commit as long as possible?
>>> 
>>> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <er...@gmail.com>
>>> wrote:
>>> 
>>>> Oh, there are about a zillion reasons ;).
>>>> 
>>>> First of all, most tools that show heap usage also count uncollected
>>>> garbage. So your 10G could actually be much less “live” data. Quick way
>> to
>>>> test is to attach jconsole to the running Solr and hit the button that
>>>> forces a full GC.
>>>> 
>>>> Another way is to reduce your heap when you start Solr (on a test system
>>>> of course) until bad stuff happens, if you reduce it to very close to
>> what
>>>> Solr needs, you’ll get slower as more and more cycles are spent on GC,
>> if
>>>> you reduce it a little more you’ll get OOMs.
>>>> 
>>>> You can take heap dumps of course to see where all the memory is being
>>>> used, but that’s tricky as it also includes garbage.
>>>> 
>>>> I’ve seen cache sizes (filterCache in particular) be something that uses
>>>> lots of memory, but that requires queries to be fired. Each filterCache
>>>> entry can take up to roughly maxDoc/8 bytes + overhead….
>>>> 
>>>> A classic error is to sort, group or facet on a docValues=false field.
>>>> Starting with Solr 7.6, you can add an option to fields to throw an
>> error
>>>> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962.
>>>> 
>>>> In short, there’s not enough information until you dive in and test
>>>> bunches of stuff to tell.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> 
>>>>> On Jun 2, 2019, at 2:22 AM, John Davis <jo...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
>>>>> index.My hypothesis was merging segments was trying to read it all but
>> if
>>>>> that's not the case I am out of ideas. The one caveat is we are trying
>> to
>>>>> add the documents quickly (~1g an hour) but if lucene does write 100m
>>>>> segments and does streaming merge it shouldn't matter?
>>>>> 
>>>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wunder@wunderwood.org
>>> 
>>>>> wrote:
>>>>> 
>>>>>>> On May 31, 2019, at 11:27 PM, John Davis <jo...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> 2. Merging segments - does solr load the entire segment in memory or
>>>>>> chunks
>>>>>>> of it? if later how large are these chunks
>>>>>> 
>>>>>> No, it does not read the entire segment into memory.
>>>>>> 
>>>>>> A fundamental part of the Lucene design is streaming posting lists
>> into
>>>>>> memory and processing them sequentially. The same amount of memory is
>>>>>> needed for small or large segments. Each posting list is in
>> document-id
>>>>>> order. The merge is a merge of sorted lists, writing a new posting
>> list
>>>> in
>>>>>> document-id order.
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wunder@wunderwood.org
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Solr Heap Usage

Posted by John Davis <jo...@gmail.com>.

Erick - These conflict, what's changed?

So if I were going to recommend settings, they’d be something like this:
Do a hard commit with openSearcher=false every 60 seconds.
Do a soft commit every 5 minutes.

vs

Index-heavy, Query-light
Set your soft commit interval quite long, up to the maximum latency you can
stand for documents to be visible. This could be just a couple of minutes
or much longer. Maybe even hours with the capability of issuing a hard
commit (openSearcher=true) or soft commit on demand.
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/




On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <er...@gmail.com>
wrote:

> > I've looked through SolrJ, DIH and others -- is the bottomline
> > across all of them to "batch updates" and not commit as long as possible?
>
> Of course it’s more complicated than that ;)….
>
> But to start, yes, I urge you to batch. Here’s some stats:
> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
>
> Note that at about 100 docs/batch you hit diminishing returns. _However_,
> that test was run on a single shard collection, so if you have 10 shards
> you’d
> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just
> don’t
> send one at a time. And there are the usual gotchas if your documents are
> 1M .vs. 1K.
>
> About committing. No, don’t hold off as long as possible. When you commit,
> segments are merged. _However_, the default 100M internal buffer size means
> that segments are written anyway even if you don’t hit a commit point when
> you have 100M of index data, and merges happen anyway. So you won’t save
> anything on merging by holding off commits.
> And you’ll incur penalties. Here’s more than you want to know about
> commits:
>
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> But some key take-aways… If for some reason Solr abnormally
> terminates, the accumulated documents since the last hard
> commit are replayed. So say you don’t commit for an hour of
> furious indexing and someone does a “kill -9”. When you restart
> Solr it’ll try to re-index all the docs for the last hour. Hard commits
> with openSearcher=false aren’t all that expensive. I usually set mine
> for a minute and forget about it.
>
> Transaction logs hold a window, _not_ the entire set of operations
> since time began. When you do a hard commit, the current tlog is
> closed and a new one opened and ones that are “too old” are deleted. If
> you never commit you have a huge transaction log to no good purpose.
>
> Also, while indexing, in order to accommodate “Real Time Get”, all
> the docs indexed since the last searcher was opened have a pointer
> kept in memory. So if you _never_ open a new searcher, that internal
> structure can get quite large. So in bulk-indexing operations, I
> suggest you open a searcher every so often.
>
> Opening a new searcher isn’t terribly expensive if you have no autowarming
> going on. Autowarming as defined in solrconfig.xml in filterCache,
> queryResultCache
> etc.
>
> So if I were going to recommend settings, they’d be something like this:
> Do a hard commit with openSearcher=false every 60 seconds.
> Do a soft commit every 5 minutes.
>
> I’d actually be surprised if you were able to measure differences between
> those settings and just hard commit with openSearcher=true every 60
> seconds and soft commit at -1 (never)…
>
> Best,
> Erick
>
> > On Jun 2, 2019, at 3:35 PM, John Davis <jo...@gmail.com>
> wrote:
> >
> > If we assume there is no query load then effectively this boils down to
> > most effective way for adding a large number of documents to the solr
> > index. I've looked through SolrJ, DIH and others -- is the bottomline
> > across all of them to "batch updates" and not commit as long as possible?
> >
> > On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <er...@gmail.com>
> > wrote:
> >
> >> Oh, there are about a zillion reasons ;).
> >>
> >> First of all, most tools that show heap usage also count uncollected
> >> garbage. So your 10G could actually be much less “live” data. Quick way
> to
> >> test is to attach jconsole to the running Solr and hit the button that
> >> forces a full GC.
> >>
> >> Another way is to reduce your heap when you start Solr (on a test system
> >> of course) until bad stuff happens, if you reduce it to very close to
> what
> >> Solr needs, you’ll get slower as more and more cycles are spent on GC,
> if
> >> you reduce it a little more you’ll get OOMs.
> >>
> >> You can take heap dumps of course to see where all the memory is being
> >> used, but that’s tricky as it also includes garbage.
> >>
> >> I’ve seen cache sizes (filterCache in particular) be something that uses
> >> lots of memory, but that requires queries to be fired. Each filterCache
> >> entry can take up to roughly maxDoc/8 bytes + overhead….
> >>
> >> A classic error is to sort, group or facet on a docValues=false field.
> >> Starting with Solr 7.6, you can add an option to fields to throw an
> error
> >> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962.
> >>
> >> In short, there’s not enough information until you dive in and test
> >> bunches of stuff to tell.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>> On Jun 2, 2019, at 2:22 AM, John Davis <jo...@gmail.com>
> >> wrote:
> >>>
> >>> This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
> >>> index.My hypothesis was merging segments was trying to read it all but
> if
> >>> that's not the case I am out of ideas. The one caveat is we are trying
> to
> >>> add the documents quickly (~1g an hour) but if lucene does write 100m
> >>> segments and does streaming merge it shouldn't matter?
> >>>
> >>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wunder@wunderwood.org
> >
> >>> wrote:
> >>>
> >>>>> On May 31, 2019, at 11:27 PM, John Davis <jo...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> 2. Merging segments - does solr load the entire segment in memory or
> >>>> chunks
> >>>>> of it? if later how large are these chunks
> >>>>
> >>>> No, it does not read the entire segment into memory.
> >>>>
> >>>> A fundamental part of the Lucene design is streaming posting lists
> into
> >>>> memory and processing them sequentially. The same amount of memory is
> >>>> needed for small or large segments. Each posting list is in
> document-id
> >>>> order. The merge is a merge of sorted lists, writing a new posting
> list
> >> in
> >>>> document-id order.
> >>>>
> >>>> wunder
> >>>> Walter Underwood
> >>>> wunder@wunderwood.org
> >>>> http://observer.wunderwood.org/  (my blog)
> >>>>
> >>>>
> >>
> >>
>
>

Re: Solr Heap Usage

Posted by Erick Erickson <er...@gmail.com>.

> I've looked through SolrJ, DIH and others -- is the bottomline
> across all of them to "batch updates" and not commit as long as possible?

Of course it’s more complicated than that ;)….

But to start, yes, I urge you to batch. Here’s some stats:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

Note that at about 100 docs/batch you hit diminishing returns. _However_,
that test was run on a single shard collection, so if you have 10 shards you’d
have to send 1,000 docs/batch. I wouldn’t sweat that number much, just don’t
send one at a time. And there are the usual gotchas if your documents are
1M .vs. 1K.

About committing. No, don’t hold off as long as possible. When you commit,
segments are merged. _However_, the default 100M internal buffer size means
that segments are written anyway even if you don’t hit a commit point when
you have 100M of index data, and merges happen anyway. So you won’t save
anything on merging by holding off commits.
And you’ll incur penalties. Here’s more than you want to know about 
commits: 
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

But some key take-aways… If for some reason Solr abnormally 
terminates, the accumulated documents since the last hard
commit are replayed. So say you don’t commit for an hour of
furious indexing and someone does a “kill -9”. When you restart
Solr it’ll try to re-index all the docs for the last hour. Hard commits
with openSearcher=false aren’t all that expensive. I usually set mine
for a minute and forget about it.

Transaction logs hold a window, _not_ the entire set of operations
since time began. When you do a hard commit, the current tlog is
closed and a new one opened and ones that are “too old” are deleted. If
you never commit you have a huge transaction log to no good purpose.

Also, while indexing, in order to accommodate “Real Time Get”, all
the docs indexed since the last searcher was opened have a pointer
kept in memory. So if you _never_ open a new searcher, that internal
structure can get quite large. So in bulk-indexing operations, I
suggest you open a searcher every so often.

Opening a new searcher isn’t terribly expensive if you have no autowarming
going on. Autowarming as defined in solrconfig.xml in filterCache, queryResultCache
etc. 

So if I were going to recommend settings, they’d be something like this:
Do a hard commit with openSearcher=false every 60 seconds.
Do a soft commit every 5 minutes.

I’d actually be surprised if you were able to measure differences between
those settings and just hard commit with openSearcher=true every 60 seconds and soft commit at -1 (never)…

Best,
Erick

> On Jun 2, 2019, at 3:35 PM, John Davis <jo...@gmail.com> wrote:
> 
> If we assume there is no query load then effectively this boils down to
> most effective way for adding a large number of documents to the solr
> index. I've looked through SolrJ, DIH and others -- is the bottomline
> across all of them to "batch updates" and not commit as long as possible?
> 
> On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <er...@gmail.com>
> wrote:
> 
>> Oh, there are about a zillion reasons ;).
>> 
>> First of all, most tools that show heap usage also count uncollected
>> garbage. So your 10G could actually be much less “live” data. Quick way to
>> test is to attach jconsole to the running Solr and hit the button that
>> forces a full GC.
>> 
>> Another way is to reduce your heap when you start Solr (on a test system
>> of course) until bad stuff happens, if you reduce it to very close to what
>> Solr needs, you’ll get slower as more and more cycles are spent on GC, if
>> you reduce it a little more you’ll get OOMs.
>> 
>> You can take heap dumps of course to see where all the memory is being
>> used, but that’s tricky as it also includes garbage.
>> 
>> I’ve seen cache sizes (filterCache in particular) be something that uses
>> lots of memory, but that requires queries to be fired. Each filterCache
>> entry can take up to roughly maxDoc/8 bytes + overhead….
>> 
>> A classic error is to sort, group or facet on a docValues=false field.
>> Starting with Solr 7.6, you can add an option to fields to throw an error
>> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962.
>> 
>> In short, there’s not enough information until you dive in and test
>> bunches of stuff to tell.
>> 
>> Best,
>> Erick
>> 
>> 
>>> On Jun 2, 2019, at 2:22 AM, John Davis <jo...@gmail.com>
>> wrote:
>>> 
>>> This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
>>> index.My hypothesis was merging segments was trying to read it all but if
>>> that's not the case I am out of ideas. The one caveat is we are trying to
>>> add the documents quickly (~1g an hour) but if lucene does write 100m
>>> segments and does streaming merge it shouldn't matter?
>>> 
>>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wu...@wunderwood.org>
>>> wrote:
>>> 
>>>>> On May 31, 2019, at 11:27 PM, John Davis <jo...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> 2. Merging segments - does solr load the entire segment in memory or
>>>> chunks
>>>>> of it? if later how large are these chunks
>>>> 
>>>> No, it does not read the entire segment into memory.
>>>> 
>>>> A fundamental part of the Lucene design is streaming posting lists into
>>>> memory and processing them sequentially. The same amount of memory is
>>>> needed for small or large segments. Each posting list is in document-id
>>>> order. The merge is a merge of sorted lists, writing a new posting list
>> in
>>>> document-id order.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wunder@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>> 
>> 
>>

Re: Solr Heap Usage

Posted by John Davis <jo...@gmail.com>.

If we assume there is no query load then effectively this boils down to
most effective way for adding a large number of documents to the solr
index. I've looked through SolrJ, DIH and others -- is the bottomline
across all of them to "batch updates" and not commit as long as possible?

On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <er...@gmail.com>
wrote:

> Oh, there are about a zillion reasons ;).
>
> First of all, most tools that show heap usage also count uncollected
> garbage. So your 10G could actually be much less “live” data. Quick way to
> test is to attach jconsole to the running Solr and hit the button that
> forces a full GC.
>
> Another way is to reduce your heap when you start Solr (on a test system
> of course) until bad stuff happens, if you reduce it to very close to what
> Solr needs, you’ll get slower as more and more cycles are spent on GC, if
> you reduce it a little more you’ll get OOMs.
>
> You can take heap dumps of course to see where all the memory is being
> used, but that’s tricky as it also includes garbage.
>
> I’ve seen cache sizes (filterCache in particular) be something that uses
> lots of memory, but that requires queries to be fired. Each filterCache
> entry can take up to roughly maxDoc/8 bytes + overhead….
>
> A classic error is to sort, group or facet on a docValues=false field.
> Starting with Solr 7.6, you can add an option to fields to throw an error
> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962.
>
> In short, there’s not enough information until you dive in and test
> bunches of stuff to tell.
>
> Best,
> Erick
>
>
> > On Jun 2, 2019, at 2:22 AM, John Davis <jo...@gmail.com>
> wrote:
> >
> > This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
> > index.My hypothesis was merging segments was trying to read it all but if
> > that's not the case I am out of ideas. The one caveat is we are trying to
> > add the documents quickly (~1g an hour) but if lucene does write 100m
> > segments and does streaming merge it shouldn't matter?
> >
> > On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wu...@wunderwood.org>
> > wrote:
> >
> >>> On May 31, 2019, at 11:27 PM, John Davis <jo...@gmail.com>
> >> wrote:
> >>>
> >>> 2. Merging segments - does solr load the entire segment in memory or
> >> chunks
> >>> of it? if later how large are these chunks
> >>
> >> No, it does not read the entire segment into memory.
> >>
> >> A fundamental part of the Lucene design is streaming posting lists into
> >> memory and processing them sequentially. The same amount of memory is
> >> needed for small or large segments. Each posting list is in document-id
> >> order. The merge is a merge of sorted lists, writing a new posting list
> in
> >> document-id order.
> >>
> >> wunder
> >> Walter Underwood
> >> wunder@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>
>
>

Re: Solr Heap Usage

Posted by Erick Erickson <er...@gmail.com>.

Oh, there are about a zillion reasons ;).

First of all, most tools that show heap usage also count uncollected garbage. So your 10G could actually be much less “live” data. Quick way to test is to attach jconsole to the running Solr and hit the button that forces a full GC.

Another way is to reduce your heap when you start Solr (on a test system of course) until bad stuff happens, if you reduce it to very close to what Solr needs, you’ll get slower as more and more cycles are spent on GC, if you reduce it a little more you’ll get OOMs.

You can take heap dumps of course to see where all the memory is being used, but that’s tricky as it also includes garbage.

I’ve seen cache sizes (filterCache in particular) be something that uses lots of memory, but that requires queries to be fired. Each filterCache entry can take up to roughly maxDoc/8 bytes + overhead….

A classic error is to sort, group or facet on a docValues=false field. Starting with Solr 7.6, you can add an option to fields to throw an error if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962.

In short, there’s not enough information until you dive in and test bunches of stuff to tell.

Best,
Erick

> On Jun 2, 2019, at 2:22 AM, John Davis <jo...@gmail.com> wrote:
> 
> This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
> index.My hypothesis was merging segments was trying to read it all but if
> that's not the case I am out of ideas. The one caveat is we are trying to
> add the documents quickly (~1g an hour) but if lucene does write 100m
> segments and does streaming merge it shouldn't matter?
> 
> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wu...@wunderwood.org>
> wrote:
> 
>>> On May 31, 2019, at 11:27 PM, John Davis <jo...@gmail.com>
>> wrote:
>>> 
>>> 2. Merging segments - does solr load the entire segment in memory or
>> chunks
>>> of it? if later how large are these chunks
>> 
>> No, it does not read the entire segment into memory.
>> 
>> A fundamental part of the Lucene design is streaming posting lists into
>> memory and processing them sequentially. The same amount of memory is
>> needed for small or large segments. Each posting list is in document-id
>> order. The merge is a merge of sorted lists, writing a new posting list in
>> document-id order.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>

Re: Solr Heap Usage

Posted by John Davis <jo...@gmail.com>.

This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
index.My hypothesis was merging segments was trying to read it all but if
that's not the case I am out of ideas. The one caveat is we are trying to
add the documents quickly (~1g an hour) but if lucene does write 100m
segments and does streaming merge it shouldn't matter?

On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wu...@wunderwood.org>
wrote:

> > On May 31, 2019, at 11:27 PM, John Davis <jo...@gmail.com>
> wrote:
> >
> > 2. Merging segments - does solr load the entire segment in memory or
> chunks
> > of it? if later how large are these chunks
>
> No, it does not read the entire segment into memory.
>
> A fundamental part of the Lucene design is streaming posting lists into
> memory and processing them sequentially. The same amount of memory is
> needed for small or large segments. Each posting list is in document-id
> order. The merge is a merge of sorted lists, writing a new posting list in
> document-id order.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>

Re: Solr Heap Usage

Posted by Walter Underwood <wu...@wunderwood.org>.

> On May 31, 2019, at 11:27 PM, John Davis <jo...@gmail.com> wrote:
> 
> 2. Merging segments - does solr load the entire segment in memory or chunks
> of it? if later how large are these chunks

No, it does not read the entire segment into memory.

A fundamental part of the Lucene design is streaming posting lists into memory and processing them sequentially. The same amount of memory is needed for small or large segments. Each posting list is in document-id order. The merge is a merge of sorted lists, writing a new posting list in document-id order.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: Solr Heap Usage

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/1/2019 12:27 AM, John Davis wrote:
> I've read a bunch of the wiki's on solr heap usage and wanted to confirm my
> understanding of what all does solr use the heap for:

This is something that's not straightforward to answer.  It would not be 
wrong to say that Solr uses the Java heap for everything it does ... but 
saying that doesn't help you.

It's extremely difficult to predict in advance exactly how much heap you 
need to give to Solr.

https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

We can (and sometimes do) make specific recommendations to users that 
provide us with a wealth of information about their setup ... but you 
should know that those recommendations are always given with caveats. 
There's a good chance that things will actually work with less heap than 
we mention -- we're going to aim for larger values simply because the 
performance implications of a heap that's too small are orders of 
magnitude worse than one that's too large.

In practice, the way I deal with heap sizing is to start with a large 
value that seems big enough to work, and then analyze GC logs to try and 
determine whether it needs to be changed.  The initial value is mostly 
arbitrary, influenced by experience.

Most of Solr's functionality is provided by Lucene, which is a 
programming API for search.  For me, Lucene, and Solr's usage of Lucene, 
is mostly a black box - precisely how it functions internally is unknown 
to me.  The source code is available, but it would take a very in-depth 
study to actually understand it.

> 1. Indexing new documents - until committed? if not how long are the new
> documents kept in heap?

Lucene sets aside a buffer to hold data that will be flushed to a new 
segment.  Solr's default for this buffer size is 100MB.  That buffer is 
flushed when it fills up, not just on commmit.  The segments produced by 
default are smaller than 100MB, so clearly Lucene does not store the 
data internally in the precise format that it ends up on disk.

Additional memory is needed for indexing beyond that 100MB buffer for 
all the manipulations that Lucene and Solr must perform.

> 2. Merging segments - does solr load the entire segment in memory or chunks
> of it? if later how large are these chunks

Again, this is Lucene, so I don't know in detail.  I can optimize an 
index that is much larger than all the memory in the system, so it 
cannot be loading all the data into memory.  I don't think it's 
enormously RAM-hungry, but it does hit the CPU pretty hard.  The fastest 
I have ever seen segment merging proceed is at about 30 megabytes per 
second, with 20 megabytes per second being more common.  Virtually all 
modern disks are capable of faster transfer rates than 30MB/s, 
especially RAID10 volumes and SSD -- the disk is not the bottleneck.

> 3. Queries, facets, caches - anything else major?

Facets, grouping, and sorting are all RAM-hungry processes whose memory 
usage is greatly improved by using docValues in the field definition -- 
because docValues is already exactly the right data for those processes.

Was this wiki page one of the things you read?  I wrote it:

https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

Thanks,
Shawn

Re: Solr Heap Usage

Posted by Erick Erickson <er...@gmail.com>.

This may be more than you need, but:

https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Advice hasn’t changed since 2012….

Best,
Erick

> On Jun 1, 2019, at 7:52 AM, Jörn Franke <jo...@gmail.com> wrote:
> 
> I recommend to setup JMeter Test cases of most common scenarios that you can run on the Solr cluster to evaluate the performance. Then you can also simulate concurrencies of scenarios and users etc.
> 
>> Am 01.06.2019 um 08:27 schrieb John Davis <jo...@gmail.com>:
>> 
>> I've read a bunch of the wiki's on solr heap usage and wanted to confirm my
>> understanding of what all does solr use the heap for:
>> 
>> 1. Indexing new documents - until committed? if not how long are the new
>> documents kept in heap?
>> 
>> 2. Merging segments - does solr load the entire segment in memory or chunks
>> of it? if later how large are these chunks
>> 
>> 3. Queries, facets, caches - anything else major?
>> 
>> John

Re: Solr Heap Usage

Posted by Jörn Franke <jo...@gmail.com>.

I recommend to setup JMeter Test cases of most common scenarios that you can run on the Solr cluster to evaluate the performance. Then you can also simulate concurrencies of scenarios and users etc.

> Am 01.06.2019 um 08:27 schrieb John Davis <jo...@gmail.com>:
> 
> I've read a bunch of the wiki's on solr heap usage and wanted to confirm my
> understanding of what all does solr use the heap for:
> 
> 1. Indexing new documents - until committed? if not how long are the new
> documents kept in heap?
> 
> 2. Merging segments - does solr load the entire segment in memory or chunks
> of it? if later how large are these chunks
> 
> 3. Queries, facets, caches - anything else major?
> 
> John