You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rahul Goswami <ra...@gmail.com> on 2019/07/03 04:53:42 UTC

Re: SolrCloud indexing triggers merges and timeouts

Hi Shawn,

Thank you for the detailed suggestions. Although, I would like to
understand the maxMergeCount and maxThreadCount params better. The
documentation
<https://lucene.apache.org/solr/guide/7_3/indexconfig-in-solrconfig.html#mergescheduler>
mentions
that

maxMergeCount : The maximum number of simultaneous merges that are allowed.
maxThreadCount : The maximum number of simultaneous merge threads that
should be running at once

Since one thread can only do 1 merge at any given point of time, how does
maxMergeCount being greater than maxThreadCount help anyway? I am having
difficulty wrapping my head around this, and would appreciate if you could
help clear it for me.

Thanks,
Rahul

On Thu, Jun 13, 2019 at 7:33 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 6/6/2019 9:00 AM, Rahul Goswami wrote:
> > *OP Reply* : Total 48 GB per node... I couldn't see another software
> using
> > a lot of memory.
> > I am honestly not sure about the reason for change of directory factory
> to
> > SimpleFSDirectoryFactory. But I was told that with mmap at one point we
> > started to see the shared memory usage on Windows go up significantly,
> > intermittently freezing the system.
> > Could the choice of DirectoryFactory here be a factor for the long
> > updates/frequent merges?
>
> With about 24GB of RAM to cache 1.4TB of index data, you're never going
> to have good performance.  Any query you do is probably going to read
> more than 24GB of data from the index, which means that it cannot come
> from memory, some of it must come from disk, which is incredibly slow
> compared to memory.
>
> MMap is more efficient than "simple" filesystem access.  I do not know
> if you would see markedly better performance, but getting rid of the
> DirectoryFactory config and letting Solr choose its default might help.
>
> > How many total documents (maxDoc, not numDoc) are in that 1.4 TB of
> > space?
> > *OP Reply:* Also, there are nearly 12.8 million total docs (maxDoc, NOT
> > numDoc) in that 1.4 TB space
>
> Unless you're doing faceting or grouping on fields with extremely high
> cardinality, which I find to be rarely useful except for data mining,
> 24GB of heap for 12.8 million docs seems very excessive.  I was
> expecting this number to be something like 500 million or more ... that
> small document count must mean each document is HUGE.  Can you take
> steps to reduce the index size, perhaps by setting stored, indexed,
> and/or docValues to "false" on some of your fields, and having your
> application go to the system of record for full details on each
> document?  You will have to reindex after making changes like that.
>
> >> Can you share the GC log that Solr writes?
> > *OP Reply:*  Please find the GC logs and thread dumps at this location
> > https://drive.google.com/open?id=1slsYkAcsH7OH-7Pma91k6t5T72-tIPlw
>
> The larger GC log was unrecognized by both gcviwer and gceasy.io ... the
> smaller log shows heap usage about 10GB, but it only covers 10 minutes,
> so it's not really conclusive for diagnosis.  The first thing I can
> suggest to try is to reduce the heap size to 12GB ... but I do not know
> if that's actually going to work.  Indexing might require more memory.
> The idea here is to make more memory available to the OS disk cache ...
> with your index size, you're probably going to need to add memory to the
> system (not the heap).
>
> > Another observation is that the CPU usage reaches around 70% (through
> > manual monitoring) when the indexing starts and the merges are observed.
> It
> > is well below 50% otherwise.
>
> Indexing will increase load, and that increase is often very
> significant.  Adding memory to the system is your best bet for better
> performance.  I'd want 1TB of memory for a 1.4TB index ... but I know
> that memory sizes that high are extremely expensive, and for most
> servers, not even possible.  512GB or 256GB is more attainable, and
> would have better performance than 48GB.
>
> > Also, should something be altered with the mergeScheduler setting ?
> > "mergeScheduler":{
> >          "class":"org.apache.lucene.index.ConcurrentMergeScheduler",
> >          "maxMergeCount":2,
> >          "maxThreadCount":2},
>
> Do not configure maxThreadCount beyond 1 unless your data is on SSD.  It
> will slow things down a lot due to the fact that standard disks must
> move the disk head to read/write from different locations, and head
> moves take time.  SSD can do I/O from any location without pauses, so
> more threads would probably help performance rather than hurt it.
>
> Increase maxMergeCount to 6 -- at 2, large merges will probably stop
> indexing entirely.  With a larger number, Solr can keep indexing even
> when there's a huge segment merge happening.
>
> Thanks,
> Shawn
>

Re: SolrCloud indexing triggers merges and timeouts

Posted by Rahul Goswami <ra...@gmail.com>.
Upon further investigation on this issue, I see the below log lines during
the indexing process:

2019-06-06 22:24:56.203 INFO  (qtp1169794610-5652)
[c:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623
s:shard22 r:core_node87
x:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623_shard22_replica_n84]
org.apache.solr.update.LoggingInfoStream [FP][qtp1169794610-5652]: trigger
flush: activeBytes=352402600 deleteBytes=279 vs limit=104857600
2019-06-06 22:24:56.203 INFO  (qtp1169794610-5652)
[c:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623
s:shard22 r:core_node87
x:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623_shard22_replica_n84]
org.apache.solr.update.LoggingInfoStream [FP][qtp1169794610-5652]: thread
state has 352402600 bytes; docInRAM=1
2019-06-06 22:24:56.204 INFO  (qtp1169794610-5652)
[c:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623
s:shard22 r:core_node87
x:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623_shard22_replica_n84]
org.apache.solr.update.LoggingInfoStream [FP][qtp1169794610-5652]: 1 in-use
non-flushing threads states
2019-06-06 22:24:56.204 INFO  (qtp1169794610-5652)
[c:UM_IndexServer_MailArchiv_Spelle_66AC8340-4734-438A-9D1D-A84B659B1623
s:shard22 r:core_node87

I have the below questions:
1) The log line which says "thread state has 352402600 bytes; docInRAM=1 ",
does it mean that the buffer was flushed to disk with only one huge
document ?
2) If yes, does this flush create a segment with just one document ?
3) Heap dump analysis shows large (>350 MB) instances of
DocumentWritersPerThread. Does one instance of this class correspond to one
document?


Help is much appreciated.

Thanks,
Rahul


On Fri, Jul 5, 2019 at 2:11 AM Rahul Goswami <ra...@gmail.com> wrote:

> Shawn,Erick,
> Thank you for the explanation. The merge scheduler params make sense now.
>
> Thanks,
> Rahul
>
> On Wed, Jul 3, 2019 at 11:30 AM Erick Erickson <er...@gmail.com>
> wrote:
>
>> Two more tidbits to add to Shawn’s explanation:
>>
>> There are heuristics built in to ConcurrentMergeScheduler.
>> From the Javadocs:
>> * If it's an SSD,
>> *  {@code maxThreadCount} is set to {@code max(1, min(4,
>> cpuCoreCount/2))},
>> *  otherwise 1.  Note that detection only currently works on
>> *  Linux; other platforms will assume the index is not on an SSD.
>>
>> Second, TieredMergePolicy (the default) merges in “tiers” that
>> are of similar size. So you can have multiple merges going on
>> at the same time on disjoint sets of segments.
>>
>> Best,
>> Erick
>>
>> > On Jul 3, 2019, at 7:54 AM, Shawn Heisey <ap...@elyograg.org> wrote:
>> >
>> > On 7/2/2019 10:53 PM, Rahul Goswami wrote:
>> >> Hi Shawn,
>> >> Thank you for the detailed suggestions. Although, I would like to
>> >> understand the maxMergeCount and maxThreadCount params better. The
>> >> documentation
>> >> <
>> https://lucene.apache.org/solr/guide/7_3/indexconfig-in-solrconfig.html#mergescheduler
>> >
>> >> mentions
>> >> that
>> >> maxMergeCount : The maximum number of simultaneous merges that are
>> allowed.
>> >> maxThreadCount : The maximum number of simultaneous merge threads that
>> >> should be running at once
>> >> Since one thread can only do 1 merge at any given point of time, how
>> does
>> >> maxMergeCount being greater than maxThreadCount help anyway? I am
>> having
>> >> difficulty wrapping my head around this, and would appreciate if you
>> could
>> >> help clear it for me.
>> >
>> > The maxMergeCount setting controls the number of merges that can be
>> *scheduled* at the same time.  As soon as that number of merges is reached,
>> the indexing thread(s) will be paused until the number of merges in the
>> schedule drops below this number.  This ensures that no more merges will be
>> scheduled.
>> >
>> > By setting maxMergeCount higher than the number of merges that are
>> expected in the schedule, you can ensure that indexing will never be
>> paused.  It would require very atypical merge policy settings for the
>> number of scheduled merges to ever reach six.  On my own indexing, I
>> reached three scheduled merges quite frequently.  The default setting for
>> maxMergeCount is three.
>> >
>> > The maxThreadCount setting controls how many of the scheduled merges
>> will be simultaneously executed. With index data on standard spinning
>> disks, you do not want to increase this number beyond 1, or you will have a
>> performance problem due to thrashing disk heads.  If your data is on SSD,
>> you can make it larger than 1.
>> >
>> > Thanks,
>> > Shawn
>>
>>

Re: SolrCloud indexing triggers merges and timeouts

Posted by Rahul Goswami <ra...@gmail.com>.
Shawn,Erick,
Thank you for the explanation. The merge scheduler params make sense now.

Thanks,
Rahul

On Wed, Jul 3, 2019 at 11:30 AM Erick Erickson <er...@gmail.com>
wrote:

> Two more tidbits to add to Shawn’s explanation:
>
> There are heuristics built in to ConcurrentMergeScheduler.
> From the Javadocs:
> * If it's an SSD,
> *  {@code maxThreadCount} is set to {@code max(1, min(4, cpuCoreCount/2))},
> *  otherwise 1.  Note that detection only currently works on
> *  Linux; other platforms will assume the index is not on an SSD.
>
> Second, TieredMergePolicy (the default) merges in “tiers” that
> are of similar size. So you can have multiple merges going on
> at the same time on disjoint sets of segments.
>
> Best,
> Erick
>
> > On Jul 3, 2019, at 7:54 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> >
> > On 7/2/2019 10:53 PM, Rahul Goswami wrote:
> >> Hi Shawn,
> >> Thank you for the detailed suggestions. Although, I would like to
> >> understand the maxMergeCount and maxThreadCount params better. The
> >> documentation
> >> <
> https://lucene.apache.org/solr/guide/7_3/indexconfig-in-solrconfig.html#mergescheduler
> >
> >> mentions
> >> that
> >> maxMergeCount : The maximum number of simultaneous merges that are
> allowed.
> >> maxThreadCount : The maximum number of simultaneous merge threads that
> >> should be running at once
> >> Since one thread can only do 1 merge at any given point of time, how
> does
> >> maxMergeCount being greater than maxThreadCount help anyway? I am having
> >> difficulty wrapping my head around this, and would appreciate if you
> could
> >> help clear it for me.
> >
> > The maxMergeCount setting controls the number of merges that can be
> *scheduled* at the same time.  As soon as that number of merges is reached,
> the indexing thread(s) will be paused until the number of merges in the
> schedule drops below this number.  This ensures that no more merges will be
> scheduled.
> >
> > By setting maxMergeCount higher than the number of merges that are
> expected in the schedule, you can ensure that indexing will never be
> paused.  It would require very atypical merge policy settings for the
> number of scheduled merges to ever reach six.  On my own indexing, I
> reached three scheduled merges quite frequently.  The default setting for
> maxMergeCount is three.
> >
> > The maxThreadCount setting controls how many of the scheduled merges
> will be simultaneously executed. With index data on standard spinning
> disks, you do not want to increase this number beyond 1, or you will have a
> performance problem due to thrashing disk heads.  If your data is on SSD,
> you can make it larger than 1.
> >
> > Thanks,
> > Shawn
>
>

Re: SolrCloud indexing triggers merges and timeouts

Posted by Erick Erickson <er...@gmail.com>.
Two more tidbits to add to Shawn’s explanation:

There are heuristics built in to ConcurrentMergeScheduler.
From the Javadocs:
* If it's an SSD,
*  {@code maxThreadCount} is set to {@code max(1, min(4, cpuCoreCount/2))},
*  otherwise 1.  Note that detection only currently works on
*  Linux; other platforms will assume the index is not on an SSD.

Second, TieredMergePolicy (the default) merges in “tiers” that
are of similar size. So you can have multiple merges going on
at the same time on disjoint sets of segments.

Best,
Erick

> On Jul 3, 2019, at 7:54 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
> On 7/2/2019 10:53 PM, Rahul Goswami wrote:
>> Hi Shawn,
>> Thank you for the detailed suggestions. Although, I would like to
>> understand the maxMergeCount and maxThreadCount params better. The
>> documentation
>> <https://lucene.apache.org/solr/guide/7_3/indexconfig-in-solrconfig.html#mergescheduler>
>> mentions
>> that
>> maxMergeCount : The maximum number of simultaneous merges that are allowed.
>> maxThreadCount : The maximum number of simultaneous merge threads that
>> should be running at once
>> Since one thread can only do 1 merge at any given point of time, how does
>> maxMergeCount being greater than maxThreadCount help anyway? I am having
>> difficulty wrapping my head around this, and would appreciate if you could
>> help clear it for me.
> 
> The maxMergeCount setting controls the number of merges that can be *scheduled* at the same time.  As soon as that number of merges is reached, the indexing thread(s) will be paused until the number of merges in the schedule drops below this number.  This ensures that no more merges will be scheduled.
> 
> By setting maxMergeCount higher than the number of merges that are expected in the schedule, you can ensure that indexing will never be paused.  It would require very atypical merge policy settings for the number of scheduled merges to ever reach six.  On my own indexing, I reached three scheduled merges quite frequently.  The default setting for maxMergeCount is three.
> 
> The maxThreadCount setting controls how many of the scheduled merges will be simultaneously executed. With index data on standard spinning disks, you do not want to increase this number beyond 1, or you will have a performance problem due to thrashing disk heads.  If your data is on SSD, you can make it larger than 1.
> 
> Thanks,
> Shawn


Re: SolrCloud indexing triggers merges and timeouts

Posted by Shawn Heisey <ap...@elyograg.org>.
On 7/2/2019 10:53 PM, Rahul Goswami wrote:
> Hi Shawn,
> 
> Thank you for the detailed suggestions. Although, I would like to
> understand the maxMergeCount and maxThreadCount params better. The
> documentation
> <https://lucene.apache.org/solr/guide/7_3/indexconfig-in-solrconfig.html#mergescheduler>
> mentions
> that
> 
> maxMergeCount : The maximum number of simultaneous merges that are allowed.
> maxThreadCount : The maximum number of simultaneous merge threads that
> should be running at once
> 
> Since one thread can only do 1 merge at any given point of time, how does
> maxMergeCount being greater than maxThreadCount help anyway? I am having
> difficulty wrapping my head around this, and would appreciate if you could
> help clear it for me.

The maxMergeCount setting controls the number of merges that can be 
*scheduled* at the same time.  As soon as that number of merges is 
reached, the indexing thread(s) will be paused until the number of 
merges in the schedule drops below this number.  This ensures that no 
more merges will be scheduled.

By setting maxMergeCount higher than the number of merges that are 
expected in the schedule, you can ensure that indexing will never be 
paused.  It would require very atypical merge policy settings for the 
number of scheduled merges to ever reach six.  On my own indexing, I 
reached three scheduled merges quite frequently.  The default setting 
for maxMergeCount is three.

The maxThreadCount setting controls how many of the scheduled merges 
will be simultaneously executed.  With index data on standard spinning 
disks, you do not want to increase this number beyond 1, or you will 
have a performance problem due to thrashing disk heads.  If your data is 
on SSD, you can make it larger than 1.

Thanks,
Shawn