You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by 苗海泉 <ms...@gmail.com> on 2018/02/27 10:36:55 UTC

When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

I encountered a more serious problem in the process of using solr. We use
the solr version is 6.0, our daily amount of data is about 500 billion
documents, create a collection every hour, the online collection of more
than a thousand, 49 solr nodes. If the collection in less than 800, the
speed is still very fast, if the collection of the number of 1100 or so,
the construction of solr index will drop sharply, one of the original
program speed of about 2-3 million TPS, Dropped to only a few hundred or
even tens of TPS, who have encountered a similar situation, there is no
good idea to find this issue. By the way, solr a node memory we assigned
32G,We checked the memory, cpu, disk IO, network IO occupancy is no
problem, belong to the normal state. Which friend encountered a similar
problem, please inform the solution, thank you very much.

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by 苗海泉 <ms...@gmail.com>.

Thank you for your advice on gc tools, what do you suggest to me?

2018-02-28 23:57 GMT+08:00 Shawn Heisey <el...@elyograg.org>:

> On 2/28/2018 2:53 AM, 苗海泉 wrote:
>
>> Thanks for your detailed advice, the monitor product you are talking about
>> is good, but our solr system is running on a private network and seems to
>> be unusable at all, with no single downloadable application for analyzing
>> specific gc logs.
>>
>
> For analyzing GC logs, the GCViewer app is useful.  With some practice
> (learning to disable irrelevent information) you can pinpoint problems.  It
> also compiles statistics about GC intervals, which can be very helpful.  It
> is an executable jar.
>
> https://github.com/chewiebug/GCViewer
>
> But I have found an even easier tool for general use:
>
> http://gceasy.io/
>
> I still find value in GCViewer, but most of the time the information I'm
> after is provided by gceasy, and it's a lot easier to decipher.
>
> Possible disadvantage for gceasy: it's an online tool. So you have to copy
> the log out of disconnected networks into a machine with Internet access.
> I don't anticipate any sort of privacy problems with them -- logs that you
> upload are not kept very long, and GC logs don't contain anything sensitive
> anyway.
>
> Thanks,
> Shawn
>
>


-- 
==============================
联创科技
知行如一
==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by Shawn Heisey <el...@elyograg.org>.

On 2/28/2018 2:53 AM, 苗海泉 wrote:
> Thanks for your detailed advice, the monitor product you are talking about
> is good, but our solr system is running on a private network and seems to
> be unusable at all, with no single downloadable application for analyzing
> specific gc logs.

For analyzing GC logs, the GCViewer app is useful.  With some practice 
(learning to disable irrelevent information) you can pinpoint problems.  
It also compiles statistics about GC intervals, which can be very 
helpful.  It is an executable jar.

https://github.com/chewiebug/GCViewer

But I have found an even easier tool for general use:

http://gceasy.io/

I still find value in GCViewer, but most of the time the information I'm 
after is provided by gceasy, and it's a lot easier to decipher.

Possible disadvantage for gceasy: it's an online tool. So you have to 
copy the log out of disconnected networks into a machine with Internet 
access.  I don't anticipate any sort of privacy problems with them -- 
logs that you upload are not kept very long, and GC logs don't contain 
anything sensitive anyway.

Thanks,
Shawn

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by Emir Arnautović <em...@sematext.com>.

If you are after only visualising GC, there are several tools that you can download or upload logs to visualise. If you would like to monitor all host/solr/jvm, Sematext’s SPM also comes in on-premises  version, where you install and host your own monitoring infrastructure: https://sematext.com/spm/#on-premises <https://sematext.com/spm/#on-premises>

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 28 Feb 2018, at 10:53, 苗海泉 <ms...@gmail.com> wrote:
> 
> Thanks for your detailed advice, the monitor product you are talking about
> is good, but our solr system is running on a private network and seems to
> be unusable at all, with no single downloadable application for analyzing
> specific gc logs.
> 
> 2018-02-28 16:57 GMT+08:00 Emir Arnautović <emir.arnautovic@sematext.com <ma...@sematext.com>>:
> 
>> Hi,
>> I would start with following:
>> 1. have dedicated nodes for ZK ensemble - those do not have to be powerful
>> nodes (maybe 2-4 cores and 8GB RAM)
>> 2. reduce heap size to value below margin where JVM can use compressed
>> oops - 31GB should be safe size
>> 3. shard collection to all nodes
>> 4. increase rollover interval to 2h so you keep shard size/number as it is
>> today.
>> 5. experiment with slightly larger rollover intervals (e.g. 3h) if query
>> latency is still acceptable. That will result in less shards that are
>> slightly larger.
>> 
>> In any case monitor your cluster to see how changes affect it. Not sure
>> what you currently use for monitoring, but manual scanning of GC logs is
>> not fun. You can check out our monitoring tool if you don’t have one or if
>> it does not give you enough visibility: https://sematext.com/spm/ <
>> https://sematext.com/spm/ <https://sematext.com/spm/>>
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 28 Feb 2018, at 02:42, 苗海泉 <ms...@gmail.com> wrote:
>>> 
>>> Thank you, I read under the memory footprint, I set 75% recovery, memory
>>> occupancy at about 76%, the other we zookeeper not on a dedicated server,
>>> perhaps because of this cause instability.
>>> 
>>> What else do you recommend for me to check?
>>> 
>>> 2018-02-27 22:37 GMT+08:00 Emir Arnautović <emir.arnautovic@sematext.com
>>> :
>>> 
>>>> This does not show much: only that your heap is around 75% (24-25GB). I
>>>> was thinking that you should compare metrics (heap/GC as well) when
>> running
>>>> on without issues and when running with issues and see if something can
>> be
>>>> concluded.
>>>> About instability: Do you run ZK on dedicated nodes?
>>>> 
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>>> On 27 Feb 2018, at 14:43, 苗海泉 <ms...@gmail.com> wrote:
>>>>> 
>>>>> Thank you, we were 49 shard 49 nodes, but later found that in this
>> case,
>>>>> often disconnect between solr and zookeepr, zookeeper too many nodes
>>>> caused
>>>>> solr instability, so reduced to 25 A follow-up performance can not keep
>>>> up
>>>>> also need to increase back.
>>>>> 
>>>>> Very slow when solr and zookeeper not found any errors, just build the
>>>>> index slow, automatic commit inside the log display is slow, but the
>> main
>>>>> reason may not lie in the commit place.
>>>>> 
>>>>> I am sorry, I do not know how to look at the utilization of java heap,
>>>>> through the gc log, gc time is not long, I posted the log:
>>>>> 
>>>>> 
>>>>> {Heap before GC invocations=1144021 (full 72):
>>>>> garbage-first heap   total 33554432K, used 26982419K
>> [0x00007f1478000000,
>>>>> 0x00007f1478808000, 0x00007f1c78000000)
>>>>> region size 8192K, 204 young (1671168K), 26 survivors (212992K)
>>>>> Metaspace       used 41184K, capacity 41752K, committed 67072K,
>> reserved
>>>>> 67584K
>>>>> 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation
>>>> Pause)
>>>>> (young)
>>>>> Desired survivor size 109051904 bytes, new threshold 1 (max 15)
>>>>> - age   1:  113878760 bytes,  113878760 total
>>>>> - age   2:   21264744 bytes,  135143504 total
>>>>> - age   3:   17020096 bytes,  152163600 total
>>>>> - age   4:   26870864 bytes,  179034464 total
>>>>> , 0.0579794 secs]
>>>>> [Parallel Time: 46.9 ms, GC Workers: 18]
>>>>>    [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
>>>>> 4668016046.4, Diff: 0.3]
>>>>>    [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
>>>>> Sum: 116.9]
>>>>>    [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum:
>> 62.0]
>>>>>       [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum:
>>>> 113]
>>>>>    [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
>>>>>    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>>>>> Sum: 0.0]
>>>>>    [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
>>>>> 428.1]
>>>>>    [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
>>>>> 228.9]
>>>>>       [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
>>>> 18]
>>>>>    [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4,
>> Sum:
>>>>> 1.2]
>>>>>    [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
>>>>> Sum: 838.0]
>>>>>    [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
>>>>> 4668016092.8, Diff: 0.0]
>>>>> [Code Root Fixup: 0.2 ms]
>>>>> [Code Root Purge: 0.0 ms]
>>>>> [Clear CT: 0.3 ms]
>>>>> [Other: 10.7 ms]
>>>>>    [Choose CSet: 0.0 ms]
>>>>>    [Ref Proc: 5.9 ms]
>>>>>    [Ref Enq: 0.2 ms]
>>>>>    [Redirty Cards: 0.2 ms]
>>>>>    [Humongous Register: 2.2 ms]
>>>>>    [Humongous Reclaim: 0.4 ms]
>>>>>    [Free CSet: 0.4 ms]
>>>>> [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap:
>>>>> 25.7G(32.0G)->24.3G(32.0G)]
>>>>> Heap after GC invocations=1144022 (full 72):
>>>>> garbage-first heap   total 33554432K, used 25489656K
>> [0x00007f1478000000,
>>>>> 0x00007f1478808000, 0x00007f1c78000000)
>>>>> region size 8192K, 10 young (81920K), 10 survivors (81920K)
>>>>> Metaspace       used 41184K, capacity 41752K, committed 67072K,
>> reserved
>>>>> 67584K
>>>>> }
>>>>> [Times: user=0.84 sys=0.01, real=0.05 secs]
>>>>> 2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which
>>>> application
>>>>> threads were stopped: 0.0661383 seconds, Stopping threads took:
>> 0.0004141
>>>>> seconds
>>>>> 2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end,
>>>>> 2.5757061 secs]
>>>>> 2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark
>>>>> 2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508
>>>>> secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc,
>> 0.0277818
>>>>> secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102
>>>>> secs], 0.0704296 secs]
>>>>> [Times: user=0.85 sys=0.04, real=0.07 secs]
>>>>> 2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which
>>>> application
>>>>> threads were stopped: 0.0785762 seconds, Stopping threads took:
>> 0.0006159
>>>>> seconds
>>>>> 2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G),
>>>>> 0.0391915 secs]
>>>>> [Times: user=0.64 sys=0.00, real=0.04 secs]
>>>>> 2018-02-27T21:43:02.218+0800: 4668016.469: Total time for which
>>>> application
>>>>> threads were stopped: 0.0470020 seconds, Stopping threads took:
>> 0.0001684
>>>>> seconds
>>>>> 2018-02-27T21:43:02.540+0800: 4668016.791: Total time for which
>>>> application
>>>>> threads were stopped: 0.0074829 seconds, Stopping threads took:
>> 0.0004834
>>>>> seconds
>>>>> {Heap before GC invocations=1144023 (full 72):
>>>>> garbage-first heap   total 33554432K, used 27078904K
>> [0x00007f1478000000,
>>>>> 0x00007f1478808000, 0x00007f1c78000000)
>>>>> region size 8192K, 204 young (1671168K), 10 survivors (81920K)
>>>>> Metaspace       used 41184K, capacity 41752K, committed 67072K,
>> reserved
>>>>> 67584K
>>>>> 2018-02-27T21:43:04.076+0800: 4668018.326: [GC pause (G1 Evacuation
>>>> Pause)
>>>>> (young)
>>>>> Desired survivor size 109051904 bytes, new threshold 15 (max 15)
>>>>> - age   1:   47719032 bytes,   47719032 total
>>>>> , 0.0554183 secs]
>>>>> [Parallel Time: 48.0 ms, GC Workers: 18]
>>>>>    [GC Worker Start (ms): Min: 4668018329.0, Avg: 4668018329.1, Max:
>>>>> 4668018329.3, Diff: 0.3]
>>>>>    [Ext Root Scanning (ms): Min: 2.9, Avg: 5.7, Max: 47.4, Diff: 44.6,
>>>>> Sum: 103.0]
>>>>>    [Update RS (ms): Min: 0.0, Avg: 14.3, Max: 16.2, Diff: 16.2, Sum:
>>>>> 257.6]
>>>>>       [Processed Buffers: Min: 0, Avg: 17.4, Max: 22, Diff: 22, Sum:
>>>> 314]
>>>>>    [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
>>>>>    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>>>>> Sum: 0.0]
>>>>>    [Object Copy (ms): Min: 0.1, Avg: 10.9, Max: 11.9, Diff: 11.8, Sum:
>>>>> 196.9]
>>>>>    [Termination (ms): Min: 0.0, Avg: 16.6, Max: 17.6, Diff: 17.6, Sum:
>>>>> 299.1]
>>>>>       [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
>>>> 18]
>>>>>    [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0,
>> Sum:
>>>>> 0.5]
>>>>>    [GC Worker Total (ms): Min: 47.5, Avg: 47.6, Max: 47.8, Diff: 0.3,
>>>>> Sum: 857.6]
>>>>>    [GC Worker End (ms): Min: 4668018376.7, Avg: 4668018376.8, Max:
>>>>> 4668018376.8, Diff: 0.0]
>>>>> [Code Root Fixup: 0.2 ms]
>>>>> [Code Root Purge: 0.0 ms]
>>>>> [Clear CT: 0.2 ms]
>>>>> [Other: 7.1 ms]
>>>>>    [Choose CSet: 0.0 ms]
>>>>>    [Ref Proc: 2.3 ms]
>>>>>    [Ref Enq: 0.2 ms]
>>>>>    [Redirty Cards: 0.2 ms]
>>>>>    [Humongous Register: 2.2 ms]
>>>>>    [Humongous Reclaim: 0.4 ms]
>>>>>    [Free CSet: 0.4 ms]
>>>>> [Eden: 1552.0M(1552.0M)->0.0B(1488.0M) Survivors: 80.0M->144.0M Heap:
>>>>> 25.8G(32.0G)->24.4G(32.0G)]
>>>>> Heap after GC invocations=1144024 (full 72):
>>>>> garbage-first heap   total 33554432K, used 25550050K
>> [0x00007f1478000000,
>>>>> 0x00007f1478808000, 0x00007f1c78000000)
>>>>> region size 8192K, 18 young (147456K), 18 survivors (147456K)
>>>>> Metaspace       used 41184K, capacity 41752K, committed 67072K,
>> reserved
>>>>> 67584K
>>>>> }
>>>>> [Times: user=0.82 sys=0.00, real=0.05 secs]
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 2018-02-27 20:58 GMT+08:00 Emir Arnautović <
>> emir.arnautovic@sematext.com
>>>>> :
>>>>> 
>>>>>> Ah, so there are ~560 shards per node and not all nodes are indexing
>> at
>>>>>> the same time. Why is that? You can have better throughput if indexing
>>>> on
>>>>>> all nodes. If happy with shard size, you can create new collection
>> with
>>>> 49
>>>>>> shards every 2h and have everything the same and index on all nodes.
>>>>>> 
>>>>>> Back to main question: what is the heap utilisation? When you restart
>>>> node
>>>>>> what is heap utilisation? Do you see any errors in your logs? Do you
>> see
>>>>>> any errors in ZK logs?
>>>>>> 
>>>>>> Emir
>>>>>> --
>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>>> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 27 Feb 2018, at 13:22, 苗海泉 <ms...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Thanks  for you reply again.
>>>>>>> I just said that you may have some misunderstanding, we have 49 solr
>>>>>> nodes,
>>>>>>> each collection has 25 shards, each shard has only one replica of the
>>>>>> data,
>>>>>>> there is no copy, and I reduce the part of the cache. If you need the
>>>>>>> metric data, I can check Come out to tell you, in addition we are
>> only
>>>>>>> additional system, there will not be any change action.
>>>>>>> 
>>>>>>> 2018-02-27 20:05 GMT+08:00 Emir Arnautović <
>>>> emir.arnautovic@sematext.com
>>>>>>> :
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> It is hard to tell without looking more into your metrics. It seems
>> to
>>>>>> me
>>>>>>>> that you are reaching limits of your cluster. I would doublecheck if
>>>>>> memory
>>>>>>>> is the issue. If I got it right, you have ~1120 shards per node. It
>>>>>> takes
>>>>>>>> some heap just to keep them open. If you have some caches enabled
>> and
>>>>>> if it
>>>>>>>> is append only system, old shards will keep caches until reloaded.
>>>>>>>> Probably will not make much diff, but with 25x2=50 shards and 49
>>>> nodes,
>>>>>>>> one node will need to handle double indexing load.
>>>>>>>> 
>>>>>>>> Emir
>>>>>>>> --
>>>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>>>>> Solr & Elasticsearch Consulting Support Training -
>>>> http://sematext.com/
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 27 Feb 2018, at 12:54, 苗海泉 <ms...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> In addition, we found that the rate was normal when the number of
>>>>>>>>> collections was kept below 936 and the speed was slower and slower
>> at
>>>>>>>> 984.
>>>>>>>>> Therefore, we could only temporarily delete the older collection,
>> but
>>>>>> now
>>>>>>>>> we need more Online collection, there has been no good way to
>> confuse
>>>>>> us
>>>>>>>>> for a long time, very much hope to give a solution to the problem
>> of
>>>>>>>> ideas,
>>>>>>>>> greatly appreciated
>>>>>>>>> 
>>>>>>>>> 2018-02-27 19:46 GMT+08:00 苗海泉 <ms...@gmail.com>:
>>>>>>>>> 
>>>>>>>>>> Thank you for reply.
>>>>>>>>>> One collection has 25 shard one replica, one solr node has about
>> 5T
>>>> on
>>>>>>>>>> desk.
>>>>>>>>>> GC is checked ,and modify as follow :
>>>>>>>>>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
>>>>>>>>>> GC_TUNE=" \
>>>>>>>>>> -XX:+UseG1GC \
>>>>>>>>>> -XX:+PerfDisableSharedMem \
>>>>>>>>>> -XX:+ParallelRefProcEnabled \
>>>>>>>>>> -XX:G1HeapRegionSize=8m \
>>>>>>>>>> -XX:MaxGCPauseMillis=250 \
>>>>>>>>>> -XX:InitiatingHeapOccupancyPercent=75 \
>>>>>>>>>> -XX:+UseLargePages \
>>>>>>>>>> -XX:+AggressiveOpts \
>>>>>>>>>> -XX:+UseLargePages"
>>>>>>>>>> 
>>>>>>>>>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <
>>>>>>>> emir.arnautovic@sematext.com>:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> To get more complete picture, can you tell us how many
>>>>>> shards/replicas
>>>>>>>> do
>>>>>>>>>>> you have per collection? Also what is index size on disk? Did you
>>>>>>>> check GC?
>>>>>>>>>>> 
>>>>>>>>>>> BTW, using 32GB heap prevents you from using compressed oops,
>>>>>> resulting
>>>>>>>>>>> in less memory available than 31GB.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Emir
>>>>>>>>>>> --
>>>>>>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>>>>>>>> Solr & Elasticsearch Consulting Support Training -
>>>>>>>> http://sematext.com/
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 27 Feb 2018, at 11:36, 苗海泉 <ms...@gmail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I encountered a more serious problem in the process of using
>> solr.
>>>>>> We
>>>>>>>>>>> use
>>>>>>>>>>>> the solr version is 6.0, our daily amount of data is about 500
>>>>>> billion
>>>>>>>>>>>> documents, create a collection every hour, the online collection
>>>> of
>>>>>>>> more
>>>>>>>>>>>> than a thousand, 49 solr nodes. If the collection in less than
>>>> 800,
>>>>>>>> the
>>>>>>>>>>>> speed is still very fast, if the collection of the number of
>> 1100
>>>> or
>>>>>>>> so,
>>>>>>>>>>>> the construction of solr index will drop sharply, one of the
>>>>>> original
>>>>>>>>>>>> program speed of about 2-3 million TPS, Dropped to only a few
>>>>>> hundred
>>>>>>>> or
>>>>>>>>>>>> even tens of TPS, who have encountered a similar situation,
>> there
>>>> is
>>>>>>>> no
>>>>>>>>>>>> good idea to find this issue. By the way, solr a node memory we
>>>>>>>> assigned
>>>>>>>>>>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is
>>>> no
>>>>>>>>>>>> problem, belong to the normal state. Which friend encountered a
>>>>>>>> similar
>>>>>>>>>>>> problem, please inform the solution, thank you very much.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> ==============================
>>>>>>>>>> 联创科技
>>>>>>>>>> 知行如一
>>>>>>>>>> ==============================
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> ==============================
>>>>>>>>> 联创科技
>>>>>>>>> 知行如一
>>>>>>>>> ==============================
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> ==============================
>>>>>>> 联创科技
>>>>>>> 知行如一
>>>>>>> ==============================
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> ==============================
>>>>> 联创科技
>>>>> 知行如一
>>>>> ==============================
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> ==============================
>>> 联创科技
>>> 知行如一
>>> ==============================
>> 
>> 
> 
> 
> -- 
> ==============================
> 联创科技
> 知行如一
> ==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by 苗海泉 <ms...@gmail.com>.

Thanks for your detailed advice, the monitor product you are talking about
is good, but our solr system is running on a private network and seems to
be unusable at all, with no single downloadable application for analyzing
specific gc logs.

2018-02-28 16:57 GMT+08:00 Emir Arnautović <em...@sematext.com>:

> Hi,
> I would start with following:
> 1. have dedicated nodes for ZK ensemble - those do not have to be powerful
> nodes (maybe 2-4 cores and 8GB RAM)
> 2. reduce heap size to value below margin where JVM can use compressed
> oops - 31GB should be safe size
> 3. shard collection to all nodes
> 4. increase rollover interval to 2h so you keep shard size/number as it is
> today.
> 5. experiment with slightly larger rollover intervals (e.g. 3h) if query
> latency is still acceptable. That will result in less shards that are
> slightly larger.
>
> In any case monitor your cluster to see how changes affect it. Not sure
> what you currently use for monitoring, but manual scanning of GC logs is
> not fun. You can check out our monitoring tool if you don’t have one or if
> it does not give you enough visibility: https://sematext.com/spm/ <
> https://sematext.com/spm/>
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 28 Feb 2018, at 02:42, 苗海泉 <ms...@gmail.com> wrote:
> >
> > Thank you, I read under the memory footprint, I set 75% recovery, memory
> > occupancy at about 76%, the other we zookeeper not on a dedicated server,
> > perhaps because of this cause instability.
> >
> > What else do you recommend for me to check?
> >
> > 2018-02-27 22:37 GMT+08:00 Emir Arnautović <emir.arnautovic@sematext.com
> >:
> >
> >> This does not show much: only that your heap is around 75% (24-25GB). I
> >> was thinking that you should compare metrics (heap/GC as well) when
> running
> >> on without issues and when running with issues and see if something can
> be
> >> concluded.
> >> About instability: Do you run ZK on dedicated nodes?
> >>
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 27 Feb 2018, at 14:43, 苗海泉 <ms...@gmail.com> wrote:
> >>>
> >>> Thank you, we were 49 shard 49 nodes, but later found that in this
> case,
> >>> often disconnect between solr and zookeepr, zookeeper too many nodes
> >> caused
> >>> solr instability, so reduced to 25 A follow-up performance can not keep
> >> up
> >>> also need to increase back.
> >>>
> >>> Very slow when solr and zookeeper not found any errors, just build the
> >>> index slow, automatic commit inside the log display is slow, but the
> main
> >>> reason may not lie in the commit place.
> >>>
> >>> I am sorry, I do not know how to look at the utilization of java heap,
> >>> through the gc log, gc time is not long, I posted the log:
> >>>
> >>>
> >>> {Heap before GC invocations=1144021 (full 72):
> >>> garbage-first heap   total 33554432K, used 26982419K
> [0x00007f1478000000,
> >>> 0x00007f1478808000, 0x00007f1c78000000)
> >>> region size 8192K, 204 young (1671168K), 26 survivors (212992K)
> >>> Metaspace       used 41184K, capacity 41752K, committed 67072K,
> reserved
> >>> 67584K
> >>> 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation
> >> Pause)
> >>> (young)
> >>> Desired survivor size 109051904 bytes, new threshold 1 (max 15)
> >>> - age   1:  113878760 bytes,  113878760 total
> >>> - age   2:   21264744 bytes,  135143504 total
> >>> - age   3:   17020096 bytes,  152163600 total
> >>> - age   4:   26870864 bytes,  179034464 total
> >>> , 0.0579794 secs]
> >>>  [Parallel Time: 46.9 ms, GC Workers: 18]
> >>>     [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
> >>> 4668016046.4, Diff: 0.3]
> >>>     [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
> >>> Sum: 116.9]
> >>>     [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum:
> 62.0]
> >>>        [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum:
> >> 113]
> >>>     [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
> >>>     [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> >>> Sum: 0.0]
> >>>     [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
> >>> 428.1]
> >>>     [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
> >>> 228.9]
> >>>        [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
> >> 18]
> >>>     [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4,
> Sum:
> >>> 1.2]
> >>>     [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
> >>> Sum: 838.0]
> >>>     [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
> >>> 4668016092.8, Diff: 0.0]
> >>>  [Code Root Fixup: 0.2 ms]
> >>>  [Code Root Purge: 0.0 ms]
> >>>  [Clear CT: 0.3 ms]
> >>>  [Other: 10.7 ms]
> >>>     [Choose CSet: 0.0 ms]
> >>>     [Ref Proc: 5.9 ms]
> >>>     [Ref Enq: 0.2 ms]
> >>>     [Redirty Cards: 0.2 ms]
> >>>     [Humongous Register: 2.2 ms]
> >>>     [Humongous Reclaim: 0.4 ms]
> >>>     [Free CSet: 0.4 ms]
> >>>  [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap:
> >>> 25.7G(32.0G)->24.3G(32.0G)]
> >>> Heap after GC invocations=1144022 (full 72):
> >>> garbage-first heap   total 33554432K, used 25489656K
> [0x00007f1478000000,
> >>> 0x00007f1478808000, 0x00007f1c78000000)
> >>> region size 8192K, 10 young (81920K), 10 survivors (81920K)
> >>> Metaspace       used 41184K, capacity 41752K, committed 67072K,
> reserved
> >>> 67584K
> >>> }
> >>> [Times: user=0.84 sys=0.01, real=0.05 secs]
> >>> 2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which
> >> application
> >>> threads were stopped: 0.0661383 seconds, Stopping threads took:
> 0.0004141
> >>> seconds
> >>> 2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end,
> >>> 2.5757061 secs]
> >>> 2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark
> >>> 2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508
> >>> secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc,
> 0.0277818
> >>> secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102
> >>> secs], 0.0704296 secs]
> >>> [Times: user=0.85 sys=0.04, real=0.07 secs]
> >>> 2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which
> >> application
> >>> threads were stopped: 0.0785762 seconds, Stopping threads took:
> 0.0006159
> >>> seconds
> >>> 2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G),
> >>> 0.0391915 secs]
> >>> [Times: user=0.64 sys=0.00, real=0.04 secs]
> >>> 2018-02-27T21:43:02.218+0800: 4668016.469: Total time for which
> >> application
> >>> threads were stopped: 0.0470020 seconds, Stopping threads took:
> 0.0001684
> >>> seconds
> >>> 2018-02-27T21:43:02.540+0800: 4668016.791: Total time for which
> >> application
> >>> threads were stopped: 0.0074829 seconds, Stopping threads took:
> 0.0004834
> >>> seconds
> >>> {Heap before GC invocations=1144023 (full 72):
> >>> garbage-first heap   total 33554432K, used 27078904K
> [0x00007f1478000000,
> >>> 0x00007f1478808000, 0x00007f1c78000000)
> >>> region size 8192K, 204 young (1671168K), 10 survivors (81920K)
> >>> Metaspace       used 41184K, capacity 41752K, committed 67072K,
> reserved
> >>> 67584K
> >>> 2018-02-27T21:43:04.076+0800: 4668018.326: [GC pause (G1 Evacuation
> >> Pause)
> >>> (young)
> >>> Desired survivor size 109051904 bytes, new threshold 15 (max 15)
> >>> - age   1:   47719032 bytes,   47719032 total
> >>> , 0.0554183 secs]
> >>>  [Parallel Time: 48.0 ms, GC Workers: 18]
> >>>     [GC Worker Start (ms): Min: 4668018329.0, Avg: 4668018329.1, Max:
> >>> 4668018329.3, Diff: 0.3]
> >>>     [Ext Root Scanning (ms): Min: 2.9, Avg: 5.7, Max: 47.4, Diff: 44.6,
> >>> Sum: 103.0]
> >>>     [Update RS (ms): Min: 0.0, Avg: 14.3, Max: 16.2, Diff: 16.2, Sum:
> >>> 257.6]
> >>>        [Processed Buffers: Min: 0, Avg: 17.4, Max: 22, Diff: 22, Sum:
> >> 314]
> >>>     [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
> >>>     [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> >>> Sum: 0.0]
> >>>     [Object Copy (ms): Min: 0.1, Avg: 10.9, Max: 11.9, Diff: 11.8, Sum:
> >>> 196.9]
> >>>     [Termination (ms): Min: 0.0, Avg: 16.6, Max: 17.6, Diff: 17.6, Sum:
> >>> 299.1]
> >>>        [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
> >> 18]
> >>>     [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0,
> Sum:
> >>> 0.5]
> >>>     [GC Worker Total (ms): Min: 47.5, Avg: 47.6, Max: 47.8, Diff: 0.3,
> >>> Sum: 857.6]
> >>>     [GC Worker End (ms): Min: 4668018376.7, Avg: 4668018376.8, Max:
> >>> 4668018376.8, Diff: 0.0]
> >>>  [Code Root Fixup: 0.2 ms]
> >>>  [Code Root Purge: 0.0 ms]
> >>>  [Clear CT: 0.2 ms]
> >>>  [Other: 7.1 ms]
> >>>     [Choose CSet: 0.0 ms]
> >>>     [Ref Proc: 2.3 ms]
> >>>     [Ref Enq: 0.2 ms]
> >>>     [Redirty Cards: 0.2 ms]
> >>>     [Humongous Register: 2.2 ms]
> >>>     [Humongous Reclaim: 0.4 ms]
> >>>     [Free CSet: 0.4 ms]
> >>>  [Eden: 1552.0M(1552.0M)->0.0B(1488.0M) Survivors: 80.0M->144.0M Heap:
> >>> 25.8G(32.0G)->24.4G(32.0G)]
> >>> Heap after GC invocations=1144024 (full 72):
> >>> garbage-first heap   total 33554432K, used 25550050K
> [0x00007f1478000000,
> >>> 0x00007f1478808000, 0x00007f1c78000000)
> >>> region size 8192K, 18 young (147456K), 18 survivors (147456K)
> >>> Metaspace       used 41184K, capacity 41752K, committed 67072K,
> reserved
> >>> 67584K
> >>> }
> >>> [Times: user=0.82 sys=0.00, real=0.05 secs]
> >>>
> >>>
> >>>
> >>>
> >>> 2018-02-27 20:58 GMT+08:00 Emir Arnautović <
> emir.arnautovic@sematext.com
> >>> :
> >>>
> >>>> Ah, so there are ~560 shards per node and not all nodes are indexing
> at
> >>>> the same time. Why is that? You can have better throughput if indexing
> >> on
> >>>> all nodes. If happy with shard size, you can create new collection
> with
> >> 49
> >>>> shards every 2h and have everything the same and index on all nodes.
> >>>>
> >>>> Back to main question: what is the heap utilisation? When you restart
> >> node
> >>>> what is heap utilisation? Do you see any errors in your logs? Do you
> see
> >>>> any errors in ZK logs?
> >>>>
> >>>> Emir
> >>>> --
> >>>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>>
> >>>>
> >>>>
> >>>>> On 27 Feb 2018, at 13:22, 苗海泉 <ms...@gmail.com> wrote:
> >>>>>
> >>>>> Thanks  for you reply again.
> >>>>> I just said that you may have some misunderstanding, we have 49 solr
> >>>> nodes,
> >>>>> each collection has 25 shards, each shard has only one replica of the
> >>>> data,
> >>>>> there is no copy, and I reduce the part of the cache. If you need the
> >>>>> metric data, I can check Come out to tell you, in addition we are
> only
> >>>>> additional system, there will not be any change action.
> >>>>>
> >>>>> 2018-02-27 20:05 GMT+08:00 Emir Arnautović <
> >> emir.arnautovic@sematext.com
> >>>>> :
> >>>>>
> >>>>>> Hi,
> >>>>>> It is hard to tell without looking more into your metrics. It seems
> to
> >>>> me
> >>>>>> that you are reaching limits of your cluster. I would doublecheck if
> >>>> memory
> >>>>>> is the issue. If I got it right, you have ~1120 shards per node. It
> >>>> takes
> >>>>>> some heap just to keep them open. If you have some caches enabled
> and
> >>>> if it
> >>>>>> is append only system, old shards will keep caches until reloaded.
> >>>>>> Probably will not make much diff, but with 25x2=50 shards and 49
> >> nodes,
> >>>>>> one node will need to handle double indexing load.
> >>>>>>
> >>>>>> Emir
> >>>>>> --
> >>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>>>>> Solr & Elasticsearch Consulting Support Training -
> >> http://sematext.com/
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On 27 Feb 2018, at 12:54, 苗海泉 <ms...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> In addition, we found that the rate was normal when the number of
> >>>>>>> collections was kept below 936 and the speed was slower and slower
> at
> >>>>>> 984.
> >>>>>>> Therefore, we could only temporarily delete the older collection,
> but
> >>>> now
> >>>>>>> we need more Online collection, there has been no good way to
> confuse
> >>>> us
> >>>>>>> for a long time, very much hope to give a solution to the problem
> of
> >>>>>> ideas,
> >>>>>>> greatly appreciated
> >>>>>>>
> >>>>>>> 2018-02-27 19:46 GMT+08:00 苗海泉 <ms...@gmail.com>:
> >>>>>>>
> >>>>>>>> Thank you for reply.
> >>>>>>>> One collection has 25 shard one replica, one solr node has about
> 5T
> >> on
> >>>>>>>> desk.
> >>>>>>>> GC is checked ,and modify as follow :
> >>>>>>>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
> >>>>>>>> GC_TUNE=" \
> >>>>>>>> -XX:+UseG1GC \
> >>>>>>>> -XX:+PerfDisableSharedMem \
> >>>>>>>> -XX:+ParallelRefProcEnabled \
> >>>>>>>> -XX:G1HeapRegionSize=8m \
> >>>>>>>> -XX:MaxGCPauseMillis=250 \
> >>>>>>>> -XX:InitiatingHeapOccupancyPercent=75 \
> >>>>>>>> -XX:+UseLargePages \
> >>>>>>>> -XX:+AggressiveOpts \
> >>>>>>>> -XX:+UseLargePages"
> >>>>>>>>
> >>>>>>>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <
> >>>>>> emir.arnautovic@sematext.com>:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>> To get more complete picture, can you tell us how many
> >>>> shards/replicas
> >>>>>> do
> >>>>>>>>> you have per collection? Also what is index size on disk? Did you
> >>>>>> check GC?
> >>>>>>>>>
> >>>>>>>>> BTW, using 32GB heap prevents you from using compressed oops,
> >>>> resulting
> >>>>>>>>> in less memory available than 31GB.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Emir
> >>>>>>>>> --
> >>>>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>>>>>>>> Solr & Elasticsearch Consulting Support Training -
> >>>>>> http://sematext.com/
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On 27 Feb 2018, at 11:36, 苗海泉 <ms...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I encountered a more serious problem in the process of using
> solr.
> >>>> We
> >>>>>>>>> use
> >>>>>>>>>> the solr version is 6.0, our daily amount of data is about 500
> >>>> billion
> >>>>>>>>>> documents, create a collection every hour, the online collection
> >> of
> >>>>>> more
> >>>>>>>>>> than a thousand, 49 solr nodes. If the collection in less than
> >> 800,
> >>>>>> the
> >>>>>>>>>> speed is still very fast, if the collection of the number of
> 1100
> >> or
> >>>>>> so,
> >>>>>>>>>> the construction of solr index will drop sharply, one of the
> >>>> original
> >>>>>>>>>> program speed of about 2-3 million TPS, Dropped to only a few
> >>>> hundred
> >>>>>> or
> >>>>>>>>>> even tens of TPS, who have encountered a similar situation,
> there
> >> is
> >>>>>> no
> >>>>>>>>>> good idea to find this issue. By the way, solr a node memory we
> >>>>>> assigned
> >>>>>>>>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is
> >> no
> >>>>>>>>>> problem, belong to the normal state. Which friend encountered a
> >>>>>> similar
> >>>>>>>>>> problem, please inform the solution, thank you very much.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> ==============================
> >>>>>>>> 联创科技
> >>>>>>>> 知行如一
> >>>>>>>> ==============================
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> ==============================
> >>>>>>> 联创科技
> >>>>>>> 知行如一
> >>>>>>> ==============================
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> ==============================
> >>>>> 联创科技
> >>>>> 知行如一
> >>>>> ==============================
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> ==============================
> >>> 联创科技
> >>> 知行如一
> >>> ==============================
> >>
> >>
> >
> >
> > --
> > ==============================
> > 联创科技
> > 知行如一
> > ==============================
>
>


-- 
==============================
联创科技
知行如一
==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by Emir Arnautović <em...@sematext.com>.

Hi,
I would start with following:
1. have dedicated nodes for ZK ensemble - those do not have to be powerful nodes (maybe 2-4 cores and 8GB RAM)
2. reduce heap size to value below margin where JVM can use compressed oops - 31GB should be safe size
3. shard collection to all nodes
4. increase rollover interval to 2h so you keep shard size/number as it is today.
5. experiment with slightly larger rollover intervals (e.g. 3h) if query latency is still acceptable. That will result in less shards that are slightly larger.

In any case monitor your cluster to see how changes affect it. Not sure what you currently use for monitoring, but manual scanning of GC logs is not fun. You can check out our monitoring tool if you don’t have one or if it does not give you enough visibility: https://sematext.com/spm/ <https://sematext.com/spm/> 

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 28 Feb 2018, at 02:42, 苗海泉 <ms...@gmail.com> wrote:
> 
> Thank you, I read under the memory footprint, I set 75% recovery, memory
> occupancy at about 76%, the other we zookeeper not on a dedicated server,
> perhaps because of this cause instability.
> 
> What else do you recommend for me to check?
> 
> 2018-02-27 22:37 GMT+08:00 Emir Arnautović <em...@sematext.com>:
> 
>> This does not show much: only that your heap is around 75% (24-25GB). I
>> was thinking that you should compare metrics (heap/GC as well) when running
>> on without issues and when running with issues and see if something can be
>> concluded.
>> About instability: Do you run ZK on dedicated nodes?
>> 
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 27 Feb 2018, at 14:43, 苗海泉 <ms...@gmail.com> wrote:
>>> 
>>> Thank you, we were 49 shard 49 nodes, but later found that in this case,
>>> often disconnect between solr and zookeepr, zookeeper too many nodes
>> caused
>>> solr instability, so reduced to 25 A follow-up performance can not keep
>> up
>>> also need to increase back.
>>> 
>>> Very slow when solr and zookeeper not found any errors, just build the
>>> index slow, automatic commit inside the log display is slow, but the main
>>> reason may not lie in the commit place.
>>> 
>>> I am sorry, I do not know how to look at the utilization of java heap,
>>> through the gc log, gc time is not long, I posted the log:
>>> 
>>> 
>>> {Heap before GC invocations=1144021 (full 72):
>>> garbage-first heap   total 33554432K, used 26982419K [0x00007f1478000000,
>>> 0x00007f1478808000, 0x00007f1c78000000)
>>> region size 8192K, 204 young (1671168K), 26 survivors (212992K)
>>> Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
>>> 67584K
>>> 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation
>> Pause)
>>> (young)
>>> Desired survivor size 109051904 bytes, new threshold 1 (max 15)
>>> - age   1:  113878760 bytes,  113878760 total
>>> - age   2:   21264744 bytes,  135143504 total
>>> - age   3:   17020096 bytes,  152163600 total
>>> - age   4:   26870864 bytes,  179034464 total
>>> , 0.0579794 secs]
>>>  [Parallel Time: 46.9 ms, GC Workers: 18]
>>>     [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
>>> 4668016046.4, Diff: 0.3]
>>>     [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
>>> Sum: 116.9]
>>>     [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum: 62.0]
>>>        [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum:
>> 113]
>>>     [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
>>>     [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>>> Sum: 0.0]
>>>     [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
>>> 428.1]
>>>     [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
>>> 228.9]
>>>        [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
>> 18]
>>>     [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum:
>>> 1.2]
>>>     [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
>>> Sum: 838.0]
>>>     [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
>>> 4668016092.8, Diff: 0.0]
>>>  [Code Root Fixup: 0.2 ms]
>>>  [Code Root Purge: 0.0 ms]
>>>  [Clear CT: 0.3 ms]
>>>  [Other: 10.7 ms]
>>>     [Choose CSet: 0.0 ms]
>>>     [Ref Proc: 5.9 ms]
>>>     [Ref Enq: 0.2 ms]
>>>     [Redirty Cards: 0.2 ms]
>>>     [Humongous Register: 2.2 ms]
>>>     [Humongous Reclaim: 0.4 ms]
>>>     [Free CSet: 0.4 ms]
>>>  [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap:
>>> 25.7G(32.0G)->24.3G(32.0G)]
>>> Heap after GC invocations=1144022 (full 72):
>>> garbage-first heap   total 33554432K, used 25489656K [0x00007f1478000000,
>>> 0x00007f1478808000, 0x00007f1c78000000)
>>> region size 8192K, 10 young (81920K), 10 survivors (81920K)
>>> Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
>>> 67584K
>>> }
>>> [Times: user=0.84 sys=0.01, real=0.05 secs]
>>> 2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which
>> application
>>> threads were stopped: 0.0661383 seconds, Stopping threads took: 0.0004141
>>> seconds
>>> 2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end,
>>> 2.5757061 secs]
>>> 2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark
>>> 2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508
>>> secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc, 0.0277818
>>> secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102
>>> secs], 0.0704296 secs]
>>> [Times: user=0.85 sys=0.04, real=0.07 secs]
>>> 2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which
>> application
>>> threads were stopped: 0.0785762 seconds, Stopping threads took: 0.0006159
>>> seconds
>>> 2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G),
>>> 0.0391915 secs]
>>> [Times: user=0.64 sys=0.00, real=0.04 secs]
>>> 2018-02-27T21:43:02.218+0800: 4668016.469: Total time for which
>> application
>>> threads were stopped: 0.0470020 seconds, Stopping threads took: 0.0001684
>>> seconds
>>> 2018-02-27T21:43:02.540+0800: 4668016.791: Total time for which
>> application
>>> threads were stopped: 0.0074829 seconds, Stopping threads took: 0.0004834
>>> seconds
>>> {Heap before GC invocations=1144023 (full 72):
>>> garbage-first heap   total 33554432K, used 27078904K [0x00007f1478000000,
>>> 0x00007f1478808000, 0x00007f1c78000000)
>>> region size 8192K, 204 young (1671168K), 10 survivors (81920K)
>>> Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
>>> 67584K
>>> 2018-02-27T21:43:04.076+0800: 4668018.326: [GC pause (G1 Evacuation
>> Pause)
>>> (young)
>>> Desired survivor size 109051904 bytes, new threshold 15 (max 15)
>>> - age   1:   47719032 bytes,   47719032 total
>>> , 0.0554183 secs]
>>>  [Parallel Time: 48.0 ms, GC Workers: 18]
>>>     [GC Worker Start (ms): Min: 4668018329.0, Avg: 4668018329.1, Max:
>>> 4668018329.3, Diff: 0.3]
>>>     [Ext Root Scanning (ms): Min: 2.9, Avg: 5.7, Max: 47.4, Diff: 44.6,
>>> Sum: 103.0]
>>>     [Update RS (ms): Min: 0.0, Avg: 14.3, Max: 16.2, Diff: 16.2, Sum:
>>> 257.6]
>>>        [Processed Buffers: Min: 0, Avg: 17.4, Max: 22, Diff: 22, Sum:
>> 314]
>>>     [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
>>>     [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>>> Sum: 0.0]
>>>     [Object Copy (ms): Min: 0.1, Avg: 10.9, Max: 11.9, Diff: 11.8, Sum:
>>> 196.9]
>>>     [Termination (ms): Min: 0.0, Avg: 16.6, Max: 17.6, Diff: 17.6, Sum:
>>> 299.1]
>>>        [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
>> 18]
>>>     [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum:
>>> 0.5]
>>>     [GC Worker Total (ms): Min: 47.5, Avg: 47.6, Max: 47.8, Diff: 0.3,
>>> Sum: 857.6]
>>>     [GC Worker End (ms): Min: 4668018376.7, Avg: 4668018376.8, Max:
>>> 4668018376.8, Diff: 0.0]
>>>  [Code Root Fixup: 0.2 ms]
>>>  [Code Root Purge: 0.0 ms]
>>>  [Clear CT: 0.2 ms]
>>>  [Other: 7.1 ms]
>>>     [Choose CSet: 0.0 ms]
>>>     [Ref Proc: 2.3 ms]
>>>     [Ref Enq: 0.2 ms]
>>>     [Redirty Cards: 0.2 ms]
>>>     [Humongous Register: 2.2 ms]
>>>     [Humongous Reclaim: 0.4 ms]
>>>     [Free CSet: 0.4 ms]
>>>  [Eden: 1552.0M(1552.0M)->0.0B(1488.0M) Survivors: 80.0M->144.0M Heap:
>>> 25.8G(32.0G)->24.4G(32.0G)]
>>> Heap after GC invocations=1144024 (full 72):
>>> garbage-first heap   total 33554432K, used 25550050K [0x00007f1478000000,
>>> 0x00007f1478808000, 0x00007f1c78000000)
>>> region size 8192K, 18 young (147456K), 18 survivors (147456K)
>>> Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
>>> 67584K
>>> }
>>> [Times: user=0.82 sys=0.00, real=0.05 secs]
>>> 
>>> 
>>> 
>>> 
>>> 2018-02-27 20:58 GMT+08:00 Emir Arnautović <emir.arnautovic@sematext.com
>>> :
>>> 
>>>> Ah, so there are ~560 shards per node and not all nodes are indexing at
>>>> the same time. Why is that? You can have better throughput if indexing
>> on
>>>> all nodes. If happy with shard size, you can create new collection with
>> 49
>>>> shards every 2h and have everything the same and index on all nodes.
>>>> 
>>>> Back to main question: what is the heap utilisation? When you restart
>> node
>>>> what is heap utilisation? Do you see any errors in your logs? Do you see
>>>> any errors in ZK logs?
>>>> 
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>>> On 27 Feb 2018, at 13:22, 苗海泉 <ms...@gmail.com> wrote:
>>>>> 
>>>>> Thanks  for you reply again.
>>>>> I just said that you may have some misunderstanding, we have 49 solr
>>>> nodes,
>>>>> each collection has 25 shards, each shard has only one replica of the
>>>> data,
>>>>> there is no copy, and I reduce the part of the cache. If you need the
>>>>> metric data, I can check Come out to tell you, in addition we are only
>>>>> additional system, there will not be any change action.
>>>>> 
>>>>> 2018-02-27 20:05 GMT+08:00 Emir Arnautović <
>> emir.arnautovic@sematext.com
>>>>> :
>>>>> 
>>>>>> Hi,
>>>>>> It is hard to tell without looking more into your metrics. It seems to
>>>> me
>>>>>> that you are reaching limits of your cluster. I would doublecheck if
>>>> memory
>>>>>> is the issue. If I got it right, you have ~1120 shards per node. It
>>>> takes
>>>>>> some heap just to keep them open. If you have some caches enabled and
>>>> if it
>>>>>> is append only system, old shards will keep caches until reloaded.
>>>>>> Probably will not make much diff, but with 25x2=50 shards and 49
>> nodes,
>>>>>> one node will need to handle double indexing load.
>>>>>> 
>>>>>> Emir
>>>>>> --
>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>>> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 27 Feb 2018, at 12:54, 苗海泉 <ms...@gmail.com> wrote:
>>>>>>> 
>>>>>>> In addition, we found that the rate was normal when the number of
>>>>>>> collections was kept below 936 and the speed was slower and slower at
>>>>>> 984.
>>>>>>> Therefore, we could only temporarily delete the older collection, but
>>>> now
>>>>>>> we need more Online collection, there has been no good way to confuse
>>>> us
>>>>>>> for a long time, very much hope to give a solution to the problem of
>>>>>> ideas,
>>>>>>> greatly appreciated
>>>>>>> 
>>>>>>> 2018-02-27 19:46 GMT+08:00 苗海泉 <ms...@gmail.com>:
>>>>>>> 
>>>>>>>> Thank you for reply.
>>>>>>>> One collection has 25 shard one replica, one solr node has about 5T
>> on
>>>>>>>> desk.
>>>>>>>> GC is checked ,and modify as follow :
>>>>>>>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
>>>>>>>> GC_TUNE=" \
>>>>>>>> -XX:+UseG1GC \
>>>>>>>> -XX:+PerfDisableSharedMem \
>>>>>>>> -XX:+ParallelRefProcEnabled \
>>>>>>>> -XX:G1HeapRegionSize=8m \
>>>>>>>> -XX:MaxGCPauseMillis=250 \
>>>>>>>> -XX:InitiatingHeapOccupancyPercent=75 \
>>>>>>>> -XX:+UseLargePages \
>>>>>>>> -XX:+AggressiveOpts \
>>>>>>>> -XX:+UseLargePages"
>>>>>>>> 
>>>>>>>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <
>>>>>> emir.arnautovic@sematext.com>:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> To get more complete picture, can you tell us how many
>>>> shards/replicas
>>>>>> do
>>>>>>>>> you have per collection? Also what is index size on disk? Did you
>>>>>> check GC?
>>>>>>>>> 
>>>>>>>>> BTW, using 32GB heap prevents you from using compressed oops,
>>>> resulting
>>>>>>>>> in less memory available than 31GB.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Emir
>>>>>>>>> --
>>>>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>>>>>> Solr & Elasticsearch Consulting Support Training -
>>>>>> http://sematext.com/
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 27 Feb 2018, at 11:36, 苗海泉 <ms...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> I encountered a more serious problem in the process of using solr.
>>>> We
>>>>>>>>> use
>>>>>>>>>> the solr version is 6.0, our daily amount of data is about 500
>>>> billion
>>>>>>>>>> documents, create a collection every hour, the online collection
>> of
>>>>>> more
>>>>>>>>>> than a thousand, 49 solr nodes. If the collection in less than
>> 800,
>>>>>> the
>>>>>>>>>> speed is still very fast, if the collection of the number of 1100
>> or
>>>>>> so,
>>>>>>>>>> the construction of solr index will drop sharply, one of the
>>>> original
>>>>>>>>>> program speed of about 2-3 million TPS, Dropped to only a few
>>>> hundred
>>>>>> or
>>>>>>>>>> even tens of TPS, who have encountered a similar situation, there
>> is
>>>>>> no
>>>>>>>>>> good idea to find this issue. By the way, solr a node memory we
>>>>>> assigned
>>>>>>>>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is
>> no
>>>>>>>>>> problem, belong to the normal state. Which friend encountered a
>>>>>> similar
>>>>>>>>>> problem, please inform the solution, thank you very much.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> ==============================
>>>>>>>> 联创科技
>>>>>>>> 知行如一
>>>>>>>> ==============================
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> ==============================
>>>>>>> 联创科技
>>>>>>> 知行如一
>>>>>>> ==============================
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> ==============================
>>>>> 联创科技
>>>>> 知行如一
>>>>> ==============================
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> ==============================
>>> 联创科技
>>> 知行如一
>>> ==============================
>> 
>> 
> 
> 
> -- 
> ==============================
> 联创科技
> 知行如一
> ==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by 苗海泉 <ms...@gmail.com>.

Thank you, I read under the memory footprint, I set 75% recovery, memory
occupancy at about 76%, the other we zookeeper not on a dedicated server,
perhaps because of this cause instability.

What else do you recommend for me to check?

2018-02-27 22:37 GMT+08:00 Emir Arnautović <em...@sematext.com>:

> This does not show much: only that your heap is around 75% (24-25GB). I
> was thinking that you should compare metrics (heap/GC as well) when running
> on without issues and when running with issues and see if something can be
> concluded.
> About instability: Do you run ZK on dedicated nodes?
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 27 Feb 2018, at 14:43, 苗海泉 <ms...@gmail.com> wrote:
> >
> > Thank you, we were 49 shard 49 nodes, but later found that in this case,
> > often disconnect between solr and zookeepr, zookeeper too many nodes
> caused
> > solr instability, so reduced to 25 A follow-up performance can not keep
> up
> > also need to increase back.
> >
> > Very slow when solr and zookeeper not found any errors, just build the
> > index slow, automatic commit inside the log display is slow, but the main
> > reason may not lie in the commit place.
> >
> > I am sorry, I do not know how to look at the utilization of java heap,
> > through the gc log, gc time is not long, I posted the log:
> >
> >
> > {Heap before GC invocations=1144021 (full 72):
> > garbage-first heap   total 33554432K, used 26982419K [0x00007f1478000000,
> > 0x00007f1478808000, 0x00007f1c78000000)
> >  region size 8192K, 204 young (1671168K), 26 survivors (212992K)
> > Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
> > 67584K
> > 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation
> Pause)
> > (young)
> > Desired survivor size 109051904 bytes, new threshold 1 (max 15)
> > - age   1:  113878760 bytes,  113878760 total
> > - age   2:   21264744 bytes,  135143504 total
> > - age   3:   17020096 bytes,  152163600 total
> > - age   4:   26870864 bytes,  179034464 total
> > , 0.0579794 secs]
> >   [Parallel Time: 46.9 ms, GC Workers: 18]
> >      [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
> > 4668016046.4, Diff: 0.3]
> >      [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
> > Sum: 116.9]
> >      [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum: 62.0]
> >         [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum:
> 113]
> >      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
> >      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> > Sum: 0.0]
> >      [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
> > 428.1]
> >      [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
> > 228.9]
> >         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
> 18]
> >      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum:
> > 1.2]
> >      [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
> > Sum: 838.0]
> >      [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
> > 4668016092.8, Diff: 0.0]
> >   [Code Root Fixup: 0.2 ms]
> >   [Code Root Purge: 0.0 ms]
> >   [Clear CT: 0.3 ms]
> >   [Other: 10.7 ms]
> >      [Choose CSet: 0.0 ms]
> >      [Ref Proc: 5.9 ms]
> >      [Ref Enq: 0.2 ms]
> >      [Redirty Cards: 0.2 ms]
> >      [Humongous Register: 2.2 ms]
> >      [Humongous Reclaim: 0.4 ms]
> >      [Free CSet: 0.4 ms]
> >   [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap:
> > 25.7G(32.0G)->24.3G(32.0G)]
> > Heap after GC invocations=1144022 (full 72):
> > garbage-first heap   total 33554432K, used 25489656K [0x00007f1478000000,
> > 0x00007f1478808000, 0x00007f1c78000000)
> >  region size 8192K, 10 young (81920K), 10 survivors (81920K)
> > Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
> > 67584K
> > }
> > [Times: user=0.84 sys=0.01, real=0.05 secs]
> > 2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which
> application
> > threads were stopped: 0.0661383 seconds, Stopping threads took: 0.0004141
> > seconds
> > 2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end,
> > 2.5757061 secs]
> > 2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark
> > 2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508
> > secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc, 0.0277818
> > secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102
> > secs], 0.0704296 secs]
> > [Times: user=0.85 sys=0.04, real=0.07 secs]
> > 2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which
> application
> > threads were stopped: 0.0785762 seconds, Stopping threads took: 0.0006159
> > seconds
> > 2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G),
> > 0.0391915 secs]
> > [Times: user=0.64 sys=0.00, real=0.04 secs]
> > 2018-02-27T21:43:02.218+0800: 4668016.469: Total time for which
> application
> > threads were stopped: 0.0470020 seconds, Stopping threads took: 0.0001684
> > seconds
> > 2018-02-27T21:43:02.540+0800: 4668016.791: Total time for which
> application
> > threads were stopped: 0.0074829 seconds, Stopping threads took: 0.0004834
> > seconds
> > {Heap before GC invocations=1144023 (full 72):
> > garbage-first heap   total 33554432K, used 27078904K [0x00007f1478000000,
> > 0x00007f1478808000, 0x00007f1c78000000)
> >  region size 8192K, 204 young (1671168K), 10 survivors (81920K)
> > Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
> > 67584K
> > 2018-02-27T21:43:04.076+0800: 4668018.326: [GC pause (G1 Evacuation
> Pause)
> > (young)
> > Desired survivor size 109051904 bytes, new threshold 15 (max 15)
> > - age   1:   47719032 bytes,   47719032 total
> > , 0.0554183 secs]
> >   [Parallel Time: 48.0 ms, GC Workers: 18]
> >      [GC Worker Start (ms): Min: 4668018329.0, Avg: 4668018329.1, Max:
> > 4668018329.3, Diff: 0.3]
> >      [Ext Root Scanning (ms): Min: 2.9, Avg: 5.7, Max: 47.4, Diff: 44.6,
> > Sum: 103.0]
> >      [Update RS (ms): Min: 0.0, Avg: 14.3, Max: 16.2, Diff: 16.2, Sum:
> > 257.6]
> >         [Processed Buffers: Min: 0, Avg: 17.4, Max: 22, Diff: 22, Sum:
> 314]
> >      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
> >      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> > Sum: 0.0]
> >      [Object Copy (ms): Min: 0.1, Avg: 10.9, Max: 11.9, Diff: 11.8, Sum:
> > 196.9]
> >      [Termination (ms): Min: 0.0, Avg: 16.6, Max: 17.6, Diff: 17.6, Sum:
> > 299.1]
> >         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
> 18]
> >      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum:
> > 0.5]
> >      [GC Worker Total (ms): Min: 47.5, Avg: 47.6, Max: 47.8, Diff: 0.3,
> > Sum: 857.6]
> >      [GC Worker End (ms): Min: 4668018376.7, Avg: 4668018376.8, Max:
> > 4668018376.8, Diff: 0.0]
> >   [Code Root Fixup: 0.2 ms]
> >   [Code Root Purge: 0.0 ms]
> >   [Clear CT: 0.2 ms]
> >   [Other: 7.1 ms]
> >      [Choose CSet: 0.0 ms]
> >      [Ref Proc: 2.3 ms]
> >      [Ref Enq: 0.2 ms]
> >      [Redirty Cards: 0.2 ms]
> >      [Humongous Register: 2.2 ms]
> >      [Humongous Reclaim: 0.4 ms]
> >      [Free CSet: 0.4 ms]
> >   [Eden: 1552.0M(1552.0M)->0.0B(1488.0M) Survivors: 80.0M->144.0M Heap:
> > 25.8G(32.0G)->24.4G(32.0G)]
> > Heap after GC invocations=1144024 (full 72):
> > garbage-first heap   total 33554432K, used 25550050K [0x00007f1478000000,
> > 0x00007f1478808000, 0x00007f1c78000000)
> >  region size 8192K, 18 young (147456K), 18 survivors (147456K)
> > Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
> > 67584K
> > }
> > [Times: user=0.82 sys=0.00, real=0.05 secs]
> >
> >
> >
> >
> > 2018-02-27 20:58 GMT+08:00 Emir Arnautović <emir.arnautovic@sematext.com
> >:
> >
> >> Ah, so there are ~560 shards per node and not all nodes are indexing at
> >> the same time. Why is that? You can have better throughput if indexing
> on
> >> all nodes. If happy with shard size, you can create new collection with
> 49
> >> shards every 2h and have everything the same and index on all nodes.
> >>
> >> Back to main question: what is the heap utilisation? When you restart
> node
> >> what is heap utilisation? Do you see any errors in your logs? Do you see
> >> any errors in ZK logs?
> >>
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 27 Feb 2018, at 13:22, 苗海泉 <ms...@gmail.com> wrote:
> >>>
> >>> Thanks  for you reply again.
> >>> I just said that you may have some misunderstanding, we have 49 solr
> >> nodes,
> >>> each collection has 25 shards, each shard has only one replica of the
> >> data,
> >>> there is no copy, and I reduce the part of the cache. If you need the
> >>> metric data, I can check Come out to tell you, in addition we are only
> >>> additional system, there will not be any change action.
> >>>
> >>> 2018-02-27 20:05 GMT+08:00 Emir Arnautović <
> emir.arnautovic@sematext.com
> >>> :
> >>>
> >>>> Hi,
> >>>> It is hard to tell without looking more into your metrics. It seems to
> >> me
> >>>> that you are reaching limits of your cluster. I would doublecheck if
> >> memory
> >>>> is the issue. If I got it right, you have ~1120 shards per node. It
> >> takes
> >>>> some heap just to keep them open. If you have some caches enabled and
> >> if it
> >>>> is append only system, old shards will keep caches until reloaded.
> >>>> Probably will not make much diff, but with 25x2=50 shards and 49
> nodes,
> >>>> one node will need to handle double indexing load.
> >>>>
> >>>> Emir
> >>>> --
> >>>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>>
> >>>>
> >>>>
> >>>>> On 27 Feb 2018, at 12:54, 苗海泉 <ms...@gmail.com> wrote:
> >>>>>
> >>>>> In addition, we found that the rate was normal when the number of
> >>>>> collections was kept below 936 and the speed was slower and slower at
> >>>> 984.
> >>>>> Therefore, we could only temporarily delete the older collection, but
> >> now
> >>>>> we need more Online collection, there has been no good way to confuse
> >> us
> >>>>> for a long time, very much hope to give a solution to the problem of
> >>>> ideas,
> >>>>> greatly appreciated
> >>>>>
> >>>>> 2018-02-27 19:46 GMT+08:00 苗海泉 <ms...@gmail.com>:
> >>>>>
> >>>>>> Thank you for reply.
> >>>>>> One collection has 25 shard one replica, one solr node has about 5T
> on
> >>>>>> desk.
> >>>>>> GC is checked ,and modify as follow :
> >>>>>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
> >>>>>> GC_TUNE=" \
> >>>>>> -XX:+UseG1GC \
> >>>>>> -XX:+PerfDisableSharedMem \
> >>>>>> -XX:+ParallelRefProcEnabled \
> >>>>>> -XX:G1HeapRegionSize=8m \
> >>>>>> -XX:MaxGCPauseMillis=250 \
> >>>>>> -XX:InitiatingHeapOccupancyPercent=75 \
> >>>>>> -XX:+UseLargePages \
> >>>>>> -XX:+AggressiveOpts \
> >>>>>> -XX:+UseLargePages"
> >>>>>>
> >>>>>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <
> >>>> emir.arnautovic@sematext.com>:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>> To get more complete picture, can you tell us how many
> >> shards/replicas
> >>>> do
> >>>>>>> you have per collection? Also what is index size on disk? Did you
> >>>> check GC?
> >>>>>>>
> >>>>>>> BTW, using 32GB heap prevents you from using compressed oops,
> >> resulting
> >>>>>>> in less memory available than 31GB.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Emir
> >>>>>>> --
> >>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>>>>>> Solr & Elasticsearch Consulting Support Training -
> >>>> http://sematext.com/
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 27 Feb 2018, at 11:36, 苗海泉 <ms...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> I encountered a more serious problem in the process of using solr.
> >> We
> >>>>>>> use
> >>>>>>>> the solr version is 6.0, our daily amount of data is about 500
> >> billion
> >>>>>>>> documents, create a collection every hour, the online collection
> of
> >>>> more
> >>>>>>>> than a thousand, 49 solr nodes. If the collection in less than
> 800,
> >>>> the
> >>>>>>>> speed is still very fast, if the collection of the number of 1100
> or
> >>>> so,
> >>>>>>>> the construction of solr index will drop sharply, one of the
> >> original
> >>>>>>>> program speed of about 2-3 million TPS, Dropped to only a few
> >> hundred
> >>>> or
> >>>>>>>> even tens of TPS, who have encountered a similar situation, there
> is
> >>>> no
> >>>>>>>> good idea to find this issue. By the way, solr a node memory we
> >>>> assigned
> >>>>>>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is
> no
> >>>>>>>> problem, belong to the normal state. Which friend encountered a
> >>>> similar
> >>>>>>>> problem, please inform the solution, thank you very much.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> ==============================
> >>>>>> 联创科技
> >>>>>> 知行如一
> >>>>>> ==============================
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> ==============================
> >>>>> 联创科技
> >>>>> 知行如一
> >>>>> ==============================
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> ==============================
> >>> 联创科技
> >>> 知行如一
> >>> ==============================
> >>
> >>
> >
> >
> > --
> > ==============================
> > 联创科技
> > 知行如一
> > ==============================
>
>


-- 
==============================
联创科技
知行如一
==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by Emir Arnautović <em...@sematext.com>.

This does not show much: only that your heap is around 75% (24-25GB). I was thinking that you should compare metrics (heap/GC as well) when running on without issues and when running with issues and see if something can be concluded.
About instability: Do you run ZK on dedicated nodes?

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Feb 2018, at 14:43, 苗海泉 <ms...@gmail.com> wrote:
> 
> Thank you, we were 49 shard 49 nodes, but later found that in this case,
> often disconnect between solr and zookeepr, zookeeper too many nodes caused
> solr instability, so reduced to 25 A follow-up performance can not keep up
> also need to increase back.
> 
> Very slow when solr and zookeeper not found any errors, just build the
> index slow, automatic commit inside the log display is slow, but the main
> reason may not lie in the commit place.
> 
> I am sorry, I do not know how to look at the utilization of java heap,
> through the gc log, gc time is not long, I posted the log:
> 
> 
> {Heap before GC invocations=1144021 (full 72):
> garbage-first heap   total 33554432K, used 26982419K [0x00007f1478000000,
> 0x00007f1478808000, 0x00007f1c78000000)
>  region size 8192K, 204 young (1671168K), 26 survivors (212992K)
> Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
> 67584K
> 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation Pause)
> (young)
> Desired survivor size 109051904 bytes, new threshold 1 (max 15)
> - age   1:  113878760 bytes,  113878760 total
> - age   2:   21264744 bytes,  135143504 total
> - age   3:   17020096 bytes,  152163600 total
> - age   4:   26870864 bytes,  179034464 total
> , 0.0579794 secs]
>   [Parallel Time: 46.9 ms, GC Workers: 18]
>      [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
> 4668016046.4, Diff: 0.3]
>      [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
> Sum: 116.9]
>      [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum: 62.0]
>         [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum: 113]
>      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
>      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> Sum: 0.0]
>      [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
> 428.1]
>      [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
> 228.9]
>         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 18]
>      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum:
> 1.2]
>      [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
> Sum: 838.0]
>      [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
> 4668016092.8, Diff: 0.0]
>   [Code Root Fixup: 0.2 ms]
>   [Code Root Purge: 0.0 ms]
>   [Clear CT: 0.3 ms]
>   [Other: 10.7 ms]
>      [Choose CSet: 0.0 ms]
>      [Ref Proc: 5.9 ms]
>      [Ref Enq: 0.2 ms]
>      [Redirty Cards: 0.2 ms]
>      [Humongous Register: 2.2 ms]
>      [Humongous Reclaim: 0.4 ms]
>      [Free CSet: 0.4 ms]
>   [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap:
> 25.7G(32.0G)->24.3G(32.0G)]
> Heap after GC invocations=1144022 (full 72):
> garbage-first heap   total 33554432K, used 25489656K [0x00007f1478000000,
> 0x00007f1478808000, 0x00007f1c78000000)
>  region size 8192K, 10 young (81920K), 10 survivors (81920K)
> Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
> 67584K
> }
> [Times: user=0.84 sys=0.01, real=0.05 secs]
> 2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which application
> threads were stopped: 0.0661383 seconds, Stopping threads took: 0.0004141
> seconds
> 2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end,
> 2.5757061 secs]
> 2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark
> 2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508
> secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc, 0.0277818
> secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102
> secs], 0.0704296 secs]
> [Times: user=0.85 sys=0.04, real=0.07 secs]
> 2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which application
> threads were stopped: 0.0785762 seconds, Stopping threads took: 0.0006159
> seconds
> 2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G),
> 0.0391915 secs]
> [Times: user=0.64 sys=0.00, real=0.04 secs]
> 2018-02-27T21:43:02.218+0800: 4668016.469: Total time for which application
> threads were stopped: 0.0470020 seconds, Stopping threads took: 0.0001684
> seconds
> 2018-02-27T21:43:02.540+0800: 4668016.791: Total time for which application
> threads were stopped: 0.0074829 seconds, Stopping threads took: 0.0004834
> seconds
> {Heap before GC invocations=1144023 (full 72):
> garbage-first heap   total 33554432K, used 27078904K [0x00007f1478000000,
> 0x00007f1478808000, 0x00007f1c78000000)
>  region size 8192K, 204 young (1671168K), 10 survivors (81920K)
> Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
> 67584K
> 2018-02-27T21:43:04.076+0800: 4668018.326: [GC pause (G1 Evacuation Pause)
> (young)
> Desired survivor size 109051904 bytes, new threshold 15 (max 15)
> - age   1:   47719032 bytes,   47719032 total
> , 0.0554183 secs]
>   [Parallel Time: 48.0 ms, GC Workers: 18]
>      [GC Worker Start (ms): Min: 4668018329.0, Avg: 4668018329.1, Max:
> 4668018329.3, Diff: 0.3]
>      [Ext Root Scanning (ms): Min: 2.9, Avg: 5.7, Max: 47.4, Diff: 44.6,
> Sum: 103.0]
>      [Update RS (ms): Min: 0.0, Avg: 14.3, Max: 16.2, Diff: 16.2, Sum:
> 257.6]
>         [Processed Buffers: Min: 0, Avg: 17.4, Max: 22, Diff: 22, Sum: 314]
>      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
>      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> Sum: 0.0]
>      [Object Copy (ms): Min: 0.1, Avg: 10.9, Max: 11.9, Diff: 11.8, Sum:
> 196.9]
>      [Termination (ms): Min: 0.0, Avg: 16.6, Max: 17.6, Diff: 17.6, Sum:
> 299.1]
>         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 18]
>      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum:
> 0.5]
>      [GC Worker Total (ms): Min: 47.5, Avg: 47.6, Max: 47.8, Diff: 0.3,
> Sum: 857.6]
>      [GC Worker End (ms): Min: 4668018376.7, Avg: 4668018376.8, Max:
> 4668018376.8, Diff: 0.0]
>   [Code Root Fixup: 0.2 ms]
>   [Code Root Purge: 0.0 ms]
>   [Clear CT: 0.2 ms]
>   [Other: 7.1 ms]
>      [Choose CSet: 0.0 ms]
>      [Ref Proc: 2.3 ms]
>      [Ref Enq: 0.2 ms]
>      [Redirty Cards: 0.2 ms]
>      [Humongous Register: 2.2 ms]
>      [Humongous Reclaim: 0.4 ms]
>      [Free CSet: 0.4 ms]
>   [Eden: 1552.0M(1552.0M)->0.0B(1488.0M) Survivors: 80.0M->144.0M Heap:
> 25.8G(32.0G)->24.4G(32.0G)]
> Heap after GC invocations=1144024 (full 72):
> garbage-first heap   total 33554432K, used 25550050K [0x00007f1478000000,
> 0x00007f1478808000, 0x00007f1c78000000)
>  region size 8192K, 18 young (147456K), 18 survivors (147456K)
> Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
> 67584K
> }
> [Times: user=0.82 sys=0.00, real=0.05 secs]
> 
> 
> 
> 
> 2018-02-27 20:58 GMT+08:00 Emir Arnautović <em...@sematext.com>:
> 
>> Ah, so there are ~560 shards per node and not all nodes are indexing at
>> the same time. Why is that? You can have better throughput if indexing on
>> all nodes. If happy with shard size, you can create new collection with 49
>> shards every 2h and have everything the same and index on all nodes.
>> 
>> Back to main question: what is the heap utilisation? When you restart node
>> what is heap utilisation? Do you see any errors in your logs? Do you see
>> any errors in ZK logs?
>> 
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 27 Feb 2018, at 13:22, 苗海泉 <ms...@gmail.com> wrote:
>>> 
>>> Thanks  for you reply again.
>>> I just said that you may have some misunderstanding, we have 49 solr
>> nodes,
>>> each collection has 25 shards, each shard has only one replica of the
>> data,
>>> there is no copy, and I reduce the part of the cache. If you need the
>>> metric data, I can check Come out to tell you, in addition we are only
>>> additional system, there will not be any change action.
>>> 
>>> 2018-02-27 20:05 GMT+08:00 Emir Arnautović <emir.arnautovic@sematext.com
>>> :
>>> 
>>>> Hi,
>>>> It is hard to tell without looking more into your metrics. It seems to
>> me
>>>> that you are reaching limits of your cluster. I would doublecheck if
>> memory
>>>> is the issue. If I got it right, you have ~1120 shards per node. It
>> takes
>>>> some heap just to keep them open. If you have some caches enabled and
>> if it
>>>> is append only system, old shards will keep caches until reloaded.
>>>> Probably will not make much diff, but with 25x2=50 shards and 49 nodes,
>>>> one node will need to handle double indexing load.
>>>> 
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>>> On 27 Feb 2018, at 12:54, 苗海泉 <ms...@gmail.com> wrote:
>>>>> 
>>>>> In addition, we found that the rate was normal when the number of
>>>>> collections was kept below 936 and the speed was slower and slower at
>>>> 984.
>>>>> Therefore, we could only temporarily delete the older collection, but
>> now
>>>>> we need more Online collection, there has been no good way to confuse
>> us
>>>>> for a long time, very much hope to give a solution to the problem of
>>>> ideas,
>>>>> greatly appreciated
>>>>> 
>>>>> 2018-02-27 19:46 GMT+08:00 苗海泉 <ms...@gmail.com>:
>>>>> 
>>>>>> Thank you for reply.
>>>>>> One collection has 25 shard one replica, one solr node has about 5T on
>>>>>> desk.
>>>>>> GC is checked ,and modify as follow :
>>>>>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
>>>>>> GC_TUNE=" \
>>>>>> -XX:+UseG1GC \
>>>>>> -XX:+PerfDisableSharedMem \
>>>>>> -XX:+ParallelRefProcEnabled \
>>>>>> -XX:G1HeapRegionSize=8m \
>>>>>> -XX:MaxGCPauseMillis=250 \
>>>>>> -XX:InitiatingHeapOccupancyPercent=75 \
>>>>>> -XX:+UseLargePages \
>>>>>> -XX:+AggressiveOpts \
>>>>>> -XX:+UseLargePages"
>>>>>> 
>>>>>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <
>>>> emir.arnautovic@sematext.com>:
>>>>>> 
>>>>>>> Hi,
>>>>>>> To get more complete picture, can you tell us how many
>> shards/replicas
>>>> do
>>>>>>> you have per collection? Also what is index size on disk? Did you
>>>> check GC?
>>>>>>> 
>>>>>>> BTW, using 32GB heap prevents you from using compressed oops,
>> resulting
>>>>>>> in less memory available than 31GB.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Emir
>>>>>>> --
>>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>>>> Solr & Elasticsearch Consulting Support Training -
>>>> http://sematext.com/
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 27 Feb 2018, at 11:36, 苗海泉 <ms...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> I encountered a more serious problem in the process of using solr.
>> We
>>>>>>> use
>>>>>>>> the solr version is 6.0, our daily amount of data is about 500
>> billion
>>>>>>>> documents, create a collection every hour, the online collection of
>>>> more
>>>>>>>> than a thousand, 49 solr nodes. If the collection in less than 800,
>>>> the
>>>>>>>> speed is still very fast, if the collection of the number of 1100 or
>>>> so,
>>>>>>>> the construction of solr index will drop sharply, one of the
>> original
>>>>>>>> program speed of about 2-3 million TPS, Dropped to only a few
>> hundred
>>>> or
>>>>>>>> even tens of TPS, who have encountered a similar situation, there is
>>>> no
>>>>>>>> good idea to find this issue. By the way, solr a node memory we
>>>> assigned
>>>>>>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
>>>>>>>> problem, belong to the normal state. Which friend encountered a
>>>> similar
>>>>>>>> problem, please inform the solution, thank you very much.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> ==============================
>>>>>> 联创科技
>>>>>> 知行如一
>>>>>> ==============================
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> ==============================
>>>>> 联创科技
>>>>> 知行如一
>>>>> ==============================
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> ==============================
>>> 联创科技
>>> 知行如一
>>> ==============================
>> 
>> 
> 
> 
> -- 
> ==============================
> 联创科技
> 知行如一
> ==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by 苗海泉 <ms...@gmail.com>.

Thank you, we were 49 shard 49 nodes, but later found that in this case,
often disconnect between solr and zookeepr, zookeeper too many nodes caused
solr instability, so reduced to 25 A follow-up performance can not keep up
also need to increase back.

Very slow when solr and zookeeper not found any errors, just build the
index slow, automatic commit inside the log display is slow, but the main
reason may not lie in the commit place.

I am sorry, I do not know how to look at the utilization of java heap,
through the gc log, gc time is not long, I posted the log:


{Heap before GC invocations=1144021 (full 72):
 garbage-first heap   total 33554432K, used 26982419K [0x00007f1478000000,
0x00007f1478808000, 0x00007f1c78000000)
  region size 8192K, 204 young (1671168K), 26 survivors (212992K)
 Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
67584K
2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation Pause)
(young)
Desired survivor size 109051904 bytes, new threshold 1 (max 15)
- age   1:  113878760 bytes,  113878760 total
- age   2:   21264744 bytes,  135143504 total
- age   3:   17020096 bytes,  152163600 total
- age   4:   26870864 bytes,  179034464 total
, 0.0579794 secs]
   [Parallel Time: 46.9 ms, GC Workers: 18]
      [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
4668016046.4, Diff: 0.3]
      [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
Sum: 116.9]
      [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum: 62.0]
         [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum: 113]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.0]
      [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
428.1]
      [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
228.9]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 18]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum:
1.2]
      [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
Sum: 838.0]
      [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
4668016092.8, Diff: 0.0]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.3 ms]
   [Other: 10.7 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 5.9 ms]
      [Ref Enq: 0.2 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Register: 2.2 ms]
      [Humongous Reclaim: 0.4 ms]
      [Free CSet: 0.4 ms]
   [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap:
25.7G(32.0G)->24.3G(32.0G)]
Heap after GC invocations=1144022 (full 72):
 garbage-first heap   total 33554432K, used 25489656K [0x00007f1478000000,
0x00007f1478808000, 0x00007f1c78000000)
  region size 8192K, 10 young (81920K), 10 survivors (81920K)
 Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
67584K
}
 [Times: user=0.84 sys=0.01, real=0.05 secs]
2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which application
threads were stopped: 0.0661383 seconds, Stopping threads took: 0.0004141
seconds
2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end,
2.5757061 secs]
2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark
2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508
secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc, 0.0277818
secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102
secs], 0.0704296 secs]
 [Times: user=0.85 sys=0.04, real=0.07 secs]
2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which application
threads were stopped: 0.0785762 seconds, Stopping threads took: 0.0006159
seconds
2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G),
0.0391915 secs]
 [Times: user=0.64 sys=0.00, real=0.04 secs]
2018-02-27T21:43:02.218+0800: 4668016.469: Total time for which application
threads were stopped: 0.0470020 seconds, Stopping threads took: 0.0001684
seconds
2018-02-27T21:43:02.540+0800: 4668016.791: Total time for which application
threads were stopped: 0.0074829 seconds, Stopping threads took: 0.0004834
seconds
{Heap before GC invocations=1144023 (full 72):
 garbage-first heap   total 33554432K, used 27078904K [0x00007f1478000000,
0x00007f1478808000, 0x00007f1c78000000)
  region size 8192K, 204 young (1671168K), 10 survivors (81920K)
 Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
67584K
2018-02-27T21:43:04.076+0800: 4668018.326: [GC pause (G1 Evacuation Pause)
(young)
Desired survivor size 109051904 bytes, new threshold 15 (max 15)
- age   1:   47719032 bytes,   47719032 total
, 0.0554183 secs]
   [Parallel Time: 48.0 ms, GC Workers: 18]
      [GC Worker Start (ms): Min: 4668018329.0, Avg: 4668018329.1, Max:
4668018329.3, Diff: 0.3]
      [Ext Root Scanning (ms): Min: 2.9, Avg: 5.7, Max: 47.4, Diff: 44.6,
Sum: 103.0]
      [Update RS (ms): Min: 0.0, Avg: 14.3, Max: 16.2, Diff: 16.2, Sum:
257.6]
         [Processed Buffers: Min: 0, Avg: 17.4, Max: 22, Diff: 22, Sum: 314]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.0]
      [Object Copy (ms): Min: 0.1, Avg: 10.9, Max: 11.9, Diff: 11.8, Sum:
196.9]
      [Termination (ms): Min: 0.0, Avg: 16.6, Max: 17.6, Diff: 17.6, Sum:
299.1]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 18]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum:
0.5]
      [GC Worker Total (ms): Min: 47.5, Avg: 47.6, Max: 47.8, Diff: 0.3,
Sum: 857.6]
      [GC Worker End (ms): Min: 4668018376.7, Avg: 4668018376.8, Max:
4668018376.8, Diff: 0.0]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.2 ms]
   [Other: 7.1 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 2.3 ms]
      [Ref Enq: 0.2 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Register: 2.2 ms]
      [Humongous Reclaim: 0.4 ms]
      [Free CSet: 0.4 ms]
   [Eden: 1552.0M(1552.0M)->0.0B(1488.0M) Survivors: 80.0M->144.0M Heap:
25.8G(32.0G)->24.4G(32.0G)]
Heap after GC invocations=1144024 (full 72):
 garbage-first heap   total 33554432K, used 25550050K [0x00007f1478000000,
0x00007f1478808000, 0x00007f1c78000000)
  region size 8192K, 18 young (147456K), 18 survivors (147456K)
 Metaspace       used 41184K, capacity 41752K, committed 67072K, reserved
67584K
}
 [Times: user=0.82 sys=0.00, real=0.05 secs]




2018-02-27 20:58 GMT+08:00 Emir Arnautović <em...@sematext.com>:

> Ah, so there are ~560 shards per node and not all nodes are indexing at
> the same time. Why is that? You can have better throughput if indexing on
> all nodes. If happy with shard size, you can create new collection with 49
> shards every 2h and have everything the same and index on all nodes.
>
> Back to main question: what is the heap utilisation? When you restart node
> what is heap utilisation? Do you see any errors in your logs? Do you see
> any errors in ZK logs?
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 27 Feb 2018, at 13:22, 苗海泉 <ms...@gmail.com> wrote:
> >
> > Thanks  for you reply again.
> > I just said that you may have some misunderstanding, we have 49 solr
> nodes,
> > each collection has 25 shards, each shard has only one replica of the
> data,
> > there is no copy, and I reduce the part of the cache. If you need the
> > metric data, I can check Come out to tell you, in addition we are only
> > additional system, there will not be any change action.
> >
> > 2018-02-27 20:05 GMT+08:00 Emir Arnautović <emir.arnautovic@sematext.com
> >:
> >
> >> Hi,
> >> It is hard to tell without looking more into your metrics. It seems to
> me
> >> that you are reaching limits of your cluster. I would doublecheck if
> memory
> >> is the issue. If I got it right, you have ~1120 shards per node. It
> takes
> >> some heap just to keep them open. If you have some caches enabled and
> if it
> >> is append only system, old shards will keep caches until reloaded.
> >> Probably will not make much diff, but with 25x2=50 shards and 49 nodes,
> >> one node will need to handle double indexing load.
> >>
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 27 Feb 2018, at 12:54, 苗海泉 <ms...@gmail.com> wrote:
> >>>
> >>> In addition, we found that the rate was normal when the number of
> >>> collections was kept below 936 and the speed was slower and slower at
> >> 984.
> >>> Therefore, we could only temporarily delete the older collection, but
> now
> >>> we need more Online collection, there has been no good way to confuse
> us
> >>> for a long time, very much hope to give a solution to the problem of
> >> ideas,
> >>> greatly appreciated
> >>>
> >>> 2018-02-27 19:46 GMT+08:00 苗海泉 <ms...@gmail.com>:
> >>>
> >>>> Thank you for reply.
> >>>> One collection has 25 shard one replica, one solr node has about 5T on
> >>>> desk.
> >>>> GC is checked ,and modify as follow :
> >>>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
> >>>> GC_TUNE=" \
> >>>> -XX:+UseG1GC \
> >>>> -XX:+PerfDisableSharedMem \
> >>>> -XX:+ParallelRefProcEnabled \
> >>>> -XX:G1HeapRegionSize=8m \
> >>>> -XX:MaxGCPauseMillis=250 \
> >>>> -XX:InitiatingHeapOccupancyPercent=75 \
> >>>> -XX:+UseLargePages \
> >>>> -XX:+AggressiveOpts \
> >>>> -XX:+UseLargePages"
> >>>>
> >>>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <
> >> emir.arnautovic@sematext.com>:
> >>>>
> >>>>> Hi,
> >>>>> To get more complete picture, can you tell us how many
> shards/replicas
> >> do
> >>>>> you have per collection? Also what is index size on disk? Did you
> >> check GC?
> >>>>>
> >>>>> BTW, using 32GB heap prevents you from using compressed oops,
> resulting
> >>>>> in less memory available than 31GB.
> >>>>>
> >>>>> Thanks,
> >>>>> Emir
> >>>>> --
> >>>>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>>>> Solr & Elasticsearch Consulting Support Training -
> >> http://sematext.com/
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On 27 Feb 2018, at 11:36, 苗海泉 <ms...@gmail.com> wrote:
> >>>>>>
> >>>>>> I encountered a more serious problem in the process of using solr.
> We
> >>>>> use
> >>>>>> the solr version is 6.0, our daily amount of data is about 500
> billion
> >>>>>> documents, create a collection every hour, the online collection of
> >> more
> >>>>>> than a thousand, 49 solr nodes. If the collection in less than 800,
> >> the
> >>>>>> speed is still very fast, if the collection of the number of 1100 or
> >> so,
> >>>>>> the construction of solr index will drop sharply, one of the
> original
> >>>>>> program speed of about 2-3 million TPS, Dropped to only a few
> hundred
> >> or
> >>>>>> even tens of TPS, who have encountered a similar situation, there is
> >> no
> >>>>>> good idea to find this issue. By the way, solr a node memory we
> >> assigned
> >>>>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
> >>>>>> problem, belong to the normal state. Which friend encountered a
> >> similar
> >>>>>> problem, please inform the solution, thank you very much.
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> ==============================
> >>>> 联创科技
> >>>> 知行如一
> >>>> ==============================
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> ==============================
> >>> 联创科技
> >>> 知行如一
> >>> ==============================
> >>
> >>
> >
> >
> > --
> > ==============================
> > 联创科技
> > 知行如一
> > ==============================
>
>


-- 
==============================
联创科技
知行如一
==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by Emir Arnautović <em...@sematext.com>.

Ah, so there are ~560 shards per node and not all nodes are indexing at the same time. Why is that? You can have better throughput if indexing on all nodes. If happy with shard size, you can create new collection with 49 shards every 2h and have everything the same and index on all nodes.

Back to main question: what is the heap utilisation? When you restart node what is heap utilisation? Do you see any errors in your logs? Do you see any errors in ZK logs?

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Feb 2018, at 13:22, 苗海泉 <ms...@gmail.com> wrote:
> 
> Thanks  for you reply again.
> I just said that you may have some misunderstanding, we have 49 solr nodes,
> each collection has 25 shards, each shard has only one replica of the data,
> there is no copy, and I reduce the part of the cache. If you need the
> metric data, I can check Come out to tell you, in addition we are only
> additional system, there will not be any change action.
> 
> 2018-02-27 20:05 GMT+08:00 Emir Arnautović <em...@sematext.com>:
> 
>> Hi,
>> It is hard to tell without looking more into your metrics. It seems to me
>> that you are reaching limits of your cluster. I would doublecheck if memory
>> is the issue. If I got it right, you have ~1120 shards per node. It takes
>> some heap just to keep them open. If you have some caches enabled and if it
>> is append only system, old shards will keep caches until reloaded.
>> Probably will not make much diff, but with 25x2=50 shards and 49 nodes,
>> one node will need to handle double indexing load.
>> 
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 27 Feb 2018, at 12:54, 苗海泉 <ms...@gmail.com> wrote:
>>> 
>>> In addition, we found that the rate was normal when the number of
>>> collections was kept below 936 and the speed was slower and slower at
>> 984.
>>> Therefore, we could only temporarily delete the older collection, but now
>>> we need more Online collection, there has been no good way to confuse us
>>> for a long time, very much hope to give a solution to the problem of
>> ideas,
>>> greatly appreciated
>>> 
>>> 2018-02-27 19:46 GMT+08:00 苗海泉 <ms...@gmail.com>:
>>> 
>>>> Thank you for reply.
>>>> One collection has 25 shard one replica, one solr node has about 5T on
>>>> desk.
>>>> GC is checked ,and modify as follow :
>>>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
>>>> GC_TUNE=" \
>>>> -XX:+UseG1GC \
>>>> -XX:+PerfDisableSharedMem \
>>>> -XX:+ParallelRefProcEnabled \
>>>> -XX:G1HeapRegionSize=8m \
>>>> -XX:MaxGCPauseMillis=250 \
>>>> -XX:InitiatingHeapOccupancyPercent=75 \
>>>> -XX:+UseLargePages \
>>>> -XX:+AggressiveOpts \
>>>> -XX:+UseLargePages"
>>>> 
>>>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <
>> emir.arnautovic@sematext.com>:
>>>> 
>>>>> Hi,
>>>>> To get more complete picture, can you tell us how many shards/replicas
>> do
>>>>> you have per collection? Also what is index size on disk? Did you
>> check GC?
>>>>> 
>>>>> BTW, using 32GB heap prevents you from using compressed oops, resulting
>>>>> in less memory available than 31GB.
>>>>> 
>>>>> Thanks,
>>>>> Emir
>>>>> --
>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 27 Feb 2018, at 11:36, 苗海泉 <ms...@gmail.com> wrote:
>>>>>> 
>>>>>> I encountered a more serious problem in the process of using solr. We
>>>>> use
>>>>>> the solr version is 6.0, our daily amount of data is about 500 billion
>>>>>> documents, create a collection every hour, the online collection of
>> more
>>>>>> than a thousand, 49 solr nodes. If the collection in less than 800,
>> the
>>>>>> speed is still very fast, if the collection of the number of 1100 or
>> so,
>>>>>> the construction of solr index will drop sharply, one of the original
>>>>>> program speed of about 2-3 million TPS, Dropped to only a few hundred
>> or
>>>>>> even tens of TPS, who have encountered a similar situation, there is
>> no
>>>>>> good idea to find this issue. By the way, solr a node memory we
>> assigned
>>>>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
>>>>>> problem, belong to the normal state. Which friend encountered a
>> similar
>>>>>> problem, please inform the solution, thank you very much.
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> ==============================
>>>> 联创科技
>>>> 知行如一
>>>> ==============================
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> ==============================
>>> 联创科技
>>> 知行如一
>>> ==============================
>> 
>> 
> 
> 
> -- 
> ==============================
> 联创科技
> 知行如一
> ==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by 苗海泉 <ms...@gmail.com>.

Thanks  for you reply again.
I just said that you may have some misunderstanding, we have 49 solr nodes,
each collection has 25 shards, each shard has only one replica of the data,
there is no copy, and I reduce the part of the cache. If you need the
metric data, I can check Come out to tell you, in addition we are only
additional system, there will not be any change action.

2018-02-27 20:05 GMT+08:00 Emir Arnautović <em...@sematext.com>:

> Hi,
> It is hard to tell without looking more into your metrics. It seems to me
> that you are reaching limits of your cluster. I would doublecheck if memory
> is the issue. If I got it right, you have ~1120 shards per node. It takes
> some heap just to keep them open. If you have some caches enabled and if it
> is append only system, old shards will keep caches until reloaded.
> Probably will not make much diff, but with 25x2=50 shards and 49 nodes,
> one node will need to handle double indexing load.
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 27 Feb 2018, at 12:54, 苗海泉 <ms...@gmail.com> wrote:
> >
> > In addition, we found that the rate was normal when the number of
> > collections was kept below 936 and the speed was slower and slower at
> 984.
> > Therefore, we could only temporarily delete the older collection, but now
> > we need more Online collection, there has been no good way to confuse us
> > for a long time, very much hope to give a solution to the problem of
> ideas,
> > greatly appreciated
> >
> > 2018-02-27 19:46 GMT+08:00 苗海泉 <ms...@gmail.com>:
> >
> >> Thank you for reply.
> >> One collection has 25 shard one replica, one solr node has about 5T on
> >> desk.
> >> GC is checked ,and modify as follow :
> >> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
> >> GC_TUNE=" \
> >> -XX:+UseG1GC \
> >> -XX:+PerfDisableSharedMem \
> >> -XX:+ParallelRefProcEnabled \
> >> -XX:G1HeapRegionSize=8m \
> >> -XX:MaxGCPauseMillis=250 \
> >> -XX:InitiatingHeapOccupancyPercent=75 \
> >> -XX:+UseLargePages \
> >> -XX:+AggressiveOpts \
> >> -XX:+UseLargePages"
> >>
> >> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <
> emir.arnautovic@sematext.com>:
> >>
> >>> Hi,
> >>> To get more complete picture, can you tell us how many shards/replicas
> do
> >>> you have per collection? Also what is index size on disk? Did you
> check GC?
> >>>
> >>> BTW, using 32GB heap prevents you from using compressed oops, resulting
> >>> in less memory available than 31GB.
> >>>
> >>> Thanks,
> >>> Emir
> >>> --
> >>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>
> >>>
> >>>
> >>>> On 27 Feb 2018, at 11:36, 苗海泉 <ms...@gmail.com> wrote:
> >>>>
> >>>> I encountered a more serious problem in the process of using solr. We
> >>> use
> >>>> the solr version is 6.0, our daily amount of data is about 500 billion
> >>>> documents, create a collection every hour, the online collection of
> more
> >>>> than a thousand, 49 solr nodes. If the collection in less than 800,
> the
> >>>> speed is still very fast, if the collection of the number of 1100 or
> so,
> >>>> the construction of solr index will drop sharply, one of the original
> >>>> program speed of about 2-3 million TPS, Dropped to only a few hundred
> or
> >>>> even tens of TPS, who have encountered a similar situation, there is
> no
> >>>> good idea to find this issue. By the way, solr a node memory we
> assigned
> >>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
> >>>> problem, belong to the normal state. Which friend encountered a
> similar
> >>>> problem, please inform the solution, thank you very much.
> >>>
> >>>
> >>
> >>
> >> --
> >> ==============================
> >> 联创科技
> >> 知行如一
> >> ==============================
> >>
> >
> >
> >
> > --
> > ==============================
> > 联创科技
> > 知行如一
> > ==============================
>
>


-- 
==============================
联创科技
知行如一
==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by Emir Arnautović <em...@sematext.com>.

Hi,
It is hard to tell without looking more into your metrics. It seems to me that you are reaching limits of your cluster. I would doublecheck if memory is the issue. If I got it right, you have ~1120 shards per node. It takes some heap just to keep them open. If you have some caches enabled and if it is append only system, old shards will keep caches until reloaded.
Probably will not make much diff, but with 25x2=50 shards and 49 nodes, one node will need to handle double indexing load.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Feb 2018, at 12:54, 苗海泉 <ms...@gmail.com> wrote:
> 
> In addition, we found that the rate was normal when the number of
> collections was kept below 936 and the speed was slower and slower at 984.
> Therefore, we could only temporarily delete the older collection, but now
> we need more Online collection, there has been no good way to confuse us
> for a long time, very much hope to give a solution to the problem of ideas,
> greatly appreciated
> 
> 2018-02-27 19:46 GMT+08:00 苗海泉 <ms...@gmail.com>:
> 
>> Thank you for reply.
>> One collection has 25 shard one replica, one solr node has about 5T on
>> desk.
>> GC is checked ,and modify as follow :
>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
>> GC_TUNE=" \
>> -XX:+UseG1GC \
>> -XX:+PerfDisableSharedMem \
>> -XX:+ParallelRefProcEnabled \
>> -XX:G1HeapRegionSize=8m \
>> -XX:MaxGCPauseMillis=250 \
>> -XX:InitiatingHeapOccupancyPercent=75 \
>> -XX:+UseLargePages \
>> -XX:+AggressiveOpts \
>> -XX:+UseLargePages"
>> 
>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <em...@sematext.com>:
>> 
>>> Hi,
>>> To get more complete picture, can you tell us how many shards/replicas do
>>> you have per collection? Also what is index size on disk? Did you check GC?
>>> 
>>> BTW, using 32GB heap prevents you from using compressed oops, resulting
>>> in less memory available than 31GB.
>>> 
>>> Thanks,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>> 
>>> 
>>> 
>>>> On 27 Feb 2018, at 11:36, 苗海泉 <ms...@gmail.com> wrote:
>>>> 
>>>> I encountered a more serious problem in the process of using solr. We
>>> use
>>>> the solr version is 6.0, our daily amount of data is about 500 billion
>>>> documents, create a collection every hour, the online collection of more
>>>> than a thousand, 49 solr nodes. If the collection in less than 800, the
>>>> speed is still very fast, if the collection of the number of 1100 or so,
>>>> the construction of solr index will drop sharply, one of the original
>>>> program speed of about 2-3 million TPS, Dropped to only a few hundred or
>>>> even tens of TPS, who have encountered a similar situation, there is no
>>>> good idea to find this issue. By the way, solr a node memory we assigned
>>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
>>>> problem, belong to the normal state. Which friend encountered a similar
>>>> problem, please inform the solution, thank you very much.
>>> 
>>> 
>> 
>> 
>> --
>> ==============================
>> 联创科技
>> 知行如一
>> ==============================
>> 
> 
> 
> 
> -- 
> ==============================
> 联创科技
> 知行如一
> ==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by 苗海泉 <ms...@gmail.com>.

In addition, we found that the rate was normal when the number of
collections was kept below 936 and the speed was slower and slower at 984.
Therefore, we could only temporarily delete the older collection, but now
we need more Online collection, there has been no good way to confuse us
for a long time, very much hope to give a solution to the problem of ideas,
greatly appreciated

2018-02-27 19:46 GMT+08:00 苗海泉 <ms...@gmail.com>:

> Thank you for reply.
> One collection has 25 shard one replica, one solr node has about 5T on
> desk.
> GC is checked ,and modify as follow :
> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
> GC_TUNE=" \
> -XX:+UseG1GC \
> -XX:+PerfDisableSharedMem \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=8m \
> -XX:MaxGCPauseMillis=250 \
> -XX:InitiatingHeapOccupancyPercent=75 \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts \
> -XX:+UseLargePages"
>
> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <em...@sematext.com>:
>
>> Hi,
>> To get more complete picture, can you tell us how many shards/replicas do
>> you have per collection? Also what is index size on disk? Did you check GC?
>>
>> BTW, using 32GB heap prevents you from using compressed oops, resulting
>> in less memory available than 31GB.
>>
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 27 Feb 2018, at 11:36, 苗海泉 <ms...@gmail.com> wrote:
>> >
>> > I encountered a more serious problem in the process of using solr. We
>> use
>> > the solr version is 6.0, our daily amount of data is about 500 billion
>> > documents, create a collection every hour, the online collection of more
>> > than a thousand, 49 solr nodes. If the collection in less than 800, the
>> > speed is still very fast, if the collection of the number of 1100 or so,
>> > the construction of solr index will drop sharply, one of the original
>> > program speed of about 2-3 million TPS, Dropped to only a few hundred or
>> > even tens of TPS, who have encountered a similar situation, there is no
>> > good idea to find this issue. By the way, solr a node memory we assigned
>> > 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
>> > problem, belong to the normal state. Which friend encountered a similar
>> > problem, please inform the solution, thank you very much.
>>
>>
>
>
> --
> ==============================
> 联创科技
> 知行如一
> ==============================
>



-- 
==============================
联创科技
知行如一
==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by 苗海泉 <ms...@gmail.com>.

Thank you for reply.
One collection has 25 shard one replica, one solr node has about 5T on desk.
GC is checked ,and modify as follow :
SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=250 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
-XX:+UseLargePages"

2018-02-27 19:27 GMT+08:00 Emir Arnautović <em...@sematext.com>:

> Hi,
> To get more complete picture, can you tell us how many shards/replicas do
> you have per collection? Also what is index size on disk? Did you check GC?
>
> BTW, using 32GB heap prevents you from using compressed oops, resulting in
> less memory available than 31GB.
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 27 Feb 2018, at 11:36, 苗海泉 <ms...@gmail.com> wrote:
> >
> > I encountered a more serious problem in the process of using solr. We use
> > the solr version is 6.0, our daily amount of data is about 500 billion
> > documents, create a collection every hour, the online collection of more
> > than a thousand, 49 solr nodes. If the collection in less than 800, the
> > speed is still very fast, if the collection of the number of 1100 or so,
> > the construction of solr index will drop sharply, one of the original
> > program speed of about 2-3 million TPS, Dropped to only a few hundred or
> > even tens of TPS, who have encountered a similar situation, there is no
> > good idea to find this issue. By the way, solr a node memory we assigned
> > 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
> > problem, belong to the normal state. Which friend encountered a similar
> > problem, please inform the solution, thank you very much.
>
>


-- 
==============================
联创科技
知行如一
==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Posted by Emir Arnautović <em...@sematext.com>.

Hi,
To get more complete picture, can you tell us how many shards/replicas do you have per collection? Also what is index size on disk? Did you check GC?

BTW, using 32GB heap prevents you from using compressed oops, resulting in less memory available than 31GB.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Feb 2018, at 11:36, 苗海泉 <ms...@gmail.com> wrote:
> 
> I encountered a more serious problem in the process of using solr. We use
> the solr version is 6.0, our daily amount of data is about 500 billion
> documents, create a collection every hour, the online collection of more
> than a thousand, 49 solr nodes. If the collection in less than 800, the
> speed is still very fast, if the collection of the number of 1100 or so,
> the construction of solr index will drop sharply, one of the original
> program speed of about 2-3 million TPS, Dropped to only a few hundred or
> even tens of TPS, who have encountered a similar situation, there is no
> good idea to find this issue. By the way, solr a node memory we assigned
> 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
> problem, belong to the normal state. Which friend encountered a similar
> problem, please inform the solution, thank you very much.