You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Raymond Wilson <ra...@trimble.com> on 2021/01/13 05:48:36 UTC

Ever increasing startup times as data grow in persistent storage

We have noticed that startup time for our server nodes has been slowly
increasing in time as the amount of data stored in the persistent store
grows.

This appears to be closely related to recovery of WAL changes that were not
checkpointed at the time the node was stopped.

After enabling debug logging we see that the WAL file is scanned, and for
every cache, all partitions in the cache are examined, and if there are any
uncommitted changes in the WAL file then the partition is updated (I assume
this requires reading of the partition itself as a part of this process).

We now have ~150Gb of data in our persistent store and we see WAL update
times between 5-10 minutes to complete, during which the node is
unavailable.

We use fairly large WAL files (512Mb) and use 10 segments, with WAL
archiving enabled.

We anticipate data in persistent storage to grow to Terabytes, and if the
startup time continues to grow as storage grows then this makes deploys and
restarts difficult.

Until now we have been using the default checkpoint time out of 3 minutes
which may mean we have significant uncheckpointed data in the WAL files. We
are moving to 1 minute checkpoint but don't yet know if this improve
startup times. We also use the default 1024 partitions per cache, though
some partitions may be large.

Can anyone confirm this is expected behaviour and recommendations for
resolving it?

Will reducing checking pointing intervals help?
Is the entire content of a partition read while applying WAL changes?
Does anyone else have this issue?

Thanks,
Raymond.


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Ever increasing startup times as data grow in persistent storage

Posted by Raymond Wilson <ra...@trimble.com>.
We have been continuing to monitor this issue and have experimented more
with deactivation versus Ingition.Stop(). There is some evidence that using
graceful shutdown does not perform a checkpoint in that startup times are
not necessarily improved on restart. We'll continue to investigate this but
confirmation that a check point definitely is performed on deactivation
would be good.

We see some instances when a node takes 15 minutes to restore WAL changes
(reported by the 'Finished restoring partition state for local groups' log
statement), this is despite reducing CP interval to 30 seconds. Our ingest
rate is not changing significantly which suggests there is a correlation
between the size of a partition and the time taken to finish restoring
partition state, rather than between the size of un-checkpointed changes
and the time to finish restoring partition state. Can someone confirm if
this is expected, or not?

Size of data in the persistent store is around 150Gb.

Thanks,
Raymond.



On Fri, Jan 22, 2021 at 4:21 AM andrei <ae...@gmail.com> wrote:

> Hi,
>
> I don't think there are any other options at the moment other than the
> ones you mentioned.
>
> However, you can also create your own application that will check the
> topology and activate it when all nodes from the baseline are online. For
> example, additional java code when starting a server node.
>
> In case you require any changes to the current Ignite implementation, you
> can create a thread in the Ignite developer list:
>
> http://apache-ignite-developers.2346864.n4.nabble.com/
>
> BR,
> Andrei
>
>
> 1/20/2021 9:16 PM, Raymond Wilson пишет:
>
> Hi Andre,
>
> I would like to see Ignite support a graceful shutdown scenario you get
> with deactivation, but which does not need to be manually reactivated.
>
> We run a pretty agile process and it is not uncommon to have multiple
> deploys to production throughout a week. This is a pretty automated affair
> (essentially push-button) and it works well, except for the WAL rescan on
> startup.
>
> Today there are two approaches we can take for a deployment:
>
> 1. Stop the nodes (which is what we currently do), leaving the WAL and
> persistent store inconsistent. This requires a rescan of the WAL before the
> grid is auto re-activated on startup. The time to do this is increasing
> with the size of the persistent store - it does not appear to be related to
> the size of the WAL.
> 2. Deactivate the grid, which leaves the WAL and persistent store in a
> consistent state. This requires manual re-activation on restart, but does
> not incur the increasing WAL restart cost.
>
> Is an option like the one below possible?:
>
> 3. Suspend the grid, which performs the same steps deactivation does to
> make the WAL and persistent store consistent, but which leaves the grid
> activated so the manual activation process is not required on restart.
>
> Thanks,
> Raymond.
>
>
> On Thu, Jan 21, 2021 at 4:02 AM andrei <ae...@gmail.com> wrote:
>
>> Hi,
>>
>> Yes, that was to be expected. The main autoactivation scenario is cluster
>> restart. If you are using manual deactivation, you should also manually
>> activate your cluster.
>>
>> BR,
>> Andrei
>> 1/20/2021 5:50 AM, Raymond Wilson пишет:
>>
>> We have been experimenting with using deactivation to shutdown the grid
>> to reduce the time for the grid to start up again.
>>
>> It appears there is a downside to this: once deactivated the grid does
>> not appear to auto-activate once baseline topology is achieved, which means
>> we will need to run through the bootstrapping protocol of ensuring the grid
>> has restarted correctly before activating it once again.
>>
>> The baseline topology documentation at
>> https://ignite.apache.org/docs/latest/clustering/baseline-topology does
>> not cover this condition.
>>
>> Is this expected?
>>
>> Thanks,
>> Raymond.
>>
>>
>> On Wed, Jan 13, 2021 at 11:49 PM Pavel Tupitsyn <pt...@apache.org>
>> wrote:
>>
>>> Raymond,
>>>
>>> Please use ICluster.SetActive [1] instead, the API linked above is
>>> obsolete
>>>
>>>
>>> [1]
>>> https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Cluster.ICluster.html?#Apache_Ignite_Core_Cluster_ICluster_SetActive_System_Boolean_
>>>
>>> On Wed, Jan 13, 2021 at 11:54 AM Raymond Wilson <
>>> raymond_wilson@trimble.com> wrote:
>>>
>>>> Of course. Obvious! :)
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On 13/01/2021, at 9:15 PM, Zhenya Stanilovsky <ar...@mail.ru>
>>>> wrote:
>>>>
>>>> 
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Is there an API version of the cluster deactivation?
>>>>
>>>>
>>>>
>>>> https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131
>>>>
>>>>
>>>> On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky <arzamas123@mail.ru
>>>> <//...@mail.ru>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Hi Zhenya,
>>>>
>>>> Thanks for confirming performing checkpoints more often will help here.
>>>>
>>>> Hi Raymond !
>>>>
>>>>
>>>> I have established this configuration so will experiment with settings
>>>> little.
>>>>
>>>> On a related note, is there any way to automatically trigger a
>>>> checkpoint, for instance as a pre-shutdown activity?
>>>>
>>>>
>>>> If you shutdown your cluster gracefully = with deactivation [1] further
>>>> start will not trigger wal readings.
>>>>
>>>> [1]
>>>> https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
>>>>
>>>>
>>>> Checkpoints seem to be much faster than the process of applying WAL
>>>> updates.
>>>>
>>>> Raymond.
>>>>
>>>> On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky <arzamas123@mail.ru
>>>> <ht...@mail.ru>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> We have noticed that startup time for our server nodes has been slowly
>>>> increasing in time as the amount of data stored in the persistent store
>>>> grows.
>>>>
>>>> This appears to be closely related to recovery of WAL changes that were
>>>> not checkpointed at the time the node was stopped.
>>>>
>>>> After enabling debug logging we see that the WAL file is scanned, and
>>>> for every cache, all partitions in the cache are examined, and if there are
>>>> any uncommitted changes in the WAL file then the partition is updated (I
>>>> assume this requires reading of the partition itself as a part of this
>>>> process).
>>>>
>>>> We now have ~150Gb of data in our persistent store and we see WAL
>>>> update times between 5-10 minutes to complete, during which the node is
>>>> unavailable.
>>>>
>>>> We use fairly large WAL files (512Mb) and use 10 segments, with WAL
>>>> archiving enabled.
>>>>
>>>> We anticipate data in persistent storage to grow to Terabytes, and if
>>>> the startup time continues to grow as storage grows then this makes deploys
>>>> and restarts difficult.
>>>>
>>>> Until now we have been using the default checkpoint time out of 3
>>>> minutes which may mean we have significant uncheckpointed data in the WAL
>>>> files. We are moving to 1 minute checkpoint but don't yet know if this
>>>> improve startup times. We also use the default 1024 partitions per cache,
>>>> though some partitions may be large.
>>>>
>>>> Can anyone confirm this is expected behaviour and recommendations for
>>>> resolving it?
>>>>
>>>> Will reducing checking pointing intervals help?
>>>>
>>>>
>>>> yes, it will help. Check
>>>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>>
>>>> Is the entire content of a partition read while applying WAL changes?
>>>>
>>>>
>>>> don`t think so, may be someone else suggest here?
>>>>
>>>> Does anyone else have this issue?
>>>>
>>>> Thanks,
>>>> Raymond.
>>>>
>>>>
>>>> --
>>>> <http://www.trimble.com/>
>>>> Raymond Wilson
>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>> raymond_wilson@trimble.com
>>>> <ht...@trimble.com>
>>>>
>>>>
>>>>
>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> <http://www.trimble.com/>
>>>> Raymond Wilson
>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>> raymond_wilson@trimble.com
>>>> <ht...@trimble.com>
>>>>
>>>>
>>>>
>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> <http://www.trimble.com/>
>>>> Raymond Wilson
>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>> raymond_wilson@trimble.com
>>>> <//...@trimble.com>
>>>>
>>>>
>>>>
>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> raymond_wilson@trimble.com
>>
>>
>>
>>
>>
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wilson@trimble.com
>
>
>
>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>

-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Ever increasing startup times as data grow in persistent storage

Posted by andrei <ae...@gmail.com>.
Hi,

I don't think there are any other options at the moment other than the 
ones you mentioned.

However, you can also create your own application that will check the 
topology and activate it when all nodes from the baseline are online. 
For example, additional java code when starting a server node.

In case you require any changes to the current Ignite implementation, 
you can create a thread in the Ignite developer list:

http://apache-ignite-developers.2346864.n4.nabble.com/

BR,
Andrei


1/20/2021 9:16 PM, Raymond Wilson пишет:
> Hi Andre,
>
> I would like to see Ignite support a graceful shutdown scenario you 
> get with deactivation, but which does not need to be manually reactivated.
>
> We run a pretty agile process and it is not uncommon to have multiple 
> deploys to production throughout a week. This is a pretty automated 
> affair (essentially push-button) and it works well, except for the WAL 
> rescan on startup.
>
> Today there are two approaches we can take for a deployment:
>
> 1. Stop the nodes (which is what we currently do), leaving the WAL and 
> persistent store inconsistent. This requires a rescan of the WAL 
> before the grid is auto re-activated on startup. The time to do this 
> is increasing with the size of the persistent store - it does not 
> appear to be related to the size of the WAL.
> 2. Deactivate the grid, which leaves the WAL and persistent store in a 
> consistent state. This requires manual re-activation on restart, but 
> does not incur the increasing WAL restart cost.
>
> Is an option like the one below possible?:
>
> 3. Suspend the grid, which performs the same steps deactivation does 
> to make the WAL and persistent store consistent, but which leaves the 
> grid activated so the manual activation process is not required on 
> restart.
>
> Thanks,
> Raymond.
>
> On Thu, Jan 21, 2021 at 4:02 AM andrei <aealexsandrov@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     Yes, that was to be expected. The main autoactivation scenario is
>     cluster restart. If you are using manual deactivation, you should
>     also manually activate your cluster.
>
>     BR,
>     Andrei
>
>     1/20/2021 5:50 AM, Raymond Wilson пишет:
>>     We have been experimenting with using deactivation to shutdown
>>     the grid to reduce the time for the grid to start up again.
>>
>>     It appears there is a downside to this: once deactivated the grid
>>     does not appear to auto-activate once baseline topology is
>>     achieved, which means we will need to run through the
>>     bootstrapping protocol of ensuring the grid has restarted
>>     correctly before activating it once again.
>>
>>     The baseline topology documentation at
>>     https://ignite.apache.org/docs/latest/clustering/baseline-topology
>>     <https://ignite.apache.org/docs/latest/clustering/baseline-topology>
>>     does not cover this condition.
>>
>>     Is this expected?
>>
>>     Thanks,
>>     Raymond.
>>
>>
>>     On Wed, Jan 13, 2021 at 11:49 PM Pavel Tupitsyn
>>     <ptupitsyn@apache.org <ma...@apache.org>> wrote:
>>
>>         Raymond,
>>
>>         Please use ICluster.SetActive [1] instead, the API linked
>>         above is obsolete
>>
>>
>>         [1]
>>         https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Cluster.ICluster.html?#Apache_Ignite_Core_Cluster_ICluster_SetActive_System_Boolean_
>>         <https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Cluster.ICluster.html?#Apache_Ignite_Core_Cluster_ICluster_SetActive_System_Boolean_>
>>
>>         On Wed, Jan 13, 2021 at 11:54 AM Raymond Wilson
>>         <raymond_wilson@trimble.com
>>         <ma...@trimble.com>> wrote:
>>
>>             Of course. Obvious! :)
>>
>>             Sent from my iPhone
>>
>>>             On 13/01/2021, at 9:15 PM, Zhenya Stanilovsky
>>>             <arzamas123@mail.ru <ma...@mail.ru>> wrote:
>>>
>>>             
>>>
>>>
>>>
>>>                 Is there an API version of the cluster deactivation?
>>>
>>>             https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131
>>>             <https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131>
>>>
>>>                 On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky
>>>                 <arzamas123@mail.ru
>>>                 <//...@mail.ru>>
>>>                 wrote:
>>>
>>>
>>>
>>>                         Hi Zhenya,
>>>                         Thanks for confirming performing checkpoints
>>>                         more often will help here.
>>>
>>>                     Hi Raymond !
>>>
>>>                         I have established this configuration so
>>>                         will experiment with settings little.
>>>                         On a related note, is there any way to
>>>                         automatically trigger a checkpoint, for
>>>                         instance as a pre-shutdown activity?
>>>
>>>                     If you shutdown your cluster gracefully = with
>>>                     deactivation [1] further start will not
>>>                     trigger wal readings.
>>>                     [1]
>>>                     https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
>>>                     <https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster>
>>>
>>>                         Checkpoints seem to be much faster than the
>>>                         process of applying WAL updates.
>>>                         Raymond.
>>>                         On Wed, Jan 13, 2021 at 8:07 PM Zhenya
>>>                         Stanilovsky <arzamas123@mail.ru
>>>                         <ht...@mail.ru>>
>>>                         wrote:
>>>
>>>
>>>
>>>
>>>                                 We have noticed that startup time
>>>                                 for our server nodes has been slowly
>>>                                 increasing in time as the amount of
>>>                                 data stored in the persistent store
>>>                                 grows.
>>>                                 This appears to be closely related
>>>                                 to recovery of WAL changes that were
>>>                                 not checkpointed at the time the
>>>                                 node was stopped.
>>>                                 After enabling debug logging we see
>>>                                 that the WAL file is scanned, and
>>>                                 for every cache, all partitions in
>>>                                 the cache are examined, and if there
>>>                                 are any uncommitted changes in the
>>>                                 WAL file then the partition is
>>>                                 updated (I assume this requires
>>>                                 reading of the partition itself as a
>>>                                 part of this process).
>>>                                 We now have ~150Gb of data in our
>>>                                 persistent store and we see WAL
>>>                                 update times between 5-10 minutes to
>>>                                 complete, during which the node is
>>>                                 unavailable.
>>>                                 We use fairly large WAL files
>>>                                 (512Mb) and use 10 segments, with
>>>                                 WAL archiving enabled.
>>>                                 We anticipate data in persistent
>>>                                 storage to grow to Terabytes, and if
>>>                                 the startup time continues to grow
>>>                                 as storage grows then this makes
>>>                                 deploys and restarts difficult.
>>>                                 Until now we have been using the
>>>                                 default checkpoint time out of 3
>>>                                 minutes which may mean we have
>>>                                 significant uncheckpointed data in
>>>                                 the WAL files. We are moving to 1
>>>                                 minute checkpoint but don't yet know
>>>                                 if this improve startup times. We
>>>                                 also use the default 1024 partitions
>>>                                 per cache, though some partitions
>>>                                 may be large.
>>>                                 Can anyone confirm this is expected
>>>                                 behaviour and recommendations for
>>>                                 resolving it?
>>>                                 Will reducing checking pointing
>>>                                 intervals help?
>>>
>>>                             yes, it will help. Check
>>>                             https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>                             <https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood>
>>>
>>>                                 Is the entire content of a partition
>>>                                 read while applying WAL changes?
>>>
>>>                             don`t think so, may be someone else
>>>                             suggest here?
>>>
>>>                                 Does anyone else have this issue?
>>>                                 Thanks,
>>>                                 Raymond.
>>>                                 -- 
>>>                                 <http://www.trimble.com/>
>>>                                 Raymond Wilson
>>>                                 Solution Architect, Civil
>>>                                 Construction Software Systems (CCSS)
>>>                                 11 Birmingham Drive | Christchurch,
>>>                                 New Zealand
>>>                                 raymond_wilson@trimble.com
>>>                                 <ht...@trimble.com>
>>>
>>>                                 					
>>>                                 	
>>>                                 <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>>                         -- 
>>>                         <http://www.trimble.com/>
>>>                         Raymond Wilson
>>>                         Solution Architect, Civil Construction
>>>                         Software Systems (CCSS)
>>>                         11 Birmingham Drive | Christchurch, New Zealand
>>>                         raymond_wilson@trimble.com
>>>                         <ht...@trimble.com>
>>>
>>>                         					
>>>                         	
>>>                         <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>>                 -- 
>>>                 <http://www.trimble.com/>
>>>                 Raymond Wilson
>>>                 Solution Architect, Civil Construction Software
>>>                 Systems (CCSS)
>>>                 11 Birmingham Drive | Christchurch, New Zealand
>>>                 raymond_wilson@trimble.com
>>>                 <//...@trimble.com>
>>>
>>>                 					
>>>                 	
>>>                 <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>
>>
>>     -- 
>>     <http://www.trimble.com/>
>>     Raymond Wilson
>>     Solution Architect, Civil Construction Software Systems (CCSS)
>>     11 Birmingham Drive | Christchurch, New Zealand
>>     raymond_wilson@trimble.com <ma...@trimble.com>
>>
>>     	
>>     	
>>     	
>>     	
>>     	
>>     	
>>     <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>
>
> -- 
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wilson@trimble.com <ma...@trimble.com>
>
> 	
> 	
> 	
> 	
> 	
> 	
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>

Re: Ever increasing startup times as data grow in persistent storage

Posted by Raymond Wilson <ra...@trimble.com>.
Hi Andre,

I would like to see Ignite support a graceful shutdown scenario you get
with deactivation, but which does not need to be manually reactivated.

We run a pretty agile process and it is not uncommon to have multiple
deploys to production throughout a week. This is a pretty automated affair
(essentially push-button) and it works well, except for the WAL rescan on
startup.

Today there are two approaches we can take for a deployment:

1. Stop the nodes (which is what we currently do), leaving the WAL and
persistent store inconsistent. This requires a rescan of the WAL before the
grid is auto re-activated on startup. The time to do this is increasing
with the size of the persistent store - it does not appear to be related to
the size of the WAL.
2. Deactivate the grid, which leaves the WAL and persistent store in a
consistent state. This requires manual re-activation on restart, but does
not incur the increasing WAL restart cost.

Is an option like the one below possible?:

3. Suspend the grid, which performs the same steps deactivation does to
make the WAL and persistent store consistent, but which leaves the grid
activated so the manual activation process is not required on restart.

Thanks,
Raymond.


On Thu, Jan 21, 2021 at 4:02 AM andrei <ae...@gmail.com> wrote:

> Hi,
>
> Yes, that was to be expected. The main autoactivation scenario is cluster
> restart. If you are using manual deactivation, you should also manually
> activate your cluster.
>
> BR,
> Andrei
> 1/20/2021 5:50 AM, Raymond Wilson пишет:
>
> We have been experimenting with using deactivation to shutdown the grid to
> reduce the time for the grid to start up again.
>
> It appears there is a downside to this: once deactivated the grid does not
> appear to auto-activate once baseline topology is achieved, which means we
> will need to run through the bootstrapping protocol of ensuring the grid
> has restarted correctly before activating it once again.
>
> The baseline topology documentation at
> https://ignite.apache.org/docs/latest/clustering/baseline-topology does
> not cover this condition.
>
> Is this expected?
>
> Thanks,
> Raymond.
>
>
> On Wed, Jan 13, 2021 at 11:49 PM Pavel Tupitsyn <pt...@apache.org>
> wrote:
>
>> Raymond,
>>
>> Please use ICluster.SetActive [1] instead, the API linked above is
>> obsolete
>>
>>
>> [1]
>> https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Cluster.ICluster.html?#Apache_Ignite_Core_Cluster_ICluster_SetActive_System_Boolean_
>>
>> On Wed, Jan 13, 2021 at 11:54 AM Raymond Wilson <
>> raymond_wilson@trimble.com> wrote:
>>
>>> Of course. Obvious! :)
>>>
>>> Sent from my iPhone
>>>
>>> On 13/01/2021, at 9:15 PM, Zhenya Stanilovsky <ar...@mail.ru>
>>> wrote:
>>>
>>> 
>>>
>>>
>>>
>>>
>>>
>>> Is there an API version of the cluster deactivation?
>>>
>>>
>>>
>>> https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131
>>>
>>>
>>> On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky <arzamas123@mail.ru
>>> <//...@mail.ru>> wrote:
>>>
>>>
>>>
>>>
>>>
>>> Hi Zhenya,
>>>
>>> Thanks for confirming performing checkpoints more often will help here.
>>>
>>> Hi Raymond !
>>>
>>>
>>> I have established this configuration so will experiment with settings
>>> little.
>>>
>>> On a related note, is there any way to automatically trigger a
>>> checkpoint, for instance as a pre-shutdown activity?
>>>
>>>
>>> If you shutdown your cluster gracefully = with deactivation [1] further
>>> start will not trigger wal readings.
>>>
>>> [1]
>>> https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
>>>
>>>
>>> Checkpoints seem to be much faster than the process of applying WAL
>>> updates.
>>>
>>> Raymond.
>>>
>>> On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky <arzamas123@mail.ru
>>> <ht...@mail.ru>> wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>> We have noticed that startup time for our server nodes has been slowly
>>> increasing in time as the amount of data stored in the persistent store
>>> grows.
>>>
>>> This appears to be closely related to recovery of WAL changes that were
>>> not checkpointed at the time the node was stopped.
>>>
>>> After enabling debug logging we see that the WAL file is scanned, and
>>> for every cache, all partitions in the cache are examined, and if there are
>>> any uncommitted changes in the WAL file then the partition is updated (I
>>> assume this requires reading of the partition itself as a part of this
>>> process).
>>>
>>> We now have ~150Gb of data in our persistent store and we see WAL update
>>> times between 5-10 minutes to complete, during which the node is
>>> unavailable.
>>>
>>> We use fairly large WAL files (512Mb) and use 10 segments, with WAL
>>> archiving enabled.
>>>
>>> We anticipate data in persistent storage to grow to Terabytes, and if
>>> the startup time continues to grow as storage grows then this makes deploys
>>> and restarts difficult.
>>>
>>> Until now we have been using the default checkpoint time out of 3
>>> minutes which may mean we have significant uncheckpointed data in the WAL
>>> files. We are moving to 1 minute checkpoint but don't yet know if this
>>> improve startup times. We also use the default 1024 partitions per cache,
>>> though some partitions may be large.
>>>
>>> Can anyone confirm this is expected behaviour and recommendations for
>>> resolving it?
>>>
>>> Will reducing checking pointing intervals help?
>>>
>>>
>>> yes, it will help. Check
>>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>
>>> Is the entire content of a partition read while applying WAL changes?
>>>
>>>
>>> don`t think so, may be someone else suggest here?
>>>
>>> Does anyone else have this issue?
>>>
>>> Thanks,
>>> Raymond.
>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> raymond_wilson@trimble.com
>>> <ht...@trimble.com>
>>>
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> raymond_wilson@trimble.com
>>> <ht...@trimble.com>
>>>
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> raymond_wilson@trimble.com
>>> <//...@trimble.com>
>>>
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wilson@trimble.com
>
>
>
>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>

-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Ever increasing startup times as data grow in persistent storage

Posted by andrei <ae...@gmail.com>.
Hi,

Yes, that was to be expected. The main autoactivation scenario is 
cluster restart. If you are using manual deactivation, you should also 
manually activate your cluster.

BR,
Andrei

1/20/2021 5:50 AM, Raymond Wilson пишет:
> We have been experimenting with using deactivation to shutdown the 
> grid to reduce the time for the grid to start up again.
>
> It appears there is a downside to this: once deactivated the grid does 
> not appear to auto-activate once baseline topology is achieved, which 
> means we will need to run through the bootstrapping protocol of 
> ensuring the grid has restarted correctly before activating it once again.
>
> The baseline topology documentation at 
> https://ignite.apache.org/docs/latest/clustering/baseline-topology 
> <https://ignite.apache.org/docs/latest/clustering/baseline-topology> 
> does not cover this condition.
>
> Is this expected?
>
> Thanks,
> Raymond.
>
>
> On Wed, Jan 13, 2021 at 11:49 PM Pavel Tupitsyn <ptupitsyn@apache.org 
> <ma...@apache.org>> wrote:
>
>     Raymond,
>
>     Please use ICluster.SetActive [1] instead, the API linked above is
>     obsolete
>
>
>     [1]
>     https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Cluster.ICluster.html?#Apache_Ignite_Core_Cluster_ICluster_SetActive_System_Boolean_
>     <https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Cluster.ICluster.html?#Apache_Ignite_Core_Cluster_ICluster_SetActive_System_Boolean_>
>
>     On Wed, Jan 13, 2021 at 11:54 AM Raymond Wilson
>     <raymond_wilson@trimble.com <ma...@trimble.com>>
>     wrote:
>
>         Of course. Obvious! :)
>
>         Sent from my iPhone
>
>>         On 13/01/2021, at 9:15 PM, Zhenya Stanilovsky
>>         <arzamas123@mail.ru <ma...@mail.ru>> wrote:
>>
>>         
>>
>>
>>
>>             Is there an API version of the cluster deactivation?
>>
>>         https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131
>>         <https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131>
>>
>>             On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky
>>             <arzamas123@mail.ru
>>             <//...@mail.ru>>
>>             wrote:
>>
>>
>>
>>                     Hi Zhenya,
>>                     Thanks for confirming performing checkpoints more
>>                     often will help here.
>>
>>                 Hi Raymond !
>>
>>                     I have established this configuration so will
>>                     experiment with settings little.
>>                     On a related note, is there any way to
>>                     automatically trigger a checkpoint, for instance
>>                     as a pre-shutdown activity?
>>
>>                 If you shutdown your cluster gracefully = with
>>                 deactivation [1] further start will not trigger wal
>>                 readings.
>>                 [1]
>>                 https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
>>                 <https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster>
>>
>>                     Checkpoints seem to be much faster than the
>>                     process of applying WAL updates.
>>                     Raymond.
>>                     On Wed, Jan 13, 2021 at 8:07 PM Zhenya
>>                     Stanilovsky <arzamas123@mail.ru
>>                     <ht...@mail.ru>>
>>                     wrote:
>>
>>
>>
>>
>>                             We have noticed that startup time for our
>>                             server nodes has been slowly increasing
>>                             in time as the amount of data stored in
>>                             the persistent store grows.
>>                             This appears to be closely related to
>>                             recovery of WAL changes that were not
>>                             checkpointed at the time the node was
>>                             stopped.
>>                             After enabling debug logging we see that
>>                             the WAL file is scanned, and for every
>>                             cache, all partitions in the cache are
>>                             examined, and if there are any
>>                             uncommitted changes in the WAL file then
>>                             the partition is updated (I assume this
>>                             requires reading of the partition itself
>>                             as a part of this process).
>>                             We now have ~150Gb of data in our
>>                             persistent store and we see WAL update
>>                             times between 5-10 minutes to complete,
>>                             during which the node is unavailable.
>>                             We use fairly large WAL files (512Mb) and
>>                             use 10 segments, with WAL archiving enabled.
>>                             We anticipate data in persistent storage
>>                             to grow to Terabytes, and if the startup
>>                             time continues to grow as storage grows
>>                             then this makes deploys and restarts
>>                             difficult.
>>                             Until now we have been using the default
>>                             checkpoint time out of 3 minutes which
>>                             may mean we have significant
>>                             uncheckpointed data in the WAL files. We
>>                             are moving to 1 minute checkpoint but
>>                             don't yet know if this improve startup
>>                             times. We also use the default 1024
>>                             partitions per cache, though some
>>                             partitions may be large.
>>                             Can anyone confirm this is expected
>>                             behaviour and recommendations for
>>                             resolving it?
>>                             Will reducing checking pointing intervals
>>                             help?
>>
>>                         yes, it will help. Check
>>                         https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>                         <https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood>
>>
>>                             Is the entire content of a partition read
>>                             while applying WAL changes?
>>
>>                         don`t think so, may be someone else suggest here?
>>
>>                             Does anyone else have this issue?
>>                             Thanks,
>>                             Raymond.
>>                             -- 
>>                             <http://www.trimble.com/>
>>                             Raymond Wilson
>>                             Solution Architect, Civil Construction
>>                             Software Systems (CCSS)
>>                             11 Birmingham Drive | Christchurch, New
>>                             Zealand
>>                             raymond_wilson@trimble.com
>>                             <ht...@trimble.com>
>>
>>                             					
>>                             	
>>                             <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>>                     -- 
>>                     <http://www.trimble.com/>
>>                     Raymond Wilson
>>                     Solution Architect, Civil Construction Software
>>                     Systems (CCSS)
>>                     11 Birmingham Drive | Christchurch, New Zealand
>>                     raymond_wilson@trimble.com
>>                     <ht...@trimble.com>
>>
>>                     					
>>                     	
>>                     <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>>             -- 
>>             <http://www.trimble.com/>
>>             Raymond Wilson
>>             Solution Architect, Civil Construction Software Systems
>>             (CCSS)
>>             11 Birmingham Drive | Christchurch, New Zealand
>>             raymond_wilson@trimble.com
>>             <//...@trimble.com>
>>
>>             					
>>             	
>>             <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>
>
> -- 
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wilson@trimble.com <ma...@trimble.com>
>
> 	
> 	
> 	
> 	
> 	
> 	
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>

Re: Ever increasing startup times as data grow in persistent storage

Posted by Raymond Wilson <ra...@trimble.com>.
We have been experimenting with using deactivation to shutdown the grid to
reduce the time for the grid to start up again.

It appears there is a downside to this: once deactivated the grid does not
appear to auto-activate once baseline topology is achieved, which means we
will need to run through the bootstrapping protocol of ensuring the grid
has restarted correctly before activating it once again.

The baseline topology documentation at
https://ignite.apache.org/docs/latest/clustering/baseline-topology does not
cover this condition.

Is this expected?

Thanks,
Raymond.


On Wed, Jan 13, 2021 at 11:49 PM Pavel Tupitsyn <pt...@apache.org>
wrote:

> Raymond,
>
> Please use ICluster.SetActive [1] instead, the API linked above is obsolete
>
>
> [1]
> https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Cluster.ICluster.html?#Apache_Ignite_Core_Cluster_ICluster_SetActive_System_Boolean_
>
> On Wed, Jan 13, 2021 at 11:54 AM Raymond Wilson <
> raymond_wilson@trimble.com> wrote:
>
>> Of course. Obvious! :)
>>
>> Sent from my iPhone
>>
>> On 13/01/2021, at 9:15 PM, Zhenya Stanilovsky <ar...@mail.ru> wrote:
>>
>> 
>>
>>
>>
>>
>>
>> Is there an API version of the cluster deactivation?
>>
>>
>>
>> https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131
>>
>>
>> On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky <arzamas123@mail.ru
>> <//...@mail.ru>> wrote:
>>
>>
>>
>>
>>
>> Hi Zhenya,
>>
>> Thanks for confirming performing checkpoints more often will help here.
>>
>> Hi Raymond !
>>
>>
>> I have established this configuration so will experiment with settings
>> little.
>>
>> On a related note, is there any way to automatically trigger a
>> checkpoint, for instance as a pre-shutdown activity?
>>
>>
>> If you shutdown your cluster gracefully = with deactivation [1] further
>> start will not trigger wal readings.
>>
>> [1]
>> https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
>>
>>
>> Checkpoints seem to be much faster than the process of applying WAL
>> updates.
>>
>> Raymond.
>>
>> On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky <arzamas123@mail.ru
>> <ht...@mail.ru>> wrote:
>>
>>
>>
>>
>>
>>
>> We have noticed that startup time for our server nodes has been slowly
>> increasing in time as the amount of data stored in the persistent store
>> grows.
>>
>> This appears to be closely related to recovery of WAL changes that were
>> not checkpointed at the time the node was stopped.
>>
>> After enabling debug logging we see that the WAL file is scanned, and for
>> every cache, all partitions in the cache are examined, and if there are any
>> uncommitted changes in the WAL file then the partition is updated (I assume
>> this requires reading of the partition itself as a part of this process).
>>
>> We now have ~150Gb of data in our persistent store and we see WAL update
>> times between 5-10 minutes to complete, during which the node is
>> unavailable.
>>
>> We use fairly large WAL files (512Mb) and use 10 segments, with WAL
>> archiving enabled.
>>
>> We anticipate data in persistent storage to grow to Terabytes, and if the
>> startup time continues to grow as storage grows then this makes deploys and
>> restarts difficult.
>>
>> Until now we have been using the default checkpoint time out of 3 minutes
>> which may mean we have significant uncheckpointed data in the WAL files. We
>> are moving to 1 minute checkpoint but don't yet know if this improve
>> startup times. We also use the default 1024 partitions per cache, though
>> some partitions may be large.
>>
>> Can anyone confirm this is expected behaviour and recommendations for
>> resolving it?
>>
>> Will reducing checking pointing intervals help?
>>
>>
>> yes, it will help. Check
>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>
>> Is the entire content of a partition read while applying WAL changes?
>>
>>
>> don`t think so, may be someone else suggest here?
>>
>> Does anyone else have this issue?
>>
>> Thanks,
>> Raymond.
>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> raymond_wilson@trimble.com
>> <ht...@trimble.com>
>>
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> raymond_wilson@trimble.com
>> <ht...@trimble.com>
>>
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> raymond_wilson@trimble.com
>> <//...@trimble.com>
>>
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>>
>>
>>
>>
>>
>>

-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Ever increasing startup times as data grow in persistent storage

Posted by Pavel Tupitsyn <pt...@apache.org>.
Raymond,

Please use ICluster.SetActive [1] instead, the API linked above is obsolete


[1]
https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Cluster.ICluster.html?#Apache_Ignite_Core_Cluster_ICluster_SetActive_System_Boolean_

On Wed, Jan 13, 2021 at 11:54 AM Raymond Wilson <ra...@trimble.com>
wrote:

> Of course. Obvious! :)
>
> Sent from my iPhone
>
> On 13/01/2021, at 9:15 PM, Zhenya Stanilovsky <ar...@mail.ru> wrote:
>
> 
>
>
>
>
>
> Is there an API version of the cluster deactivation?
>
>
>
> https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131
>
>
> On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky <arzamas123@mail.ru
> <//...@mail.ru>> wrote:
>
>
>
>
>
> Hi Zhenya,
>
> Thanks for confirming performing checkpoints more often will help here.
>
> Hi Raymond !
>
>
> I have established this configuration so will experiment with settings
> little.
>
> On a related note, is there any way to automatically trigger a checkpoint,
> for instance as a pre-shutdown activity?
>
>
> If you shutdown your cluster gracefully = with deactivation [1] further
> start will not trigger wal readings.
>
> [1]
> https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
>
>
> Checkpoints seem to be much faster than the process of applying WAL
> updates.
>
> Raymond.
>
> On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky <arzamas123@mail.ru
> <ht...@mail.ru>> wrote:
>
>
>
>
>
>
> We have noticed that startup time for our server nodes has been slowly
> increasing in time as the amount of data stored in the persistent store
> grows.
>
> This appears to be closely related to recovery of WAL changes that were
> not checkpointed at the time the node was stopped.
>
> After enabling debug logging we see that the WAL file is scanned, and for
> every cache, all partitions in the cache are examined, and if there are any
> uncommitted changes in the WAL file then the partition is updated (I assume
> this requires reading of the partition itself as a part of this process).
>
> We now have ~150Gb of data in our persistent store and we see WAL update
> times between 5-10 minutes to complete, during which the node is
> unavailable.
>
> We use fairly large WAL files (512Mb) and use 10 segments, with WAL
> archiving enabled.
>
> We anticipate data in persistent storage to grow to Terabytes, and if the
> startup time continues to grow as storage grows then this makes deploys and
> restarts difficult.
>
> Until now we have been using the default checkpoint time out of 3 minutes
> which may mean we have significant uncheckpointed data in the WAL files. We
> are moving to 1 minute checkpoint but don't yet know if this improve
> startup times. We also use the default 1024 partitions per cache, though
> some partitions may be large.
>
> Can anyone confirm this is expected behaviour and recommendations for
> resolving it?
>
> Will reducing checking pointing intervals help?
>
>
> yes, it will help. Check
> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>
> Is the entire content of a partition read while applying WAL changes?
>
>
> don`t think so, may be someone else suggest here?
>
> Does anyone else have this issue?
>
> Thanks,
> Raymond.
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wilson@trimble.com
> <ht...@trimble.com>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>
>
>
>
>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wilson@trimble.com
> <ht...@trimble.com>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>
>
>
>
>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wilson@trimble.com
> <//...@trimble.com>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>
>
>
>
>
>

Re: Ever increasing startup times as data grow in persistent storage

Posted by Raymond Wilson <ra...@trimble.com>.
Of course. Obvious! :)

Sent from my iPhone

On 13/01/2021, at 9:15 PM, Zhenya Stanilovsky <ar...@mail.ru> wrote:







Is there an API version of the cluster deactivation?


https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131


On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky <arzamas123@mail.ru
<//...@mail.ru>> wrote:





Hi Zhenya,

Thanks for confirming performing checkpoints more often will help here.

Hi Raymond !


I have established this configuration so will experiment with settings
little.

On a related note, is there any way to automatically trigger a checkpoint,
for instance as a pre-shutdown activity?


If you shutdown your cluster gracefully = with deactivation [1] further
start will not trigger wal readings.

[1]
https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster


Checkpoints seem to be much faster than the process of applying WAL updates.

Raymond.

On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky <arzamas123@mail.ru
<ht...@mail.ru>> wrote:






We have noticed that startup time for our server nodes has been slowly
increasing in time as the amount of data stored in the persistent store
grows.

This appears to be closely related to recovery of WAL changes that were not
checkpointed at the time the node was stopped.

After enabling debug logging we see that the WAL file is scanned, and for
every cache, all partitions in the cache are examined, and if there are any
uncommitted changes in the WAL file then the partition is updated (I assume
this requires reading of the partition itself as a part of this process).

We now have ~150Gb of data in our persistent store and we see WAL update
times between 5-10 minutes to complete, during which the node is
unavailable.

We use fairly large WAL files (512Mb) and use 10 segments, with WAL
archiving enabled.

We anticipate data in persistent storage to grow to Terabytes, and if the
startup time continues to grow as storage grows then this makes deploys and
restarts difficult.

Until now we have been using the default checkpoint time out of 3 minutes
which may mean we have significant uncheckpointed data in the WAL files. We
are moving to 1 minute checkpoint but don't yet know if this improve
startup times. We also use the default 1024 partitions per cache, though
some partitions may be large.

Can anyone confirm this is expected behaviour and recommendations for
resolving it?

Will reducing checking pointing intervals help?


yes, it will help. Check
https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood

Is the entire content of a partition read while applying WAL changes?


don`t think so, may be someone else suggest here?

Does anyone else have this issue?

Thanks,
Raymond.


--
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wilson@trimble.com
<ht...@trimble.com>


<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>








--
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wilson@trimble.com
<ht...@trimble.com>


<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>








--
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wilson@trimble.com
<//...@trimble.com>


<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re[4]: Ever increasing startup times as data grow in persistent storage

Posted by Zhenya Stanilovsky <ar...@mail.ru>.



 
>Is there an API version of the cluster deactivation?
 
https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131
 
>On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky < arzamas123@mail.ru > wrote:
>>
>>
>> 
>>>Hi Zhenya,
>>> 
>>>Thanks for confirming performing checkpoints more often will help here.
>>Hi Raymond !
>>> 
>>>I have established this configuration so will experiment with settings little.
>>> 
>>>On a related note, is there any way to automatically trigger a checkpoint, for instance as a pre-shutdown activity?
>> 
>>If you shutdown your cluster gracefully = with deactivation [1] further start will not trigger wal readings.
>> 
>>[1]  https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
>> 
>>>Checkpoints seem to be much faster than the process of applying WAL updates.
>>> 
>>>Raymond.  
>>>On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky < arzamas123@mail.ru > wrote:
>>>>
>>>>
>>>>
>>>> 
>>>>>We have noticed that startup time for our server nodes has been slowly increasing in time as the amount of data stored in the persistent store grows.
>>>>> 
>>>>>This appears to be closely related to recovery of WAL changes that were not checkpointed at the time the node was stopped.
>>>>> 
>>>>>After enabling debug logging we see that the WAL file is scanned, and for every cache, all partitions in the cache are examined, and if there are any uncommitted changes in the WAL file then the partition is updated (I assume this requires reading of the partition itself as a part of this process).
>>>>> 
>>>>>We now have ~150Gb of data in our persistent store and we see WAL update times between 5-10 minutes to complete, during which the node is unavailable.
>>>>> 
>>>>>We use fairly large WAL files (512Mb) and use 10 segments, with WAL archiving enabled.
>>>>> 
>>>>>We anticipate data in persistent storage to grow to Terabytes, and if the startup time continues to grow as storage grows then this makes deploys and restarts difficult.
>>>>> 
>>>>>Until now we have been using the default checkpoint time out of 3 minutes which may mean we have significant uncheckpointed data in the WAL files. We are moving to 1 minute checkpoint but don't yet know if this improve startup times. We also use the default 1024 partitions per cache, though some partitions may be large. 
>>>>> 
>>>>>Can anyone confirm this is expected behaviour and recommendations for resolving it?
>>>>> 
>>>>>Will reducing checking pointing intervals help?
>>>> 
>>>>yes, it will help. Check  https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>>>Is the entire content of a partition read while applying WAL changes?
>>>> 
>>>>don`t think so, may be someone else suggest here?
>>>>>Does anyone else have this issue?
>>>>> 
>>>>>Thanks,
>>>>>Raymond.
>>>>> 
>>>>>  --
>>>>>
>>>>>Raymond Wilson
>>>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>>>raymond_wilson@trimble.com
>>>>>         
>>>>> 
>>>> 
>>>> 
>>>> 
>>>>  
>>> 
>>>  --
>>>
>>>Raymond Wilson
>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>raymond_wilson@trimble.com
>>>         
>>> 
>> 
>> 
>> 
>>  
> 
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wilson@trimble.com
>         
> 
 
 
 
 

Re: Re[2]: Ever increasing startup times as data grow in persistent storage

Posted by Raymond Wilson <ra...@trimble.com>.
Is there an API version of the cluster deactivation?

On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky <ar...@mail.ru>
wrote:

>
>
>
>
> Hi Zhenya,
>
> Thanks for confirming performing checkpoints more often will help here.
>
> Hi Raymond !
>
>
> I have established this configuration so will experiment with settings
> little.
>
> On a related note, is there any way to automatically trigger a checkpoint,
> for instance as a pre-shutdown activity?
>
>
> If you shutdown your cluster gracefully = with deactivation [1] further
> start will not trigger wal readings.
>
> [1]
> https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
>
>
> Checkpoints seem to be much faster than the process of applying WAL
> updates.
>
> Raymond.
>
> On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky <arzamas123@mail.ru
> <//...@mail.ru>> wrote:
>
>
>
>
>
>
> We have noticed that startup time for our server nodes has been slowly
> increasing in time as the amount of data stored in the persistent store
> grows.
>
> This appears to be closely related to recovery of WAL changes that were
> not checkpointed at the time the node was stopped.
>
> After enabling debug logging we see that the WAL file is scanned, and for
> every cache, all partitions in the cache are examined, and if there are any
> uncommitted changes in the WAL file then the partition is updated (I assume
> this requires reading of the partition itself as a part of this process).
>
> We now have ~150Gb of data in our persistent store and we see WAL update
> times between 5-10 minutes to complete, during which the node is
> unavailable.
>
> We use fairly large WAL files (512Mb) and use 10 segments, with WAL
> archiving enabled.
>
> We anticipate data in persistent storage to grow to Terabytes, and if the
> startup time continues to grow as storage grows then this makes deploys and
> restarts difficult.
>
> Until now we have been using the default checkpoint time out of 3 minutes
> which may mean we have significant uncheckpointed data in the WAL files. We
> are moving to 1 minute checkpoint but don't yet know if this improve
> startup times. We also use the default 1024 partitions per cache, though
> some partitions may be large.
>
> Can anyone confirm this is expected behaviour and recommendations for
> resolving it?
>
> Will reducing checking pointing intervals help?
>
>
> yes, it will help. Check
> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>
> Is the entire content of a partition read while applying WAL changes?
>
>
> don`t think so, may be someone else suggest here?
>
> Does anyone else have this issue?
>
> Thanks,
> Raymond.
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wilson@trimble.com
> <ht...@trimble.com>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>
>
>
>
>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wilson@trimble.com
> <//...@trimble.com>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>
>
>
>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Re[2]: Ever increasing startup times as data grow in persistent storage

Posted by Raymond Wilson <ra...@trimble.com>.
Thank Zhenya.

Currently we call Ignition.Stop() with the flag to allow jobs to complete.
I assume when using deactivate we don;t need to call that, or is it still a
good idea as a belt and braces shut down for the grid?

Raymond

On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky <ar...@mail.ru>
wrote:

>
>
>
>
> Hi Zhenya,
>
> Thanks for confirming performing checkpoints more often will help here.
>
> Hi Raymond !
>
>
> I have established this configuration so will experiment with settings
> little.
>
> On a related note, is there any way to automatically trigger a checkpoint,
> for instance as a pre-shutdown activity?
>
>
> If you shutdown your cluster gracefully = with deactivation [1] further
> start will not trigger wal readings.
>
> [1]
> https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
>
>
> Checkpoints seem to be much faster than the process of applying WAL
> updates.
>
> Raymond.
>
> On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky <arzamas123@mail.ru
> <//...@mail.ru>> wrote:
>
>
>
>
>
>
> We have noticed that startup time for our server nodes has been slowly
> increasing in time as the amount of data stored in the persistent store
> grows.
>
> This appears to be closely related to recovery of WAL changes that were
> not checkpointed at the time the node was stopped.
>
> After enabling debug logging we see that the WAL file is scanned, and for
> every cache, all partitions in the cache are examined, and if there are any
> uncommitted changes in the WAL file then the partition is updated (I assume
> this requires reading of the partition itself as a part of this process).
>
> We now have ~150Gb of data in our persistent store and we see WAL update
> times between 5-10 minutes to complete, during which the node is
> unavailable.
>
> We use fairly large WAL files (512Mb) and use 10 segments, with WAL
> archiving enabled.
>
> We anticipate data in persistent storage to grow to Terabytes, and if the
> startup time continues to grow as storage grows then this makes deploys and
> restarts difficult.
>
> Until now we have been using the default checkpoint time out of 3 minutes
> which may mean we have significant uncheckpointed data in the WAL files. We
> are moving to 1 minute checkpoint but don't yet know if this improve
> startup times. We also use the default 1024 partitions per cache, though
> some partitions may be large.
>
> Can anyone confirm this is expected behaviour and recommendations for
> resolving it?
>
> Will reducing checking pointing intervals help?
>
>
> yes, it will help. Check
> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>
> Is the entire content of a partition read while applying WAL changes?
>
>
> don`t think so, may be someone else suggest here?
>
> Does anyone else have this issue?
>
> Thanks,
> Raymond.
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wilson@trimble.com
> <ht...@trimble.com>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>
>
>
>
>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wilson@trimble.com
> <//...@trimble.com>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>
>
>
>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re[2]: Ever increasing startup times as data grow in persistent storage

Posted by Zhenya Stanilovsky <ar...@mail.ru>.


 
>Hi Zhenya,
> 
>Thanks for confirming performing checkpoints more often will help here.
Hi Raymond !
> 
>I have established this configuration so will experiment with settings little.
> 
>On a related note, is there any way to automatically trigger a checkpoint, for instance as a pre-shutdown activity?
 
If you shutdown your cluster gracefully = with deactivation [1] further start will not trigger wal readings.
 
[1] https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
 
>Checkpoints seem to be much faster than the process of applying WAL updates.
> 
>Raymond.  
>On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky < arzamas123@mail.ru > wrote:
>>
>>
>>
>> 
>>>We have noticed that startup time for our server nodes has been slowly increasing in time as the amount of data stored in the persistent store grows.
>>> 
>>>This appears to be closely related to recovery of WAL changes that were not checkpointed at the time the node was stopped.
>>> 
>>>After enabling debug logging we see that the WAL file is scanned, and for every cache, all partitions in the cache are examined, and if there are any uncommitted changes in the WAL file then the partition is updated (I assume this requires reading of the partition itself as a part of this process).
>>> 
>>>We now have ~150Gb of data in our persistent store and we see WAL update times between 5-10 minutes to complete, during which the node is unavailable.
>>> 
>>>We use fairly large WAL files (512Mb) and use 10 segments, with WAL archiving enabled.
>>> 
>>>We anticipate data in persistent storage to grow to Terabytes, and if the startup time continues to grow as storage grows then this makes deploys and restarts difficult.
>>> 
>>>Until now we have been using the default checkpoint time out of 3 minutes which may mean we have significant uncheckpointed data in the WAL files. We are moving to 1 minute checkpoint but don't yet know if this improve startup times. We also use the default 1024 partitions per cache, though some partitions may be large. 
>>> 
>>>Can anyone confirm this is expected behaviour and recommendations for resolving it?
>>> 
>>>Will reducing checking pointing intervals help?
>> 
>>yes, it will help. Check  https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>Is the entire content of a partition read while applying WAL changes?
>> 
>>don`t think so, may be someone else suggest here?
>>>Does anyone else have this issue?
>>> 
>>>Thanks,
>>>Raymond.
>>> 
>>>  --
>>>
>>>Raymond Wilson
>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>raymond_wilson@trimble.com
>>>         
>>> 
>> 
>> 
>> 
>>  
> 
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wilson@trimble.com
>         
> 
 
 
 
 

Re: Ever increasing startup times as data grow in persistent storage

Posted by Raymond Wilson <ra...@trimble.com>.
Hi Zhenya,

Thanks for confirming performing checkpoints more often will help here.

I have established this configuration so will experiment with settings
little.

On a related note, is there any way to automatically trigger a checkpoint,
for instance as a pre-shutdown activity? Checkpoints seem to be much faster
than the process of applying WAL updates.

Raymond.

On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky <ar...@mail.ru>
wrote:

>
>
>
>
>
> We have noticed that startup time for our server nodes has been slowly
> increasing in time as the amount of data stored in the persistent store
> grows.
>
> This appears to be closely related to recovery of WAL changes that were
> not checkpointed at the time the node was stopped.
>
> After enabling debug logging we see that the WAL file is scanned, and for
> every cache, all partitions in the cache are examined, and if there are any
> uncommitted changes in the WAL file then the partition is updated (I assume
> this requires reading of the partition itself as a part of this process).
>
> We now have ~150Gb of data in our persistent store and we see WAL update
> times between 5-10 minutes to complete, during which the node is
> unavailable.
>
> We use fairly large WAL files (512Mb) and use 10 segments, with WAL
> archiving enabled.
>
> We anticipate data in persistent storage to grow to Terabytes, and if the
> startup time continues to grow as storage grows then this makes deploys and
> restarts difficult.
>
> Until now we have been using the default checkpoint time out of 3 minutes
> which may mean we have significant uncheckpointed data in the WAL files. We
> are moving to 1 minute checkpoint but don't yet know if this improve
> startup times. We also use the default 1024 partitions per cache, though
> some partitions may be large.
>
> Can anyone confirm this is expected behaviour and recommendations for
> resolving it?
>
> Will reducing checking pointing intervals help?
>
>
> yes, it will help. Check
> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>
> Is the entire content of a partition read while applying WAL changes?
>
>
> don`t think so, may be someone else suggest here?
>
> Does anyone else have this issue?
>
> Thanks,
> Raymond.
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wilson@trimble.com
> <//...@trimble.com>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>
>
>
>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Ever increasing startup times as data grow in persistent storage

Posted by Zhenya Stanilovsky <ar...@mail.ru>.



 
>We have noticed that startup time for our server nodes has been slowly increasing in time as the amount of data stored in the persistent store grows.
> 
>This appears to be closely related to recovery of WAL changes that were not checkpointed at the time the node was stopped.
> 
>After enabling debug logging we see that the WAL file is scanned, and for every cache, all partitions in the cache are examined, and if there are any uncommitted changes in the WAL file then the partition is updated (I assume this requires reading of the partition itself as a part of this process).
> 
>We now have ~150Gb of data in our persistent store and we see WAL update times between 5-10 minutes to complete, during which the node is unavailable.
> 
>We use fairly large WAL files (512Mb) and use 10 segments, with WAL archiving enabled.
> 
>We anticipate data in persistent storage to grow to Terabytes, and if the startup time continues to grow as storage grows then this makes deploys and restarts difficult.
> 
>Until now we have been using the default checkpoint time out of 3 minutes which may mean we have significant uncheckpointed data in the WAL files. We are moving to 1 minute checkpoint but don't yet know if this improve startup times. We also use the default 1024 partitions per cache, though some partitions may be large. 
> 
>Can anyone confirm this is expected behaviour and recommendations for resolving it?
> 
>Will reducing checking pointing intervals help?
 
yes, it will help. Check https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>Is the entire content of a partition read while applying WAL changes?
 
don`t think so, may be someone else suggest here?
>Does anyone else have this issue?
> 
>Thanks,
>Raymond.
> 
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wilson@trimble.com
>         
> 
 
 
 
 

Re: Ever increasing startup times as data grow in persistent storage

Posted by Raymond Wilson <ra...@trimble.com>.
We are currently using AI 2.8.1 with the c# client.

On Wed, Jan 13, 2021 at 8:12 PM Kirill Tkalenko <tk...@yandex.ru>
wrote:

> Hello, Raymond! What version are you using?
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Ever increasing startup times as data grow in persistent storage

Posted by Kirill Tkalenko <tk...@yandex.ru>.
Hello, Raymond! What version are you using?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Ever increasing startup times as data grow in persistent storage

Posted by Naveen <na...@gmail.com>.
Hi Raymond

It does block writes until the checkpoint is complete, but this only happens
when we restart our nodes, that time all the piled up requests (during the
shutdown) gets processed, thats when bulk data ingestion happens, otherwise
for normal day to day real time operations it does not really hurt us since
we do not have any bulk writes etc.

Thanks




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Ever increasing startup times as data grow in persistent storage

Posted by Raymond Wilson <ra...@trimble.com>.
Hi Naveen,

We currently have two data regions. A small one (for ingest), set to 128
Mb, and a larger one for requests (4Gb). We leave the checkpoint page
buffer size at the default value, so this will be 1Gb for the larger
region, and possibly 128Mb for the smaller region (if I recall the rules
correctly). I'm planning to combine them to improve checkpointing behaviour.

I guess what you're saying is by setting the buffer size smaller, you cap
the volume of WAL updates that need to be applied when restarting your
nodes?

Sounds like something worth trying...

I guess the risk here is that if the number of dirty pages hits that limit
during a checkpoint then Ignite will block writes until that check point is
complete, and schedule another checkpoint immediately after the first has
completed. Do you see that occurring in your system?

Thanks,
Raymond

On Wed, Jan 13, 2021 at 7:52 PM Naveen <na...@gmail.com> wrote:

> Hi Raymond
>
> Did you try checkpointPageBufferSize instead of time interval, we have used
> 24MB as checkpointPageBufferSize , working fine for us, we also have close
> to 12 TB of data and does take good 6 to 10 mts to bring up the node and
> become cluster active
> Regarding the no of partitions also, 128 partitions should do and its doing
> good for us
>
> Thanks
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Ever increasing startup times as data grow in persistent storage

Posted by Naveen <na...@gmail.com>.
Hi Raymond

Did you try checkpointPageBufferSize instead of time interval, we have used
24MB as checkpointPageBufferSize , working fine for us, we also have close
to 12 TB of data and does take good 6 to 10 mts to bring up the node and
become cluster active
Regarding the no of partitions also, 128 partitions should do and its doing
good for us

Thanks



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/