You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Carl Mueller <ca...@smartthings.com.INVALID> on 2019/04/17 17:26:47 UTC

2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

We are doing a ton of upgrades to get out of 2.1.x. We've done probably
20-30 clusters so far and have not encountered anything like this yet.

After upgrade of a node, the restart takes a long time. like 10 minutes
long. ALmost all of our other nodes took less than 2 minutes to upgrade
(aside from sstableupgrades).

The startup stalls on a particular table, it is the largest table at about
300GB, but we have upgraded other clusters with about that much data
without this 8-10 minute delay. We have the ability to roll back the node,
and the restart as a 2.1.x node is normal with no delays.

Alas this is a prod cluster so we are going to try to sstable load the data
on a lower environment and try to replicate the delay. If we can, we will
turn on debug logging.

This occurred on the first node we tried to upgrade. It is possible it is
limited to only this node, but we are gunshy to play around with upgrades
in prod.

We have an automated upgrading program that flushes, snapshots, shuts down
gossip, drains before upgrade, suppressed autostart on upgrade, and has
worked about as flawlessly as one could hope for so far for 2.1->2.2 and
2.2-> 3.11 upgrades.

INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 -
Initializing zzzz.access_token
INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 -
Initializing zzzz.refresh_token
INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 -
Initializing zzzz.userid
INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 -
Initializing zzzz.access_token_by_auth

You can see the 6:30 delay in the startup log above. All the other
keyspace/tables initialize in under a second.

Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

Posted by Carl Mueller <ca...@smartthings.com.INVALID>.

Will try if we get to replicate the node upgrade on that node, or if we
replicate in a lower env.

Thanks


On Wed, Apr 17, 2019 at 1:49 PM Jon Haddad <jo...@jonhaddad.com> wrote:

> Let me be more specific - run the async java profiler and generate a
> flame graph to determine where CPU time is spent.
>
> On Wed, Apr 17, 2019 at 11:36 AM Jon Haddad <jo...@jonhaddad.com> wrote:
> >
> > Run the async java profiler on the node to determine what it's doing:
> > https://github.com/jvm-profiling-tools/async-profiler
> >
> > On Wed, Apr 17, 2019 at 11:31 AM Carl Mueller
> > <ca...@smartthings.com.invalid> wrote:
> > >
> > > No, we just did the package upgrade 2.1.9 --> 2.2.13
> > >
> > > It definitely feels like some indexes are being recalculated or the
> entire sstables are being scanned due to suspected corruption.
> > >
> > >
> > > On Wed, Apr 17, 2019 at 12:32 PM Jeff Jirsa <jj...@gmail.com> wrote:
> > >>
> > >> There was a time when changing some of the parameters (especially
> bloom filter FP ratio) would cause the bloom filters to be rebuilt on
> startup if the sstables didnt match what was in the schema, leading to a
> delay like that and similar logs. Any chance you changed the schema on that
> table since the last time you restarted it?
> > >>
> > >>
> > >>
> > >> On Wed, Apr 17, 2019 at 10:30 AM Carl Mueller <
> carl.mueller@smartthings.com.invalid> wrote:
> > >>>
> > >>> Oh, the table in question is SizeTiered, had about 10 sstables
> total, it was JBOD across two data directories.
> > >>>
> > >>> On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller <
> carl.mueller@smartthings.com> wrote:
> > >>>>
> > >>>> We are doing a ton of upgrades to get out of 2.1.x. We've done
> probably 20-30 clusters so far and have not encountered anything like this
> yet.
> > >>>>
> > >>>> After upgrade of a node, the restart takes a long time. like 10
> minutes long. ALmost all of our other nodes took less than 2 minutes to
> upgrade (aside from sstableupgrades).
> > >>>>
> > >>>> The startup stalls on a particular table, it is the largest table
> at about 300GB, but we have upgraded other clusters with about that much
> data without this 8-10 minute delay. We have the ability to roll back the
> node, and the restart as a 2.1.x node is normal with no delays.
> > >>>>
> > >>>> Alas this is a prod cluster so we are going to try to sstable load
> the data on a lower environment and try to replicate the delay. If we can,
> we will turn on debug logging.
> > >>>>
> > >>>> This occurred on the first node we tried to upgrade. It is possible
> it is limited to only this node, but we are gunshy to play around with
> upgrades in prod.
> > >>>>
> > >>>> We have an automated upgrading program that flushes, snapshots,
> shuts down gossip, drains before upgrade, suppressed autostart on upgrade,
> and has worked about as flawlessly as one could hope for so far for
> 2.1->2.2 and 2.2-> 3.11 upgrades.
> > >>>>
> > >>>> INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 -
> Initializing zzzz.access_token
> > >>>> INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 -
> Initializing zzzz.refresh_token
> > >>>> INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 -
> Initializing zzzz.userid
> > >>>> INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 -
> Initializing zzzz.access_token_by_auth
> > >>>>
> > >>>> You can see the 6:30 delay in the startup log above. All the other
> keyspace/tables initialize in under a second.
> > >>>>
> > >>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>

Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

Posted by Jon Haddad <jo...@jonhaddad.com>.

Let me be more specific - run the async java profiler and generate a
flame graph to determine where CPU time is spent.

On Wed, Apr 17, 2019 at 11:36 AM Jon Haddad <jo...@jonhaddad.com> wrote:
>
> Run the async java profiler on the node to determine what it's doing:
> https://github.com/jvm-profiling-tools/async-profiler
>
> On Wed, Apr 17, 2019 at 11:31 AM Carl Mueller
> <ca...@smartthings.com.invalid> wrote:
> >
> > No, we just did the package upgrade 2.1.9 --> 2.2.13
> >
> > It definitely feels like some indexes are being recalculated or the entire sstables are being scanned due to suspected corruption.
> >
> >
> > On Wed, Apr 17, 2019 at 12:32 PM Jeff Jirsa <jj...@gmail.com> wrote:
> >>
> >> There was a time when changing some of the parameters (especially bloom filter FP ratio) would cause the bloom filters to be rebuilt on startup if the sstables didnt match what was in the schema, leading to a delay like that and similar logs. Any chance you changed the schema on that table since the last time you restarted it?
> >>
> >>
> >>
> >> On Wed, Apr 17, 2019 at 10:30 AM Carl Mueller <ca...@smartthings.com.invalid> wrote:
> >>>
> >>> Oh, the table in question is SizeTiered, had about 10 sstables total, it was JBOD across two data directories.
> >>>
> >>> On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller <ca...@smartthings.com> wrote:
> >>>>
> >>>> We are doing a ton of upgrades to get out of 2.1.x. We've done probably 20-30 clusters so far and have not encountered anything like this yet.
> >>>>
> >>>> After upgrade of a node, the restart takes a long time. like 10 minutes long. ALmost all of our other nodes took less than 2 minutes to upgrade (aside from sstableupgrades).
> >>>>
> >>>> The startup stalls on a particular table, it is the largest table at about 300GB, but we have upgraded other clusters with about that much data without this 8-10 minute delay. We have the ability to roll back the node, and the restart as a 2.1.x node is normal with no delays.
> >>>>
> >>>> Alas this is a prod cluster so we are going to try to sstable load the data on a lower environment and try to replicate the delay. If we can, we will turn on debug logging.
> >>>>
> >>>> This occurred on the first node we tried to upgrade. It is possible it is limited to only this node, but we are gunshy to play around with upgrades in prod.
> >>>>
> >>>> We have an automated upgrading program that flushes, snapshots, shuts down gossip, drains before upgrade, suppressed autostart on upgrade, and has worked about as flawlessly as one could hope for so far for 2.1->2.2 and 2.2-> 3.11 upgrades.
> >>>>
> >>>> INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 - Initializing zzzz.access_token
> >>>> INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 - Initializing zzzz.refresh_token
> >>>> INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 - Initializing zzzz.userid
> >>>> INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 - Initializing zzzz.access_token_by_auth
> >>>>
> >>>> You can see the 6:30 delay in the startup log above. All the other keyspace/tables initialize in under a second.
> >>>>
> >>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

Posted by Jon Haddad <jo...@jonhaddad.com>.

Run the async java profiler on the node to determine what it's doing:
https://github.com/jvm-profiling-tools/async-profiler

On Wed, Apr 17, 2019 at 11:31 AM Carl Mueller
<ca...@smartthings.com.invalid> wrote:
>
> No, we just did the package upgrade 2.1.9 --> 2.2.13
>
> It definitely feels like some indexes are being recalculated or the entire sstables are being scanned due to suspected corruption.
>
>
> On Wed, Apr 17, 2019 at 12:32 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>
>> There was a time when changing some of the parameters (especially bloom filter FP ratio) would cause the bloom filters to be rebuilt on startup if the sstables didnt match what was in the schema, leading to a delay like that and similar logs. Any chance you changed the schema on that table since the last time you restarted it?
>>
>>
>>
>> On Wed, Apr 17, 2019 at 10:30 AM Carl Mueller <ca...@smartthings.com.invalid> wrote:
>>>
>>> Oh, the table in question is SizeTiered, had about 10 sstables total, it was JBOD across two data directories.
>>>
>>> On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller <ca...@smartthings.com> wrote:
>>>>
>>>> We are doing a ton of upgrades to get out of 2.1.x. We've done probably 20-30 clusters so far and have not encountered anything like this yet.
>>>>
>>>> After upgrade of a node, the restart takes a long time. like 10 minutes long. ALmost all of our other nodes took less than 2 minutes to upgrade (aside from sstableupgrades).
>>>>
>>>> The startup stalls on a particular table, it is the largest table at about 300GB, but we have upgraded other clusters with about that much data without this 8-10 minute delay. We have the ability to roll back the node, and the restart as a 2.1.x node is normal with no delays.
>>>>
>>>> Alas this is a prod cluster so we are going to try to sstable load the data on a lower environment and try to replicate the delay. If we can, we will turn on debug logging.
>>>>
>>>> This occurred on the first node we tried to upgrade. It is possible it is limited to only this node, but we are gunshy to play around with upgrades in prod.
>>>>
>>>> We have an automated upgrading program that flushes, snapshots, shuts down gossip, drains before upgrade, suppressed autostart on upgrade, and has worked about as flawlessly as one could hope for so far for 2.1->2.2 and 2.2-> 3.11 upgrades.
>>>>
>>>> INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 - Initializing zzzz.access_token
>>>> INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 - Initializing zzzz.refresh_token
>>>> INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 - Initializing zzzz.userid
>>>> INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 - Initializing zzzz.access_token_by_auth
>>>>
>>>> You can see the 6:30 delay in the startup log above. All the other keyspace/tables initialize in under a second.
>>>>
>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

Posted by Carl Mueller <ca...@smartthings.com.INVALID>.

No, we just did the package upgrade 2.1.9 --> 2.2.13

It definitely feels like some indexes are being recalculated or the entire
sstables are being scanned due to suspected corruption.


On Wed, Apr 17, 2019 at 12:32 PM Jeff Jirsa <jj...@gmail.com> wrote:

> There was a time when changing some of the parameters (especially bloom
> filter FP ratio) would cause the bloom filters to be rebuilt on startup if
> the sstables didnt match what was in the schema, leading to a delay like
> that and similar logs. Any chance you changed the schema on that table
> since the last time you restarted it?
>
>
>
> On Wed, Apr 17, 2019 at 10:30 AM Carl Mueller
> <ca...@smartthings.com.invalid> wrote:
>
>> Oh, the table in question is SizeTiered, had about 10 sstables total, it
>> was JBOD across two data directories.
>>
>> On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller <
>> carl.mueller@smartthings.com> wrote:
>>
>>> We are doing a ton of upgrades to get out of 2.1.x. We've done probably
>>> 20-30 clusters so far and have not encountered anything like this yet.
>>>
>>> After upgrade of a node, the restart takes a long time. like 10 minutes
>>> long. ALmost all of our other nodes took less than 2 minutes to upgrade
>>> (aside from sstableupgrades).
>>>
>>> The startup stalls on a particular table, it is the largest table at
>>> about 300GB, but we have upgraded other clusters with about that much data
>>> without this 8-10 minute delay. We have the ability to roll back the node,
>>> and the restart as a 2.1.x node is normal with no delays.
>>>
>>> Alas this is a prod cluster so we are going to try to sstable load the
>>> data on a lower environment and try to replicate the delay. If we can, we
>>> will turn on debug logging.
>>>
>>> This occurred on the first node we tried to upgrade. It is possible it
>>> is limited to only this node, but we are gunshy to play around with
>>> upgrades in prod.
>>>
>>> We have an automated upgrading program that flushes, snapshots, shuts
>>> down gossip, drains before upgrade, suppressed autostart on upgrade, and
>>> has worked about as flawlessly as one could hope for so far for 2.1->2.2
>>> and 2.2-> 3.11 upgrades.
>>>
>>> INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 -
>>> Initializing zzzz.access_token
>>> INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 -
>>> Initializing zzzz.refresh_token
>>> INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 -
>>> Initializing zzzz.userid
>>> INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 -
>>> Initializing zzzz.access_token_by_auth
>>>
>>> You can see the 6:30 delay in the startup log above. All the other
>>> keyspace/tables initialize in under a second.
>>>
>>>
>>>

Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

Posted by Jeff Jirsa <jj...@gmail.com>.

There was a time when changing some of the parameters (especially bloom
filter FP ratio) would cause the bloom filters to be rebuilt on startup if
the sstables didnt match what was in the schema, leading to a delay like
that and similar logs. Any chance you changed the schema on that table
since the last time you restarted it?



On Wed, Apr 17, 2019 at 10:30 AM Carl Mueller
<ca...@smartthings.com.invalid> wrote:

> Oh, the table in question is SizeTiered, had about 10 sstables total, it
> was JBOD across two data directories.
>
> On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller <
> carl.mueller@smartthings.com> wrote:
>
>> We are doing a ton of upgrades to get out of 2.1.x. We've done probably
>> 20-30 clusters so far and have not encountered anything like this yet.
>>
>> After upgrade of a node, the restart takes a long time. like 10 minutes
>> long. ALmost all of our other nodes took less than 2 minutes to upgrade
>> (aside from sstableupgrades).
>>
>> The startup stalls on a particular table, it is the largest table at
>> about 300GB, but we have upgraded other clusters with about that much data
>> without this 8-10 minute delay. We have the ability to roll back the node,
>> and the restart as a 2.1.x node is normal with no delays.
>>
>> Alas this is a prod cluster so we are going to try to sstable load the
>> data on a lower environment and try to replicate the delay. If we can, we
>> will turn on debug logging.
>>
>> This occurred on the first node we tried to upgrade. It is possible it is
>> limited to only this node, but we are gunshy to play around with upgrades
>> in prod.
>>
>> We have an automated upgrading program that flushes, snapshots, shuts
>> down gossip, drains before upgrade, suppressed autostart on upgrade, and
>> has worked about as flawlessly as one could hope for so far for 2.1->2.2
>> and 2.2-> 3.11 upgrades.
>>
>> INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 -
>> Initializing zzzz.access_token
>> INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 -
>> Initializing zzzz.refresh_token
>> INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 -
>> Initializing zzzz.userid
>> INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 -
>> Initializing zzzz.access_token_by_auth
>>
>> You can see the 6:30 delay in the startup log above. All the other
>> keyspace/tables initialize in under a second.
>>
>>
>>

Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

Posted by Carl Mueller <ca...@smartthings.com.INVALID>.

Oh, the table in question is SizeTiered, had about 10 sstables total, it
was JBOD across two data directories.

On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller <ca...@smartthings.com>
wrote:

> We are doing a ton of upgrades to get out of 2.1.x. We've done probably
> 20-30 clusters so far and have not encountered anything like this yet.
>
> After upgrade of a node, the restart takes a long time. like 10 minutes
> long. ALmost all of our other nodes took less than 2 minutes to upgrade
> (aside from sstableupgrades).
>
> The startup stalls on a particular table, it is the largest table at about
> 300GB, but we have upgraded other clusters with about that much data
> without this 8-10 minute delay. We have the ability to roll back the node,
> and the restart as a 2.1.x node is normal with no delays.
>
> Alas this is a prod cluster so we are going to try to sstable load the
> data on a lower environment and try to replicate the delay. If we can, we
> will turn on debug logging.
>
> This occurred on the first node we tried to upgrade. It is possible it is
> limited to only this node, but we are gunshy to play around with upgrades
> in prod.
>
> We have an automated upgrading program that flushes, snapshots, shuts down
> gossip, drains before upgrade, suppressed autostart on upgrade, and has
> worked about as flawlessly as one could hope for so far for 2.1->2.2 and
> 2.2-> 3.11 upgrades.
>
> INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 -
> Initializing zzzz.access_token
> INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 -
> Initializing zzzz.refresh_token
> INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 -
> Initializing zzzz.userid
> INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 -
> Initializing zzzz.access_token_by_auth
>
> You can see the 6:30 delay in the startup log above. All the other
> keyspace/tables initialize in under a second.
>
>
>