You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Mick Semb Wever <mc...@apache.org> on 2022/11/09 19:22:08 UTC

Should we change 4.1 to G1 and offheap_objects ?

Any objections to making these changes, at the very last minute, for
4.1-rc1 ?
This is CASSANDRA-12029 and CASSANDRA-7486

Provided we figure out patches for them in the next day or two.

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Jeff Jirsa <jj...@gmail.com>.

I'll withdraw my comment about friendliness of g1 vs cms. I think it's too
late to sneak it in, but wouldn't object formally.

offheap_objects is way too late to change given we shipped the alpha in May
and there are known, long lived bugs that multiple users have reported and
wouldn't have been tested in the alpha, so I'd vote -1 on a release with
that change on the basis that we hadn't done a valid alpha/beta/testing
with that config.

On Thu, Nov 10, 2022 at 8:56 AM Jon Haddad <ru...@apache.org>
wrote:

> +1 to switching to G1.
>
> No opinion about offheap objects.
>
> On 2022/11/09 19:22:08 Mick Semb Wever wrote:
> > Any objections to making these changes, at the very last minute, for
> > 4.1-rc1 ?
> > This is CASSANDRA-12029 and CASSANDRA-7486
> >
> > Provided we figure out patches for them in the next day or two.
> >
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Jon Haddad <ru...@apache.org>.

+1 to switching to G1.   

No opinion about offheap objects.

On 2022/11/09 19:22:08 Mick Semb Wever wrote:
> Any objections to making these changes, at the very last minute, for
> 4.1-rc1 ?
> This is CASSANDRA-12029 and CASSANDRA-7486
> 
> Provided we figure out patches for them in the next day or two.
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Elliott Sims <el...@backblaze.com>.

From a user PoV, I'd call G1 drastically friendlier than CMS in that it
tends to be well-behaved under a variety of workloads and heap sizes right
out of the box without the kind of dark-art tuning and overnight surprises
you get with CMS.  Granted the smallest heap I have now is 2GB, but that's
not really small by 2022 standards and it seems to be the minimum
recommended in the docs (though the config calculator will go as low as 1GB)

It feels like the sort of change that wouldn't be a bad surprise in an RC,
but maybe a bit too big of a change for backporting to 4.0.

On Wed, Nov 9, 2022 at 4:30 PM Josh McKenzie <jm...@apache.org> wrote:

> fwiw, the "CMS is friendlier for small heaps with C*" conclusion may no
> longer be accurate; a lot of work has gone into G1 since the last time
> we've covered the topic as a project. Nevermind the changes in C*.
>
> Lots of moving targets.
>
> On Wed, Nov 9, 2022, at 6:13 PM, Brad wrote:
>
> The default garbage collector in Java 11 is G1*.  *It's designed to be
> self-tuning, so I'd call it friendly.  We have run Java 8 and 11 on G1 in
> production on all of our 1,000+ clusters for several years.
>
> I'd agree with Jeremiah that it's worth changing in trunk at the very
> least and consider backporting.
>
> On Wed, Nov 9, 2022 at 5:10 PM Brandon Williams <dr...@gmail.com> wrote:
>
> If CMS is gone, is there a friendlier alternative to G1?
>
> On Wed, Nov 9, 2022 at 3:53 PM Josh McKenzie <jm...@apache.org> wrote:
> >
> > My recollection (and brief sleuthing now) surfaces: we've gone back and
> forth on the G1 vs. CMS debate over the years and I think we settled on "it
> all depends on your environment, workload, and you need to tune it anyway.
> It might be worth having a 'default' mode that selects one of the two based
> on heap size unless otherwise specified".
> >
> > I certainly wouldn't make changes to any defaults on a release between
> beta and rc personally.
> >
> > On Wed, Nov 9, 2022, at 4:20 PM, Jeff Jirsa wrote:
> >
> > G1 you can argue for with the changes in the JDK, though it's MUCH  less
> friendly to small heaps (e.g. probably our default simple user).
> >
> > Offheap memtables are different though. If someone wants to attest that
> offheap_objects get the same level of rigorous testing as the existing
> default, that'd be useful, but I'm pretty sure that's not true, and bugs
> like https://issues.apache.org/jira/browse/CASSANDRA-12125  (which
> remains undiagnosed) reinforce that it's less commonly used and may have
> latent undiscovered bugs for default users.
> >
> >
> >
> >
> >
> > On Wed, Nov 9, 2022 at 11:23 AM Mick Semb Wever <mc...@apache.org> wrote:
> >
> > Any objections to making these changes, at the very last minute, for
> 4.1-rc1 ?
> > This is CASSANDRA-12029 and CASSANDRA-7486
> >
> > Provided we figure out patches for them in the next day or two.
> >
> >
>
>
>

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Josh McKenzie <jm...@apache.org>.

fwiw, the "CMS is friendlier for small heaps with C*" conclusion may no longer be accurate; a lot of work has gone into G1 since the last time we've covered the topic as a project. Nevermind the changes in C*.

Lots of moving targets.

On Wed, Nov 9, 2022, at 6:13 PM, Brad wrote:
> The default garbage collector in Java 11 is G1*.  *It's designed to be self-tuning, so I'd call it friendly.  We have run Java 8 and 11 on G1 in production on all of our 1,000+ clusters for several years.
> 
> I'd agree with Jeremiah that it's worth changing in trunk at the very least and consider backporting.
> 
> On Wed, Nov 9, 2022 at 5:10 PM Brandon Williams <dr...@gmail.com> wrote:
>> If CMS is gone, is there a friendlier alternative to G1?
>> 
>> On Wed, Nov 9, 2022 at 3:53 PM Josh McKenzie <jm...@apache.org> wrote:
>> >
>> > My recollection (and brief sleuthing now) surfaces: we've gone back and forth on the G1 vs. CMS debate over the years and I think we settled on "it all depends on your environment, workload, and you need to tune it anyway. It might be worth having a 'default' mode that selects one of the two based on heap size unless otherwise specified".
>> >
>> > I certainly wouldn't make changes to any defaults on a release between beta and rc personally.
>> >
>> > On Wed, Nov 9, 2022, at 4:20 PM, Jeff Jirsa wrote:
>> >
>> > G1 you can argue for with the changes in the JDK, though it's MUCH  less friendly to small heaps (e.g. probably our default simple user).
>> >
>> > Offheap memtables are different though. If someone wants to attest that offheap_objects get the same level of rigorous testing as the existing default, that'd be useful, but I'm pretty sure that's not true, and bugs like https://issues.apache.org/jira/browse/CASSANDRA-12125  (which remains undiagnosed) reinforce that it's less commonly used and may have latent undiscovered bugs for default users.
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Nov 9, 2022 at 11:23 AM Mick Semb Wever <mc...@apache.org> wrote:
>> >
>> > Any objections to making these changes, at the very last minute, for 4.1-rc1 ?
>> > This is CASSANDRA-12029 and CASSANDRA-7486
>> >
>> > Provided we figure out patches for them in the next day or two.
>> >
>> >

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Brad <bs...@gmail.com>.

The default garbage collector in Java 11 is G1*.  *It's designed to be
self-tuning, so I'd call it friendly.  We have run Java 8 and 11 on G1 in
production on all of our 1,000+ clusters for several years.

I'd agree with Jeremiah that it's worth changing in trunk at the very least
and consider backporting.

On Wed, Nov 9, 2022 at 5:10 PM Brandon Williams <dr...@gmail.com> wrote:

> If CMS is gone, is there a friendlier alternative to G1?
>
> On Wed, Nov 9, 2022 at 3:53 PM Josh McKenzie <jm...@apache.org> wrote:
> >
> > My recollection (and brief sleuthing now) surfaces: we've gone back and
> forth on the G1 vs. CMS debate over the years and I think we settled on "it
> all depends on your environment, workload, and you need to tune it anyway.
> It might be worth having a 'default' mode that selects one of the two based
> on heap size unless otherwise specified".
> >
> > I certainly wouldn't make changes to any defaults on a release between
> beta and rc personally.
> >
> > On Wed, Nov 9, 2022, at 4:20 PM, Jeff Jirsa wrote:
> >
> > G1 you can argue for with the changes in the JDK, though it's MUCH  less
> friendly to small heaps (e.g. probably our default simple user).
> >
> > Offheap memtables are different though. If someone wants to attest that
> offheap_objects get the same level of rigorous testing as the existing
> default, that'd be useful, but I'm pretty sure that's not true, and bugs
> like https://issues.apache.org/jira/browse/CASSANDRA-12125  (which
> remains undiagnosed) reinforce that it's less commonly used and may have
> latent undiscovered bugs for default users.
> >
> >
> >
> >
> >
> > On Wed, Nov 9, 2022 at 11:23 AM Mick Semb Wever <mc...@apache.org> wrote:
> >
> > Any objections to making these changes, at the very last minute, for
> 4.1-rc1 ?
> > This is CASSANDRA-12029 and CASSANDRA-7486
> >
> > Provided we figure out patches for them in the next day or two.
> >
> >
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Brandon Williams <dr...@gmail.com>.

> Can you define "friendlier" in the context of CMS?

Friendlier to small heaps, to Jeff's point about it being much less
friendly to them.

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Derek Chen-Becker <de...@chen-becker.org>.

There's a lot of work that's gone into G1 to the point where for
almost all workloads it will perform better than CMS. However, there
are almost no knobs to tune (most G1 params are "advisory" and G1 will
happily ignore them if it wants to), so there may not be a great
replacement if people are tuning CMS heavily for a specific workload.
Can you define "friendlier" in the context of CMS?

Derek

On Wed, Nov 9, 2022 at 3:10 PM Brandon Williams <dr...@gmail.com> wrote:
>
> If CMS is gone, is there a friendlier alternative to G1?
>
> On Wed, Nov 9, 2022 at 3:53 PM Josh McKenzie <jm...@apache.org> wrote:
> >
> > My recollection (and brief sleuthing now) surfaces: we've gone back and forth on the G1 vs. CMS debate over the years and I think we settled on "it all depends on your environment, workload, and you need to tune it anyway. It might be worth having a 'default' mode that selects one of the two based on heap size unless otherwise specified".
> >
> > I certainly wouldn't make changes to any defaults on a release between beta and rc personally.
> >
> > On Wed, Nov 9, 2022, at 4:20 PM, Jeff Jirsa wrote:
> >
> > G1 you can argue for with the changes in the JDK, though it's MUCH  less friendly to small heaps (e.g. probably our default simple user).
> >
> > Offheap memtables are different though. If someone wants to attest that offheap_objects get the same level of rigorous testing as the existing default, that'd be useful, but I'm pretty sure that's not true, and bugs like https://issues.apache.org/jira/browse/CASSANDRA-12125  (which remains undiagnosed) reinforce that it's less commonly used and may have latent undiscovered bugs for default users.
> >
> >
> >
> >
> >
> > On Wed, Nov 9, 2022 at 11:23 AM Mick Semb Wever <mc...@apache.org> wrote:
> >
> > Any objections to making these changes, at the very last minute, for 4.1-rc1 ?
> > This is CASSANDRA-12029 and CASSANDRA-7486
> >
> > Provided we figure out patches for them in the next day or two.
> >
> >



-- 
+---------------------------------------------------------------+
| Derek Chen-Becker                                             |
| GPG Key available at https://keybase.io/dchenbecker and       |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---------------------------------------------------------------+

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Brandon Williams <dr...@gmail.com>.

If CMS is gone, is there a friendlier alternative to G1?

On Wed, Nov 9, 2022 at 3:53 PM Josh McKenzie <jm...@apache.org> wrote:
>
> My recollection (and brief sleuthing now) surfaces: we've gone back and forth on the G1 vs. CMS debate over the years and I think we settled on "it all depends on your environment, workload, and you need to tune it anyway. It might be worth having a 'default' mode that selects one of the two based on heap size unless otherwise specified".
>
> I certainly wouldn't make changes to any defaults on a release between beta and rc personally.
>
> On Wed, Nov 9, 2022, at 4:20 PM, Jeff Jirsa wrote:
>
> G1 you can argue for with the changes in the JDK, though it's MUCH  less friendly to small heaps (e.g. probably our default simple user).
>
> Offheap memtables are different though. If someone wants to attest that offheap_objects get the same level of rigorous testing as the existing default, that'd be useful, but I'm pretty sure that's not true, and bugs like https://issues.apache.org/jira/browse/CASSANDRA-12125  (which remains undiagnosed) reinforce that it's less commonly used and may have latent undiscovered bugs for default users.
>
>
>
>
>
> On Wed, Nov 9, 2022 at 11:23 AM Mick Semb Wever <mc...@apache.org> wrote:
>
> Any objections to making these changes, at the very last minute, for 4.1-rc1 ?
> This is CASSANDRA-12029 and CASSANDRA-7486
>
> Provided we figure out patches for them in the next day or two.
>
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Josh McKenzie <jm...@apache.org>.

My recollection (and brief sleuthing now) surfaces: we've gone back and forth on the G1 vs. CMS debate over the years and I think we settled on "it all depends on your environment, workload, and you need to tune it anyway. It might be worth having a 'default' mode that selects one of the two based on heap size unless otherwise specified".

I certainly wouldn't make changes to any defaults on a release between beta and rc personally.

On Wed, Nov 9, 2022, at 4:20 PM, Jeff Jirsa wrote:
> G1 you can argue for with the changes in the JDK, though it's MUCH  less friendly to small heaps (e.g. probably our default simple user).
> 
> Offheap memtables are different though. If someone wants to attest that offheap_objects get the same level of rigorous testing as the existing default, that'd be useful, but I'm pretty sure that's not true, and bugs like https://issues.apache.org/jira/browse/CASSANDRA-12125  (which remains undiagnosed) reinforce that it's less commonly used and may have latent undiscovered bugs for default users. 
> 
> 
> 
> 
> 
> On Wed, Nov 9, 2022 at 11:23 AM Mick Semb Wever <mc...@apache.org> wrote:
>> Any objections to making these changes, at the very last minute, for 4.1-rc1 ? 
>> This is CASSANDRA-12029 and CASSANDRA-7486 
>> 
>> Provided we figure out patches for them in the next day or two.

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Jeff Jirsa <jj...@gmail.com>.

G1 you can argue for with the changes in the JDK, though it's MUCH  less
friendly to small heaps (e.g. probably our default simple user).

Offheap memtables are different though. If someone wants to attest that
offheap_objects get the same level of rigorous testing as the existing
default, that'd be useful, but I'm pretty sure that's not true, and bugs
like https://issues.apache.org/jira/browse/CASSANDRA-12125  (which remains
undiagnosed) reinforce that it's less commonly used and may have latent
undiscovered bugs for default users.

On Wed, Nov 9, 2022 at 11:23 AM Mick Semb Wever <mc...@apache.org> wrote:

> Any objections to making these changes, at the very last minute, for
> 4.1-rc1 ?
> This is CASSANDRA-12029 and CASSANDRA-7486
>
> Provided we figure out patches for them in the next day or two.
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by "C. Scott Andreas" <sc...@paradoxica.net>.

I share David and Aleksey’s views on this.

We shouldn’t make major defaults changes right before RC. Might be worth adding a release note recommending users try them, and that they may become default in a future release though.

— Scott

> On Nov 16, 2022, at 3:38 PM, David Capwell <dc...@apple.com> wrote:
> 
> Getting poked in Slack to be more explicit in this thread… 
> 
> Switching to G1 on trunk, +1
> Switching to G1 on 4.1, -1.  4.1 is about to be released and this isn’t a bug fix but a perf improvement ticket and as such should go through validation that the perf improvements are seen, there is not enough time left for that added performance work burden so strongly feel it should be pushed to 4.2/5.0 where it has plenty of time to be validated against.  The ticket even asks to avoid validating the claims; saying 'Hoping we can skip due diligence on this ticket because the data is "in the past” already”'.  Others have attempted both shenandoah and ZGC and found mixed results, so nothing leads me to believe that won’t be true here either.
> 
>> On Nov 16, 2022, at 9:15 AM, J. D. Jordan <je...@gmail.com> wrote:
>> 
>> Heap -
>> +1 for G1 in trunk
>> +0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I understand pushback against changing this so late in the game.
>> 
>> Memtable -
>> -1 for off heap in 4.1. I think this needs more testing and isn’t something to change at the last minute.
>> +1 for running performance/fuzz tests against the alternate memtable choices in trunk and switching if they don’t show regressions.
>> 
>>>> On Nov 16, 2022, at 10:48 AM, Josh McKenzie <jm...@apache.org> wrote:
>>> 
>>> 
>>> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to prioritize digging into G1's behavior on small heaps vs. CMS w/our default tuning sooner rather than later. With that info I'd likely be a strong +1 on the shift.
>>> 
>>> -1 on switching to offheap_objects for 4.1 RC; again, think this is just a small step away from being a +1 w/some more rigor around seeing the current state of the technology's intersections.
>>> 
>>> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:
>>>> All right. I’ll clarify then.
>>>> 
>>>> -0 on switching the default to G1 *this late* just before RC1.
>>>> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for it in principle, for 4.2, after we run some more test and resolve the concerns raised by Jeff.
>>>> 
>>>> Let’s please try to avoid this kind of super late defaults switch going forward?
>>>> 
>>>> —
>>>> AY
>>>> 
>>>>> On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> wrote:
>>>>> 
>>>>> For the record, I'm +100 on G1. Take it with whatever sized grain of
>>>>> salt you think appropriate for a relative newcomer to the list, but
>>>>> I've spent my last 7-8 years dealing with the intersection of
>>>>> high-throughput, low latency systems and their interaction with GC and
>>>>> in my personal experience G1 outperforms CMS in all cases and with
>>>>> significantly less work (zero work, in many cases). The only things
>>>>> I've seen perform better *with a similar heap footprint* are GenShen
>>>>> (currently experimental) and Rust (beyond the scope of this topic).
>>>>> 
>>>>> Derek
>>>>> 
>>>>> On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <ru...@apache.org> wrote:
>>>>>> 
>>>>>> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?
>>>>>> 
>>>>>> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.
>>>>>> 
>>>>>> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.
>>>>>> 
>>>>>> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.
>>>>>> 
>>>>>> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).
>>>>>> 
>>>>>> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.
>>>>>> 
>>>>>> Jon
>>>>>> 
>>>>>> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
>>>>>>>> 
>>>>>>>> In case of GC, reasonably extensive performance testing should be the
>>>>>>>> expectations. Potentially revisiting some of the G1 params for the 4.1
>>>>>>>> reality - quite a lot has changed since those optional defaults where
>>>>>>>> picked.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
>>>>>>> in the patch for CASSANDRA-18027
>>>>>>> 
>>>>>>> In reality it is really not much of a change, g1 does make it simple.
>>>>>>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
>>>>>>> the new heap (XX:NewSize) is still required, though we could do a much
>>>>>>> better job of dynamic defaults to them.
>>>>>>> 
>>>>>>> Alex Dejanovski's blog is a starting point:
>>>>>>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
>>>>>>> where this gc opt set was used (though it doesn't prove why those options
>>>>>>> are chosen)
>>>>>>> 
>>>>>>> The bar for objection to sneaking these into 4.1 was intended to be low,
>>>>>>> and I stand by those that raise concerns.
>>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> +---------------------------------------------------------------+
>>>>> | Derek Chen-Becker                                             |
>>>>> | GPG Key available at https://keybase.io/dchenbecker and       |
>>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>>>> +---------------------------------------------------------------+
>>>> 
>>>> 
>>> 
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by "C. Scott Andreas" <sc...@paradoxica.net>.

Jon, thanks for flagging that I didn't get a reply to your question on the thread.My main point in this thread is that I don't think post-beta is an appropriate time for a major prop change like this in the release cycle. Ideally at this point in the release cycle, major contributors and large users of Cassandra are running the build at minimum in pre-production environments, and hopefully in production environments too. Prop changes reset much of what's been learned by exercising the beta shortly before RC.Adding some detail on your question re: G1 – which mostly boils down to some experience to the contrary. I don't have data from past tests easily accessible to me, so I'm writing from memory and deductive reasoning here.The problem of garbage collection is minimizing a function of "memory overhead required to safely operate, program pause time, and CPU time burnt." ParNew+CMS are throughput-oriented collectors that commonly have higher throughput, lower CPU usage, and higher pause times than newer collectors like G1 and Shenandoah. This is a poor tradeoff for most applications.Cassandra is unique here: internode requests speculate, masking latency within cluster that can be incurred by the pause phase of a collection. The Java Driver is also great at speculating, masking latency of a coordinator that may be pausing for a collection as well. While ParNew+CMS are an objectively poor choice for many systems, Cassandra's architecture as a majority-quorum database that can speculate both at the client and coordinator level avoids the worst of those pitfalls.In cases where I and my colleagues have evaluated other collectors like G1 and Shenandoah, we've found lower pause times, ~unchanged or slightly higher client latency, and lower throughput. G1 testing may predate me, so I'll offer a more recent Shenandoah example. In a ~12-instance cluster that runs hot - averaging about 80% CPU - enabling Shenandoah resulted in about 5-10% lower request throughput after a couple days and a roughly equal increase in latency. While its micro-pause behavior was nice relative to ParNew's ~100-200ms pauses, it didn't make much of a difference due to internode and client speculation around it.Again, my point in this thread is that I wouldn't alter defaults on the eve of an RC in a release cycle. We do know this will need to change soon. CMS is gone in JDK17, so consider this email an elegy :). As part of JDK17 readiness, our collector defaults must change. If someone is interested in picking up the work, I think now would be a great time to perform that measurement and propose new defaults for the project based on it - and I don't even have an objection to those landing in a patchlevel release if the measurements look really good.But I wouldn't change the defaults on the eve of RC.– ScottOn Nov 17, 2022, at 7:26 AM, Joseph Lynch <jo...@gmail.com> wrote:I'm surprised we released 4.0 without changing the default to G1 giventhat many Cassandra deployments have changed the project's defaultbecause it is incorrect. I know that 7486 broke a user 7 years ago,but I think we have had a ton of testing since then in the communityto build our confidence. Not to mention that Java 9+ (released 2017)made G1 the default and Java 14 (2020) removes CMS entirely.I have personally done targeted AB testing of G1GC vs CMS in acontrolled fashion using NDBench and our team had enough confidence in~2019 to roll it to Netflix's entire fleet of O(1k) clusters andO(10k) instances running Java 8. We found it vastly superior to CMS inpractically every way (no more 10s+ compacting STW phases after heapfragmentation, better tail latency at a coordinator/replica level,better average throughput, etc ...), and only identified a single veryminor p99 regression on one cluster (~5%) which we didn't considersevere enough to roll back.Right now our project defaults are hurting 99 users to help 1; letthat one user change the defaults? 4.1 seems like a great place to fixthe bug, absent being able to do that let's at least fix it in trunk?-JoeyOn Thu, Nov 17, 2022 at 8:27 AM Jon Haddad <ru...@apache.org> wrote:I noticed nobody answered my actual question - what would it take for you to be comfortable?It seems that the need to do a release is now more important than the best interests of the new user's experience - despite having plenty of *production* experience showing that what we ship isn't even remotely close to usable.I tried to offer a compromise, and it's not cool with me that it was ignored by everyone objecting.JonOn 2022/11/17 08:34:53 Mick Semb Wever wrote:> Ok, wrt G1 default, this is won't go ahead for 4.1-rc1>> We can revisit it for 4.1.x>> We have a lot of voices here adamantly positive for it, and those of us> that have done the performance testing over the years know why. But being> called to prove it is totally valid, if you have data to any such tests> please add them to the ticket 18027>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Joseph Lynch <jo...@gmail.com>.

I'm surprised we released 4.0 without changing the default to G1 given
that many Cassandra deployments have changed the project's default
because it is incorrect. I know that 7486 broke a user 7 years ago,
but I think we have had a ton of testing since then in the community
to build our confidence. Not to mention that Java 9+ (released 2017)
made G1 the default and Java 14 (2020) removes CMS entirely.

I have personally done targeted AB testing of G1GC vs CMS in a
controlled fashion using NDBench and our team had enough confidence in
~2019 to roll it to Netflix's entire fleet of O(1k) clusters and
O(10k) instances running Java 8. We found it vastly superior to CMS in
practically every way (no more 10s+ compacting STW phases after heap
fragmentation, better tail latency at a coordinator/replica level,
better average throughput, etc ...), and only identified a single very
minor p99 regression on one cluster (~5%) which we didn't consider
severe enough to roll back.

Right now our project defaults are hurting 99 users to help 1; let
that one user change the defaults? 4.1 seems like a great place to fix
the bug, absent being able to do that let's at least fix it in trunk?

-Joey

On Thu, Nov 17, 2022 at 8:27 AM Jon Haddad <ru...@apache.org> wrote:
>
> I noticed nobody answered my actual question - what would it take for you to be comfortable?
>
> It seems that the need to do a release is now more important than the best interests of the new user's experience - despite having plenty of *production* experience showing that what we ship isn't even remotely close to usable.
>
> I tried to offer a compromise, and it's not cool with me that it was ignored by everyone objecting.
>
> Jon
>
> On 2022/11/17 08:34:53 Mick Semb Wever wrote:
> > Ok, wrt G1 default, this is won't go ahead for 4.1-rc1
> >
> > We can revisit it for 4.1.x
> >
> > We have a lot of voices here adamantly positive for it, and those of us
> > that have done the performance testing over the years know why. But being
> > called to prove it is totally valid, if you have data to any such tests
> > please add them to the ticket 18027
> >

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Jon Haddad <ru...@apache.org>.

I noticed nobody answered my actual question - what would it take for you to be comfortable?  

It seems that the need to do a release is now more important than the best interests of the new user's experience - despite having plenty of *production* experience showing that what we ship isn't even remotely close to usable.

I tried to offer a compromise, and it's not cool with me that it was ignored by everyone objecting.

Jon

On 2022/11/17 08:34:53 Mick Semb Wever wrote:
> Ok, wrt G1 default, this is won't go ahead for 4.1-rc1
> 
> We can revisit it for 4.1.x
> 
> We have a lot of voices here adamantly positive for it, and those of us
> that have done the performance testing over the years know why. But being
> called to prove it is totally valid, if you have data to any such tests
> please add them to the ticket 18027
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by guo Maxwell <cc...@gmail.com>.

same with David Capwell，+1 on updating NEWS in 4.1.x and really change in
4.x /5.0

David Capwell <dc...@apple.com>于2023年1月13日 周五上午3:11写道：

> I am cool with updating NEWS in 4.1.1 to recommend the change and change
> it in 4.2/5.0
>
>
> On Jan 12, 2023, at 10:56 AM, Josh McKenzie <jm...@apache.org> wrote:
>
> Potential compromise: We change it in trunk, and we NEWS.txt in the minor
> about that change in trunk, why, and recommend users consider qualifying
> the same change on their 4.1 release.
>
> In case it's not clear from me:
> +1 to changing on trunk for 5.0 here
> -1 to changing on minor release given how little (i.e. nonexistent) perf
> testing we have on the OSS project right now.
>
> On Thu, Jan 12, 2023, at 11:47 AM, Paulo Motta wrote:
>
> I tend to agree with Aleksey's sentiment. Why do we need to change the
> default in a minor release if we already provide G1 options for users that
> want to opt-in?
>
> On Thu, Jan 12, 2023 at 9:46 AM Aleksey Yeshchenko <al...@apple.com>
> wrote:
>
> Switching a major default in a minor release is way worse than doing it in
> a GA - less notice and visibility, many folks don’t even read minor version
> NEWS.txt before upgrading.
>
> Trunk is fine by me though.
>
> > On 12 Jan 2023, at 13:14, Mick Semb Wever <mc...@apache.org> wrote:
> >
> >> Ok, wrt G1 default, this is won't go ahead for 4.1-rc1
> >>
> >> We can revisit it for 4.1.x
> >>
> >> We have a lot of voices here adamantly positive for it, and those of us
> that have done the performance testing over the years know why. But being
> called to prove it is totally valid, if you have data to any such tests
> please add them to the ticket 18027
> >
> >
> > Revisiting. Are there any vetoes to making G1 the default (and
> > updating the G1 settings, see the patch on
> > https://issues.apache.org/jira/browse/CASSANDRA-18027 ) for 4.1.1 ?
> >
> > IIUC , the summary of this thread till now was: there were no vetoes
> > to the change in trunk, but there were vetoes to 4.1.0 (because we
> > were inside the beta to GA window), and there was a desire to have
> > benchmarking data presented.
> >
> > WRT benchmarking, we have a separate thread for performance testing in
> > the project.  The ticket admittedly does not do its due diligence on
> > data presentation and analysis of smaller heaps: a precedent we should
> > be creating; but instead relies upon experience from many. Are we ok
> > with this this time around, or shall the patch only be applied to
> > trunk (where we have no choice w/ jdk17 landing)?
>
>
> --
you are the apple of my eye !

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by David Capwell <dc...@apple.com>.

I am cool with updating NEWS in 4.1.1 to recommend the change and change it in 4.2/5.0

> On Jan 12, 2023, at 10:56 AM, Josh McKenzie <jm...@apache.org> wrote:
> 
> Potential compromise: We change it in trunk, and we NEWS.txt in the minor about that change in trunk, why, and recommend users consider qualifying the same change on their 4.1 release.
> 
> In case it's not clear from me:
> +1 to changing on trunk for 5.0 here
> -1 to changing on minor release given how little (i.e. nonexistent) perf testing we have on the OSS project right now.
> 
> On Thu, Jan 12, 2023, at 11:47 AM, Paulo Motta wrote:
>> I tend to agree with Aleksey's sentiment. Why do we need to change the default in a minor release if we already provide G1 options for users that want to opt-in?
>> 
>> On Thu, Jan 12, 2023 at 9:46 AM Aleksey Yeshchenko <aleksey@apple.com <ma...@apple.com>> wrote:
>> Switching a major default in a minor release is way worse than doing it in a GA - less notice and visibility, many folks don’t even read minor version NEWS.txt before upgrading.
>> 
>> Trunk is fine by me though.
>> 
>> > On 12 Jan 2023, at 13:14, Mick Semb Wever <mck@apache.org <ma...@apache.org>> wrote:
>> > 
>> >> Ok, wrt G1 default, this is won't go ahead for 4.1-rc1
>> >> 
>> >> We can revisit it for 4.1.x
>> >> 
>> >> We have a lot of voices here adamantly positive for it, and those of us that have done the performance testing over the years know why. But being called to prove it is totally valid, if you have data to any such tests please add them to the ticket 18027
>> > 
>> > 
>> > Revisiting. Are there any vetoes to making G1 the default (and
>> > updating the G1 settings, see the patch on
>> > https://issues.apache.org/jira/browse/CASSANDRA-18027 <https://issues.apache.org/jira/browse/CASSANDRA-18027> ) for 4.1.1 ?
>> > 
>> > IIUC , the summary of this thread till now was: there were no vetoes
>> > to the change in trunk, but there were vetoes to 4.1.0 (because we
>> > were inside the beta to GA window), and there was a desire to have
>> > benchmarking data presented.
>> > 
>> > WRT benchmarking, we have a separate thread for performance testing in
>> > the project.  The ticket admittedly does not do its due diligence on
>> > data presentation and analysis of smaller heaps: a precedent we should
>> > be creating; but instead relies upon experience from many. Are we ok
>> > with this this time around, or shall the patch only be applied to
>> > trunk (where we have no choice w/ jdk17 landing)?

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Josh McKenzie <jm...@apache.org>.

Potential compromise: We change it in trunk, and we NEWS.txt in the minor about that change in trunk, why, and recommend users consider qualifying the same change on their 4.1 release.

In case it's not clear from me:
+1 to changing on trunk for 5.0 here
-1 to changing on minor release given how little (i.e. nonexistent) perf testing we have on the OSS project right now.

On Thu, Jan 12, 2023, at 11:47 AM, Paulo Motta wrote:
> I tend to agree with Aleksey's sentiment. Why do we need to change the default in a minor release if we already provide G1 options for users that want to opt-in?
> 
> On Thu, Jan 12, 2023 at 9:46 AM Aleksey Yeshchenko <al...@apple.com> wrote:
>> Switching a major default in a minor release is way worse than doing it in a GA - less notice and visibility, many folks don’t even read minor version NEWS.txt before upgrading.
>> 
>> Trunk is fine by me though.
>> 
>> > On 12 Jan 2023, at 13:14, Mick Semb Wever <mc...@apache.org> wrote:
>> > 
>> >> Ok, wrt G1 default, this is won't go ahead for 4.1-rc1
>> >> 
>> >> We can revisit it for 4.1.x
>> >> 
>> >> We have a lot of voices here adamantly positive for it, and those of us that have done the performance testing over the years know why. But being called to prove it is totally valid, if you have data to any such tests please add them to the ticket 18027
>> > 
>> > 
>> > Revisiting. Are there any vetoes to making G1 the default (and
>> > updating the G1 settings, see the patch on
>> > https://issues.apache.org/jira/browse/CASSANDRA-18027 ) for 4.1.1 ?
>> > 
>> > IIUC , the summary of this thread till now was: there were no vetoes
>> > to the change in trunk, but there were vetoes to 4.1.0 (because we
>> > were inside the beta to GA window), and there was a desire to have
>> > benchmarking data presented.
>> > 
>> > WRT benchmarking, we have a separate thread for performance testing in
>> > the project.  The ticket admittedly does not do its due diligence on
>> > data presentation and analysis of smaller heaps: a precedent we should
>> > be creating; but instead relies upon experience from many. Are we ok
>> > with this this time around, or shall the patch only be applied to
>> > trunk (where we have no choice w/ jdk17 landing)?
>>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Paulo Motta <pa...@gmail.com>.

I tend to agree with Aleksey's sentiment. Why do we need to change the
default in a minor release if we already provide G1 options for users that
want to opt-in?

On Thu, Jan 12, 2023 at 9:46 AM Aleksey Yeshchenko <al...@apple.com>
wrote:

> Switching a major default in a minor release is way worse than doing it in
> a GA - less notice and visibility, many folks don’t even read minor version
> NEWS.txt before upgrading.
>
> Trunk is fine by me though.
>
> > On 12 Jan 2023, at 13:14, Mick Semb Wever <mc...@apache.org> wrote:
> >
> >> Ok, wrt G1 default, this is won't go ahead for 4.1-rc1
> >>
> >> We can revisit it for 4.1.x
> >>
> >> We have a lot of voices here adamantly positive for it, and those of us
> that have done the performance testing over the years know why. But being
> called to prove it is totally valid, if you have data to any such tests
> please add them to the ticket 18027
> >
> >
> > Revisiting. Are there any vetoes to making G1 the default (and
> > updating the G1 settings, see the patch on
> > https://issues.apache.org/jira/browse/CASSANDRA-18027 ) for 4.1.1 ?
> >
> > IIUC , the summary of this thread till now was: there were no vetoes
> > to the change in trunk, but there were vetoes to 4.1.0 (because we
> > were inside the beta to GA window), and there was a desire to have
> > benchmarking data presented.
> >
> > WRT benchmarking, we have a separate thread for performance testing in
> > the project.  The ticket admittedly does not do its due diligence on
> > data presentation and analysis of smaller heaps: a precedent we should
> > be creating; but instead relies upon experience from many. Are we ok
> > with this this time around, or shall the patch only be applied to
> > trunk (where we have no choice w/ jdk17 landing)?
>
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Aleksey Yeshchenko <al...@apple.com>.

Switching a major default in a minor release is way worse than doing it in a GA - less notice and visibility, many folks don’t even read minor version NEWS.txt before upgrading.

Trunk is fine by me though.

> On 12 Jan 2023, at 13:14, Mick Semb Wever <mc...@apache.org> wrote:
> 
>> Ok, wrt G1 default, this is won't go ahead for 4.1-rc1
>> 
>> We can revisit it for 4.1.x
>> 
>> We have a lot of voices here adamantly positive for it, and those of us that have done the performance testing over the years know why. But being called to prove it is totally valid, if you have data to any such tests please add them to the ticket 18027
> 
> 
> Revisiting. Are there any vetoes to making G1 the default (and
> updating the G1 settings, see the patch on
> https://issues.apache.org/jira/browse/CASSANDRA-18027 ) for 4.1.1 ?
> 
> IIUC , the summary of this thread till now was: there were no vetoes
> to the change in trunk, but there were vetoes to 4.1.0 (because we
> were inside the beta to GA window), and there was a desire to have
> benchmarking data presented.
> 
> WRT benchmarking, we have a separate thread for performance testing in
> the project.  The ticket admittedly does not do its due diligence on
> data presentation and analysis of smaller heaps: a precedent we should
> be creating; but instead relies upon experience from many. Are we ok
> with this this time around, or shall the patch only be applied to
> trunk (where we have no choice w/ jdk17 landing)?

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Michael Shuler <mi...@pbandjelly.org>.

On 1/13/23 05:50, Mick Semb Wever wrote:
> Thanks for the support Brad, you're definitely not alone. Alas the 
> project works in a consensus model, i.e. off the objections made - which 
> have been all sound. A good compromise has been offered that I will move 
> forward on, and I'll also update the commented out G1 settings in 4.1.1 
> to match those becoming the default in trunk.

+1 to G1 default in trunk and a recommendation in 4.1.1 NEWS.txt. I 
agree with Aleksey and others, trunk is the right place to change defaults.

Kind regards,
Michael

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Mick Semb Wever <mc...@apache.org>.

> *+1* to changing to G1 on trunk for 5.0 and 4.1.1.  We have over a
> thousand clusters and over 10K nodes running on J8 and 11 with G1GC and
> memory management is excellent.
>


Thanks for the support Brad, you're definitely not alone. Alas the project
works in a consensus model, i.e. off the objections made - which have been
all sound. A good compromise has been offered that I will move forward on,
and I'll also update the commented out G1 settings in 4.1.1 to match those
becoming the default in trunk.



> Excellent. Two observations: first we reverted MaxGCPauseMillis=200,
> which is the JVM default. Cassandra's jvm{8,11}-server.options has 500
> (commented out) for some reason. Second on some clusters with 'humongous
> allocations' we've had to increase G1HeapRegionSize in a few cases on
> clusters with very large partitions.
>
> CMS was deprecated in Java 9, so I don't know why Cassandra would still
> use it as the default.
>


Absolutely! Take a look at the patch, it aligns the G1 settings closer to
what you say.
https://github.com/apache/cassandra/compare/trunk...thelastpickle:cassandra:mck/7486/trunk


My apologies I did not create this ticket earlier.

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Brad <bs...@gmail.com>.

*+1* to changing to G1 on trunk for 5.0 and 4.1.1.  We have over a thousand
clusters and over 10K nodes running on J8 and 11 with G1GC and memory
management is excellent. Excellent. Two observations: first we
reverted MaxGCPauseMillis=200,
which is the JVM default. Cassandra's jvm{8,11}-server.options has 500
(commented out) for some reason. Second on some clusters with 'humongous
allocations' we've had to increase G1HeapRegionSize in a few cases on
clusters with very large partitions.

CMS was deprecated in Java 9, so I don't know why Cassandra would still use
it as the default.

JEP 291: Deprecate the Concurrent Mark Sweep (CMS) Garbage Collector
https://openjdk.org/jeps/291

The change to off-heap memory sounds good, but maybe change on trunk (5.0)
not 4.1.

On Thu, Jan 12, 2023 at 8:16 AM Mick Semb Wever <mc...@apache.org> wrote:

> > Ok, wrt G1 default, this is won't go ahead for 4.1-rc1
> >
> > We can revisit it for 4.1.x
> >
> > We have a lot of voices here adamantly positive for it, and those of us
> that have done the performance testing over the years know why. But being
> called to prove it is totally valid, if you have data to any such tests
> please add them to the ticket 18027
>
>
> Revisiting. Are there any vetoes to making G1 the default (and
> updating the G1 settings, see the patch on
> https://issues.apache.org/jira/browse/CASSANDRA-18027 ) for 4.1.1 ?
>
> IIUC , the summary of this thread till now was: there were no vetoes
> to the change in trunk, but there were vetoes to 4.1.0 (because we
> were inside the beta to GA window), and there was a desire to have
> benchmarking data presented.
>
> WRT benchmarking, we have a separate thread for performance testing in
> the project.  The ticket admittedly does not do its due diligence on
> data presentation and analysis of smaller heaps: a precedent we should
> be creating; but instead relies upon experience from many. Are we ok
> with this this time around, or shall the patch only be applied to
> trunk (where we have no choice w/ jdk17 landing)?
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Mick Semb Wever <mc...@apache.org>.

> Ok, wrt G1 default, this is won't go ahead for 4.1-rc1
>
> We can revisit it for 4.1.x
>
> We have a lot of voices here adamantly positive for it, and those of us that have done the performance testing over the years know why. But being called to prove it is totally valid, if you have data to any such tests please add them to the ticket 18027


Revisiting. Are there any vetoes to making G1 the default (and
updating the G1 settings, see the patch on
https://issues.apache.org/jira/browse/CASSANDRA-18027 ) for 4.1.1 ?

IIUC , the summary of this thread till now was: there were no vetoes
to the change in trunk, but there were vetoes to 4.1.0 (because we
were inside the beta to GA window), and there was a desire to have
benchmarking data presented.

WRT benchmarking, we have a separate thread for performance testing in
the project.  The ticket admittedly does not do its due diligence on
data presentation and analysis of smaller heaps: a precedent we should
be creating; but instead relies upon experience from many. Are we ok
with this this time around, or shall the patch only be applied to
trunk (where we have no choice w/ jdk17 landing)?

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Mick Semb Wever <mc...@apache.org>.

Ok, wrt G1 default, this is won't go ahead for 4.1-rc1

We can revisit it for 4.1.x

We have a lot of voices here adamantly positive for it, and those of us
that have done the performance testing over the years know why. But being
called to prove it is totally valid, if you have data to any such tests
please add them to the ticket 18027

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by "C. Scott Andreas" <sc...@paradoxica.net>.

We have precedent for changing defaults that have near-universal positive impact in patchlevel releases, yep.

disk_access_mode: auto -> mmap_index_only comes to mind.

- Scott

> On Nov 16, 2022, at 6:49 PM, Derek Chen-Becker <de...@chen-becker.org> wrote:
> 
> I'm fine with not including G1 in 4.1, but would we consider inclusion
> for 4.1.X down the road once validation has been done?
> 
> Derek
> 
> 
>> On Wed, Nov 16, 2022 at 4:39 PM David Capwell <dc...@apple.com> wrote:
>> 
>> Getting poked in Slack to be more explicit in this thread…
>> 
>> Switching to G1 on trunk, +1
>> Switching to G1 on 4.1, -1.  4.1 is about to be released and this isn’t a bug fix but a perf improvement ticket and as such should go through validation that the perf improvements are seen, there is not enough time left for that added performance work burden so strongly feel it should be pushed to 4.2/5.0 where it has plenty of time to be validated against.  The ticket even asks to avoid validating the claims; saying 'Hoping we can skip due diligence on this ticket because the data is "in the past” already”'.  Others have attempted both shenandoah and ZGC and found mixed results, so nothing leads me to believe that won’t be true here either.
>> 
>>>> On Nov 16, 2022, at 9:15 AM, J. D. Jordan <je...@gmail.com> wrote:
>>> 
>>> Heap -
>>> +1 for G1 in trunk
>>> +0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I understand pushback against changing this so late in the game.
>>> 
>>> Memtable -
>>> -1 for off heap in 4.1. I think this needs more testing and isn’t something to change at the last minute.
>>> +1 for running performance/fuzz tests against the alternate memtable choices in trunk and switching if they don’t show regressions.
>>> 
>>>> On Nov 16, 2022, at 10:48 AM, Josh McKenzie <jm...@apache.org> wrote:
>>>> 
>>>> 
>>>> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to prioritize digging into G1's behavior on small heaps vs. CMS w/our default tuning sooner rather than later. With that info I'd likely be a strong +1 on the shift.
>>>> 
>>>> -1 on switching to offheap_objects for 4.1 RC; again, think this is just a small step away from being a +1 w/some more rigor around seeing the current state of the technology's intersections.
>>>> 
>>>> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:
>>>>> All right. I’ll clarify then.
>>>>> 
>>>>> -0 on switching the default to G1 *this late* just before RC1.
>>>>> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for it in principle, for 4.2, after we run some more test and resolve the concerns raised by Jeff.
>>>>> 
>>>>> Let’s please try to avoid this kind of super late defaults switch going forward?
>>>>> 
>>>>> —
>>>>> AY
>>>>> 
>>>>>> On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> wrote:
>>>>>> 
>>>>>> For the record, I'm +100 on G1. Take it with whatever sized grain of
>>>>>> salt you think appropriate for a relative newcomer to the list, but
>>>>>> I've spent my last 7-8 years dealing with the intersection of
>>>>>> high-throughput, low latency systems and their interaction with GC and
>>>>>> in my personal experience G1 outperforms CMS in all cases and with
>>>>>> significantly less work (zero work, in many cases). The only things
>>>>>> I've seen perform better *with a similar heap footprint* are GenShen
>>>>>> (currently experimental) and Rust (beyond the scope of this topic).
>>>>>> 
>>>>>> Derek
>>>>>> 
>>>>>> On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <ru...@apache.org> wrote:
>>>>>>> 
>>>>>>> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?
>>>>>>> 
>>>>>>> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.
>>>>>>> 
>>>>>>> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.
>>>>>>> 
>>>>>>> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.
>>>>>>> 
>>>>>>> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).
>>>>>>> 
>>>>>>> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.
>>>>>>> 
>>>>>>> Jon
>>>>>>> 
>>>>>>> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
>>>>>>>>> 
>>>>>>>>> In case of GC, reasonably extensive performance testing should be the
>>>>>>>>> expectations. Potentially revisiting some of the G1 params for the 4.1
>>>>>>>>> reality - quite a lot has changed since those optional defaults where
>>>>>>>>> picked.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
>>>>>>>> in the patch for CASSANDRA-18027
>>>>>>>> 
>>>>>>>> In reality it is really not much of a change, g1 does make it simple.
>>>>>>>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
>>>>>>>> the new heap (XX:NewSize) is still required, though we could do a much
>>>>>>>> better job of dynamic defaults to them.
>>>>>>>> 
>>>>>>>> Alex Dejanovski's blog is a starting point:
>>>>>>>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
>>>>>>>> where this gc opt set was used (though it doesn't prove why those options
>>>>>>>> are chosen)
>>>>>>>> 
>>>>>>>> The bar for objection to sneaking these into 4.1 was intended to be low,
>>>>>>>> and I stand by those that raise concerns.
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> +---------------------------------------------------------------+
>>>>>> | Derek Chen-Becker                                             |
>>>>>> | GPG Key available at https://keybase.io/dchenbecker and       |
>>>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>>>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>>>>> +---------------------------------------------------------------+
>>>>> 
>>>>> 
>>>> 
>> 
> 
> 
> --
> +---------------------------------------------------------------+
> | Derek Chen-Becker                                             |
> | GPG Key available at https://keybase.io/dchenbecker and       |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---------------------------------------------------------------+

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Jeff Jirsa <jj...@gmail.com>.

On Thu, Nov 17, 2022 at 12:47 PM J. D. Jordan <je...@gmail.com>
wrote:

> -1 on providing a bunch of choices and forcing users to pick one. We
> should have a default and it should be “good enough” for most people. The
> people who want to dig in and try other gc settings can still do it, and we
> could provide them some profiles to start from, but there needs to be a
> default.  We need to be asking new operators less questions on install, not
> more.
>
> Re:experience with Shenandoah under high load, I have in the past seen the
> exact same thing for both Shenandoah and ZGC. Both of them have issues at
> high loads while performing great at moderate loads. I have not seen G1
> ever have such issues. So I would not be fine with a switch to Shenandoah
> or ZGC as the default without extensive testing on current JVM versions
> that have hopefully improved the behavior under load
>


I have personally reverted hundreds of machines off of G1 with 12G heaps on
jdk8, where (intelligently tuned) CMS with the same workload/heap size was
fine.

It was many years ago, and G1 has changed a lot, but the "zero problems
with G1" is AT LEAST 1 problem with G1, by someone who knew how both
Cassandra and the JVM works.

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Derek Chen-Becker <de...@chen-becker.org>.

On Thu, Nov 17, 2022 at 2:01 PM Josh McKenzie <jm...@apache.org> wrote:
> 3) Expert: Leave me alone. I tune my own GC

This is increasingly not a thing. I haven't looked at ZGC, but G1 and
Shenandoah provide a lot of knobs...that the collector will happily
ignore if it decides it knows better :)

-- 
+---------------------------------------------------------------+
| Derek Chen-Becker                                             |
| GPG Key available at https://keybase.io/dchenbecker and       |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---------------------------------------------------------------+

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Josh McKenzie <jm...@apache.org>.

> -1 on providing a bunch of choices and forcing users to pick one. We should have a default and it should be “good enough” for most people.
These are 2 different things (providing choices and whether we provide a default).

Sounds like you're against both not having a default *and* providing choices independently; I assume you're not in favor of having something "good enough" as the default but also providing other tuning options should operators be interested in testing them out?

I could see there being potentially 3 tiers of operator expertise / interest in this space:
1) No interest. Give me a good enough default; I don't want to think about this.
2) Moderate expertise. Give me a one line config change where I can bounce 3 nodes in a replica set to 3 different pre-configured profiles and see how it works for my workloads and pick one.
3) Expert: Leave me alone. I tune my own GC

So the above is possibly moot if we don't have the resources on the project to *test and provide* alternative GC profiles, but it sounds to me like we're not actually short on differently tuned GC config but are instead butting up against timing relative to release + view on what the right default should be.

On Thu, Nov 17, 2022, at 3:47 PM, J. D. Jordan wrote:
> -1 on providing a bunch of choices and forcing users to pick one. We should have a default and it should be “good enough” for most people. The people who want to dig in and try other gc settings can still do it, and we could provide them some profiles to start from, but there needs to be a default.  We need to be asking new operators less questions on install, not more.
> 
> Re:experience with Shenandoah under high load, I have in the past seen the exact same thing for both Shenandoah and ZGC. Both of them have issues at high loads while performing great at moderate loads. I have not seen G1 ever have such issues. So I would not be fine with a switch to Shenandoah or ZGC as the default without extensive testing on current JVM versions that have hopefully improved the behavior under load.
> 
> > On Nov 17, 2022, at 9:39 AM, Joseph Lynch <jo...@gmail.com> wrote:
> > It seems like this is a choice most users might not know how to make?
> > 
> > On Thu, Nov 17, 2022 at 7:06 AM Josh McKenzie <jm...@apache.org> wrote:
> >> 
> >> Have we ever discussed including multiple profiles that are simple to swap between and documented for their tested / intended use cases?
> >> 
> >> Then the burden of having a “sane” default for the wild variance of workloads people use it for would be somewhat mitigated. Sure, there’s always going to be folks that run the default and never think to change it but the UX could be as simple as a one line config change to swap between GC profiles and we could add and deprecate / remove over time.
> >> 
> >> Concretely, having config files such as:
> >> 
> >> jvm11-CMS-write.options
> >> jvm11-CMS-mixed.options
> >> jvm11-CMS-read.options
> >> jvm11-G1.options
> >> jvm11-ZGC.options
> >> jvm11-Shen.options
> >> 
> >> 
> >> Arguably we could take it a step further and not actually allow a C* node to startup without pointing to one of the config files from your primary config, and provide a clean mechanism to integrate that selection on headless installs.
> >> 
> >> Notably, this could be a terrible idea. But it does seem like we keep butting up against the complexity and mixed pressures of having the One True Way to GC via the default config and the lift to change that.
> >> 
> >> On Wed, Nov 16, 2022, at 9:49 PM, Derek Chen-Becker wrote:
> >> 
> >> I'm fine with not including G1 in 4.1, but would we consider inclusion
> >> for 4.1.X down the road once validation has been done?
> >> 
> >> Derek
> >> 
> >> 
> >> On Wed, Nov 16, 2022 at 4:39 PM David Capwell <dc...@apple.com> wrote:
> >>> Getting poked in Slack to be more explicit in this thread…
> >>> Switching to G1 on trunk, +1
> >>> Switching to G1 on 4.1, -1.  4.1 is about to be released and this isn’t a bug fix but a perf improvement ticket and as such should go through validation that the perf improvements are seen, there is not enough time left for that added performance work burden so strongly feel it should be pushed to 4.2/5.0 where it has plenty of time to be validated against.  The ticket even asks to avoid validating the claims; saying 'Hoping we can skip due diligence on this ticket because the data is "in the past” already”'.  Others have attempted both shenandoah and ZGC and found mixed results, so nothing leads me to believe that won’t be true here either.
> >>>> On Nov 16, 2022, at 9:15 AM, J. D. Jordan <je...@gmail.com> wrote:
> >>>> Heap -
> >>>> +1 for G1 in trunk
> >>>> +0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I understand pushback against changing this so late in the game.
> >>>> Memtable -
> >>>> -1 for off heap in 4.1. I think this needs more testing and isn’t something to change at the last minute.
> >>>> +1 for running performance/fuzz tests against the alternate memtable choices in trunk and switching if they don’t show regressions.
> >>>>> On Nov 16, 2022, at 10:48 AM, Josh McKenzie <jm...@apache.org> wrote:
> >>>>> 
> >>>>> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to prioritize digging into G1's behavior on small heaps vs. CMS w/our default tuning sooner rather than later. With that info I'd likely be a strong +1 on the shift.
> >>>>> -1 on switching to offheap_objects for 4.1 RC; again, think this is just a small step away from being a +1 w/some more rigor around seeing the current state of the technology's intersections.
> >>>>> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:
> >>>>>> All right. I’ll clarify then.
> >>>>>> -0 on switching the default to G1 *this late* just before RC1.
> >>>>>> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for it in principle, for 4.2, after we run some more test and resolve the concerns raised by Jeff.
> >>>>>> Let’s please try to avoid this kind of super late defaults switch going forward?
> >>>>>> —
> >>>>>> AY
> >>>>>>> On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> wrote:
> >>>>>>> For the record, I'm +100 on G1. Take it with whatever sized grain of
> >>>>>>> salt you think appropriate for a relative newcomer to the list, but
> >>>>>>> I've spent my last 7-8 years dealing with the intersection of
> >>>>>>> high-throughput, low latency systems and their interaction with GC and
> >>>>>>> in my personal experience G1 outperforms CMS in all cases and with
> >>>>>>> significantly less work (zero work, in many cases). The only things
> >>>>>>> I've seen perform better *with a similar heap footprint* are GenShen
> >>>>>>> (currently experimental) and Rust (beyond the scope of this topic).
> >>>>>>> Derek
> >>>>>>> On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <ru...@apache.org> wrote:
> >>>>>>>> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?
> >>>>>>>> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.
> >>>>>>>> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.
> >>>>>>>> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.
> >>>>>>>> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).
> >>>>>>>> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.
> >>>>>>>> Jon
> >>>>>>>> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
> >>>>>>>>>> In case of GC, reasonably extensive performance testing should be the
> >>>>>>>>>> expectations. Potentially revisiting some of the G1 params for the 4.1
> >>>>>>>>>> reality - quite a lot has changed since those optional defaults where
> >>>>>>>>>> picked.
> >>>>>>>>> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
> >>>>>>>>> in the patch for CASSANDRA-18027
> >>>>>>>>> In reality it is really not much of a change, g1 does make it simple.
> >>>>>>>>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
> >>>>>>>>> the new heap (XX:NewSize) is still required, though we could do a much
> >>>>>>>>> better job of dynamic defaults to them.
> >>>>>>>>> Alex Dejanovski's blog is a starting point:
> >>>>>>>>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
> >>>>>>>>> where this gc opt set was used (though it doesn't prove why those options
> >>>>>>>>> are chosen)
> >>>>>>>>> The bar for objection to sneaking these into 4.1 was intended to be low,
> >>>>>>>>> and I stand by those that raise concerns.
> >>>>>>> --
> >>>>>>> +---------------------------------------------------------------+
> >>>>>>> | Derek Chen-Becker                                             |
> >>>>>>> | GPG Key available at https://keybase.io/dchenbecker and       |
> >>>>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> >>>>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> >>>>>>> +---------------------------------------------------------------+
> >> 
> >> 
> >> --
> >> +---------------------------------------------------------------+
> >> | Derek Chen-Becker                                             |
> >> | GPG Key available at https://keybase.io/dchenbecker and       |
> >> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> >> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> >> +---------------------------------------------------------------+
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Derek Chen-Becker <de...@chen-becker.org>.

I wouldn't recommend Shenandoah or ZGC, period. They're not designed
for the kind of workload you'll typically see running a database (high
throughput of objects that don't tenure) and both will fall over in
interesting ways under high allocation rate. GenShen is intended to
combine the generational goodness of G1 with the concurrent collection
of Shenandoah, and will likely perform better than G1 in terms of
pause time and better than Shenandoah for allocation rate and heap
utilization. GenShen, however, is experimental. Right now I would say
G1 is the best collector generally available.

In terms of providing data (beyond anecdotes), do we even agree on
what the baseline load test looks like? Are we going off of something
that's in dtest, or do we have a defined benchmarking suite somewhere?

Cheers,

Derek

On Thu, Nov 17, 2022 at 1:47 PM J. D. Jordan <je...@gmail.com> wrote:
>
> -1 on providing a bunch of choices and forcing users to pick one. We should have a default and it should be “good enough” for most people. The people who want to dig in and try other gc settings can still do it, and we could provide them some profiles to start from, but there needs to be a default.  We need to be asking new operators less questions on install, not more.
>
> Re:experience with Shenandoah under high load, I have in the past seen the exact same thing for both Shenandoah and ZGC. Both of them have issues at high loads while performing great at moderate loads. I have not seen G1 ever have such issues. So I would not be fine with a switch to Shenandoah or ZGC as the default without extensive testing on current JVM versions that have hopefully improved the behavior under load.
>
> > On Nov 17, 2022, at 9:39 AM, Joseph Lynch <jo...@gmail.com> wrote:
> > It seems like this is a choice most users might not know how to make?
> >
> > On Thu, Nov 17, 2022 at 7:06 AM Josh McKenzie <jm...@apache.org> wrote:
> >>
> >> Have we ever discussed including multiple profiles that are simple to swap between and documented for their tested / intended use cases?
> >>
> >> Then the burden of having a “sane” default for the wild variance of workloads people use it for would be somewhat mitigated. Sure, there’s always going to be folks that run the default and never think to change it but the UX could be as simple as a one line config change to swap between GC profiles and we could add and deprecate / remove over time.
> >>
> >> Concretely, having config files such as:
> >>
> >> jvm11-CMS-write.options
> >> jvm11-CMS-mixed.options
> >> jvm11-CMS-read.options
> >> jvm11-G1.options
> >> jvm11-ZGC.options
> >> jvm11-Shen.options
> >>
> >>
> >> Arguably we could take it a step further and not actually allow a C* node to startup without pointing to one of the config files from your primary config, and provide a clean mechanism to integrate that selection on headless installs.
> >>
> >> Notably, this could be a terrible idea. But it does seem like we keep butting up against the complexity and mixed pressures of having the One True Way to GC via the default config and the lift to change that.
> >>
> >> On Wed, Nov 16, 2022, at 9:49 PM, Derek Chen-Becker wrote:
> >>
> >> I'm fine with not including G1 in 4.1, but would we consider inclusion
> >> for 4.1.X down the road once validation has been done?
> >>
> >> Derek
> >>
> >>
> >> On Wed, Nov 16, 2022 at 4:39 PM David Capwell <dc...@apple.com> wrote:
> >>> Getting poked in Slack to be more explicit in this thread…
> >>> Switching to G1 on trunk, +1
> >>> Switching to G1 on 4.1, -1.  4.1 is about to be released and this isn’t a bug fix but a perf improvement ticket and as such should go through validation that the perf improvements are seen, there is not enough time left for that added performance work burden so strongly feel it should be pushed to 4.2/5.0 where it has plenty of time to be validated against.  The ticket even asks to avoid validating the claims; saying 'Hoping we can skip due diligence on this ticket because the data is "in the past” already”'.  Others have attempted both shenandoah and ZGC and found mixed results, so nothing leads me to believe that won’t be true here either.
> >>>> On Nov 16, 2022, at 9:15 AM, J. D. Jordan <je...@gmail.com> wrote:
> >>>> Heap -
> >>>> +1 for G1 in trunk
> >>>> +0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I understand pushback against changing this so late in the game.
> >>>> Memtable -
> >>>> -1 for off heap in 4.1. I think this needs more testing and isn’t something to change at the last minute.
> >>>> +1 for running performance/fuzz tests against the alternate memtable choices in trunk and switching if they don’t show regressions.
> >>>>> On Nov 16, 2022, at 10:48 AM, Josh McKenzie <jm...@apache.org> wrote:
> >>>>> 
> >>>>> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to prioritize digging into G1's behavior on small heaps vs. CMS w/our default tuning sooner rather than later. With that info I'd likely be a strong +1 on the shift.
> >>>>> -1 on switching to offheap_objects for 4.1 RC; again, think this is just a small step away from being a +1 w/some more rigor around seeing the current state of the technology's intersections.
> >>>>> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:
> >>>>>> All right. I’ll clarify then.
> >>>>>> -0 on switching the default to G1 *this late* just before RC1.
> >>>>>> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for it in principle, for 4.2, after we run some more test and resolve the concerns raised by Jeff.
> >>>>>> Let’s please try to avoid this kind of super late defaults switch going forward?
> >>>>>> —
> >>>>>> AY
> >>>>>>> On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> wrote:
> >>>>>>> For the record, I'm +100 on G1. Take it with whatever sized grain of
> >>>>>>> salt you think appropriate for a relative newcomer to the list, but
> >>>>>>> I've spent my last 7-8 years dealing with the intersection of
> >>>>>>> high-throughput, low latency systems and their interaction with GC and
> >>>>>>> in my personal experience G1 outperforms CMS in all cases and with
> >>>>>>> significantly less work (zero work, in many cases). The only things
> >>>>>>> I've seen perform better *with a similar heap footprint* are GenShen
> >>>>>>> (currently experimental) and Rust (beyond the scope of this topic).
> >>>>>>> Derek
> >>>>>>> On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <ru...@apache.org> wrote:
> >>>>>>>> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?
> >>>>>>>> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.
> >>>>>>>> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.
> >>>>>>>> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.
> >>>>>>>> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).
> >>>>>>>> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.
> >>>>>>>> Jon
> >>>>>>>> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
> >>>>>>>>>> In case of GC, reasonably extensive performance testing should be the
> >>>>>>>>>> expectations. Potentially revisiting some of the G1 params for the 4.1
> >>>>>>>>>> reality - quite a lot has changed since those optional defaults where
> >>>>>>>>>> picked.
> >>>>>>>>> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
> >>>>>>>>> in the patch for CASSANDRA-18027
> >>>>>>>>> In reality it is really not much of a change, g1 does make it simple.
> >>>>>>>>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
> >>>>>>>>> the new heap (XX:NewSize) is still required, though we could do a much
> >>>>>>>>> better job of dynamic defaults to them.
> >>>>>>>>> Alex Dejanovski's blog is a starting point:
> >>>>>>>>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
> >>>>>>>>> where this gc opt set was used (though it doesn't prove why those options
> >>>>>>>>> are chosen)
> >>>>>>>>> The bar for objection to sneaking these into 4.1 was intended to be low,
> >>>>>>>>> and I stand by those that raise concerns.
> >>>>>>> --
> >>>>>>> +---------------------------------------------------------------+
> >>>>>>> | Derek Chen-Becker                                             |
> >>>>>>> | GPG Key available at https://keybase.io/dchenbecker and       |
> >>>>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> >>>>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> >>>>>>> +---------------------------------------------------------------+
> >>
> >>
> >> --
> >> +---------------------------------------------------------------+
> >> | Derek Chen-Becker                                             |
> >> | GPG Key available at https://keybase.io/dchenbecker and       |
> >> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> >> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> >> +---------------------------------------------------------------+



-- 
+---------------------------------------------------------------+
| Derek Chen-Becker                                             |
| GPG Key available at https://keybase.io/dchenbecker and       |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---------------------------------------------------------------+

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by "J. D. Jordan" <je...@gmail.com>.

-1 on providing a bunch of choices and forcing users to pick one. We should have a default and it should be “good enough” for most people. The people who want to dig in and try other gc settings can still do it, and we could provide them some profiles to start from, but there needs to be a default.  We need to be asking new operators less questions on install, not more.

Re:experience with Shenandoah under high load, I have in the past seen the exact same thing for both Shenandoah and ZGC. Both of them have issues at high loads while performing great at moderate loads. I have not seen G1 ever have such issues. So I would not be fine with a switch to Shenandoah or ZGC as the default without extensive testing on current JVM versions that have hopefully improved the behavior under load.

> On Nov 17, 2022, at 9:39 AM, Joseph Lynch <jo...@gmail.com> wrote:
> It seems like this is a choice most users might not know how to make?
> 
> On Thu, Nov 17, 2022 at 7:06 AM Josh McKenzie <jm...@apache.org> wrote:
>> 
>> Have we ever discussed including multiple profiles that are simple to swap between and documented for their tested / intended use cases?
>> 
>> Then the burden of having a “sane” default for the wild variance of workloads people use it for would be somewhat mitigated. Sure, there’s always going to be folks that run the default and never think to change it but the UX could be as simple as a one line config change to swap between GC profiles and we could add and deprecate / remove over time.
>> 
>> Concretely, having config files such as:
>> 
>> jvm11-CMS-write.options
>> jvm11-CMS-mixed.options
>> jvm11-CMS-read.options
>> jvm11-G1.options
>> jvm11-ZGC.options
>> jvm11-Shen.options
>> 
>> 
>> Arguably we could take it a step further and not actually allow a C* node to startup without pointing to one of the config files from your primary config, and provide a clean mechanism to integrate that selection on headless installs.
>> 
>> Notably, this could be a terrible idea. But it does seem like we keep butting up against the complexity and mixed pressures of having the One True Way to GC via the default config and the lift to change that.
>> 
>> On Wed, Nov 16, 2022, at 9:49 PM, Derek Chen-Becker wrote:
>> 
>> I'm fine with not including G1 in 4.1, but would we consider inclusion
>> for 4.1.X down the road once validation has been done?
>> 
>> Derek
>> 
>> 
>> On Wed, Nov 16, 2022 at 4:39 PM David Capwell <dc...@apple.com> wrote:
>>> Getting poked in Slack to be more explicit in this thread…
>>> Switching to G1 on trunk, +1
>>> Switching to G1 on 4.1, -1.  4.1 is about to be released and this isn’t a bug fix but a perf improvement ticket and as such should go through validation that the perf improvements are seen, there is not enough time left for that added performance work burden so strongly feel it should be pushed to 4.2/5.0 where it has plenty of time to be validated against.  The ticket even asks to avoid validating the claims; saying 'Hoping we can skip due diligence on this ticket because the data is "in the past” already”'.  Others have attempted both shenandoah and ZGC and found mixed results, so nothing leads me to believe that won’t be true here either.
>>>> On Nov 16, 2022, at 9:15 AM, J. D. Jordan <je...@gmail.com> wrote:
>>>> Heap -
>>>> +1 for G1 in trunk
>>>> +0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I understand pushback against changing this so late in the game.
>>>> Memtable -
>>>> -1 for off heap in 4.1. I think this needs more testing and isn’t something to change at the last minute.
>>>> +1 for running performance/fuzz tests against the alternate memtable choices in trunk and switching if they don’t show regressions.
>>>>> On Nov 16, 2022, at 10:48 AM, Josh McKenzie <jm...@apache.org> wrote:
>>>>> 
>>>>> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to prioritize digging into G1's behavior on small heaps vs. CMS w/our default tuning sooner rather than later. With that info I'd likely be a strong +1 on the shift.
>>>>> -1 on switching to offheap_objects for 4.1 RC; again, think this is just a small step away from being a +1 w/some more rigor around seeing the current state of the technology's intersections.
>>>>> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:
>>>>>> All right. I’ll clarify then.
>>>>>> -0 on switching the default to G1 *this late* just before RC1.
>>>>>> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for it in principle, for 4.2, after we run some more test and resolve the concerns raised by Jeff.
>>>>>> Let’s please try to avoid this kind of super late defaults switch going forward?
>>>>>> —
>>>>>> AY
>>>>>>> On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> wrote:
>>>>>>> For the record, I'm +100 on G1. Take it with whatever sized grain of
>>>>>>> salt you think appropriate for a relative newcomer to the list, but
>>>>>>> I've spent my last 7-8 years dealing with the intersection of
>>>>>>> high-throughput, low latency systems and their interaction with GC and
>>>>>>> in my personal experience G1 outperforms CMS in all cases and with
>>>>>>> significantly less work (zero work, in many cases). The only things
>>>>>>> I've seen perform better *with a similar heap footprint* are GenShen
>>>>>>> (currently experimental) and Rust (beyond the scope of this topic).
>>>>>>> Derek
>>>>>>> On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <ru...@apache.org> wrote:
>>>>>>>> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?
>>>>>>>> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.
>>>>>>>> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.
>>>>>>>> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.
>>>>>>>> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).
>>>>>>>> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.
>>>>>>>> Jon
>>>>>>>> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
>>>>>>>>>> In case of GC, reasonably extensive performance testing should be the
>>>>>>>>>> expectations. Potentially revisiting some of the G1 params for the 4.1
>>>>>>>>>> reality - quite a lot has changed since those optional defaults where
>>>>>>>>>> picked.
>>>>>>>>> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
>>>>>>>>> in the patch for CASSANDRA-18027
>>>>>>>>> In reality it is really not much of a change, g1 does make it simple.
>>>>>>>>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
>>>>>>>>> the new heap (XX:NewSize) is still required, though we could do a much
>>>>>>>>> better job of dynamic defaults to them.
>>>>>>>>> Alex Dejanovski's blog is a starting point:
>>>>>>>>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
>>>>>>>>> where this gc opt set was used (though it doesn't prove why those options
>>>>>>>>> are chosen)
>>>>>>>>> The bar for objection to sneaking these into 4.1 was intended to be low,
>>>>>>>>> and I stand by those that raise concerns.
>>>>>>> --
>>>>>>> +---------------------------------------------------------------+
>>>>>>> | Derek Chen-Becker                                             |
>>>>>>> | GPG Key available at https://keybase.io/dchenbecker and       |
>>>>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>>>>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>>>>>> +---------------------------------------------------------------+
>> 
>> 
>> --
>> +---------------------------------------------------------------+
>> | Derek Chen-Becker                                             |
>> | GPG Key available at https://keybase.io/dchenbecker and       |
>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>> +---------------------------------------------------------------+

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Joseph Lynch <jo...@gmail.com>.

It seems like this is a choice most users might not know how to make?

On Thu, Nov 17, 2022 at 7:06 AM Josh McKenzie <jm...@apache.org> wrote:
>
> Have we ever discussed including multiple profiles that are simple to swap between and documented for their tested / intended use cases?
>
> Then the burden of having a “sane” default for the wild variance of workloads people use it for would be somewhat mitigated. Sure, there’s always going to be folks that run the default and never think to change it but the UX could be as simple as a one line config change to swap between GC profiles and we could add and deprecate / remove over time.
>
> Concretely, having config files such as:
>
> jvm11-CMS-write.options
> jvm11-CMS-mixed.options
> jvm11-CMS-read.options
> jvm11-G1.options
> jvm11-ZGC.options
> jvm11-Shen.options
>
>
> Arguably we could take it a step further and not actually allow a C* node to startup without pointing to one of the config files from your primary config, and provide a clean mechanism to integrate that selection on headless installs.
>
> Notably, this could be a terrible idea. But it does seem like we keep butting up against the complexity and mixed pressures of having the One True Way to GC via the default config and the lift to change that.
>
> On Wed, Nov 16, 2022, at 9:49 PM, Derek Chen-Becker wrote:
>
> I'm fine with not including G1 in 4.1, but would we consider inclusion
> for 4.1.X down the road once validation has been done?
>
> Derek
>
>
> On Wed, Nov 16, 2022 at 4:39 PM David Capwell <dc...@apple.com> wrote:
> >
> > Getting poked in Slack to be more explicit in this thread…
> >
> > Switching to G1 on trunk, +1
> > Switching to G1 on 4.1, -1.  4.1 is about to be released and this isn’t a bug fix but a perf improvement ticket and as such should go through validation that the perf improvements are seen, there is not enough time left for that added performance work burden so strongly feel it should be pushed to 4.2/5.0 where it has plenty of time to be validated against.  The ticket even asks to avoid validating the claims; saying 'Hoping we can skip due diligence on this ticket because the data is "in the past” already”'.  Others have attempted both shenandoah and ZGC and found mixed results, so nothing leads me to believe that won’t be true here either.
> >
> > > On Nov 16, 2022, at 9:15 AM, J. D. Jordan <je...@gmail.com> wrote:
> > >
> > > Heap -
> > > +1 for G1 in trunk
> > > +0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I understand pushback against changing this so late in the game.
> > >
> > > Memtable -
> > > -1 for off heap in 4.1. I think this needs more testing and isn’t something to change at the last minute.
> > > +1 for running performance/fuzz tests against the alternate memtable choices in trunk and switching if they don’t show regressions.
> > >
> > >> On Nov 16, 2022, at 10:48 AM, Josh McKenzie <jm...@apache.org> wrote:
> > >>
> > >> 
> > >> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to prioritize digging into G1's behavior on small heaps vs. CMS w/our default tuning sooner rather than later. With that info I'd likely be a strong +1 on the shift.
> > >>
> > >> -1 on switching to offheap_objects for 4.1 RC; again, think this is just a small step away from being a +1 w/some more rigor around seeing the current state of the technology's intersections.
> > >>
> > >> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:
> > >>> All right. I’ll clarify then.
> > >>>
> > >>> -0 on switching the default to G1 *this late* just before RC1.
> > >>> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for it in principle, for 4.2, after we run some more test and resolve the concerns raised by Jeff.
> > >>>
> > >>> Let’s please try to avoid this kind of super late defaults switch going forward?
> > >>>
> > >>> —
> > >>> AY
> > >>>
> > >>> > On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> wrote:
> > >>> >
> > >>> > For the record, I'm +100 on G1. Take it with whatever sized grain of
> > >>> > salt you think appropriate for a relative newcomer to the list, but
> > >>> > I've spent my last 7-8 years dealing with the intersection of
> > >>> > high-throughput, low latency systems and their interaction with GC and
> > >>> > in my personal experience G1 outperforms CMS in all cases and with
> > >>> > significantly less work (zero work, in many cases). The only things
> > >>> > I've seen perform better *with a similar heap footprint* are GenShen
> > >>> > (currently experimental) and Rust (beyond the scope of this topic).
> > >>> >
> > >>> > Derek
> > >>> >
> > >>> > On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <ru...@apache.org> wrote:
> > >>> >>
> > >>> >> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?
> > >>> >>
> > >>> >> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.
> > >>> >>
> > >>> >> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.
> > >>> >>
> > >>> >> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.
> > >>> >>
> > >>> >> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).
> > >>> >>
> > >>> >> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.
> > >>> >>
> > >>> >> Jon
> > >>> >>
> > >>> >> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
> > >>> >>>>
> > >>> >>>> In case of GC, reasonably extensive performance testing should be the
> > >>> >>>> expectations. Potentially revisiting some of the G1 params for the 4.1
> > >>> >>>> reality - quite a lot has changed since those optional defaults where
> > >>> >>>> picked.
> > >>> >>>>
> > >>> >>>
> > >>> >>>
> > >>> >>> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
> > >>> >>> in the patch for CASSANDRA-18027
> > >>> >>>
> > >>> >>> In reality it is really not much of a change, g1 does make it simple.
> > >>> >>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
> > >>> >>> the new heap (XX:NewSize) is still required, though we could do a much
> > >>> >>> better job of dynamic defaults to them.
> > >>> >>>
> > >>> >>> Alex Dejanovski's blog is a starting point:
> > >>> >>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
> > >>> >>> where this gc opt set was used (though it doesn't prove why those options
> > >>> >>> are chosen)
> > >>> >>>
> > >>> >>> The bar for objection to sneaking these into 4.1 was intended to be low,
> > >>> >>> and I stand by those that raise concerns.
> > >>> >>>
> > >>> >
> > >>> >
> > >>> >
> > >>> > --
> > >>> > +---------------------------------------------------------------+
> > >>> > | Derek Chen-Becker                                             |
> > >>> > | GPG Key available at https://keybase.io/dchenbecker and       |
> > >>> > | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> > >>> > | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> > >>> > +---------------------------------------------------------------+
> > >>>
> > >>>
> > >>
> >
>
>
> --
> +---------------------------------------------------------------+
> | Derek Chen-Becker                                             |
> | GPG Key available at https://keybase.io/dchenbecker and       |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---------------------------------------------------------------+
>
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Brandon Williams <dr...@gmail.com>.

On Thu, Nov 17, 2022 at 9:06 AM Josh McKenzie <jm...@apache.org> wrote:
>
> Arguably we could take it a step further and not actually allow a C* node to startup without pointing to one of the config files from your primary config, and provide a clean mechanism to integrate that selection on headless installs.

We could also automatically choose one based on the heap size (when we
by default have to automatically choose that as well.)

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Josh McKenzie <jm...@apache.org>.

Have we ever discussed including multiple profiles that are simple to swap between and documented for their tested / intended use cases?

Then the burden of having a “sane” default for the wild variance of workloads people use it for would be somewhat mitigated. Sure, there’s always going to be folks that run the default and never think to change it but the UX could be as simple as a one line config change to swap between GC profiles and we could add and deprecate / remove over time.

Concretely, having config files such as:
> jvm11-CMS-write.options
> jvm11-CMS-mixed.options
> jvm11-CMS-read.options
> jvm11-G1.options
> jvm11-ZGC.options
> jvm11-Shen.options

Arguably we could take it a step further and not actually allow a C* node to startup without pointing to one of the config files from your primary config, and provide a clean mechanism to integrate that selection on headless installs.

Notably, this could be a terrible idea. But it *does* seem like we keep butting up against the complexity and mixed pressures of having the One True Way to GC via the default config and the lift to change that.

On Wed, Nov 16, 2022, at 9:49 PM, Derek Chen-Becker wrote:
> I'm fine with not including G1 in 4.1, but would we consider inclusion
> for 4.1.X down the road once validation has been done?
> 
> Derek
> 
> 
> On Wed, Nov 16, 2022 at 4:39 PM David Capwell <dc...@apple.com> wrote:
> >
> > Getting poked in Slack to be more explicit in this thread…
> >
> > Switching to G1 on trunk, +1
> > Switching to G1 on 4.1, -1.  4.1 is about to be released and this isn’t a bug fix but a perf improvement ticket and as such should go through validation that the perf improvements are seen, there is not enough time left for that added performance work burden so strongly feel it should be pushed to 4.2/5.0 where it has plenty of time to be validated against.  The ticket even asks to avoid validating the claims; saying 'Hoping we can skip due diligence on this ticket because the data is "in the past” already”'.  Others have attempted both shenandoah and ZGC and found mixed results, so nothing leads me to believe that won’t be true here either.
> >
> > > On Nov 16, 2022, at 9:15 AM, J. D. Jordan <je...@gmail.com> wrote:
> > >
> > > Heap -
> > > +1 for G1 in trunk
> > > +0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I understand pushback against changing this so late in the game.
> > >
> > > Memtable -
> > > -1 for off heap in 4.1. I think this needs more testing and isn’t something to change at the last minute.
> > > +1 for running performance/fuzz tests against the alternate memtable choices in trunk and switching if they don’t show regressions.
> > >
> > >> On Nov 16, 2022, at 10:48 AM, Josh McKenzie <jm...@apache.org> wrote:
> > >>
> > >> 
> > >> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to prioritize digging into G1's behavior on small heaps vs. CMS w/our default tuning sooner rather than later. With that info I'd likely be a strong +1 on the shift.
> > >>
> > >> -1 on switching to offheap_objects for 4.1 RC; again, think this is just a small step away from being a +1 w/some more rigor around seeing the current state of the technology's intersections.
> > >>
> > >> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:
> > >>> All right. I’ll clarify then.
> > >>>
> > >>> -0 on switching the default to G1 *this late* just before RC1.
> > >>> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for it in principle, for 4.2, after we run some more test and resolve the concerns raised by Jeff.
> > >>>
> > >>> Let’s please try to avoid this kind of super late defaults switch going forward?
> > >>>
> > >>> —
> > >>> AY
> > >>>
> > >>> > On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> wrote:
> > >>> >
> > >>> > For the record, I'm +100 on G1. Take it with whatever sized grain of
> > >>> > salt you think appropriate for a relative newcomer to the list, but
> > >>> > I've spent my last 7-8 years dealing with the intersection of
> > >>> > high-throughput, low latency systems and their interaction with GC and
> > >>> > in my personal experience G1 outperforms CMS in all cases and with
> > >>> > significantly less work (zero work, in many cases). The only things
> > >>> > I've seen perform better *with a similar heap footprint* are GenShen
> > >>> > (currently experimental) and Rust (beyond the scope of this topic).
> > >>> >
> > >>> > Derek
> > >>> >
> > >>> > On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <ru...@apache.org> wrote:
> > >>> >>
> > >>> >> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?
> > >>> >>
> > >>> >> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.
> > >>> >>
> > >>> >> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.
> > >>> >>
> > >>> >> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.
> > >>> >>
> > >>> >> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).
> > >>> >>
> > >>> >> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.
> > >>> >>
> > >>> >> Jon
> > >>> >>
> > >>> >> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
> > >>> >>>>
> > >>> >>>> In case of GC, reasonably extensive performance testing should be the
> > >>> >>>> expectations. Potentially revisiting some of the G1 params for the 4.1
> > >>> >>>> reality - quite a lot has changed since those optional defaults where
> > >>> >>>> picked.
> > >>> >>>>
> > >>> >>>
> > >>> >>>
> > >>> >>> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
> > >>> >>> in the patch for CASSANDRA-18027
> > >>> >>>
> > >>> >>> In reality it is really not much of a change, g1 does make it simple.
> > >>> >>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
> > >>> >>> the new heap (XX:NewSize) is still required, though we could do a much
> > >>> >>> better job of dynamic defaults to them.
> > >>> >>>
> > >>> >>> Alex Dejanovski's blog is a starting point:
> > >>> >>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
> > >>> >>> where this gc opt set was used (though it doesn't prove why those options
> > >>> >>> are chosen)
> > >>> >>>
> > >>> >>> The bar for objection to sneaking these into 4.1 was intended to be low,
> > >>> >>> and I stand by those that raise concerns.
> > >>> >>>
> > >>> >
> > >>> >
> > >>> >
> > >>> > --
> > >>> > +---------------------------------------------------------------+
> > >>> > | Derek Chen-Becker                                             |
> > >>> > | GPG Key available at https://keybase.io/dchenbecker and       |
> > >>> > | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> > >>> > | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> > >>> > +---------------------------------------------------------------+
> > >>>
> > >>>
> > >>
> >
> 
> 
> --
> +---------------------------------------------------------------+
> | Derek Chen-Becker                                             |
> | GPG Key available at https://keybase.io/dchenbecker and       |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---------------------------------------------------------------+
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Derek Chen-Becker <de...@chen-becker.org>.

I'm fine with not including G1 in 4.1, but would we consider inclusion
for 4.1.X down the road once validation has been done?

Derek


On Wed, Nov 16, 2022 at 4:39 PM David Capwell <dc...@apple.com> wrote:
>
> Getting poked in Slack to be more explicit in this thread…
>
> Switching to G1 on trunk, +1
> Switching to G1 on 4.1, -1.  4.1 is about to be released and this isn’t a bug fix but a perf improvement ticket and as such should go through validation that the perf improvements are seen, there is not enough time left for that added performance work burden so strongly feel it should be pushed to 4.2/5.0 where it has plenty of time to be validated against.  The ticket even asks to avoid validating the claims; saying 'Hoping we can skip due diligence on this ticket because the data is "in the past” already”'.  Others have attempted both shenandoah and ZGC and found mixed results, so nothing leads me to believe that won’t be true here either.
>
> > On Nov 16, 2022, at 9:15 AM, J. D. Jordan <je...@gmail.com> wrote:
> >
> > Heap -
> > +1 for G1 in trunk
> > +0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I understand pushback against changing this so late in the game.
> >
> > Memtable -
> > -1 for off heap in 4.1. I think this needs more testing and isn’t something to change at the last minute.
> > +1 for running performance/fuzz tests against the alternate memtable choices in trunk and switching if they don’t show regressions.
> >
> >> On Nov 16, 2022, at 10:48 AM, Josh McKenzie <jm...@apache.org> wrote:
> >>
> >> 
> >> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to prioritize digging into G1's behavior on small heaps vs. CMS w/our default tuning sooner rather than later. With that info I'd likely be a strong +1 on the shift.
> >>
> >> -1 on switching to offheap_objects for 4.1 RC; again, think this is just a small step away from being a +1 w/some more rigor around seeing the current state of the technology's intersections.
> >>
> >> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:
> >>> All right. I’ll clarify then.
> >>>
> >>> -0 on switching the default to G1 *this late* just before RC1.
> >>> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for it in principle, for 4.2, after we run some more test and resolve the concerns raised by Jeff.
> >>>
> >>> Let’s please try to avoid this kind of super late defaults switch going forward?
> >>>
> >>> —
> >>> AY
> >>>
> >>> > On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> wrote:
> >>> >
> >>> > For the record, I'm +100 on G1. Take it with whatever sized grain of
> >>> > salt you think appropriate for a relative newcomer to the list, but
> >>> > I've spent my last 7-8 years dealing with the intersection of
> >>> > high-throughput, low latency systems and their interaction with GC and
> >>> > in my personal experience G1 outperforms CMS in all cases and with
> >>> > significantly less work (zero work, in many cases). The only things
> >>> > I've seen perform better *with a similar heap footprint* are GenShen
> >>> > (currently experimental) and Rust (beyond the scope of this topic).
> >>> >
> >>> > Derek
> >>> >
> >>> > On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <ru...@apache.org> wrote:
> >>> >>
> >>> >> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?
> >>> >>
> >>> >> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.
> >>> >>
> >>> >> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.
> >>> >>
> >>> >> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.
> >>> >>
> >>> >> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).
> >>> >>
> >>> >> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.
> >>> >>
> >>> >> Jon
> >>> >>
> >>> >> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
> >>> >>>>
> >>> >>>> In case of GC, reasonably extensive performance testing should be the
> >>> >>>> expectations. Potentially revisiting some of the G1 params for the 4.1
> >>> >>>> reality - quite a lot has changed since those optional defaults where
> >>> >>>> picked.
> >>> >>>>
> >>> >>>
> >>> >>>
> >>> >>> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
> >>> >>> in the patch for CASSANDRA-18027
> >>> >>>
> >>> >>> In reality it is really not much of a change, g1 does make it simple.
> >>> >>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
> >>> >>> the new heap (XX:NewSize) is still required, though we could do a much
> >>> >>> better job of dynamic defaults to them.
> >>> >>>
> >>> >>> Alex Dejanovski's blog is a starting point:
> >>> >>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
> >>> >>> where this gc opt set was used (though it doesn't prove why those options
> >>> >>> are chosen)
> >>> >>>
> >>> >>> The bar for objection to sneaking these into 4.1 was intended to be low,
> >>> >>> and I stand by those that raise concerns.
> >>> >>>
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > +---------------------------------------------------------------+
> >>> > | Derek Chen-Becker                                             |
> >>> > | GPG Key available at https://keybase.io/dchenbecker and       |
> >>> > | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> >>> > | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> >>> > +---------------------------------------------------------------+
> >>>
> >>>
> >>
>


--
+---------------------------------------------------------------+
| Derek Chen-Becker                                             |
| GPG Key available at https://keybase.io/dchenbecker and       |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---------------------------------------------------------------+

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by David Capwell <dc...@apple.com>.

Getting poked in Slack to be more explicit in this thread… 

Switching to G1 on trunk, +1
Switching to G1 on 4.1, -1.  4.1 is about to be released and this isn’t a bug fix but a perf improvement ticket and as such should go through validation that the perf improvements are seen, there is not enough time left for that added performance work burden so strongly feel it should be pushed to 4.2/5.0 where it has plenty of time to be validated against.  The ticket even asks to avoid validating the claims; saying 'Hoping we can skip due diligence on this ticket because the data is "in the past” already”'.  Others have attempted both shenandoah and ZGC and found mixed results, so nothing leads me to believe that won’t be true here either.

> On Nov 16, 2022, at 9:15 AM, J. D. Jordan <je...@gmail.com> wrote:
> 
> Heap -
> +1 for G1 in trunk
> +0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I understand pushback against changing this so late in the game.
> 
> Memtable -
> -1 for off heap in 4.1. I think this needs more testing and isn’t something to change at the last minute.
> +1 for running performance/fuzz tests against the alternate memtable choices in trunk and switching if they don’t show regressions.
> 
>> On Nov 16, 2022, at 10:48 AM, Josh McKenzie <jm...@apache.org> wrote:
>> 
>> 
>> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to prioritize digging into G1's behavior on small heaps vs. CMS w/our default tuning sooner rather than later. With that info I'd likely be a strong +1 on the shift.
>> 
>> -1 on switching to offheap_objects for 4.1 RC; again, think this is just a small step away from being a +1 w/some more rigor around seeing the current state of the technology's intersections.
>> 
>> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:
>>> All right. I’ll clarify then.
>>> 
>>> -0 on switching the default to G1 *this late* just before RC1.
>>> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for it in principle, for 4.2, after we run some more test and resolve the concerns raised by Jeff.
>>> 
>>> Let’s please try to avoid this kind of super late defaults switch going forward?
>>> 
>>> —
>>> AY
>>> 
>>> > On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> wrote:
>>> > 
>>> > For the record, I'm +100 on G1. Take it with whatever sized grain of
>>> > salt you think appropriate for a relative newcomer to the list, but
>>> > I've spent my last 7-8 years dealing with the intersection of
>>> > high-throughput, low latency systems and their interaction with GC and
>>> > in my personal experience G1 outperforms CMS in all cases and with
>>> > significantly less work (zero work, in many cases). The only things
>>> > I've seen perform better *with a similar heap footprint* are GenShen
>>> > (currently experimental) and Rust (beyond the scope of this topic).
>>> > 
>>> > Derek
>>> > 
>>> > On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <ru...@apache.org> wrote:
>>> >> 
>>> >> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?
>>> >> 
>>> >> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.
>>> >> 
>>> >> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.
>>> >> 
>>> >> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.
>>> >> 
>>> >> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).
>>> >> 
>>> >> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.
>>> >> 
>>> >> Jon
>>> >> 
>>> >> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
>>> >>>> 
>>> >>>> In case of GC, reasonably extensive performance testing should be the
>>> >>>> expectations. Potentially revisiting some of the G1 params for the 4.1
>>> >>>> reality - quite a lot has changed since those optional defaults where
>>> >>>> picked.
>>> >>>> 
>>> >>> 
>>> >>> 
>>> >>> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
>>> >>> in the patch for CASSANDRA-18027
>>> >>> 
>>> >>> In reality it is really not much of a change, g1 does make it simple.
>>> >>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
>>> >>> the new heap (XX:NewSize) is still required, though we could do a much
>>> >>> better job of dynamic defaults to them.
>>> >>> 
>>> >>> Alex Dejanovski's blog is a starting point:
>>> >>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
>>> >>> where this gc opt set was used (though it doesn't prove why those options
>>> >>> are chosen)
>>> >>> 
>>> >>> The bar for objection to sneaking these into 4.1 was intended to be low,
>>> >>> and I stand by those that raise concerns.
>>> >>> 
>>> > 
>>> > 
>>> > 
>>> > -- 
>>> > +---------------------------------------------------------------+
>>> > | Derek Chen-Becker                                             |
>>> > | GPG Key available at https://keybase.io/dchenbecker and       |
>>> > | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>>> > | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>> > +---------------------------------------------------------------+
>>> 
>>> 
>>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by "J. D. Jordan" <je...@gmail.com>.

Heap -

+1 for G1 in trunk

+0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I
understand pushback against changing this so late in the game.

  

Memtable -

-1 for off heap in 4.1. I think this needs more testing and isn’t something to change at the last minute.

+1 for running performance/fuzz tests against the alternate memtable choices
in trunk and switching if they don’t show regressions.

  

> On Nov 16, 2022, at 10:48 AM, Josh McKenzie <jm...@apache.org> wrote:  
>  
>

> 
>
> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to
> prioritize digging into G1's behavior on small heaps vs. CMS w/our default
> tuning sooner rather than later. With that info I'd likely be a strong +1 on
> the shift.  
>
>
>  
>
>
> -1 on switching to offheap_objects for 4.1 RC; again, think this is just a
> small step away from being a +1 w/some more rigor around seeing the current
> state of the technology's intersections.
>
>  
>
>
> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:  
>
>

>> All right. I’ll clarify then.  
>
>>

>>  
>
>>

>> -0 on switching the default to G1 *this late* just before RC1.  
>
>>

>> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for it
in principle, for 4.2, after we run some more test and resolve the concerns
raised by Jeff.  
>
>>

>>  
>
>>

>> Let’s please try to avoid this kind of super late defaults switch going
forward?  
>
>>

>>  
>
>>

>> —  
>
>>

>> AY  
>
>>

>>  
>
>>

>> > On 16 Nov 2022, at 03:27, Derek Chen-Becker <[derek@chen-
becker.org](mailto:derek@chen-becker.org)> wrote:  
>
>>

>> >  
>
>>

>> > For the record, I'm +100 on G1. Take it with whatever sized grain of  
>
>>

>> > salt you think appropriate for a relative newcomer to the list, but  
>
>>

>> > I've spent my last 7-8 years dealing with the intersection of  
>
>>

>> > high-throughput, low latency systems and their interaction with GC and  
>
>>

>> > in my personal experience G1 outperforms CMS in all cases and with  
>
>>

>> > significantly less work (zero work, in many cases). The only things  
>
>>

>> > I've seen perform better *with a similar heap footprint* are GenShen  
>
>>

>> > (currently experimental) and Rust (beyond the scope of this topic).  
>
>>

>> >  
>
>>

>> > Derek  
>
>>

>> >  
>
>>

>> > On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad
<[rustyrazorblade@apache.org](mailto:rustyrazorblade@apache.org)> wrote:  
>
>>

>> >>  
>
>>

>> >> I'm curious what it would take for folks to be OK with merging this into
4.1?  How much additional time would you want to feel comfortable?  
>
>>

>> >>  
>
>>

>> >> I should probably have been a little more vigorous in my +1 of Mick's
PR.  For a little background - I worked on several hundred clusters while at
TLP, mostly dealing with stability and performance issues.  A lot of them
stemmed partially or wholly from the GC settings we ship in the project. Par
New with CMS and small new gen results in a lot of premature promotion leading
to high pause times into the hundreds of ms which pushes p99 latency through
the roof.  
>
>>

>> >>  
>
>>

>> >> I'm a big +1 in favor of G1 because it's not just better for most people
but it's better for _every_ new Cassandra user.  The first experience that
people have with the project is important, and our current GC settings are
quite bad - so bad they lead to problems with stability in production.  The G1
settings are mostly hands off, result in shorter pause times and are a big
improvement over the status quo.  
>
>>

>> >>  
>
>>

>> >> Most folks don't do GC tuning, they use what we supply, and what we
currently supply leads to a poor initial experience with the database.  I
think we owe the community our best effort even if it means pushing the
release back little bit.  
>
>>

>> >>  
>
>>

>> >> Just for some additional context, we're (Netflix) running 25K nodes on
G1 across a variety of hardware in AWS with wildly varying workloads, and I
haven't seen G1 be the root cause of a problem even once.  The settings that
Mick is proposing are almost identical to what we use (we use half of heap up
to 30GB).  
>
>>

>> >>  
>
>>

>> >> I'd really appreciate it if we took a second to consider the community
effect of another release that ships settings that cause significant pain for
our users.  
>
>>

>> >>  
>
>>

>> >> Jon  
>
>>

>> >>  
>
>>

>> >> On 2022/11/10 21:49:36 Mick Semb Wever wrote:  
>
>>

>> >>>>  
>
>>

>> >>>> In case of GC, reasonably extensive performance testing should be the  
>
>>

>> >>>> expectations. Potentially revisiting some of the G1 params for the 4.1  
>
>>

>> >>>> reality - quite a lot has changed since those optional defaults where  
>
>>

>> >>>> picked.  
>
>>

>> >>>>  
>
>>

>> >>>  
>
>>

>> >>>  
>
>>

>> >>> I've put our battle-tested g1 opts (from consultants at TLP and
DataStax)  
>
>>

>> >>> in the patch for CASSANDRA-18027  
>
>>

>> >>>  
>
>>

>> >>> In reality it is really not much of a change, g1 does make it simple.  
>
>>

>> >>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor
to  
>
>>

>> >>> the new heap (XX:NewSize) is still required, though we could do a much  
>
>>

>> >>> better job of dynamic defaults to them.  
>
>>

>> >>>  
>
>>

>> >>> Alex Dejanovski's blog is a starting point:  
>
>>

>> >>>
<https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html>  
>
>>

>> >>> where this gc opt set was used (though it doesn't prove why those
options  
>
>>

>> >>> are chosen)  
>
>>

>> >>>  
>
>>

>> >>> The bar for objection to sneaking these into 4.1 was intended to be
low,  
>
>>

>> >>> and I stand by those that raise concerns.  
>
>>

>> >>>  
>
>>

>> >  
>
>>

>> >  
>
>>

>> >  
>
>>

>> > \--  
>
>>

>> > +---------------------------------------------------------------+  
>
>>

>> > | Derek Chen-Becker                                             |  
>
>>

>> > | GPG Key available at
[https://keybase.io/dchenbecker](https://urldefense.com/v3/__https://keybase.io/dchenbecker__;!!PbtH5S7Ebw!aq8qPYh_5L-LQez6rapB8x0ZEzhNtZetfSMD_YvPDE8_pxp6ilxpaLJpXak_45oXf96RW_zxbWzzu3ZrpuVc$)
and       |  
>
>>

>> > | [https://pgp.mit.edu/pks/lookup?search=derek%40chen-
becker.org](https://urldefense.com/v3/__https://pgp.mit.edu/pks/lookup?search=derek*40chen-
becker.org__;JQ!!PbtH5S7Ebw!aq8qPYh_5L-LQez6rapB8x0ZEzhNtZetfSMD_YvPDE8_pxp6ilxpaLJpXak_45oXf96RW_zxbWzzuxCp6yW9$)
|  
>
>>

>> > | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |  
>
>>

>> > +---------------------------------------------------------------+  
>
>>

>>  
>
>>

>>  
>
>
>  
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Josh McKenzie <jm...@apache.org>.

To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to prioritize digging into G1's behavior on small heaps vs. CMS w/our default tuning sooner rather than later. With that info I'd likely be a strong +1 on the shift.

-1 on switching to offheap_objects for 4.1 RC; again, think this is just a small step away from being a +1 w/some more rigor around seeing the current state of the technology's intersections.

On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:
> All right. I’ll clarify then.
> 
> -0 on switching the default to G1 *this late* just before RC1.
> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for it in principle, for 4.2, after we run some more test and resolve the concerns raised by Jeff.
> 
> Let’s please try to avoid this kind of super late defaults switch going forward?
> 
> —
> AY
> 
> > On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> wrote:
> > 
> > For the record, I'm +100 on G1. Take it with whatever sized grain of
> > salt you think appropriate for a relative newcomer to the list, but
> > I've spent my last 7-8 years dealing with the intersection of
> > high-throughput, low latency systems and their interaction with GC and
> > in my personal experience G1 outperforms CMS in all cases and with
> > significantly less work (zero work, in many cases). The only things
> > I've seen perform better *with a similar heap footprint* are GenShen
> > (currently experimental) and Rust (beyond the scope of this topic).
> > 
> > Derek
> > 
> > On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <ru...@apache.org> wrote:
> >> 
> >> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?
> >> 
> >> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.
> >> 
> >> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.
> >> 
> >> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.
> >> 
> >> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).
> >> 
> >> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.
> >> 
> >> Jon
> >> 
> >> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
> >>>> 
> >>>> In case of GC, reasonably extensive performance testing should be the
> >>>> expectations. Potentially revisiting some of the G1 params for the 4.1
> >>>> reality - quite a lot has changed since those optional defaults where
> >>>> picked.
> >>>> 
> >>> 
> >>> 
> >>> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
> >>> in the patch for CASSANDRA-18027
> >>> 
> >>> In reality it is really not much of a change, g1 does make it simple.
> >>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
> >>> the new heap (XX:NewSize) is still required, though we could do a much
> >>> better job of dynamic defaults to them.
> >>> 
> >>> Alex Dejanovski's blog is a starting point:
> >>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
> >>> where this gc opt set was used (though it doesn't prove why those options
> >>> are chosen)
> >>> 
> >>> The bar for objection to sneaking these into 4.1 was intended to be low,
> >>> and I stand by those that raise concerns.
> >>> 
> > 
> > 
> > 
> > -- 
> > +---------------------------------------------------------------+
> > | Derek Chen-Becker                                             |
> > | GPG Key available at https://keybase.io/dchenbecker and       |
> > | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> > | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> > +---------------------------------------------------------------+
> 
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Aleksey Yeshchenko <al...@apple.com>.

All right. I’ll clarify then.

-0 on switching the default to G1 *this late* just before RC1.
-1 on switching the default offheap_objects *for 4.1 RC1*, but all for it in principle, for 4.2, after we run some more test and resolve the concerns raised by Jeff.

Let’s please try to avoid this kind of super late defaults switch going forward?

—
AY

> On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> wrote:
> 
> For the record, I'm +100 on G1. Take it with whatever sized grain of
> salt you think appropriate for a relative newcomer to the list, but
> I've spent my last 7-8 years dealing with the intersection of
> high-throughput, low latency systems and their interaction with GC and
> in my personal experience G1 outperforms CMS in all cases and with
> significantly less work (zero work, in many cases). The only things
> I've seen perform better *with a similar heap footprint* are GenShen
> (currently experimental) and Rust (beyond the scope of this topic).
> 
> Derek
> 
> On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <ru...@apache.org> wrote:
>> 
>> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?
>> 
>> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.
>> 
>> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.
>> 
>> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.
>> 
>> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).
>> 
>> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.
>> 
>> Jon
>> 
>> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
>>>> 
>>>> In case of GC, reasonably extensive performance testing should be the
>>>> expectations. Potentially revisiting some of the G1 params for the 4.1
>>>> reality - quite a lot has changed since those optional defaults where
>>>> picked.
>>>> 
>>> 
>>> 
>>> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
>>> in the patch for CASSANDRA-18027
>>> 
>>> In reality it is really not much of a change, g1 does make it simple.
>>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
>>> the new heap (XX:NewSize) is still required, though we could do a much
>>> better job of dynamic defaults to them.
>>> 
>>> Alex Dejanovski's blog is a starting point:
>>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
>>> where this gc opt set was used (though it doesn't prove why those options
>>> are chosen)
>>> 
>>> The bar for objection to sneaking these into 4.1 was intended to be low,
>>> and I stand by those that raise concerns.
>>> 
> 
> 
> 
> -- 
> +---------------------------------------------------------------+
> | Derek Chen-Becker                                             |
> | GPG Key available at https://keybase.io/dchenbecker and       |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---------------------------------------------------------------+

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Derek Chen-Becker <de...@chen-becker.org>.

For the record, I'm +100 on G1. Take it with whatever sized grain of
salt you think appropriate for a relative newcomer to the list, but
I've spent my last 7-8 years dealing with the intersection of
high-throughput, low latency systems and their interaction with GC and
in my personal experience G1 outperforms CMS in all cases and with
significantly less work (zero work, in many cases). The only things
I've seen perform better *with a similar heap footprint* are GenShen
(currently experimental) and Rust (beyond the scope of this topic).

Derek

On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <ru...@apache.org> wrote:
>
> I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?
>
> I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.
>
> I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.
>
> Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.
>
> Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).
>
> I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.
>
> Jon
>
> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
> > >
> > > In case of GC, reasonably extensive performance testing should be the
> > > expectations. Potentially revisiting some of the G1 params for the 4.1
> > > reality - quite a lot has changed since those optional defaults where
> > > picked.
> > >
> >
> >
> > I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
> > in the patch for CASSANDRA-18027
> >
> > In reality it is really not much of a change, g1 does make it simple.
> > Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
> > the new heap (XX:NewSize) is still required, though we could do a much
> > better job of dynamic defaults to them.
> >
> > Alex Dejanovski's blog is a starting point:
> > https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
> > where this gc opt set was used (though it doesn't prove why those options
> > are chosen)
> >
> > The bar for objection to sneaking these into 4.1 was intended to be low,
> > and I stand by those that raise concerns.
> >



-- 
+---------------------------------------------------------------+
| Derek Chen-Becker                                             |
| GPG Key available at https://keybase.io/dchenbecker and       |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---------------------------------------------------------------+

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Jon Haddad <ru...@apache.org>.

I'm curious what it would take for folks to be OK with merging this into 4.1?  How much additional time would you want to feel comfortable?  

I should probably have been a little more vigorous in my +1 of Mick's PR.  For a little background - I worked on several hundred clusters while at TLP, mostly dealing with stability and performance issues.  A lot of them stemmed partially or wholly from the GC settings we ship in the project. Par New with CMS and small new gen results in a lot of premature promotion leading to high pause times into the hundreds of ms which pushes p99 latency through the roof.

I'm a big +1 in favor of G1 because it's not just better for most people but it's better for _every_ new Cassandra user.  The first experience that people have with the project is important, and our current GC settings are quite bad - so bad they lead to problems with stability in production.  The G1 settings are mostly hands off, result in shorter pause times and are a big improvement over the status quo.  

Most folks don't do GC tuning, they use what we supply, and what we currently supply leads to a poor initial experience with the database.  I think we owe the community our best effort even if it means pushing the release back little bit.

Just for some additional context, we're (Netflix) running 25K nodes on G1 across a variety of hardware in AWS with wildly varying workloads, and I haven't seen G1 be the root cause of a problem even once.  The settings that Mick is proposing are almost identical to what we use (we use half of heap up to 30GB).  

I'd really appreciate it if we took a second to consider the community effect of another release that ships settings that cause significant pain for our users.

Jon

On 2022/11/10 21:49:36 Mick Semb Wever wrote:
> >
> > In case of GC, reasonably extensive performance testing should be the
> > expectations. Potentially revisiting some of the G1 params for the 4.1
> > reality - quite a lot has changed since those optional defaults where
> > picked.
> >
> 
> 
> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
> in the patch for CASSANDRA-18027
> 
> In reality it is really not much of a change, g1 does make it simple.
> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
> the new heap (XX:NewSize) is still required, though we could do a much
> better job of dynamic defaults to them.
> 
> Alex Dejanovski's blog is a starting point:
> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
> where this gc opt set was used (though it doesn't prove why those options
> are chosen)
> 
> The bar for objection to sneaking these into 4.1 was intended to be low,
> and I stand by those that raise concerns.
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Anthony Grasso <an...@gmail.com>.

+1 to switching to G1 as well. Most production clusters I've seen are
typically running with a heap size of 16 GB or higher which works well with
G1.

I agree with Elliott's comment; I think this change should go into 4.1
onwards (i.e. no change to the default JVM settings in 4.0).

On Fri, 11 Nov 2022 at 08:50, Mick Semb Wever <mc...@apache.org> wrote:

> In case of GC, reasonably extensive performance testing should be the
>> expectations. Potentially revisiting some of the G1 params for the 4.1
>> reality - quite a lot has changed since those optional defaults where
>> picked.
>>
>
>
> I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
> in the patch for CASSANDRA-18027
>
> In reality it is really not much of a change, g1 does make it simple.
> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
> the new heap (XX:NewSize) is still required, though we could do a much
> better job of dynamic defaults to them.
>
> Alex Dejanovski's blog is a starting point:
> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
> where this gc opt set was used (though it doesn't prove why those options
> are chosen)
>
> The bar for objection to sneaking these into 4.1 was intended to be low,
> and I stand by those that raise concerns.
>
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Mick Semb Wever <mc...@apache.org>.

>
> In case of GC, reasonably extensive performance testing should be the
> expectations. Potentially revisiting some of the G1 params for the 4.1
> reality - quite a lot has changed since those optional defaults where
> picked.
>


I've put our battle-tested g1 opts (from consultants at TLP and DataStax)
in the patch for CASSANDRA-18027

In reality it is really not much of a change, g1 does make it simple.
Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
the new heap (XX:NewSize) is still required, though we could do a much
better job of dynamic defaults to them.

Alex Dejanovski's blog is a starting point:
https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
where this gc opt set was used (though it doesn't prove why those options
are chosen)

The bar for objection to sneaking these into 4.1 was intended to be low,
and I stand by those that raise concerns.

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Aleksey Yeshchenko <al...@apple.com>.

I assume not with 4.0/4.1 though.

It might be a better default than CMS, but switching a major default like this (an memtable allocation) is not something that should be snuck in at the very last moment.

In case of GC, reasonably extensive performance testing should be the expectations. Potentially revisiting some of the G1 params for the 4.1 reality - quite a lot has changed since those optional defaults where picked.

> On 9 Nov 2022, at 21:13, Jeremiah D Jordan <je...@gmail.com> wrote:
> 
> At DataStax we’ve been shipping those optional G1 settings as the default for many years now, so I am +1 to at the very least making the change in trunk, but really I would think it fine to make it back in 4.0 and 4.1 as well.

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Jeremiah D Jordan <je...@gmail.com>.

At DataStax we’ve been shipping those optional G1 settings as the default for many years now, so I am +1 to at the very least making the change in trunk, but really I would think it fine to make it back in 4.0 and 4.1 as well.

-Jeremiah

> On Nov 9, 2022, at 1:32 PM, David Capwell <dc...@apple.com> wrote:
> 
> CASSANDRA-12029/CASSANDRA-7486 I am not in favor of doing for 4.1, we spend time validating the current settings, so changing at the last minute adds risk; so rather push that to 4.2/5.0
> 
> 
>> On Nov 9, 2022, at 11:25 AM, Brandon Williams <driftx@gmail.com <ma...@gmail.com>> wrote:
>> 
>> CMS was deprecated in JDK 9, I don't see a good reason to follow it
>> until it's dying breath, and we already have G1 ready in the jvm
>> options files so this should be an easy switch, +1.
>> 
>> Kind Regards,
>> Brandon
>> 
>> On Wed, Nov 9, 2022 at 1:22 PM Mick Semb Wever <mck@apache.org <ma...@apache.org>> wrote:
>>> 
>>> Any objections to making these changes, at the very last minute, for 4.1-rc1 ?
>>> This is CASSANDRA-12029 and CASSANDRA-7486
>>> 
>>> Provided we figure out patches for them in the next day or two.
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by David Capwell <dc...@apple.com>.

CASSANDRA-12029/CASSANDRA-7486 I am not in favor of doing for 4.1, we spend time validating the current settings, so changing at the last minute adds risk; so rather push that to 4.2/5.0


> On Nov 9, 2022, at 11:25 AM, Brandon Williams <dr...@gmail.com> wrote:
> 
> CMS was deprecated in JDK 9, I don't see a good reason to follow it
> until it's dying breath, and we already have G1 ready in the jvm
> options files so this should be an easy switch, +1.
> 
> Kind Regards,
> Brandon
> 
> On Wed, Nov 9, 2022 at 1:22 PM Mick Semb Wever <mc...@apache.org> wrote:
>> 
>> Any objections to making these changes, at the very last minute, for 4.1-rc1 ?
>> This is CASSANDRA-12029 and CASSANDRA-7486
>> 
>> Provided we figure out patches for them in the next day or two.

Re: Should we change 4.1 to G1 and offheap_objects ?

Posted by Brandon Williams <dr...@gmail.com>.

CMS was deprecated in JDK 9, I don't see a good reason to follow it
until it's dying breath, and we already have G1 ready in the jvm
options files so this should be an easy switch, +1.

Kind Regards,
Brandon

On Wed, Nov 9, 2022 at 1:22 PM Mick Semb Wever <mc...@apache.org> wrote:
>
> Any objections to making these changes, at the very last minute, for 4.1-rc1 ?
> This is CASSANDRA-12029 and CASSANDRA-7486
>
> Provided we figure out patches for them in the next day or two.