You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Carl Mueller <ca...@smartthings.com> on 2018/02/22 20:39:40 UTC

Why isn't there a separate JVM per table?

GC pauses may have been improved in newer releases, since we are in 2.1.x,
but I was wondering why cassandra uses one jvm for all tables and
keyspaces, intermingling the heap for on-JVM objects.

... so why doesn't cassandra spin off a jvm per table so each jvm can be
tuned per table and gc tuned and gc impacts not impact other tables? It
would probably increase the number of endpoints if we avoid having an
overarching query router.

Re: Why isn't there a separate JVM per table?

Posted by Carl Mueller <ca...@smartthings.com>.
Alternative: JVM per vnode.

On Thu, Feb 22, 2018 at 4:52 PM, Carl Mueller <ca...@smartthings.com>
wrote:

> BLoom filters... nevermind
>
>
> On Thu, Feb 22, 2018 at 4:48 PM, Carl Mueller <
> carl.mueller@smartthings.com> wrote:
>
>> Is the current reason for a large starting heap due to the memtable?
>>
>> On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller <
>> carl.mueller@smartthings.com> wrote:
>>
>>>  ... compaction on its own jvm was also something I was thinking about,
>>> but then I realized even more JVM sharding could be done at the table level.
>>>
>>> On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad <jo...@jonhaddad.com> wrote:
>>>
>>>> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world
>>>> where we’re isolating crazy GC churning parts of the DB.  It would mean
>>>> reworking how tasks are created and removal of all shared state in favor of
>>>> messaging + a smarter manager, which imo would be a good idea regardless.
>>>>
>>>> It might be a better use of time (especially for 4.0) to do some GC
>>>> performance profiling and cut down on the allocations, since that doesn’t
>>>> involve a massive effort.
>>>>
>>>> I’ve been meaning to do a little benchmarking and profiling for a while
>>>> now, and it seems like a few others have the same inclination as well,
>>>> maybe now is a good time to coordinate that.  A nice perf bump for 4.0
>>>> would be very rewarding.
>>>>
>>>> Jon
>>>>
>>>> > On Feb 22, 2018, at 2:00 PM, Nate McCall <zz...@gmail.com> wrote:
>>>> >
>>>> > I've heard a couple of folks pontificate on compaction in its own
>>>> > process as well, given it has such a high impact on GC. Not sure about
>>>> > the value of individual tables. Interesting idea though.
>>>> >
>>>> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek <gd...@gmail.com>
>>>> wrote:
>>>> >> I've given it some thought in the past. In the end, I usually talk
>>>> myself
>>>> >> out of it because I think it increases the surface area for failure.
>>>> That
>>>> >> is, managing N processes is more difficult that managing one
>>>> process. But
>>>> >> if the additional failure modes are addressed, there are some
>>>> interesting
>>>> >> possibilities.
>>>> >>
>>>> >> For example, having gossip in its own process would decrease the
>>>> odds that
>>>> >> a node is marked dead because STW GC is happening in the storage
>>>> JVM. On
>>>> >> the flipside, you'd need checks to make sure that the gossip process
>>>> can
>>>> >> recognize when the storage process has died vs just running a long
>>>> GC.
>>>> >>
>>>> >> I don't know that I'd go so far as to have separate processes for
>>>> >> keyspaces, etc.
>>>> >>
>>>> >> There is probably some interesting work that could be done to
>>>> support the
>>>> >> orgs who run multiple cassandra instances on the same node (multiple
>>>> >> gossipers in that case is at least a little wasteful).
>>>> >>
>>>> >> I've also played around with using domain sockets for IPC inside of
>>>> >> cassandra. I never ran a proper benchmark, but there were some
>>>> throughput
>>>> >> advantages to this approach.
>>>> >>
>>>> >> Cheers,
>>>> >>
>>>> >> Gary.
>>>> >>
>>>> >>
>>>> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <
>>>> carl.mueller@smartthings.com>
>>>> >> wrote:
>>>> >>
>>>> >>> GC pauses may have been improved in newer releases, since we are in
>>>> 2.1.x,
>>>> >>> but I was wondering why cassandra uses one jvm for all tables and
>>>> >>> keyspaces, intermingling the heap for on-JVM objects.
>>>> >>>
>>>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm
>>>> can be
>>>> >>> tuned per table and gc tuned and gc impacts not impact other
>>>> tables? It
>>>> >>> would probably increase the number of endpoints if we avoid having
>>>> an
>>>> >>> overarching query router.
>>>> >>>
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>> > For additional commands, e-mail: dev-help@cassandra.apache.org
>>>> >
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>
>>>>
>>>
>>
>

Re: Why isn't there a separate JVM per table?

Posted by Nate McCall <zz...@gmail.com>.
Agree that any first efforts per compaction should be on profiling.
Probably some low-hanging fruit there.

On Fri, Feb 23, 2018 at 11:55 AM, Jeff Jirsa <jj...@gmail.com> wrote:
> Bloom filters are offheap.
>
> To be honest, there may come a time when it makes sense to move compaction
> into its own JVM, but it would be FAR less effort to just profile what
> exists now and fix the problems.
>
>
>
> On Thu, Feb 22, 2018 at 2:52 PM, Carl Mueller <ca...@smartthings.com>
> wrote:
>
>> BLoom filters... nevermind
>>
>>
>> On Thu, Feb 22, 2018 at 4:48 PM, Carl Mueller <
>> carl.mueller@smartthings.com>
>> wrote:
>>
>> > Is the current reason for a large starting heap due to the memtable?
>> >
>> > On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller <
>> > carl.mueller@smartthings.com> wrote:
>> >
>> >>  ... compaction on its own jvm was also something I was thinking about,
>> >> but then I realized even more JVM sharding could be done at the table
>> level.
>> >>
>> >> On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad <jo...@jonhaddad.com> wrote:
>> >>
>> >>> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world
>> >>> where we’re isolating crazy GC churning parts of the DB.  It would mean
>> >>> reworking how tasks are created and removal of all shared state in
>> favor of
>> >>> messaging + a smarter manager, which imo would be a good idea
>> regardless.
>> >>>
>> >>> It might be a better use of time (especially for 4.0) to do some GC
>> >>> performance profiling and cut down on the allocations, since that
>> doesn’t
>> >>> involve a massive effort.
>> >>>
>> >>> I’ve been meaning to do a little benchmarking and profiling for a while
>> >>> now, and it seems like a few others have the same inclination as well,
>> >>> maybe now is a good time to coordinate that.  A nice perf bump for 4.0
>> >>> would be very rewarding.
>> >>>
>> >>> Jon
>> >>>
>> >>> > On Feb 22, 2018, at 2:00 PM, Nate McCall <zz...@gmail.com> wrote:
>> >>> >
>> >>> > I've heard a couple of folks pontificate on compaction in its own
>> >>> > process as well, given it has such a high impact on GC. Not sure
>> about
>> >>> > the value of individual tables. Interesting idea though.
>> >>> >
>> >>> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek <gdusbabek@gmail.com
>> >
>> >>> wrote:
>> >>> >> I've given it some thought in the past. In the end, I usually talk
>> >>> myself
>> >>> >> out of it because I think it increases the surface area for failure.
>> >>> That
>> >>> >> is, managing N processes is more difficult that managing one
>> process.
>> >>> But
>> >>> >> if the additional failure modes are addressed, there are some
>> >>> interesting
>> >>> >> possibilities.
>> >>> >>
>> >>> >> For example, having gossip in its own process would decrease the
>> odds
>> >>> that
>> >>> >> a node is marked dead because STW GC is happening in the storage
>> JVM.
>> >>> On
>> >>> >> the flipside, you'd need checks to make sure that the gossip process
>> >>> can
>> >>> >> recognize when the storage process has died vs just running a long
>> GC.
>> >>> >>
>> >>> >> I don't know that I'd go so far as to have separate processes for
>> >>> >> keyspaces, etc.
>> >>> >>
>> >>> >> There is probably some interesting work that could be done to
>> support
>> >>> the
>> >>> >> orgs who run multiple cassandra instances on the same node (multiple
>> >>> >> gossipers in that case is at least a little wasteful).
>> >>> >>
>> >>> >> I've also played around with using domain sockets for IPC inside of
>> >>> >> cassandra. I never ran a proper benchmark, but there were some
>> >>> throughput
>> >>> >> advantages to this approach.
>> >>> >>
>> >>> >> Cheers,
>> >>> >>
>> >>> >> Gary.
>> >>> >>
>> >>> >>
>> >>> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <
>> >>> carl.mueller@smartthings.com>
>> >>> >> wrote:
>> >>> >>
>> >>> >>> GC pauses may have been improved in newer releases, since we are in
>> >>> 2.1.x,
>> >>> >>> but I was wondering why cassandra uses one jvm for all tables and
>> >>> >>> keyspaces, intermingling the heap for on-JVM objects.
>> >>> >>>
>> >>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm
>> >>> can be
>> >>> >>> tuned per table and gc tuned and gc impacts not impact other
>> tables?
>> >>> It
>> >>> >>> would probably increase the number of endpoints if we avoid having
>> an
>> >>> >>> overarching query router.
>> >>> >>>
>> >>> >
>> >>> > ------------------------------------------------------------
>> ---------
>> >>> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> >>> > For additional commands, e-mail: dev-help@cassandra.apache.org
>> >>> >
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> >>> For additional commands, e-mail: dev-help@cassandra.apache.org
>> >>>
>> >>>
>> >>
>> >
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Why isn't there a separate JVM per table?

Posted by Jeff Jirsa <jj...@gmail.com>.
Bloom filters are offheap.

To be honest, there may come a time when it makes sense to move compaction
into its own JVM, but it would be FAR less effort to just profile what
exists now and fix the problems.



On Thu, Feb 22, 2018 at 2:52 PM, Carl Mueller <ca...@smartthings.com>
wrote:

> BLoom filters... nevermind
>
>
> On Thu, Feb 22, 2018 at 4:48 PM, Carl Mueller <
> carl.mueller@smartthings.com>
> wrote:
>
> > Is the current reason for a large starting heap due to the memtable?
> >
> > On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller <
> > carl.mueller@smartthings.com> wrote:
> >
> >>  ... compaction on its own jvm was also something I was thinking about,
> >> but then I realized even more JVM sharding could be done at the table
> level.
> >>
> >> On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad <jo...@jonhaddad.com> wrote:
> >>
> >>> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world
> >>> where we’re isolating crazy GC churning parts of the DB.  It would mean
> >>> reworking how tasks are created and removal of all shared state in
> favor of
> >>> messaging + a smarter manager, which imo would be a good idea
> regardless.
> >>>
> >>> It might be a better use of time (especially for 4.0) to do some GC
> >>> performance profiling and cut down on the allocations, since that
> doesn’t
> >>> involve a massive effort.
> >>>
> >>> I’ve been meaning to do a little benchmarking and profiling for a while
> >>> now, and it seems like a few others have the same inclination as well,
> >>> maybe now is a good time to coordinate that.  A nice perf bump for 4.0
> >>> would be very rewarding.
> >>>
> >>> Jon
> >>>
> >>> > On Feb 22, 2018, at 2:00 PM, Nate McCall <zz...@gmail.com> wrote:
> >>> >
> >>> > I've heard a couple of folks pontificate on compaction in its own
> >>> > process as well, given it has such a high impact on GC. Not sure
> about
> >>> > the value of individual tables. Interesting idea though.
> >>> >
> >>> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek <gdusbabek@gmail.com
> >
> >>> wrote:
> >>> >> I've given it some thought in the past. In the end, I usually talk
> >>> myself
> >>> >> out of it because I think it increases the surface area for failure.
> >>> That
> >>> >> is, managing N processes is more difficult that managing one
> process.
> >>> But
> >>> >> if the additional failure modes are addressed, there are some
> >>> interesting
> >>> >> possibilities.
> >>> >>
> >>> >> For example, having gossip in its own process would decrease the
> odds
> >>> that
> >>> >> a node is marked dead because STW GC is happening in the storage
> JVM.
> >>> On
> >>> >> the flipside, you'd need checks to make sure that the gossip process
> >>> can
> >>> >> recognize when the storage process has died vs just running a long
> GC.
> >>> >>
> >>> >> I don't know that I'd go so far as to have separate processes for
> >>> >> keyspaces, etc.
> >>> >>
> >>> >> There is probably some interesting work that could be done to
> support
> >>> the
> >>> >> orgs who run multiple cassandra instances on the same node (multiple
> >>> >> gossipers in that case is at least a little wasteful).
> >>> >>
> >>> >> I've also played around with using domain sockets for IPC inside of
> >>> >> cassandra. I never ran a proper benchmark, but there were some
> >>> throughput
> >>> >> advantages to this approach.
> >>> >>
> >>> >> Cheers,
> >>> >>
> >>> >> Gary.
> >>> >>
> >>> >>
> >>> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <
> >>> carl.mueller@smartthings.com>
> >>> >> wrote:
> >>> >>
> >>> >>> GC pauses may have been improved in newer releases, since we are in
> >>> 2.1.x,
> >>> >>> but I was wondering why cassandra uses one jvm for all tables and
> >>> >>> keyspaces, intermingling the heap for on-JVM objects.
> >>> >>>
> >>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm
> >>> can be
> >>> >>> tuned per table and gc tuned and gc impacts not impact other
> tables?
> >>> It
> >>> >>> would probably increase the number of endpoints if we avoid having
> an
> >>> >>> overarching query router.
> >>> >>>
> >>> >
> >>> > ------------------------------------------------------------
> ---------
> >>> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >>> >
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >>> For additional commands, e-mail: dev-help@cassandra.apache.org
> >>>
> >>>
> >>
> >
>

Re: Why isn't there a separate JVM per table?

Posted by Rahul Singh <ra...@gmail.com>.
I agree with Jon. The actor based model would be the logical approach to get to be more “efficient.” Until then fault tolerance has to be built into the driver to contact another node if in the middle and then reconcile the commitlog later.

I’ve seen many people combine an external queue to deal with the GC issues by adding yet another layer of asynchronicity. (If it’s not a word it is now)

Even in systems like SQL servers there are internal queues that get locked up due to memory, storage, or cpu pressures. It’s not a GC pause but it may as well be. Even with all the tweaking the only way to get beyond is distributed asynchronous systems that are self healing.

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Feb 23, 2018, 4:34 AM -0500, Brian Hess <br...@gmail.com>, wrote:
> Something folks haven't raised, but would be another impediment here is that in Cassandra if you submit a batch (logged or unlogged) for two tables in the same keyspace with the same partition then Cassandra collapses them into the same Mutation and the two INSERTs are processed atomically. There are a few (maybe more than a few) things that take advantage of this fact.
>
> If you move each table to its own JVM then you cannot really achieve this atomicity. So, at most you would want to consider a JVM per keyspace (or consider touching a lot of code or changing a pretty fundamental/deep contract in Cassandra).
>
> ---->Brian
>
> Sent from my iPhone
>
> > On Feb 22, 2018, at 7:10 PM, J. D. Jordan <je...@gmail.com> wrote:
> >
> > I would be careful with anything per table for memory sizing. We used to have many caches and things that could be tuned per table, but they have all since changed to being per node, as it was a real PITA to get them right. Having to do per table heap/gc/memtable/cache tuning just sounds like a usability nightmare.
> >
> > -Jeremiah
> >
> > On Feb 22, 2018, at 6:59 PM, kurt greaves <ku...@instaclustr.com> wrote:
> >
> > > >
> > > > ... compaction on its own jvm was also something I was thinking about, but
> > > > then I realized even more JVM sharding could be done at the table level.
> > >
> > >
> > > Compaction in it's own JVM makes sense. At the table level I'm not so sure
> > > about. Gotta be some serious overheads from running that many JVM's.
> > > Keyspace might be reasonable purely to isolate bad tables, but for the most
> > > part I'd think isolating every table isn't that beneficial and pretty
> > > complicated. In most cases people just fix their modelling so that they
> > > don't generate large amounts of GC, and hopefully test enough so they know
> > > how it will behave in production.
> > >
> > > If we did at the table level we would inevitable have to make each
> > > individual table incredibly tune-able which would be a bit tedious IMO.
> > > There's no way for us to smartly decide how much heap/memtable space/etc
> > > each table should use (not without some decent AI, anyway).
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>

Re: Why isn't there a separate JVM per table?

Posted by Brian Hess <br...@gmail.com>.
Something folks haven't raised, but would be another impediment here is that in Cassandra if you submit a batch (logged or unlogged) for two tables in the same keyspace with the same partition then Cassandra collapses them into the same Mutation and the two INSERTs are processed atomically. There are a few (maybe more than a few) things that take advantage of this fact. 

If you move each table to its own JVM then you cannot really achieve this atomicity. So, at most you would want to consider a JVM per keyspace (or consider touching a lot of code or changing a pretty fundamental/deep contract in Cassandra). 

---->Brian

Sent from my iPhone

> On Feb 22, 2018, at 7:10 PM, J. D. Jordan <je...@gmail.com> wrote:
> 
> I would be careful with anything per table for memory sizing. We used to have many caches and things that could be tuned per table, but they have all since changed to being per node, as it was a real PITA to get them right.  Having to do per table heap/gc/memtable/cache tuning just sounds like a usability nightmare.
> 
> -Jeremiah 
> 
> On Feb 22, 2018, at 6:59 PM, kurt greaves <ku...@instaclustr.com> wrote:
> 
>>> 
>>> ... compaction on its own jvm was also something I was thinking about, but
>>> then I realized even more JVM sharding could be done at the table level.
>> 
>> 
>> Compaction in it's own JVM makes sense. At the table level I'm not so sure
>> about. Gotta be some serious overheads from running that many JVM's.
>> Keyspace might be reasonable purely to isolate bad tables, but for the most
>> part I'd think isolating every table isn't that beneficial and pretty
>> complicated. In most cases people just fix their modelling so that they
>> don't generate large amounts of GC, and hopefully test enough so they know
>> how it will behave in production.
>> 
>> If we did at the table level we would inevitable have to make each
>> individual table incredibly tune-able which would be a bit tedious IMO.
>> There's no way for us to smartly decide how much heap/memtable space/etc
>> each table should use (not without some decent AI, anyway).
>> ​
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Why isn't there a separate JVM per table?

Posted by "J. D. Jordan" <je...@gmail.com>.
I would be careful with anything per table for memory sizing. We used to have many caches and things that could be tuned per table, but they have all since changed to being per node, as it was a real PITA to get them right.  Having to do per table heap/gc/memtable/cache tuning just sounds like a usability nightmare.

-Jeremiah 

On Feb 22, 2018, at 6:59 PM, kurt greaves <ku...@instaclustr.com> wrote:

>> 
>> ... compaction on its own jvm was also something I was thinking about, but
>> then I realized even more JVM sharding could be done at the table level.
> 
> 
> Compaction in it's own JVM makes sense. At the table level I'm not so sure
> about. Gotta be some serious overheads from running that many JVM's.
> Keyspace might be reasonable purely to isolate bad tables, but for the most
> part I'd think isolating every table isn't that beneficial and pretty
> complicated. In most cases people just fix their modelling so that they
> don't generate large amounts of GC, and hopefully test enough so they know
> how it will behave in production.
> 
> If we did at the table level we would inevitable have to make each
> individual table incredibly tune-able which would be a bit tedious IMO.
> There's no way for us to smartly decide how much heap/memtable space/etc
> each table should use (not without some decent AI, anyway).
> ​

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Why isn't there a separate JVM per table?

Posted by Jonathan Haddad <jo...@jonhaddad.com>.
There's an incredible amount of work that would need to be done in order to
make any of this happen.  Basically a full rewrite of the entire codebase.
Years of effort.

The codebase would have to move to a shared-nothing actor & message based
communication mechanism before any of this is possible.  Fun in theory, but
considering removing singletons has been a multi-year, many failure effort,
I suspect we might need 10 years to refactor Cassandra to use multiple
JVMs.  By then maybe we'll have a pauseless / low pause collector and it
won't matter.

On Thu, Feb 22, 2018 at 3:59 PM kurt greaves <ku...@instaclustr.com> wrote:

> >
> >  ... compaction on its own jvm was also something I was thinking about,
> but
> > then I realized even more JVM sharding could be done at the table level.
>
>
> Compaction in it's own JVM makes sense. At the table level I'm not so sure
> about. Gotta be some serious overheads from running that many JVM's.
> Keyspace might be reasonable purely to isolate bad tables, but for the most
> part I'd think isolating every table isn't that beneficial and pretty
> complicated. In most cases people just fix their modelling so that they
> don't generate large amounts of GC, and hopefully test enough so they know
> how it will behave in production.
>
> If we did at the table level we would inevitable have to make each
> individual table incredibly tune-able which would be a bit tedious IMO.
> There's no way for us to smartly decide how much heap/memtable space/etc
> each table should use (not without some decent AI, anyway).
> ​
>

Re: Why isn't there a separate JVM per table?

Posted by kurt greaves <ku...@instaclustr.com>.
>
>  ... compaction on its own jvm was also something I was thinking about, but
> then I realized even more JVM sharding could be done at the table level.


Compaction in it's own JVM makes sense. At the table level I'm not so sure
about. Gotta be some serious overheads from running that many JVM's.
Keyspace might be reasonable purely to isolate bad tables, but for the most
part I'd think isolating every table isn't that beneficial and pretty
complicated. In most cases people just fix their modelling so that they
don't generate large amounts of GC, and hopefully test enough so they know
how it will behave in production.

If we did at the table level we would inevitable have to make each
individual table incredibly tune-able which would be a bit tedious IMO.
There's no way for us to smartly decide how much heap/memtable space/etc
each table should use (not without some decent AI, anyway).
​

Re: Why isn't there a separate JVM per table?

Posted by Carl Mueller <ca...@smartthings.com>.
BLoom filters... nevermind


On Thu, Feb 22, 2018 at 4:48 PM, Carl Mueller <ca...@smartthings.com>
wrote:

> Is the current reason for a large starting heap due to the memtable?
>
> On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller <
> carl.mueller@smartthings.com> wrote:
>
>>  ... compaction on its own jvm was also something I was thinking about,
>> but then I realized even more JVM sharding could be done at the table level.
>>
>> On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad <jo...@jonhaddad.com> wrote:
>>
>>> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world
>>> where we’re isolating crazy GC churning parts of the DB.  It would mean
>>> reworking how tasks are created and removal of all shared state in favor of
>>> messaging + a smarter manager, which imo would be a good idea regardless.
>>>
>>> It might be a better use of time (especially for 4.0) to do some GC
>>> performance profiling and cut down on the allocations, since that doesn’t
>>> involve a massive effort.
>>>
>>> I’ve been meaning to do a little benchmarking and profiling for a while
>>> now, and it seems like a few others have the same inclination as well,
>>> maybe now is a good time to coordinate that.  A nice perf bump for 4.0
>>> would be very rewarding.
>>>
>>> Jon
>>>
>>> > On Feb 22, 2018, at 2:00 PM, Nate McCall <zz...@gmail.com> wrote:
>>> >
>>> > I've heard a couple of folks pontificate on compaction in its own
>>> > process as well, given it has such a high impact on GC. Not sure about
>>> > the value of individual tables. Interesting idea though.
>>> >
>>> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek <gd...@gmail.com>
>>> wrote:
>>> >> I've given it some thought in the past. In the end, I usually talk
>>> myself
>>> >> out of it because I think it increases the surface area for failure.
>>> That
>>> >> is, managing N processes is more difficult that managing one process.
>>> But
>>> >> if the additional failure modes are addressed, there are some
>>> interesting
>>> >> possibilities.
>>> >>
>>> >> For example, having gossip in its own process would decrease the odds
>>> that
>>> >> a node is marked dead because STW GC is happening in the storage JVM.
>>> On
>>> >> the flipside, you'd need checks to make sure that the gossip process
>>> can
>>> >> recognize when the storage process has died vs just running a long GC.
>>> >>
>>> >> I don't know that I'd go so far as to have separate processes for
>>> >> keyspaces, etc.
>>> >>
>>> >> There is probably some interesting work that could be done to support
>>> the
>>> >> orgs who run multiple cassandra instances on the same node (multiple
>>> >> gossipers in that case is at least a little wasteful).
>>> >>
>>> >> I've also played around with using domain sockets for IPC inside of
>>> >> cassandra. I never ran a proper benchmark, but there were some
>>> throughput
>>> >> advantages to this approach.
>>> >>
>>> >> Cheers,
>>> >>
>>> >> Gary.
>>> >>
>>> >>
>>> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <
>>> carl.mueller@smartthings.com>
>>> >> wrote:
>>> >>
>>> >>> GC pauses may have been improved in newer releases, since we are in
>>> 2.1.x,
>>> >>> but I was wondering why cassandra uses one jvm for all tables and
>>> >>> keyspaces, intermingling the heap for on-JVM objects.
>>> >>>
>>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm
>>> can be
>>> >>> tuned per table and gc tuned and gc impacts not impact other tables?
>>> It
>>> >>> would probably increase the number of endpoints if we avoid having an
>>> >>> overarching query router.
>>> >>>
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> > For additional commands, e-mail: dev-help@cassandra.apache.org
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>
>>>
>>
>

Re: Why isn't there a separate JVM per table?

Posted by Carl Mueller <ca...@smartthings.com>.
Is the current reason for a large starting heap due to the memtable?

On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller <ca...@smartthings.com>
wrote:

>  ... compaction on its own jvm was also something I was thinking about,
> but then I realized even more JVM sharding could be done at the table level.
>
> On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad <jo...@jonhaddad.com> wrote:
>
>> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world where
>> we’re isolating crazy GC churning parts of the DB.  It would mean reworking
>> how tasks are created and removal of all shared state in favor of messaging
>> + a smarter manager, which imo would be a good idea regardless.
>>
>> It might be a better use of time (especially for 4.0) to do some GC
>> performance profiling and cut down on the allocations, since that doesn’t
>> involve a massive effort.
>>
>> I’ve been meaning to do a little benchmarking and profiling for a while
>> now, and it seems like a few others have the same inclination as well,
>> maybe now is a good time to coordinate that.  A nice perf bump for 4.0
>> would be very rewarding.
>>
>> Jon
>>
>> > On Feb 22, 2018, at 2:00 PM, Nate McCall <zz...@gmail.com> wrote:
>> >
>> > I've heard a couple of folks pontificate on compaction in its own
>> > process as well, given it has such a high impact on GC. Not sure about
>> > the value of individual tables. Interesting idea though.
>> >
>> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek <gd...@gmail.com>
>> wrote:
>> >> I've given it some thought in the past. In the end, I usually talk
>> myself
>> >> out of it because I think it increases the surface area for failure.
>> That
>> >> is, managing N processes is more difficult that managing one process.
>> But
>> >> if the additional failure modes are addressed, there are some
>> interesting
>> >> possibilities.
>> >>
>> >> For example, having gossip in its own process would decrease the odds
>> that
>> >> a node is marked dead because STW GC is happening in the storage JVM.
>> On
>> >> the flipside, you'd need checks to make sure that the gossip process
>> can
>> >> recognize when the storage process has died vs just running a long GC.
>> >>
>> >> I don't know that I'd go so far as to have separate processes for
>> >> keyspaces, etc.
>> >>
>> >> There is probably some interesting work that could be done to support
>> the
>> >> orgs who run multiple cassandra instances on the same node (multiple
>> >> gossipers in that case is at least a little wasteful).
>> >>
>> >> I've also played around with using domain sockets for IPC inside of
>> >> cassandra. I never ran a proper benchmark, but there were some
>> throughput
>> >> advantages to this approach.
>> >>
>> >> Cheers,
>> >>
>> >> Gary.
>> >>
>> >>
>> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <
>> carl.mueller@smartthings.com>
>> >> wrote:
>> >>
>> >>> GC pauses may have been improved in newer releases, since we are in
>> 2.1.x,
>> >>> but I was wondering why cassandra uses one jvm for all tables and
>> >>> keyspaces, intermingling the heap for on-JVM objects.
>> >>>
>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm can
>> be
>> >>> tuned per table and gc tuned and gc impacts not impact other tables?
>> It
>> >>> would probably increase the number of endpoints if we avoid having an
>> >>> overarching query router.
>> >>>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> > For additional commands, e-mail: dev-help@cassandra.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>
>>
>

Re: Why isn't there a separate JVM per table?

Posted by Carl Mueller <ca...@smartthings.com>.
 ... compaction on its own jvm was also something I was thinking about, but
then I realized even more JVM sharding could be done at the table level.

On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad <jo...@jonhaddad.com> wrote:

> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world where
> we’re isolating crazy GC churning parts of the DB.  It would mean reworking
> how tasks are created and removal of all shared state in favor of messaging
> + a smarter manager, which imo would be a good idea regardless.
>
> It might be a better use of time (especially for 4.0) to do some GC
> performance profiling and cut down on the allocations, since that doesn’t
> involve a massive effort.
>
> I’ve been meaning to do a little benchmarking and profiling for a while
> now, and it seems like a few others have the same inclination as well,
> maybe now is a good time to coordinate that.  A nice perf bump for 4.0
> would be very rewarding.
>
> Jon
>
> > On Feb 22, 2018, at 2:00 PM, Nate McCall <zz...@gmail.com> wrote:
> >
> > I've heard a couple of folks pontificate on compaction in its own
> > process as well, given it has such a high impact on GC. Not sure about
> > the value of individual tables. Interesting idea though.
> >
> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek <gd...@gmail.com>
> wrote:
> >> I've given it some thought in the past. In the end, I usually talk
> myself
> >> out of it because I think it increases the surface area for failure.
> That
> >> is, managing N processes is more difficult that managing one process.
> But
> >> if the additional failure modes are addressed, there are some
> interesting
> >> possibilities.
> >>
> >> For example, having gossip in its own process would decrease the odds
> that
> >> a node is marked dead because STW GC is happening in the storage JVM. On
> >> the flipside, you'd need checks to make sure that the gossip process can
> >> recognize when the storage process has died vs just running a long GC.
> >>
> >> I don't know that I'd go so far as to have separate processes for
> >> keyspaces, etc.
> >>
> >> There is probably some interesting work that could be done to support
> the
> >> orgs who run multiple cassandra instances on the same node (multiple
> >> gossipers in that case is at least a little wasteful).
> >>
> >> I've also played around with using domain sockets for IPC inside of
> >> cassandra. I never ran a proper benchmark, but there were some
> throughput
> >> advantages to this approach.
> >>
> >> Cheers,
> >>
> >> Gary.
> >>
> >>
> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <
> carl.mueller@smartthings.com>
> >> wrote:
> >>
> >>> GC pauses may have been improved in newer releases, since we are in
> 2.1.x,
> >>> but I was wondering why cassandra uses one jvm for all tables and
> >>> keyspaces, intermingling the heap for on-JVM objects.
> >>>
> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm can
> be
> >>> tuned per table and gc tuned and gc impacts not impact other tables? It
> >>> would probably increase the number of endpoints if we avoid having an
> >>> overarching query router.
> >>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

Re: Why isn't there a separate JVM per table?

Posted by Jon Haddad <jo...@jonhaddad.com>.
Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world where we’re isolating crazy GC churning parts of the DB.  It would mean reworking how tasks are created and removal of all shared state in favor of messaging + a smarter manager, which imo would be a good idea regardless. 

It might be a better use of time (especially for 4.0) to do some GC performance profiling and cut down on the allocations, since that doesn’t involve a massive effort.  

I’ve been meaning to do a little benchmarking and profiling for a while now, and it seems like a few others have the same inclination as well, maybe now is a good time to coordinate that.  A nice perf bump for 4.0 would be very rewarding.

Jon

> On Feb 22, 2018, at 2:00 PM, Nate McCall <zz...@gmail.com> wrote:
> 
> I've heard a couple of folks pontificate on compaction in its own
> process as well, given it has such a high impact on GC. Not sure about
> the value of individual tables. Interesting idea though.
> 
> On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek <gd...@gmail.com> wrote:
>> I've given it some thought in the past. In the end, I usually talk myself
>> out of it because I think it increases the surface area for failure. That
>> is, managing N processes is more difficult that managing one process. But
>> if the additional failure modes are addressed, there are some interesting
>> possibilities.
>> 
>> For example, having gossip in its own process would decrease the odds that
>> a node is marked dead because STW GC is happening in the storage JVM. On
>> the flipside, you'd need checks to make sure that the gossip process can
>> recognize when the storage process has died vs just running a long GC.
>> 
>> I don't know that I'd go so far as to have separate processes for
>> keyspaces, etc.
>> 
>> There is probably some interesting work that could be done to support the
>> orgs who run multiple cassandra instances on the same node (multiple
>> gossipers in that case is at least a little wasteful).
>> 
>> I've also played around with using domain sockets for IPC inside of
>> cassandra. I never ran a proper benchmark, but there were some throughput
>> advantages to this approach.
>> 
>> Cheers,
>> 
>> Gary.
>> 
>> 
>> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <ca...@smartthings.com>
>> wrote:
>> 
>>> GC pauses may have been improved in newer releases, since we are in 2.1.x,
>>> but I was wondering why cassandra uses one jvm for all tables and
>>> keyspaces, intermingling the heap for on-JVM objects.
>>> 
>>> ... so why doesn't cassandra spin off a jvm per table so each jvm can be
>>> tuned per table and gc tuned and gc impacts not impact other tables? It
>>> would probably increase the number of endpoints if we avoid having an
>>> overarching query router.
>>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Why isn't there a separate JVM per table?

Posted by Nate McCall <zz...@gmail.com>.
I've heard a couple of folks pontificate on compaction in its own
process as well, given it has such a high impact on GC. Not sure about
the value of individual tables. Interesting idea though.

On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek <gd...@gmail.com> wrote:
> I've given it some thought in the past. In the end, I usually talk myself
> out of it because I think it increases the surface area for failure. That
> is, managing N processes is more difficult that managing one process. But
> if the additional failure modes are addressed, there are some interesting
> possibilities.
>
> For example, having gossip in its own process would decrease the odds that
> a node is marked dead because STW GC is happening in the storage JVM. On
> the flipside, you'd need checks to make sure that the gossip process can
> recognize when the storage process has died vs just running a long GC.
>
> I don't know that I'd go so far as to have separate processes for
> keyspaces, etc.
>
> There is probably some interesting work that could be done to support the
> orgs who run multiple cassandra instances on the same node (multiple
> gossipers in that case is at least a little wasteful).
>
> I've also played around with using domain sockets for IPC inside of
> cassandra. I never ran a proper benchmark, but there were some throughput
> advantages to this approach.
>
> Cheers,
>
> Gary.
>
>
> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <ca...@smartthings.com>
> wrote:
>
>> GC pauses may have been improved in newer releases, since we are in 2.1.x,
>> but I was wondering why cassandra uses one jvm for all tables and
>> keyspaces, intermingling the heap for on-JVM objects.
>>
>> ... so why doesn't cassandra spin off a jvm per table so each jvm can be
>> tuned per table and gc tuned and gc impacts not impact other tables? It
>> would probably increase the number of endpoints if we avoid having an
>> overarching query router.
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Why isn't there a separate JVM per table?

Posted by Gary Dusbabek <gd...@gmail.com>.
I've given it some thought in the past. In the end, I usually talk myself
out of it because I think it increases the surface area for failure. That
is, managing N processes is more difficult that managing one process. But
if the additional failure modes are addressed, there are some interesting
possibilities.

For example, having gossip in its own process would decrease the odds that
a node is marked dead because STW GC is happening in the storage JVM. On
the flipside, you'd need checks to make sure that the gossip process can
recognize when the storage process has died vs just running a long GC.

I don't know that I'd go so far as to have separate processes for
keyspaces, etc.

There is probably some interesting work that could be done to support the
orgs who run multiple cassandra instances on the same node (multiple
gossipers in that case is at least a little wasteful).

I've also played around with using domain sockets for IPC inside of
cassandra. I never ran a proper benchmark, but there were some throughput
advantages to this approach.

Cheers,

Gary.


On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <ca...@smartthings.com>
wrote:

> GC pauses may have been improved in newer releases, since we are in 2.1.x,
> but I was wondering why cassandra uses one jvm for all tables and
> keyspaces, intermingling the heap for on-JVM objects.
>
> ... so why doesn't cassandra spin off a jvm per table so each jvm can be
> tuned per table and gc tuned and gc impacts not impact other tables? It
> would probably increase the number of endpoints if we avoid having an
> overarching query router.
>

Re: Why isn't there a separate JVM per table?

Posted by Michael Kjellman <kj...@apple.com>.
it's an interesting idea. i'd wonder how much overhead you'd end up with message parsing and negate any potential GC wins. rick branson had played around a bunch with running storage nodes and doubling down on the old "fat client" model. if you had 10000 tables (yes, barely works but we don't explicitly prevent it) you can't really run that many jvm processes on a single box.

> On Feb 22, 2018, at 12:39 PM, Carl Mueller <ca...@smartthings.com> wrote:
> 
> GC pauses may have been improved in newer releases, since we are in 2.1.x,
> but I was wondering why cassandra uses one jvm for all tables and
> keyspaces, intermingling the heap for on-JVM objects.
> 
> ... so why doesn't cassandra spin off a jvm per table so each jvm can be
> tuned per table and gc tuned and gc impacts not impact other tables? It
> would probably increase the number of endpoints if we avoid having an
> overarching query router.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org