You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Russell Branca <ru...@chewbranca.com> on 2022/08/02 22:59:29 UTC
Chewbranca's Opinionated Vision for Apache CouchDB and Cloudant Dbcore

Hi all,

Chewbranca here. I've been fairly quiet for the last few years for a number of
reasons, but this year I've been becoming more involved again in the CouchDB
community and various Cloudant communities. I put together this writeup a while
ago to detail out what my personal views are on what I believe to be some of the
highest priority focus points for CouchDB/Cloudant and targeted approaches
for tackling these issues. Full disclosure, these are my own opinions discussing
my views on direction for Cloudant and CouchDB; most of the views here are
shared by a wider audience, but I want to be clear this is not a formal proposal
from Cloudant but rather the musings of a long standing Cloudant and CouchDB
member with plenty of strong opinions.

The first draft of this document was done as an internal writeup with the
intention of then creating a modified version that defines a set of core
"stakeholders" such that the various tasks defined demonstrate value to the core
stakeholders, but in particular, that all of this functionality is defined and
executed in a way to bring clear value to the Apache CouchDB project and Apache
CouchDB community. I strongly believe the majority of this functionality should
be in the core CouchDB database itself and I want to ensure that it is done in a
way to be clearly beneficial to _all_ stakeholders involved. I started adding in
the discussion of target audience and stakeholders, but the breakout details for
each of the discussion points below is incomplete.

Forgive the lack of polish in this writeup, it's still a draft, but I've already
sat on this longer than I wanted, and with the upcoming CouchDB developer
meeting tomorrow, I felt it was better to get it out now in a rough form than
worry about a completed writeup. We can always rip various pieces out and create
more specific and complete writeups. I've got some additional writeups and ideas
to get out to the mailing list as well, but this is a good starting point.

A quick note on naming, "Dbcore" is the name of the version of CouchDB that
Cloudant runs in production. Many years ago this was an actual fork, but since
the 2.x merge we're essentially running CouchDB upstream with additional
libraries added on for Cloudant specific functionality, like IAM auth and what
not. The majority of the fun stuff was open sourced and incorporated into the
3.x codebase. This document switches between Dbcore and CouchDB in various
areas, but again, this is more of a draft than a finished document.

Thanks for your time and happy hacking!


-Russell "Chewbranca" Branca


==============================================================================


# Chewbranca's Immediate Vision for Cloudant Dbcore and Apache CouchDB


Welcome to my vision of where to go with development and functionality of
Cloudant's Dbcore and Apache CouchDB, for the immediate future in 2022/2023.
This is an opinionated list of the things that I think are critically important
and also readily achievable. We may or may not get to the things in this list,
but here it is.

There's much to do, much that could be done, and endless possibilities. This
vision is targeted at the here and now, focused on immediately prevalent issues,
constraints, and challenges, skewed towards heavy hitters in terms of
time/resource investment to dividends paid. Not everything here is easy and
simple, but everything here is cherry picked because of its significant impact,
and also selected because it *needs* to be here.

This vision is also not exhaustive, to neither Cloudant nor CouchDB. I make no
claims that there isn't other important things to do, rather that the following
list of things *is* of substantial importance and provides a path to paradigm
shift the current state of Cloudant and CouchDB.


## Target Audience and Stakeholders

I want this writeup to target a few different core groups of people and
stakeholders to help understand motivations and help define scope. In
particular, I'm targeting three core groups of people:

  * CouchDB users/operators
  * Cloudant customers
  * Cloudant operators

As it stands today, these groups have very distinct interactions with the
database, and while there will also be distinctions between users of the system
and operators of the system, I think there's far more overlap between the three
groups than is currently represented. In particular, all three of these groups
need visibility into what the system is doing, what are the most active
databases and documents, and what impact requests have on the system. Both users
and operators need to have this visibility, and the more we can do to make it
easier for everyone to see into the activity of the system, the better.

The distinciton between Cloudant operators and CouchDB operators is considerably
different as the tooling is fundamentally different, or rather heavily lacking
in CouchDB and attached at the periphery in Cloudant; which is not ideal nor is
sustainable in my opinion.

I think we can greatly improve the operability and visibility of the system for
all three groups of people by streamlining the tooling and building core
enhancements into CouchDB. Much of this writeup is geared towards cherry picked
improvements I think can have a significant impact. In the sections below, I've
elaborated on what benefit each of the different groups receives from the
changes and improvements, in order to help scope and justify my recommendations,
but also to make sure that the core stakeholders all benefit from these
improvements, and that we can minimize the differences of interactions with the
system between the groups.


Core goals of CouchDB users/operators:

  * Stable, reliable, performant, operable, and introspectable database system
  * Ability to pin point expensive requests
  * Visibility into what operations are utilizing resources
  * Easily scalable without hassle and complex re-architecting
  * Easy to operate, but also easy to know what to operate


Core goals of Cloudant customers:

  * Stable, reliable, performant, and introspectable database system
  * Ability to pin point expensive requests
  * Visibility into what operations are utilizing resources
  * Easily scalable without hassle and complex re-architecting


Core goals of Cloudant operators:

  * Stable, reliable, performant, operable, and introspectable database system
  * Ability to pin point expensive requests
  * Visibility into what operations are utilizing resources
  * Easily scalable without hassle and complex re-architecting
  * Easy to operate, but also easy to know what to operate


## Core Facets


All items below are directly focused on one or more of the following facets:

  * Scalability
  * Performance
  * Reliability
  * Robustness
  * Operability
  * Introspection
  * Predictability

Let's provide a quick description of what each of those means to me, what I see
are the core problems, and what are the core goals.


### Scalability


CouchDB is a durable distributed document store that can scale horizontally with
the number of nodes but also vertically with the volume of computation and
storage per node. CouchDB is managed by end users themselves, but knowing how
and when to scale a CouchDB cluster is still a dark art.

Cloudant is a "Cloud" database, part of IBM Cloud, that is powered by CouchDB.
Magical scaling is a primary expectation of modern Cloud computing, regardless
of the validity of that expectation, especially when taking into consideration
physical properties of shuffling bits around at the speed of light.

To me, our immediate concerns around scalabiity are being able to scale user
applications nearly linearly as we introduce more hardware, doing this
automagically for them, not needing operator intervention to do so, and
minimizing the ways customers can sabotage themselves in this endeavor. We
should be able to scale data volume and query computation independently of each,
allowing for a wider variety of workloads.

Core issues:

  * Q scales single shard operations ~linearly, but inversely for aggregate
    operations. This is the fundamental scaling constraint of Cloudant/CouchDB
    and desperately needs to be solved.
  * IOQ2 has problematic workflows to address before it can be the default
  * Re-sharding is a manual process, should not necessitate ops involvement
  * Storage volume entirely too coupled to node count
    - we need to be able to add additional hard drives independently of the
      amount of computational power in the cluster
    - Storage vs CPU are different axis and should be decoupled

Core Goals:

  * Decouple Q from computational complexity of aggregate queries
    - especially view queries, game changer. TXE delivered on this, now Dbcore
      needs to as well
    - See my "On the Scalability of CouchDB and Dbcore" writeup
      - NOTE: I'll post this soon
  * Ship `couch_file` cache, great overall, but also helps decouple Q
  * Ship IOQ2 fixes; enable IOQ2 by default
    - IOQ2 pid per shard fixes primary issues, but opens other questions around
      how we do rate limiting overall
    - Remove Clouseau requests from IOQ2
  * Scale core storage volume and computational capacity independently
  * Support multiple Dbcore volume mounts, drop use of raid 0
    - Have a volume mount per drive, rather than a raid 0 stripe on all drives
    - Balance database creation across all drives, rebalance during compaction

Core stakeholder benefits:

  * CouchDB users/operators:
    - more scalable and performant CouchDB system
    - eliminates many core scalabiliy constraints
    - easier to run servers with multiple drives
    - more predictable scaling without needing lots of databases

  * Cloudant users:
    - more scalable and performant CouchDB/Cloudant service
    - eliminates many core scalabiliy constraints
    - more predictable scaling without needing lots of databases

  * Cloudant operators:
    - easier to run servers with multiple drives
    - more predictably scalable service, easier to operate


### Performance


Performance can always be better, and how fast/slow we go heavily dictates the
markets and use cases we can serve. The vision here is less focused on making
things go faster, and more focused on preventing things from going slower. In
addition to not going slower, we should be able to better understand the
performance of the system and the _why_ of its performance.

Core issues:

  * Computational complexity of aggregate operations scales with Q
    - This *MUST* go away, full stop.
    - Eg view query performance needs to be dependent on rows queried, not
      overall size of the database or number of shards
    - CouchDB 4.x/TXE solved this, Dbcore needs to catch up
  * Every cluster is a performance snowflake; as is every request!
    - We need bounded performance, can't have individual requests invoke
      millions of operations, especially when that happens invisibily
    - Lack of cost accounting and visibility, can't even find the bad/expensive
      requests. This needs to be exposed to both ops and customers
  * Lack of back pressure results in unpredicatably "bad" behavior and
    performance. Back pressure forces overload errors back to the client, but
    does so in predictable and performant manners and allows us to cap unbounded
    queues and growth
  * IOQ2 noisy shards issues

Core goals:

  - See scalability core goal for "Decouple Q from computational complexity..."
  - No more unbounded growth!!!
  - Introduce back pressure
    - IOQ read timeouts, bubble back to caller
    - Throw 503 errors *IMMEDIATELY* rather than eventual 5xx errors after
      unbounded growth tanks the system
  - Eliminate "bad requests"; no "million dollar" requests
  - Extend CCM rate limits further into Dbcore
  - Fix IOQ2 to avoid noisy neighbors and can enable it by default
  - Potentially add rexi/fabric client/server rate limited queues
  - Introduce couch_file cache as a way to utilize heavy Erlang parallelism

Core stakeholder benefits:

  * CouchDB users/operators:
    - predictable performance predictable overload
    - minimize unknown/unexpected behavior
    - make it easy to diagnose performance and overload issues
    - no unbounded requests

  * Cloudant users:

  * Cloudant operators:


### Reliability & Robustness


Before addressing these individually, I want to briefly cover both to clarify
the distinctions between the two that I'm highlighting here, as they're quite
similar. From the Google Oxford Dictionary definitions, we have:

  * Reliability: the quality of being trustworthy or of performing consistently
    well.
  * Robustness: the quality or condition of being strong and in good condition.

For the purposes here, I'm looking at "reliability" as Dbcore's ability to
provide predictable and consistent performance, on demand; and "robustness" as
Dbcore's ability to continue to operate in the face of innovative internal
failure scenarios and adverse external stimuli induced by customers. (For
simplicity in both of these definitions, I'm assuming non malicious load from
3rd parties and that the results delivered are "correct").

I want to take a step back and commend the various Cloudant employees and
CouchDB committers over the years, Dbcore/CouchDB is a pretty *damn* reliable
and robust database! Great job folks, Dbcore/CouchDB have been battle hardened
over many years of service and care, and we owe it to ourselves to remember that
its put in work day in day out for well over a decade. Huzzah and
congratulations!


### Reliability


Again, I'm using the definition of "Reliability" as "the quality of being
trustworthy or of performing consistently well".

Dbcore/CouchDB is a pretty damn reliable system... until it's not. Under normal
operating conditions the system is very reliable and performant, with minimal
issues arising. Unfortunately, shit inevitably always goes wrong, and that's a
major reason why customers pay us to provide the Cloudant service, so that when
things inevitably go wrong, we're there to clean up the mess. How we choose to
respond to issues and how we choose to have the database respond to issues
largely dictates our day to day operational overhead and also dictates what
types of service impairment that are exposed to the customer. With dbcore, for
better or worse, we strive to *ALWAYS* be present and available, to always
return _something_, and to always save _something_ when data is written.

There are many advantages to this approach, but we err far too heavily on the
side of _always_ trying to perform the actions that have been asked of the
system, rather than establishing healthy boundaries and being able to say "no".
We allow the system to be stretched to its limits, beyond, and then even further
until it snaps violently. It's a classic example of trying to take on too much
work until the limits are surpassed and everything comes crashing down.

Fundamentally, we cannot continue to allow unbounded growth of pending work to
go into the system, and we need to introduce backpressure allowing us to say no
to additional work when overloaded. This is absolutely essential to achieving
consistent and trustworthy performance. A system that continues to take on more
work while progressively responding slower until it's so far backlogged that it
topples over and drops everything on the floor is *not* behaving in a
trustworthy manner, and is *not* performing consistently well.

As a service, saying "no" and throwing `503 Service Unavailable` errors are
often viewed as abject failure, but, all services inevitably fail, the question
is whether or not those systems fail in a trustworthy and consistently
performing way. Running reliably on a good day is easy, failing reliably on a
bad day is the true measure of the health and reliability of a service.

Core issues:

  * Unbounded queue growth
    - can't answer every request in the world, need to be able to say "no"
  * During overload, system performs progressively worse, progressively more
    inconsistently, and progressively less trustworthy
  * Need to choose reliability instead of a "never say no" mentality

Core goals:

  * Introduce backpressure and load shedding
  * Embrace 503 errors
  * `infinity` is not real, stop pretending it's an ok duration for requests
    - IOQ timeouts of infinity need to go away
    - Load shed IOQ work that has timed out
  * Stop RPC work when coordinator request no longer present
  * Eliminate unbounded queue growth
  * Eliminate unbounded request duration, magnitude, and impact
    - Eg `/_changes` with a custom Javascript filter from `?since=0`


### Robustness


Again, I'm using the definition of "Robustness" as "the quality or condition of
being strong and in good condition."

Dbcore/CouchDB is a pretty damn robust system... until it's not. Under normal
operating conditions the system is very strong and resilient, with minimal
issues arising. Unfortunately, shit inevitably always goes wrong. Failures are
mostly well isolated in the system, in large part thanks to Erlang's OTP system
we make extensive use of, but also because we have spent a *lot* of human hours
diagnosing, debugging, fixing, and improving the system. That said, I think we
have improvements to make on our modus operandi in terms of what our catalysts
are for finding and fixing bugs. To provide a consistently strong and in good
condition service, we need to be aggressive with eliminating known defects,
minimizing the introduction of new defects, and auditing the system regularly to
keep ourselves honest.

I think a lot of our Robustness issues are shared with the Reliability category
in terms of allowing the system to get backed up with unbounded growth until it
crumbles, so I won't reiterate those here. Instead, for Robustness, I want to
focus on how we approach finding and fixing bugs.

Right now, we have an abundance of "bug reports" in our internal logging
systems, Dialyzer, xref, and other tools. These are factual bugs that we ignore
by letting the OTP system restart crashed processes, or by triggering full
dbcore reboots. We ignore these errors unless they're sufficiently service
impacting or a customer yells loudly enough. Many of these bugs have isolated
impact, but a number of them end up tanking larger parts of the supervision
hierarchies, and I suspect are responsible for a lot of service "blips" we
encounter. On the more extreme side, we have bugs like the `all_dbs_active`
errors that render the system in an unknown impaired state, with partial and
sporadic functionality.

We need to fix bugs we're aware of, eliminate the excessive quantity of logs
generated by these logs, and improve the service by eliminating various known
sharp edges. Most of these bugs are readily accessible to be fixed, and
essentially have bug reports provided for us that we ignore. Some of these bugs
are more complicated to resolve than others, but there's considerable low
hanging fruit here to fix for large impact.

The `all_dbs_active` bug is an interesting case study here. The fundamental
issue is that opening a database handle inside CouchDB/dbcore can sometimes
return the error `all_dbs_active`, and most callers are not prepared to handle
that error, as it's not a common error and we currently do not ensure all errors
that can arise are properly handled. The major problem here is that any
subsystems unequipped to handle this error at best crash, or at worst, partially
fail in mysterious ways rendering various subsystems unresponsive or behaving in
unexpected ways. What's especially interesting in this case study is that I was
able to add a three line Dialyzer patch codifying the `all_dbs_active` error
return type and immediately it found sixteen instances of the unhandled errors!
This was a fantastic result and makes me incredibly excited by the prospect of
more heavily utilizing Dialyzer, which will deliver us bug reports with minimal
investment.

I think fixing known bugs, reducing the volume of bug logs, and leaning on
Dialyzer (and xref) to provide some powerful static analysis allowing us to
track down existing bugs, minimize new bugs, and have more confidence in the
codebase will go a long ways towards improving the Robustness of CouchDB as a
whole, but also the dbcore service.

Fixing random small bugs is not glorious, but I really do think it's one of the
most important aspects of this entire writeup.

Core issues:

  * Bug reports in logs are ignored
  * Bug reports in Dialyzer are ignored
  * Bug reports in xref are ignored
  * Brutal bugs like `all_dbs_active` exist and their extent/impact are unknown

Core goals:

  * Audit dbcore logs regularly and get all bugs into backlog to be fixed
  * Establish a culture of checking dbcore logs and keeping them minimal
  * Revamp usage of Dialyzer:
    - Greatly extend spec coverage of the code base
    - Fix bugs Dialyzer finds
    - Establish a culture of fixing new Dialyzer bugs found
  * Make use of xref
    - this is a no brainer, great tool, easy, and just delivers bug reports
  * Fix `all_dbs_active` error
    - Calling it out individually here as it's gnarly and exemplary of long
      standing bugs with massive service impact. This bug affects nearly all
      core subsystems of CouchDB/dbcore


### Operability & Introspection


Similarly to Reliability and Robustness, there is a lot of overlap between
Operability and Introspection. For the purposes here, I'm focusing on
Introspection as our ability to understand what is happening in the cluster, and
Operability as our ability to interact with and modify the underlying state of
the cluster. These two have considerable overlap but have potentially different
audiences.


### Operability


My number one priority for Operability is to make it easier for Cloudant
operators and CouchDB users to operate clusters by simplifying what needs to be
done and reducing the amount of things that need human intervention. This is not
an exhaustive list, rather a cherry picked set of things I think are high
priority, and as such, there are two core focuses here.

First off is resharding. This *needs* to become a first class construct of
CouchDB/dbcore that is completely hands off for operators. I think we should
take it a step further and automate the resharding logic based on write volume
to the individual shards, but that requires disconnecting Q from the
computational complexity of view queries. So as a first step, we should
internalize resharding into CouchDB such that it's self managing and interacted
with by a jobs API similar to the replicator jobs system (although ideally
without direct database operations). Making it seamless so that operators do not
need to be involved outside of initiating the jobs is essential. Additionally,
this functionality should also include rebalancing, as the two are very similar
in functionality and if thought through properly could share the majority of
logic. Both of these items should also be done in a way that can be readily
regulated, allowing for safe migrations with minimal impacts on production
workloads.

The next up is making core operational tooling a first class construct of
CouchDB. Cloudant independently deploys operational tooling alongside Dbcore so
that it can be deployed outside of normal release processes as operational
issues arise. This is always done with the intent of moving it into CouchDB
core, but in practice it has a tendency to stay to the side, rather than
becoming core functionality of CouchDB that all users can benefit from. We need
to make the operational tooling a core part of CouchDB so that it's maintained,
improved, documented, etc, along with the core database. There's also many
improvements that can be made here, but getting operational tooling as a core
construct of the database is the first step.

Core issues:

  * resharding/rebalancing, while necessary and standard operations, are very
    complicated, error prone, and labor intensive
  * Lack of internalized operational tooling
  * Insufficient introspection to properly understand system usage and
    saturation

Core goals:

  * Managed resharding/rebalancing engine
    - ideally automatic based on write volume in conjunction with sorted key
      order view sharding
  * First class operational capabilities in CouchDB/dbcore
    - migrate Cloudant secondary opts tooling into core database
    - establish culture of operational tooling in core database
    - we can and should do better on this front
  * Increased introspection capabilities to properly understand system usage,
    saturation, and to faciliate finding hot spots


### Introspection


Visibility is critical to understanding system usage, saturation, and overall
behavior. CouchDB has a variety of introspection capabilities, and combined with
the Erlang VM you can technically see whatever you want, but that doesn't make
it easy to see what's going on. My number one issue with CouchDB/dbcore
visibility is that we can't see the things we don't know about! With the large
variety of API capabilities and interactions within the core database, there is
much we're not aware of the underlying cost. For instance, doing a filtered
changes feed from the beginning of a database's changes seq with a moderately
large limit and a filter function that returns a small subset of the data is a
singular http request, it's logged as a singular http request, it creates
individual Erlang processes for the http request handler and the rpc handlers
per shard, but it can potentially manifest as doing an entire database scan that
runs every document through Javascript engine filters, to return less than the
limit of rows, and is damn hard to find if you're not looking for it.

CouchDB has a widely diverse API in terms of underlying resource cost of
satisfying the request. Individual doc lookups load a single doc from `N` shard
replicas, whereas filtered changes feeds can perform an entire database scan and
Javascript invocation for every doc in the system across all shard ranges, and
then views land somewhere in the middle where they're directly proportional to
the number of shards in the database. This leaves the conundrum of: do you scale
your database for single doc reads, or for view queries? In the current
scalability model of CouchDB, you must increase the number of shards to increase
the capabilities of CouchDB to utilize available CPU cores. Single doc
operations scale beautifully with the number of shards, for a single database or
many, whereas aggregate operations like view queries scale inversely with the
number of shards in the database, requiring a larger number of small databases
for horizontal scalability.

CouchDB 4.x fixed this by way of key ordered sharding that results in only
needing to interact with the appropriate shards for the query, and we can do
that on the 3.x codebase line as well, as I've written up elsewhere. But this is
the `Introspection` section of this vision, not performance, so the relevant
point here is that the wildy different consumption of resources per request type
is *challenging* to introspect and understand, and that needs to change. We need
to introduce cost accounting to how CouchDB utilizes resources both as a
function of incoming http requests and internal background jobs.

Luckily, I think getting this data and exposing it is not only one of the easiet
tasks on this list, but also one of the most impactful. Understanding what the
system is doing, where it's consuming resources, and why, is absolutely critical
for anyone interacting with CouchDB, from users to operators. And even to
developers, because without properly visibility into resource consumption, it's
challenging at best to understand what the expensive operations are. From my
analysis, view query performance computational complexity is roughly `O(Q^2 *
log(Q))`, and I've been vocal about that for a long time now, but things like
innocuous changes requests consuming massive resources continues to catch me off
guard because we can't see the data on _what_ is expensive without studying
particular areas in depth.

This is a solvable problem, we need to begin cost accounting database operations
for number of rpc calls, number of nodes involved, number of shards involved,
number of b-tree lookups, number of IO requests, quantity of data, number of JS
invocations, number of mergesorts, rev count of docs involved, magnitude of
requests in terms of docs processed and docs returned,  number of Erlang
messages sent, etc, and then we need to log this data in a way that is
introspectable retroactively and also made readily available for real time
visibility in Fauxton and various metric engines. Knowing the most expensive
requests and where the system is spending it's time is useful for _everyone_ who
interacts with CouchDB/Cloudant.

We have obvious scalability bottlenecks that can be fixed and targeted, but
until we get proper visibility into the cost of various aspects of the system we
won't know about the non obvious areas we need to fix, and we won't be able to
easily see _where_ those bottlenecks are being pushed. It's one thing to know
that you're overloading the JS engine, it's another thing to know why and where
that's happening. Similarly, it's one thing to know document updates are slow,
it's another thing to understand that it's because of that one heavily
conflicted document tanking performance.

Core issues:

  * Incredibly difficult to find low quantity but high impact requests
  * Lack of database resource accounting means we don't know what the system is
    doing at any given time or why, making it difficult to operate, difficult to
    scale and capacity plan, and very difficult for customers to understand the
    impact of their workload on the underlying system

Core goals:

  * Introduce cost accounting
    - Track resource consumption of all http requests and internal traffic
    - Log this data in an auditable way
    - Expose internal realtime tooling for operators to directly visualize
      current activity on the cluster
    - Expose this functionality to Fauxton so CouchDB and Cloudant users can
      utilize this info
    - Per database API endpoints making it easy to get isolated stats
  * Use cost accounting as a way to understand what the expensive requests are
    and make informed decisions about areas to fix to reduce said cost
  * Reduce log volume as detailed above, making logs more readily useful and
    more indicative of relevant/new issues


### Predictability


The most "predictable" thing about CouchDB and dbcore right now is that the
database will do it's best to handle whatever load is thrown at the system,
regardless of its ability to handle said load, which leads to unpredictable
behavior and fireworks. While it's noble to attempt to handle _all_ requests
thrown at the system, reality is that there are finite limits to what the
database can handle, and ignoring that reality leads to unpredicatable ways in
which the system can fail and when. This is not a good user experience for
anyone, and makes it very challenging to understand _why_ the system fails, when
it _will_ fail, and what the upper bounds are. Waiting for the system to trigger
timeouts and/or get OOM killed is ignoring the problem rather than a direct
approach to handle overload and introducing backpressure.

In my experience, the current approach makes it very difficult to understand and
predict how the system will behave when it aproaches saturation limits and
overload, which results in users being perplexed as to why things went bad, and
leads to operators going on wild goose chases to understand why particular
subsystems did or did not fail when they're all emblematic of the same overload
failure scenarios.

Bottom line is that in my opinion we should strive to throw a predictable
overoad error allowing the system to continue operations in a predictable manner
rather than allowing unbounded growth of queues until overload causes timeouts
and/or full system failure.


Core issues:

  * Unpredictable failure modes, but predictably inevitable
  * Unbounded growth leads to degraded performance and eventual total failure
  * Failures arise somewhat arbitrariliy as a function of system saturation,
    making it difficult for operators to understand and near
impossible for users

Core goals:

  * Introduce backpressure to deliver predictable errors at overload rather than
    unexpected total system failures and/or excessively long timeouts
  * Expose backpressure in a manner that makes it easy for the underlying user
    to understand what limits they tripped
  * No more unbounded growth
  * No more infinite timeout requests


## Core Audiences and Stakeholders


I intend to expand this section more to define a few stakeholders along the
lines of Cloudant users, CouchDB users, Cloudant operators, Cloudant/CouchDB
developers. I think this is useful in general, but in particular for this
writeup, I think the majority of the items here require core changes to CouchDB
and should not be done by way of plugins, so we need to ensure they're
beneficial to not just Cloudant but also all CouchDB users. A lot of these items
are self evident, like fixing bugs is obviously beneficial, but others are more
focused on hardening the core service, and for those, I think we need to make
these changes in ways that make it much easier for not just Cloudant operators
and Cloudant customers to gain visibility, insights, and operability in the
system, but for anyone using CouchDB to benefit from this visibility and ease of
use. Making it easier for CouchDB users to understand what the database is
doing, where it's spending resources, what problems arise, and being able to
take action to diagnose and alleviate issues is not only incredibly valuable to
the CouchDB community as a whole, but IMO would also go a long ways in making it
easier for non core Cloudant/CouchDB developers to understand and operate the
database as it's far too complex and hidden currently.

The initial version of this writeup is for internal Cloudant audiences, but most
everything here is directly applicable to CouchDB and as such I intend to make a
modified version of this list to introduce to the CouchDB mailing list in the
near future to be part of the CouchDB 3.x vs 4.x discussions.


## Other Areas of Interest

There are a handful of other areas of interest in Cloudant and CouchDB that are
pain points and/or need to be addressed. Again, this list was never meant to be
exhaustive, and rather tightly focused on things I believe we *can* and *should*
deliver soon, but for the sake of visibility I'll mention a few here. Also, a
purpose of this writeup is to connect various threads of communication I've had
with various overlapping groups of people into one place, so a few of these
items below are outlined below as further avenues of discussion.

  * Conflicts...
    - Yeap this is a gnarly one, nope I don't have an immediate simple solution
      in mind so it's not listed in the above list
  * CouchDB/Cloudant "Metal"
    - CouchDB/Cloudant "Metal" is the codename I've been bouncing around for the
      idea of a hardened CouchDB service built from the ground up to only focus
      on a subset of functionality in the current API that we can scale
      aggressively and predictably.
    - Due to time constraints I have not outlined this further here, but this is
      an intriguing option that maps well to both CouchDB users and Cloudant
      services billing by request/usage volume
    - The key thing is starting with a subset of core functionality we take
      confidence in our ability to scale, and then go from there, only adding
      additional layers to the stack we can scale properly
    - I think the best way we can encourage users to move off of expensive
      CouchDB/Cloudant API operations is by incentivizing them to use more
      scalable operations that utilize less hardware resources and then provide
      clear introspection capabilities demonstrating the impact of the slower
      and more costly API operations utilized
  * FDB style Namespacing
    - If we adopt the approach outlined in my CouchDB/Dbcore
scalability writeup, I
      believe we could readily adopt a Namespace model like FDB utilizes, making
      a lot of the notions about databases and shards and on disk files go away.
      This would work fine for views, but Search would be trickier, although
      that's the case for Search in that scalability approach in general, I
      suspect we'll need Search to be a standalone service long term
  * Reintroduce Spans/Opentracing/whatever this thing is called
    - I really like the Span system we were putting into place allowing us to
      understand performance and throughput and various "spans" of the stack.
      Hopefully things have stabalized and we can use this someday
  * Disaggregated/heterogeneous CouchDB/Cloudant Architecture
    - CouchdB/Dbcore is a distributed mostly stateless http app stack talking to
      a distributed database app all coupled into one system. This has served us
      well and has a number of advantages, but I think long term, especially for
      a CouchDB/Cloudant Metal like system, we need to properly isolate the
      storage layer allowing us to scale that independently from the http stack,
      the indexing stack, replication stack, etc.
  * New features in BEAM, and all the various BEAM VM args
    - Tons of low hanging fruit on the VM args options, things like starting
      process heap size, or memory carriers, let alone all the various IO tuning
      knobs are a wealth of potential
    - We're also leap frogging up to OTP 23, there's been considerable
      advancement in OTP over the years, what can we incorporate?
  * Consolidated Cloudant and CouchDB auth systems
    - Cloudant's auth system (not the IAM one) and CouchDB's auth system have a
      lot of overlap but neither are full replacements for the other and both
      have functionality desired in the other, we should merge these down to a
      singular more capable auth system with proper RBAC
  * Consistent and Simple dev environment
    - Enough with having this be a mess. We need a simple consistent dev
      environment. I don't believe the argument around having folks use
      different setups to test things has panned out well, but it has certainly
      wasted much time and I don't want to keep dealing with this. Obviously
      still contentious and has some issues, I personally want a reliable and
      maintained Vagrant VM setup and provisioner that can easily bootstrap all
      the things. I extracted the core setup I used a few years back:
      https://github.com/chewbranca/hackcouch


## Next Steps

This is currently an initial version of this document and I'm getting it out to
a wider audience to get more feedback sooner rather than later, especially with
the CouchDB 3.x vs 4.x discussion happening now. I intend to continue refining
this document a bit, and making a version we can share publicly. The other
primary next step for myself is I intend to detail out the various `Core Goals`
bullet points as actionable items in distinct tickets, with better definition of
goals and scope for that particular item. We can then
discuss/edit/approve/delete those items, but I think having separate tickets
will simplify the discussion around many of these issues and help us see how to
achieve them.