You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by mp...@apache.org on 2016/03/10 22:25:28 UTC

incubator-kudu git commit: Move tablet design docs to docs/design-docs

Repository: incubator-kudu
Updated Branches:
  refs/heads/master 03b2f1d26 -> b7f859ed1


Move tablet design docs to docs/design-docs

Also convert from plain text to MarkDown.

Change-Id: I657cb3eb8ce054447b0681594817a70e9538ce97
Reviewed-on: http://gerrit.cloudera.org:8080/2499
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Mike Percy <mp...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/incubator-kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-kudu/commit/b7f859ed
Tree: http://git-wip-us.apache.org/repos/asf/incubator-kudu/tree/b7f859ed
Diff: http://git-wip-us.apache.org/repos/asf/incubator-kudu/diff/b7f859ed

Branch: refs/heads/master
Commit: b7f859ed11d891b51257b26761b9fe4897ed723f
Parents: 03b2f1d
Author: Mike Percy <mp...@apache.org>
Authored: Wed Mar 9 16:38:40 2016 +0200
Committer: Mike Percy <mp...@apache.org>
Committed: Thu Mar 10 21:24:47 2016 +0000

----------------------------------------------------------------------
 docs/design-docs/README.md                     |   4 +
 docs/design-docs/compaction-policy.md          | 420 ++++++++++++++++++++
 docs/design-docs/compaction.md                 |  94 +++++
 docs/design-docs/schema-change.md              | 114 ++++++
 docs/design-docs/triggering-maintenance-ops.md | 211 ++++++++++
 src/kudu/tablet/compaction-policy.txt          | 397 ------------------
 src/kudu/tablet/compaction.txt                 |  95 -----
 src/kudu/tablet/schema-change.txt              | 107 -----
 src/kudu/tablet/triggering-maintenance-ops.txt | 211 ----------
 9 files changed, 843 insertions(+), 810 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/b7f859ed/docs/design-docs/README.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/README.md b/docs/design-docs/README.md
index bb88fea..b52063c 100644
--- a/docs/design-docs/README.md
+++ b/docs/design-docs/README.md
@@ -29,4 +29,8 @@ made.
 | [Master design](master.md) | Master | N/A |
 | [RPC design and impl. details](rpc.md) | RPC | N/A |
 | [Tablet design, impl. details and comparison to other systems](tablet.md) | Tablet | N/A |
+| [Tablet compaction design and impl.](compaction.md) | Tablet | N/A |
+| [Tablet compaction policy](compaction-policy.md) | Tablet | N/A |
+| [Schema change design](schema-change.md) | Master, Tablet | N/A |
+| [Maintenance operation scheduling](triggering-maintenance-ops.md) | Master, Tablet Server | N/A |
 | [C++ client design and impl. details](cpp-client.md) | Client | N/A |

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/b7f859ed/docs/design-docs/compaction-policy.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/compaction-policy.md b/docs/design-docs/compaction-policy.md
new file mode 100644
index 0000000..dee61d1
--- /dev/null
+++ b/docs/design-docs/compaction-policy.md
@@ -0,0 +1,420 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+Compaction Policy
+============================================================
+
+This document explains the policy of performing a compaction.
+For details explaining how compactions are implemented, see compaction.md.
+
+The compaction policy is responsible for selecting a set of rowsets to compact
+together. Compactions are necessary in order to reduce the number of DiskRowSets
+which must be consulted for various operations, thus improving the overall
+performance of the tablet.
+
+Coming up with a good compaction policy is a balancing act between several goals:
+
+1. Re-arrange the physical layout to be more efficient for subsequent operations.
+
+2. Do so without using too many resources in the compaction itself.
+
+3. Do so "smoothly" - spread work out over time so that operation performance is
+   predictable and reasonably constant.
+
+
+The following sections provide some analysis of the above goals:
+
+Benefit of compaction for subsequent operations
+============================================================
+
+In order to determine a good compaction policy, we want to define a cost measure
+for a given set of RowSets within a tablet. Consider the following set of
+RowSets:
+
+```
+   1     2      3     4    5
+|--A--||-B--||--C--||---D----|
+|--------------E-------------|
+                   |-F--|
+```
+
+In this diagram, the key space spans from left to right, and each RowSet is drawn
+as an interval based on its first and last contained key. We'll define a few terms
+for later use in this document:
+
+**Width**
+
+Let the Width of a RowSet be proportional to the percentage of key
+space that it spans. For example, rowset E has a width of 1, since
+it spans the whole tablet. Rowset B has width 0.2, since it spans
+about 20% of the tablet.
+
+Note that the Width is also the probability that any read in a
+uniform random read workload will have to consult that RowSet.
+
+**Height**
+
+The "Height" of a tablet at a given key is the number of rowsets
+whose key ranges contain that key. For example, the height of the
+above tablet at key 1 is 2, since rowsets A and E span that key.
+The height at key 4 is 3, since D, E, and F span that key.
+
+The Height at any key is the number of RowSets that will be have to
+be consulted for a random read of that key.
+
+Let us consider the cost of various operations on the tablet:
+
+Insert
+-------
+In order to Insert, each of the rowsets must be checked for a duplicate key. By
+storing the rowset ranges in an interval tree, we can efficiently determine the
+set of rowsets whose intervals may contain the key to be inserted, and thus the
+cost is linear in that number of rowsets:
+
+```
+Let n = the Height of the tablet at the given key
+Let B = the bloom filter false positive rate
+Let C_bf = cost of bloom filter check
+Let C_pk = cost of a primary key lookup
+Cost = n*C_bf + n*B*C_pk
+Cost = n(C_bf + B*C_pk)
+```
+
+Typically, B is approximately 1% or lower, so the bloom filter checks dominate this
+equation. However, in some cases where the primary key column is very large, every
+primary key check will incur a disk seek, meaning that `C_pk` is orders of magnitude
+higher than `C_bf` (which we expect to be in RAM or SSD). So, we cannot fully ignore
+the term resulting from the bloom filter misses.
+
+Random read
+------------
+The costs for random read are similar to the cost for inserts: given the known key,
+each potentially overlapping rowset must be queried.
+
+
+Short Scan
+-----------
+Scans cannot make use of bloom filters, so the cost is similar to the above, except
+that all overlapping rowsets must be seeked by PK:
+
+```
+Cost = n*C_pk
+```
+
+We assume a "short" scan is one in which the sequential IO cost after finding the start
+key is small compared to the seek cost. (eg assuming a 10ms seek time, 1MB or less of
+sequential IO).
+
+
+Long scan (e.g. full table scan):
+---------------------------------
+A long scan is likely to retrieve data from many rowsets. In this case, the size
+of the rowsets comes into play.
+
+Let S = the number of MB in the scan
+Let B = the disk bandwidth (MB/sec)
+Let n = the number of rowsets accessed, as before
+
+Assume that accessing each rowset costs 1 seek (same as `C_pk`).
+
+```
+Cost = n*C_pk + S/B
+```
+
+To summarize the above, all of the costs of operations are heavily dependent on the
+number of rowsets which must be accessed. Therefore, to minimize cost, we should
+follow the following strategies:
+
+1. In the case of point queries (inserts and random read/short scan), merge
+   rowsets which overlap in keyspace, thus reducing the average height of the
+   Tablet.
+
+2. In the case of longer scans, merge together rowsets to improve the ratio of
+   sequential IO to seeks.
+
+We can assume that, so long as the rowsets are reasonably large, goal #2 above has
+diminishing returns after rowsets achieve ~10MB or so of sequential IO for every
+seek (1 seek ~= 10ms, 10MB IO ~= 100ms). However, goal #1 has linear returns, so we
+focus on goal #1.
+
+
+Cost of doing a compaction
+============================================================
+According to the above analysis, the optimal configuration for a tablet is a
+single giant rowset which spans the entirety of the key space. This is
+intuitively true: a fully-compacted tablet is going to perform the best because
+every access will require at most one bloom filter check and one seek.
+
+However, it is obviously not optimal to simply compact all RowSets together in every
+compaction. This would be inefficient, since every compaction would rewrite the
+entire rowset, causing huge write amplification and wasted IO for only a small
+amount of efficiency gain.
+
+So, we need to consider not just how efficient the resulting tablet would be, but also
+how expensive it is to perform the candidate compaction. Only by weighing those two
+against each other can we decide on the best compaction to perform at any given point
+in time.
+
+For the purposes of this analysis, we consider the cost of a compaction to simply be
+the sum of the IO performed by the compaction. We'll assume that deletions are rare,
+in which case the output data size of a compaction is approximately equal to the
+input data size. We also assume that the compaction inputs are large enough that
+sequential IO outweighs any seeks required.
+
+Thus the cost of performing a compaction is O(input size).
+
+
+Incremental work
+============================================================
+The third goal for compaction is to be able to perform work incrementally. Doing
+frequent incremental compactions rather than occasional large ones results in a
+more consistent performance profile for end-user applications. Incremental work
+also allows the system to react more quickly to changes in workload: for example,
+if one area of the keyspace becomes hot, we would like to be able to quickly
+react and compact that area of the keyspace within a short time window.
+
+One way to achieve this goal is to put a bound on the amount of data that any
+given compaction will read and write. Bounding this data on the range of several
+hundred MB means that a compaction can occur in 10 seconds or less, allowing
+quick reaction time to shifts in workload.
+
+
+Proposed strategy:
+============================================================
+
+Limiting RowSet Sizes
+------------------------------
+The first key piece of the proposed compaction strategy is to limit the maximum size of
+any RowSet to a relatively small footprint - e.g 64MB or even less. This can be done
+by modifying the DiskRowSet writer code to "roll over" to a new rowset after the size
+threshold has been reached. Thus, even if flushing a larger dataset from memory, the
+on-disk rowset sizes can be limited.
+
+
+Flushes with limited RowSet size
+---------------------------------
+For example, imagine that the max rowset size is set to 64MB, and 150MB of data has
+accumulated in the MemRowSet before a flush. The resulting output of the flush, then
+looks like:
+
+```
+   A       B     C
+|------||------||--|
+  64MB    64MB  22MB
+```
+
+Note that even though the maximum DiskRowSet size is 64MB, the third flushed rowset
+will be smaller. In the future, we could esimate the on-disk data size and try to make
+the three RowSets approximately equal-sized, but it is not necessary for correctness.
+
+Compactions with limited RowSet size
+-------------------------------------
+Now imagine another scenario, where a Tablet flushes several times, each resulting in
+small files which span the entirety of the key space -- commonly seen in a uniform
+random insert load. After 3 flushes, the Tablet looks like:
+
+```
+       A (50MB)
+|-------------------|
+       B (50MB)
+|-------------------|
+       C (50MB)
+|-------------------|
+```
+
+Because the three rowset ranges overlap, every access to the tablet must query each of the
+rowsets (i.e the average rowset "depth" is 3). If the compaction policy selects these
+three RowSets for compaction, the compaction result will look like:
+
+```
+   D       E     F
+|------||------||--|
+  64MB    64MB  22MB
+```
+
+Essentially, the compaction reorganizes the data from overlapping rowsets into non-overlapping
+rowsets of a similar size. This reduces the average depth from 3 to 1, improving the
+Tablet performance.
+
+
+Dealing with large numbers of RowSets
+--------------------------------------
+With these limited sizes, a modestly sized Tablet (eg 20GB) will have on the order of hundreds
+of RowSets. In order to efficiently determine the set of RowSets which may contain a given
+query key or range, we have to change the Tablet code to store the RowSets in an interval
+tree instead of a simple list. The Interval Tree is a data structure which provides efficient
+query for the set of intervals overlapping a given query point or query interval.
+
+
+Intuition behind compaction selection policy
+---------------------------------------------
+As a simplification, assume for now that all RowSets are exactly the same size (rather
+than bounded under a maximum). Then, we can classify a RowSet as "good" or "bad" based on
+one simple factor: the smaller the range of key space that it spans, the better.
+Assuming a uniform insert workload, every flushed RowSet will span the entirety of the
+Tablet's key space -- and hence must be queried by every subsequent operation. Once there
+are multiple such flushed RowSets (A, B, and C in the diagram), compacting them results in
+skinnier rowsets D, E, and F.
+
+Intuitively, then, a good compaction policy finds rowsets which are wide and overlapping, and
+compacts them together, resulting in rowsets which are skinny and non-overlapping.
+
+Taking the cost factors developed above, we can look at compaction selection as an optimization
+problem: reduce the cost of the Tablet configuration as much as possible under a given IO budget.
+
+Per the analysis above, the cost of a single read or insert is linear in the "height" of the
+RowSets at the key being accessed. So, the average cost of operations can be calculated by
+integrating the tablet height across the key space, or equivalently adding up the widths
+of all of the RowSets. For example:
+
+```
+          |---A----| (width 10)
+     |-----B-------| (width 15)
+|-C-||-----D-------| (width 5, width 15)
+|--------E---------| (width 20)
+```
+
+So, the summed width = 20+5+15+15+10 = 65.
+
+Imagine that we choose to compact rowsets A, B, and D above, resulting in the following
+output:
+
+```
+|-C-||-F-||-G-||-H-| (width 5, width 5, width 5, width 5)
+|--------E---------| (width 20)
+```
+
+Note that the total number of bytes have not changed: we've just reorganized the bytes
+into a more compact form, reducing the average height of the tablet.
+
+Now the summed cost is 40. So, the compaction had benefit 25, using a budget of 3 units of IO
+(remember that rowsets are assumed to be constant size for this analysis).
+
+Another choice for the compaction might have been to compact B, D, and E, resulting in:
+
+```
+          |---A----| (width 10)
+|-C-|                (width 5)
+|---F--||--G--||-H-| (width 8, width 7, width 5)
+```
+
+This compaction reduced the tablet cost from 65 to 35 -- so its benefit was 30, using the same
+IO budget of 3.
+
+Given that the second compaction choice reduced the tablet height more using the same budget,
+it is a more optimal solution.
+
+Mathematical analysis
+-----------------------
+The reduction of cost due to a compaction is simple to calculate:
+
+Cost change = sum(original rowset widths) - sum(output rowset widths)
+
+We know that the output rowsets will not overlap at all, and that their total width will
+span the union of the input rowset ranges. Therefore:
+
+Cost change = sum(original rowset widths) - (union width of original rowsets)
+
+Note that, for this analysis, the key ranges are treated as integers. This can be extended
+to string keys in a straightforward manner by treating the string data as unsigned integers.
+
+Algorithm
+----------
+
+Given budget N rowsets:
+
+```
+For each pair of rowsets (A, B):
+  Evaluate BestForPair(A, B):
+
+BestForPair(A, B):
+  Let union width = max(A.max_key, B.max_key) - min(A.min_key, B.min_key)
+  Determine the subset R of rowsets that are fully contained within the range A, B
+  Evaluate PickRowsetsWithBudget(R, N):
+  Set objective = sum(rowset width) - union width
+  If objective > best objective:
+    best solution = this set
+
+PickRowsetsWithBudget(R, N):
+  Choose the N rowsets in R which which maximize sum(rowset width)
+```
+
+
+PickRowsetsWithBudget can be solved by simply sorting the rowsets by their width and
+choosing the top N.
+
+
+Extending algorithm to non-constant sizes
+------------------------------------------
+
+Even though we limit the maximum rowset size to a constant, some rowsets may be smaller
+due to more frequent flushes, etc. Thus, we would like to change the budget to be a number
+of MB of IO, rather than a simple count N of input files. The subproblem PickNRowSets then becomes:
+
+> Choose a set of RowSets such that their total file size falls within a budget, and
+> maximizes their total widths.
+
+This is an instance of the 0-1 knapsack problem, so we replace PickRowsetsWithBudget(R, N)
+with a knapsack problem solver.
+
+Computational complexity
+----------------------------
+
+The algorithm contains `O(n^2)` calls to BestForPair, each of which contains one instance of the
+0-1 knapsack problem, which has complexity `O(n * max_budget). Thus, the total complexity is cubic
+in the number of rowsets, which can become quite expensive when a given tablet may include on the
+order of a thousand rowsets.
+
+We can optimize the approach by changing the order in which we consider pairs (A, B) in the
+above-described algorithm:
+
+```
+For each rowset A:
+  candidates = all rowsets B such that B.min_key >= A.min_key
+  sort candidates B by increasing B.max
+  For each pair (A, B):
+    Evaluate BestForPair(A, B)
+```
+
+Considering the pairs in this order simplifies BestForPair as follows:
+
+```
+BestForPair(A, B):
+  Let union width = max(A.max_key, b.max_key) - min(A.min_key, B.min_key)
+  Determine the subset R of rowsets that are fully contained within the range A, B
+   ** Because B.max_key is non_decreasing, this subset R is identical to R in the
+      previous call, except that B is now added to the end. No extra loop
+      is required.
+  Evaluate PickRowsetsWithBudget(R, N):
+   ** This instantiation of the knapsack problem now is identical to the previous
+      instantiation, except with one additional item. Thus, it can be computed
+      incrementally from the previous solution.
+  Set objective = sum(rowset width) - union width
+  If objective > best objective:
+    best solution = this set
+```
+
+
+Additionally, upper bounds can be calculated by solving the simpler fractional knapsack
+problem and used to short-circuit the more complex calculations.
+
+
+Extending algorithm to non-uniform workloads
+--------------------------------------------
+
+The above analysis is done in terms of constant workloads. However, in practice, workloads
+may be skewed. Given that, it is more important to compact the areas of the key space which
+are seeing frequent access. The algorithms can be extended in a straightforward way by changing
+all references to the "width" of a rowset to instead be CDF(max key) - CDF(min key) where CDF
+is the cumulative distribution function for accesses over a lagging time window.

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/b7f859ed/docs/design-docs/compaction.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/compaction.md b/docs/design-docs/compaction.md
new file mode 100644
index 0000000..93361ff
--- /dev/null
+++ b/docs/design-docs/compaction.md
@@ -0,0 +1,94 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+Compaction design notes
+=======================
+
+This document explains the mechanics of performing a rowset flush/compaction.
+For details explaining how compactions are selected, see compaction-policy.md.
+NOTE: this does not describe anything about flushing delta stores to delta files!
+
+Goal: Take two or more RowSets with overlapping key ranges, and merge
+them into a new RowSet, while updates are concurrently being applied.
+The output RowSet should also garbage collect (i.e reclaim storage from)
+any rows which were deleted in the old RowSets.
+
+Let's start with the simple example of compacting from 1 input rowset to
+1 output rowset. This has the effect of removing GC-able data and
+applying updates. The compaction has two main phases:
+
+```
+      "flush_snap"
+           |
+           |
+  before   v
+<----------|
+              Phase 1:
+          merging/flushing
+           |-----------|
+                         Phase 2: migrate
+                         deltas
+                       |---------------|
+                                         compaction
+                                         complete
+                                       |----------->
+
+|--------------  time ----------------------------->
+```
+
+System steady state:
+  - Updates are applied only to the "source RowSet"
+
+Transition into Phase 1:
+  - Create a snapshot iterator to merge the input RowSets, and save the
+    associated MVCC snapshot state.
+
+Phase 1: merge/flush data:
+  - Use the iterator created above to create a new set of data for the output
+    RowSet. This will reflect any updates or deletes which arrived prior to the
+    start of phase 1, but no updates or deletes which arrive during either
+    phase of the compaction.
+
+  - Any mutations which arrive during this phase are applied only to the input
+    RowSets' delta tracking structures. Because the merge operates on a snapshot,
+    it will not take these into account in the output RowSet.
+
+Phase 2: migrate deltas from phase 1
+  - Any mutations which arrive during this phase should be applied to both the
+    input RowSet and the output RowSet. This is simple to do by duplicating
+    the key lookup into the output RowSet's key column when the update arrives.
+    This is implemented by swapping in a "DuplicatingRowSet" implementation which
+    forwards updates to both the input and output rowsets.
+
+  - Any reads during this phase must be served from the input RowSet, since the
+    output RowSet is missing the deltas which arrived during the merge phase.
+
+  - Because the merge output ignored any mutations which arrived during phase 1,
+    we must now 'migrate' those mutations to the output RowSet. This can be done
+    efficiently by collecting all of the deltas which were not included in the
+    snapshot iterator, and applying them to the output rowset's delta tracker.
+
+
+End of Phase 2: swap RowSets
+  - After Phase 2, the two RowSets have logically identical data, and they may
+    be atomically swapped. Once the output RowSet has been swapped in, new updates
+    only need to be applied to the output RowSet, and the old RowSet may be dropped.
+
+Extending to multiple RowSets
+-----------------------------
+
+The above algorithm can be extended to multiple RowSets equally well. At the beginning
+of the compaction, each RowSet is snapshotted, and a snapshot iterator created. A merge
+iterator then performs the merge of all of the snapshots in ascending key order.
+

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/b7f859ed/docs/design-docs/schema-change.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/schema-change.md b/docs/design-docs/schema-change.md
new file mode 100644
index 0000000..7351fd4
--- /dev/null
+++ b/docs/design-docs/schema-change.md
@@ -0,0 +1,114 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+Schema Changes
+============================================================
+
+Column IDs
+------------------------------
+Internal to a Schema, and not exposed to the user, each column in a schema has
+a unique identifier. The identifiers are integers which are not re-used,
+and serve to distinguish an old column from a new one in the case that they
+have the same name.
+
+For example:
+
+```
+> CREATE TABLE x (col_a int, col_b int);
+> INSERT INTO x VALUES (1, 1);
+> ALTER TABLE x DROP COLUMN col_b;
+> ALTER TABLE x ADD COLUMN col_b int not null default 999;
+```
+
+In this case, although the Schema at the end of the sequence looks the same
+as the one at the beginning, the correct data is:
+
+```
+> SELECT * from x;
+ col_a   | col_b
+------------------
+  1      | 999
+```
+
+In other words, we cannot re-materialize data from the old `col_b` into the new
+`col_b`.
+
+If we were to dump the initial schema and the new schema, we would see that although
+the two `col_b`s have the same name, they would have different column IDs.
+
+Column IDs are internal to the server and not sent by the user on RPCs. Clients
+specify columns by name. This is because we expect a client to continue to make
+queries like "`select sum(col_b) from x;`" without any refresh of the schema, even
+if the column is dropped and re-added with new data.
+
+Schemas specified in RPCs
+------------------------------
+
+When the user makes an RPC to read or write from a tablet, the RPC specifies only
+the names, types, and nullability of the columns. Internal to the server, we map
+the names to the internal IDs.
+
+If the user specifies a column name which does not exist in the latest schema,
+it is considered an error.
+
+If the type or nullability does not match, we also currently consider it an error.
+In the future, we may be able to adapt the data to the requested type (eg promote
+smaller to larger integers on read, promote non-null data to a nullable read, etc).
+
+Handling varying schemas at read time
+------------------------------
+
+```
+ + Tablet
+ |---- MemRowSet
+ |---- DiskRowSet N
+ |-------- CFileSet
+ |-------- Delta Tracker
+ |------------ Delta Memstore
+ |------------ Delta File N
+```
+
+Because the Schema of a table may change over time, different rowsets may have
+been written with different schemas. At read time, the server determines a Schema
+for the read based on the current metadata of the tablet. This Schema determines
+what to do as the read path encounters older data which was inserted prior to
+the schema change and thus may be  missing some columns.
+
+For each column in the read schema which is not present in the data, that column
+may be treated in one of two ways:
+
+1. In the case that the new column has a "read default" in the metadata, that
+   value is materialized for each cell.
+2. If no "read default" is present, then the column must be nullable. In that
+   case, a column of NULLs is materialized.
+
+Currently, Kudu does not handle type changes. In the future, we may also need to
+add type adapters to convert older data to the new type.
+
+When reading delta files, updates to columns which have since been removed are
+ignored. Updates to new columns are applied on top of the materialized default
+column data.
+
+Compaction
+------------------------------
+Each CFileSet and DeltaFile has a schema associated to describe the data in it.
+On compaction, CFileSet/DeltaFiles with different schemas may be aggregated into a new file.
+This new file will have the latest schema and all the rows must be projected.
+
+In the case of CFiles, the projection affects only the new columns, where the read default
+value will be written as data, or in case of "alter type" where the "encoding" is changed.
+
+In the case of DeltaFiles, the projection is essential since the RowChangeList is serialized
+with no hint of the schema used. This means that you can read a RowChangeList only if you
+know the exact serialization schema.

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/b7f859ed/docs/design-docs/triggering-maintenance-ops.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/triggering-maintenance-ops.md b/docs/design-docs/triggering-maintenance-ops.md
new file mode 100644
index 0000000..53f9c4e
--- /dev/null
+++ b/docs/design-docs/triggering-maintenance-ops.md
@@ -0,0 +1,211 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+Maintenance Op Scheduling
+===============================================================================
+
+For the purpose of this document, "maintenance operations" are any background
+processes that Kudu runs in the course of normal operation.  The
+MaintenanceManager must schedule these operations intelligently to keep the
+system operating smoothly.  Partly, this is a tradeoff between current
+performance and future performance.  For example, running a compaction will
+spend some I/O now in order to speed up insertions later.  Partly, this is a
+matter of performing necessary tasks that, if left undone, would compromise the
+stability of the system.  For example, if we never flushed MemRowSets, we would
+eventually run out of memory.  As memory gets low, admissions control will slow
+the pace of new requests getting accepted.
+
+
+Decision Criteria
+===============================================================================
+The most important things that we need to weigh in order to make good decisions
+are:
+1. memory usage
+2. tablet statistics
+3. the age of memrowsets
+
+Some other criteria that we considered, but rejected for v1 include:
+1. free disk space.
+2. load-balancing between disks or disksets which will be touched by
+   maintenance operations
+
+Free disk space should not be an issue in most competently administered setups.
+We may revisit this later, but for the initial version, it is best to assume we
+have enough space.
+
+We can't consider disk-based scheduling right now since we don't have support
+for multiple disks yet.
+
+
+Memory usage
+-------------------------------------------------------------------------------
+Memory usage can be broken down into a few buckets:
+1. System overhead (C++ data structures, operating system overheads, and so
+   forth).
+2. MemRowSets
+3. The LRU block cache
+
+We assume that #1 is relatively constant.  The maintenance op scheduler can
+make tradeoffs between #2 and #3 by deciding to flush certain MemRowSets to
+disk.
+
+We want to keep the total amount of memory held by #1, #2 and #3 from growing
+too large.  For now, our goal is to keep this sum relatively constant.  We have
+not yet implemented giving memory held by tcmalloc back to the operating system.
+
+
+Tablet Statistics
+-------------------------------------------------------------------------------
+If we know that a tablet's workload is scan-heavy (rather than insert-heavy),
+we may wish to do a major delta compaction for that tablet to speed up scans.
+It's probably smarter to do compactions on tables that are heavily used, than
+on obscure tables that don't see much traffic.
+
+This is probably the most difficult information source to make effective use
+of, simply because it involves many workload-dependent assumptions and
+heuristics.
+
+
+The Age of MemRowSet objects
+-------------------------------------------------------------------------------
+MemRowSet and DeltaMemRowSet objects must be flushed to disk when they get too
+old.  If we don't do this, the write-ahead log (WAL) will grow without bound.
+This growth would waste disk space and slow startup to a crawl, since the
+entire WAL must be traversed during the startup process.
+
+We should embed a WAL op id in each MemRowSets and DeltaMemRowSet.  The
+scheduler will look more favorably on the flushing of a MemRowSet as it ages.
+After the operation id falls too far behind, it will try to flush the MemRowSet
+no matter what.
+
+
+Maintenance Operation types
+===============================================================================
+
+Maintenance operations to reduce memory usage
+----------------------------------------
+
+These operations spend some I/O or CPU in order to free up memory usage. They
+may also incur further performance costs after completion. These cannot be
+delayed indefinitely, as RAM is a finite resource.
+
+
+MemStore Flush
+------------------------------
+Cost:
+- Sequential I/O now (writing the actual memstore contents to disk)
+- Sequential I/O later (frequent small flushes will cost more compactions down the road)
+
+Benefit:
+- RAM: frees up memory
+
+Other/wash:
+- At first glance, flushing might seem to increase cost of further insert/updates
+  because it adds a new RowSet. However, because memstores are not compressed in
+  any way, typically the newly flushed RowSet will be much smaller on disk than the
+  memstore that it came from. This means that, even if we have to cache the whole
+  result RowSet in the block cache, we're making much more effective use of RAM and
+  thus may _reduce_ the total number of actual I/Os.
+
+
+DeltaMemStore Flush
+------------------------------
+Basically the same costs as MemStore flush
+
+Additional benefits:
+TODO: flushing may also speed up scans substantially. Need to run experiments on this --
+how much better is scanning a static cached file compared to scanning the equivalent
+memstore. Maybe an order of magnitude.
+
+
+LRU cache eviction
+------------------------------
+Cost: slower reads, slower inserts if evicting key columns or blooms
+Benefit: frees RAM
+
+
+
+
+Maintenance operations to manage future performance
+----------------------------------------
+
+These operations expend some kind of I/O and CPU now in order to improve the performance
+of the system after they complete. They are only ever "necessary" in that if we put them
+off forever, the system will slow to a crawl eventually.
+
+
+Merging Compaction
+------------------------------
+Cost:
+- Sequential I/O now (reading input, re-writing output)
+
+Benefit:
+- reduce the number of RowSets: speeds up inserts, updates. Speeds up short scans where blooms don't apply.
+
+
+Minor Delta Compaction
+------------------------------
+Cost:
+- Sequential I/O (reading input, re-writing output)
+
+Benefit:
+- Speeds up scans -- fewer delta trackers to apply
+- May save disk space (eg when snapshot isolation is implemented, old version updates may be discarded)
+
+
+Major delta compaction
+------------------------------
+Cost:
+- Sequential I/O (reading input, re-writing output)
+
+Benefit:
+- Speeds up scans -- fewer delta trackers to apply, fewer total rows with deltas to apply.
+- Save disk space (eg when snapshot isolation is implemented, old version updates may be discarded)
+
+Relevant metrics:
+- for each column, % of rows in RowSet which have been updated
+- for each column, % of deltas which could be fully merged
+- workload: scan heavy vs insert/update heavy?
+
+
+Implementation Considerations
+===============================================================================
+Each tablet creates several MaintenanceOp objects, representing the various
+maintenance operations which can be performed on it.  It registers these
+operations with the MaintenanceManager.
+
+The MaintenanceManager has a main thread which periodically polls the
+registered MaintenanceOp objects and determines whether it should execute any
+of them.  The default polling interval is 250 ms, but this is configurable.
+Access to the MaintenanceOp is assumed to be thread-safe.  It's important to
+note that the scheduler can choose any op available to it.  It is not bound to
+execute operations on a first-come, first-serve basis.
+
+If the MaintenanceManager decides to execute one of these operations, it will
+run it in a thread-pool of configurable size.  We assume that maintenance
+operations are blocking and require a thread context.  If the operation fails,
+the MaintenanceManager will log a warning message and re-trigger the main
+thread.  The failed MaintenanceOp will not be retried until a configurable
+grace period has expired.
+
+The MaintenanceOp has various fields indicating how much memory it will
+probably free, how much CPU it will use, and so forth.  It also has a field
+which marks it as not currently executable.  For example, this may be used by
+some Ops that don't want multiple instances of themselves to run concurrently.
+
+We want to keep at least one thread free to run flush operations, so that we
+don't ever get into a situation where we need to free up memory, but all the
+maintenance op threads are working on compactions or other operations.
+Hopefully, most compactions will be reasonably short, so that we won't have to
+schedule long compactions differently than short ones.

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/b7f859ed/src/kudu/tablet/compaction-policy.txt
----------------------------------------------------------------------
diff --git a/src/kudu/tablet/compaction-policy.txt b/src/kudu/tablet/compaction-policy.txt
deleted file mode 100644
index ee9a283..0000000
--- a/src/kudu/tablet/compaction-policy.txt
+++ /dev/null
@@ -1,397 +0,0 @@
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-
-This document explains the policy of performing a compaction.
-For details explaining how compactions are implemented, see compaction.txt.
-
-Compaction Policy
-============================================================
-
-The compaction policy is responsible for selecting a set of rowsets to compact
-together. Compactions are necessary in order to reduce the number of DiskRowSets
-which must be consulted for various operations, thus improving the overall
-performance of the tablet.
-
-Coming up with a good compaction policy is a balancing act between several goals:
-
-1) Re-arrange the physical layout to be more efficient for subsequent operations.
-
-2) Do so without using too many resources in the compaction itself.
-
-3) Do so "smoothly" - spread work out over time so that operation performance is
-   predictable and reasonably constant.
-
-
-The following sections provide some analysis of the above goals:
-
-Benefit of compaction for subsequent operations
-============================================================
-
-In order to determine a good compaction policy, we want to define a cost measure
-for a given set of RowSets within a tablet. Consider the following set of
-RowSets:
-
-
-   1     2      3     4    5
-|--A--||-B--||--C--||---D----|
-|--------------E-------------|
-                   |-F--|
-
-In this diagram, the key space spans from left to right, and each RowSet is drawn
-as an interval based on its first and last contained key. We'll define a few terms
-for later use in this document:
-
-  "Width"
-  -------
-  Let the Width of a RowSet be proportional to the percentage of key
-  space that it spans. For example, rowset E has a width of 1, since
-  it spans the whole tablet. Rowset B has width 0.2, since it spans
-  about 20% of the tablet.
-
-  Note that the Width is also the probability that any read in a
-  uniform random read workload will have to consult that RowSet.
-
-  "Height"
-  --------
-  The "Height" of a tablet at a given key is the number of rowsets
-  whose key ranges contain that key. For example, the height of the
-  above tablet at key 1 is 2, since rowsets A and E span that key.
-  The height at key 4 is 3, since D, E, and F span that key.
-
-  The Height at any key is the number of RowSets that will be have to
-  be consulted for a random read of that key.
-
-Let us consider the cost of various operations on the tablet:
-
-Insert
--------
-In order to Insert, each of the rowsets must be checked for a duplicate key. By
-storing the rowset ranges in an interval tree, we can efficiently determine the
-set of rowsets whose intervals may contain the key to be inserted, and thus the
-cost is linear in that number of rowsets:
-
-  Let n = the Height of the tablet at the given key
-  Let B = the bloom filter false positive rate
-  Let C_bf = cost of bloom filter check
-  Let C_pk = cost of a primary key lookup
-  Cost = n*C_bf + n*B*C_pk
-  Cost = n(C_bf + B*C_pk)
-
-Typically, B is approximately 1% or lower, so the bloom filter checks dominate this
-equation. However, in some cases where the primary key column is very large, every
-primary key check will incur a disk seek, meaning that C_pk is orders of magnitude
-higher than C_bf (which we expect to be in RAM or SSD). So, we cannot fully ignore
-the term resulting from the bloom filter misses.
-
-Random read
-------------
-The costs for random read are similar to the cost for inserts: given the known key,
-each potentially overlapping rowset must be queried.
-
-
-Short Scan
------------
-Scans cannot make use of bloom filters, so the cost is similar to the above, except
-that all overlapping rowsets must be seeked by PK:
-
-Cost = n*C_pk
-
-We assume a "short" scan is one in which the sequential IO cost after finding the start
-key is small compared to the seek cost. (eg assuming a 10ms seek time, 1MB or less of
-sequential IO).
-
-
-Long scan (e.g full table scan):
----------------------------------
-A long scan is likely to retrieve data from many rowsets. In this case, the size
-of the rowsets comes into play.
-
-Let S = the number of MB in the scan
-Let B = the disk bandwidth (MB/sec)
-Let n = the number of rowsets accessed, as before
-
-Assume that accessing each rowset costs 1 seek (same as C_pk).
-
-Cost = n*C_pk + S/B
-
-
-To summarize the above, all of the costs of operations are heavily dependent on the
-number of rowsets which must be accessed. Therefore, to minimize cost, we should
-follow the following strategies:
-
-1) In the case of point queries (inserts and random read/short scan), merge
-   rowsets which overlap in keyspace, thus reducing the average height of the
-   Tablet.
-
-2) In the case of longer scans, merge together rowsets to improve the ratio of
-   sequential IO to seeks.
-
-We can assume that, so long as the rowsets are reasonably large, goal #2 above has
-diminishing returns after rowsets achieve ~10MB or so of sequential IO for every
-seek (1 seek ~= 10ms, 10MB IO ~= 100ms). However, goal #1 has linear returns, so we
-focus on goal #1.
-
-
-Cost of doing a compaction
-============================================================
-According to the above analysis, the optimal configuration for a tablet is a
-single giant rowset which spans the entirety of the key space. This is
-intuitively true: a fully-compacted tablet is going to perform the best because
-every access will require at most one bloom filter check and one seek.
-
-However, it is obviously not optimal to simply compact all RowSets together in every
-compaction. This would be inefficient, since every compaction would rewrite the
-entire rowset, causing huge write amplification and wasted IO for only a small
-amount of efficiency gain.
-
-So, we need to consider not just how efficient the resulting tablet would be, but also
-how expensive it is to perform the candidate compaction. Only by weighing those two
-against each other can we decide on the best compaction to perform at any given point
-in time.
-
-For the purposes of this analysis, we consider the cost of a compaction to simply be
-the sum of the IO performed by the compaction. We'll assume that deletions are rare,
-in which case the output data size of a compaction is approximately equal to the
-input data size. We also assume that the compaction inputs are large enough that
-sequential IO outweighs any seeks required.
-
-Thus the cost of performing a compaction is O(input size).
-
-
-Incremental work
-============================================================
-The third goal for compaction is to be able to perform work incrementally. Doing
-frequent incremental compactions rather than occasional large ones results in a
-more consistent performance profile for end-user applications. Incremental work
-also allows the system to react more quickly to changes in workload: for example,
-if one area of the keyspace becomes hot, we would like to be able to quickly
-react and compact that area of the keyspace within a short time window.
-
-One way to achieve this goal is to put a bound on the amount of data that any
-given compaction will read and write. Bounding this data on the range of several
-hundred MB means that a compaction can occur in 10 seconds or less, allowing
-quick reaction time to shifts in workload.
-
-
-Proposed strategy:
-============================================================
-
-Limiting RowSet Sizes
-------------------------------
-The first key piece of the proposed compaction strategy is to limit the maximum size of
-any RowSet to a relatively small footprint - e.g 64MB or even less. This can be done
-by modifying the DiskRowSet writer code to "roll over" to a new rowset after the size
-threshold has been reached. Thus, even if flushing a larger dataset from memory, the
-on-disk rowset sizes can be limited.
-
-
-Flushes with limited RowSet size
----------------------------------
-For example, imagine that the max rowset size is set to 64MB, and 150MB of data has
-accumulated in the MemRowSet before a flush. The resulting output of the flush, then
-looks like:
-
-   A       B     C
-|------||------||--|
-  64MB    64MB  22MB
-
-Note that even though the maximum DiskRowSet size is 64MB, the third flushed rowset
-will be smaller. In the future, we could esimate the on-disk data size and try to make
-the three RowSets approximately equal-sized, but it is not necessary for correctness.
-
-Compactions with limited RowSet size
--------------------------------------
-Now imagine another scenario, where a Tablet flushes several times, each resulting in
-small files which span the entirety of the key space -- commonly seen in a uniform
-random insert load. After 3 flushes, the Tablet looks like:
-
-
-       A (50MB)
-|-------------------|
-       B (50MB)
-|-------------------|
-       C (50MB)
-|-------------------|
-
-
-Because the three rowset ranges overlap, every access to the tablet must query each of the
-rowsets (i.e the average rowset "depth" is 3). If the compaction policy selects these
-three RowSets for compaction, the compaction result will look like:
-
-   D       E     F
-|------||------||--|
-  64MB    64MB  22MB
-
-
-Essentially, the compaction reorganizes the data from overlapping rowsets into non-overlapping
-rowsets of a similar size. This reduces the average depth from 3 to 1, improving the
-Tablet performance.
-
-
-Dealing with large numbers of RowSets
---------------------------------------
-With these limited sizes, a modestly sized Tablet (eg 20GB) will have on the order of hundreds
-of RowSets. In order to efficiently determine the set of RowSets which may contain a given
-query key or range, we have to change the Tablet code to store the RowSets in an interval
-tree instead of a simple list. The Interval Tree is a data structure which provides efficient
-query for the set of intervals overlapping a given query point or query interval.
-
-
-Intuition behind compaction selection policy
----------------------------------------------
-As a simplification, assume for now that all RowSets are exactly the same size (rather
-than bounded under a maximum). Then, we can classify a RowSet as "good" or "bad" based on
-one simple factor: the smaller the range of key space that it spans, the better.
-Assuming a uniform insert workload, every flushed RowSet will span the entirety of the
-Tablet's key space -- and hence must be queried by every subsequent operation. Once there
-are multiple such flushed RowSets (A, B, and C in the diagram), compacting them results in
-skinnier rowsets D, E, and F.
-
-Intuitively, then, a good compaction policy finds rowsets which are wide and overlapping, and
-compacts them together, resulting in rowsets which are skinny and non-overlapping.
-
-Taking the cost factors developed above, we can look at compaction selection as an optimization
-problem: reduce the cost of the Tablet configuration as much as possible under a given IO budget.
-
-Per the analysis above, the cost of a single read or insert is linear in the "height" of the
-RowSets at the key being accessed. So, the average cost of operations can be calculated by
-integrating the tablet height across the key space, or equivalently adding up the widths
-of all of the RowSets. For example:
-
-          |---A----| (width 10)
-     |-----B-------| (width 15)
-|-C-||-----D-------| (width 5, width 15)
-|--------E---------| (width 20)
-
-So, the summed width = 20+5+15+15+10 = 65.
-
-Imagine that we choose to compact rowsets A, B, and D above, resulting in the following
-output:
-
-|-C-||-F-||-G-||-H-| (width 5, width 5, width 5, width 5)
-|--------E---------| (width 20)
-
-Note that the total number of bytes have not changed: we've just reorganized the bytes
-into a more compact form, reducing the average height of the tablet.
-
-Now the summed cost is 40. So, the compaction had benefit 25, using a budget of 3 units of IO
-(remember that rowsets are assumed to be constant size for this analysis).
-
-Another choice for the compaction might have been to compact B, D, and E, resulting in:
-          |---A----| (width 10)
-|-C-|                (width 5)
-|---F--||--G--||-H-| (width 8, width 7, width 5)
-
-This compaction reduced the tablet cost from 65 to 35 -- so its benefit was 30, using the same
-IO budget of 3.
-
-Given that the second compaction choice reduced the tablet height more using the same budget,
-it is a more optimal solution.
-
-Mathematical analysis
------------------------
-The reduction of cost due to a compaction is simple to calculate:
-
-Cost change = sum(original rowset widths) - sum(output rowset widths)
-
-We know that the output rowsets will not overlap at all, and that their total width will
-span the union of the input rowset ranges. Therefore:
-
-Cost change = sum(original rowset widths) - (union width of original rowsets)
-
-Note that, for this analysis, the key ranges are treated as integers. This can be extended
-to string keys in a straightforward manner by treating the string data as unsigned integers.
-
-Algorithm
-----------
-
-Given budget N rowsets:
-
-For each pair of rowsets (A, B):
-  Evaluate BestForPair(A, B):
-
-BestForPair(A, B):
-  Let union width = max(A.max_key, B.max_key) - min(A.min_key, B.min_key)
-  Determine the subset R of rowsets that are fully contained within the range A, B
-  Evaluate PickRowsetsWithBudget(R, N):
-  Set objective = sum(rowset width) - union width
-  If objective > best objective:
-    best solution = this set
-
-PickRowsetsWithBudget(R, N):
-  Choose the N rowsets in R which which maximize sum(rowset width)
-
-
-PickRowsetsWithBudget can be solved by simply sorting the rowsets by their width and
-choosing the top N.
-
-
-Extending algorithm to non-constant sizes
-------------------------------------------
-
-Even though we limit the maximum rowset size to a constant, some rowsets may be smaller
-due to more frequent flushes, etc. Thus, we would like to change the budget to be a number
-of MB of IO, rather than a simple count N of input files. The subproblem PickNRowSets then becomes:
-
-  Choose a set of RowSets such that their total file size falls within a budget, and
-  maximizes their total widths.
-
-This is an instance of the 0-1 knapsack problem, so we replace PickRowsetsWithBudget(R, N)
-with a knapsack problem solver.
-
-Computational complexity
-----------------------------
-
-The algorithm contains O(n^2) calls to BestForPair, each of which contains one instance of the
-0-1 knapsack problem, which has complexity O(n * max_budget). Thus, the total complexity is cubic
-in the number of rowsets, which can become quite expensive when a given tablet may include on the
-order of a thousand rowsets.
-
-We can optimize the approach by changing the order in which we consider pairs (A, B) in the
-above-described algorithm:
-
-For each rowset A:
-  candidates = all rowsets B such that B.min_key >= A.min_key
-  sort candidates B by increasing B.max
-  For each pair (A, B):
-    Evaluate BestForPair(A, B)
-
-Considering the pairs in this order simplifies BestForPair as follows:
-
-BestForPair(A, B):
-  Let union width = max(A.max_key, b.max_key) - min(A.min_key, B.min_key)
-  Determine the subset R of rowsets that are fully contained within the range A, B
-   ** Because B.max_key is non_decreasing, this subset R is identical to R in the
-      previous call, except that B is now added to the end. No extra loop
-      is required.
-  Evaluate PickRowsetsWithBudget(R, N):
-   ** This instantiation of the knapsack problem now is identical to the previous
-      instantiation, except with one additional item. Thus, it can be computed
-      incrementally from the previous solution.
-  Set objective = sum(rowset width) - union width
-  If objective > best objective:
-    best solution = this set
-
-
-Additionally, upper bounds can be calculated by solving the simpler fractional knapsack
-problem and used to short-circuit the more complex calculations.
-
-
-Extending algorithm to non-uniform workloads
---------------------------------------------
-
-The above analysis is done in terms of constant workloads. However, in practice, workloads
-may be skewed. Given that, it is more important to compact the areas of the key space which
-are seeing frequent access. The algorithms can be extended in a straightforward way by changing
-all references to the "width" of a rowset to instead be CDF(max key) - CDF(min key) where CDF
-is the cumulative distribution function for accesses over a lagging time window.

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/b7f859ed/src/kudu/tablet/compaction.txt
----------------------------------------------------------------------
diff --git a/src/kudu/tablet/compaction.txt b/src/kudu/tablet/compaction.txt
deleted file mode 100644
index 49a8c85..0000000
--- a/src/kudu/tablet/compaction.txt
+++ /dev/null
@@ -1,95 +0,0 @@
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-
-This document explains the mechanics of performing a rowset flush/compaction.
-For details explaining how compactions are selected, see compaction-policy.txt.
-NOTE: this does not describe anything about flushing delta stores to delta files!
-
-Compaction design notes
-------------------------------------------------------------
-
-Goal: Take two or more RowSets with overlapping key ranges, and merge
-them into a new RowSet, while updates are concurrently being applied.
-The output RowSet should also garbage collect (i.e reclaim storage from)
-any rows which were deleted in the old RowSets.
-
-------------------------------
-
-Let's start with the simple example of compacting from 1 input rowset to
-1 output rowset. This has the effect of removing GC-able data and
-applying updates. The compaction has two main phases:
-
-
-      "flush_snap"
-           |
-           |
-  before   v
-<----------|
-              Phase 1:
-          merging/flushing
-           |-----------|
-                         Phase 2: migrate
-                         deltas
-                       |---------------|
-                                         compaction
-                                         complete
-                                       |----------->
-
-|--------------  time ----------------------------->
-
-
-System steady state:
-  - Updates are applied only to the "source RowSet"
-
-Transition into Phase 1:
-  - Create a snapshot iterator to merge the input RowSets, and save the
-    associated MVCC snapshot state.
-
-Phase 1: merge/flush data:
-  - Use the iterator created above to create a new set of data for the output
-    RowSet. This will reflect any updates or deletes which arrived prior to the
-    start of phase 1, but no updates or deletes which arrive during either
-    phase of the compaction.
-
-  - Any mutations which arrive during this phase are applied only to the input
-    RowSets' delta tracking structures. Because the merge operates on a snapshot,
-    it will not take these into account in the output RowSet.
-
-Phase 2: migrate deltas from phase 1
-  - Any mutations which arrive during this phase should be applied to both the
-    input RowSet and the output RowSet. This is simple to do by duplicating
-    the key lookup into the output RowSet's key column when the update arrives.
-    This is implemented by swapping in a "DuplicatingRowSet" implementation which
-    forwards updates to both the input and output rowsets.
-
-  - Any reads during this phase must be served from the input RowSet, since the
-    output RowSet is missing the deltas which arrived during the merge phase.
-
-  - Because the merge output ignored any mutations which arrived during phase 1,
-    we must now 'migrate' those mutations to the output RowSet. This can be done
-    efficiently by collecting all of the deltas which were not included in the
-    snapshot iterator, and applying them to the output rowset's delta tracker.
-
-
-End of Phase 2: swap RowSets
-  - After Phase 2, the two RowSets have logically identical data, and they may
-    be atomically swapped. Once the output RowSet has been swapped in, new updates
-    only need to be applied to the output RowSet, and the old RowSet may be dropped.
-
-Extending to multiple RowSets
-------------------------------
-
-The above algorithm can be extended to multiple RowSets equally well. At the beginning
-of the compaction, each RowSet is snapshotted, and a snapshot iterator created. A merge
-iterator then performs the merge of all of the snapshots in ascending key order.
-

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/b7f859ed/src/kudu/tablet/schema-change.txt
----------------------------------------------------------------------
diff --git a/src/kudu/tablet/schema-change.txt b/src/kudu/tablet/schema-change.txt
deleted file mode 100644
index 6bbe175..0000000
--- a/src/kudu/tablet/schema-change.txt
+++ /dev/null
@@ -1,107 +0,0 @@
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-
-============================================================
-Schema Changes
-============================================================
-
-Column IDs
-------------------------------
-Internal to a Schema, and not exposed to the user, each column in a schema has
-a unique identifier. The identifiers are integers which are not re-used,
-and serve to distinguish an old column from a new one in the case that they
-have the same name.
-
-For example:
-
-> CREATE TABLE x (col_a int, col_b int);
-> INSERT INTO x VALUES (1, 1);
-> ALTER TABLE x DROP COLUMN col_b;
-> ALTER TABLE x ADD COLUMN col_b int not null default 999;
-
-In this case, although the Schema at the end of the sequence looks the same
-as the one at the beginning, the correct data is:
-
-> SELECT * from x;
- col_a   | col_b
-------------------
-  1      | 999
-
-In other words, we cannot re-materialize data from the old 'col_b' into the new
-'col_b'.
-
-If we were to dump the initial schema and the new schema, we would see that although
-the two 'col_b's have the same name, they would have different column IDs.
-
-Column IDs are internal to the server and not sent by the user on RPCs. Clients
-specify columns by name. This is because we expect a client to continue to make
-queries like "select sum(col_b) from x;" without any refresh of the schema, even
-if the column is dropped and re-added with new data.
-
-Schemas specified in RPCs
-------------------------------
-
-When the user makes an RPC to read or write from a tablet, the RPC specifies only
-the names, types, and nullability of the columns. Internal to the server, we map
-the names to the internal IDs.
-
-If the user specifies a column name which does not exist in the latest schema,
-it is considered an error.
-
-If the type or nullability does not match, we also currently consider it an error.
-In the future, we may be able to adapt the data to the requested type (eg promote
-smaller to larger integers on read, promote non-null data to a nullable read, etc).
-
-Handling varying schemas at read time
-------------------------------
- + Tablet
- |---- MemRowSet
- |---- DiskRowSet N
- |-------- CFileSet
- |-------- Delta Tracker
- |------------ Delta Memstore
- |------------ Delta File N
-
-Because the Schema of a table may change over time, different rowsets may have
-been written with different schemas. At read time, the server determines a Schema
-for the read based on the current metadata of the tablet. This Schema determines
-what to do as the read path encounters older data which was inserted prior to
-the schema change and thus may be  missing some columns.
-
-For each column in the read schema which is not present in the data, that column
-may be treated in one of two ways:
-
-  1) In the case that the new column has a "read default" in the metadata, that
-     value is materialized for each cell.
-  2) If no "read default" is present, then the column must be nullable. In that
-     case, a column of NULLs is materialized.
-
-Currently, Kudu does not handle type changes. In the future, we may also need to
-add type adapters to convert older data to the new type.
-
-When reading delta files, updates to columns which have since been removed are
-ignored. Updates to new columns are applied on top of the materialized default
-column data.
-
-Compaction
-------------------------------
-Each CFileSet and DeltaFile has a schema associated to describe the data in it.
-On compaction, CFileSet/DeltaFiles with different schemas may be aggregated into a new file.
-This new file will have the latest schema and all the rows must be projected.
-
-In the case of CFiles, the projection affects only the new columns, where the read default
-value will be written as data, or in case of "alter type" where the "encoding" is changed.
-
-In the case of DeltaFiles, the projection is essential since the RowChangeList is serialized
-with no hint of the schema used. This means that you can read a RowChangeList only if you
-know the exact serialization schema.

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/b7f859ed/src/kudu/tablet/triggering-maintenance-ops.txt
----------------------------------------------------------------------
diff --git a/src/kudu/tablet/triggering-maintenance-ops.txt b/src/kudu/tablet/triggering-maintenance-ops.txt
deleted file mode 100644
index b9922fe..0000000
--- a/src/kudu/tablet/triggering-maintenance-ops.txt
+++ /dev/null
@@ -1,211 +0,0 @@
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-
-===============================================================================
-Maintenance Op Scheduling
-===============================================================================
-
-For the purpose of this document, "maintenance operations" are any background
-processes that Kudu runs in the course of normal operation.  The
-MaintenanceManager must schedule these operations intelligently to keep the
-system operating smoothly.  Partly, this is a tradeoff between current
-performance and future performance.  For example, running a compaction will
-spend some I/O now in order to speed up insertions later.  Partly, this is a
-matter of performing necessary tasks that, if left undone, would compromise the
-stability of the system.  For example, if we never flushed MemRowSets, we would
-eventually run out of memory.  As memory gets low, admissions control will slow
-the pace of new requests getting accepted.
-
-
-Decision Criteria
-===============================================================================
-The most important things that we need to weigh in order to make good decisions
-are:
-1. memory usage
-2. tablet statistics
-3. the age of memrowsets
-
-Some other criteria that we considered, but rejected for v1 include:
-1. free disk space.
-2. load-balancing between disks or disksets which will be touched by
-maintenance operations
-
-Free disk space should not be an issue in most competently administered setups.
-We may revisit this later, but for the initial version, it is best to assume we
-have enough space. 
-
-We can't consider disk-based scheduling right now since we don't have support
-for multiple disks yet.
-
-
-Memory usage
--------------------------------------------------------------------------------
-Memory usage can be broken down into a few buckets:
-1. System overhead (C++ data structures, operating system overheads, and so
-forth).
-2. MemRowSets
-3. The LRU block cache
-
-We assume that #1 is relatively constant.  The maintenance op scheduler can
-make tradeoffs between #2 and #3 by deciding to flush certain MemRowSets to
-disk.
-
-We want to keep the total amount of memory held by #1, #2 and #3 from growing
-too large.  For now, our goal is to keep this sum relatively constant.  We have
-not yet implemented giving memory held by tcmalloc back to the operating system.
-
-
-Tablet Statistics
--------------------------------------------------------------------------------
-If we know that a tablet's workload is scan-heavy (rather than insert-heavy),
-we may wish to do a major delta compaction for that tablet to speed up scans.
-It's probably smarter to do compactions on tables that are heavily used, than
-on obscure tables that don't see much traffic.
-
-This is probably the most difficult information source to make effective use
-of, simply because it involves many workload-dependent assumptions and
-heuristics.
-
-
-The Age of MemRowSet objects
--------------------------------------------------------------------------------
-MemRowSet and DeltaMemRowSet objects must be flushed to disk when they get too
-old.  If we don't do this, the write-ahead log (WAL) will grow without bound.
-This growth would waste disk space and slow startup to a crawl, since the
-entire WAL must be traversed during the startup process.
-
-We should embed a WAL op id in each MemRowSets and DeltaMemRowSet.  The
-scheduler will look more favorably on the flushing of a MemRowSet as it ages.
-After the operation id falls too far behind, it will try to flush the MemRowSet
-no matter what.
-
-
-Maintenance Operation types
-===============================================================================
-
-Maintenance operations to reduce memory usage
-----------------------------------------
-
-These operations spend some I/O or CPU in order to free up memory usage. They
-may also incur further performance costs after completion. These cannot be
-delayed indefinitely, as RAM is a finite resource.
-
-
-MemStore Flush
-------------------------------
-Cost:
-- Sequential I/O now (writing the actual memstore contents to disk)
-- Sequential I/O later (frequent small flushes will cost more compactions down the road)
-
-Benefit:
-- RAM: frees up memory
-
-Other/wash:
-- At first glance, flushing might seem to increase cost of further insert/updates
-  because it adds a new RowSet. However, because memstores are not compressed in
-  any way, typically the newly flushed RowSet will be much smaller on disk than the
-  memstore that it came from. This means that, even if we have to cache the whole
-  result RowSet in the block cache, we're making much more effective use of RAM and
-  thus may _reduce_ the total number of actual I/Os.
-
-
-DeltaMemStore Flush
-------------------------------
-Basically the same costs as MemStore flush
-
-Additional benefits:
-TODO: flushing may also speed up scans substantially. Need to run experiments on this --
-how much better is scanning a static cached file compared to scanning the equivalent
-memstore. Maybe an order of magnitude.
-
-
-LRU cache eviction
-------------------------------
-Cost: slower reads, slower inserts if evicting key columns or blooms
-Benefit: frees RAM
-
-
-
-
-Maintenance operations to manage future performance
-----------------------------------------
-
-These operations expend some kind of I/O and CPU now in order to improve the performance
-of the system after they complete. They are only ever "necessary" in that if we put them
-off forever, the system will slow to a crawl eventually.
-
-
-Merging Compaction
-------------------------------
-Cost:
-- Sequential I/O now (reading input, re-writing output)
-
-Benefit:
-- reduce the number of RowSets: speeds up inserts, updates. Speeds up short scans where blooms don't apply.
-
-
-Minor Delta Compaction
-------------------------------
-Cost:
-- Sequential I/O (reading input, re-writing output)
-
-Benefit:
-- Speeds up scans -- fewer delta trackers to apply
-- May save disk space (eg when snapshot isolation is implemented, old version updates may be discarded)
-
-
-Major delta compaction
-------------------------------
-Cost:
-- Sequential I/O (reading input, re-writing output)
-
-Benefit:
-- Speeds up scans -- fewer delta trackers to apply, fewer total rows with deltas to apply.
-- Save disk space (eg when snapshot isolation is implemented, old version updates may be discarded)
-
-Relevant metrics:
-- for each column, % of rows in RowSet which have been updated
-- for each column, % of deltas which could be fully merged
-- workload: scan heavy vs insert/update heavy?
-
-
-Implementation Considerations
-===============================================================================
-Each tablet creates several MaintenanceOp objects, representing the various
-maintenance operations which can be performed on it.  It registers these
-operations with the MaintenanceManager.
-
-The MaintenanceManager has a main thread which periodically polls the
-registered MaintenanceOp objects and determines whether it should execute any
-of them.  The default polling interval is 250 ms, but this is configurable.
-Access to the MaintenanceOp is assumed to be thread-safe.  It's important to
-note that the scheduler can choose any op available to it.  It is not bound to
-execute operations on a first-come, first-serve basis.  
-
-If the MaintenanceManager decides to execute one of these operations, it will
-run it in a thread-pool of configurable size.  We assume that maintenance
-operations are blocking and require a thread context.  If the operation fails,
-the MaintenanceManager will log a warning message and re-trigger the main
-thread.  The failed MaintenanceOp will not be retried until a configurable
-grace period has expired.
-
-The MaintenanceOp has various fields indicating how much memory it will
-probably free, how much CPU it will use, and so forth.  It also has a field
-which marks it as not currently executable.  For example, this may be used by
-some Ops that don't want multiple instances of themselves to run concurrently.
-
-We want to keep at least one thread free to run flush operations, so that we
-don't ever get into a situation where we need to free up memory, but all the
-maintenance op threads are working on compactions or other operations.
-Hopefully, most compactions will be reasonably short, so that we won't have to
-schedule long compactions differently than short ones.