You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by da...@apache.org on 2016/02/18 23:29:54 UTC
[3/5] incubator-kudu git commit: Move design docs to the new folder and change the extension

Move design docs to the new folder and change the extension

This moves the design docs that we had sprinkled all over the code base to the
new design doc folder and changes the extension to .md, as well as adds them
to the design doc index. Though this doesn't do any kind of thorough markdown
conversion it does introduce '```' where appropriate to avoid that our nice ascii
diagrams become garbled garbage when visualized on github.

Change-Id: Iebeaba9e975453458ac993131dec6074c59d0eeb
Reviewed-on: http://gerrit.cloudera.org:8080/2227
Reviewed-by: Dan Burkert <da...@cloudera.com>
Tested-by: David Ribeiro Alves <da...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/incubator-kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-kudu/commit/c004cedc
Tree: http://git-wip-us.apache.org/repos/asf/incubator-kudu/tree/c004cedc
Diff: http://git-wip-us.apache.org/repos/asf/incubator-kudu/diff/c004cedc

Branch: refs/heads/master
Commit: c004cedc6cd5e32ec2d7ee9cd3c26e2f4d4d89ab
Parents: fc86225
Author: David Alves <da...@cloudera.com>
Authored: Thu Feb 18 11:58:16 2016 -0800
Committer: David Ribeiro Alves <da...@cloudera.com>
Committed: Thu Feb 18 21:03:02 2016 +0000

----------------------------------------------------------------------
 docs/design-docs/README.md                      |  21 +
 docs/design-docs/cfile.md                       | 188 +++++
 docs/design-docs/codegen.md                     | 248 ++++++
 docs/design-docs/consensus.md                   | 281 +++++++
 docs/design-docs/cpp-client.md                  | 129 ++++
 docs/design-docs/master.md                      | 239 ++++++
 docs/design-docs/rpc.md                         | 362 +++++++++
 .../scan-optimization-partition-pruning.md      |  14 +
 docs/design-docs/tablet.md                      | 760 +++++++++++++++++++
 src/kudu/cfile/README                           | 186 -----
 src/kudu/client/README                          | 132 ----
 src/kudu/codegen/README                         | 247 ------
 src/kudu/common/README                          |  16 -
 src/kudu/consensus/README                       | 280 -------
 src/kudu/master/README                          | 238 ------
 src/kudu/rpc/README                             | 361 ---------
 src/kudu/tablet/README                          | 759 ------------------
 17 files changed, 2242 insertions(+), 2219 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/docs/design-docs/README.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/README.md b/docs/design-docs/README.md
index 454629f..bb88fea 100644
--- a/docs/design-docs/README.md
+++ b/docs/design-docs/README.md
@@ -1,3 +1,17 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
 # Design Docs
 
 This directory holds Kudu design documents. These documents are typically
@@ -9,3 +23,10 @@ made.
 | Document | Component(s) | Discussion |
 | -------- | ------------ | ---------- |
 | [Scan Optimization & Partition Pruning](scan-optimization-partition-pruning.md) | Client, Tablet | [gerrit](http://gerrit.cloudera.org:8080/#/c/2149/) |
+| [CFile format](cfile.md) | Tablet | N/A |
+| [Codegen API and impl. details](codegen.md) | Server | N/A |
+| [Consensus design](consensus.md) | Consensus | N/A |
+| [Master design](master.md) | Master | N/A |
+| [RPC design and impl. details](rpc.md) | RPC | N/A |
+| [Tablet design, impl. details and comparison to other systems](tablet.md) | Tablet | N/A |
+| [C++ client design and impl. details](cpp-client.md) | Client | N/A |

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/docs/design-docs/cfile.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/cfile.md b/docs/design-docs/cfile.md
new file mode 100644
index 0000000..4606d3b
--- /dev/null
+++ b/docs/design-docs/cfile.md
@@ -0,0 +1,188 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+CFile is a simple columnar format which stores multiple related B-Trees.
+
+
+File format
+-----------------
+```
+<header>
+<blocks>
+<btree root blocks>
+<footer>
+EOF
+
+
+Header
+------
+
+<magic>: the string 'kuducfil'
+<header length>: 32-bit unsigned integer length delimiter
+<header>: CFileHeaderPB protobuf
+
+
+Footer
+------
+
+<footer>: CFileFooterPB protobuf
+<magic>: the string 'kuducfil'
+<footer length> (length of protobuf)
+```
+
+==============================
+
+Data blocks:
+
+Data blocks are stored with various types of encodings.
+
+* Prefix Encoding
+
+Currently used for STRING blocks. This is based on the encoding used
+by LevelDB for its data blocks, more or less.
+
+Starts with a header of four uint32s, group-varint coded:
+```
+  <num elements>       \
+  <ordinal position>   |
+  <restart interval>   |  group varint 32
+  <unused>             /
+```
+Followed by prefix-compressed values. Each value is stored relative
+to the value preceding it using the following format:
+```
+  shared_bytes: varint32
+  unshared_bytes: varint32
+  delta: char[unshared_bytes]
+```
+Periodically there will be a "restart point" which is necessary for
+faster binary searching. At a "restart point", shared_bytes is
+0 but otherwise the encoding is the same.
+
+At the end of the block is a trailer with the offsets of the
+restart points:
+```
+  restart_points[num_restarts]:  uint32
+  num_restarts: uint32
+```
+The restart points are offsets relative to the start of the block,
+including the header.
+
+
+* Group Varint Frame-Of-Reference Encoding
+
+Used for uint32 blocks.
+
+Starts with a header:
+```
+<num elements>     \
+<min element>      |
+<ordinal position> | group varint 32
+<unused>           /
+```
+The ordinal position is the ordinal position of the first item in the
+file. For example, the first data block in the file has ordinal position
+0. If that block had 400 data entries, then the second data block would
+have ordinal position 400.
+
+Followed by the actual data, each set of 4 integers using group-varint.
+The last group is padded with 0s.
+Each integer is relative to the min element in the header.
+
+==============================
+
+Nullable Columns
+
+If a column is marked as nullable in the schema, a bitmap is used to keep track
+of the null and not null rows.
+
+The bitmap is added the begininning of the data block, and it uses RLE.
+```
+  <num elements in the block>   : vint
+  <null bitmap size>            : vint
+  <null bitmap>                 : RLE encoding
+  <encoded non-null values>     : encoded data
+```
+Data Block Example - 4 items, the first and last are nulls.
+```
+  4        Num Elements in the block
+  1        Null Bitmap Size
+  0110     Null Bitmap
+  v2       Value of row 2
+  v3       Value of row 3
+```
+==============================
+
+Index blocks:
+
+The index blocks are organized in a B-Tree. As data blocks are written,
+they are appended to the end of a leaf index block. When a leaf index
+block reaches the configured block size, it is added to another index
+block higher up the tree, and a new leaf is started. If the intermediate
+index block fills, it will start a new intermediate block and spill into
+an even higher-layer internal block.
+
+For example:
+```
+                      [Int 0]
+           ------------------------------
+           |                            |
+        [Int 1]                       [Int 2]
+    -----------------            --------------
+    |       |       |            |             |
+[Leaf 0]     ...   [Leaf N]   [Leaf N+1]    [Leaf N+2]
+```
+
+In this case, we wrote N leaf blocks, which filled up the node labeled
+Int 1. At this point, the writer would create Int 0 with one entry pointing
+to Int 1. Further leaf blocks (N+1 and N+2) would be written to a new
+internal node (Int 2). When the file is completed, Int 2 will spill,
+adding its entry into Int 0 as well.
+
+Note that this strategy doesn't result in a fully balanced b-tree, but instead
+results in a 100% "fill factor" on all nodes in each level except for the last
+one written.
+
+There are two types of indexes:
+
+- Positional indexes: map ordinal position -> data block offset
+
+These are used to satisfy queries like: "seek to the Nth entry in this file"
+
+- Value-based indexes: reponsible for mapping value -> data block offset
+
+These are only present in files which contain data stored in sorted order
+(e.g key columns). They can satisfy seeks by value.
+
+
+An index block is encoded similarly for both types of indexes:
+```
+<key> <block offset> <block size>
+<key> <block offset> <block size>
+...
+   key: vint64 for positional, otherwise varint-length-prefixed string
+   offset: vint64
+   block size: vint32
+
+<offset to first key>   (fixed32)
+<offset to second key>  (fixed32)
+...
+   These offsets are relative to the start of the block.
+
+<trailer>
+   A IndexBlockTrailerPB protobuf
+<trailer length>
+```
+The trailer protobuf includes a field which designates whether the block
+is a leaf node or internal node of the B-Tree, allowing a reader to know
+whether the pointer is to another index block or to a data block.

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/docs/design-docs/codegen.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/codegen.md b/docs/design-docs/codegen.md
new file mode 100644
index 0000000..6d03c0c
--- /dev/null
+++ b/docs/design-docs/codegen.md
@@ -0,0 +1,248 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+===============================================================================
+Code Generation Interface
+===============================================================================
+
+The codegen directory houses code which is compiled with LLVM code generation
+utilities. The point of code generation is to have code that is generated at
+run time which is optimized to run on data specific to usage that can only be
+described at run time. For instance, code which projects rows during a scan
+relies on the types of the data stored in each of the columns, but these are
+only determined by a run time schema. To alleviate this issue, a row projector
+can be compiled with schema-specific machine code to run on the current rows.
+
+Note the following classes, whose headers are LLVM-independent and thus intended
+to be used by the rest of project without introducing additional dependencies:
+
+CompilationManager (compilation_manager.h)
+RowProjector (row_projector.h)
+
+(Other classes also avoid LLVM headers, but they have little external use).
+
+CompilationManager
+------------------
+
+The compilation manager takes care of asynchronous compilation tasks. It
+accepts requests to compile new objects. If the requested object is already
+cached, then the compiled object is returned. Otherwise, the compilation request
+is enqueued and eventually carried out.
+
+The manager can be accessed (and thus compiled code requests can be made)
+by using the GetSingleton() method. Yes - there's a universal singleton for
+compilation management. See the header for details.
+
+The manager allows for waiting for all current compilations to finish, and can
+register its metrics (which include code cache performance) upon request.
+
+No cleanup is necessary for the CompilationManager. It registers a shutdown method
+with the exit handler.
+
+Generated objects
+-----------------
+
+* codegen::RowProjector - A row projector has the same interface as a
+common::RowProjector, but supports a narrower scope of row types and arenas.
+It does not allow its schema to be reset (indeed, that's the point of compiling
+to a specific schema). The row projector's behavior is fully determined by
+the base and projection schemas. As such, the compilation manager expects those
+two items when retrieving a row projector.
+
+================================================================================
+Code Generation Implementation Details
+================================================================================
+
+Code generation works by creating what is essentially an assembly language
+file for the desired object, then handing off that assembly to the LLVM
+MCJIT compiler. The LLVM backend handles generating target-dependent machine
+code. After code generation, the machine code, which is represented as a
+shared object in memory, is dynamically linked to the invoking application
+(i.e., this one), and the newly generated code becomes available.
+
+Overview of LLVM-interfacing classes
+------------------------------------
+
+Most of the interfacing with LLVM is handled by the CodeGenerator
+(code_generator.h) and ModuleBuilder (module_builder.h) classes. The CodeGenerator
+takes care of setting up static intializations that LLVM is dependent on and
+provides an interface which wraps around various calls to LLVM compilation
+functions.
+
+The ModuleBuilder takes care of the one-time construction of a module, which is
+LLVM's unit of code. A module is its own namespace containing functions that
+are compiled together. Currently, LLVM does not support having multiple
+modules per execution engine so the code is coupled with an ExecutionEngine
+instance which owns the generated code behind the scenes (the ExecutionEngine is
+the LLVM class responsible for actual compilation and running of the dynamically
+linked code). Note throughout the directory the execution engine is referred to
+(actually typedef-ed as) a JITCodeOwner, because to every single class except
+the ModuleBuilder that is all the execution engine is good for. Once the
+destructor to a JITCodeOwner object is called, the associated data is deleted.
+
+In turn, the ModuleBuilder provides a minimal interface to code-generating
+classes (classes that accept data specific to a certain request and create the
+LLVM IR - the assembly that was mentioned earlier - that is appropriate for
+the specific data). The classes fill up the module with the desired assembly.
+
+Sequence of operation
+---------------------
+
+The parts come together as follows (in the case that the code cache is empty).
+
+1. External component requests some compiled object for certain runtime-
+dependent data (e.g. a row projector for a base and projection schemas).
+2. The CompilationManager accepts the request, but finds no such object
+is cached.
+3. The CompilationManager enqueues a request to compile said object to its
+own threadpool, and responds with failure to the external component.
+4. Eventually, a thread becomes available to take on the compilation task. The
+task is dequeued and the CodeGenerator's compilation method for the request is
+called.
+5. The code generator checks that code generation is enabled, and makes a call
+to the appropriate code-generating classes.
+6. The classes rely on the ModuleBuilder to compile their code, after which
+they return pointers to the requested functions.
+
+Code-generating classes
+-----------------------
+
+As mentioned in steps (5) and (6), the code-generating classes are responsible
+for generating the LLVM IR which is compiled at run time for whatever specific
+requests the external components have.
+
+The "code-generating classes" implement the JITWrapper (jit_wrapper.h) interface.
+The base class requires an owning reference to a JITCodeOwner, intended to be the
+owner of the JIT-compiled code that the JITWrapper derived class refers to.
+
+On top of containing the JITCodeOwner and pointers to JIT-compiled functions,
+the JITWrapper also provides methods which enable code caching. Caching compiled
+code is essential because compilation times are prohibitively slow, so satisfying
+any single request with freshly compiled code is not an option. As such, each
+piece of compiled code should be associated with some run time determined data.
+
+In the case of a row projector, this data is a pair of schemas, for the base
+and the projection. In order to work for arbitrary types (so we do not need
+multiple code caches for each different compiled object), the JITWrapper
+implementation must be able to provide a byte string key encoding of its
+associated data. This provides the key for the aforementioned cache. Similarly,
+there should be a static method which allows encoding such a key without
+generating a new instance (every time there is a request made to the manager,
+the manager needs to generate the byte string key to look it up in the cache).
+
+For instance, the JITWrapper for RowProjector code, RowProjectorFunctions, has
+the following method:
+
+static Status EncodeKey(const Schema& base, const Schema& proj,
+                        faststring* out);
+
+For any given input (pair of schemas), the JITWrapper generates a unique key
+so that the cache can be looked up for the generated row projector in later
+requests (the manager handles the cache lookups).
+
+In order to keep one homogeneous cache of all the generated code, the keys
+need to be unique across classes, which is difficult to maintain because the
+encodings could conflict by accident. For this reason, a type identifier should
+be prefixed to the beginning of every key. This identifier is an enum, with
+values for each JITWrapper derived type, thus guaranteeing uniqueness between
+classes.
+
+Guide to creating new codegenned classes
+----------------------------------------
+
+To add new classes with code generation, one needs to generate the appropriate
+JITWrapper and update the higher-level classes.
+
+First, the inputs to code generation need to be established (henceforth referred
+to as just "inputs").
+
+1. Making a new JITWrapper
+
+A new JITWrapper should derive from the JITWrapper class and expose a static
+key-generation method which returns a key given the inputs for the class. To
+satisfy the prefix condition, a new enum value must be added in
+JITWrapper::JITWrapperType.
+
+The JITWrapper derived class should have a creation method that generates
+a shared reference to an instance of itself. The JITWrappers should only
+be handled through shared references because this ensures that the code owner
+within the class is kept alive exactly as long as references to code pointing with
+it exist (the derived class is the only class that should contain members which
+are pointers to the desired compiled functions for the given input).
+
+The actual creation of the compiled code is perhaps the hardest part. See the
+section below.
+
+2. Updating top-level classes
+
+On top of adding the new enum value in the JITWrapper enumeration, several other
+top-level classes should provide the interfaces necessary to use the new
+codegen class (the layer of interface classes enables separate components
+of kudu to be independent of LLVM headers).
+
+In the CodeGenerator, there should be a Compile...(inputs) function which
+creates a scoped_refptr to the derived JITWrapper class by invoking the
+class' creation method. Note that the CodeGenerator should also print
+the appropriate LLVM disassembly if the flag is activated.
+
+The compilation manager should likewise offer a Request...(inputs) function
+that returns the requested compiled functions by looking up the cache for the
+inputs by generating a key with the static encoding method mentioned above. If the
+cache lookup fails, the manager should submit a new compilation request. The
+cache hit metrics should be incremented appropriately.
+
+Guide to code generation
+------------------------
+
+The resources at the bottom of this document provide a good reference for
+LLVM IR. However, there should be little need to use much LLVM IR because the
+majority of the LLVM code can be precompiled.
+
+If you wish to execute certain functions A, B, or C based on the input data which
+takes on values 1, 2, or 3, then do the following:
+
+1. Write A, B, and C in an extern "C" namespace (to avoid name mangling) in
+codegen/precompiled.cc.
+2. When creating your derived JITWrapper class, create a ModuleBuilder. The
+builder should load your functions A, B, and C automatically.
+3. Create an LLVM IR function dependent on the inputs. I.e., if the input
+for code generation is 1, then the desired function would be A. In that case,
+request the module builder for a function called "A". The builder, when compiled,
+will offer a pointer to the compiled function.
+
+Note in the above example the only utility of code generation is avoiding
+a couple of branches which decide on A, B, or C based on input data 1, 2, or 3.
+
+Code generation gets much more mileage from constant propagation. To utilize this,
+one needs to generate a new function in LLVM IR at run time which passes
+arguments to the precompiled functions, with hopefully some relevant constants
+based on the input data. When LLVM compiles the module, it will propagate those
+constants, creating more efficient machine code.
+
+To create a function in a module at run time, you need to use a
+ModuleBuilder::LLVMBuilder. The builder emits LLVM IR dynamically. It is an
+alias for the llvm::IRBuilder<> class, whose API is available in the links at
+the bottom of this document. A worked example is available in row_projector.cc.
+
+Useful resources
+----------------
+http://llvm.org/docs/doxygen/html/index.html
+http://llvm.org/docs/tutorial/
+http://llvm.org/docs/LangRef.html
+
+Debugging
+---------
+
+Debug info is available by printing the generated code. See the flags declared
+in code_generator.cc for further details.

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/docs/design-docs/consensus.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/consensus.md b/docs/design-docs/consensus.md
new file mode 100644
index 0000000..a782884
--- /dev/null
+++ b/docs/design-docs/consensus.md
@@ -0,0 +1,281 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+This document introduces how Kudu will handle log replication and consistency
+using an algorithm known as Viewstamped Replication (VS) and a series of 
+practical algorithms/techniques for recovery, reconfiguration, compactions etc.
+This document introduces all the concepts directly related to Kudu, for any
+missing information please refer to the original papers [1,3,4].
+
+Quorums, in Kudu, are a set of collaborating processes that serve the purpose
+of keeping a consistent, replicated log of operations on a given data set, e.g.
+a tablet. This replicated consistent log, also plays the role of the Write
+Ahead Log (WAL) for the tablet. Throughout this document we use config
+participant and process interchangeably, these do not represent machines or OS
+processes, as machines and or application daemons will participate in multiple
+configs.
+
+============================================================
+The write ahead log (WAL)
+============================================================
+
+The WAL provides strict ordering and durability guarantees:
+
+1) If calls to Reserve() are externally synchronized, the order in
+which entries had been reserved will be the order in which they will
+be committed to disk.
+
+2) If fsync is enabled (via the 'log_force_fsync_all' flag -- see
+log_util.cc; note: this is _DISABLED_ by default), then every single
+transaction is guaranteed to be synchronized to disk before its
+execution is deemed successful.
+
+Log uses group commit to increase performance primarily by allowing
+throughput to scale with the number of writer threads while
+maintaining close to constant latency.
+
+============================================================
+Basic WAL usage
+============================================================
+
+To add operations to the log, the caller must obtain the lock, and
+call Reserve() with a collection of operations and pointer to the
+reserved entry (the latter being an out parameter). Then, the caller
+may release the lock and call the AsyncAppend() method with the
+reserved entry and a callback that will be invoked upon completion of
+the append. AsyncAppend method performs serialization and copying
+outside of the lock.
+
+For sample usage see local_consensus.cc and mt-log-test.cc.
+
+=============================================================
+Group commit implementation details
+=============================================================
+
+Currently, the group implementation uses a blocking queue (see
+Log::entry_queue_ in log.h) and a separate long-running thread (see
+Log::AppendThread in log.cc). Since access to the queue is
+synchronized via a lock and only a single thread removes the queue,
+the order in which the elements are added to the queue will be the
+same as the order in which the elements are removed from the queue.
+
+The size of the queue is currently based on the number of entries, but
+this will eventually be changed to be based on size of all queued
+entries in bytes.
+
+=============================================================
+Reserving a slot for the entry
+=============================================================
+
+Currently Reserve() allocates memory for a new entry on the heap each
+time, marks the entry internally as "reserved" via a state enum, and
+adds it to the above-mentioned queue. In the future, a ring-buffer or
+another similar data structure could be used that would take the place
+of the queue and make allocation unnecessary.
+
+============================================================
+Copying the entry contents to the reserved slot
+============================================================
+
+AsyncAppend() serializes the contents of the entry to a buffer field
+in the entry object (currently the buffer is allocated at the same
+time as the entry itself); this avoids contention that would occur if
+a shared buffer was to be used.
+
+============================================================
+Synchronizing the entry contents to disk
+============================================================
+
+A separate appender thread waits until entries are added to the
+queue. Once the queue is no longer empty, the thread grabs all
+elements on the queue. Then for each dequeued entry, the appender
+waits until the entry is marked ready (see "Copying the entry contents
+to the reserved slot" above) and then appends the entry to the current
+log segment without synchronizing the underlying file with filesystem
+(env::WritableFile::Append())
+
+Note: this could be further optimized by calling AppendVector() with a
+vector of buffers from all of the consumed entries.
+
+Once all entries are successfully appended, the appender thread syncs
+the file to disk (env::WritableFile::Sync()) and (again) waits until
+more entries are added to the queue, or until the queue or the
+appender thread are shut down.
+
+============================================================
+Log segment files and asynchronous preallocation
+============================================================
+
+Log uses PosixWritableFile() for underlying storage. If preallocation
+is enabled ('--log_preallocate_segments' flag, defined in log_util.cc,
+true by default), then whenever a new segment is created, the
+underlying file is preallocated to a certain size in megabytes
+('--log_segment_size_mb', defined in log_util.cc, default 64). While
+the offset in the segment file is below the preallocated length,
+the cheaper fdatasync() operation is used instead of fsync().
+
+When the size of the current segment exceeds the preallocated size, a
+task is launched in a separate thread that begins preallocating the
+underlying file for the new log segment; meanwhile, until the task
+finishes, appends still go to the existing file.  Once the new file is
+preallocated, it is renamed to the correct name for the next segment
+and is swapped in place of the current segment.
+
+When the current segment is closed without reaching the preallocated
+size, the underlying file is truncated to the last written offset
+(i.e., the actual size).
+
+============================================================
+Quorums and roles within configs
+============================================================
+
+A config in Kudu is a fault-tolerant, consistent unit that serves requests for
+a single tablet. As long as there are 2f+1 participants available in a config,
+where f is the number of possibly faulty participants, the config will keep
+serving requests for its tablet and it is guaranteed that clients perceive a
+fully consistent, linearizable view of both data and operations on that data.
+The f parameter, defined table wide through configuration implicitly
+defines the size of the config, f=0 indicates a single node config, f=1
+indicates a 3 node config, f=2 indicates a 5 node config, etc.. Quorums may
+overlap in the sense that each physical machine may be participating in
+multiple configs, usually one per each tablet that it serves.
+
+Within a single config, in steady state, i.e. when no peer is faulty, there
+are two main types of peers. The leader peer and the follower peers.
+The leader peer dictates the serialization of the operations throughout the
+config, its version of the sequence of data altering requests is the "truth"
+and any data altering request is only considered final (i.e. can be
+acknowledged to the client as successful) when a majority of the config
+acknowledges that they "agree" with the leader's view of the event order.
+In practice this means that all write requests are sent directly to the
+leader, which then replicates them to a majority of the followers before
+sending an ACK to the client. Follower peers are completely passive in
+steady state, only receiving data from the leader and acknowledging back.
+Follower peers only become active when the leader process stops and one
+of the followers (if there are any) must be elected leader.
+
+Participants in a config may be assigned the following roles:
+
+LEADER - The current leader of the config, receives requests from clients
+and serializes them to other nodes.
+
+FOLLOWER - Active participants in the config, whose votes count towards
+majority, replication count etc.
+
+LEARNER - Passive participants in the config, whose votes do not count
+towards majority or replication count. New nodes joining the config
+will have this role until they catch up and can be promoted to FOLLOWER.
+
+NON_PARTICIPANT - A peer that does not participate in a particular
+config. Mostly used to mark prior participants that stopped being so
+on a configuration change.
+
+The following diagram illustrates the possible state changes:
+
+                 +------------+
+                 |  NON_PART  +---+
+                 +-----+------+   |
+       Exist. RaftConfig?  |          |
+                 +-----v------+   |
+                 |  LEARNER   +   | New RaftConfig?
+                 +-----+------+   |
+                       |          |
+                 +-----v------+   |
+             +-->+  FOLLOW.   +<--+
+             |   +-----+------+
+             |         |
+             |   +-----v------+
+  Step Down  +<--+ CANDIDATE  |
+             ^   +-----+------+
+             |         |
+             |   +-----v------+
+             +<--+   LEADER   |
+                 +------------+
+
+Additionally all states can transition to NON_PARTICIPANT, on configuration
+changes and/or peer timeout/death.
+
+============================================================
+Assembling/Rebooting a RaftConfig and RaftConfig States
+============================================================
+
+Prior to starting/rebooting a peer, the state in WAL must have been replayed
+in a bootstrap phase. This process will yield an up-to-date Log and Tablet.
+The new/rebooting peer is then Init()'ed with this Log. The Log is queried
+for the last committed configuration entry (A Raft configuration consists of
+a set of peers (uuid and last known address) and hinted* roles). If there is
+none, it means this is a new config.
+
+After the peer has been Init()'ed, Start(Configuration) is called. The provided
+configuration is a hint which is only taken into account if there was no previous
+configuration*.
+
+Independently of whether the configuration is a new one (new config)
+or an old one (rebooting config), the config cannot start until a
+leader has been elected and replicates the configuration through
+consensus. This ensures that a majority of nodes agree that this is
+the most recent configuration.
+
+The provided configuration will always specify a leader -- in the case
+of a new config, it is chosen by the master, and in the case of a
+rebooted one, it is the configuration that was active before the node
+crashed. In either case, replicating this initial configuration
+entry happens in the exact same way as any other config entry,
+i.e. the LEADER will try and replicate it to FOLLOWERS. As usual if
+the LEADER fails, leader election is triggered and the new LEADER will
+try to replicate a new configuration.
+
+Only after the config has successfully replicated the initial configuration
+entry is the config ready to accept writes.
+
+
+Peers in the config can therefore be in the following states:
+
+BOOTSTRAPPING - The phase prior to initialization where the Log is being
+replayed. If a majority of peers are still BOOTSTRAPPING, the config doesn't
+exist yet.
+
+CONFIGURING: Until the current configuration is pushed though consensus. This
+is true for both new configs and rebooting configs. The peers do not accept
+client requests in this state. In this state, the Leader tries to replicate
+the configuration. Followers run failure detection and trigger leader election
+if the hinted leader doesn't successfully replicate within the configured
+timeout period.
+
+RUNNING: The LEADER peer accepts writes and replicates them through consensus.
+FOLLOWER replicas accepts writes from the leader and ACK.
+
+* The configuration provided on Start() can only be taken into account if there
+is an appropriate leader election algorithm. This can be added later but is not
+present in the initial implementation. Roles are hinted in the sense that the
+config initiator (usually the master) might hint what the roles for the peers
+in the config should be, but the config is the ultimate decider on whether that
+is possible or not.
+
+============================================================
+References
+============================================================
+[1] http://ramcloud.stanford.edu/raft.pdf
+
+[2] http://www.cs.berkeley.edu/~brewer/cs262/Aries.pdf
+
+[3] Viewstamped Replication: A New Primary Copy Method to Support
+Highly-Available Distributed Systems. B. Oki, B. Liskov
+http://www.pmg.csail.mit.edu/papers/vr.pdf
+
+[4] Viewstamped Replication Revisited. B. Liskov and J. Cowling 
+http://pmg.csail.mit.edu/papers/vr-revisited.pdf
+
+[5] Aether: A Scalable Approach to logging
+http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/docs/design-docs/cpp-client.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/cpp-client.md b/docs/design-docs/cpp-client.md
new file mode 100644
index 0000000..f7623d1
--- /dev/null
+++ b/docs/design-docs/cpp-client.md
@@ -0,0 +1,129 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+```c++
+/*
+
+This file contains some example code for the C++ client. It will
+probably be eventually removed in favor of actual runnable examples,
+but serves as a guide/docs for the client API design for now.
+
+See class docs for KuduClient, KuduSession, KuduTable for proper docs.
+*/
+
+// This is an example of explicit batching done by the client.
+// This would be used in contexts like interactive webapps, where
+// you are likely going to set a short timeout.
+void ExplicitBatchingExample() {
+  // Get a reference to the tablet we want to insert into.
+  // Note that this may be done without a session, either before or
+  // after creating a session, since a session isn't tied to any
+  // particular table or set of tables.
+  scoped_refptr<KuduTable> t;
+  CHECK_OK(client_->OpenTable("my_table", &t));
+
+  // Create a new session. All data-access operations must happen through
+  // a session.
+  shared_ptr<KuduSession> session(client->NewSession());
+
+  // Setting flush mode to MANUAL_FLUSH makes the session accumulate
+  // all operations until the next Flush() call. This is sort of like
+  // TCP_CORK.
+  CHECK_OK(session->SetFlushMode(KuduSession::MANUAL_FLUSH));
+
+  // Insert 100 rows.
+  for (int i = 0; i < 100; i++) {
+    gscoped_ptr<Insert> ins = t->NewInsert();
+    ins->mutable_row()->SetInt64("key", i);
+    ins->mutable_row()->SetInt64("val", i * 2);
+    // The insert should return immediately after moving the insert
+    // into the appropriate buffers. This always returns OK unless the
+    // Insert itself is invalid (eg missing a key column).
+    CHECK_OK(session->Apply(ins.Pass()));
+  }
+
+  // Update a row.
+  gscoped_ptr<Update> upd = t->NewUpdate();
+  upd->mutable_row()->SetInt64("key", 1);
+  upd->mutable_row()->SetInt64("val", 1 * 2 + 1);
+
+  // Delete a row.
+  gscoped_ptr<Delete> del = t->NewDelete();
+  del->mutable_row()->SetInt64("key", 2); // only specify key.
+
+  // Setting a timeout on the session applies to the next Flush call.
+  session->SetTimeoutMillis(300);
+
+  // After accumulating all of the stuff in the batch, call Flush()
+  // to send the updates in one go. This may be done either sync or async.
+  // Sync API example:
+  {
+    // Returns an Error if any insert in the batch had an issue.
+    CHECK_OK(session->Flush());
+    // Call session->GetPendingErrors() to get errors.
+  }
+
+  // Async API example:
+  {
+    // Returns immediately, calls Callback when either success or failure.
+    CHECK_OK(session->FlushAsync(MyCallback));
+    // TBD: should you be able to use the same session before the Callback has
+    // been called? Or require that you do nothing with this session while
+    // in-flight (which is more like what JDBC does I think)
+  }
+}
+
+// This is an example of how a "bulk ingest" program might work -- one in
+// which the client just wants to shove a bunch of data in, and perhaps
+// fail if it ever gets an error.
+void BulkIngestExample() {
+  scoped_refptr<KuduTable> t;
+  CHECK_OK(client_->OpenTable("my_table", &t));
+  shared_ptr<KuduSession> session(client->NewSession());
+
+  // If the amount of buffered data in RAM is larger than this amount,
+  // blocks the writer from performing more inserts until memory has
+  // been freed (either by inserts succeeding or timing out).
+  session->SetBufferSpace(32 * 1024 * 1024);
+
+  // Set a long timeout for this kind of usecase. This determines how long
+  // Flush() may block for, as well as how long Apply() may block due to
+  // the buffer being full.
+  session->SetTimeoutMillis(60 * 1000);
+
+  // In AUTO_FLUSH_BACKGROUND mode, the session will try to accumulate batches
+  // for optimal efficiency, rather than flushing each operation.
+  CHECK_OK(session->SetFlushMode(KuduSession::AUTO_FLUSH_BACKGROUND));
+
+  for (int i = 0; i < 10000; i++) {
+    gscoped_ptr<Insertion> ins = t->NewInsertion();
+    ins->SetInt64("key", i);
+    ins->SetInt64("val", i * 2);
+    // This will start getting written in the background.
+    // If there are any pending errors, it will return a bad Status,
+    // and the user should call GetPendingErrors()
+    // This may block if the buffer is full.
+    CHECK_OK(session->Apply(&ins));
+    if (session->HasErrors())) {
+      LOG(FATAL) << "Failed to insert some rows: " << DumpErrors(session);
+    }
+  }
+  // Blocks until remaining buffered operations have been flushed.
+  // May also use the async API per above.
+  Status s = session->Flush());
+  if (!s.ok()) {
+    LOG(FATAL) << "Failed to insert some rows: " << DumpErrors(session);
+  }
+}
+```

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/docs/design-docs/master.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/master.md b/docs/design-docs/master.md
new file mode 100644
index 0000000..4a366aa
--- /dev/null
+++ b/docs/design-docs/master.md
@@ -0,0 +1,239 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+============================================================
+The Catalog Manager and System Tables
+============================================================
+
+The Catalog Manager keeps track of the Kudu tables and tablets defined by the
+user in the cluster.
+
+All the table and tablet information is stored in-memory in copy-on-write
+TableInfo / TabletInfo objects, as well as on-disk, in the "sys.catalog"
+Kudu system table hosted only on the Masters.  This system table is loaded
+into memory on Master startup.  At the time of this writing, the "sys.catalog"
+table consists of only a single tablet in order to provide strong consistency
+for the metadata under RAFT replication (as currently, each tablet has its own
+log).
+
+To add or modify a table or tablet, the Master writes, but does not yet commit
+the changes to memory, then writes and flushes the system table to disk, and
+then makes the changes visible in-memory (commits them) if the disk write (and,
+in a distributed master setup, config-based replication) is successful. This
+allows readers to access the in-memory state in a consistent
+way, even while a write is in-progress.
+
+This design prevents having to go through the whole scan path to service tablet
+location calls, which would be more expensive, and allows for easily keeping
+"soft" state in the Master for every Table and Tablet.
+
+The catalog manager maintains 3 hash-maps for looking up info in the sys table:
+- [Table Id] -> TableInfo
+- [Table Name] -> TableInfo
+- [Tablet Id] -> TabletInfo
+
+The TableInfo has a map [tablet-start-key] -> TabletInfo used to provide
+the tablets locations to the user based on a key-range request.
+
+
+Table Creation
+--------------
+
+The below corresponds to the code in CatalogManager::CreateTable().
+
+1. Client -> Master request: Create "table X" with N tablets and schema S.
+2. Master: CatalogManager::CreateTable():
+   a. Validate user request (e.g. ensure a valid schema).
+   b. Verify that the table name is not already taken.
+      TODO: What about old, deleted tables?
+   c. Add (in-memory) the new TableInfo (in "preparing" state).
+   d. Add (in-memory) the TabletInfo based on the user-provided pre-split-keys
+      field (in "preparing" state).
+   e. Write the tablets info to "sys.catalog"
+      (The Master process is killed if the write fails).
+      - Master begins writing to disk.
+      - Note: If the Master crashes or restarts here or at any time previous to
+        this point, the table will not exist when the Master comes back online.
+   f. Write the table info to "sys.catalog" with the "running" state
+      (The Master process is killed if the write fails).
+      - Master completes writing to disk.
+      - After this point, the table will exist and be re-created as necessary
+        at startup time after a crash or process restart.
+   g. Commit the "running" state to memory, which allows clients to see the table.
+3. Master -> Client response: The table has been created with some ID, i.e. "xyz"
+   (or, in case something went wrong, an error message).
+
+After this point in time, the table is reported as created, which means that if
+the cluster is shut down, when it starts back up the table will still exist.
+However, the tablets are not yet created (see Table Assignment, below).
+
+
+Table Deletion
+--------------
+
+When the user sends a DeleteTable request for table T, table T is marked as
+deleted by writing a "deleted" flag in the state field in T's record in the
+"sys.catalog" table, table T is removed from the in-memory "table names"
+map on the Master, and the table is marked as being "deleted" in the
+in-memory TableInfo / TabletInfo "state" field on the Master.
+TODO: Could this race with table deletion / creation??
+
+At this point, the table is no longer externally visible to clients via Master
+RPC calls, but the tablet configs that make up the table may still be up and
+running. New clients trying to open the table will get a NotFound error, while
+clients that already have the tablet locations cached may still be able to
+read and write to the tablet configs, as long as the corresponding tablet
+servers are online and their respective tablets have not yet been deleted.
+In some ways, this is similar the design of FS unlink.
+
+The Master will asynchronously send a DeleteTablet RPC request to each tablet
+(one RPC request per tablet server in the config, for each tablet), and the
+tablets will therefore be deleted in parallel in some unspecified order. If the
+Master or tablet server goes offline before a particular DeleteTablet operation
+successfully completes, the Master will send a new DeleteTablet request at the
+time that the next heartbeat is received from the tablet that is to be deleted.
+
+A "Cleaner" process will be reponsible for removing the data from deleted tables
+and tablets in the future, both on-disk and cached in memory (TODO).
+
+
+Table Assignment (Tablet Creation)
+----------------------------------
+
+Once a table is created, the tablets must be created on a set of replicas. In
+order to do that, the master has to select the replicas and associate them to
+the tablet.
+
+For each tablet not created we select a set of replicas and a leader and we
+send the "create tablet" request. On the next TS-heartbeat from the leader we
+can mark the tablet as "running", if reported. If we don't receive a "tablet
+created" report after ASSIGNMENT-TIMEOUT-MSEC we replace the tablet with a new
+one, following these same steps for the new tablet.
+
+The Assignment is processed by the "CatalogManagerBgTasks" thread. This thread
+is waiting for an event that can be:
+
+- Create Table (need to process the new tablet for assignment)
+- Assignment Timeout (some tablet request timeout expired, replace it)
+
+This is the current control flow:
+
+- CatalogManagerBgTasks thread:
+  1. Process Pending Assignments:
+     - For each tablet pending assignment:
+       - If tablet creation was already requested:
+          - If we did not receive a response yet, and the configurable
+            assignment timeout period has passed, mark the tablet as "replaced":
+            1. Delete the tablet if it ever reports in.
+            2. Create a new tablet in its place, add that tablet to the
+               "create table" list.
+       - Else, if the tablet is new (just created by CreateTable in "preparing" state):
+         - Add it to the "create tablet" list.
+     - Now, for each tablet in the "create tablet" list:
+       - Select a set of tablet servers to host the tablet config.
+       - Select a tablet server to be the initial config leader.
+       [BEGIN-WRITE-TO-DISK]
+       - Flush the "to create" to sys.catalog with state "creating"
+       [If something fails here, the "Process Pending Assignments" will
+        reprocess these tablets. As nothing was done, running tables will be replaced]
+       [END-WRITE-TO-DISK]
+       - For each tablet server in the config:
+         - Send an async CreateTablet() RPC request to the TS.
+           On TS-heartbeat, the Master will receive the notification of "tablet creation".
+     - Commit any changes in state to memory.
+       At this point the tablets marked as "running" are visible to the user.
+
+  2. Cleanup deleted tables & tablets (FIXME: is this implemented?):
+     - Remove the tables/tablets with "deleted" state from "sys.catalog"
+     - Remove the tablets with "deleted" state from the in-memory map
+     - Remove the tables with "deleted" state from the in-memory map
+
+When the TS receives a CreateTablet() RPC, it will attempt to create the tablet
+replica locally. Once it is successful, it will be added to the next tablet
+report. When the tablet is reported, the master-side ProcessTabletReport()
+function is called.
+
+If we find at this point that the reported tablet is in "creating" state, and
+the TS reporting the tablet is the leader selected during the assignment
+process (see CatalogManagerBgTasksThread above), the tablet will be marked as
+running and committed to disk, completing the assignment process.
+
+
+Alter Table
+-----------
+
+When the user sends an alter request, which may contain changes to the schema,
+table name or attributes, the Master will send a set of AlterTable() RPCs to
+each TS handling the set of tablets currently running. The Master will keep
+retrying in case of error.
+
+If a TS is down or goes down during an AlterTable request, on restart it will
+report the schema version that it is using, and if it is out of date, the Master
+will send an AlterTable request to that TS at that time.
+
+When the Master first comes online after being restarted, a full tablet report
+will be requested from each TS, and the tablet schema version sent on the next
+heartbeat will be used to determine if a given TS needs an AlterTable() call.
+
+============================================================
+Heartbeats and TSManager
+============================================================
+
+Heartbeats are sent by the TS to the master. Per master.proto, a
+heartbeat contains:
+
+1. Node instance information: permanent uuid, node sequence number
+(which is incremented each time the node is started).
+
+2. (Optional) registration. Sent either at TS startup or if the master
+responded to a previous heartbeat with "needs register" (see
+'Handling heartbeats' below for an explanation of when this response
+will be sent).
+
+3. (Optional) tablet report. Sent either when tablet information has
+changed, or if the master responded to a previous heartbeat with
+"needs a full tablet report" (see "Handling heartbeats" below for an
+explanation of when this response will be sent).
+
+Handling heartbeats
+-------------------
+
+Upon receiving a heartbeat from a TS, the master will:
+
+1) Check if the heartbeat has registration info. If so, register
+the TS instance with TSManager (see "TSManager" below for more
+details).
+
+2) Retrieve a TSDescriptor from TSManager. If the TSDescriptor
+is not found, reply to the TS with "need re-register" field set to
+true, and return early.
+
+3) Update the heartbeat time (see "TSManager" below) in the
+registration object.
+
+4) If the heartbeat contains a tablet report, the Catalog Manager will
+process the report and update its cache as well as the system tables
+(see "Catalog Manager" above). Otherwise, the master will respond to
+the TS requesting a full tablet report.
+
+5) Send a success respond to the TS.
+
+TSManager
+---------
+
+TSManager provides in-memory storage for information sent by the
+tablet server to the master (tablet servers that have been heard from,
+heartbeats, tablet reports, etc...). The information is stored in a
+map, where the key is the permanent uuid of a tablet server and the
+value is (a pointer to) a TSDescriptor.

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/docs/design-docs/rpc.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/rpc.md b/docs/design-docs/rpc.md
new file mode 100644
index 0000000..628aa33
--- /dev/null
+++ b/docs/design-docs/rpc.md
@@ -0,0 +1,362 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+===============================================================================
+RPC
+===============================================================================
+
+-------------------------------------------------------------------------------
+Intro
+-------------------------------------------------------------------------------
+The RPC layer makes communication with remote processes look like local
+function calls.  You can make either asynchronous calls, in which you provide a
+callback which is invoked later, or synchronous calls, where your thread blocks
+until the remote system responds.
+
+The wire format of Kudu RPC is very close to the wire format of Hadoop IPC in
+hadoop-3 and beyond. It is not identical since there are still some java-isms
+left in Hadoop IPC which we did not want to inherit. In addition, Kudu RPC has
+developed some extra features such as deadline propagation which are not
+available in Hadoop. However, the overall structure of the wire protocol is
+very similar.
+
+We use protocol buffers for serialization, and libev for non-blocking I/O.
+
+For some code examples, look in rpc-test.cc and rpc_stub-test.
+
+-------------------------------------------------------------------------------
+Overview
+-------------------------------------------------------------------------------
+```
+                                        +------------------------------------+
+                                        | AcceptorPool                       |
+                                        |   a pool of threads which          |
+  +-------------------------+           |   call accept()                    |
+  | Proxy                   |           +------------------------------------+
+  |                         |                          | new socket
+  | The proxy is the object |                          V
+  | which has the remote    |           +------------------------------------+
+  | method definitions.     | --------> | Messenger                          |
+  |                         |           |                                    |
+  +-------------------------+           | +-----------+ +-----------+        |
+                                        | | reactor 1 | | reactor 2 | ...    |
+  +-------------------------+           | +-----------+ +-----------+        |
+  | ResponseCallback        | <-------- |                                    |<-.
+  |                         |           +------------------------------------+  |
+  | The callback which gets |                          |                        |
+  | invoked when the remote |                          V                        |
+  | end replies or the call |           +------------------------------------+  |
+  | otherwise terminates.   |           | ServicePool                        |  |
+  +-------------------------+           |   a pool of threads which          |  | Call responses
+                                        |   pull new inbound calls from a    |  | sent back via
+                                        |   work queue.                      |  | messenger.
+                                        +------------------------------------+  |
+                                                       |                        |
+                                                       v                        |
+                                        +------------------------------------+  |
+                                        | ServiceIf                          |  |
+                                        |   user-implemented class which     | /
+                                        |   handles new inbound RPCs         |
+                                        +------------------------------------+
+```
+Each reactor has a thread which uses epoll to handle many sockets using
+non-blocking I/O.  Blocking calls are implemented by the Proxy using
+non-blocking calls-- from the point of view of the Messenger, all calls are
+non-blocking.
+
+The acceptor pool and the service pool are optional components.  If you don't
+expect anyone to be connecting to you, you do not have to start them. If a server
+expects to listen on multiple ports (eg for different protocols), multiple
+AcceptorPools may be attached.
+
+-------------------------------------------------------------------------------
+Proxy classes
+-------------------------------------------------------------------------------
+
+Proxy classes are used by the client to send calls to a remote service.
+Calls may be made synchronously or asynchronously -- the synchronous calls are simply
+a wrapper around the asynchronous version, which makes the call and then waits
+on the callback to be triggered.
+
+In order to make a call, the user must provide a method name, a request protobuf,
+a response protobuf, an RpcController, and a callback.
+
+Each RpcController object corresponds to exactly one in-flight call on the client.
+This class is where per-call settings may be adjusted before making an RPC --
+currently this is just timeout functionality, but in the future may include
+other call properties such as tracing information, priority classes, deadline
+propagation, etc.
+
+Upon issuing the asynchronous request, the RPC layer enqueues the call to be sent
+to the server and immediately returns. During this period, the caller thread
+may continue to send other RPCs or perform other processing while waiting for
+the callback to be triggered. In the future, we will provide an RPC cancellation
+function on the RpcController object in case the user determines that the call
+is no longer required.
+
+When the call completes, the RPC layer will invoke the provided ResponseCallback
+function from within the context of the reactor thread. Given this,
+ResponseCallbacks should be careful to never block, as it would prevent other
+threads from concurrent sending or receiving RPCs.
+
+The callback is invoked exactly one time, regardless of the call's termination state.
+The user can determine the call's state by invoking methods on the RpcController object,
+for example to determine whether the call succeded, timed out, or suffered a
+transport error. In the case that the call succeeds, the user-provided response protobuf
+will have been initialized to contain the result.
+
+Please see the accompanying documentation in the Proxy and RpcController classes
+for more information on the specific API, as well as the test cases in rpc-test.cc
+for example usage.
+
+-------------------------------------------------------------------------------
+Generated Code
+-------------------------------------------------------------------------------
+
+In general, clients will use auto-generated subclasses of Proxy and ServiceIf to
+get additional type safety and nicer APIs.
+
+The generated proxy object has the same API as the generic Proxy, except that
+methods are generated for each RPC defined within the protobuf service. Each
+RPC has a synchronous and async version, corresponding to Proxy::AsyncRequest and
+Proxy::SyncRequest. These generated methods have an identical API to the generic
+one except that they are type-safe and do not require the method name to be passed.
+
+The generated ServiceIf class contains pure virtual methods for each of the RPCs
+in the service. Each method to be implemented has an API like:
+
+  void MethodName(const RequestPB *req,
+     ResponsePB *resp, ::kudu::rpc::RpcContext *context);
+
+The request PB is the user-provided request, and the response PB is a cleared
+protobuf ready to store the RPC response. Once the RPC response has been filled in,
+the service should call context->RespondSuccess(). This method may be called
+from any thread in the application at any point either before or after the
+actual handler method returns.
+
+In the case of an unexpected error, the generated code may alternatively call
+context->RespondFailure(...). However, for any error responses which should be
+parseable by the client code, it is preferable to define an error response inside
+the response protobuf itself -- this is a much more flexible way of returning
+actionable information with an error, given that Status just holds a string
+and not much else.
+
+See rpc/rpc-test-base.h for an example service implementation, as well as the
+documentation comments in rpc/service_if.h.
+
+-------------------------------------------------------------------------------
+ServiceIf classes
+-------------------------------------------------------------------------------
+ServiceIf classes are abstract interfaces that the server implements to handle
+incoming RPCs.  In general, each generated service has several virtual methods
+which you can override in order to implement the relevant function call.
+
+There is a ServicePool which you can use to coordinate several worker threads
+handling callbacks.
+
+-------------------------------------------------------------------------------
+RPC Sidecars
+-------------------------------------------------------------------------------
+RPC sidecars are used to avoid excess copies for large volumes of data.
+Prior to RPC sidecars, the sequence of steps for creating an RPC response
+on the server side would be as follows:
+
+1. Write the prepared response to a Google protobuf message.
+2. Pass the message off to the InboundCall class, which serializes the
+   protobuf into a process-local buffer.
+3. Copy the process-local buffer to the kernel buffer (send() to a socket).
+
+The client follows these steps in reverse order. On top of the extra copy,
+this procedure also forces us to use std::string, which is difficult for
+compilers to inline code for and requires that reserved bytes are nulled out,
+which is an unnecessary call to memset.
+
+Instead, sidecars provide a mechanism to indicate the need to pass a large
+store of data to the InboundCall class, which manages the response to a single
+RPC on the server side. When send()-ing the rest of the message (i.e., the
+protobuf), the sidecar's data is directly written to the socket.
+
+The data is appended directly after the main message protobuf. Here's what
+a typical message looks like without sidecars:
+```
++------------------------------------------------+
+| Total message length (4 bytes)                 |
++------------------------------------------------+
+| RPC Header protobuf length (variable encoding) |
++------------------------------------------------+
+| RPC Header protobuf                            |
++------------------------------------------------+
+| Main message length (variable encoding)        |
++------------------------------------------------+
+| Main message protobuf                          |
++------------------------------------------------+
+```
+In this case, the main message length is equal to the protobuf's byte size.
+Since there are no sidecars, the header protobuf's sidecar_offsets list
+will will be empty.
+
+Here's what it looks like with the sidecars:
+```
++------------------------------------------------+
+| Total message length (4 bytes)                 |
++------------------------------------------------+
+| RPC Header protobuf length (variable encoding) |
++------------------------------------------------+
+| RPC Header protobuf                            |
++------------------------------------------------+
+| Main message length (variable encoding)        |
++------------------------------------------------+ --- 0
+| Main message protobuf                          |
++------------------------------------------------+ --- sidecar_offsets(0)
+| Sidecar 0                                      |
++------------------------------------------------+ --- sidecar_offsets(1)
+| Sidecar 1                                      |
++------------------------------------------------+ --- sidecar_offsets(2)
+| Sidecar 2                                      |
++------------------------------------------------+ --- ...
+| ...                                            |
++------------------------------------------------+
+```
+When there are sidecars, the sidecar_offsets member in the header will be a
+nonempty list, whose values indicate the offset, measured from the beginning
+of the main message protobuf, of the start of each sidecar. The number
+of offsets will indicate the number of sidecars.
+
+Then, on the client side, the sidecars locations are decoded and made available
+by RpcController::GetSidecars() (which returns the pointer to the array of all
+the sidecars). The caller must be sure to check that the sidecar index in the
+sidecar array is correct and in-bounds.
+
+More information is available in rpc/rpc_sidecar.h.
+
+-------------------------------------------------------------------------------
+Wire Protocol
+-------------------------------------------------------------------------------
+
+Connection establishment and connection header
+----------------------------------------------
+
+After the client connects to a server, the client first sends a connection header.
+The connection header consists of a magic number "hrpc" and three byte flags,
+for a total of 7 bytes:
+```
++----------------------------------+
+|  "hrpc" 4 bytes                  |
++----------------------------------+
+|  Version (1 byte)                |
++----------------------------------+
+|  ServiceClass (1 byte)           |
++----------------------------------+
+|  AuthProtocol (1 byte)           |
++----------------------------------+
+```
+Currently, the RPC version is 9. The ServiceClass and AuthProtocol fields are unused.
+
+
+Message framing and request/response headers
+--------------------------------------------
+Aside from the initial connection header described above, all other messages are
+serialized as follows:
+```
+  total_size: (32-bit big-endian integer)
+    the size of the rest of the message, not including this 4-byte header
+
+  header: varint-prefixed header protobuf
+    - client->server messages use the RequestHeader protobuf
+    - server->client messages use the ResponseHeader protobuf
+
+  body: varint-prefixed protobuf
+    - for typical RPC calls, this is the user-specified request or response
+      protobuf
+    - for RPC calls which caused an error, the response is a ErrorResponsePB
+    - during SASL negotiation, this is a SaslMessagePB
+```
+
+Example packet capture
+--------------------------
+An example call (captured with strace on rpc-test.cc) follows:
+```
+   "\x00\x00\x00\x17"   (total_size: 23 bytes to follow)
+   "\x09"  RequestHeader varint: 9 bytes
+    "\x08\x0a\x1a\x03\x41\x64\x64\x20\x01" (RequestHeader protobuf)
+      Decoded with protoc --decode=RequestHeader rpc_header.proto:
+      callId: 10
+      methodName: "Add"
+      requestParam: true
+
+   "\x0c"  Request parameter varint: 12 bytes
+    "\x08\xd4\x90\x80\x91\x01\x10\xf8\xcf\xc4\xed\x04" Request parameter
+      Decoded with protoc --decode=kudu.rpc_test.AddRequestPB rpc/rtest.proto
+      x: 304089172
+      y: 1303455736
+```
+
+
+SASL negotiation
+------------------
+After the initial connection header is sent, SASL negotiation begins.
+Kudu always uses SASL regardless of security settings. In the case that
+no strong authentication is required, SASL PLAIN is used with no password.
+
+This SASL negotiation protocol matches the Hadoop protocol.
+The negotiation proceeds as described in this diagram:
+```
+                                CLIENT |        | SERVER
+                                       |        |
+(1) SaslMessagePB }                    |        |
+state=NEGOTIATE   } --------------------------> |
+                                       |        |
+                                       |        | { (2) SaslMessagePB
+                                       |        | { state=NEGOTIATE
+                                       | <------- { auths=<list of supported mechanisms>
+                                       |        |
+(3) SaslMessagePB                  }   |        |
+state=INITIATE                     }   |        |
+auths[0]=<chosen mechanism>        }   |        |
+token=<challenge response, if any> } ---------> |
+                                       |        |
+                                       |        | { (4) SaslMessagePB
+                                       |        | { state=CHALLENGE (or SUCCESS)
+                                       | <------- { token=<challenge token, if applicable>
+                                       |        |
+(5) SaslMessagePB          }           |        |
+state=RESPONSE             }           |        |
+token=<challenge response> } -----------------> |
+                                       |        |
+                                       |        | { GOTO (4) above
+                                       |        |
+
+```
+Each of the SaslMessagePBs above is framed as usual using RequestHeader or ResponseHeader
+protobufs. For each SASL message, the CallId should be set to '-33'.
+
+
+
+Connection Context:
+------------------
+Once the SASL negotiation is complete, before the first request, the client
+sends the server a special call with call_id -3. The body of this call is a
+ConnectionContextPB. The server should not respond to this call.
+
+
+Steady state
+------------
+During steady state operation, the client sends call protobufs prefixed by
+RequestHeader protobufs. The server sends responses prefixed by ResponseHeader
+protobufs.
+
+The client must send calls in strictly increasing 'call_id' order. The server
+may reject repeated calls or calls with lower IDs. The server's responses may
+arrive out-of-order, and use the 'call_id' in the response to associate a response
+with the correct call.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/c004cedc/docs/design-docs/scan-optimization-partition-pruning.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/scan-optimization-partition-pruning.md b/docs/design-docs/scan-optimization-partition-pruning.md
index 5546ee2..740dfd0 100644
--- a/docs/design-docs/scan-optimization-partition-pruning.md
+++ b/docs/design-docs/scan-optimization-partition-pruning.md
@@ -1,3 +1,17 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
 # Scan Optimization & Partition Pruning
 
 ## Background