You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@distributedlog.apache.org by si...@apache.org on 2017/04/26 18:41:26 UTC

[02/12] incubator-distributedlog git commit: Release 0.4.0-incubating

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/3469fc87/website/docs/0.4.0-incubating/user_guide/configuration/client.rst
----------------------------------------------------------------------
diff --git a/website/docs/0.4.0-incubating/user_guide/configuration/client.rst b/website/docs/0.4.0-incubating/user_guide/configuration/client.rst
new file mode 100644
index 0000000..28dee4e
--- /dev/null
+++ b/website/docs/0.4.0-incubating/user_guide/configuration/client.rst
@@ -0,0 +1,110 @@
+---
+layout: default
+
+# Sub-level navigation
+sub-nav-group: user-guide
+sub-nav-parent: configuration
+sub-nav-id: client-configuration
+sub-nav-pos: 3
+sub-nav-title: Client Configuration
+---
+
+.. contents:: Client Configuration
+
+Client Configuration
+====================
+
+This section describes the settings used by DistributedLog Write Proxy Client.
+
+Different from core library, the proxy client uses a builder to configure its settings.
+
+::
+
+    DistributedLogClient client = DistributedLogClientBuilder.newBuilder()
+        .name("test-client")
+        .clientId("test-client-id")
+        .finagleNameStr("inet!localhost:8080")
+        .statsReceiver(statsReceiver)
+        .build();
+
+Client Builder Settings
+-----------------------
+
+Common Settings
+~~~~~~~~~~~~~~~
+
+- *name(string)*: The name of the distributedlog client.
+- *clientId(string)*: The client id used for the underneath finagle client. It is a string identifier that server will
+  use to identify who are the client. So the server can book keep and optionally reject unknown clients.
+- *requestTimeoutMs(int)*: The maximum time that a request could take before claiming it as failure, in milliseconds.
+- *thriftmux(boolean)*: The flag to enable or disable using ThriftMux_ on the underneath channels.
+- *streamFailfast(boolean)*: The flag to enable or disable failfast the requests when the server responds `stream-not-ready`.
+  A stream would be treated as not ready when it is initializing or rolling log segments. The setting is only take effects
+  when the write proxy also enables `failFastOnStreamNotReady`.
+
+.. _ThriftMux: http://twitter.github.io/finagle/guide/Protocols.html#mux
+
+Environment Settings
+~~~~~~~~~~~~~~~~~~~~
+
+DistributedLog uses finagle Names_ to identify the network locations of write proxies.
+Names must be supplied when building a distributedlog client through `finagleNameStr` or
+`finagleNameStrs`.
+
+.. _Names: http://twitter.github.io/finagle/guide/Names.html
+
+- *finagleNameStr(string)*: The finagle name to locate write proxies.
+- *finagleNameStrs(string, string...)*: A list of finagle names. It is typically used by the global replicated log wherever there
+  are multiple regions of write proxies. The first parameter is the finagle name of local region; while the remaining parameters
+  are the finagle names for remote regions.
+
+Redirection Settings
+~~~~~~~~~~~~~~~~~~~~
+
+DistributedLog client can redirect the requests to other write proxies when accessing a write proxy doesn't own the given stream.
+This section describes the settings related to redirection.
+
+- *redirectBackoffStartMs(int)*: The initial backoff for redirection, in milliseconds.
+- *redirectBackoffMaxMs(int)*: The maximum backoff for redirection, in milliseconds.
+- *maxRedirects(int)*: The maximum number of redirections that a request could take before claiming it as failure.
+
+Channel Settings
+~~~~~~~~~~~~~~~~
+
+DistributedLog client uses FinagleClient_ to establish the connections to the write proxy. A finagle client will be
+created via ClientBuilder_ for each write proxy.
+
+.. _FinagleClient: https://twitter.github.io/finagle/guide/Clients.html
+
+.. _ClientBuilder: http://twitter.github.io/finagle/docs/index.html#com.twitter.finagle.builder.ClientBuilder
+
+- *clientBuilder(ClientBuilder)*: The finagle client builder to build connection to each write proxy.
+
+Ownership Cache Settings
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+DistributedLog client maintains a ownership cache locally to archieve stable deterministic request routing. Normally,
+the ownership cache is propagated after identified a new owner when performing stream related operations such as write.
+The client also does handshaking when initiating connections to a write proxy or periodically for fast failure detection.
+During handshaking, the client also pull the latest ownership mapping from write proxies to update its local cache, which
+it would help detecting ownership changes quickly, and avoid latency penalty introduced by redirection when ownership changes.
+
+- *handshakeWithClientInfo(boolean)*: The flag to enable or disable pulling ownership mapping during handshaking.
+- *periodicHandshakeIntervalMs(long)*: The periodic handshake interval in milliseconds. Every provided interval, the DL client
+  will handshake with existing proxies. It would detect proxy failures during handshaking. If the interval is already greater than
+  `periodicOwnershipSyncIntervalMs`, the handshake will pull the latest ownership mapping. Otherwise, it will not. The default
+  value is 5 minutes. Setting it to 0 or negative number will disable this feature.
+- *periodicOwnershipSyncIntervalMs(long)*: The periodic ownership sync interval, in milliseconds. If periodic handshake is
+  enabled, the handshake will sync ownership if the elapsed time is greater than the sync interval.
+- *streamNameRegex(string)*: The regex to match the stream names that the client cares about their ownerships.
+
+Constraint Settings
+~~~~~~~~~~~~~~~~~~~
+
+- *checksum(boolean)*: The flag to enable/disable checksum validation on requests that sent to proxy.
+
+Stats Settings
+~~~~~~~~~~~~~~
+
+- *statsReceiver(StatsReceiver)*: The stats receiver used for collecting stats exposed by this client.
+- *streamStatsReceiver(StatsReceiver)*: The stats receiver used for collecting per stream stats exposed by this client.

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/3469fc87/website/docs/0.4.0-incubating/user_guide/configuration/core.rst
----------------------------------------------------------------------
diff --git a/website/docs/0.4.0-incubating/user_guide/configuration/core.rst b/website/docs/0.4.0-incubating/user_guide/configuration/core.rst
new file mode 100644
index 0000000..8916386
--- /dev/null
+++ b/website/docs/0.4.0-incubating/user_guide/configuration/core.rst
@@ -0,0 +1,424 @@
+---
+layout: default
+
+# Sub-level navigation
+sub-nav-group: user-guide
+sub-nav-parent: configuration
+sub-nav-id: core-library-configuration
+sub-nav-pos: 1
+sub-nav-title: Core Library Configuration
+---
+
+.. contents:: Core Library Configuration
+
+Core Library Configuration
+==========================
+
+This section describes the configuration settings used by DistributedLog Core Library.
+
+All the core library settings are managed in `DistributedLogConfiguration`, which is
+basically a properties based configuration, which extends from Apache commons
+`CompositeConfiguration`. All the DL settings are in camel case and prefixed with a
+meaningful component name. For example, `zkSessionTimeoutSeconds` means the session timeout
+for component `zk` in seconds.
+
+The default distributedlog configuration is constructed by instantiating an instance
+of `DistributedLogConfiguration`. This distributedlog configuration will automatically load
+the settings that specified via `SystemConfiguration`.
+
+::
+
+    DistributedLogConfiguration conf = new DistributedLogConfiguration();
+
+The recommended way is to load configuration from URL that points to a configuration file
+(`#loadConf(URL)`).
+
+::
+
+    String configFile = "/path/to/distributedlog/conf/file";
+    DistributedLogConfiguration conf = new DistributedLogConfiguration();
+    conf.loadConf(new File(configFile).toURI().toURL());
+
+ZooKeeper Settings
+------------------
+
+A distributedlog namespace usually creates two zookeeper client instances: one is used
+for DL metadata operations, while the other one is used by bookkeeper. All the zookeeper
+clients are *retryable* clients, which they would reconnect when session is expired.
+
+DL ZooKeeper Settings
+~~~~~~~~~~~~~~~~~~~~~
+
+- *zkSessionTimeoutSeconds*: ZooKeeper session timeout, in seconds. Default is 30 seconds.
+- *zkNumRetries*: Number of retries of each zookeeper request could attempt on retryable exceptions.
+  Default is 3.
+- *zkRetryStartBackoffMillis*: The initial backoff time of first retry of each zookeeper request, in milliseconds.
+  Default is 5000.
+- *zkRetryMaxBackoffMillis*: The max backoff time of retries of each zookeeper request, in milliseconds.
+  Default is 30000.
+- *zkcNumRetryThreads*: The number of retry threads used by this zookeeper client. Default is 1.
+- *zkRequestRateLimit*: The rate limiter is basically a guava `RateLimiter`. It is rate limiting the
+  requests that sent by zookeeper client per second. If the value is non-positive, the rate limiting
+  is disable. Default is 0.
+- *zkAclId*: The digest id used for zookeeper ACL. If it is null, ACL is disabled. Default is null.
+
+BK ZooKeeper Settings
+~~~~~~~~~~~~~~~~~~~~~
+
+- *bkcZKSessionTimeoutSeconds*: ZooKeeper session timeout, in seconds. Default is 30 seconds.
+- *bkcZKNumRetries*: Number of retries of each zookeeper request could attempt on retryable exceptions.
+  Default is 3.
+- *bkcZKRetryStartBackoffMillis*: The initial backoff time of first retry of each zookeeper request, in milliseconds.
+  Default is 5000.
+- *bkcZKRetryMaxBackoffMillis*: The max backoff time of retries of each zookeeper request, in milliseconds.
+  Default is 30000.
+- *bkcZKRequestRateLimit*: The rate limiter is basically a guava `RateLimiter`. It is rate limiting the
+  requests that sent by zookeeper client per second. If the value is non-positive, the rate limiting
+  is disable. Default is 0.
+
+There are a few rules to follow when optimizing the zookeeper settings:
+
+1. In general, higher session timeout is much better than lower timeout, which will make zookeeper client
+   more resilent to any network glitches.
+2. A lower backoff time is better for latency, as it would trigger fast retries. But it
+   could trigger retry storm if the backoff time is too low.
+3. Number of retries should be tuned based on the backoff time settings and corresponding latency SLA budget.
+4. BK and DL readers use zookeeper client for metadata accesses. It is recommended to have higher session timeout,
+   higher number of retries and proper backoff time.
+5. DL writers also use zookeeper client for ownership tracking. It is required to act quickly on network glitches.
+   It is recommended to have low session timeout, low backoff time and proper number of retries.
+
+BookKeeper Settings
+-------------------
+
+All the bookkeeper client configuration settings could be loaded via `DistributedLogConfiguration`. All of them
+are prefixed with `bkc.`. For example, `bkc.zkTimeout` in distributedlog configuration will be applied as
+`zkTimeout` in bookkeeper client configuration.
+
+General Settings
+~~~~~~~~~~~~~~~~
+
+- *bkcNumIOThreads*: The number of I/O threads used by netty in bookkeeper client.
+  The default value is `numWorkerThreads`.
+
+Timer Settings
+~~~~~~~~~~~~~~
+
+- *timerTickDuration*: The tick duration in milliseconds that used for timeout
+  timer in bookkeeper client. The default value is 100 milliseconds.
+- *timerNumTicks*: The number of ticks that used for timeout timer in bookkeeper client.
+  The default value is 1024.
+
+Data Placement Settings
+~~~~~~~~~~~~~~~~~~~~~~~
+
+A log segment is backed by a bookkeeper `ledger`. A ledger's data is stored in an ensemble
+of bookies in a stripping way. Each entry will be added in a `write-quorum` size of bookies.
+The add operation will complete once it receives responses from a `ack-quorum` size of bookies.
+The stripping is done in a round-robin way in bookkeeper.
+
+For example, we configure the ensemble-size to 5, write-quorum-size to 3,
+and ack-quorum-size to 2. The data will be stored in following stripping way.
+
+::
+
+    | entry id | bk1 | bk2 | bk3 | bk4 | bk5 |
+    |     0    |  x  |  x  |  x  |     |     |
+    |     1    |     |  x  |  x  |  x  |     |
+    |     2    |     |     |  x  |  x  |  x  |
+    |     3    |  x  |     |     |  x  |  x  |
+    |     4    |  x  |  x  |     |     |  x  |
+    |     5    |  x  |  x  |  x  |     |     |
+
+We don't recommend stripping within a log segment to increase bandwidth. We'd recommend using
+multiple distributedlog streams to increase bandwidth in higher level of distributedlog. so
+typically the ensemble size will be set to be the same value as `write-quorum-size`.
+
+- *bkcEnsembleSize*: The ensemble size of the log segment. The default value is 3.
+- *bkcWriteQuorumSize*: The write quorum size of the log segment. The default value is 3.
+- *bkcAckQuorumSize*: The ack quorumm size of the log segment. The default value is 2.
+
+DNS Resolver Settings
++++++++++++++++++++++
+
+DistributedLog uses bookkeeper's `rack-aware` data placement policy on placing data across
+bookkeeper nodes. The `rack-aware` data placement uses a DNS resolver to resolve a bookie
+address into a network location and then use those locations to build the network topology.
+
+There are two built-in DNS resolvers in DistributedLog:
+
+1. *DNSResolverForRacks*: It resolves domain name like `(region)-(rack)-xxx-xxx.*` to
+   network location `/(region)/(rack)`. If resolution failed, it returns `/default-region/default-rack`.
+2. *DNSResolverForRows*: It resolves domain name like `(region)-(row)xx-xxx-xxx.*` to
+   network location `/(region)/(row)`. If resolution failed, it returns `/default-region/default-row`.
+
+The DNS resolver could be loaded by reflection via `bkEnsemblePlacementDnsResolverClass`.
+
+`(region)` could be overrided in a configured `dnsResolverOverrides`. For example, if the
+host name is `(regionA)-(row1)-xx-yyy`, it would be resolved to `/regionA/row1` without any
+overrides. If the specified overrides is `(regionA)-(row1)-xx-yyy:regionB`,
+the resolved network location would be `/regionB/row1`. Allowing overriding region provides
+the optimization hits to bookkeeper if two `logical` regions are in same or close locations.
+
+- *bkEnsemblePlacementDnsResolverClass*: The DNS resolver class for bookkeeper rack-aware ensemble placement.
+  The default value is `DNSResolverForRacks`.
+- *bkRowAwareEnsemblePlacement*: A flag indicates whether `DNSResolverForRows` should be used.
+  If enabled, `DNSResolverForRows` will be used for DNS resolution in rack-aware placement policy.
+  Otherwise, it would use the DNS resolver configured by `bkEnsemblePlacementDnsResolverClass`.
+- *dnsResolverOverrides*: The mapping used to override the region mapping derived by the DNS resolver.
+  The value is a string of pairs of host-region mappings (`host:region`) separated by semicolon.
+  By default it is empty string.
+
+Namespace Configuration Settings
+--------------------------------
+
+This section lists all the general settings used by `DistributedLogNamespace`.
+
+Executor Settings
+~~~~~~~~~~~~~~~~~
+
+- *numWorkerThreads*: The number of worker threads used by the namespace instance.
+  The default value is the number of available processors.
+- *numReadAheadWorkerThreads*: The number of dedicated readahead worker treads used
+  by the namespace instance. If it is non-positive, it would share the same executor
+  for readahead. Otherwise, it would create a dedicated executor for readahead.
+  The default value is 0.
+- *numLockStateThreads*: The number of lock state threads used by the namespace instance.
+  The default value is 1.
+- *schedulerShutdownTimeoutMs*: The timeout value in milliseconds, for shutting down
+  schedulers in the namespace instance. The default value is 5000ms.
+- *useDaemonThread*: The flag whether to use daemon thread for DL executor threads.
+  The default value is false.
+
+Metadata Settings
+~~~~~~~~~~~~~~~~~
+
+The log segment metadata is serialized into a string of content with a version. The version in log segment
+metadata allows us evolving changes to metadata. All the versions supported by distributedlog right now
+are listed in the below table.
+
++--------+-----------------------------------------------------------------------------------+
+|version |description                                                                        |
++========+===================================================================================+
+|   0    |Invalid version number.                                                            |
++--------+-----------------------------------------------------------------------------------+
+|   1    |Basic version number.                                                              |
+|        |Inprogress: start tx id, ledger id, region id                                      |
+|        |Completed: start/end tx id, ledger id, region id, record count and completion time |
++--------+-----------------------------------------------------------------------------------+
+|   2    |Introduced LSSN (LogSegment Sequence Number)                                       |
++--------+-----------------------------------------------------------------------------------+
+|   3    |Introduced Partial Truncated and Truncated status.                                 |
+|        |A min active (entry_id, slot_id) pair is recorded in completed log segment         |
+|        |metadata.                                                                          |
++--------+-----------------------------------------------------------------------------------+
+|   4    |Introduced Enveloped Entry Stucture. None & LZ4 compression codec introduced.      |
++--------+-----------------------------------------------------------------------------------+
+|   5    |Introduced Sequence Id.                                                            |
++--------+-----------------------------------------------------------------------------------+
+
+A general rule for log segment metadata upgrade is described as below. For example, we are upgrading
+from version *X* to version *X+1*.
+
+1. Upgrade the readers before upgrading writers. So the readers are able to recognize the log segments of version *X+1*.
+2. Upgrade the writers with the new binary of version *X+1* only. Keep the configuration `ledgerMetadataLayoutVersion` unchanged - still in version *X*.
+3. Once all the writers are running in same binary of version *X+1*. Update writers again with `ledgerMetadataLayoutVersion` set to version *X+1*.
+
+**Available Settings**
+
+- *ledgerMetadataLayoutVersion*: The logsegment metadata layout version. The default value is 5. Apply for `writers` only.
+- *ledgerMetadataSkipMinVersionCheck*: The flag indicates whether DL should enforce minimum log segment metadata vesion check.
+  If it is true, DL will skip the checking and read the log segment metadata if it could recognize. Otherwise, it would fail
+  the read if the log segment's metadata version is less than the version that DL supports. By default, it is disabled.
+- *firstLogsegmentSequenceNumber*: The first log segment sequence number to start with for a stream. The default value is 1.
+  The setting is only applied for writers, and only when upgrading metadata from version `1` to version `2`.
+  In this upgrade, we need to update old log segments to add ledger sequence number, once the writers start generating
+  new log segments with new version starting from this `firstLogSegmentSequenceNumber`.
+- *maxIdSanityCheck*: The flag indicates whether DL should do sanity check on transaction id. If it is enabled, DL will throw
+  `TransactionIdOutOfOrderException` when it received a smaller transaction id than current maximum transaction id. By default,
+  it is enabled.
+- *encodeRegionIDInVersion*: The flag indicates whether DL should encode region id into log segment metadata. In a global replicated
+  log, the log segments can be created in different regions. The region id in log segment metadata would help figuring out what
+  region that a log segment is created. The region id in log segment metadata would help for monitoring and troubleshooting.
+  By default, it is disabled.
+
+Namespace Settings
+~~~~~~~~~~~~~~~~~~
+
+- *federatedNamespaceEnabled*: The flag indicates whether DL should use federated namespace. By default, it is disabled.
+- *federatedMaxLogsPerSubnamespace*: The maximum number of log stream per sub namespace in a federated namespace. By default, it is 15000
+- *federatedCheckExistenceWhenCacheMiss*: The flag indicates whether to check the existence of a log stream in zookeeper or not,
+  if querying the local cache of the federated namespace missed.
+
+Writer Configuration Settings
+-----------------------------
+
+General Settings
+~~~~~~~~~~~~~~~~
+
+- *createStreamIfNotExists*: The flag indicates whether to create a log stream if it doesn't exist. By default, it is true.
+- *compressionType*: The compression type used when enveloping the output buffer. The available compression types are
+  `none` and `lz4`. By default, it is `none` - no compression.
+- *failFastOnStreamNotReady*: The flag indicates whether to fail immediately if the stream is not ready rather than enqueueing
+  the request. A log stream is considered as `not-ready` when it is either initializing the log stream or rolling a new log
+  segment. If this is enabled, DL would fail the write request immediately when the stream isn't ready. Otherwise, it would
+  enqueue the request and wait for the stream become ready. Please consider turning it on for the use cases that could retry
+  writing to other log streams, which it would result in fast failure hence client could retry other streams immediately.
+  By default, it is disabled.
+- *disableRollingOnLogSegmentError*: The flag to disable rolling log segment when encountered error. By default, it is true.
+
+Durability Settings
+~~~~~~~~~~~~~~~~~~~
+
+- *isDurableWriteEnabled*: The flag indicates whether durable write is enabled. By default it is true.
+
+Transmit Settings
+~~~~~~~~~~~~~~~~~
+
+DL writes the log records into a transmit buffer before writing to bookkeeper. The following settings control
+the frequency of transmits and commits.
+
+- *writerOutputBufferSize*: The output buffer size in bytes. Larger buffer size will result in higher compression ratio and it would reduce the entries sent to bookkeeper, use the disk bandwidth more efficiently and improve throughput. Set this setting to `0` will ask DL to transmit the data immediately, which it would achieve low latency.
+- *periodicFlushFrequencyMilliSeconds*: The periodic flush frequency in milliseconds. If the setting is set to a positive value, the data in transmit buffer will be flushed in every half of the provided interval. Otherwise, the periodical flush will be disabled. For example, if this setting is set to `10` milliseconds, the data will be flushed (`transmit`) every 5 milliseconds.
+- *enableImmediateFlush*: The flag to enable immediate flush a control record. It is a flag to control the period to make data visible to the readers. If this settings is true, DL would flush a control record immediately after transmitting the user data is completed. The default value is false.
+- *minimumDelayBetweenImmediateFlushMilliSeconds*: The minimum delay between two immediate flushes, in milliseconds. This setting only takes effects when immediate flush is enabled. It is designed to tolerant the bursty of traffic when immediate flush is enabled, which prevents sending too many control records to the bookkeeper.
+
+LogSegment Retention Settings
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following settings are related to log segment retention.
+
+- *logSegmentRetentionHours*: The log segment retention period, in hours. In other words, how long should DL keep the log segment once it is `truncated`.
+- *explicitTruncationByApp*: The flag indicates that truncation is managed explicitly by the application. If this is set then time based retention only clean the log segments which are marked as `truncated`. By default it is disabled.
+
+LogSegment Rolling Settings
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following settings are related to log segment rolling.
+
+- *logSegmentRollingMinutes*: The log segment rolling interval, in minutes. If the setting is set to a positive value, DL will roll
+  log segments based on time. Otherwise, it will roll log segment based on size (`maxLogSegmentBytes`). The default value is 2 hours.
+- *maxLogSegmentBytes*: The maximum size of a log segment, in bytes. This setting only takes effects when time based rolling is disabled.
+  If it is enabled, DL will roll a new log segment when the current one reaches the provided threshold. The default value is 256MB.
+- *logSegmentRollingConcurrency*: The concurrency of log segment rolling. If the value is positive, it means how many log segments
+  can be rolled at the same time. Otherwise, it is unlimited. The default value is 1.
+
+LogSegment Allocation Settings
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A bookkeeper ledger is allocated when a DL stream is rolling into a new log segment. To reduce the latency penalty on log segment rolling,
+a ledger allocator could be used for pre-allocating the ledgers for DL streams. This section describes the settings related to ledger
+allocation.
+
+- *enableLedgerAllocatorPool*: The flag indicates whether to use ledger allocator pool or not. It is disabled by default. It is recommended
+  to enable on write proxy.
+- *ledgerAllocatorPoolPath*: The path of the ledger allocator pool. The default value is ".allocation_pool". The allocator pool path has to
+  be prefixed with `"."`. A DL namespace is allowed to have multiple allocator pool, as they will be acted independently.
+- *ledgerAllocatorPoolName*: The name of the ledger allocator pool. Default value is null. It is set by write proxy on startup.
+- *ledgerAllocatorPoolCoreSize*: The number of ledger allocators in the pool. The default value is 20.
+
+Write Limit Settings
+~~~~~~~~~~~~~~~~~~~~
+
+This section describes the settings related to queue-based write limiting.
+
+- *globalOutstandingWriteLimit*: The maximum number of outstanding writes. If this setting is set to a positive value, the global
+  write limiting is enabled - when the number of outstanding writes go above the threshold, the consequent requests will be rejected
+  with `OverCapacity` exceptions. Otherwise, it is disabled. The default value is 0.
+- *perWriterOutstandingWriteLimit*: The maximum number of outstanding writes per writer. It is similar as `globalOutstandingWriteLimit`
+  but applied per writer instance. The default value is 0.
+- *outstandingWriteLimitDarkmode*: The flag indicates whether the write limiting is running in darkmode or not. If it is running in
+  dark mode, the request is not rejected when it is over limit, but just record it in the stats. By default, it is in dark mode. It
+  is recommended to run in dark mode to understand the traffic pattern before enabling real write limiting.
+
+Lock Settings
+~~~~~~~~~~~~~
+
+This section describes the settings related to distributed lock used by the writers.
+
+- *lockTimeoutSeconds*: The lock timeout in seconds. The default value is 30. If it is 0 or negative, the caller will attempt claiming
+  the lock, if there is no owner, it would claim successfully, otherwise it would return immediately and throw exception to indicate
+  who is the current owner.
+
+Reader Configuration Settings
+-----------------------------
+
+General Settings
+~~~~~~~~~~~~~~~~
+
+- *readLACLongPollTimeout*: The long poll timeout for reading `LastAddConfirmed` requests, in milliseconds.
+  The default value is 1 second. It is typically recommended to tune approximately with the request arrival interval. Otherwise, it would
+  end up becoming unnecessary short polls.
+
+ReadAhead Settings
+~~~~~~~~~~~~~~~~~~
+
+This section describes the settings related to readahead in DL readers.
+
+- *enableReadAhead*: Flag to enable read ahead in DL readers. It is enabled by default.
+- *readAheadMaxRecords*: The maximum number of records that will be cached in readahead cache by the DL readers. The default value
+  is 10. A higher value will improve throughput but use more memory. It should be tuned properly to avoid jvm gc if the reader cannot
+  keep up with the writing rate.
+- *readAheadBatchSize*: The maximum number of entries that readahead worker will read in one batch. The default value is 2.
+  Increase the value to increase the concurrency of reading entries from bookkeeper. It is recommended to tune to a proper value for
+  catching up readers, not to exhaust bookkeeper's bandwidth.
+- *readAheadWaitTimeOnEndOfStream*: The wait time if the reader reaches end of stream and there isn't any new inprogress log segment,
+  in milliseconds. The default value is 10 seconds.
+- *readAheadNoSuchLedgerExceptionOnReadLACErrorThresholdMillis*: If readahead worker keeps receiving `NoSuchLedgerExists` exceptions
+  when reading `LastAddConfirmed` in the given period, it would stop long polling `LastAddConfirmed` and re-initialize the ledger handle
+  and retry. The threshold is in milliseconds. The default value is 10 seconds.
+
+Reader Constraint Settings
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This section describes the constraint settings in DL reader.
+
+- *ignoreTruncationStatus*: The flag whether to ignore truncation status when reading the records. By default, it is false.
+  The readers will not attempt to read a log segment that is marked as `Truncated` if this setting is false. It can be enabled for
+  tooling and troubleshooting.
+- *alertPositionOnTruncated*: The flag whether we should alert when reader is positioned on a truncated segment. By default, it is true.
+  It would alert and fail the reader if it is positioned at a `Truncated` log segment when the setting is true. It can be disabled for
+  tooling and troubleshooting.
+- *positionGapDetectionEnabled*: The flag whether to enable position gap detection or not. This is a very strict constraint on reader,
+  to prevent readers miss reading records due to any software bugs. It is enabled by default.
+
+Idle Reader Settings
+~~~~~~~~~~~~~~~~~~~~
+
+There is a mechanism to detect idleness of readers, to prevent reader becoming stall due to any bugs.
+
+- *readerIdleWarnThresholdMillis*: The warning threshold of the time that a reader becomes idle, in milliseconds. If a reader becomes
+  idle more than the threshold, it would dump warnings in the log. The default value is 2 minutes.
+- *readerIdleErrorThresholdMillis*: The error threshold of the time that a reader becomes idle, in milliseconds. If a reader becomes
+  idle more than the threshold, it would throw `IdleReader` exceptions to notify applications. The default value is `Integer.MAX_VALUE`.
+
+Scan Settings
+~~~~~~~~~~~~~
+
+- *firstNumEntriesEachPerLastRecordScan*: Number of entries to scan for first scan of reading last record. The default value is 2.
+- *maxNumEntriesPerReadLastRecordScan*: Maximum number of entries for each scan to read last record. The default value is 16.
+
+Tracing/Stats Settings
+----------------------
+
+This section describes the settings related to tracing and stats.
+
+- *traceReadAheadDeliveryLatency*: Flag to enable tracing read ahead delivery latency. By default it is disabled.
+- *metadataLatencyWarnThresholdMs*: The warn threshold of metadata access latency, in milliseconds. If a metadata operation takes
+  more than the threshold, it would be logged. By default it is 1 second.
+- *dataLatencyWarnThresholdMs*: The warn threshold for data access latency, in milliseconds. If a data operation takes
+  more than the threshold, it would be logged. By default it is 2 seconds.
+- *traceReadAheadMetadataChanges*: Flag to enable tracing the major metadata changes in readahead. If it is enabled, it will log
+  the readahead metadata changes with precise timestamp, which is helpful for troubleshooting latency related issues. By default it
+  is disabled.
+- *enableTaskExecutionStats*: Flag to trace long running tasks and record task execution stats in the thread pools. It is disabled
+  by default.
+- *taskExecutionWarnTimeMicros*: The warn threshold for the task execution time, in micros. The default value is 100,000.
+- *enablePerStreamStat*: Flag to enable per stream stat. By default, it is disabled.
+
+Feature Provider Settings
+-------------------------
+
+- *featureProviderClass*: The feature provider class. The default value is `DefaultFeatureProvider`, which disable all the features
+  by default.
+

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/3469fc87/website/docs/0.4.0-incubating/user_guide/configuration/main.rst
----------------------------------------------------------------------
diff --git a/website/docs/0.4.0-incubating/user_guide/configuration/main.rst b/website/docs/0.4.0-incubating/user_guide/configuration/main.rst
new file mode 100644
index 0000000..65865ef
--- /dev/null
+++ b/website/docs/0.4.0-incubating/user_guide/configuration/main.rst
@@ -0,0 +1,44 @@
+---
+layout: default
+
+# Top navigation
+top-nav-group: user-guide
+top-nav-pos: 5
+top-nav-title: Configuration
+
+# Sub-level navigation
+sub-nav-group: user-guide
+sub-nav-parent: user-guide
+sub-nav-id: configuration
+sub-nav-pos: 5
+sub-nav-title: Configuration
+---
+
+Configuration
+=============
+
+DistributedLog uses key-value pairs in the `property file format`__ for configuration. These values can be supplied either from a file, jvm system properties, or programmatically.
+
+.. _PropertyFileFormat: http://en.wikipedia.org/wiki/.properties
+
+__ PropertyFileFormat_
+
+In DistributedLog, we only put non-environment related settings in the configuration.
+Those environment related settings, such as zookeeper connect string, bookkeeper
+ledgers path, should not be loaded from configuration. They should be added in `namespace binding`.
+
+- `Core Library Configuration`_
+
+.. _Core Library Configuration: ./core
+
+- `Write Proxy Configuration`_
+
+.. _Write Proxy Configuration: ./proxy
+
+- `Write Proxy Client Configuration`_
+
+.. _Write Proxy Client Configuration: ./client
+
+- `Per Stream Configuration`_
+
+.. _Per Stream Configuration: ./perlog

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/3469fc87/website/docs/0.4.0-incubating/user_guide/configuration/perlog.rst
----------------------------------------------------------------------
diff --git a/website/docs/0.4.0-incubating/user_guide/configuration/perlog.rst b/website/docs/0.4.0-incubating/user_guide/configuration/perlog.rst
new file mode 100644
index 0000000..c7abc66
--- /dev/null
+++ b/website/docs/0.4.0-incubating/user_guide/configuration/perlog.rst
@@ -0,0 +1,138 @@
+---
+layout: default
+
+# Sub-level navigation
+sub-nav-group: user-guide
+sub-nav-parent: configuration
+sub-nav-id: per-stream-configuration
+sub-nav-pos: 4
+sub-nav-title: Per Stream Configuration
+---
+
+.. contents:: Per Stream Configuration
+
+Per Stream Configuration
+========================
+
+Application is allowed to override `DistributedLogConfiguration` for individual streams. This is archieved
+for supplying an overrided `DistributedLogConfiguration` when opening the distributedlog manager.
+
+::
+
+    DistributedLogNamespace namespace = ...;
+    DistributedLogConfiguration perStreamConf = new DistributeLogConfiguration();
+    perStreamConf.loadConf(...); // load configuration from a per stream configuration file
+    DistributedLogManager dlm = namespace.openLog("test-stream", Optional.of(perStreamConf), Optional.absent());
+
+Dynamic Configuration
+---------------------
+
+Besides overriding normal `DistributedLogConfiguration` with per stream configuration, DistributedLog also
+provides loading some configuration settings dynamically. The per stream dynamic settings are offered in
+`DynamicDistributedLogConfiguration`.
+
+File Based Dynamic Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The default dynamic configuration implementation is based on properties files and reloading the file periodically.
+
+::
+
+    ConcurrentBaseConfiguration defaultConf = ...; // base config to fall through
+    int reloadPeriod = 60; // 60 seconds
+    TimeUnit reloadUnit = TimeUnit.SECOND;
+    String configPath = "/path/to/per/stream/config/file";
+    File configFile = new File(configPath);
+    // load the fie into a properties configuration builder
+    PropertiesConfigurationBuilder properties =
+        new PropertiesConfigurationBuilder(configFile.toURI().toURL());
+    // Construct the dynamic configuration
+    DynamicDistributedLogConfiguration dynConf = new DynamicDistributedLogConfiguration(defaultConf);
+    // add a configuration subscription to periodically reload the config from the file
+    ConfigurationSubscription subscription =
+        new ConfigurationSubscription(dynConf, properties, executorService, reloadPeriod, reloadUnit);
+
+Stream Config Provider
+~~~~~~~~~~~~~~~~~~~~~~
+
+The stream config provider is designed to manage and reload dynamic configs for individual streams.
+
+::
+
+    String perStreamConfigDir = "/path/to/per/stream/config/dir";
+    String defaultConfigPath = "/path/to/default/config/file";
+    StreamPartitionConverter converter = ...;
+    ScheduledExecutorService scheduler = ...;
+    int reloadPeriod = 60; // 60 seconds
+    TimeUnit reloadUnit = TimeUnit.SECOND;
+    StreamConfigProvider provider = new ServiceStreamConfigProvider(
+        perStreamConfigDir,
+        defaultConfigPath,
+        converter,
+        scheduler,
+        reloadPeriod,
+        reloadUnit);
+
+    Optional<DynamicDistributedLogConfiguration> streamConf = provider.getDynamicStreamConfig("test-stream");
+
+- *perStreamConfigDir*: The directory contains configuration files for each stream. the file name is `<stream_name>.conf`.
+- *defaultConfigPath*: The default configuration file. If there is no stream configuration file found in `perStreamConfigDir`,
+  it would load the configuration from `defaultConfigPath`.
+- *StreamPartitionConverter*: A converter that convert the stream names to partitions. DistributedLog doesn't provide built-in
+  partitions. It leaves partition strategy to application. Application usually put the partition id in the dl stream name. So the
+  converter is for group the streams and apply same configuration. For example, if application uses 3 streams and names them as
+  `test-stream_000001`, `test-stream_000002` and `test-stream_000003`, a `StreamPartitionConverter` could be used to categorize them
+  as partitions for stream `test-stream` and apply the configuration from file `test-stream.conf`.
+- *scheduler*: The executor service that reloads configuration periodically.
+- *reloadPeriod*: The reload period, in `reloadUnit`.
+
+Available Dynamic Settings
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Storage Settings
+~~~~~~~~~~~~~~~~
+
+- *logSegmentRetentionHours*: The log segment retention period, in hours. In other words, how long should DL keep the log segment once it is `truncated` or `completed`.
+- *bkcEnsembleSize*: The ensemble size of the log segment. The default value is 3.
+- *bkcWriteQuorumSize*: The write quorum size of the log segment. The default value is 3.
+- *bkcAckQuorumSize*: The ack quorumm size of the log segment. The default value is 2.
+
+Transmit Settings
+~~~~~~~~~~~~~~~~~
+
+- *writerOutputBufferSize*: The output buffer size in bytes. Larger buffer size will result in higher compression ratio and
+  it would reduce the entries sent to bookkeeper, use the disk bandwidth more efficiently and improve throughput.
+  Set this setting to `0` will ask DL to transmit the data immediately, which it would achieve low latency.
+
+Durability Settings
+~~~~~~~~~~~~~~~~~~~
+
+- *isDurableWriteEnabled*: The flag indicates whether durable write is enabled. By default it is true.
+
+ReadAhead Settings
+~~~~~~~~~~~~~~~~~~
+
+- *readAheadMaxRecords*: The maximum number of records that will be cached in readahead cache by the DL readers. The default value
+  is 10. A higher value will improve throughput but use more memory. It should be tuned properly to avoid jvm gc if the reader cannot
+  keep up with the writing rate.
+- *readAheadBatchSize*: The maximum number of entries that readahead worker will read in one batch. The default value is 2.
+  Increase the value to increase the concurrency of reading entries from bookkeeper. It is recommended to tune to a proper value for
+  catching up readers, not to exhaust bookkeeper's bandwidth.
+
+Rate Limit Settings
+~~~~~~~~~~~~~~~~~~~
+
+All the rate limit settings have both `soft` and `hard` thresholds. If the throughput goes above `soft` limit,
+the requests won't be rejected but just logging in the stat. But if the throughput goes above `hard` limit,
+the requests would be rejected immediately.
+
+NOTE: `bps` stands for `bytes per second`, while `rps` stands for `requests per second`.
+
+- *bpsSoftWriteLimit*: The soft limit for bps. Setting it to 0 or negative value will disable this feature.
+  By default it is disabled.
+- *bpsHardWriteLimit*: The hard limit for bps. Setting it to 0 or negative value will disable this feature.
+  By default it is disabled.
+- *rpsSoftWriteLimit*: The soft limit for rps. Setting it to 0 or negative value will disable this feature.
+  By default it is disabled.
+- *rpsHardWriteLimit*: The hard limit for rps. Setting it to 0 or negative value will disable this feature.
+  By default it is disabled.

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/3469fc87/website/docs/0.4.0-incubating/user_guide/configuration/proxy.rst
----------------------------------------------------------------------
diff --git a/website/docs/0.4.0-incubating/user_guide/configuration/proxy.rst b/website/docs/0.4.0-incubating/user_guide/configuration/proxy.rst
new file mode 100644
index 0000000..988b250
--- /dev/null
+++ b/website/docs/0.4.0-incubating/user_guide/configuration/proxy.rst
@@ -0,0 +1,82 @@
+---
+layout: default
+
+# Sub-level navigation
+sub-nav-group: user-guide
+sub-nav-parent: configuration
+sub-nav-id: write-proxy-configuration
+sub-nav-pos: 2
+sub-nav-title: Write Proxy Configuration
+---
+
+.. contents:: Write Proxy Configuration
+
+Write Proxy Configuration
+=========================
+
+This section describes the configuration settings used by DistributedLog Write Proxy.
+
+All the server related settings are managed in `ServerConfiguration`. Similar as `DistributedLogConfiguration`,
+it is also a properties based configuration, which extends from Apache commons `CompositeConfiguration`. All
+server related settings are in lower case and use `'_'` to concat words. For example, `server_region_id` means
+the region id used by the write proxy server.
+
+Server Configuration Settings
+-----------------------------
+
+- *server_dlsn_version*: The version of serialized format of DLSN. The default value is 1. It is not recommended to change it.
+- *server_region_id*: The region id used by the server to instantiate a DL namespace. The default value is `LOCAL`.
+- *server_port*: The listen port of the write proxy. The default value is 0.
+- *server_shard*: The shard id used by the server to identify itself. It is optional but recommended to set. For example, if
+  the write proxy is running in `Apache Aurora`, you could use the instance id as the shard id. The default value is -1 (unset).
+- *server_threads*: The number of threads for the executor of this server. The default value is the available processors.
+- *server_enable_perstream_stat*: The flag to enable per stream stat in write proxy. It is different from `enablePerStreamStat`
+  in core library. The setting here is controlling exposing the per stream stat exposed by write proxy, while `enablePerStreamStat`
+  is to control expose the per stream stat exposed by the core library. It is enabled by default.
+- *server_graceful_shutdown_period_ms*: The graceful shutdown period in milliseconds. The default value is 0.
+- *server_service_timeout_ms*: The timeout period for the execution of a stream operation in write proxy. If it is positive,
+  write proxy will timeout requests if they are taking longer time than the threshold. Otherwise, the timeout feature is disabled.
+  By default, it is 0 (disabled).
+- *server_stream_probation_timeout_ms*: The time period that a stream should be kept in cache in probationary state after service
+  timeout, in order to prevent ownership reacquiring. The unit is milliseconds. The default value is 5 minutes.
+- *stream_partition_converter_class*: The stream-to-partition convert class. The converter is used to group streams together, which
+  these streams can apply same `per-stream` configuration settings or same other constraints. By default, it is an
+  `IdentityStreamPartitionConverter` which doesn't group any streams.
+
+Rate Limit Settings
+~~~~~~~~~~~~~~~~~~~
+
+This section describes the rate limit settings per write proxy.
+
+All the rate limit settings have both `soft` and `hard` thresholds. If the throughput goes above `soft` limit,
+the requests won't be rejected but just logging in the stat. But if the throughput goes above `hard` limit,
+the requests would be rejected immediately.
+
+NOTE: `bps` stands for `bytes per second`, while `rps` stands for `requests per second`.
+
+- *bpsSoftServiceLimit*: The soft limit for bps. Setting it to 0 or negative value will disable this feature.
+  By default it is disabled.
+- *bpsHardServiceLimit*: The hard limit for bps. Setting it to 0 or negative value will disable this feature.
+  By default it is disabled.
+- *rpsSoftServiceLimit*: The soft limit for rps. Setting it to 0 or negative value will disable this feature.
+  By default it is disabled.
+- *rpsHardServiceLimit*: The hard limit for rps. Setting it to 0 or negative value will disable this feature.
+  By default it is disabled.
+
+There are two additional rate limiting settings that related to stream acquisitions.
+
+- *rpsStreamAcquireServiceLimit*: The rate limit for rps. When the rps goes above this threshold, the write proxy
+  will stop accepting serving new streams.
+- *bpsStreamAcquireServiceLimit*: The rate limit for bps. When the bps goes above this threshold, the write proxy
+  will stop accepting serving new streams.
+
+Stream Limit Settings
+~~~~~~~~~~~~~~~~~~~~~
+
+This section describes the stream limit settings per write proxy. They are the constraints that each write proxy
+will apply when deciding whether to own given streams.
+
+- *maxAcquiredPartitionsPerProxy*: The maximum number of partitions per stream that a write proxy is allowed to
+  serve. Setting it to 0 or negative value will disable this feature. By default it is unlimited.
+- *maxCachedPartitionsPerProxy*: The maximum number of partitions per stream that a write proxy is allowed to cache.
+  Setting it to 0 or negative value will disable this feature. By default it is unlimited.

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/3469fc87/website/docs/0.4.0-incubating/user_guide/considerations/main.rst
----------------------------------------------------------------------
diff --git a/website/docs/0.4.0-incubating/user_guide/considerations/main.rst b/website/docs/0.4.0-incubating/user_guide/considerations/main.rst
new file mode 100644
index 0000000..9a35b8b
--- /dev/null
+++ b/website/docs/0.4.0-incubating/user_guide/considerations/main.rst
@@ -0,0 +1,82 @@
+---
+layout: default
+
+# Top navigation
+top-nav-group: user-guide
+top-nav-pos: 2
+top-nav-title: Considerations
+
+# Sub-level navigation
+sub-nav-group: user-guide
+sub-nav-parent: user-guide
+sub-nav-id: architecture
+sub-nav-pos: 2
+sub-nav-title: Considerations
+
+---
+
+Considerations
+==============
+
+As different applications have different requirements, we've carefully considered the capabilities
+that should be included in DistributedLog leaving the rest up to the applications. These considerations are: 
+
+Consistency, Durability and Ordering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The distributed systems literature commonly refers to two broad paradigms to use a log
+for building reliable replicated systems (Figure 1). The `Pub-Sub` paradigm usually
+refers to an active-active model where we keep a log of the incoming requests and each
+replica(reader) processes each request. While the `Master-Slave` paradigm elects one
+replica as the master to process requests as they arrive and log changes to its state.
+The other replicas referred to as slaves apply the state changes in the same order as
+the master, thereby being in sync and ready to take over the mastership if the current
+master fails. If the current master loses connectivity to the slaves due to a network
+partition, the slaves may elect a new master to continue forward progress. A fencing
+mechanism is necessary for the old master to discover that it has lost ownership and
+prevent it from modifying state after network connectivity is restored.
+
+.. figure:: ../../images/pubsub.png
+   :align: center
+   :width: 600px
+
+   Figure 1. The uses of a log in distributed systems
+
+
+These two different approaches indicate two different types of ordering requirements -
+`Write Ordering` and `Read Ordering`. `Write ordering` requires that all writes issued
+by the log writer be written in a strict order to the log, while `read ordering` only
+requires that any reader that reads the log stream should see the same record at any
+given position, the log records however may not appear in the same order that the writer
+wrote them. The replicated log service should be able to support both use cases. 
+
+Partitioning
+~~~~~~~~~~~~
+
+`Partitioning` (also known as sharding or bucketing) facilitates horizontal scale. The
+partitioning scheme depends on the characteristics of the application and is closely
+related to the ordering guarantees that the application requires. For example, distributed
+key/value store that uses DistributedLog as its transaction log, distributes the data into
+partitions each of which is a unit of consistency. Modifications within each partition are
+required to be strictly ordered. On the other hand, real-time analytics workloads don't
+require strict order, can use *round-robin* partitioning to evenly distribute the reads and
+writes across all partitions. It is therefore prudent to provide applications the flexibility
+to choose a suitable partitioning scheme.
+
+Processing Semantics
+~~~~~~~~~~~~~~~~~~~~
+
+Applications typically choose between `at-least-once` and `exactly-once` processing semantics.
+`At-least-once` processing guarantees that the application will process all the log records,
+however when the application resumes after failure, previously processed records may be
+re-processed if they have not been acknowledged. `Exactly once` processing is a stricter
+guarantee where applications must see the effect of processing each record exactly once.
+`Exactly once` semantics can be achieved by maintaining reader positions together with the
+application state and atomically updating both the reader position and the effects of the
+corresponding log records. For instance, for strongly consistent updates in a distributed
+key/value store the reader position must be persisted atomically with the changes applied
+from the corresponding log records. Upon restart from a failure, the reader resumes from the
+last persisted position thereby guaranteeing that each change is applied only once. With at
+least once processing guarantees the application can store reader positions in an external
+store and update it periodically. Upon restart the application will reprocess messages since
+the last updated reader position.

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/3469fc87/website/docs/0.4.0-incubating/user_guide/design/main.rst
----------------------------------------------------------------------
diff --git a/website/docs/0.4.0-incubating/user_guide/design/main.rst b/website/docs/0.4.0-incubating/user_guide/design/main.rst
new file mode 100644
index 0000000..998fd77
--- /dev/null
+++ b/website/docs/0.4.0-incubating/user_guide/design/main.rst
@@ -0,0 +1,230 @@
+---
+layout: default
+
+# Top navigation
+top-nav-group: user-guide
+top-nav-pos: 6
+top-nav-title: Detail Design
+
+# Sub-level navigation
+sub-nav-group: user-guide
+sub-nav-parent: user-guide
+sub-nav-id: detail-design
+sub-nav-pos: 6
+sub-nav-title: Detail Design
+---
+
+.. contents:: Detail Design
+
+Detail Design
+=============
+
+We will describe the design choices that we made while implementing DistributedLog and why we built such layered architecture.
+
+Consistency
+-----------
+
+DistributedLog achieves strong consistency, using the `fencing` mechanism provided in the log segment store to guarantee data consistency
+and `versioned updates` in the metadata store to guarantee metadata consistency.
+
+LastAddConfirmed
+~~~~~~~~~~~~~~~~
+
+DistributedLog leverages bookkeeper's `LAC` (LastAddConfirmed) protocol - a variation of `two-phase-commit` algorithm to build its data pipeline
+and achieve consistency around it. Figure 1 illustrates the basic concepts of this protocol.
+
+.. figure:: ../../images/lacprotocol.png
+   :align: center
+
+   Figure 1. Consistency in Log Segment Store
+
+Each batched entry appended to a log segment will be assigned a monotonically increasing entry id by the log segment writer. All the entries are
+written asynchronously in a pipeline. The log segment writer therefore updates an in-memory pointer, called `LAP` (LastAddPushed), which is the
+entry id of the last batched entry pushed to log segment store by the writer. The entries could be written out of order but only be acknowledged
+in entry id order. Along with the successful acknowledges, the log segment writer also updates an in-memory pointer, called `LAC` (LastAddConfirmed).
+LAC is the entry id of the last entry that already acknowledged by the writer. All the entries written between LAC and LAP are unacknowledged data,
+which they are not visible to readers. 
+
+The readers can read entries up to LAC as those entries are known to be durably replicated - thereby can be safely read without the risk of violating
+read ordering. The writer includes the current LAC in each entry that it sends to BookKeeper. Therefore each subsequent entry makes the records in
+the previous entry visible to the readers. LAC updates can be piggybacked on the next entry that are written by the writer. Since readers are strictly
+followers, they can leverage LAC to read durable data from any of the replicas without need for any communication or coordination with the writer.
+
+DL introduces one type of system record, which is called `control record` - it acts as the `commit` request in `two-phases-commit` algorithm.
+If no application records arrive within the specified SLA, the writer will generate a control record. With writing the control record, it would advance
+the LAC of the log stream. The control record is added either immediately after receiving acknowledges from writing a user record or periodically if
+no application records are added. It is configured as part of writer's flushing policy. While control log records are present in the physical log stream,
+they are not delivered by the log readers to the application.
+
+Fencing
+~~~~~~~
+
+LAC is a very simple and useful mechanism to guarantee consistency across readers. But it is not enough to guarantee correctness when the ownership
+of a log stream is changed - there might be multiple writers exist at the same time when network partition happens. DistributedLog addresses this by `fencing`
+data in log segment store and conditionally (via versioned set) updating log segment metadata in metadata store. Fencing is a built-in mechanism in bookkeeper - when
+a client wants to fence a ledger, it would send a special fence request to all the replicas of that ledger; the bookies that host that ledger will change the state of
+that ledger to fenced. once a ledger's state is changed to fenced, all the write attempts to it would be failed immediately. Client claims a success fence when
+it receives successful fence responses from majorities of the replicas.
+
+Figure 2 illustrates how does DistributedLog work when ownership is changed for a log stream.
+
+.. figure:: ../../images/fencing.png
+   :align: center
+
+   Figure 2. Fencing & Consistency
+
+Whenever the ownership is changed from one writer to the other writer (step 0), the new owner of the log stream will first retrieve the list of log segments of
+that log stream along with their versions (the versions will used for versioned set on updating log segments' metadata). The new owner will find current inprogress
+log segment and recover the log segment in following sequence:
+
+1. It would first fence the log segment (step 2.1). Fencing successfully means no writes will succeed any more after that. 
+2. If the old owner is just network partitioned, it might still think itself is the owner and keep adding records to that log segment.  But because the log segment has been fenced, so all writes by the old owner will be rejected and failed (step 2.2). The old owner will realize that it already lost the ownership and gave up.
+3. Once the log segment is fenced, the new owner will proceed a recovery process to recover the log segment. Once the log segment is recovered, it would issue a versioned set operation to metadata store to convert the log segment status from inprogress to completed (step 2.3).
+4. A new inprogress log segment will be created by the new writer to continue writing to this log stream (step 3).
+
+Completing an inprogress log segment and creating a new log segment could be executed in parallel to achieve fast log stream recovery. It will reduce the latency
+penalty for writes during ownership changed.
+
+Creating a new log segment during ownership change is known as '*obtaining an epoch during leader election*' in distributed consensus algorithms. It makes clean 
+implementation for a replicated log service, as the client that lost the ownership (aka mastership, lock) doesn't even know the identity of the new epoch (in DL,
+it is the new log segment id) so it can't accidentally write to the new log segment. We leverage zookeeper's sequential znode on generating new log segment id.
+
+Ownership Tracking
+~~~~~~~~~~~~~~~~~~
+
+With the built-in fencing mechanism in storage layer and metadata updates, DistributedLog doesn't require strict leader election
+to guarantee correctness. Therefore we use '`ownership tracking`' as opposed to '`leader election`' for the log stream ownership management.
+
+DistributedLog uses ZooKeeper ephemeral znodes for tracking the ownerships of log streams. Since ZooKeeper already provides `sessions` that
+can be used to track leases for failure detection. In production environment, we tuned the zookeeper settings to ensure failures could be
+detected within one second. An aggressive bound on failure detection increases the possibility of false positives. If ownerships flap between
+write proxies, delays will result from writes blocking for log stream recovery. `Deterministic routing` allows multiple clients to choose the
+same write proxy to fail over when the current owner proxy is unavailable. The details are described in Figure 3. 
+
+.. figure:: ../../images/requestrouting.png
+   :align: center
+
+   Figure 3. Request Routing
+
+Applications write the log records by the write client. Write client will first look up the `ownership cache`, a local cache that caches mapping
+between log stream name and its corresponding log stream owner. If the stream is not cached yet, the client will use consistent hashing based
+`routing service` to compute a candidate write proxy (step 1.1) and then send the write request to this candidate write proxy (step 1.2). If it
+already owns the log stream or it could successfully claim the ownership, it would satisfy the write request and respond back to the client (step 1.3).
+If it can't claim the ownership, it then send the response back to the client to ask it redirect to the right owner (1.4). All succeed write requests
+will keep the local ownership cache up-to-date, which help avoiding the subsequent requests being redirected.
+
+Streaming Reads
+---------------
+
+After the readers have caught up to the current tail of the log, DistributedLog provides readers the ability to read new log records as they are
+published - a mechanism commonly known as `tailing` the log. Readers start out by **positioning** to a record in the log stream based on either DLSN or
+Transaction ID. The reader starts **reading** records until it reaches the tail of the log stream. Once it has caught up with the writer, the reader waits
+for **notifications** about new log records or new log segments.
+
+Positioning
+~~~~~~~~~~~
+
+As mentioned above, there are 3 types of sequence numbers are associated with a log record. Except sequence id is computed at reading time, both DLSN (implicit)
+and Transaction ID (explicit) are attached to log records in writing time. Applications could use either of them for positioning. DLSN is the best sequence number
+on positioning, as it already tells which log segment, which entry and which slot of the record in the log stream. No additional search operations are required.
+While Transaction ID is assigned by applications, positioning a reader by transaction id will first look up the list of log segments to find which log segment
+contains the given transaction id and then look up the records in the found log segment to figure out the actual position within that log segment.
+Both looking up in the log segment list and the found log segment use binary search to speed up the searching. Although positioning by transaction id could be a
+bit slower than positioning by DLSN, it is useful for analytics workloads to rewind to analyze old data in hours if the transaction id is timestamp.
+
+Reading
+~~~~~~~
+
+Figure 4 illustrates reading batched entries from log segment store. The are two basic read operations: read a given entry by entry id (a) and read LAC (b). 
+
+.. figure:: ../../images/readrequests.png
+   :align: center
+
+   Figure 4. Read entries from log segment store
+
+Since an entry is immutable after it is appended to a log segment, reading a given entry by entry id could go to any replicas of that log segment and retry others
+if encountered failures. In order to achieve low predictable 99.9 percentile latency even during bookie failures, a **speculative** read mechanism is deployed:
+a read request will be sent to first replica; if client doesn't receive the response with a speculative timeout, it would send another request to second replica;
+then wait for the responses of both first replica and second replica; and so forth until receiving a valid response to complete the read request or timeout.
+
+Reading LAC is an operation for readers to catch up with the writer. It is typically a quorum-read operation to guarantee freshness: the client sends the read requests
+to all replicas in the log segment and waits for the responses from the majority of them. It could be optimized to be a best-effort quorum-read operation for tailing reads,
+which it doesn't have to wait for quorum responses from the replicas and could return whenever it sees an advanced LAC.
+
+`Figure 4(c)` illustrates the third type of read request, which is called `"Long Poll Read"`. It is a combination of (a) and (b), serving the purpose of
+reading next available entry in the log segment. The client sends a long poll read request along with next read entry id to the log segment store.
+If the log segment store already saw the entry and it is committed (entry id is not greater than LAC), it responds the request immediately with latest LAC
+and requested entry. Otherwise, it would wait for LAC being advanced to given entry id and respond back requested entry. Similar speculative mechanism is
+deployed in long polling to achieve predictable low 99.9 percentile latency.
+
+Notifications
+~~~~~~~~~~~~~
+
+Once the reader is caught up with the writer, it would turn itself into `'notification'` mode. In this mode, it would wait notifications of new records
+by `long polling` reads (described above) and `notification` of state changes of log segments. The notification mechanism for state changes of log segments
+is provided by Metadata Store. Currently it is ZooKeeper watcher. The notifications are triggered when an inprogress log segment is completed or a new inprogress
+log segment is created.
+
+ReadAhead
+~~~~~~~~~
+
+The reader will read ahead to proactively bring new data into cache, for applications to consume. It helps reducing the read latency as it proactively brings newer
+data into cache while applications consuming them. DistributedLog uses LAC as an indicator to detect if a reader is still catching up or already caught up and
+adjusting the readahead pace based on the reader state and its consuming rate.
+
+LogSegment Lifecycle
+--------------------
+
+DistributedLog breaks a log stream down into multiple log segments based configured rolling policy. The current inprogress log segment will be completed
+and a new log segment will be created when either the log segment has been written for more than a configured rolling interval (aka time-based rolling),
+the size of the log segment has reached a configured threshold (aka size-based rolling), or whenever the ownership of a log stream is changed.
+
+A new log segment is created in `Inprogress` state. It is completed as a `Completed` log segment when either the writer rolls into a new log segment or
+recovered when ownership changed. Once the log segment is completed, it will be truncated later either by `explicit truncation` or `expired due to TTL timeout`.
+The log segment will be marked as `Partial Truncated` along with a `Min-Active-DLSN` pointer when only portion of its data is truncated, and `Truncated` when
+the `Min-Active-DLSN` pointer reaches the end of the log segment. The truncated log segments will be moved to Cold Storage for longer retention or backup for
+disaster recovery, and eventually be deleted after TTL expiration. Figure 5 illustrates a log stream that contains 5 log segments which each of them are in
+different states. The dot line describes the transition between states.
+
+.. figure:: ../../images/logsegments.png
+   :align: center
+
+   Figure 5. The lifecycle of log segments
+
+Distribution
+~~~~~~~~~~~~
+
+A log segment is placed on multiple log segment storage nodes according configured placement policy. DistributedLog uses a `rack-aware` placement policy on
+placing log segments in a local datacenter setup, which the rack-aware placement policy will guarantee all the replicas of same log segment placed in
+different racks for network fault-tolerance. It uses a `region-aware` placement policy on placing log segments among multiple datacenters for a global setup
+(see more in section `"Global Replicated Log"`), which guarantees all the replicas of same log segment placed in multiple datacenters and ensures receiving
+acknowledges from majority of the data centers.
+
+As DistributedLog breaks down the streams into multiple log segments, the log segments could be evenly distributed across multiple log segment storage nodes
+for load balancing. It helps the data distribution balancing and read workload balancing. Figure 6 shows an example how the data of 2 streams (*x*, *y*) is
+stored as 3 replicas in a *5-nodes* cluster in a balanced way.
+ 
+.. figure:: ../../images/distribution.png
+   :align: center
+
+   Figure 6. Log Segment Distribution Example
+
+Truncation
+~~~~~~~~~~
+
+As the writers keep writing records into the log streams, the data will be accumulated. In DistributedLog,
+there are two ways to delete old data, one is `Explicit Truncation` while the other is `TTL Expiration`. 
+
+Applications are allowed to explicitly truncate a log stream to a given DLSN. Once the truncation request is
+received by the writer, the writer will mark all the log segments whose log segment sequence number is less than 
+the sequence number of that DLSN as `Truncated`. The log segment segment whose sequence number is same as that 
+DLSN will be marked as `Partially Truncated` along and the DLSN as the last active DLSN. So positioning the reader 
+will be advanced to last active DLSN if the provided position is already truncated. All the truncated log segments 
+will be still kept for a configured time period for disaster recovery and the actual log segments will be deleted 
+and garbage collected via `TTL Expiration`.
+
+When a log segment is completed, the completion time will be recorded as part of the log segment metadata. 
+DistributedLog uses `completion time` for TTL Expiration: all the log segments whose completion time already 
+passed the configured TTL period will be deleted from metadata store. After the log segments are deleted from 
+metadata store, the log segments will be garbage collected from log segment store and their disk spaces will be 
+reclaimed.

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/3469fc87/website/docs/0.4.0-incubating/user_guide/globalreplicatedlog/main.rst
----------------------------------------------------------------------
diff --git a/website/docs/0.4.0-incubating/user_guide/globalreplicatedlog/main.rst b/website/docs/0.4.0-incubating/user_guide/globalreplicatedlog/main.rst
new file mode 100644
index 0000000..818128f
--- /dev/null
+++ b/website/docs/0.4.0-incubating/user_guide/globalreplicatedlog/main.rst
@@ -0,0 +1,118 @@
+---
+layout: default
+
+# Top navigation
+top-nav-group: user-guide
+top-nav-pos: 7
+top-nav-title: Global Replicated Log
+
+# Sub-level navigation
+sub-nav-group: user-guide
+sub-nav-parent: user-guide
+sub-nav-id: global-replicated-log
+sub-nav-pos: 7
+sub-nav-title: Global Replicated Log
+---
+
+.. contents:: Global Replicated Log
+
+Global Replicated Log
+=====================
+
+A typical setup for DistributedLog is within a datacenter. But a global setup is required for
+providing global replicated logs for distributed key/value store to achieve strong consistency
+across multiple datacenters. `Global` here means across datacenters, which is different from
+`Local` meaning within a datacenter.
+
+A global setup of DistributedLog is organized as a set of `regions`, where each region is the
+rough analog of a local setup. Regions are the unit of administrative deployment. The set of
+regions is also the set of locations across which data can be replicated. Regions can be added
+to or removed from a running system as new datacenters are brought into service and old ones
+are turned off, respectively. Regions are also the unit of physical isolation: there may be one
+or more regions in a datacenter if they have isolated power or network supplies.
+
+.. figure:: ../../images/globalreplicatedlog.png
+   :align: center
+   :width: 600px
+
+   Figure 1. Global Replicated Log
+
+Figure 1 illustrates the servers in a `Global Replicated Log` setup. There is no inter datacenter
+communication between write proxies or log segment storage nodes. The only component that does
+inter datacenters communications within its hosts is the 'Global' metadata store, which is a global
+setup of ZooKeeper. Write clients will talk to the write proxies in its local region to bootstrap
+the ownership cache and redirect to correct write proxies in other regions through direct TCP
+connections. While readers will identify the regions of the log segment storage nodes according to
+the `region aware` placement policy, and try reading from local region at most of the time and
+speculatively try on remote regions.
+
+Region Aware Data Placement Policy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Region aware placement policy uses hierarchical allocation where-in nodes are allocated so that data
+is spread uniformly across the available regions and within each region it uses the `rack-aware`
+placement policy to spread the data uniformly across the available racks.
+
+Region aware placement policy is governed by a parameter ensures that the ack quorum covers at least
+*minRegionsForDurability* distinct regions. This ensures that the system can survive the failure of
+`(totalRegions - minRegionsForDurability)` regions without loss of availability. For example if we
+have bookie nodes in *5* regions and if the *minRegionsForDurability* is *3* then we can survive the
+failure of `(5 - 3) = 2` regions.
+
+The placement algorithm follows the following simple invariant:
+
+::
+
+    There is no combination of nodes that would satisfy the ack quorum with
+    less than "minRegionsForDurability" responses.
+
+
+This invariant ensures that enforcing ack quorum is sufficient to enforce that the entry has been made durable
+in *minRegionsForDurability* regions.
+
+The *minRegionsForDurability* requirement enforces constraints on the minimum ack quorum as we want to ensure
+that when we run in degraded mode - *i.e. when only a subset of the regions are available* - we would still not
+be able to allocate nodes in such a way that the ack quorum would be satisfied by fewer than *minRegionsForDurability*
+regions.
+
+For instance consider the following scenario with three regions each containing 20 bookie nodes:
+
+::
+
+    minRegionsForDurability = 2
+    ensemble size = write quorum = 15
+    ack quorum =  8
+
+
+Let's say that one of the regions is currently unavailable and we want to still ensure that writes can continue.
+The ensemble placement may then have to choose bookies from the two available regions. Given that *15* bookies have
+to be allocated, we will have to allocate at least *8* bookies from one of the remaining regions - but with ack quorum
+of *8* we run the risk of satisfying ack quorum with bookies from a single region. Therefore we must require that
+the ack quorum is greater than *8*.
+
+Cross Region Speculative Reads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As discussed before, read requests can be satisfied by any replica of the data, however for high availability
+speculative requests are sent to multiple copies to ensure that at least one of the requests returns within
+the time specified by the *SLA*. The reader consults the data placement policy to get the list of replicas that
+can satisfy the request in the order of preference. This order is decided as follows:
+
+* The first node in the list is always the bookie node that is closest to the client - if more than one such nodes exist, one is chosen at random.
+* The second node is usually the closest node in a different failure domain. In the case of a two level hierarchy that would be a node in a different rack.
+* The third node is chosen from a different region
+
+The delay between successive speculative read requests ensures that the probability of sending the *nth*
+speculative read request decays exponentially with *n*. This ensures that the number of requests that go to
+farther nodes is still kept at a minimum. However by making sure that we cross failure domains in the first
+few speculative requests improves fault-tolerance of the reader. Transient node failures are transparently
+handled by the reader by this simple and generalizable speculative read policy. This can be thought of as
+the most granular form of failover where each request essentially fails-over to an alternate node if the
+primary node it attempted to access is unavailable. In practice we have found this to also better handle
+network congestion where routes between specific pairs of nodes may become unavailable without necessarily
+making the nodes completely inaccessible.
+
+In addition to static decisions based on the location of the bookie nodes, we can also make dynamic decisions
+based on observed latency or failure rates from specific bookies. These statistics are tracked by the bookie
+client and are used to influence the order in which speculative read requests are scheduled. This again is
+able to capture partial network outages that affect specific routes within the network. 

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/3469fc87/website/docs/0.4.0-incubating/user_guide/implementation/core.rst
----------------------------------------------------------------------
diff --git a/website/docs/0.4.0-incubating/user_guide/implementation/core.rst b/website/docs/0.4.0-incubating/user_guide/implementation/core.rst
new file mode 100644
index 0000000..1ab753a
--- /dev/null
+++ b/website/docs/0.4.0-incubating/user_guide/implementation/core.rst
@@ -0,0 +1,4 @@
+---
+layout: default
+---
+

http://git-wip-us.apache.org/repos/asf/incubator-distributedlog/blob/3469fc87/website/docs/0.4.0-incubating/user_guide/implementation/main.rst
----------------------------------------------------------------------
diff --git a/website/docs/0.4.0-incubating/user_guide/implementation/main.rst b/website/docs/0.4.0-incubating/user_guide/implementation/main.rst
new file mode 100644
index 0000000..fcc3844
--- /dev/null
+++ b/website/docs/0.4.0-incubating/user_guide/implementation/main.rst
@@ -0,0 +1,25 @@
+---
+layout: default
+
+# Top navigation
+top-nav-group: user-guide
+top-nav-pos: 8
+top-nav-title: Implementation
+
+# Sub-level navigation
+sub-nav-group: user-guide
+sub-nav-parent: user-guide
+sub-nav-id: implementation
+sub-nav-pos: 8
+sub-nav-title: Implementation
+---
+
+Implementation
+==============
+
+This page covers the implementation details for `DistributedLog`.
+
+- `Storage`_
+
+.. _Storage: ./storage
+