You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@zookeeper.apache.org by iv...@apache.org on 2011/11/30 00:15:03 UTC

svn commit: r1208129 - in /zookeeper/bookkeeper/trunk/doc: bookkeeperConfigParams.textile bookkeeperInternals.textile

Author: ivank
Date: Tue Nov 29 23:15:02 2011
New Revision: 1208129

URL: http://svn.apache.org/viewvc?rev=1208129&view=rev
Log:
BOOKKEEPER-122: Review BookKeeper server documentation (fpj & ivank) [Forgot to add 2 files]

Added:
    zookeeper/bookkeeper/trunk/doc/bookkeeperConfigParams.textile
    zookeeper/bookkeeper/trunk/doc/bookkeeperInternals.textile

Added: zookeeper/bookkeeper/trunk/doc/bookkeeperConfigParams.textile
URL: http://svn.apache.org/viewvc/zookeeper/bookkeeper/trunk/doc/bookkeeperConfigParams.textile?rev=1208129&view=auto
==============================================================================
--- zookeeper/bookkeeper/trunk/doc/bookkeeperConfigParams.textile (added)
+++ zookeeper/bookkeeper/trunk/doc/bookkeeperConfigParams.textile Tue Nov 29 23:15:02 2011
@@ -0,0 +1,46 @@
+Title:        BookKeeper Configuration Parameters
+Notice: Licensed under the Apache License, Version 2.0 (the "License");
+        you may not use this file except in compliance with the License. You may
+        obtain a copy of the License at "http://www.apache.org/licenses/LICENSE-2.0":http://www.apache.org/licenses/LICENSE-2.0.
+        .        
+        Unless required by applicable law or agreed to in writing,
+        software distributed under the License is distributed on an "AS IS"
+        BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+        implied. See the License for the specific language governing permissions
+        and limitations under the License.
+        .
+
+h1. BookKeeper Configuration Parameters
+
+This page contains detailed information about configuration parameters used for configuring a BookKeeper bookie. There is an example in "bookkeeper-server/conf/bk_server.conf". 
+
+h3. General parameters
+
+| @zkServers@ | A list of one of more servers on which zookeeper is running. The server list can be comma separated values, e.g., zk1:2181,zk2:2181,zk3:2181 |
+| @zkTimeout@ | ZooKeeper client session timeout in milliseconds. Bookie server will exit if it received SESSION_EXPIRED because it was partitioned off from ZooKeeper for more than the session timeout JVM garbage collection, disk I/O will cause SESSION_EXPIRED. Increment this value could help avoiding this issue. The default value is 10,000. |
+| @bookiePort@        |Port that bookie server listens on. The default value is 3181.|
+| @journalDir@        | Directory Bookkeeper outputs its write ahead log, ideally in a dedicated device. The default value is "/tmp/bk-txn". |
+| @ledgerDirs@        | Directory Bookkeeper outputs ledger snapshots could define multiple directories to store snapshots, comma separated. For example: /tmp/bk1-data,/tmp/bk2-data. Ideally ledger dirs and journal dir are each in a differet device, which reduce the contention between random i/o and sequential write. It is possible to run with a single disk,  but performance will be significantly lower.|
+| @logSizeLimit@      | Maximum file size of entry logger, in bytes. A new entry log file will be created when the old one reaches the file size limitation. The default value is 2GB. |
+| @journalMaxSizeMB@  |  Maximum file size of journal file, in mega bytes. A new journal file will be created when the old one reaches the file size limitation. The default value is 2kB. |
+| @journalMaxBackups@ |  Max number of old journal file to keep. Keeping a number of old journal files might help data recovery in some special cases. The default value is 5. |
+| @gcWaitTime@        | Interval to trigger next garbage collection, in milliseconds. Since garbage collection is running in the background, running the garbage collector too frequently hurts performance. It is best to set its value high enough if there is sufficient disk capacity.|
+| @flushInterval@ | Interval to flush ledger index pages to disk, in milliseconds. Flushing index files will introduce random disk I/O. Consequently, it is important to have journal dir and ledger dirs each on different devices. However,  if it necessary to have journal dir and ledger dirs on the same device, one option is to increment the flush interval to get higher performance. Upon a failure, the bookie will take longer to recover. |
+| @bookieDeathWatchInterval@ | Interval to check whether a bookie is dead or not, in milliseconds. |
+
+h3. NIO server settings
+
+| @serverTcpNoDelay@ | This settings is used to enabled/disabled Nagle's algorithm, which is a means of improving the efficiency of TCP/IP networks by reducing the number of packets that need to be sent over the network. If you are sending many small messages, such that more than one can fit in a single IP packet, setting server.tcpnodelay to false to enable Nagle algorithm can provide better performance. Default value is true. |
+
+h3. Ledger cache settings
+
+| @openFileLimit@ | Maximum number of ledger index files that can be opened in a bookie. If the number of ledger index files reaches this limit, the bookie starts to flush some ledger indexes from memory to disk. If flushing happens too frequently, then performance is affected. You can tune this number to improve performance according. |
+| @pageSize@ | Size of an index page in ledger cache, in bytes. A larger index page can improve performance when writing page to disk, which is efficient when you have small number of ledgers and these ledgers have a similar number of entries. With a large number of ledgers and a few entries per ledger, a smaller index page would improves memory usage. |
+| @pageLimit@ | Maximum number of index pages to store in the ledger cache. If the number of index pages reaches this limit, bookie server starts to flush ledger indexes from memory to disk. Incrementing this value is an option when flushing becomes frequent. It is important to make sure, though, that pageLimit*pageSize is not more than JVM max memory limit; otherwise it will raise an OutOfMemoryException. In general, incrementing pageLimit, using smaller index page would gain better performance in the case of a large number of ledgers with few entries per ledger. If pageLimit is -1, a bookie uses 1/3 of the JVM memory to compute the maximum number of index pages. |
+
+h3. Ledger manager settings
+
+| @ledgerManagerType@ | What kind of ledger manager is used to manage how ledgers are stored, managed and garbage collected. See "BookKeeper Internals":./bookkeeperInternals.html for detailed info. Default is flat. |
+| @zkLedgersRootPath@ | Root zookeeper path to store ledger metadata. Default is /ledgers. |
+
+

Added: zookeeper/bookkeeper/trunk/doc/bookkeeperInternals.textile
URL: http://svn.apache.org/viewvc/zookeeper/bookkeeper/trunk/doc/bookkeeperInternals.textile?rev=1208129&view=auto
==============================================================================
--- zookeeper/bookkeeper/trunk/doc/bookkeeperInternals.textile (added)
+++ zookeeper/bookkeeper/trunk/doc/bookkeeperInternals.textile Tue Nov 29 23:15:02 2011
@@ -0,0 +1,84 @@
+Title:        BookKeeper Internals
+Notice: Licensed under the Apache License, Version 2.0 (the "License");
+        you may not use this file except in compliance with the License. You may
+        obtain a copy of the License at "http://www.apache.org/licenses/LICENSE-2.0":http://www.apache.org/licenses/LICENSE-2.0.
+        .        
+        Unless required by applicable law or agreed to in writing,
+        software distributed under the License is distributed on an "AS IS"
+        BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+        implied. See the License for the specific language governing permissions
+        and limitations under the License.
+        .
+
+h2. Bookie Internals
+
+p. Bookie server stores its data in multiple ledger directories and its journal files in a journal directory. Ideally, storing journal files in a separate directory than data files would increase throughput and decrease latency
+
+h3. The Bookie Journal
+
+p. Journal directory has one kind of file in it:
+
+* @{timestamp}.txn@ - holds transactions executed in the bookie server.
+
+p. Before persisting ledger index and data to disk, a bookie ensures that the transaction that represents the update is written to a journal in non-volatile storage. A new journal file is created using current timestamp when a bookie starts or an old journal file reaches its maximum size.
+
+p. A bookie supports journal rolling to remove old journal files. In order to remove old journal files safely, bookie server records LastLogMark in Ledger Device, which indicates all updates (including index and data) before LastLogMark has been persisted to the Ledger Device.
+
+p. LastLogMark contains two parts:
+
+* @LastLogId@ - indicates which journal file the transaction persisted.
+* @LastLogPos@ - indicates the position the transaction persisted in LastLogId journal file.
+
+p. You may use following settings to further fine tune the behavior of journalling on bookies:
+
+| @journalMaxSizeMB@ | journal file size limitation. when a journal reaches this limitation, it will be closed and new journal file be created. |
+| @journalMaxBackups@ | how many old journal files whose id is less than LastLogMark 's journal id. |
+
+bq. NOTE: keeping number of old journal files would be useful for manually recovery in special case.
+
+h1. ZooKeeper Metadata
+
+p. For BookKeeper, we require a ZooKeeper installation to store metadata, and to pass the list of ZooKeeper servers as parameter to the constructor of the BookKeeper class (@org.apache.bookkeeper.client.BookKeeper@). To setup ZooKeeper, please check the "ZooKeeper documentation":http://zookeeper.apache.org/doc/trunk/index.html. 
+
+p. BookKeeper provides two mechanisms to organize its metadata in ZooKeeper. By default, the @FlatLedgerManager@ is used, and 99% of users should never need to look at anything else. However, in cases where there are a lot of active ledgers concurrently, (> 50,000), @HierarchicalLedgerManager@ should be used. For so many ledgers, a hierarchical approach is needed due to a limit ZooKeeper places on packet sizes "JIRA Issue":https://issues.apache.org/jira/browse/BOOKKEEPER-39.
+
+| @FlatLedgerManager@ | All ledger metadata are placed as children in a single zookeeper path. |
+| @HierarchicalLedgerManager@ | All ledger metadata are partitioned into 2-level znodes. |
+
+h2. Flat Ledger Manager
+
+p. All ledgers' metadata are put in a single zookeeper path, created using zookeeper sequential node, which can ensure uniqueness of ledger id. Each ledger node is prefixed with 'L'.
+
+p. Bookie server manages its owned active ledgers in a hash map. So it is easy for bookie server to find what ledgers are deleted from zookeeper and garbage collect them. And its garbage collection flow is described as below:
+
+* Fetch all existing ledgers from zookeeper (@zkActiveLedgers@).
+* Fetch all ledgers currently active within the Bookie (@bkActiveLedgers@).
+* Loop over @bkActiveLedgers@ to find those ledgers which do not exist in @zkActiveLedgers@ and garbage collect them.
+
+h2. Hierarchical Ledger Manager
+
+p. @HierarchicalLedgerManager@ first obtains a global unique id from ZooKeeper using a EPHEMERAL_SEQUENTIAL znode.
+
+p. Since ZooKeeper sequential counter has a format of %10d -- that is 10 digits with 0 (zero) padding, i.e. "<path>0000000001", @HierarchicalLedgerManager@ splits the generated id into 3 parts :
+
+@{level1 (2 digits)}{level2 (4 digits)}{level3 (4 digits)}@
+
+p. These 3 parts are used to form the actual ledger node path used to store ledger metadata:
+
+@{ledgers_root_path}/{level1}/{level2}/L{level3}@
+
+p. E.g. Ledger 0000000001 is split into 3 parts 00, 0000, 00001, which is stored in znode /{ledgers_root_path}/00/0000/L0001. So each znode could have at most 10000 ledgers, which avoids the problem of the child list being larger than the maximum ZooKeeper packet size.
+
+p. Bookie server manages its active ledgers in a sorted map, which simplifies access to active ledgers in a particular (level1, level2) partition.
+
+p. Garbage collection in bookie server is processed node by node as follows:
+
+* Fetching all level1 nodes, by calling zk#getChildren(ledgerRootPath).
+** For each level1 nodes, fetching their level2 nodes :
+** For each partition (level1, level2) :
+*** Fetch all existed ledgers from zookeeper belonging to partition (level1, level2) (@zkActiveLedgers@).
+*** Fetch all ledgers currently active in the bookie which belong to partition (level1, level2) (@bkActiveLedgers@).
+*** Loop over @bkActiveLedgers@ to find those ledgers which do not exist in @zkActiveLedgers@, and garbage collect them.
+
+bq. NOTE: Hierarchical Ledger Manager is more suitable to manage large number of ledgers existed in BookKeeper.
+