You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-commits@jackrabbit.apache.org by am...@apache.org on 2015/09/02 06:31:25 UTC

svn commit: r1700703 - in /jackrabbit/oak/trunk/oak-doc/src/site/markdown: osgi_config.md plugins/blobstore.md

Author: amitj
Date: Wed Sep  2 04:31:25 2015
New Revision: 1700703

URL: http://svn.apache.org/r1700703
Log:
OAK-301: Document Oak

- Updated Blob Garbage Collection documentation
- Details of GC in a shared DataStore

Modified:
    jackrabbit/oak/trunk/oak-doc/src/site/markdown/osgi_config.md
    jackrabbit/oak/trunk/oak-doc/src/site/markdown/plugins/blobstore.md

Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/osgi_config.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/osgi_config.md?rev=1700703&r1=1700702&r2=1700703&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/osgi_config.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/osgi_config.md Wed Sep  2 04:31:25 2015
@@ -204,6 +204,21 @@ cacheSizeInMB
 : Size in MB. In memory cache for storing small files whose size is less than `maxCachedBinarySize`. This
   helps in better performance when lots of small binaries are accessed frequently.
 
+#### Oak - SharedS3DataStore (Since Oak 1.2.0)
+
+Supports shared S3 DataStore
+
+_PID `org.apache.jackrabbit.oak.plugins.blob.datastore.SharedS3DataStore`_
+
+maxCachedBinarySize
+: Default - 17408 (17 KB)
+: Size in bytes. Binaries with size less than or equal to this size would be stored in in memory cache
+
+cacheSizeInMB
+: Default - 16
+: Size in MB. In memory cache for storing small files whose size is less than `maxCachedBinarySize`. This
+  helps in better performance when lots of small binaries are accessed frequently.
+
 ### System properties and Framework properties
 
 Following properties are supported by Oak. They are grouped in two parts _Stable_ and
@@ -273,4 +288,4 @@ need not be specified in config files
 [2]: http://jackrabbit.apache.org/api/2.4/org/apache/jackrabbit/core/data/FileDataStore.html
 [OAK-1645]: https://issues.apache.org/jira/browse/OAK-1645
 [doc-cache]: ./nodestore/documentmk.html#cache
-[persistent-cache]: ./nodestore/persistent-cache.html
\ No newline at end of file
+[persistent-cache]: ./nodestore/persistent-cache.html

Modified: jackrabbit/oak/trunk/oak-doc/src/site/markdown/plugins/blobstore.md
URL: http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-doc/src/site/markdown/plugins/blobstore.md?rev=1700703&r1=1700702&r2=1700703&view=diff
==============================================================================
--- jackrabbit/oak/trunk/oak-doc/src/site/markdown/plugins/blobstore.md (original)
+++ jackrabbit/oak/trunk/oak-doc/src/site/markdown/plugins/blobstore.md Wed Sep  2 04:31:25 2015
@@ -58,25 +58,6 @@ point, see also http://en.wikipedia.org/
 must use the SHA-2 family of hash functions for these applications
 after 2010". This might affect some potential users.
 
-### Blob Garbage Collection
-
-Oak implements a Mark and Sweep based Garbage Collection logic. 
- 
-1. Mark Phase - In this phase the binary references are marked in both
-   BlobStore and NodeStore
-    1. Mark BlobStore - GC logic would make a record of all the blob
-       references present in the BlobStore. In doing so it would only
-       consider those blobs which are older than a specified time 
-       interval. So only those blob references are fetched which are 
-       last modified say 24 hrs (default) ago. 
-    2. Mark NodeStore - GC logic would make a record of all the blob
-       references which are referred by any node present in NodeStore.
-       Note that any blob references from old revisions of node would also be 
-       considered as a valid references. 
-2. Sweep Phase - In this phase all blob references form Mark BlobStore phase 
-    which were not found in Mark NodeStore part would considered as GC candidates
-    and would be deleted.
-
 ### Support for Jackrabbit 2 DataStore
 
 Jackrabbit 2 used [DataStore][2] to store blobs. Oak supports usage of such 
@@ -129,6 +110,72 @@ one of the following can be used
 * S3DataStore - This should be used when binaries are stored in Amazon S3. Typically used when running
   in Amazon AWS
 
+### Blob Garbage Collection
+
+Blob Garbage Collection(GC) is applicable for the following blob stores:
+
+* DocumentNodeStore 
+    * MongoBlobStore/RDBBlobStore (Default blob stores for RDB & Mongo)
+    * FileDataStore
+    * S3DataStore
+    * SharedS3DataStore (since Oak 1.2.0)
+    
+* SegmentNodeStore 
+    * FileDataStore
+    * S3DataStore
+    * SharedS3DataStore (since Oak 1.2.0)
+
+Oak implements a Mark and Sweep based Garbage Collection logic. 
+ 
+1. Mark Phase - In this phase the binary references are marked in both
+   BlobStore and NodeStore
+    1. Mark BlobStore - GC logic would make a record of all the blobs
+       present in the BlobStore. 
+    2. Mark NodeStore - GC logic would make a record of all the blob
+       references which are referred by any node present in NodeStore.
+       Note that any blob references from old revisions of node would also be 
+       considered as a valid references. 
+2. Sweep Phase - In this phase all blob references form Mark BlobStore phase 
+    which were not found in Mark NodeStore part would considered as GC candidates.
+    It would only delete blobs which are older than a specified time interval 
+    (last modified say 24 hrs (default) ago).
+
+The garbage collection can be triggered by calling:
+
+* `MarkSweepGarbageCollector#collectGarbage()` (Oak 1.0.x)
+* `MarkSweepGarbageCollector#collectGarbage(false)` (Oak 1.2.x)
+ 
+#### Shared DataStore Blob Garbage Collection (Since 1.2.0)
+
+On start of a repository configured with a shared DataStore, a unique repository id is registered. 
+In the DataStore this repository id is registered as an empty file with the format `repository-[repository-id]` 
+(e.g. repository-988373a0-3efb-451e-ab4c-f7e794189273).
+The high-level process for garbage collection is still the same as described above. 
+But to support blob garbage collection in a shared DataStore the Mark and Sweep phase can be
+run independently.
+
+The details of the process are as follows:
+
+* The Mark NodeStore phase has to be executed for each of the repositories sharing the DataStore.
+    * This can be executed by running `MarkSweepGarbageCollector#collectGarbage(true)`, where true indicates mark only.
+    * All the references are collected in the DataStore in a file with the format `references-[repository-id]` 
+    (e.g. references-988373a0-3efb-451e-ab4c-f7e794189273).
+* One completion of the above process on all repositories, the sweep phase needs to be triggered.
+    * This can be executed by running `MarkSweepGarbageCollector#collectGarbage(false)` on one of the repositories, 
+    where false indicates to run sweep also. 
+    * The sweep process checks for availability of the references file from all registered repositories and aborts otherwise.
+    * All the references available are collected.
+    * All the blobs available in the DataStore are collected and deletion candidates identified by calculating all the 
+    blobs available not appearing in the blobs referenced. Only blobs older than a specified time interval from the 
+    earliest available references file are deleted. (last modified say 24 hrs (default)).
+    
+The shared DataStore garbage collection is applicable for the following DataStore(s):
+
+* FileDataStore
+* SharedS3DataStore - Extends the S3DataStore to enable sharing of the data store with
+                        multiple repositories                        
+ 
+
 [1]: http://serverfault.com/questions/52861/how-does-dropbox-version-upload-large-files
 [2]: http://wiki.apache.org/jackrabbit/DataStore
-[3]: http://jclouds.apache.org/start/blobstore/
\ No newline at end of file
+[3]: http://jclouds.apache.org/start/blobstore/