You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@james.apache.org by GitBox <gi...@apache.org> on 2021/09/01 09:30:23 UTC
[GitHub] [james-project] Arsnael commented on a change in pull request #631: JAMES-3150 Document blob garbage collection
Arsnael commented on a change in pull request #631:
URL: https://github.com/apache/james-project/pull/631#discussion_r699117501
##########
File path: docs/modules/servers/pages/distributed/configure/blobstore.adoc
##########
@@ -35,11 +35,20 @@ If you choose to enable deduplication, the mails with the same content will be s
WARNING: Once this feature is enabled, there is no turning back as turning it off will lead to the deletion of all
the mails sharing the same content once one is deleted.
-This feature also requires a garbage collector mechanism to effectively drop blobs, which is not implemented yet.
+NOTE: If you are upgrading from James 3.5 or older, the deduplication was enabled.
-Consequently, all the requested deletions will not be performed, meaning that blobstore will only grow.
+Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
+based on bloom filters can be used and triggered using the WebAdmin REST API. See
+xref:distributed/operate/webadmin.adoc#_running_blob_garbage_collection[Running blob garbage collection].
-NOTE: If you are upgrading from James 3.5 or older, the deduplication was enabled.
+In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
Review comment:
```suggestion
In order to avoid concurrency issues upon garbage collection, we slice the blobs in generation, the two more recent
```
##########
File path: docs/modules/servers/pages/distributed/configure/blobstore.adoc
##########
@@ -35,11 +35,20 @@ If you choose to enable deduplication, the mails with the same content will be s
WARNING: Once this feature is enabled, there is no turning back as turning it off will lead to the deletion of all
the mails sharing the same content once one is deleted.
-This feature also requires a garbage collector mechanism to effectively drop blobs, which is not implemented yet.
+NOTE: If you are upgrading from James 3.5 or older, the deduplication was enabled.
-Consequently, all the requested deletions will not be performed, meaning that blobstore will only grow.
+Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
+based on bloom filters can be used and triggered using the WebAdmin REST API. See
+xref:distributed/operate/webadmin.adoc#_running_blob_garbage_collection[Running blob garbage collection].
-NOTE: If you are upgrading from James 3.5 or older, the deduplication was enabled.
+In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+generation are not garbage collected.
+
+*deduplication.gc.generation.duration*: Allow controlling the duration of one generation. Longer implies better deduplication
+but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is days.
Review comment:
```suggestion
but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is in days.
```
##########
File path: docs/modules/servers/pages/distributed/operate/webadmin.adoc
##########
@@ -2306,6 +2306,45 @@ the following `additionalInformation`:
}
....
+== Running blob garbage collection
+
+When deduplication is enabled one need to explicitly run a garbage collection in order to delete no longer referenced
Review comment:
```suggestion
When deduplication is enabled one needs to explicitly run a garbage collection in order to delete no longer referenced
```
##########
File path: src/site/xdoc/server/config-blobstore.xml
##########
@@ -53,14 +53,34 @@
The generated startup warning log can be deactivated via the <code>cassandra.blob.store.disable.startup.warning</code> environment
variable being positioned to <code>false</code>.
</dd>
+ Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
+ based on bloom filters can be used and triggered using the WebAdmin REST API. See
+ xref:distributed/operate/webadmin.adoc#_running_blob_garbage_collection[Running blob garbage collection].
+ In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+ generation are not garbage collected.
+
+ *deduplication.gc.generation.duration*: Allow controlling the duration of one generation. Longer implies better deduplication
+ but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is days.
+
+ *deduplication.gc.generation.family*: Every time the duration is changes, this integer counter must be incremented to avoid
Review comment:
```suggestion
*deduplication.gc.generation.family*: Every time the duration is changed, this integer counter must be incremented to avoid
```
##########
File path: docs/modules/servers/pages/distributed/operate/webadmin.adoc
##########
@@ -2306,6 +2306,45 @@ the following `additionalInformation`:
}
....
+== Running blob garbage collection
+
+When deduplication is enabled one need to explicitly run a garbage collection in order to delete no longer referenced
+blobs.
+
+To do so:
+
+....
+curl -XDELETE http:ip:port/blobs?scope=unreferenced
+....
+
+link:#_endpoints_returning_a_task[More details about endpoints returning a task].
+
+Additional parameters include Bloom filter tuning parameters:
+
+ - *associatedProbability*: Allow to define the targeted false positive rate. Note that subsequent runs do not have the
+same false-positives.
+ - *expectedBlobCount*: Expected count of blobs used to size the bloom filters.
+
+The created task have the following additional information:
+
+....
+{
+ "referenceSourceCount": 3456,
+ "blobCount": 5678,
+ "gcedBlobCount": 1234,
+ "bloomFilterExpectedBlobCount": 10000,
+ "bloomFilterAssociatedProbability": 0.01
+}
+....
+
+Where:
+
+ - *bloomFilterExpectedBlobCount* correspond to the supplied *expectedBlobCount* query parameter.
+ - *bloomFilterAssociatedProbability* correspond to the supplied *associatedProbability* query parameter.
+ - *referenceSourceCount* is the count of distinct blob references encountered while populating the bloom filter.
+ - *blobCount* is the count of blob tried against the bloom filter. This value can be used to better size the bloom
Review comment:
```suggestion
- *blobCount* is the count of blobs tried against the bloom filter. This value can be used to better size the bloom
```
##########
File path: docs/modules/servers/pages/distributed/configure/blobstore.adoc
##########
@@ -35,11 +35,20 @@ If you choose to enable deduplication, the mails with the same content will be s
WARNING: Once this feature is enabled, there is no turning back as turning it off will lead to the deletion of all
the mails sharing the same content once one is deleted.
-This feature also requires a garbage collector mechanism to effectively drop blobs, which is not implemented yet.
+NOTE: If you are upgrading from James 3.5 or older, the deduplication was enabled.
-Consequently, all the requested deletions will not be performed, meaning that blobstore will only grow.
+Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
+based on bloom filters can be used and triggered using the WebAdmin REST API. See
+xref:distributed/operate/webadmin.adoc#_running_blob_garbage_collection[Running blob garbage collection].
-NOTE: If you are upgrading from James 3.5 or older, the deduplication was enabled.
+In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+generation are not garbage collected.
+
+*deduplication.gc.generation.duration*: Allow controlling the duration of one generation. Longer implies better deduplication
+but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is days.
+
+*deduplication.gc.generation.family*: Every time the duration is changes, this integer counter must be incremented to avoid
Review comment:
```suggestion
*deduplication.gc.generation.family*: Every time the duration is changed, this integer counter must be incremented to avoid
```
##########
File path: docs/modules/servers/pages/distributed/configure/blobstore.adoc
##########
@@ -35,11 +35,20 @@ If you choose to enable deduplication, the mails with the same content will be s
WARNING: Once this feature is enabled, there is no turning back as turning it off will lead to the deletion of all
the mails sharing the same content once one is deleted.
-This feature also requires a garbage collector mechanism to effectively drop blobs, which is not implemented yet.
+NOTE: If you are upgrading from James 3.5 or older, the deduplication was enabled.
-Consequently, all the requested deletions will not be performed, meaning that blobstore will only grow.
+Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
+based on bloom filters can be used and triggered using the WebAdmin REST API. See
+xref:distributed/operate/webadmin.adoc#_running_blob_garbage_collection[Running blob garbage collection].
-NOTE: If you are upgrading from James 3.5 or older, the deduplication was enabled.
+In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+generation are not garbage collected.
Review comment:
```suggestion
generations are not garbage collected.
```
##########
File path: src/site/markdown/server/manage-webadmin.md
##########
@@ -3517,6 +3518,46 @@ Response codes:
- 204: Operation succeeded
+## Running blob garbage collection
+
+When deduplication is enabled one need to explicitly run a garbage collection in order to delete no longer referenced
+blobs.
+
+To do so:
+
+```
+curl -XDELETE http:ip:port/blobs?scope=unreferenced
+```
+
+[More details about endpoints returning a task](#Endpoints_returning_a_task).
+
+Additional parameters include Bloom filter tuning parameters:
+
+ - **associatedProbability**: Allow to define the targeted false positive rate. Note that subsequent runs do not have the
+same false-positives.
+ - **expectedBlobCount**: Expected count of blobs used to size the bloom filters.
+
+The created task have the following additional information:
+
+```json
+{
+ "referenceSourceCount": 3456,
+ "blobCount": 5678,
+ "gcedBlobCount": 1234,
+ "bloomFilterExpectedBlobCount": 10000,
+ "bloomFilterAssociatedProbability": 0.01
+}
+```
+
+Where:
+
+ - **bloomFilterExpectedBlobCount** correspond to the supplied **expectedBlobCount** query parameter.
+ - **bloomFilterAssociatedProbability** correspond to the supplied **associatedProbability** query parameter.
+ - **referenceSourceCount** is the count of distinct blob references encountered while populating the bloom filter.
+ - **blobCount** is the count of blob tried against the bloom filter. This value can be used to better size the bloom
Review comment:
```suggestion
- **blobCount** is the count of blobs tried against the bloom filter. This value can be used to better size the bloom
```
##########
File path: docs/modules/servers/pages/distributed/operate/webadmin.adoc
##########
@@ -2306,6 +2306,45 @@ the following `additionalInformation`:
}
....
+== Running blob garbage collection
+
+When deduplication is enabled one need to explicitly run a garbage collection in order to delete no longer referenced
+blobs.
+
+To do so:
+
+....
+curl -XDELETE http:ip:port/blobs?scope=unreferenced
+....
+
+link:#_endpoints_returning_a_task[More details about endpoints returning a task].
+
+Additional parameters include Bloom filter tuning parameters:
+
+ - *associatedProbability*: Allow to define the targeted false positive rate. Note that subsequent runs do not have the
+same false-positives.
+ - *expectedBlobCount*: Expected count of blobs used to size the bloom filters.
+
+The created task have the following additional information:
Review comment:
```suggestion
The created task has the following additional information:
```
##########
File path: src/site/markdown/server/manage-webadmin.md
##########
@@ -3517,6 +3518,46 @@ Response codes:
- 204: Operation succeeded
+## Running blob garbage collection
+
+When deduplication is enabled one need to explicitly run a garbage collection in order to delete no longer referenced
Review comment:
```suggestion
When deduplication is enabled one needs to explicitly run a garbage collection in order to delete no longer referenced
```
##########
File path: src/site/xdoc/server/config-blobstore.xml
##########
@@ -53,14 +53,34 @@
The generated startup warning log can be deactivated via the <code>cassandra.blob.store.disable.startup.warning</code> environment
variable being positioned to <code>false</code>.
</dd>
+ Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
+ based on bloom filters can be used and triggered using the WebAdmin REST API. See
+ xref:distributed/operate/webadmin.adoc#_running_blob_garbage_collection[Running blob garbage collection].
+ In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+ generation are not garbage collected.
+
+ *deduplication.gc.generation.duration*: Allow controlling the duration of one generation. Longer implies better deduplication
+ but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is days.
+
+ *deduplication.gc.generation.family*: Every time the duration is changes, this integer counter must be incremented to avoid
+ conflicts. Defaults to 1.
<dt><strong>deduplication/enable</strong></dt>
<dd>Mandatory. Supported value: true and false.</dd>
<dd>If you choose to enable deduplication, the mails with the same content will be stored only once.</dd>
<dd>Warning: Once this feature is enabled, there is no turning back as turning it off will lead to the deletion of all</dd>
<dd>the mails sharing the same content once one is deleted.</dd>
- <dd>This feature also requires a garbage collector mechanism to effectively drop blobs, which is not implemented yet.</dd>
- <dd>Consequently, all the requested deletions will not be performed, meaning that blobstore will only grow.</dd>
+ <dd>This feature also requires a garbage collector mechanism to effectively drop blobs. A first implementation
+ based on bloom filters can be used and triggered using the WebAdmin REST API. See
+ <a href="manage-webadmin.html#Running_blob_garbage_collection">Running blob garbage collection</a>.
+ In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+ generation are not garbage collected.</dd>
Review comment:
```suggestion
generations are not garbage collected.</dd>
```
##########
File path: src/site/markdown/server/manage-webadmin.md
##########
@@ -3517,6 +3518,46 @@ Response codes:
- 204: Operation succeeded
+## Running blob garbage collection
+
+When deduplication is enabled one need to explicitly run a garbage collection in order to delete no longer referenced
+blobs.
+
+To do so:
+
+```
+curl -XDELETE http:ip:port/blobs?scope=unreferenced
+```
+
+[More details about endpoints returning a task](#Endpoints_returning_a_task).
+
+Additional parameters include Bloom filter tuning parameters:
+
+ - **associatedProbability**: Allow to define the targeted false positive rate. Note that subsequent runs do not have the
+same false-positives.
+ - **expectedBlobCount**: Expected count of blobs used to size the bloom filters.
+
+The created task have the following additional information:
Review comment:
```suggestion
The created task has the following additional information:
```
##########
File path: src/site/xdoc/server/config-blobstore.xml
##########
@@ -53,14 +53,34 @@
The generated startup warning log can be deactivated via the <code>cassandra.blob.store.disable.startup.warning</code> environment
variable being positioned to <code>false</code>.
</dd>
+ Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
+ based on bloom filters can be used and triggered using the WebAdmin REST API. See
+ xref:distributed/operate/webadmin.adoc#_running_blob_garbage_collection[Running blob garbage collection].
+ In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+ generation are not garbage collected.
Review comment:
```suggestion
generations are not garbage collected.
```
##########
File path: src/site/xdoc/server/config-blobstore.xml
##########
@@ -53,14 +53,34 @@
The generated startup warning log can be deactivated via the <code>cassandra.blob.store.disable.startup.warning</code> environment
variable being positioned to <code>false</code>.
</dd>
+ Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
+ based on bloom filters can be used and triggered using the WebAdmin REST API. See
+ xref:distributed/operate/webadmin.adoc#_running_blob_garbage_collection[Running blob garbage collection].
+ In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+ generation are not garbage collected.
+
+ *deduplication.gc.generation.duration*: Allow controlling the duration of one generation. Longer implies better deduplication
+ but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is days.
+
+ *deduplication.gc.generation.family*: Every time the duration is changes, this integer counter must be incremented to avoid
+ conflicts. Defaults to 1.
<dt><strong>deduplication/enable</strong></dt>
<dd>Mandatory. Supported value: true and false.</dd>
<dd>If you choose to enable deduplication, the mails with the same content will be stored only once.</dd>
<dd>Warning: Once this feature is enabled, there is no turning back as turning it off will lead to the deletion of all</dd>
<dd>the mails sharing the same content once one is deleted.</dd>
- <dd>This feature also requires a garbage collector mechanism to effectively drop blobs, which is not implemented yet.</dd>
- <dd>Consequently, all the requested deletions will not be performed, meaning that blobstore will only grow.</dd>
+ <dd>This feature also requires a garbage collector mechanism to effectively drop blobs. A first implementation
+ based on bloom filters can be used and triggered using the WebAdmin REST API. See
+ <a href="manage-webadmin.html#Running_blob_garbage_collection">Running blob garbage collection</a>.
+ In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+ generation are not garbage collected.</dd>
+ <dd><strong>deduplication.gc.generation.duration</strong></dd>
+ <dd>Allow controlling the duration of one generation. Longer implies better deduplication
+ but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is days.</dd>
Review comment:
```suggestion
but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is in days.</dd>
```
##########
File path: src/site/xdoc/server/config-blobstore.xml
##########
@@ -53,14 +53,34 @@
The generated startup warning log can be deactivated via the <code>cassandra.blob.store.disable.startup.warning</code> environment
variable being positioned to <code>false</code>.
</dd>
+ Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
Review comment:
Isn't this all block off the xml syntax and duplicated with the correct syntax below? (line 73)
##########
File path: src/site/xdoc/server/config-blobstore.xml
##########
@@ -53,14 +53,34 @@
The generated startup warning log can be deactivated via the <code>cassandra.blob.store.disable.startup.warning</code> environment
variable being positioned to <code>false</code>.
</dd>
+ Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
+ based on bloom filters can be used and triggered using the WebAdmin REST API. See
+ xref:distributed/operate/webadmin.adoc#_running_blob_garbage_collection[Running blob garbage collection].
+ In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+ generation are not garbage collected.
+
+ *deduplication.gc.generation.duration*: Allow controlling the duration of one generation. Longer implies better deduplication
+ but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is days.
+
+ *deduplication.gc.generation.family*: Every time the duration is changes, this integer counter must be incremented to avoid
+ conflicts. Defaults to 1.
<dt><strong>deduplication/enable</strong></dt>
<dd>Mandatory. Supported value: true and false.</dd>
<dd>If you choose to enable deduplication, the mails with the same content will be stored only once.</dd>
<dd>Warning: Once this feature is enabled, there is no turning back as turning it off will lead to the deletion of all</dd>
<dd>the mails sharing the same content once one is deleted.</dd>
- <dd>This feature also requires a garbage collector mechanism to effectively drop blobs, which is not implemented yet.</dd>
- <dd>Consequently, all the requested deletions will not be performed, meaning that blobstore will only grow.</dd>
+ <dd>This feature also requires a garbage collector mechanism to effectively drop blobs. A first implementation
+ based on bloom filters can be used and triggered using the WebAdmin REST API. See
+ <a href="manage-webadmin.html#Running_blob_garbage_collection">Running blob garbage collection</a>.
+ In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
Review comment:
```suggestion
In order to avoid concurrency issues upon garbage collection, we slice the blobs in generation, the two more recent
```
##########
File path: src/site/xdoc/server/config-blobstore.xml
##########
@@ -53,14 +53,34 @@
The generated startup warning log can be deactivated via the <code>cassandra.blob.store.disable.startup.warning</code> environment
variable being positioned to <code>false</code>.
</dd>
+ Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
+ based on bloom filters can be used and triggered using the WebAdmin REST API. See
+ xref:distributed/operate/webadmin.adoc#_running_blob_garbage_collection[Running blob garbage collection].
+ In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+ generation are not garbage collected.
+
+ *deduplication.gc.generation.duration*: Allow controlling the duration of one generation. Longer implies better deduplication
+ but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is days.
+
+ *deduplication.gc.generation.family*: Every time the duration is changes, this integer counter must be incremented to avoid
+ conflicts. Defaults to 1.
<dt><strong>deduplication/enable</strong></dt>
<dd>Mandatory. Supported value: true and false.</dd>
<dd>If you choose to enable deduplication, the mails with the same content will be stored only once.</dd>
<dd>Warning: Once this feature is enabled, there is no turning back as turning it off will lead to the deletion of all</dd>
<dd>the mails sharing the same content once one is deleted.</dd>
- <dd>This feature also requires a garbage collector mechanism to effectively drop blobs, which is not implemented yet.</dd>
- <dd>Consequently, all the requested deletions will not be performed, meaning that blobstore will only grow.</dd>
+ <dd>This feature also requires a garbage collector mechanism to effectively drop blobs. A first implementation
+ based on bloom filters can be used and triggered using the WebAdmin REST API. See
+ <a href="manage-webadmin.html#Running_blob_garbage_collection">Running blob garbage collection</a>.
+ In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+ generation are not garbage collected.</dd>
+ <dd><strong>deduplication.gc.generation.duration</strong></dd>
+ <dd>Allow controlling the duration of one generation. Longer implies better deduplication
+ but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is days.</dd>
+ <dd><strong>deduplication.gc.generation.family</strong></dd>
+ <dd>Every time the duration is changes, this integer counter must be incremented to avoid
Review comment:
```suggestion
<dd>Every time the duration is changed, this integer counter must be incremented to avoid
```
##########
File path: src/site/xdoc/server/config-blobstore.xml
##########
@@ -53,14 +53,34 @@
The generated startup warning log can be deactivated via the <code>cassandra.blob.store.disable.startup.warning</code> environment
variable being positioned to <code>false</code>.
</dd>
+ Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
+ based on bloom filters can be used and triggered using the WebAdmin REST API. See
+ xref:distributed/operate/webadmin.adoc#_running_blob_garbage_collection[Running blob garbage collection].
+ In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
+ generation are not garbage collected.
+
+ *deduplication.gc.generation.duration*: Allow controlling the duration of one generation. Longer implies better deduplication
+ but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is days.
Review comment:
```suggestion
but deleted blobs will live longer. Duration, defaults on 30 days, the default unit is in days.
```
##########
File path: src/site/xdoc/server/config-blobstore.xml
##########
@@ -53,14 +53,34 @@
The generated startup warning log can be deactivated via the <code>cassandra.blob.store.disable.startup.warning</code> environment
variable being positioned to <code>false</code>.
</dd>
+ Deduplication requires a garbage collector mechanism to effectively drop blobs. A first implementation
+ based on bloom filters can be used and triggered using the WebAdmin REST API. See
+ xref:distributed/operate/webadmin.adoc#_running_blob_garbage_collection[Running blob garbage collection].
+ In order to avoid concurrency issues upon garbage collection, we slice th blobs in generation, the two more recent
Review comment:
```suggestion
In order to avoid concurrency issues upon garbage collection, we slice the blobs in generation, the two more recent
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@james.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@james.apache.org
For additional commands, e-mail: notifications-help@james.apache.org