You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@james.apache.org by GitBox <gi...@apache.org> on 2021/08/13 07:40:08 UTC

[GitHub] [james-project] vttranlina commented on a change in pull request #594: ADR Deduplicated blob garbage collection with Bloom Filters

vttranlina commented on a change in pull request #594:
URL: https://github.com/apache/james-project/pull/594#discussion_r688309427



##########
File path: src/adr/0049-deduplicated-blobs-gs-with-bloom-filters.md
##########
@@ -0,0 +1,110 @@
+# 49. Deduplicated blobs garbage collection with bloom filters
+
+Date: 2021-07-21
+
+## Status
+
+Accepted (lazy consensus).
+
+Not yet implemented.
+
+Proposes a simple to implement alternative to  [39. Distributed blob garbage collector](0039-distributed-blob-garbage-collector.md)
+
+## Context
+
+The body, headers, attachments of the mails are stored as blobs in a blob store.
+
+In order to save space in those stores, those blobs are de-duplicated using a hash of their content.
+
+To attain that the current blob store will read the content of the blob before saving it, and generate its id based on
+a hash of this content. This way two blobs with the same content will share the same id and thus be saved only once.
+
+This makes the safe deletion of one of those blobs a non-trivial problem as we can't delete one blob without ensuring
+that all references to it are themselves deleted. For example if two messages share the same blob, when we delete
+one message there is at the time being no way to tell if the blob is still referenced by another message.
+
+## Decision
+
+To solve this, we will propose a simple two steps algorithms to provide a background deduplication job.
+
+The **first step** consists in building a bloom filter using the entities referencing the blobs.
+
+In the **second step** we iterate over blobs and check in the bloom filter to predict if they are referenced on not.
+
+**Bloom filters** are probabilistic data structures. Here the reference prediction can produce false positives: we might 
+skip some non referenced blobs that should have been garbage collected. However, the associated probability can be tuned 
+and by adding a salt we can ensure subsequent runs will have different sets of false positives and thus that all blobs is
+eventually garbage collected.
+
+To avoid concurrency issues, where we could garbage collect a blob at the same time a new reference to it appear,
+a `reference generation` notion will be added. The de-duplicating id of the blobs which before where constructed
+using only the hash of their content,  will now include this `reference generation` too. To avoid synchronisation 

Review comment:
       synchronisation -> synchronization

##########
File path: src/adr/0049-deduplicated-blobs-gs-with-bloom-filters.md
##########
@@ -0,0 +1,110 @@
+# 49. Deduplicated blobs garbage collection with bloom filters
+
+Date: 2021-07-21
+
+## Status
+
+Accepted (lazy consensus).
+
+Not yet implemented.
+
+Proposes a simple to implement alternative to  [39. Distributed blob garbage collector](0039-distributed-blob-garbage-collector.md)
+
+## Context
+
+The body, headers, attachments of the mails are stored as blobs in a blob store.
+
+In order to save space in those stores, those blobs are de-duplicated using a hash of their content.
+
+To attain that the current blob store will read the content of the blob before saving it, and generate its id based on
+a hash of this content. This way two blobs with the same content will share the same id and thus be saved only once.
+
+This makes the safe deletion of one of those blobs a non-trivial problem as we can't delete one blob without ensuring
+that all references to it are themselves deleted. For example if two messages share the same blob, when we delete
+one message there is at the time being no way to tell if the blob is still referenced by another message.
+
+## Decision
+
+To solve this, we will propose a simple two steps algorithms to provide a background deduplication job.
+
+The **first step** consists in building a bloom filter using the entities referencing the blobs.
+
+In the **second step** we iterate over blobs and check in the bloom filter to predict if they are referenced on not.
+
+**Bloom filters** are probabilistic data structures. Here the reference prediction can produce false positives: we might 
+skip some non referenced blobs that should have been garbage collected. However, the associated probability can be tuned 
+and by adding a salt we can ensure subsequent runs will have different sets of false positives and thus that all blobs is
+eventually garbage collected.
+
+To avoid concurrency issues, where we could garbage collect a blob at the same time a new reference to it appear,
+a `reference generation` notion will be added. The de-duplicating id of the blobs which before where constructed
+using only the hash of their content,  will now include this `reference generation` too. To avoid synchronisation 
+issues, the `generation` will be time based.
+
+So only blobs belonging to the `reference generation` `n-2` will be eligible for garbage collection to avoid 
+concurrency issues, and allow for a clock skew.
+
+Finally, we wish to offer the opportunity to configure, and reconfigure, the `generation` duration. In order to do so,
+we introduce a `generation family` part in the blobId. Incremented by the administrator on each configuration changes on
+the generation duration it allows avoiding conflicts in generations getting the same number before and after the change:
+all blobIds with a different family are considered belonging to a distinct generation ready to be garbage collected. This
+allows arbitrary changes in the generation duration.
+
+## Consequences
+
+We need to be able to list blobIds of blobs within the BlobStore. This operation should be supported by the blobStore 
+to prevent deduplication issues.
+
+We need to introduce a new API: `BlobReferenceSource` listing the blobIds currently referenced by an entity. We will
+need to implement it for each entity: Attachment, messages, mail queue and mail repositories.
+
+The garbage collection is a heavy-weight task whose complexity is proportional to the dataset size. 
+
+Regarding the garbage collection generation duration tradeoff:
+ - The longer, the longer it will take to effectively delete blobs
+ - However, the longer, the more efficient deduplication will be (deduplication is of course scoped to a single 
+   genetation).

Review comment:
       genetation -> generation




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@james.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@james.apache.org
For additional commands, e-mail: notifications-help@james.apache.org