You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@openwhisk.apache.org by Chetan Mehrotra <ch...@gmail.com> on 2018/03/26 10:28:30 UTC

AttachmentStore - Handling concurrent attachment updates

Last week I had a Slack call with Rodric around AttachmentStore PR and
as part of that we discussed the problem around handling of concurrent
updates of attachments. Details below are based on that discussion.

As of now CouchDB can detect concurrent updates of attachment due to
inbuilt MVCC support. However most of the Object Stores (like S3/IBM
COS, Azure Blob Store etc) which are to be used for new
AttachmentStore implementation does not provide any conditional update
and are designed more for immutable storage.

Consider an Action Update sequence which is currently done in 2 parts

1. Update the document
2. Upload the attachment

Now consider an AttachmentStore implementation (as per PR #3453
design) which stores attachment content against a key like

   whiskentity/<doc id>/<attachment name>

Where

1. whiskaction - Key prefix to store attachments related to Whisk entities
2. <doc id> - Document Id with which the attachment is being attached
3. <attachment name> - Name of attachment like `jarfile`

Object Stores are optimized for direct key lookup and also allows
searches based on key prefix. Hence the use of such a format which
allows direct attachment lookup for readAttachment and all attachments
related to specific doc for deleteAttachments

Now consider following flow

1. thread 1: updates the document and succeeds
2. thread 2: updates the document (based on thread 1) and succeeds
3. thread 2: attaches i.e. writes an attachment to the AttachmentStore
4. thread 1: attaches

This would result in a race condition where in the end attachment
meant for document state at #1 gets linked to document at state #2. To
handle such cases we should switch to immutable attachment design

A - Proposal - Use Immutable Attachments
----------------------------------------------------

In current flow we perform an "update" of existing attachment with a
given name. For e.g. currently action update flow is like

1. Put document with attachment info

    "exec": {
        "kind": "java",
        "code": {
          "attachmentName": "jarfile",
          "attachmentType": "application/java-archive"
        },
        "binary": true,
        "main": "Hello"
      }

2. Attach the attachment with name set to value of `attachmentName`

Instead of that we should allow `ArtifactStore` (which in turn rely on
AttachmentStore) to generate the name and then save that name against
`attachmentName`. So proposed flow is

1. Upload the attachment and have ArtifactStore return a generated name

  protected[core] def attach(doc: DocInfo, contentType: ContentType,
docStream: Source[ByteString, _])(
    implicit transid: TransactionId): Future[(DocInfo, AttachmentName)]

2. Then update the document with attachmentName set to name returned
in previous step

3. Then delete the old attachment after #2 completes successfully

With this approach the attachments would be immutable and that would enable

1. Proper handling of concurrent updates
2. Simplified caching of attachments as immutable objects can be cached easily

B - Orphaned Blob Garbage Collection
----------------------------------------------

With above approach there is a possiblity that some action update flow
may end up in between leaving some orphan blob instances in Object
stores. To clean them up we can implement a garbage collection login
as part of wskadmin

Please share your feedback about the new proposal. I would start work
on a PR for new proposal so that its easier to discuss specific
semantics. Once this work is done we can come back to AttachmentStore
PR and implement that as per newer flow

Chetan Mehrotra

Re: AttachmentStore - Handling concurrent attachment updates

Posted by Chetan Mehrotra <ch...@gmail.com>.
Hi Matt,

> Out of curiosity, should we be including some
> indicator of version of the attachment (binary) like a hash into the
> document naming for future update efficiencies (or even external security
> checks)?

The blobId is an opaque string for clients of ArtifactStore so we can
encode more attributes there if required. For e.g. we can encode the
length of binary as part of id such that if we need size details to.
say determine how to handle the binary that can be done without a
remote call.

As to topic of encoding hash - Most Object Store do generate content
hash and provide that as the meta attribute and also as etag value in
response after an upload is done. So that can be fetched as part of
some metadata api. So if a user wants to validate if the binary
uploaded is stored correctly we can return the etag value as part of
action create call

Would such a metadata api would meet the requirements you had in mind?
Chetan Mehrotra


On Mon, Mar 26, 2018 at 11:23 PM, Matt Rutkowski <mr...@us.ibm.com> wrote:
> Hi Chetan,
>
> Thanks for proposing this.  Out of curiosity, should we be including some
> indicator of version of the attachment (binary) like a hash into the
> document naming for future update efficiencies (or even external security
> checks)?  I guess these topics are on my mind given the hashing/signing of
> OW artifacts towards Apache release...
>
> Kind regards,
> Matt
>
>
>
> From:   Michael Marth <mm...@adobe.com.INVALID>
> To:     "dev@openwhisk.apache.org" <de...@openwhisk.apache.org>
> Date:   03/26/2018 11:49 AM
> Subject:        Re: AttachmentStore - Handling concurrent attachment
> updates
>
>
>
> Hi Chetan,
>
>
>
> My2c: making the attachments immutable will yield great benefits as you
> write below:
>
>
>
>     1. Proper handling of concurrent updates
>
>     2. Simplified caching of attachments as immutable objects can be
> cached easily
>
>
>
> On #2: with immutable attachments caching becomes trivial which will help
> with more distributed deployments (across different data centers).
>
>
>
> Great proposal
>
> Michael
>
>
>
>
>
>
>
>
>
> On 26/03/18 05:58, "Chetan Mehrotra" <ch...@gmail.com> wrote:
>
>
>
>     Last week I had a Slack call with Rodric around AttachmentStore PR and
>
>     as part of that we discussed the problem around handling of concurrent
>
>     updates of attachments. Details below are based on that discussion.
>
>
>
>     As of now CouchDB can detect concurrent updates of attachment due to
>
>     inbuilt MVCC support. However most of the Object Stores (like S3/IBM
>
>     COS, Azure Blob Store etc) which are to be used for new
>
>     AttachmentStore implementation does not provide any conditional update
>
>     and are designed more for immutable storage.
>
>
>
>     Consider an Action Update sequence which is currently done in 2 parts
>
>
>
>     1. Update the document
>
>     2. Upload the attachment
>
>
>
>     Now consider an AttachmentStore implementation (as per PR #3453
>
>     design) which stores attachment content against a key like
>
>
>
>        whiskentity/<doc id>/<attachment name>
>
>
>
>     Where
>
>
>
>     1. whiskaction - Key prefix to store attachments related to Whisk
> entities
>
>     2. <doc id> - Document Id with which the attachment is being attached
>
>     3. <attachment name> - Name of attachment like `jarfile`
>
>
>
>     Object Stores are optimized for direct key lookup and also allows
>
>     searches based on key prefix. Hence the use of such a format which
>
>     allows direct attachment lookup for readAttachment and all attachments
>
>     related to specific doc for deleteAttachments
>
>
>
>     Now consider following flow
>
>
>
>     1. thread 1: updates the document and succeeds
>
>     2. thread 2: updates the document (based on thread 1) and succeeds
>
>     3. thread 2: attaches i.e. writes an attachment to the AttachmentStore
>
>     4. thread 1: attaches
>
>
>
>     This would result in a race condition where in the end attachment
>
>     meant for document state at #1 gets linked to document at state #2. To
>
>     handle such cases we should switch to immutable attachment design
>
>
>
>     A - Proposal - Use Immutable Attachments
>
>     ----------------------------------------------------
>
>
>
>     In current flow we perform an "update" of existing attachment with a
>
>     given name. For e.g. currently action update flow is like
>
>
>
>     1. Put document with attachment info
>
>
>
>         "exec": {
>
>             "kind": "java",
>
>             "code": {
>
>               "attachmentName": "jarfile",
>
>               "attachmentType": "application/java-archive"
>
>             },
>
>             "binary": true,
>
>             "main": "Hello"
>
>           }
>
>
>
>     2. Attach the attachment with name set to value of `attachmentName`
>
>
>
>     Instead of that we should allow `ArtifactStore` (which in turn rely on
>
>     AttachmentStore) to generate the name and then save that name against
>
>     `attachmentName`. So proposed flow is
>
>
>
>     1. Upload the attachment and have ArtifactStore return a generated
> name
>
>
>
>       protected[core] def attach(doc: DocInfo, contentType: ContentType,
>
>     docStream: Source[ByteString, _])(
>
>         implicit transid: TransactionId): Future[(DocInfo,
> AttachmentName)]
>
>
>
>     2. Then update the document with attachmentName set to name returned
>
>     in previous step
>
>
>
>     3. Then delete the old attachment after #2 completes successfully
>
>
>
>     With this approach the attachments would be immutable and that would
> enable
>
>
>
>     1. Proper handling of concurrent updates
>
>     2. Simplified caching of attachments as immutable objects can be
> cached easily
>
>
>
>     B - Orphaned Blob Garbage Collection
>
>     ----------------------------------------------
>
>
>
>     With above approach there is a possiblity that some action update flow
>
>     may end up in between leaving some orphan blob instances in Object
>
>     stores. To clean them up we can implement a garbage collection login
>
>     as part of wskadmin
>
>
>
>     Please share your feedback about the new proposal. I would start work
>
>     on a PR for new proposal so that its easier to discuss specific
>
>     semantics. Once this work is done we can come back to AttachmentStore
>
>     PR and implement that as per newer flow
>
>
>
>     Chetan Mehrotra
>
>
>
>
>
>

Re: AttachmentStore - Handling concurrent attachment updates

Posted by Matt Rutkowski <mr...@us.ibm.com>.
Hi Chetan,

Thanks for proposing this.  Out of curiosity, should we be including some 
indicator of version of the attachment (binary) like a hash into the 
document naming for future update efficiencies (or even external security 
checks)?  I guess these topics are on my mind given the hashing/signing of 
OW artifacts towards Apache release...

Kind regards,
Matt



From:   Michael Marth <mm...@adobe.com.INVALID>
To:     "dev@openwhisk.apache.org" <de...@openwhisk.apache.org>
Date:   03/26/2018 11:49 AM
Subject:        Re: AttachmentStore - Handling concurrent attachment 
updates



Hi Chetan,



My2c: making the attachments immutable will yield great benefits as you 
write below:



    1. Proper handling of concurrent updates

    2. Simplified caching of attachments as immutable objects can be 
cached easily



On #2: with immutable attachments caching becomes trivial which will help 
with more distributed deployments (across different data centers).



Great proposal

Michael









On 26/03/18 05:58, "Chetan Mehrotra" <ch...@gmail.com> wrote:



    Last week I had a Slack call with Rodric around AttachmentStore PR and

    as part of that we discussed the problem around handling of concurrent

    updates of attachments. Details below are based on that discussion.



    As of now CouchDB can detect concurrent updates of attachment due to

    inbuilt MVCC support. However most of the Object Stores (like S3/IBM

    COS, Azure Blob Store etc) which are to be used for new

    AttachmentStore implementation does not provide any conditional update

    and are designed more for immutable storage.



    Consider an Action Update sequence which is currently done in 2 parts



    1. Update the document

    2. Upload the attachment



    Now consider an AttachmentStore implementation (as per PR #3453

    design) which stores attachment content against a key like



       whiskentity/<doc id>/<attachment name>



    Where



    1. whiskaction - Key prefix to store attachments related to Whisk 
entities

    2. <doc id> - Document Id with which the attachment is being attached

    3. <attachment name> - Name of attachment like `jarfile`



    Object Stores are optimized for direct key lookup and also allows

    searches based on key prefix. Hence the use of such a format which

    allows direct attachment lookup for readAttachment and all attachments

    related to specific doc for deleteAttachments



    Now consider following flow



    1. thread 1: updates the document and succeeds

    2. thread 2: updates the document (based on thread 1) and succeeds

    3. thread 2: attaches i.e. writes an attachment to the AttachmentStore

    4. thread 1: attaches



    This would result in a race condition where in the end attachment

    meant for document state at #1 gets linked to document at state #2. To

    handle such cases we should switch to immutable attachment design



    A - Proposal - Use Immutable Attachments

    ----------------------------------------------------



    In current flow we perform an "update" of existing attachment with a

    given name. For e.g. currently action update flow is like



    1. Put document with attachment info



        "exec": {

            "kind": "java",

            "code": {

              "attachmentName": "jarfile",

              "attachmentType": "application/java-archive"

            },

            "binary": true,

            "main": "Hello"

          }



    2. Attach the attachment with name set to value of `attachmentName`



    Instead of that we should allow `ArtifactStore` (which in turn rely on

    AttachmentStore) to generate the name and then save that name against

    `attachmentName`. So proposed flow is



    1. Upload the attachment and have ArtifactStore return a generated 
name



      protected[core] def attach(doc: DocInfo, contentType: ContentType,

    docStream: Source[ByteString, _])(

        implicit transid: TransactionId): Future[(DocInfo, 
AttachmentName)]



    2. Then update the document with attachmentName set to name returned

    in previous step



    3. Then delete the old attachment after #2 completes successfully



    With this approach the attachments would be immutable and that would 
enable



    1. Proper handling of concurrent updates

    2. Simplified caching of attachments as immutable objects can be 
cached easily



    B - Orphaned Blob Garbage Collection

    ----------------------------------------------



    With above approach there is a possiblity that some action update flow

    may end up in between leaving some orphan blob instances in Object

    stores. To clean them up we can implement a garbage collection login

    as part of wskadmin



    Please share your feedback about the new proposal. I would start work

    on a PR for new proposal so that its easier to discuss specific

    semantics. Once this work is done we can come back to AttachmentStore

    PR and implement that as per newer flow



    Chetan Mehrotra







Re: AttachmentStore - Handling concurrent attachment updates

Posted by Michael Marth <mm...@adobe.com.INVALID>.
Hi Chetan,



My2c: making the attachments immutable will yield great benefits as you write below:



    1. Proper handling of concurrent updates

    2. Simplified caching of attachments as immutable objects can be cached easily



On #2: with immutable attachments caching becomes trivial which will help with more distributed deployments (across different data centers).



Great proposal

Michael









On 26/03/18 05:58, "Chetan Mehrotra" <ch...@gmail.com> wrote:



    Last week I had a Slack call with Rodric around AttachmentStore PR and

    as part of that we discussed the problem around handling of concurrent

    updates of attachments. Details below are based on that discussion.



    As of now CouchDB can detect concurrent updates of attachment due to

    inbuilt MVCC support. However most of the Object Stores (like S3/IBM

    COS, Azure Blob Store etc) which are to be used for new

    AttachmentStore implementation does not provide any conditional update

    and are designed more for immutable storage.



    Consider an Action Update sequence which is currently done in 2 parts



    1. Update the document

    2. Upload the attachment



    Now consider an AttachmentStore implementation (as per PR #3453

    design) which stores attachment content against a key like



       whiskentity/<doc id>/<attachment name>



    Where



    1. whiskaction - Key prefix to store attachments related to Whisk entities

    2. <doc id> - Document Id with which the attachment is being attached

    3. <attachment name> - Name of attachment like `jarfile`



    Object Stores are optimized for direct key lookup and also allows

    searches based on key prefix. Hence the use of such a format which

    allows direct attachment lookup for readAttachment and all attachments

    related to specific doc for deleteAttachments



    Now consider following flow



    1. thread 1: updates the document and succeeds

    2. thread 2: updates the document (based on thread 1) and succeeds

    3. thread 2: attaches i.e. writes an attachment to the AttachmentStore

    4. thread 1: attaches



    This would result in a race condition where in the end attachment

    meant for document state at #1 gets linked to document at state #2. To

    handle such cases we should switch to immutable attachment design



    A - Proposal - Use Immutable Attachments

    ----------------------------------------------------



    In current flow we perform an "update" of existing attachment with a

    given name. For e.g. currently action update flow is like



    1. Put document with attachment info



        "exec": {

            "kind": "java",

            "code": {

              "attachmentName": "jarfile",

              "attachmentType": "application/java-archive"

            },

            "binary": true,

            "main": "Hello"

          }



    2. Attach the attachment with name set to value of `attachmentName`



    Instead of that we should allow `ArtifactStore` (which in turn rely on

    AttachmentStore) to generate the name and then save that name against

    `attachmentName`. So proposed flow is



    1. Upload the attachment and have ArtifactStore return a generated name



      protected[core] def attach(doc: DocInfo, contentType: ContentType,

    docStream: Source[ByteString, _])(

        implicit transid: TransactionId): Future[(DocInfo, AttachmentName)]



    2. Then update the document with attachmentName set to name returned

    in previous step



    3. Then delete the old attachment after #2 completes successfully



    With this approach the attachments would be immutable and that would enable



    1. Proper handling of concurrent updates

    2. Simplified caching of attachments as immutable objects can be cached easily



    B - Orphaned Blob Garbage Collection

    ----------------------------------------------



    With above approach there is a possiblity that some action update flow

    may end up in between leaving some orphan blob instances in Object

    stores. To clean them up we can implement a garbage collection login

    as part of wskadmin



    Please share your feedback about the new proposal. I would start work

    on a PR for new proposal so that its easier to discuss specific

    semantics. Once this work is done we can come back to AttachmentStore

    PR and implement that as per newer flow



    Chetan Mehrotra