You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Matt Ryan <os...@mvryan.org> on 2017/08/04 00:27:39 UTC

[CompositeBlobStore] Delegate traversal algorithm

Hi,

I’ve been thinking the past few days about how a composite blob store might
go about prioritizing the delegate blob stores for reading and writing,
considering concepts like storage filters on a blob store, read-only blob
stores, and archive or “cold” blob stores (which we don’t currently have,
but could in the future).

Storage filters basically restrict what can be stored in a delegate - like
saying only blobs with a certain JCR property, etc.  (I realize there are
implications with this too - I’ll worry about that in a separate thread
someday.)

I’d like feedback on the following idea:
- Create a new public interface in Oak that can be injected into the
composite blob store and used to handle the delegate prioritization for
reads and writes.
- Create a default implementation of this interface that can be used in
most cases (see below).

This would allow extensibility in this area to implement new or more custom
algorithms for any future use cases, as needed, without tying it to
configuration.

The default implementation would be basically this:
- For reads:
  - Delegates with storage filters first
  - Delegates without storage filters next
  - Read-only delegates next (with filters first, then without)
  - Retry reads on delegates with with filters that were previously skipped
(this is a special case)
  - Cold storage delegates last

- For writes:
  - Search for an existing blob first using the “read” algorithm - always
update an existing blob, if one is found (except in cold storage)
  - If not found:
    - Try delegates with storage filters first
    - Delegates without storage filters next

The special case to retry reads on delegates with filters that were
previously skipped is to handle configuration change.  Essentially, if a
blob is stored in a delegate blob store, and then the configuration for
that delegate changes so that the blob wouldn’t be stored there if it was
being written now, we want to be able to locate it during the time between
when the configuration change happens and some background curator moves the
blob to the correct location.


So in short, I’d do the default implementation as described, but a
different implementation could be injected instead, if someone wanted a
more custom one.


WDYT?


-MR

Re: [CompositeBlobStore] Delegate traversal algorithm

Posted by Matt Ryan <os...@mvryan.org>.

Hi Thomas et al,

I’m still working through some issues on the wiki, but I updated much of it
to address some of Thomas’ concerns.

One area where I have simplified things has to do with the write process.
But I’m not sure my line of thinking is best.

Assume the data store receives a request to write blob ID 12345. Assume
there are two delegate blob stores, D1 and D2. Assume D1 has higher
priority over D2 and that D2 already contains blob ID 12345 - meaning the
blob is already stored in D2. Assume that there is no restriction to
disallow the blob from being stored in D1.

Since D1 has higher priority than D2, blob ID 12345 should be stored in D1,
but instead is stored in D2.

Originally I thought the system should look for a match for blob ID 12345,
which it would find in D2, and then the issue has to do with whether it
should update the last access time of blob ID 12345 in D2 or whether it
should write to D1 instead which is the proper location for it.

After Thomas’ comments I wonder if it should write to D1. Future reads
would get the most recently written version so the system would work
consistently. If that were to happen, I assume blob ID 12345 in D2 would
now be unreferenced and eventually garbage collected. Is that true? Would
we consider it appropriate to just let it get garbage collected?

Or would it be better to simply update the last access time of blob ID
12345 in D2 and continue referencing it from that location?

-MR

On August 15, 2017 at 4:06:56 PM, Matt Ryan (oss@mvryan.org) wrote:

Hi Thomas,

After emailing I saw you also provided comments in-line on the wiki. I’ll
work through those and reply back on-list when I think I have addressed
them. Thanks for doing that also!

-MR

On August 15, 2017 at 2:01:04 PM, Matt Ryan (oss@mvryan.org) wrote:

Hi Thomas,

Thank you for taking the time to offer a review. I’ve been going through
the suggested readings and will continue to do so.

Some comments inline below.

On August 15, 2017 at 12:25:54 AM, Thomas Mueller (mueller@adobe.com.invalid)
wrote:

Hi,

It is important to understand which operations are available in the JCR
API, the DataStore API, and the concept of revisions we use for Oak. For
example,

* The DataStore API doesn’t support updating a binary.

This is of course true. The interface supports only an “addRecord()”
capability to put a blob into the data store. The javadoc there clearly
expects the possibility that the record may already exist: "If the same
stream already exists in another record, then that record is returned
instead of creating a new one.”

Implementations handle the details of what happens when the blob already
exists. For example, the “write()” method in the S3Backend class clearly
distinguishes between the two as the way to handle this via the AWS SDK is
different for an update versus a create:
https://svn.apache.org/repos/asf/jackrabbit/oak/trunk/oak-blob-cloud/src/main/java/org/apache/jackrabbit/oak/blob/cloud/s3/S3Backend.java

It is still the case that from the data store’s point of view there is no
difference between the two so it doesn’t support a distinction.

The original data store concept can take this approach because it only has
one place for the data to go. The composite blob store has more than one
place the data could go, so I believe there is a possibility that the data
could exist in a delegate blob store that is not the first blob store that
the data could be written to.

What should happen in that case? I assumed we should try to find a match
first, and prefer updating to creating new. I’m not sure exactly how that
would happen though, since the name only matches if the content hash is the
same (unless there’s a collision of course), and otherwise it’s a new blob
anyway.

* A node might have multiple revisions.
* In the Oak revision model, you can't update a reference of an old
revision.

Does the data store even know about this? I assumed this was all handled
at a higher level, and that once the data store is told to add a record
it’s already been determined that the write is okay, even if it ends up
that the stream being written already exists somewhere.

* The JCR API allows to create binaries without nodes via ValueFactory (so
it's not possible to use storage filters at that time).

I admit, I’m unsure how this might ever work. Maybe it would have to be
solved by curation later.

What you didn't address is how to read if there are multiple possible
storage locations, so I assume you didn't think about that case. In my
view, this should be supported. You might want to read up on LSM trees on
how to do that: using bloom filters for example.

I’ll look at what I wrote again and try to make this more clear. I did
think about it - at least if we are talking about the same thing.

What I was thinking was that a read would go through each delegate in the
specified order and attempt to find a match, returning the first match
found. The order and the algorithm used would depend on the traversal
strategy implementation. It would be possible to use LSM trees, I can see
how that would be used, although I wonder if it would be overkill where I’d
expect most real-world uses of composite data store to be two or three
delegates at most. What do you think?

Is that what you were referring to, or are you talking about what to do if
a blob exists in more than one location at the same time? Or something
else entirely? I’m not sure I understand what you are referring to.

Thanks again for the review.

-MR

Re: [CompositeBlobStore] Delegate traversal algorithm

Posted by Thomas Mueller <mu...@adobe.com.INVALID>.

Hi,

The Bloom filter is something to consider, to speed up reading. It's not strictly needed of course.

Yes, I would consider using Bloom filters to more quickly find out where an entry is stored, if there are multiple possibilities. So, one filter per "delegate". In our case, the most logical place to do that is for the read-only stores. They could also be used for read-write stores (created during garbage collection for example). Sure, they would not always be up-to-date, but most (let's say 90%) binaries are older than the last GC, so it would speedup that case (and have basically no cost for new entries, as the filter is in memory).

Regards,
Thomas

On 16.08.17, 02:25, "Matt Ryan" <os...@mvryan.org> wrote:

Hi Thomas (and everyone else):

I wanted to ask about a comment you made in the wiki where you said "Bloom
filters should be mentioned (are they used, if yes how, if not why not).”
I assume since you included that you are thinking they probably should be
used.

I believe the intended use of a Bloom filter in this case would be for read
operations, to quickly determine if a blob id is not stored anywhere in the
system. Let me know if you had another use in mind.

If that’s the use case, I wonder how we would reasonably come up with a
useful guess as to the appropriate size of the filter. Someone with more
experience using them could maybe offer some insight here as to appropriate
values for an expected number of insertions and the appropriate expected
false positive probability.

It seems like we could also use more than one Bloom filter, one for each
delegate to say whether the blob id is located in that particular
delegate. Not sure if you were thinking more along those lines or just a
single Bloom filter for the entire composite as a whole, or both.

-MR

On August 15, 2017 at 4:06:56 PM, Matt Ryan (oss@mvryan.org) wrote:

Hi Thomas,

After emailing I saw you also provided comments in-line on the wiki. I’ll
work through those and reply back on-list when I think I have addressed
them. Thanks for doing that also!

-MR

On August 15, 2017 at 2:01:04 PM, Matt Ryan (oss@mvryan.org) wrote:

Hi Thomas,

Thank you for taking the time to offer a review. I’ve been going through
the suggested readings and will continue to do so.

Some comments inline below.

On August 15, 2017 at 12:25:54 AM, Thomas Mueller (mueller@adobe.com.invalid)
wrote:

Hi,

It is important to understand which operations are available in the JCR
API, the DataStore API, and the concept of revisions we use for Oak. For
example,

* The DataStore API doesn’t support updating a binary.

It is still the case that from the data store’s point of view there is no
difference between the two so it doesn’t support a distinction.

* A node might have multiple revisions.
* In the Oak revision model, you can't update a reference of an old
revision.

* The JCR API allows to create binaries without nodes via ValueFactory (so
it's not possible to use storage filters at that time).

I admit, I’m unsure how this might ever work. Maybe it would have to be
solved by curation later.

I’ll look at what I wrote again and try to make this more clear. I did
think about it - at least if we are talking about the same thing.

Thanks again for the review.

-MR

Re: [CompositeBlobStore] Delegate traversal algorithm

Posted by Matt Ryan <os...@mvryan.org>.

Hi Thomas (and everyone else):

-MR

On August 15, 2017 at 4:06:56 PM, Matt Ryan (oss@mvryan.org) wrote:

Hi Thomas,

After emailing I saw you also provided comments in-line on the wiki. I’ll
work through those and reply back on-list when I think I have addressed
them. Thanks for doing that also!

-MR

On August 15, 2017 at 2:01:04 PM, Matt Ryan (oss@mvryan.org) wrote:

Hi Thomas,

Thank you for taking the time to offer a review. I’ve been going through
the suggested readings and will continue to do so.

Some comments inline below.

On August 15, 2017 at 12:25:54 AM, Thomas Mueller (mueller@adobe.com.invalid)
wrote:

Hi,

It is important to understand which operations are available in the JCR
API, the DataStore API, and the concept of revisions we use for Oak. For
example,

* The DataStore API doesn’t support updating a binary.

It is still the case that from the data store’s point of view there is no
difference between the two so it doesn’t support a distinction.

* A node might have multiple revisions.
* In the Oak revision model, you can't update a reference of an old
revision.

* The JCR API allows to create binaries without nodes via ValueFactory (so
it's not possible to use storage filters at that time).

I admit, I’m unsure how this might ever work. Maybe it would have to be
solved by curation later.

I’ll look at what I wrote again and try to make this more clear. I did
think about it - at least if we are talking about the same thing.

Thanks again for the review.

-MR

Re: [CompositeBlobStore] Delegate traversal algorithm

Posted by Matt Ryan <os...@mvryan.org>.

Hi Thomas,

After emailing I saw you also provided comments in-line on the wiki. I’ll
work through those and reply back on-list when I think I have addressed
them. Thanks for doing that also!

-MR

On August 15, 2017 at 2:01:04 PM, Matt Ryan (oss@mvryan.org) wrote:

Hi Thomas,

Thank you for taking the time to offer a review. I’ve been going through
the suggested readings and will continue to do so.

Some comments inline below.

On August 15, 2017 at 12:25:54 AM, Thomas Mueller (mueller@adobe.com.invalid)
wrote:

Hi,

It is important to understand which operations are available in the JCR
API, the DataStore API, and the concept of revisions we use for Oak. For
example,

* The DataStore API doesn’t support updating a binary.

It is still the case that from the data store’s point of view there is no
difference between the two so it doesn’t support a distinction.

* A node might have multiple revisions.
* In the Oak revision model, you can't update a reference of an old
revision.

* The JCR API allows to create binaries without nodes via ValueFactory (so
it's not possible to use storage filters at that time).

I admit, I’m unsure how this might ever work. Maybe it would have to be
solved by curation later.

I’ll look at what I wrote again and try to make this more clear. I did
think about it - at least if we are talking about the same thing.

Thanks again for the review.

-MR

Re: [CompositeBlobStore] Delegate traversal algorithm

Posted by Matt Ryan <os...@mvryan.org>.

Hi Thomas,

Thank you for taking the time to offer a review. I’ve been going through
the suggested readings and will continue to do so.

Some comments inline below.

On August 15, 2017 at 12:25:54 AM, Thomas Mueller (mueller@adobe.com.invalid)
wrote:

Hi,

It is important to understand which operations are available in the JCR
API, the DataStore API, and the concept of revisions we use for Oak. For
example,

* The DataStore API doesn’t support updating a binary.

It is still the case that from the data store’s point of view there is no
difference between the two so it doesn’t support a distinction.

* A node might have multiple revisions.
* In the Oak revision model, you can't update a reference of an old
revision.

* The JCR API allows to create binaries without nodes via ValueFactory (so
it's not possible to use storage filters at that time).

I admit, I’m unsure how this might ever work. Maybe it would have to be
solved by curation later.

I’ll look at what I wrote again and try to make this more clear. I did
think about it - at least if we are talking about the same thing.

Thanks again for the review.

-MR

Re: [CompositeBlobStore] Delegate traversal algorithm

Posted by Thomas Mueller <mu...@adobe.com.INVALID>.

Hi,

It is important to understand which operations are available in the JCR API, the DataStore API, and the concept of revisions we use for Oak. For example, 

* The DataStore API doesn’t support updating a binary.
* A node might have multiple revisions.
* In the Oak revision model, you can't update a reference of an old revision.
* The JCR API allows to create binaries without nodes via ValueFactory (so it's not possible to use storage filters at that time).

What you didn't address is how to read if there are multiple possible storage locations, so I assume you didn't think about that case. In my view, this should be supported. You might want to read up on LSM trees on how to do that: using bloom filters for example.

Suggested readings:
* https://docs.adobe.com/content/docs/en/spec/jsr170/javadocs/jcr-2.0/index.html
* https://docs.adobe.com/content/docs/en/spec/jcr/1.0/index.html
* https://en.wikipedia.org/wiki/Content-addressable_storage
* https://en.wikipedia.org/wiki/Log-structured_merge-tree

Regards,
Thomas



On 15.08.17, 08:00, "Thomas Mueller" <mu...@adobe.com> wrote:

    Hi,
    
    I read you wiki update, and this caught my eye:
    
    >  If a match is found, the write is treated as an update; if no match is found, the write is treated as a create.
    
    In the DataStore, there is no such thing as an update. There are only the following operations:
    
    * write
    * read
    * delete, via garbage collection
    
    See also https://en.wikipedia.org/wiki/Content-addressable_storage
    
    Regards,
    Thomas
    
    
    On 14.08.17, 17:17, "Matt Ryan" <os...@mvryan.org> wrote:
    
        Bump.  If anyone has feedback I’d love to hear it.
        
        
        On August 3, 2017 at 6:27:39 PM, Matt Ryan (oss@mvryan.org) wrote:
        
        Hi,
        
        I’ve been thinking the past few days about how a composite blob store might
        go about prioritizing the delegate blob stores for reading and writing,
        considering concepts like storage filters on a blob store, read-only blob
        stores, and archive or “cold” blob stores (which we don’t currently have,
        but could in the future).
        
        Storage filters basically restrict what can be stored in a delegate - like
        saying only blobs with a certain JCR property, etc.  (I realize there are
        implications with this too - I’ll worry about that in a separate thread
        someday.)
        
        I’d like feedback on the following idea:
        - Create a new public interface in Oak that can be injected into the
        composite blob store and used to handle the delegate prioritization for
        reads and writes.
        - Create a default implementation of this interface that can be used in
        most cases (see below).
        
        This would allow extensibility in this area to implement new or more custom
        algorithms for any future use cases, as needed, without tying it to
        configuration.
        
        The default implementation would be basically this:
        - For reads:
          - Delegates with storage filters first
          - Delegates without storage filters next
          - Read-only delegates next (with filters first, then without)
          - Retry reads on delegates with with filters that were previously skipped
        (this is a special case)
          - Cold storage delegates last
        
        - For writes:
          - Search for an existing blob first using the “read” algorithm - always
        update an existing blob, if one is found (except in cold storage)
          - If not found:
            - Try delegates with storage filters first
            - Delegates without storage filters next
        
        The special case to retry reads on delegates with filters that were
        previously skipped is to handle configuration change.  Essentially, if a
        blob is stored in a delegate blob store, and then the configuration for
        that delegate changes so that the blob wouldn’t be stored there if it was
        being written now, we want to be able to locate it during the time between
        when the configuration change happens and some background curator moves the
        blob to the correct location.
        
        
        So in short, I’d do the default implementation as described, but a
        different implementation could be injected instead, if someone wanted a
        more custom one.
        
        
        WDYT?
        
        
        -MR

Re: [CompositeBlobStore] Delegate traversal algorithm

Posted by Thomas Mueller <mu...@adobe.com.INVALID>.

Hi,

I read you wiki update, and this caught my eye:

>  If a match is found, the write is treated as an update; if no match is found, the write is treated as a create.

In the DataStore, there is no such thing as an update. There are only the following operations:

* write
* read
* delete, via garbage collection

See also https://en.wikipedia.org/wiki/Content-addressable_storage

Regards,
Thomas


On 14.08.17, 17:17, "Matt Ryan" <os...@mvryan.org> wrote:

    Bump.  If anyone has feedback I’d love to hear it.
    
    
    On August 3, 2017 at 6:27:39 PM, Matt Ryan (oss@mvryan.org) wrote:
    
    Hi,
    
    I’ve been thinking the past few days about how a composite blob store might
    go about prioritizing the delegate blob stores for reading and writing,
    considering concepts like storage filters on a blob store, read-only blob
    stores, and archive or “cold” blob stores (which we don’t currently have,
    but could in the future).
    
    Storage filters basically restrict what can be stored in a delegate - like
    saying only blobs with a certain JCR property, etc.  (I realize there are
    implications with this too - I’ll worry about that in a separate thread
    someday.)
    
    I’d like feedback on the following idea:
    - Create a new public interface in Oak that can be injected into the
    composite blob store and used to handle the delegate prioritization for
    reads and writes.
    - Create a default implementation of this interface that can be used in
    most cases (see below).
    
    This would allow extensibility in this area to implement new or more custom
    algorithms for any future use cases, as needed, without tying it to
    configuration.
    
    The default implementation would be basically this:
    - For reads:
      - Delegates with storage filters first
      - Delegates without storage filters next
      - Read-only delegates next (with filters first, then without)
      - Retry reads on delegates with with filters that were previously skipped
    (this is a special case)
      - Cold storage delegates last
    
    - For writes:
      - Search for an existing blob first using the “read” algorithm - always
    update an existing blob, if one is found (except in cold storage)
      - If not found:
        - Try delegates with storage filters first
        - Delegates without storage filters next
    
    The special case to retry reads on delegates with filters that were
    previously skipped is to handle configuration change.  Essentially, if a
    blob is stored in a delegate blob store, and then the configuration for
    that delegate changes so that the blob wouldn’t be stored there if it was
    being written now, we want to be able to locate it during the time between
    when the configuration change happens and some background curator moves the
    blob to the correct location.
    
    
    So in short, I’d do the default implementation as described, but a
    different implementation could be injected instead, if someone wanted a
    more custom one.
    
    
    WDYT?
    
    
    -MR

Re: [CompositeBlobStore] Delegate traversal algorithm

Posted by Matt Ryan <os...@mvryan.org>.

Bump.  If anyone has feedback I’d love to hear it.


On August 3, 2017 at 6:27:39 PM, Matt Ryan (oss@mvryan.org) wrote:

Hi,

I’ve been thinking the past few days about how a composite blob store might
go about prioritizing the delegate blob stores for reading and writing,
considering concepts like storage filters on a blob store, read-only blob
stores, and archive or “cold” blob stores (which we don’t currently have,
but could in the future).

Storage filters basically restrict what can be stored in a delegate - like
saying only blobs with a certain JCR property, etc.  (I realize there are
implications with this too - I’ll worry about that in a separate thread
someday.)

I’d like feedback on the following idea:
- Create a new public interface in Oak that can be injected into the
composite blob store and used to handle the delegate prioritization for
reads and writes.
- Create a default implementation of this interface that can be used in
most cases (see below).

This would allow extensibility in this area to implement new or more custom
algorithms for any future use cases, as needed, without tying it to
configuration.

The default implementation would be basically this:
- For reads:
  - Delegates with storage filters first
  - Delegates without storage filters next
  - Read-only delegates next (with filters first, then without)
  - Retry reads on delegates with with filters that were previously skipped
(this is a special case)
  - Cold storage delegates last

- For writes:
  - Search for an existing blob first using the “read” algorithm - always
update an existing blob, if one is found (except in cold storage)
  - If not found:
    - Try delegates with storage filters first
    - Delegates without storage filters next

The special case to retry reads on delegates with filters that were
previously skipped is to handle configuration change.  Essentially, if a
blob is stored in a delegate blob store, and then the configuration for
that delegate changes so that the blob wouldn’t be stored there if it was
being written now, we want to be able to locate it during the time between
when the configuration change happens and some background curator moves the
blob to the correct location.


So in short, I’d do the default implementation as described, but a
different implementation could be injected instead, if someone wanted a
more custom one.


WDYT?


-MR