You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Thomas Mueller <mu...@adobe.com.INVALID> on 2017/07/03 13:37:45 UTC

Re: [DiSCUSS] - highly vs rarely used data

Hi,

> a property on the node, e.g. "archiveState=toArchive"

I wonder if we _can_ easily write to the version store? Also, some nodetypes don't allow such properties? It might need to be a hidden property, but then you can't use the JCR API. Or maintain this data in a "shadow" structure (not with the nodes), which would complicate move operations.

If I was a customer, I wouldn't wan't to *manually* mark / unmark binaries to be moved to / from long time storage. I would probably just want to rely on automatic management. But I'm not a customer, so my opinion is not that relevant (

> Using a property directly specified for this purpose gives us more direct control over how it is being used I think.

Sure, but it also comes with some complexities.

Regards,
Thomas




Re: [DiSCUSS] - highly vs rarely used data

Posted by Davide Giannella <da...@apache.org>.
On 04/07/2017 11:48, Julian Sedding wrote:
> A much more important and difficult question to answer IMHO is how to
> deal with the slow retrieval of archived content. And if needed, how
> to expose the slow availability (i.e. unavailable now but available
> later) to the end user (or application layer). To me this sounds
> tricky if we want to stick to the JCR API.

I think we should NOT touch the JCR api but rather use/expose the Oak
API for such features. And having a consuming application leverage one,
the other or both.

If we are going to touch the JCR API we should probably sit down a while
and think about JCR API 3 ;)

Davide



Re: [DiSCUSS] - highly vs rarely used data

Posted by Thomas Mueller <mu...@adobe.com.INVALID>.
Hi,

> (a) the implementation of an automatism is not *quite* what they need/want
> (b) they want to be able to manually select (or more likely override)
    whether a file can be archived

Well, behind the scenes, we anyway need a way to move entries to / from cold storage. But in my view, that's low-level API, and I wouldn't expose it first, but instead concentrate on implementing an automatic solution, that has no API (except for some config options). If it later turns out the low-level API is needed, it can still be added. I wouldn't introduce that as public API right from the start, just because we _think_ it _might_ be needed at some point later. Because having to maintain the API is expensive.

What I would introduce right from the start is a way to measure which binaries were read recently, and how frequently. But even for that, there is no public API needed first (except for maybe logging some statistics).

> Thus I suggest to come up with a pluggable "strategy" interface

That is too abstract for me. I think it is very important to have a concreate behaviour and API, otherwise discussing it is not possible.

> A much more important and difficult question to answer IMHO is how to deal with the slow retrieval of archived content.

My concrete suggestion would be, as I wrote: if it's in cold storage, throw an exception saying so, and load the binary into hot storage. A few minutes later, re-reading will not throw an exception as it's in hot storage. So, there is no API change needed, except for a new exception class (subclass of RepositoryException). An application can catch those exceptions and deal with them in a special way (write that the binary is not currently available). Possibly the new exception could have a method "doNotMoveBinary()" in case moving is not needed, but by default the binary should be moved, so that old applications don't have to be changed at all (backward compatibility).

What is your concrete suggestion?

Regards,
Thomas 


Re: [DiSCUSS] - highly vs rarely used data

Posted by Bertrand Delacretaz <bd...@apache.org>.
Hi Thomas,

On Tue, Jul 11, 2017 at 3:14 PM, Thomas Mueller
<mu...@adobe.com.invalid> wrote:
> ...if it's in cold storage, throw an exception saying so, and load the binary into hot storage.
> A few minutes later, re-reading will not throw an exception as it's in hot storage....

Ok great, sorry that I missed that earlier.

Note that the exception should not prevent the client from getting the
rest of the data (other properties) of the same Node - I suppose
that's natural if the exception is thrown when calling
Binary.getStream().

-Bertrand

Re: [DiSCUSS] - highly vs rarely used data

Posted by Thomas Mueller <mu...@adobe.com.INVALID>.
Hi,

On 10.07.17, 11:18, "Bertrand Delacretaz" <bd...@apache.org> wrote:
> Throw an exception maybe? BinaryNotAvailableAtThisTime, including an
>     ETA for availability. The application can then decide how to handle
>    that.

Bertrand, this is exactly what I have suggested in two previous mails:

My concrete suggestion would be, as I wrote: if it's in cold storage, throw an exception saying so, and load the binary into hot storage. A few minutes later, re-reading will not throw an exception as it's in hot storage. So, there is no API change needed, except for a new exception class (subclass of RepositoryException). An application can catch those exceptions and deal with them in a special way (write that the binary is not currently available). Possibly the new exception could have a method "doNotMoveBinary()" in case moving is not needed, but by default the binary should be moved, so that old applications don't have to be changed at all (backward compatibility).

Regards,
Thomas
 


Re: [DiSCUSS] - highly vs rarely used data

Posted by Chetan Mehrotra <ch...@gmail.com>.
I would prefer a 2 phase implementation here

A - CompositeBlobStore
---------------------------------

Have support for multiple BlobStores plugged within an Oak setup and
provide an API for layer above to select which BlobStore should be
used. This forms the lower most layer in stack. Such a feature should
support

1. Selecting which store a binary should be written to
2. How binary gets read
3. Support Blob GC

B - BinaryStorage Support
------------------------------------

Once we have A implemented then layer above can implement some logic
to manage where binaries are stored without requiring major changes in
core. For example Oak can extend the current extension point in
BlobStatsCollector to allow plugging in custom stats collector. This
can be then used by application to build logic to move content based
on various heuristics

1. Path Based
2. Access Based

Application can then use std api to "copy/move" binary from one store
to another.

We can also provide some out of box implementation but key thing here
is that it should be built on top of Oak Core and hence plug-gable.

Given that we have been discussing enhancements in Binary area for
long time now [1] it would be better to get #A implemented now with an
eye for requirements of #B. So that we make some progress here

Chetan Mehrotra
[1] https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase

Re: [DiSCUSS] - highly vs rarely used data

Posted by Bertrand Delacretaz <bd...@apache.org>.
On Tue, Jul 4, 2017 at 12:48 PM, Julian Sedding <js...@gmail.com> wrote:
> ...I suggest to come up with a pluggable "strategy" interface and
> provide a sensible default implementation...

Big +1 to that, requirements can vary widely IMO, also depending on
the characteristics of whatever cold storage is used.

> ...A much more important and difficult question to answer IMHO is how to
> deal with the slow retrieval of archived content....

Throw an exception maybe? BinaryNotAvailableAtThisTime, including an
ETA for availability. The application can then decide how to handle
that.

-Bertrand

Re: [DiSCUSS] - highly vs rarely used data

Posted by Julian Sedding <js...@gmail.com>.
From my experience working with customers, I can pretty much guarantee
that sooner or later:

(a) the implementation of an automatism is not *quite* what they need/want
(b) they want to be able to manually select (or more likely override)
whether a file can be archived

Thus I suggest to come up with a pluggable "strategy" interface and
provide a sensible default implementation. The default will be fine
for most customers/users, but advanced use-cases can be implemented by
substituting the implementation. Implementations could then also
respect manually set flags (=properties) if desired.

A much more important and difficult question to answer IMHO is how to
deal with the slow retrieval of archived content. And if needed, how
to expose the slow availability (i.e. unavailable now but available
later) to the end user (or application layer). To me this sounds
tricky if we want to stick to the JCR API.

Regards
Julian



On Mon, Jul 3, 2017 at 4:33 PM, Tommaso Teofili
<to...@gmail.com> wrote:
> I am sure there are both use cases for automatic vs manual/controlled
> collection of unused data, however if I were a user I would personally not
> want to care about this. While I'd be happy to know that my repo is faster
> / smaller / cleaner / whatever it'd sound overly complex to deal with JCR
> and Oak constraints and behaviours from the application layer.
> IMHO if we want to have such a feature in Oak to save resources, it should
> be the persistence responsibility to say "hey, this content is not being
> accessed for ages, let's try to claim some resources from it" (which could
> mean moving to cold storage, compress it or anything else).
>
> My 2 cents,
> Tommaso
>
>
>
> Il giorno lun 3 lug 2017 alle ore 15:46 Thomas Mueller
> <mu...@adobe.com.invalid> ha scritto:
>
>> Hi,
>>
>> > a property on the node, e.g. "archiveState=toArchive"
>>
>> I wonder if we _can_ easily write to the version store? Also, some
>> nodetypes don't allow such properties? It might need to be a hidden
>> property, but then you can't use the JCR API. Or maintain this data in a
>> "shadow" structure (not with the nodes), which would complicate move
>> operations.
>>
>> If I was a customer, I wouldn't wan't to *manually* mark / unmark binaries
>> to be moved to / from long time storage. I would probably just want to rely
>> on automatic management. But I'm not a customer, so my opinion is not that
>> relevant (
>>
>> > Using a property directly specified for this purpose gives us more
>> direct control over how it is being used I think.
>>
>> Sure, but it also comes with some complexities.
>>
>> Regards,
>> Thomas
>>
>>
>>
>>

Re: [DiSCUSS] - highly vs rarely used data

Posted by Tommaso Teofili <to...@gmail.com>.
I am sure there are both use cases for automatic vs manual/controlled
collection of unused data, however if I were a user I would personally not
want to care about this. While I'd be happy to know that my repo is faster
/ smaller / cleaner / whatever it'd sound overly complex to deal with JCR
and Oak constraints and behaviours from the application layer.
IMHO if we want to have such a feature in Oak to save resources, it should
be the persistence responsibility to say "hey, this content is not being
accessed for ages, let's try to claim some resources from it" (which could
mean moving to cold storage, compress it or anything else).

My 2 cents,
Tommaso



Il giorno lun 3 lug 2017 alle ore 15:46 Thomas Mueller
<mu...@adobe.com.invalid> ha scritto:

> Hi,
>
> > a property on the node, e.g. "archiveState=toArchive"
>
> I wonder if we _can_ easily write to the version store? Also, some
> nodetypes don't allow such properties? It might need to be a hidden
> property, but then you can't use the JCR API. Or maintain this data in a
> "shadow" structure (not with the nodes), which would complicate move
> operations.
>
> If I was a customer, I wouldn't wan't to *manually* mark / unmark binaries
> to be moved to / from long time storage. I would probably just want to rely
> on automatic management. But I'm not a customer, so my opinion is not that
> relevant (
>
> > Using a property directly specified for this purpose gives us more
> direct control over how it is being used I think.
>
> Sure, but it also comes with some complexities.
>
> Regards,
> Thomas
>
>
>
>