You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2007/04/01 21:37:17 UTC

Next Generation Persistence

Hi,

Based on some idle thinking I've written up a proposed model for next
gneration persistence in Jackrabbit. Instead of trying to explain the
whole idea in an email I've written a web page (with diagrams!) to
describe the model in more detail. You can find the model description
at http://jackrabbit.apache.org/dev/ngp.html (it's not linked in the
menu).

To summarize, the model I'm proposing brings a heavily abstracted
persistence layer up as the central focus point of the entire
repository architecture. The persistence model I'm proposing has
implications all the way to things like clustering implementation,
node type management, and session handling. In fact it would
revolutionarize almost all parts of Jackrabbit core. ;-)

On the other hand, instead of seeing the proposed model as a
revolution, it could be seen simply as a way to raise the prominence
of the ChangeLog class over the ItemState  model. Currently ItemStates
are the central concept in Jackrabbit and the ChangeLog class is just
a way to group together a set of ItemState changes. In the model I'm
proposing the ChangeLog, or a revision, would instead become the
central concept that governs not only read and write operations but
also things like transactions and observation to a much greater degree
than it does today.

There are a number of open issues with the proposal, especially how to
make it perform well and not use too much disk space, but I feel that
the model is interesting enough for serious consideration and perhaps
even early prototyping. I'm especially interested in hearing your
thoughts on how feasible you think such a model would be and what
potential pitfalls I may have missed. Any other comments and ideas are
also welcome.

BR,

Jukka Zitting

Re: Next Generation Persistence

Posted by Tobias Bocanegra <to...@day.com>.

hi jukka,
good work. very nice draft! i was working on a similar idea of a new
persistence model which went a bit further in some detailes and was
much alike how subversion stores it's content [0].
i see 2 additional fundamental paradigms that a persistence layer
should be based on:
1. copies must be very cheep (and just recorded as reference)
2. multiple (or all) workspaces must be able to live on the same
'persistence tree'

the first point basically describes a 'copy be reference'. this allows
very fast implementations of copy, checkin, restore, etc. the second
paradigm allows very fast inter-workspace operations, like clone,
copy, etc.

regards, toby

[0] http://subversion.tigris.org/design.html#server.fs.struct

On 4/1/07, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> Based on some idle thinking I've written up a proposed model for next
> gneration persistence in Jackrabbit. Instead of trying to explain the
> whole idea in an email I've written a web page (with diagrams!) to
> describe the model in more detail. You can find the model description
> at http://jackrabbit.apache.org/dev/ngp.html (it's not linked in the
> menu).
>
> To summarize, the model I'm proposing brings a heavily abstracted
> persistence layer up as the central focus point of the entire
> repository architecture. The persistence model I'm proposing has
> implications all the way to things like clustering implementation,
> node type management, and session handling. In fact it would
> revolutionarize almost all parts of Jackrabbit core. ;-)
>
> On the other hand, instead of seeing the proposed model as a
> revolution, it could be seen simply as a way to raise the prominence
> of the ChangeLog class over the ItemState  model. Currently ItemStates
> are the central concept in Jackrabbit and the ChangeLog class is just
> a way to group together a set of ItemState changes. In the model I'm
> proposing the ChangeLog, or a revision, would instead become the
> central concept that governs not only read and write operations but
> also things like transactions and observation to a much greater degree
> than it does today.
>
> There are a number of open issues with the proposal, especially how to
> make it perform well and not use too much disk space, but I feel that
> the model is interesting enough for serious consideration and perhaps
> even early prototyping. I'm especially interested in hearing your
> thoughts on how feasible you think such a model would be and what
> potential pitfalls I may have missed. Any other comments and ideas are
> also welcome.
>
> BR,
>
> Jukka Zitting
>


-- 
-----------------------------------------< tobias.bocanegra@day.com >---
Tobias Bocanegra, Day Management AG, Barfuesserplatz 6, CH - 4001 Basel
T +41 61 226 98 98, F +41 61 226 98 97
-----------------------------------------------< http://www.day.com >---

Re: Next Generation Persistence

Posted by Stefan Guggisberg <st...@gmail.com>.

hi jukka

nice draft! i like the approach (using MVCC in jackrabbit's core). while the
draft might be a bit overenthousiastic wrt the benefits the proposed model
is certainly worth a serious evaluation. i am pretty confident that the pro's
will clearly outweigh the con's in the end. the best way to identify the issues
of the proposed model-change would IMO be to start work on a prototype.

IMO the major challenge will be read performance.

cheers
stefan

On 4/1/07, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> Based on some idle thinking I've written up a proposed model for next
> gneration persistence in Jackrabbit. Instead of trying to explain the
> whole idea in an email I've written a web page (with diagrams!) to
> describe the model in more detail. You can find the model description
> at http://jackrabbit.apache.org/dev/ngp.html (it's not linked in the
> menu).
>
> To summarize, the model I'm proposing brings a heavily abstracted
> persistence layer up as the central focus point of the entire
> repository architecture. The persistence model I'm proposing has
> implications all the way to things like clustering implementation,
> node type management, and session handling. In fact it would
> revolutionarize almost all parts of Jackrabbit core. ;-)
>
> On the other hand, instead of seeing the proposed model as a
> revolution, it could be seen simply as a way to raise the prominence
> of the ChangeLog class over the ItemState  model. Currently ItemStates
> are the central concept in Jackrabbit and the ChangeLog class is just
> a way to group together a set of ItemState changes. In the model I'm
> proposing the ChangeLog, or a revision, would instead become the
> central concept that governs not only read and write operations but
> also things like transactions and observation to a much greater degree
> than it does today.
>
> There are a number of open issues with the proposal, especially how to
> make it perform well and not use too much disk space, but I feel that
> the model is interesting enough for serious consideration and perhaps
> even early prototyping. I'm especially interested in hearing your
> thoughts on how feasible you think such a model would be and what
> potential pitfalls I may have missed. Any other comments and ideas are
> also welcome.
>
> BR,
>
> Jukka Zitting
>

Re: Next Generation Persistence

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 4/11/07, Marcel Reutegger <ma...@gmx.net> wrote:
> Jukka Zitting wrote:
> > With a Subversion-like structure a node does not have a direct parent
> > reference which could be used to construct the node path. Even an
> > indirect parent identifier doesn't work if we want to support
> > Subversion-style zero-cost copying and moving of nodes and subtrees.
>
> moving is already a cheap operation because only the parent node id changes.
>
> copying in JCR is fundamentally different from subversion. in subversion the
> uuid of a resource does not change when you do a copy. In JCR a copied
> referenceable node *must* get a new UUID. Therefore copy in JCR is potentially
> costly and cannot be implemented in subversion style.

The cheap copy feature would be more useful for versioning operations
like checkin, update, and clone.

BR,

Jukka Zitting

Re: Next Generation Persistence

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 4/12/07, Tobias Bocanegra <to...@day.com> wrote:
> > copying in JCR is fundamentally different from subversion. in subversion the
> > uuid of a resource does not change when you do a copy. In JCR a copied
> > referenceable node *must* get a new UUID. Therefore copy in JCR is potentially
> > costly and cannot be implemented in subversion style.
>
> not necessarily, i made a prototype, that proves otherwise. on copy,
> you create a new node that references the old one, and just overlays
> the modified properties. this works quite well and fast.

Nice!

BR,

Jukka Zitting

Re: Next Generation Persistence

Posted by Tobias Bocanegra <to...@day.com>.

> copying in JCR is fundamentally different from subversion. in subversion the
> uuid of a resource does not change when you do a copy. In JCR a copied
> referenceable node *must* get a new UUID. Therefore copy in JCR is potentially
> costly and cannot be implemented in subversion style.

not necessarily, i made a prototype, that proves otherwise. on copy,
you create a new node that references the old one, and just overlays
the modified properties. this works quite well and fast.

regards, toby
-- 
-----------------------------------------< tobias.bocanegra@day.com >---
Tobias Bocanegra, Day Management AG, Barfuesserplatz 6, CH - 4001 Basel
T +41 61 226 98 98, F +41 61 226 98 97
-----------------------------------------------< http://www.day.com >---

Re: Next Generation Persistence

Posted by Marcel Reutegger <ma...@gmx.net>.

Jukka Zitting wrote:
> With a Subversion-like structure a node does not have a direct parent
> reference which could be used to construct the node path. Even an
> indirect parent identifier doesn't work if we want to support
> Subversion-style zero-cost copying and moving of nodes and subtrees.

moving is already a cheap operation because only the parent node id changes.

copying in JCR is fundamentally different from subversion. in subversion the 
uuid of a resource does not change when you do a copy. In JCR a copied 
referenceable node *must* get a new UUID. Therefore copy in JCR is potentially 
costly and cannot be implemented in subversion style.

regards
  marcel

Re: Next Generation Persistence

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 4/11/07, Nicolas <nt...@gmail.com> wrote:
> Why would getNodeByUUID(...).getPath()  become so expensive? It seems we
> could keep a pointer to the last value no?

With a Subversion-like structure a node does not have a direct parent
reference which could be used to construct the node path. Even an
indirect parent identifier doesn't work if we want to support
Subversion-style zero-cost copying and moving of nodes and subtrees.

BR,

Jukka Zitting

Re: Next Generation Persistence

Posted by Nicolas <nt...@gmail.com>.

Hi Jukka,

Why would getNodeByUUID(...).getPath()  become so expensive? It seems we
could keep a pointer to the last value no?

Nicolas

On 4/11/07, Jukka Zitting <ju...@gmail.com> wrote:
>
> Hi,
>
> Thanks for comments and pointers to further information.
>
> I think using a Subversion-like structure might make sense, though a
> call like getNodeByUUID(...).getPath() could become quite expensive.
> We probably should prioritize the performance requirements of
> different operations to help guide the design.
>

Re: Next Generation Persistence

Posted by Angela Schreiber <an...@day.com>.

hi jukka

> Another issue that came up is whether and how such a persistence model
> would work with the SPI. I had considered the SPI as the primary
> interface to use when prototyping/implementing this persistence
> proposal, but it seems that the handling of the transient space as a
> "draft revision" doesn't resonate well with the SPI model of keeping
> the transient space on the client side. Any ideas on how to best
> resolve this?

i don't have a solution at hand. but i think its worth keeping
the SPI in mind, when redesigning the jackrabbit core.
for reasons you kept pointing to during the last month such
as multiple efforts and bringing together the 2 jcr implementations.
second i think that one of the problems with jackrabbit
core is it monolithic structure and the lack of clear separation
between the various layers. this was one of the reasons that forced
us to throw away almost everything we initially wanted to use
from jackrabbit-core while building the jcr2spi layer.

kind regards
angela

Re: Next Generation Persistence

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 4/13/07, Roy T. Fielding <fi...@gbiv.com> wrote:
> It sounds to me more like the transient space needs its own backing
> store mechanism.  It doesn't make sense for unsaved changes to
> be sent across the SPI, for the same reason it doesn't make sense
> to send workspace edits to the subversion repo before an explicit
> commit.

I fully agree when the SPI is used as a remoting layer, but I don't
think it's necessary or even beneficial to have a separate backing
store mechanism for the transient space when accessing a local
repository. I also believe that the most performance-critical
deployments will use either fully local or very low-latency network
connections to the repository backend. You wouldn't want your
performance-critical CMS application to access the content repository
over the Internet.

As of now the SPI layer implies making an extra copy of transient
changes. It is possible to avoid extra copying of binary (and other)
values with the current SPI, but transient spaces with large numbers
of nodes and properties still face this problem. I'm wondering if we
could modify the SPI somehow to allow the client to only keep track of
item identifiers instead of the full transient item states and let the
SPI implementation decide whether to use a separate client-side
backing store for the item states.

BR,

Jukka Zitting

Re: Next Generation Persistence

Posted by "Roy T. Fielding" <fi...@gbiv.com>.

On Apr 13, 2007, at 2:59 AM, Jukka Zitting wrote:
> On 4/13/07, Marcel Reutegger <ma...@gmx.net> wrote:
>> hmm, I see. there seems to be a fundamental mismatch between the  
>> spi and the ngp
>> design. The spi clearly decouples the transient changes from the  
>> server whereas
>> the ngp rather integrates them more tightly into the core.
>
> Agreed. I think the mismatch is an example of the more general
> tradeoffs between remote and local access. For a remote client it
> definitely makes sense to keep the transient space local as long as
> possible, but for a local client this is not a strict requirement,
> just one design option among others.

It sounds to me more like the transient space needs its own backing
store mechanism.  It doesn't make sense for unsaved changes to
be sent across the SPI, for the same reason it doesn't make sense
to send workspace edits to the subversion repo before an explicit
commit.  Ideally, we should be able to support offline editing of
the workspace for any nodes that are already copied to the client,
without having to worry about the memory size.  Usually that means
keeping only a window of memory active and persisting changes outside
that window to the local disk.

....Roy

Re: Next Generation Persistence

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 4/13/07, Marcel Reutegger <ma...@gmx.net> wrote:
> hmm, I see. there seems to be a fundamental mismatch between the spi and the ngp
> design. The spi clearly decouples the transient changes from the server whereas
> the ngp rather integrates them more tightly into the core.

Agreed. I think the mismatch is an example of the more general
tradeoffs between remote and local access. For a remote client it
definitely makes sense to keep the transient space local as long as
possible, but for a local client this is not a strict requirement,
just one design option among others.

I'm not sure how serious this mismatch is in practice. It would be
nice if we could resolve it, but apart from duplicating the code
related to transient changes and requiring extra copies of transient
content it doesn't seem to have a too big impact on most use cases.

BR,

Jukka Zitting

Re: Next Generation Persistence

Posted by Marcel Reutegger <ma...@gmx.net>.

Jukka Zitting wrote:
> The copies I'm concerned about are created when a client wants to read
> a property that has not yet been persisted. For example if I want to
> do a large XML import and postprocess it before persisting all the
> changes. The "draft revision" model would nicely support such a use
> case without excessive memory requirements or having to make another
> copy of the subtree when saving it.
> 
>> counter question: would it be possible to create a draft revision that is
>> immediately persisted?
> 
> Certainly,  but see the concerns above. Doing that loses some of the
> nice session management features outlined in the proposal.
> 
>> I'd rather not create objects in the core or the server that are long 
>> lived and
>> potentially not used because a client decides to discard a draft 
>> revision.
> 
> A draft revision that gets discarded would just get removed from the
> disk, so I don't think this is a problem.

hmm, I see. there seems to be a fundamental mismatch between the spi and the ngp 
design. The spi clearly decouples the transient changes from the server whereas 
the ngp rather integrates them more tightly into the core.

regards
  marcel

Re: Next Generation Persistence

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 4/13/07, Marcel Reutegger <ma...@gmx.net> wrote:
> Jukka Zitting wrote:
> > In fact most of that could already be achieved by a Batch
> > implementation that is backed by the draft revision. The main problem
> > at the moment is that the client still needs to keep a full copy of
> > everything it passes to the Batch in order to properly support reading
> > transient changes. Would it be possible to add getter methods to the
> > Batch interface?
>
> a batch is distinct from transient changes because it potentially is only a
> subset of the transient changes (when save is called on an item that is not the
> root node and the workspace has changes not just below that item). as a
> consequence a batch is only created when save is called. whether a 'full copy'
> is required depends on the SPI implementation. it is also possible to interpret
> the client-side calls on the batch directly into server-side calls of the native
> store. in that case no copies are created except whatever is required/needed in
> the native storage.

The copies I'm concerned about are created when a client wants to read
a property that has not yet been persisted. For example if I want to
do a large XML import and postprocess it before persisting all the
changes. The "draft revision" model would nicely support such a use
case without excessive memory requirements or having to make another
copy of the subtree when saving it.

> counter question: would it be possible to create a draft revision that is
> immediately persisted?

Certainly,  but see the concerns above. Doing that loses some of the
nice session management features outlined in the proposal.

> I'd rather not create objects in the core or the server that are long lived and
> potentially not used because a client decides to discard a draft revision.

A draft revision that gets discarded would just get removed from the
disk, so I don't think this is a problem.

BR,

Jukka Zitting

Re: Next Generation Persistence

Posted by Marcel Reutegger <ma...@gmx.net>.

Jukka Zitting wrote:
> In fact most of that could already be achieved by a Batch
> implementation that is backed by the draft revision. The main problem
> at the moment is that the client still needs to keep a full copy of
> everything it passes to the Batch in order to properly support reading
> transient changes. Would it be possible to add getter methods to the
> Batch interface?

a batch is distinct from transient changes because it potentially is only a 
subset of the transient changes (when save is called on an item that is not the 
root node and the workspace has changes not just below that item). as a 
consequence a batch is only created when save is called. whether a 'full copy' 
is required depends on the SPI implementation. it is also possible to interpret 
the client-side calls on the batch directly into server-side calls of the native 
store. in that case no copies are created except whatever is required/needed in 
the native storage.

counter question: would it be possible to create a draft revision that is 
immediately persisted?

I'd rather not create objects in the core or the server that are long lived and 
potentially not used because a client decides to discard a draft revision.

regards
  marcel

Re: Next Generation Persistence

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 4/11/07, Marcel Reutegger <ma...@gmx.net> wrote:
> A SPI implementation using NGP could simply create a draft revision that
> includes the changes of a Batch when it is submitted. The draft revision will
> only live for a short period. Maybe the implementation could be optimized to
> directly 'stream' the draft revision into persistent storage to avoid recreating
> the changes in memory again.

Yeah, that's doable. I'm most concerned about cases where a large
number of nodes or a large binary is saved through the transient
space, so that the information needs to be temporarily saved on disk.
It would be better if that temporary storage could already be the
draft revision in which case no extra copying would be needed.

In fact most of that could already be achieved by a Batch
implementation that is backed by the draft revision. The main problem
at the moment is that the client still needs to keep a full copy of
everything it passes to the Batch in order to properly support reading
transient changes. Would it be possible to add getter methods to the
Batch interface?

BR,

Jukka Zitting

Re: Next Generation Persistence

Posted by Marcel Reutegger <ma...@gmx.net>.

Jukka Zitting wrote:
> Another issue that came up is whether and how such a persistence model
> would work with the SPI. I had considered the SPI as the primary
> interface to use when prototyping/implementing this persistence
> proposal, but it seems that the handling of the transient space as a
> "draft revision" doesn't resonate well with the SPI model of keeping
> the transient space on the client side. Any ideas on how to best
> resolve this?

A SPI implementation using NGP could simply create a draft revision that 
includes the changes of a Batch when it is submitted. The draft revision will 
only live for a short period. Maybe the implementation could be optimized to 
directly 'stream' the draft revision into persistent storage to avoid recreating 
the changes in memory again.

regards
  marcel

Re: Next Generation Persistence

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

Thanks for comments and pointers to further information.

I think using a Subversion-like structure might make sense, though a
call like getNodeByUUID(...).getPath() could become quite expensive.
We probably should prioritize the performance requirements of
different operations to help guide the design.

Another issue that came up is whether and how such a persistence model
would work with the SPI. I had considered the SPI as the primary
interface to use when prototyping/implementing this persistence
proposal, but it seems that the handling of the transient space as a
"draft revision" doesn't resonate well with the SPI model of keeping
the transient space on the client side. Any ideas on how to best
resolve this?

BR,

Jukka Zitting