You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2007/06/05 04:41:52 UTC

Re: NGP: Value records

Hi,

On 5/16/07, Jukka Zitting <ju...@gmail.com> wrote:
> On 5/12/07, Jukka Zitting <ju...@gmail.com> wrote:
> > Based on the feedback I agree that it probably doesn't make sense to
> > keep track of unique copies of all values. However, avoiding extra
> > copies of large binaries is still a very nice feature, so I'd still
> > like to keep the single copy idea for those values. This is in fact
> > something that we might want to consider already for Jackrabbit 1.4
> > regardless of what we'll do with the NGP proposal.
>
> See JCR-926 for a practical application of this idea to current Jackrabbit.

I just did a quick prototype where I made the InternalValue class turn
all incoming binary streams into data records using a global data
store. Internally the value would just be represented by the data
identifier.

This allowed me to simplify quite a few things (for example to drop
all BLOBStore classes and custom handling of binary properties) and to
achieve *major* performance improvements for cases where large (>
100kB) binaries are handled. For example the time to save a large file
was essentially cut in half and things like versioning or cloning
trees with large binaries would easily become faster by an order of
magnitude. With this change it is possible for example to copy a DVD
image file in milliseconds. What's even better, not only did this
change remove extra copying of binary values, it also pushed all
binaries out of the persistence or item state managers so that no
binary read or write operation would ever lock the repository!

The downside of the change is that it requires backwards-incompatible
changes in jackrabbit-core, most notably pulling all blob handling out
of the existing persistence managers. Adopting the data store concept
would thus require migration of all existing repositories. Luckily
such migration would likely be relatively straightforward and we could
write tools to simplify the upgrade, but it would still be a major
undertaking.

I would very much like to go forward with this approach, but I'm not
sure when would be the right time to do that. Should we target already
the 1.4 release in September/October, or would it be better to wait
for Jackrabbit 2.0 sometime next year? Alternatively, should we go for
a 2.0 release already this year with this and some other structural
changes, and have Jackrabbit 3.0 be the JSR 283 reference
impelementation?

BR,

Jukka Zitting

Re: NGP: Value records

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 6/6/07, Stefan Guggisberg <st...@gmail.com> wrote:
> something that just crossed my mind: i know a number of people
> want to store everything (config, meta data, binaries and content)
> in the same db in order to allow easy backup/restore of an entire
> repository. currently they can do so by using DatabaseFileSystem
> and the externalBLOBs=false option of DatabasePersistenceManager.
>
> do you plan to support db persistence for the binary store as well?

Yes. The DataStore interface should be very database friendly so there
shouldn't be any issues in implementing DB persistence for binaries. A
schema like this should do the trick:

    CREATE TABLE datastore ( id SERIAL, sha1 CHAR(40), data BLOB );

The getRecord() method would essentially be:

    SELECT data FROM datastore WHERE sha1=?

The addRecord() method would essentially be:

    INSERT INTO datastore (data) VALUES (?) -- calculate sh1 while inserting
    IF (SELECT 1 FROM datastore WHERE sha1=?) THEN
        DELETE FROM datastore WHERE id=?
    ELSE
        UPDATE datastore SET sha1=? WHERE id=?
    END IF
    COMMIT
    RETURN sha1

BR,

Jukka Zitting

Re: NGP: Value records

Posted by Stefan Guggisberg <st...@gmail.com>.
On 6/6/07, Stefan Guggisberg <st...@gmail.com> wrote:
> hi jukka,
>
> On 6/5/07, Jukka Zitting <ju...@gmail.com> wrote:
> > Hi,
> >
> > On 5/16/07, Jukka Zitting <ju...@gmail.com> wrote:
> > > On 5/12/07, Jukka Zitting <ju...@gmail.com> wrote:
> > > > Based on the feedback I agree that it probably doesn't make sense to
> > > > keep track of unique copies of all values. However, avoiding extra
> > > > copies of large binaries is still a very nice feature, so I'd still
> > > > like to keep the single copy idea for those values. This is in fact
> > > > something that we might want to consider already for Jackrabbit 1.4
> > > > regardless of what we'll do with the NGP proposal.
> > >
> > > See JCR-926 for a practical application of this idea to current Jackrabbit.
> >
> > I just did a quick prototype where I made the InternalValue class turn
> > all incoming binary streams into data records using a global data
> > store. Internally the value would just be represented by the data
> > identifier.
> >
> > This allowed me to simplify quite a few things (for example to drop
> > all BLOBStore classes and custom handling of binary properties) and to
> > achieve *major* performance improvements for cases where large (>
> > 100kB) binaries are handled. For example the time to save a large file
> > was essentially cut in half and things like versioning or cloning
> > trees with large binaries would easily become faster by an order of
> > magnitude. With this change it is possible for example to copy a DVD
> > image file in milliseconds. What's even better, not only did this
> > change remove extra copying of binary values, it also pushed all
> > binaries out of the persistence or item state managers so that no
> > binary read or write operation would ever lock the repository!
>
> awesome, that's great news!
>
> is there a way to purge the binary store, i.e. remove unreferenced data?
> i am a bit concerned that doing a lot of add/remove operations would
> quickly exhaust available storage space. at least we need a concept
> how deal with this kind of situation.

something that just crossed my mind: i know a number of people
want to store everything (config, meta data, binaries and content)
in the same db in order to allow easy backup/restore of an entire
repository. currently they can do so by using DatabaseFileSystem
and the externalBLOBs=false option of DatabasePersistenceManager.

do you plan to support db persistence for the binary store as well?

cheers
stefan

>
> >
> > The downside of the change is that it requires backwards-incompatible
> > changes in jackrabbit-core, most notably pulling all blob handling out
> > of the existing persistence managers. Adopting the data store concept
> > would thus require migration of all existing repositories. Luckily
> > such migration would likely be relatively straightforward and we could
> > write tools to simplify the upgrade, but it would still be a major
> > undertaking.
> >
> > I would very much like to go forward with this approach, but I'm not
> > sure when would be the right time to do that. Should we target already
> > the 1.4 release in September/October, or would it be better to wait
> > for Jackrabbit 2.0 sometime next year? Alternatively, should we go for
> > a 2.0 release already this year with this and some other structural
> > changes, and have Jackrabbit 3.0 be the JSR 283 reference
> > impelementation?
>
> since the jsr-283 public review is just around the corner we'll have to
> start work on the ri pretty soon. therefore i think the ri should target
> v2.0.
>
> wrt intergating JCR-926 both 1.4 and 2.0 would be fine with me.
>
> cheers
> stefan
>
> >
> > BR,
> >
> > Jukka Zitting
> >
>

Re: NGP: Value records

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 6/6/07, Stefan Guggisberg <st...@gmail.com> wrote:
> is there a way to purge the binary store, i.e. remove unreferenced data?
> i am a bit concerned that doing a lot of add/remove operations would
> quickly exhaust available storage space. at least we need a concept
> how deal with this kind of situation.

I was thinking of using a garbage collector to reclaim unreferenced
data. I haven't yet implemented anything like that and there will be
open issues to resolve if we want to allow more than one repository to
use the same data store for binaries, but I don't think that this is a
question that we couldn't resolve. I'll look into more details on
this. I think it would be good if we had at least a rudimentary
solution available before releasing any of this code.

> since the jsr-283 public review is just around the corner we'll have to
> start work on the ri pretty soon. therefore i think the ri should target
> v2.0.
>
> wrt intergating JCR-926 both 1.4 and 2.0 would be fine with me.

OK. We can initially target 1.4 but if it seems like we won't have the
required migration tools and other supporting code and documentation,
then we can postpone this to 2.0.

BR,

Jukka Zitting

Re: NGP: Value records

Posted by Stefan Guggisberg <st...@gmail.com>.
hi jukka,

On 6/5/07, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On 5/16/07, Jukka Zitting <ju...@gmail.com> wrote:
> > On 5/12/07, Jukka Zitting <ju...@gmail.com> wrote:
> > > Based on the feedback I agree that it probably doesn't make sense to
> > > keep track of unique copies of all values. However, avoiding extra
> > > copies of large binaries is still a very nice feature, so I'd still
> > > like to keep the single copy idea for those values. This is in fact
> > > something that we might want to consider already for Jackrabbit 1.4
> > > regardless of what we'll do with the NGP proposal.
> >
> > See JCR-926 for a practical application of this idea to current Jackrabbit.
>
> I just did a quick prototype where I made the InternalValue class turn
> all incoming binary streams into data records using a global data
> store. Internally the value would just be represented by the data
> identifier.
>
> This allowed me to simplify quite a few things (for example to drop
> all BLOBStore classes and custom handling of binary properties) and to
> achieve *major* performance improvements for cases where large (>
> 100kB) binaries are handled. For example the time to save a large file
> was essentially cut in half and things like versioning or cloning
> trees with large binaries would easily become faster by an order of
> magnitude. With this change it is possible for example to copy a DVD
> image file in milliseconds. What's even better, not only did this
> change remove extra copying of binary values, it also pushed all
> binaries out of the persistence or item state managers so that no
> binary read or write operation would ever lock the repository!

awesome, that's great news!

is there a way to purge the binary store, i.e. remove unreferenced data?
i am a bit concerned that doing a lot of add/remove operations would
quickly exhaust available storage space. at least we need a concept
how deal with this kind of situation.

>
> The downside of the change is that it requires backwards-incompatible
> changes in jackrabbit-core, most notably pulling all blob handling out
> of the existing persistence managers. Adopting the data store concept
> would thus require migration of all existing repositories. Luckily
> such migration would likely be relatively straightforward and we could
> write tools to simplify the upgrade, but it would still be a major
> undertaking.
>
> I would very much like to go forward with this approach, but I'm not
> sure when would be the right time to do that. Should we target already
> the 1.4 release in September/October, or would it be better to wait
> for Jackrabbit 2.0 sometime next year? Alternatively, should we go for
> a 2.0 release already this year with this and some other structural
> changes, and have Jackrabbit 3.0 be the JSR 283 reference
> impelementation?

since the jsr-283 public review is just around the corner we'll have to
start work on the ri pretty soon. therefore i think the ri should target
v2.0.

wrt intergating JCR-926 both 1.4 and 2.0 would be fine with me.

cheers
stefan

>
> BR,
>
> Jukka Zitting
>