You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by Ian Boston <ie...@tfd.co.uk> on 2007/04/27 02:10:40 UTC

Jackrabbit 1.2.3, RecordInput/DatabaseJournal

I want to extend reimplemente DatabaseJounal (core.cluster), but there 
is a dependency on RecordInput which is a protected class (or at least 
default scope).

So you cant extend AbstractDatabaseJournal except in the same package 
(perhaps thats the answer)

Is there a reason for this, or was it an oversight.

The reason I want to extend as I am embedding Jackrabbit into Sakai 
(www.sakaiproject.org) and I would prefer to use a DataSource rather 
DriverManager delivered connection .... even if I get the connection and 
keep it.

Thanks
Ian

Re: Jackrabbit 1.2.3, RecordInput/DatabaseJournal

Posted by Tim Kettering <ti...@vivakos.com>.
At the risk of mentioning "the other" products - I wanted to comment  
that Alfresco takes this approach.. storing content metadata in the  
DB and actual content in the file system.  Because we make extensive  
use of Alfresco in some of our product offerings here, I've had  
plenty of opportunity to look under the hood at their code, and see  
how they've set things up in general.  I'm not sure how much  
specifics I should go into right now though...

But since Alfresco 2.0 source code is released under the GPL, I see  
no real reason why one cannot look at their source for ideas on how  
this can be done.

-tim

On May 7, 2007, at 11:49 AM, Dominique Pfister wrote:

> Hi Ian,
>
> On 5/5/07, Ian Boston <ie...@tfd.co.uk> wrote:
>> would Commons transaction work for this ?
>>
>> http://jakarta.apache.org/commons/transaction/file/index.html
>>
>> Im happy to look into the FileSystemBLOBStore.
>
> Yes, that should work and this would certainly make a great
> contribution! As far as I can see, these are the things that
> complicate the matter:
>
> - One has to someway associate the database transaction and the
>  transaction in the file system in order to ensure consistency and
>  commit either both or none.
> - Right now,  FileSystemBLOBStore will use repository relative
>  directories, either ${rep.home}/workspaces/${wsp.name}/blobs or
>  ${rep.home}/version/blobs, which will not work in a clustered
>  environment, unless those directories are mount points. In order to
>  simplify the setup, those directories should be made configurable.
>
> Again, I'd love to see some solution, that transactionally (!)
> persists simple data in a database and blobs in a filesystem folder.
>
> Dominique


Re: Jackrabbit 1.2.3, RecordInput/DatabaseJournal

Posted by Dominique Pfister <do...@day.com>.
Hi Ian,

On 5/5/07, Ian Boston <ie...@tfd.co.uk> wrote:
> would Commons transaction work for this ?
>
> http://jakarta.apache.org/commons/transaction/file/index.html
>
> Im happy to look into the FileSystemBLOBStore.

Yes, that should work and this would certainly make a great
contribution! As far as I can see, these are the things that
complicate the matter:

- One has to someway associate the database transaction and the
  transaction in the file system in order to ensure consistency and
  commit either both or none.
- Right now,  FileSystemBLOBStore will use repository relative
  directories, either ${rep.home}/workspaces/${wsp.name}/blobs or
  ${rep.home}/version/blobs, which will not work in a clustered
  environment, unless those directories are mount points. In order to
  simplify the setup, those directories should be made configurable.

Again, I'd love to see some solution, that transactionally (!)
persists simple data in a database and blobs in a filesystem folder.

Dominique

Re: Jackrabbit 1.2.3, RecordInput/DatabaseJournal

Posted by Ian Boston <ie...@tfd.co.uk>.
Dominique Pfister wrote:
> Hi Ian,
> 
> On 4/27/07, Ian Boston <ie...@tfd.co.uk> wrote:
>>
>> Dominique,
>>
>> We have over the past 3-4 years moved away from database persistence for
>> the body of files since a number of Universities have 1TB of data or 
>> more.
>>
>> I have no problem putting the metadata in the DB, but if we put the
>> bodies in as well the DBA's throw a fit, is just about bearable for
>> Oracle although shifting backups starts to become a problem, but we have
>> seen some interesting results when a few 100G goes into a MySQL db under
>> innodb, not least query times.
>>
>> So the question becomes, how bad transactionally is having a DB based
>> PersistanceManager and content (the BLOBS) on the filesystem?
> 
> Quite bad. The DbBLOBStore, that will store blobs in the DB, uses the
> same underlying JDBC connection and will therefore atomically save all
> other changes along with the blobs. The FileSystemBLOBStore on the
> other hand does only fulfill the D(urability) of the ACID transaction
> properties, e.g. it could save some blob even in case the database
> operation fails and would therefore break consistency. IMHO, the time
> needed to make it transaction-safe is considerable.


would Commons transaction work for this ?

http://jakarta.apache.org/commons/transaction/file/index.html

Im happy to look into the FileSystemBLOBStore.

> 
>> Any pointers would be extremely helpful.
> 
> There might be DB-specific extensions that tell the DB to store large
> files externally in the file system (e.g. Oracle's BFILE) but that
> would imply coding some custom database persistence manager that knows
> how to deal with this situation. Not sure, whether those extensions
> still work in a transactional information, though.

We moved away from putting raw content in to the DB (non JSR-170 store) 
a few years ago when DBA's reported lots of problems in production. If 
it can be avoided I'd prefer not to go back there.

Ian

> 
> Kind regards
> Dominique
> 
>>
>>
>> Dominique Pfister wrote:
>> > Hi Ian,
>> >
>> > On 4/27/07, Ian Boston <ie...@tfd.co.uk> wrote:
>> >> One quick question, which parts of the repository filesystem 
>> {rep.home}
>> >> should be in shared space and local space on the cluster node, I'm 
>> using
>> >> content on filesystem.
>> >
>> > In a clustered environment, using content on filesystem is not
>> > recommended: since the journal does only contain the modified item's
>> > id, but not the content itself, all nodes have to save the content in
>> > the same location. Changes made by one node in the cluster should be
>> > isolated from other nodes until the change is actually committed, a
>> > condition the filesystem based persistence managers do not fulfill.
>> >
>> > I'd rather take a database based persistence manager, where the
>> > database is running standalone and not embedded. If you already use
>> > the DatabaseJournal with a JDBC datasource, it would probably make
>> > sense to use the same database to save your repository data.
>> >
>> > Kind regards
>> > Dominique
>> >
>> >>
>> >>
>> >> Ian
>> >>
>> >> Dominique Pfister wrote:
>> >> > Hi Ian,
>> >> >
>> >> > On 4/27/07, Ian Boston <ie...@tfd.co.uk> wrote:
>> >> >> I want to extend reimplemente DatabaseJounal (core.cluster), but 
>> there
>> >> >> is a dependency on RecordInput which is a protected class (or at 
>> least
>> >> >> default scope).
>> >> >>
>> >> >> So you cant extend AbstractDatabaseJournal except in the same 
>> package
>> >> >> (perhaps thats the answer)
>> >> >>
>> >> >> Is there a reason for this, or was it an oversight.
>> >> >
>> >> > This is definitely an oversight. Ideally, DatabaseJournal should 
>> have
>> >> > a protected method named "getConnection", that may be overridden to
>> >> > change the way a connection is acquired. I will file a bug for this.
>> >> >
>> >> >> The reason I want to extend as I am embedding Jackrabbit into Sakai
>> >> >> (www.sakaiproject.org) and I would prefer to use a DataSource 
>> rather
>> >> >> DriverManager delivered connection .... even if I get the
>> >> connection and
>> >> >> keep it.
>> >> >
>> >> > For the time being, if using a DataSource is an absolute must, there
>> >> > is nothing else I can suggest than checking out the source code from
>> >> > svn, applying the required changes directly to your local copy and
>> >> > building a new, customized version.
>> >> >
>> >> > Kind regards
>> >> > Dominique
>> >>
>> >>
>>
>>


Re: Jackrabbit 1.2.3, RecordInput/DatabaseJournal

Posted by Dominique Pfister <do...@day.com>.
Hi Ian,

On 4/27/07, Ian Boston <ie...@tfd.co.uk> wrote:
>
> Dominique,
>
> We have over the past 3-4 years moved away from database persistence for
> the body of files since a number of Universities have 1TB of data or more.
>
> I have no problem putting the metadata in the DB, but if we put the
> bodies in as well the DBA's throw a fit, is just about bearable for
> Oracle although shifting backups starts to become a problem, but we have
> seen some interesting results when a few 100G goes into a MySQL db under
> innodb, not least query times.
>
> So the question becomes, how bad transactionally is having a DB based
> PersistanceManager and content (the BLOBS) on the filesystem?

Quite bad. The DbBLOBStore, that will store blobs in the DB, uses the
same underlying JDBC connection and will therefore atomically save all
other changes along with the blobs. The FileSystemBLOBStore on the
other hand does only fulfill the D(urability) of the ACID transaction
properties, e.g. it could save some blob even in case the database
operation fails and would therefore break consistency. IMHO, the time
needed to make it transaction-safe is considerable.

> Any pointers would be extremely helpful.

There might be DB-specific extensions that tell the DB to store large
files externally in the file system (e.g. Oracle's BFILE) but that
would imply coding some custom database persistence manager that knows
how to deal with this situation. Not sure, whether those extensions
still work in a transactional information, though.

Kind regards
Dominique

>
>
> Dominique Pfister wrote:
> > Hi Ian,
> >
> > On 4/27/07, Ian Boston <ie...@tfd.co.uk> wrote:
> >> One quick question, which parts of the repository filesystem {rep.home}
> >> should be in shared space and local space on the cluster node, I'm using
> >> content on filesystem.
> >
> > In a clustered environment, using content on filesystem is not
> > recommended: since the journal does only contain the modified item's
> > id, but not the content itself, all nodes have to save the content in
> > the same location. Changes made by one node in the cluster should be
> > isolated from other nodes until the change is actually committed, a
> > condition the filesystem based persistence managers do not fulfill.
> >
> > I'd rather take a database based persistence manager, where the
> > database is running standalone and not embedded. If you already use
> > the DatabaseJournal with a JDBC datasource, it would probably make
> > sense to use the same database to save your repository data.
> >
> > Kind regards
> > Dominique
> >
> >>
> >>
> >> Ian
> >>
> >> Dominique Pfister wrote:
> >> > Hi Ian,
> >> >
> >> > On 4/27/07, Ian Boston <ie...@tfd.co.uk> wrote:
> >> >> I want to extend reimplemente DatabaseJounal (core.cluster), but there
> >> >> is a dependency on RecordInput which is a protected class (or at least
> >> >> default scope).
> >> >>
> >> >> So you cant extend AbstractDatabaseJournal except in the same package
> >> >> (perhaps thats the answer)
> >> >>
> >> >> Is there a reason for this, or was it an oversight.
> >> >
> >> > This is definitely an oversight. Ideally, DatabaseJournal should have
> >> > a protected method named "getConnection", that may be overridden to
> >> > change the way a connection is acquired. I will file a bug for this.
> >> >
> >> >> The reason I want to extend as I am embedding Jackrabbit into Sakai
> >> >> (www.sakaiproject.org) and I would prefer to use a DataSource rather
> >> >> DriverManager delivered connection .... even if I get the
> >> connection and
> >> >> keep it.
> >> >
> >> > For the time being, if using a DataSource is an absolute must, there
> >> > is nothing else I can suggest than checking out the source code from
> >> > svn, applying the required changes directly to your local copy and
> >> > building a new, customized version.
> >> >
> >> > Kind regards
> >> > Dominique
> >>
> >>
>
>

Re: Jackrabbit 1.2.3, RecordInput/DatabaseJournal

Posted by Ian Boston <ie...@tfd.co.uk>.
Dominique,

We have over the past 3-4 years moved away from database persistence for 
the body of files since a number of Universities have 1TB of data or more.

I have no problem putting the metadata in the DB, but if we put the 
bodies in as well the DBA's throw a fit, is just about bearable for 
Oracle although shifting backups starts to become a problem, but we have 
seen some interesting results when a few 100G goes into a MySQL db under 
innodb, not least query times.

So the question becomes, how bad transactionally is having a DB based 
PersistanceManager and content (the BLOBS) on the filesystem?

I might be getting confused at this point, and confusing you with my 
lack of knowledge and terminology.... so in my Workspace definition I am 
using

    <PersistenceManager 
class="org.sakaiproject.jcr.jackrabbit.sakai.SakaiPersistanceManager">
             <param name="schema" value="${db.dialect}"/>
             <param name="schemaObjectPrefix" value="jcr_${wsp.name}_"/>
             <param name="externalBLOBs" value="${content.filesystem}"/>
         </PersistenceManager>

Where SakaiPersistanceManager simple overrides the getConnection() 
method of the standard DB persistence manager.

The DB is a standalone mysql, or Oracle instance.

Any pointers would be extremely helpful.

Thanks
Ian


Dominique Pfister wrote:
> Hi Ian,
> 
> On 4/27/07, Ian Boston <ie...@tfd.co.uk> wrote:
>> One quick question, which parts of the repository filesystem {rep.home}
>> should be in shared space and local space on the cluster node, I'm using
>> content on filesystem.
> 
> In a clustered environment, using content on filesystem is not
> recommended: since the journal does only contain the modified item's
> id, but not the content itself, all nodes have to save the content in
> the same location. Changes made by one node in the cluster should be
> isolated from other nodes until the change is actually committed, a
> condition the filesystem based persistence managers do not fulfill.
> 
> I'd rather take a database based persistence manager, where the
> database is running standalone and not embedded. If you already use
> the DatabaseJournal with a JDBC datasource, it would probably make
> sense to use the same database to save your repository data.
> 
> Kind regards
> Dominique
> 
>>
>>
>> Ian
>>
>> Dominique Pfister wrote:
>> > Hi Ian,
>> >
>> > On 4/27/07, Ian Boston <ie...@tfd.co.uk> wrote:
>> >> I want to extend reimplemente DatabaseJounal (core.cluster), but there
>> >> is a dependency on RecordInput which is a protected class (or at least
>> >> default scope).
>> >>
>> >> So you cant extend AbstractDatabaseJournal except in the same package
>> >> (perhaps thats the answer)
>> >>
>> >> Is there a reason for this, or was it an oversight.
>> >
>> > This is definitely an oversight. Ideally, DatabaseJournal should have
>> > a protected method named "getConnection", that may be overridden to
>> > change the way a connection is acquired. I will file a bug for this.
>> >
>> >> The reason I want to extend as I am embedding Jackrabbit into Sakai
>> >> (www.sakaiproject.org) and I would prefer to use a DataSource rather
>> >> DriverManager delivered connection .... even if I get the 
>> connection and
>> >> keep it.
>> >
>> > For the time being, if using a DataSource is an absolute must, there
>> > is nothing else I can suggest than checking out the source code from
>> > svn, applying the required changes directly to your local copy and
>> > building a new, customized version.
>> >
>> > Kind regards
>> > Dominique
>>
>>


Re: Jackrabbit 1.2.3, RecordInput/DatabaseJournal

Posted by Dominique Pfister <do...@day.com>.
Hi Ian,

On 4/27/07, Ian Boston <ie...@tfd.co.uk> wrote:
> One quick question, which parts of the repository filesystem {rep.home}
> should be in shared space and local space on the cluster node, I'm using
> content on filesystem.

In a clustered environment, using content on filesystem is not
recommended: since the journal does only contain the modified item's
id, but not the content itself, all nodes have to save the content in
the same location. Changes made by one node in the cluster should be
isolated from other nodes until the change is actually committed, a
condition the filesystem based persistence managers do not fulfill.

I'd rather take a database based persistence manager, where the
database is running standalone and not embedded. If you already use
the DatabaseJournal with a JDBC datasource, it would probably make
sense to use the same database to save your repository data.

Kind regards
Dominique

>
>
> Ian
>
> Dominique Pfister wrote:
> > Hi Ian,
> >
> > On 4/27/07, Ian Boston <ie...@tfd.co.uk> wrote:
> >> I want to extend reimplemente DatabaseJounal (core.cluster), but there
> >> is a dependency on RecordInput which is a protected class (or at least
> >> default scope).
> >>
> >> So you cant extend AbstractDatabaseJournal except in the same package
> >> (perhaps thats the answer)
> >>
> >> Is there a reason for this, or was it an oversight.
> >
> > This is definitely an oversight. Ideally, DatabaseJournal should have
> > a protected method named "getConnection", that may be overridden to
> > change the way a connection is acquired. I will file a bug for this.
> >
> >> The reason I want to extend as I am embedding Jackrabbit into Sakai
> >> (www.sakaiproject.org) and I would prefer to use a DataSource rather
> >> DriverManager delivered connection .... even if I get the connection and
> >> keep it.
> >
> > For the time being, if using a DataSource is an absolute must, there
> > is nothing else I can suggest than checking out the source code from
> > svn, applying the required changes directly to your local copy and
> > building a new, customized version.
> >
> > Kind regards
> > Dominique
>
>

Re: Jackrabbit 1.2.3, RecordInput/DatabaseJournal

Posted by Ian Boston <ie...@tfd.co.uk>.
Dominique,
Cool thanks,

I've noticed that if I use the same package name, then I can do what I 
want... not exactly good practice, but since when I re-implement, 99% of 
the code is almost your code, its not quite so bad and it makes more 
sense with your license header and all.

One quick question, which parts of the repository filesystem {rep.home} 
should be in shared space and local space on the cluster node, I'm using 
content on filesystem.


Ian

Dominique Pfister wrote:
> Hi Ian,
> 
> On 4/27/07, Ian Boston <ie...@tfd.co.uk> wrote:
>> I want to extend reimplemente DatabaseJounal (core.cluster), but there
>> is a dependency on RecordInput which is a protected class (or at least
>> default scope).
>>
>> So you cant extend AbstractDatabaseJournal except in the same package
>> (perhaps thats the answer)
>>
>> Is there a reason for this, or was it an oversight.
> 
> This is definitely an oversight. Ideally, DatabaseJournal should have
> a protected method named "getConnection", that may be overridden to
> change the way a connection is acquired. I will file a bug for this.
> 
>> The reason I want to extend as I am embedding Jackrabbit into Sakai
>> (www.sakaiproject.org) and I would prefer to use a DataSource rather
>> DriverManager delivered connection .... even if I get the connection and
>> keep it.
> 
> For the time being, if using a DataSource is an absolute must, there
> is nothing else I can suggest than checking out the source code from
> svn, applying the required changes directly to your local copy and
> building a new, customized version.
> 
> Kind regards
> Dominique


Re: Jackrabbit 1.2.3, RecordInput/DatabaseJournal

Posted by Dominique Pfister <do...@day.com>.
Hi Ian,

On 4/27/07, Ian Boston <ie...@tfd.co.uk> wrote:
> I want to extend reimplemente DatabaseJounal (core.cluster), but there
> is a dependency on RecordInput which is a protected class (or at least
> default scope).
>
> So you cant extend AbstractDatabaseJournal except in the same package
> (perhaps thats the answer)
>
> Is there a reason for this, or was it an oversight.

This is definitely an oversight. Ideally, DatabaseJournal should have
a protected method named "getConnection", that may be overridden to
change the way a connection is acquired. I will file a bug for this.

> The reason I want to extend as I am embedding Jackrabbit into Sakai
> (www.sakaiproject.org) and I would prefer to use a DataSource rather
> DriverManager delivered connection .... even if I get the connection and
> keep it.

For the time being, if using a DataSource is an absolute must, there
is nothing else I can suggest than checking out the source code from
svn, applying the required changes directly to your local copy and
building a new, customized version.

Kind regards
Dominique