You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by Miro Walker <mi...@cognifide.com> on 2006/02/02 11:38:31 UTC

DP Persistence manager implementation

Hi,

We've been discussing the DB PM implementation, and have a couple of
questions regarding the implementation of this. At the moment, the
Simple DB PM appears to have been implemented using a single connection
with all write operations synchronised on a single object. This would
imply that all writes to the database are single threaded, effectively
making any application using it also run single threaded for write
operations. This appears to have two implications:

1. Performance - in a multi-user system, having single-threaded writes
to the database will make the JDBC connection a serious bottleneck as
soon as the application comes under load. It also means that any
background processing that needs to iterate over the repository making
changes (and we have a few of those) will effectively bring all other
users to a grinding halt. 

2. Transactions - we haven't tested this (as the recent support for
transactions in versioning operations has not been integrated into our
system), but it appears that to if a single connection is being used,
then we can only have a single transaction active at any one time. So,
if each user tries to execute a transaction with multiple write
operations in it, and these transactions are to be propagated through to
the database, then each transaction must complete before the next can
begin. This would either mean we get exceptions if the system attempts
to interleave operations from different transactions or that each
transaction must complete in full before another can begin, further
compounding the performance issue.

In addition to the implications of using a single synchronised
connection, another issue appears to be that the system will be unable
to recover from a connection failure. For example, if the system were
deployed onto a highly available database cluster, then in the event of
DB instance failure, any open connections will be killed, but can quite
happily be reopened later. Jackrabbit appears to create a connection on
initialisation, and has no way to recover if that connection is killed.

I know that questions around implementing support for connection pooling
on the DB have been raised before and then dismissed as unimportant, but
this appears to me to be pretty fundamental. By using a connection pool
implementation that supports recreating dead connections and supports
providing tying a connection to a transaction context, multiple
transactions could run in parallel, helping throughput and making the
system more reliable.

What do people think? Could we look to use Jakarta commons dbcp?

Cheers,

Miro

Re: DP Persistence manager implementation

Posted by Serge Huber <sh...@jahia.com>.
I also think it would be a good thing for the DB PM to use JNDI 
datasource lookup, this is the most common (modern) way of accessing DB 
resource now.

Regards,
  Serge Huber.

Martin Perez wrote:
> On 2/2/06, Jukka Zitting <ju...@gmail.com> wrote:Hi,
>
> You mean that DB PM would use JNDI to lookup the DataSource? This
>   
>> sounds like a good idea,  you may want to file a Jira issue for it.
>>
>>     
>
>
> Yes that is it. My mistake, I should have writen the magic word. :-)
>
> JIRA Issue created  #JCR-313
>
> Martin
>
>   


Re: DP Persistence manager implementation

Posted by Martin Perez <mp...@gmail.com>.
On 2/2/06, Jukka Zitting <ju...@gmail.com> wrote:Hi,

You mean that DB PM would use JNDI to lookup the DataSource? This
> sounds like a good idea,  you may want to file a Jira issue for it.
>


Yes that is it. My mistake, I should have writen the magic word. :-)

JIRA Issue created  #JCR-313

Martin

Re: DP Persistence manager implementation

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 2/2/06, Martin Perez <mp...@gmail.com> wrote:
> Maybe I have not explained myself very well. I'm refering to something like
> this:
>
>    <PersistenceManager
> class="org.apache.jackrabbit.core.state.db.SimpleDbPersistenceManager">
>        <param name="dataSource" value="jdbc/JackrabbitDS"/>

You mean that DB PM would use JNDI to lookup the DataSource? This
sounds like a good idea,  you may want to file a Jira issue for it.

BR,

Jukka Zitting

--
Yukatan - http://yukatan.fi/ - info@yukatan.fi
Software craftmanship, JCR consulting, and Java development

Re: DP Persistence manager implementation

Posted by Martin Perez <mp...@gmail.com>.
¿¿??

Maybe I have not explained myself very well. I'm refering to something like
this:

   <PersistenceManager
class="org.apache.jackrabbit.core.state.db.SimpleDbPersistenceManager">
       <param name="dataSource" value="jdbc/JackrabbitDS"/>

       .... think also about a way to pass params to data source, it
should be simple ....

       <param name="schema" value="mysql"/>
       <param name="schemaObjectPrefix" value="${wsp.name}_"/>
       <param name="externalBLOBs" value="false"/>
   </PersistenceManager>

It seems easy for me, and it should be also fairly easy to use. The
DataSource provides automatically connection pooling and it does not have
any API dependencies.

Martin

On 2/2/06, Jukka Zitting <ju...@gmail.com> wrote:
>
> Hi,
>
> On 2/2/06, Martin Perez <mp...@gmail.com> wrote:
> > why about allowing the definition of a DataSource reference instead of
> > having to hardcode JDBC URLs?
>
> The current Jackrabbit configuration mechanism only allows string
> configuration parameters for the persistence managers. This
> essentially restricts us to use the JDBC URLs. There has been
> discussion about a more flexible configuration mechanism, but this is
> not an immediate goal.
>
> BR,
>
> Jukka Zitting
>
> --
> Yukatan - http://yukatan.fi/ - info@yukatan.fi
> Software craftmanship, JCR consulting, and Java development
>

Re: DP Persistence manager implementation

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 2/2/06, Martin Perez <mp...@gmail.com> wrote:
> why about allowing the definition of a DataSource reference instead of
> having to hardcode JDBC URLs?

The current Jackrabbit configuration mechanism only allows string
configuration parameters for the persistence managers. This
essentially restricts us to use the JDBC URLs. There has been
discussion about a more flexible configuration mechanism, but this is
not an immediate goal.

BR,

Jukka Zitting

--
Yukatan - http://yukatan.fi/ - info@yukatan.fi
Software craftmanship, JCR consulting, and Java development

Re: DP Persistence manager implementation

Posted by Martin Perez <mp...@gmail.com>.
Not an expert on DB Persistence managers but...

why about allowing the definition of a DataSource reference instead of
having to hardcode JDBC URLs? => Also as consequence there will be no need
for DBCP dependency.

Martin

On 2/2/06, Miro Walker <mi...@cognifide.com> wrote:
>
> Hi,
>
> We've been discussing the DB PM implementation, and have a couple of
> questions regarding the implementation of this. At the moment, the
> Simple DB PM appears to have been implemented using a single connection
> with all write operations synchronised on a single object. This would
> imply that all writes to the database are single threaded, effectively
> making any application using it also run single threaded for write
> operations. This appears to have two implications:
>
> 1. Performance - in a multi-user system, having single-threaded writes
> to the database will make the JDBC connection a serious bottleneck as
> soon as the application comes under load. It also means that any
> background processing that needs to iterate over the repository making
> changes (and we have a few of those) will effectively bring all other
> users to a grinding halt.
>
> 2. Transactions - we haven't tested this (as the recent support for
> transactions in versioning operations has not been integrated into our
> system), but it appears that to if a single connection is being used,
> then we can only have a single transaction active at any one time. So,
> if each user tries to execute a transaction with multiple write
> operations in it, and these transactions are to be propagated through to
> the database, then each transaction must complete before the next can
> begin. This would either mean we get exceptions if the system attempts
> to interleave operations from different transactions or that each
> transaction must complete in full before another can begin, further
> compounding the performance issue.
>
> In addition to the implications of using a single synchronised
> connection, another issue appears to be that the system will be unable
> to recover from a connection failure. For example, if the system were
> deployed onto a highly available database cluster, then in the event of
> DB instance failure, any open connections will be killed, but can quite
> happily be reopened later. Jackrabbit appears to create a connection on
> initialisation, and has no way to recover if that connection is killed.
>
> I know that questions around implementing support for connection pooling
> on the DB have been raised before and then dismissed as unimportant, but
> this appears to me to be pretty fundamental. By using a connection pool
> implementation that supports recreating dead connections and supports
> providing tying a connection to a transaction context, multiple
> transactions could run in parallel, helping throughput and making the
> system more reliable.
>
> What do people think? Could we look to use Jakarta commons dbcp?
>
> Cheers,
>
> Miro
>

Re: DP Persistence manager implementation

Posted by Brian Moseley <bc...@osafoundation.org>.
On 2/2/06, Marcel Reutegger <ma...@gmx.net> wrote:

> IMO the purpose of the SimpleDbPersistenceManager is mainly embedded
> databases where a connection failure is highly unlikely, as there is no
> network in between.

there seems to be a lot of fuss about SDPM being "simple" or mainly
for embedded use, but an increasing number of people seem to want to
use jackrabbit against network dbs with connection pools. so let's
give them what they want.

how about renaming SDPM (since "simple" is a vague, non-descriptive
term)  to EmbeddedDbPM and providing GenericNetworkDbPM. what would
this new class do that's different from SDPM, other than using JNDI to
look up its connection? i'm not a PM expert, but i have looked over
SDPM a few times, and i don't remember anything else that would need
to change, except maybe the closing of the connection at the end of
the PM's lifecycle.

> If concurrent write performance should become a real issue that's where
> we first have to deal with it.

i can't quantify it yet, but there is some intuition that points to
this being the bottleneck in running cosmo under heavy load (hundreds
of simultaneous webdav requests).

Re: DP Persistence manager implementation

Posted by Marcel Reutegger <ma...@gmx.net>.
Miro Walker wrote:
> We've been discussing the DB PM implementation, and have a couple of
> questions regarding the implementation of this. At the moment, the
> Simple DB PM appears to have been implemented using a single connection
> with all write operations synchronised on a single object. This would
> imply that all writes to the database are single threaded, effectively
> making any application using it also run single threaded for write
> operations. This appears to have two implications:

this is not quite true. the actual store operation on the persistence 
manager is synchronized. however most of the write calls from different 
threads to the JCR api in jackrabbit will not block each other because 
those changes are made in a private transient scope. only the final save 
or commit of the transaction is serialized. that's only one part of the 
whole write process.

> 1. Performance - in a multi-user system, having single-threaded writes
> to the database will make the JDBC connection a serious bottleneck as
> soon as the application comes under load. It also means that any
> background processing that needs to iterate over the repository making
> changes (and we have a few of those) will effectively bring all other
> users to a grinding halt. 

this depends very much on the use case. again, all changes that such a 
background process does, are first made in a transient scope and other 
sessions are only affected if at all when the changes are stored in the 
persistence manager.
while one session stores changes, other sessions are still able to read 
certain items, as long as those are available in the 
LocalItemStateManager. Only when other sessions access item that are not 
available in their LocalItemStateManager they will be blocked until the 
store is finished.

> 2. Transactions - we haven't tested this (as the recent support for
> transactions in versioning operations has not been integrated into our
> system), but it appears that to if a single connection is being used,
> then we can only have a single transaction active at any one time. So,
> if each user tries to execute a transaction with multiple write
> operations in it, and these transactions are to be propagated through to
> the database, then each transaction must complete before the next can
> begin. This would either mean we get exceptions if the system attempts
> to interleave operations from different transactions or that each
> transaction must complete in full before another can begin, further
> compounding the performance issue.

the scopes of a JCR transaction and a transaction on the underlying 
database that is used by jackrabbit are not the same. A JCR transaction 
starts with the first modified item, whereas the transaction of the 
underlying database starts with the call to Item.save() or 
Session.save() or the JTA transaction commit (whatever you prefer ;)).

that basically means JCR transactions can run in parallel for most of 
the time, only the commit phase of the JCR transaction is serialized.

> In addition to the implications of using a single synchronised
> connection, another issue appears to be that the system will be unable
> to recover from a connection failure. For example, if the system were
> deployed onto a highly available database cluster, then in the event of
> DB instance failure, any open connections will be killed, but can quite
> happily be reopened later. Jackrabbit appears to create a connection on
> initialisation, and has no way to recover if that connection is killed.

This is certainly an issue with the SimpleDbPersistenceManager. I guess 
that's why it is called Simple...

IMO the purpose of the SimpleDbPersistenceManager is mainly embedded 
databases where a connection failure is highly unlikely, as there is no 
network in between.

> I know that questions around implementing support for connection pooling
> on the DB have been raised before and then dismissed as unimportant, but
> this appears to me to be pretty fundamental. By using a connection pool
> implementation that supports recreating dead connections and supports
> providing tying a connection to a transaction context, multiple
> transactions could run in parallel, helping throughput and making the
> system more reliable.

even if such a persistence manager allows concurrent writes, it is still 
the responsibility of the caller to ensure consistency. in our case 
that's the SharedItemStateManager. And that's the place where 
transactions are currently serialized, but only on commit.

If concurrent write performance should become a real issue that's where 
we first have to deal with it.

regards
  marcel

Re: DP Persistence manager implementation

Posted by Chandresh Turakhia <ch...@bhartitelesoft.com>.
Team ,

Both Oracle and DB have give FREE version of database. Can we start giving 
DB PersistanceManager for Oracle / DB2 inbuilt in Jackrabbit since it can be 
packed.

I do agree it would be appropriate Nice to look at Jakarta commons dbcp.

Quick Question 1) Oracle and DB2 commes with multimedia features. Since many 
a time , most CMS have multimedia content , Can we extent the 
OraclePersistanceManager to include the same. Can any one suggest how to use 
most of Oracle intermedia features using repository architecture ? I would 
want to do the same for my project.

Quick Question 2 ) Most CMS want to store Metadata as RDF. Oracle support 
RDF storage. Can repository be extended for the same.

Thanks in advance.


Chand

FYI

http://www-306.ibm.com/software/data/db2/udb/edition-expressc.html
DB2 Express-C is completely "free" to download, develop, deploy, test, run, 
embed, and redistribute. For more information click here
Restrictions:
  a.. Maximum processors: 2
  b.. Maximum addressable memory: 4GB
ww.oracle.com/technology/products/database/xe/index.html
Oracle Database 10g Express Edition (Oracle Database XE) is an entry-level, 
small-footprint database based on the Oracle Database 10g Release 2 code 
base that's free to develop, deploy, and distribute; fast to download; and 
simple to administer. Oracle Database XE is a great starter database for:

  a.. Developers working on PHP, Java, .NET, and Open Source applications
With Oracle Database XE, currently available as a Beta release for Windows 
and Linux, you can now develop and deploy applications with a powerful, 
proven, industry-leading infrastructure, and then upgrade when necessary 
without costly and complex migrations. A production release is scheduled for 
early 2006.

Restrictions:
Oracle Database XE can be installed on any size host machine with any number 
of CPUs, but this free version of the world's leading database will store up 
to 4GB of user data, use up to 1GB of memory, and use one CPU on the host 
machine.
Regards

Chand

----- Original Message ----- 
From: "Miro Walker" <mi...@cognifide.com>
To: <ja...@incubator.apache.org>
Sent: Thursday, February 02, 2006 2:38 AM
Subject: DP Persistence manager implementation


> Hi,
>
> We've been discussing the DB PM implementation, and have a couple of
> questions regarding the implementation of this. At the moment, the
> Simple DB PM appears to have been implemented using a single connection
> with all write operations synchronised on a single object. This would
> imply that all writes to the database are single threaded, effectively
> making any application using it also run single threaded for write
> operations. This appears to have two implications:
>
> 1. Performance - in a multi-user system, having single-threaded writes
> to the database will make the JDBC connection a serious bottleneck as
> soon as the application comes under load. It also means that any
> background processing that needs to iterate over the repository making
> changes (and we have a few of those) will effectively bring all other
> users to a grinding halt.
>
> 2. Transactions - we haven't tested this (as the recent support for
> transactions in versioning operations has not been integrated into our
> system), but it appears that to if a single connection is being used,
> then we can only have a single transaction active at any one time. So,
> if each user tries to execute a transaction with multiple write
> operations in it, and these transactions are to be propagated through to
> the database, then each transaction must complete before the next can
> begin. This would either mean we get exceptions if the system attempts
> to interleave operations from different transactions or that each
> transaction must complete in full before another can begin, further
> compounding the performance issue.
>
> In addition to the implications of using a single synchronised
> connection, another issue appears to be that the system will be unable
> to recover from a connection failure. For example, if the system were
> deployed onto a highly available database cluster, then in the event of
> DB instance failure, any open connections will be killed, but can quite
> happily be reopened later. Jackrabbit appears to create a connection on
> initialisation, and has no way to recover if that connection is killed.
>
> I know that questions around implementing support for connection pooling
> on the DB have been raised before and then dismissed as unimportant, but
> this appears to me to be pretty fundamental. By using a connection pool
> implementation that supports recreating dead connections and supports
> providing tying a connection to a transaction context, multiple
> transactions could run in parallel, helping throughput and making the
> system more reliable.
>
> What do people think? Could we look to use Jakarta commons dbcp?
>
> Cheers,
>
> Miro
>
>