You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Miro Walker <mi...@gmail.com> on 2006/11/15 09:46:48 UTC

Re: JackRabbit maturity?- robustness, performance and scalability

Hi Shaun,

Our experience with production systems has largely been with Day's
commercially licensed version of Jackrabbit, CRX, which contains some
prioprietary extensions. However, it's sufficiently similar that many
of the points you raise have similar answers across both systems.

Our experiences to date have indicated that there isn't a straight
answer to the questions you answer - much depends upon what you are
trying to do with the system. For example:

>  * performance with lots of nodes - any comments on the best persistence
> manager/config to use over and above the FAQ comments.

Key factors here are:
* your data model - Jackrabbit does not handle large flat node
hierarchies well, so it is sometimes necessary to artificially deepen
the hierarchy to address this.
* the persistence manager - the way in which JR stores data in the
underlying database has a big effect on performance (e.g. remote vs.
local db, persistence manager mapping to database tables).
* use of versioning / transactions - use of these features carries a
performance overhead (in some cases significant).

Reliability
>  * reliability of the persistence - how likely is corruption of the
> persisted objects?

Again this depends... Use of versionable nodes seems to be a problem
at the moment. We've seen significant issues with data loss and
corruption in live environments because of the current transaction
handling when storing versionable nodes. This is to do with the fact
that JR does not have support for true distributed transactions, but
maintains seperate connections to the workspace and the version
storage. If one of these fails and rolls-back you can end up with a
corrupt repository that then needs to be fixed "by hand" with possible
loss of data.

There are other issues, such as current lack of failover support,
search-indexes not being transactional (afaik still?), the need to
restart jackrabbit in the event of transient loss of connectivity to
the database, etc., but these are comparatively more minor.

>  * scalability - has JackRabbit being proven to handle lots of concurrent
> access? Can it yet be clustered? Any equivalent to the replication provided
> by Day?

There's some work Dominique's doing now on clustering - see JCR-263
(http://issues.apache.org/jira/browse/JCR-623). In terms of concurrent
simple read access, JR is pretty damned fast, so handling lots (how
much are you thinking of here?) of concurrent access is unlikely to be
a problem even without clustering support. For write access or
versioning, etc.

>
> Any insight from developers with live systems based on JackRabbit would be
> gratefully received and provide reassurance that JackRabbit is a suitable
> choice.
>

Hope that's useful and hasn't put you off too much :-).

Miro

Re: JackRabbit maturity?- robustness, performance and scalability

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 11/15/06, Stefan Guggisberg <st...@gmail.com> wrote:
> On 11/15/06, Miro Walker <mi...@gmail.com> wrote:
> > The issue with the SimpleDBPM is that it stores all data as name-value
> > pairs in the DB, which can be low-performance (writing a single node
> > can require numerous insert statements). There are other approaches
>
> i don't know what you mean by name-value pairs, however a node is persisted
> using *one* insert/update statement.

He's referring to the fact that properties are stored in separate
records. Thus, if you save a node with 10 properties, you end up with
11 database records. CRX and some other proprietary persistence
managers save the whole node in a single record, achieving a major
performance boost for some use cases.

> > using DDL to dynamically generate tables in the DB that conform to
> > nodetypes (so an entire node can be written with a single insert), but
>
> i am still waiting for the proof that a normalized or node type-reflecting
> schema would be superior in terms of performance. i'd also be interested
> to learn how residual child nodes/properties (e.g. nt:unstructered), non-typed
> properties and multi-valued properties would be handled.

There's definitely still a lot of uncharted territory in finding the
optimal persistence models for JCR content trees. For example I'd be
very interested in experimenting with a direct binary persistence
model, which could potentially offer major performance gains. I'm also
convinced that the current PersistenceManager interface is not optimal
from a performance point of view. I'll get back to these on the dev
list when I have some concrete suggestions.

BR,

Jukka Zitting

Re: JackRabbit maturity?- robustness, performance and scalability

Posted by Stefan Guggisberg <st...@gmail.com>.

On 11/15/06, Miro Walker <mi...@gmail.com> wrote:
> Shaun,
>
> >  * versioning - our current model makes use of nt:hierarchyNode,
> > mix:referenceable, mix:lockable and mix:versionable. From your comments it
> > sounds like using mix:versionable will significantly reduce the reliability
> > of JackRabbit. Would you recommend NOT using mix:versionable therefore?
> >
> Well, it's a tricky question. We've been using mix:versionable and it
> does what it says on the tin, but we have had issues as described (see
> also JCR-566, JCR-630, JCR-631). All versioning will have a large
> overhead - your data volume will grow very quickly and you're adding a
> lot of complexity, so if versioning is optional in your application,
> it may be better not to use it. If you do have to include versioning,
> then using jackrabbit's support for it is still probably better than
> trying to roll your own, as these issues will certainly be resolved
> eventually.
>
> >  * persistence - we'd prefer to use the SimpleDbPersistenceManager with
> > MySql. Is this a popular/reliable combination?
> >
> The issue with the SimpleDBPM is that it stores all data as name-value
> pairs in the DB, which can be low-performance (writing a single node
> can require numerous insert statements). There are other approaches

i don't know what you mean by name-value pairs, however a node is persisted
using *one* insert/update statement.

> using DDL to dynamically generate tables in the DB that conform to
> nodetypes (so an entire node can be written with a single insert), but

i am still waiting for the proof that a normalized or node
type-reflecting schema
would be superior in terms of performance. i'd also be interested to learn how
residual child nodes/properties (e.g. nt:unstructered), non-typed
properties and
multi-valued properties would be handled.

cheers
stefan

> not currently committed to jackrabbit itself.
>
> >  * Fix "by hand" - Given that some persistence managers use binary
> > serialization, how do you go about correcting the integrity of the database?
> > The prospect scares me but its not uncommon with applications operating
> > ontop of schemas with complex referential integrity.
> >
> Well, it depends on the issue, but there are a couple of approaches -
> you can extract the blobs from the DB, reverse-engineer the underlying
> data, edit it and write it back. Alternatively you can use jackrabbit
> itself to edit the data. This would be done by, for example, editing
> the version history itself. This is normally read-only, so to do so
> requires bypassing some of the standard JCR API methods.
>
> >  * you mentioned Day CRX. We also installed this and we're initially
> > impressed with the polished package however since then we've found some
> > significant problems with the Content Explorer etc which sow the seed of
> > doubt that there are potentially bigger issues under the covers. It's
> > important for us to have a commercial alternative so I'd welcome any
> > comments/experiences on using Day versus JackRabbit - for example, is
> > mix:versionable viable with Day?
> >
> I'm happy to discuss this, but this is probably not the best forum -
> feel free to email me offline.
>
> Cheers,
>
> Miro
>

Re: JackRabbit maturity?- robustness, performance and scalability

Posted by Miro Walker <mi...@gmail.com>.

Shaun,

>  * versioning - our current model makes use of nt:hierarchyNode,
> mix:referenceable, mix:lockable and mix:versionable. From your comments it
> sounds like using mix:versionable will significantly reduce the reliability
> of JackRabbit. Would you recommend NOT using mix:versionable therefore?
>
Well, it's a tricky question. We've been using mix:versionable and it
does what it says on the tin, but we have had issues as described (see
also JCR-566, JCR-630, JCR-631). All versioning will have a large
overhead - your data volume will grow very quickly and you're adding a
lot of complexity, so if versioning is optional in your application,
it may be better not to use it. If you do have to include versioning,
then using jackrabbit's support for it is still probably better than
trying to roll your own, as these issues will certainly be resolved
eventually.

>  * persistence - we'd prefer to use the SimpleDbPersistenceManager with
> MySql. Is this a popular/reliable combination?
>
The issue with the SimpleDBPM is that it stores all data as name-value
pairs in the DB, which can be low-performance (writing a single node
can require numerous insert statements). There are other approaches
using DDL to dynamically generate tables in the DB that conform to
nodetypes (so an entire node can be written with a single insert), but
not currently committed to jackrabbit itself.

>  * Fix "by hand" - Given that some persistence managers use binary
> serialization, how do you go about correcting the integrity of the database?
> The prospect scares me but its not uncommon with applications operating
> ontop of schemas with complex referential integrity.
>
Well, it depends on the issue, but there are a couple of approaches -
you can extract the blobs from the DB, reverse-engineer the underlying
data, edit it and write it back. Alternatively you can use jackrabbit
itself to edit the data. This would be done by, for example, editing
the version history itself. This is normally read-only, so to do so
requires bypassing some of the standard JCR API methods.

>  * you mentioned Day CRX. We also installed this and we're initially
> impressed with the polished package however since then we've found some
> significant problems with the Content Explorer etc which sow the seed of
> doubt that there are potentially bigger issues under the covers. It's
> important for us to have a commercial alternative so I'd welcome any
> comments/experiences on using Day versus JackRabbit - for example, is
> mix:versionable viable with Day?
>
I'm happy to discuss this, but this is probably not the best forum -
feel free to email me offline.

Cheers,

Miro

RE: JackRabbit maturity?- robustness, performance and scalability

Posted by Shaun Barriball <sb...@yahoo.co.uk>.

Hi Miro et al,
Thanks for the detailed insight. To pick up on some key points:

 * versioning - our current model makes use of nt:hierarchyNode,
mix:referenceable, mix:lockable and mix:versionable. From your comments it
sounds like using mix:versionable will significantly reduce the reliability
of JackRabbit. Would you recommend NOT using mix:versionable therefore?

 * persistence - we'd prefer to use the SimpleDbPersistenceManager with
MySql. Is this a popular/reliable combination?

 * Fix "by hand" - Given that some persistence managers use binary
serialization, how do you go about correcting the integrity of the database?
The prospect scares me but its not uncommon with applications operating
ontop of schemas with complex referential integrity.

 * you mentioned Day CRX. We also installed this and we're initially
impressed with the polished package however since then we've found some
significant problems with the Content Explorer etc which sow the seed of
doubt that there are potentially bigger issues under the covers. It's
important for us to have a commercial alternative so I'd welcome any
comments/experiences on using Day versus JackRabbit - for example, is
mix:versionable viable with Day?

Overall, your comments haven't 'put me off'. All persistence tiers have
their problems as they mature - this doesn't negate the value-add JackRabbit
provides over and above building a custom OR/RDBMS solution. I've happy to
share our results with this list as we perform various tests.

Regards,
Shaun.

-----Original Message-----
From: Miro Walker [mailto:miro.walker@gmail.com] 
Sent: 15 November 2006 08:47
To: users@jackrabbit.apache.org
Subject: Re: JackRabbit maturity?- robustness, performance and scalability

Hi Shaun,

Our experience with production systems has largely been with Day's
commercially licensed version of Jackrabbit, CRX, which contains some
prioprietary extensions. However, it's sufficiently similar that many of the
points you raise have similar answers across both systems.

Our experiences to date have indicated that there isn't a straight answer to
the questions you answer - much depends upon what you are trying to do with
the system. For example:

>  * performance with lots of nodes - any comments on the best 
> persistence manager/config to use over and above the FAQ comments.

Key factors here are:
* your data model - Jackrabbit does not handle large flat node hierarchies
well, so it is sometimes necessary to artificially deepen the hierarchy to
address this.
* the persistence manager - the way in which JR stores data in the
underlying database has a big effect on performance (e.g. remote vs.
local db, persistence manager mapping to database tables).
* use of versioning / transactions - use of these features carries a
performance overhead (in some cases significant).

Reliability
>  * reliability of the persistence - how likely is corruption of the 
> persisted objects?

Again this depends... Use of versionable nodes seems to be a problem at the
moment. We've seen significant issues with data loss and corruption in live
environments because of the current transaction handling when storing
versionable nodes. This is to do with the fact that JR does not have support
for true distributed transactions, but maintains seperate connections to the
workspace and the version storage. If one of these fails and rolls-back you
can end up with a corrupt repository that then needs to be fixed "by hand"
with possible loss of data.

There are other issues, such as current lack of failover support,
search-indexes not being transactional (afaik still?), the need to restart
jackrabbit in the event of transient loss of connectivity to the database,
etc., but these are comparatively more minor.

>  * scalability - has JackRabbit being proven to handle lots of 
> concurrent access? Can it yet be clustered? Any equivalent to the 
> replication provided by Day?

There's some work Dominique's doing now on clustering - see JCR-263
(http://issues.apache.org/jira/browse/JCR-623). In terms of concurrent
simple read access, JR is pretty damned fast, so handling lots (how much are
you thinking of here?) of concurrent access is unlikely to be a problem even
without clustering support. For write access or versioning, etc.

>
> Any insight from developers with live systems based on JackRabbit 
> would be gratefully received and provide reassurance that JackRabbit 
> is a suitable choice.
>

Hope that's useful and hasn't put you off too much :-).

Miro

Send instant messages to your online friends http://uk.messenger.yahoo.com