You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Walter Raboch <wr...@ingen.at> on 2005/07/06 22:14:40 UTC

Scalability/Clustering

Hi all,

we just plan to use JackRabbit in an e-learning project with a few
hundred concurrent users. Therefore I am a little concerned about
scalability.

Some figures we forecast for the first expansion stage:
  1.000.000 Nodes
10.000.000 Properties (around 10 properties/node)
      3.000 Named Users (about 10% concurrent)

We think of a n-tier architecture with a web and application layer, a
repository layer and the database layer with 2 or more nodes for each
layer. There are either Java and .net applications accessing the content
in the repository, so we are planing to implement a .net client for
JSR170 too.

What would be the best deployment model for such a situation in your
opinion?

Are there any efforts to make jackrabbit clustered for a load sharing
scenario (no session failover at repository layer) ?

After reading a lot of code, I think following changes should do it:

- extending ObservationManager to send and receive Events to
   and from other nodes

- implementing/extending an ORM Layer (Hibernate with shared caching for
   performance). The persistence implementation should be aware of the
   node types and allow a type specific mapping to tables. So we can map
   nodetypes with many instances to own tables while maintaining
   flexibility for new "simple" nodetypes.

- extending LockManager to sync locks with other Nodes

- Lucene should be indepentend on each node but be aware of new nodes
   and changes -> Events from ObservationManager

- Config - the cluster should have a central place for config management

- some intelligence in the JCR-RMI client to find a content repository
   node from the cluster dependending on node state (load, shutdown, ...)

What else should be synchronized between the nodes?
Did I overlook something?

I am happy about any suggestions even if you dicourage us from using
jackrabbit. Of course we would release some of these developments to the
community - if someone is interested.

thx in advance,

cheers
Walter

Re: Scalability/Clustering

Posted by Andreas Hartmann <an...@apache.org>.

Hi Walter and Jackrabbit devs,

Walter Raboch wrote:
> Hi all,
> 
> we just plan to use JackRabbit in an e-learning project with a few
> hundred concurrent users. Therefore I am a little concerned about
> scalability.

did you find a solution to the scalability problem?

We're faced with a similar architectural challenge (web application
with up to about 1000 concurrent users, mostly read operations,
probably based on Cocoon/Lenya + Tomcat).

Could you perhaps share any experience whether Jackrabbit is suitable
for a project of this scale? We have to investigate if the major load
will be on the web application or on the repository, but I suspect
that the repository traffic is rather moderate (queries to select
one of 100.000 items will probably cause the biggest impact).

Thank you very much for any hints!

-- Andreas

> 
> Some figures we forecast for the first expansion stage:
>  1.000.000 Nodes
> 10.000.000 Properties (around 10 properties/node)
>      3.000 Named Users (about 10% concurrent)
> 
> We think of a n-tier architecture with a web and application layer, a
> repository layer and the database layer with 2 or more nodes for each
> layer. There are either Java and .net applications accessing the content
> in the repository, so we are planing to implement a .net client for
> JSR170 too.
> 
> What would be the best deployment model for such a situation in your
> opinion?
> 
> Are there any efforts to make jackrabbit clustered for a load sharing
> scenario (no session failover at repository layer) ?
> 
> After reading a lot of code, I think following changes should do it:
> 
> - extending ObservationManager to send and receive Events to
>   and from other nodes
> 
> - implementing/extending an ORM Layer (Hibernate with shared caching for
>   performance). The persistence implementation should be aware of the
>   node types and allow a type specific mapping to tables. So we can map
>   nodetypes with many instances to own tables while maintaining
>   flexibility for new "simple" nodetypes.
> 
> - extending LockManager to sync locks with other Nodes
> 
> - Lucene should be indepentend on each node but be aware of new nodes
>   and changes -> Events from ObservationManager
> 
> - Config - the cluster should have a central place for config management
> 
> - some intelligence in the JCR-RMI client to find a content repository
>   node from the cluster dependending on node state (load, shutdown, ...)
> 
> What else should be synchronized between the nodes?
> Did I overlook something?
> 
> I am happy about any suggestions even if you dicourage us from using
> jackrabbit. Of course we would release some of these developments to the
> community - if someone is interested.
> 
> thx in advance,
> 
> cheers
> Walter
> 
> 
> 
> 


-- 
Andreas Hartmann
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
andreas.hartmann@wyona.com                     andreas@apache.org

Re: Scalability/Clustering

Posted by Serge Huber <sh...@jahia.com>.

David Nuescheler wrote:

>we just recently ran a test using jackrabbit and cqfs
>populating roughly 5m items (~500k nodes) and
>even without using an rdbms back end we did not
>run into issues. the performance of the persistence layer 
>degraded over time though.
>  
>
Don't you mean you got good performance because you were NOT using a 
database ? Although I've been a proponent of DB storage, I also know 
that there will always be an overhead compared to raw file access. There 
are other advantages though (as you've summarized here : 
http://www.day.com/site/en/index/products/content-centric_infrastructure/content_repository/crx_faq.html 
:) )

>>Are there any efforts to make jackrabbit clustered for a load sharing
>>scenario (no session failover at repository layer) ?
>>    
>>
>i think there are a couple of caches that need to be made 
>clusterable (or at least pluggable) in the jackrabbit core for 
>that to happen efficiently, it has to be done very carefully, 
>but it should not be to much work i think.
>
>this is definitely on the roadmap and investigations into that
>direction have already happend.
>  
>
 From what I have seen making the cache implementation pluggeable would 
be a good necessary first step. It then becomes possible to use OSCache, 
JBossTreeCache or Tangosol Coherence that all handle clustered caches.

>>- implementing/extending an ORM Layer (Hibernate with shared caching for
>>  performance). The persistence implementation should be aware of the
>>  node types and allow a type specific mapping to tables. So we can map
>>  nodetypes with many instances to own tables while maintaining
>>  flexibility for new "simple" nodetypes.
>>    
>>
>
>  
>
One quick note about the current ORM implementation. The current 
implementation that I've worked on with Jackrabbit can be improved. Feel 
free to have a look and contribute ! But what David is saying is true : 
for performance, the higher you can cache, the better !

Regards,
  Serge Huber.

Re: Scalability/Clustering

Posted by Serge Huber <sh...@jahia.com>.

Walter Raboch wrote:

>
> There would be a hybrid solution too: store structure info and 
> attributes to a DBMS and BLOBs to the filesystem. The project "daisy" 
> is  just using this aproach. (http://new.cocoondev.org/daisy/index.html)

Again storing BLOBs in the filesystem can cause problems in cluster 
scenarios. You then have the options of :

- sharing the filesystem using an NFS-like sharing system. You also need 
to setup to avoid single point of failure problems.
- replicating but this means that all the nodes will not immediately 
have coherent content.

>
>> One quick note about the current ORM implementation. The current 
>> implementation that I've worked on with Jackrabbit can be improved. 
>
> > Feel free to have a look and contribute ! But what David is saying
> > is true : for performance, the higher you can cache, the better !
>
> I am glad that you already invested so much time for a base I can work 
> on. I like your solution but would prefer making the mapping 
> configurable on a per NodeType base. I just started working on this.
>
Ok, I'm curious to see what it will look like :)

Regards,
  Serge Huber.

Re: Scalability/Clustering

Posted by Walter Raboch <wr...@ingen.at>.

Hi David, Hi Serge,

> cool. i am currently trying to get at least a common .NET port
> of the API put together in jackrabbit (just like markus did it for PHP)
> are you interested in helping with that?
> i think a .NET client using the WebDAV JCR remoting could 
> be a very interesting option.
> http://www.day.com/jsr170/server/JCR_Webdav_Protocol.zip

yes I am interested... is there still some code or how do we begin?


filesystem vs database:

I see the advantages of both ways but think that a database storage is

- easier to sell to a customer because they trust in databases since
   many decades now

- backup: there are many solutions out there and the databases are
   already backuped at customer sites - so no extra effort

- more scalable: databases have been tuned for large amounts of data
  (especialy small entities. we all now that BLOBs kill a DBMS system)

I would be fine with a filesystem storage, if replication (full 
transactional over the cluster) is available. But this has to be totaly 
transparent to the JCR client.

I understand the deployment with more JCR repositories each holding a 
subset of data for a specific user group and some shared, replicated 
data that does not  change frequently. But to support this, you have to 
group users which is  extremly hard especially in our planed application.

There would be a hybrid solution too: store structure info and 
attributes to a DBMS and BLOBs to the filesystem. The project "daisy" is 
  just using this aproach. (http://new.cocoondev.org/daisy/index.html)

>>Are there any efforts to make jackrabbit clustered for a load sharing
>>scenario (no session failover at repository layer) ?
> 
> i think there are a couple of caches that need to be made 
> clusterable (or at least pluggable) in the jackrabbit core for 
> that to happen efficiently, it has to be done very carefully, 
> but it should not be to much work i think.
> 
> this is definitely on the roadmap and investigations into that
> direction have already happend.

is there any information around about these investigations?

> From what I have seen making the cache implementation pluggeable 
> would be a good necessary first step. It then becomes possible to 
 > use OSCache, JBossTreeCache or Tangosol Coherence that all handle
 > clustered caches.

I have been thinking about the same aproach. I like the plugin concept 
because you can better tweak jackrabbit to the current situation.

>>After reading a lot of code, I think following changes should do it:
>>- extending ObservationManager to send and receive Events to
>>  and from other nodes
> 
> maybe... personally i would like to have that functionality closer
> to the core, to keep things as transactional as possible across
> the cluster.

Its ok - the closer to the core the more transparent the solution is for 
other parts of jackrabbit. What would you recommend?

>>- implementing/extending an ORM Layer (Hibernate with shared caching for
>>  performance). The persistence implementation should be aware of the
>>  node types and allow a type specific mapping to tables. So we can map
>>  nodetypes with many instances to own tables while maintaining
>>  flexibility for new "simple" nodetypes.
> 
> i think that you may get a better performance impact by implementing
> the shared cache on higher layer in the jackrabbit architecture.
> on a completely different note, some people probably also like to map 
> nodetypes to tables for "aesthetic" reasons...

> One quick note about the current ORM implementation. The current 
> implementation that I've worked on with Jackrabbit can be improved. 
 > Feel free to have a look and contribute ! But what David is saying
 > is true : for performance, the higher you can cache, the better !

I am glad that you already invested so much time for a base I can work 
on. I like your solution but would prefer making the mapping 
configurable on a per NodeType base. I just started working on this.

>>What else should be synchronized between the nodes?
>>Did I overlook something?
> 
> i think this list sounds like a good start...

Can someone explain me the decison making process in the project? How do 
we find a suggestion for these modifications?

cheers,

Walter

Re: Scalability/Clustering

Posted by David Nuescheler <da...@gmail.com>.

hi walter,

sounds very interesting...

> we just plan to use JackRabbit in an e-learning project with a few
> hundred concurrent users. Therefore I am a little concerned about
> scalability.
> Some figures we forecast for the first expansion stage:
>  1.000.000 Nodes
> 10.000.000 Properties (around 10 properties/node)
>      3.000 Named Users (about 10% concurrent)
we just recently ran a test using jackrabbit and cqfs
populating roughly 5m items (~500k nodes) and
even without using an rdbms back end we did not
run into issues. the performance of the persistence layer 
degraded over time though.

> We think of a n-tier architecture with a web and application layer, a
> repository layer and the database layer with 2 or more nodes for each
> layer. There are either Java and .net applications accessing the content
> in the repository, so we are planing to implement a .net client for
> JSR170 too.
cool. i am currently trying to get at least a common .NET port
of the API put together in jackrabbit (just like markus did it for PHP)
are you interested in helping with that?
i think a .NET client using the WebDAV JCR remoting could 
be a very interesting option.
http://www.day.com/jsr170/server/JCR_Webdav_Protocol.zip

> What would be the best deployment model for such a situation in your
> opinion?
personally, i think that it depends on the nature of the application.
the e-learning applications that i know do a lot reading for "course 
material" and a relatively limited amount of writing operations 
(test results, user tracking, ...)

i think that repository based content replication lends itself to 
distribute the course material to multiple entirely independent
cluster nodes.

with respect to the communication protocol of the clients 
i think depending on the application either an rmi-layer (for java
obviously) or a webdav-based client may be a good choice.

> Are there any efforts to make jackrabbit clustered for a load sharing
> scenario (no session failover at repository layer) ?
i think there are a couple of caches that need to be made 
clusterable (or at least pluggable) in the jackrabbit core for 
that to happen efficiently, it has to be done very carefully, 
but it should not be to much work i think.

this is definitely on the roadmap and investigations into that
direction have already happend.

> After reading a lot of code, I think following changes should do it:
> - extending ObservationManager to send and receive Events to
>   and from other nodes
maybe... personally i would like to have that functionality closer
to the core, to keep things as transactional as possible across
the cluster.

> - implementing/extending an ORM Layer (Hibernate with shared caching for
>   performance). The persistence implementation should be aware of the
>   node types and allow a type specific mapping to tables. So we can map
>   nodetypes with many instances to own tables while maintaining
>   flexibility for new "simple" nodetypes.
i think that you may get a better performance impact by implementing
the shared cache on higher layer in the jackrabbit architecture.
on a completely different note, some people probably also like to map 
nodetypes to tables for "aesthetic" reasons...

> - extending LockManager to sync locks with other Nodes
> - Lucene should be indepentend on each node but be aware of new nodes
>   and changes -> Events from ObservationManager
true.

> - Config - the cluster should have a central place for config management
sure. i think that's a nice to have though ;)

> - some intelligence in the JCR-RMI client to find a content repository
>   node from the cluster dependending on node state (load, shutdown, ...)
assuming that you are using rmi, yes. 
if you are using webdav you may be using general 
http-loadbalancing infrastructure, right?

> What else should be synchronized between the nodes?
> Did I overlook something?
i think this list sounds like a good start...

> I am happy about any suggestions even if you dicourage us from using
> jackrabbit. Of course we would release some of these developments to the
> community - if someone is interested.
sure, very interested ;)

> I recommend evaluating jackrabbit since I see much future for 
> the JSR170 standard ...
i am glad to hear that. i think going with jsr-170 also allows
the customer (at a later date) to even change the implementation 
if the requirements should change drastically. still protecting 
all the investments made into the applications, clients, etc...
whether someone wants to use an opensource or a commercial
jsr-170 compliant content repository remains a question of 
personal taste and total cost of ownership.

> ... but I am concerned about the mentioned scalability issue.
i am not too worried about that, i think the metrics that you
specified are definitely very doable if one is willing to spend a little
bit of time on tweaking ...

regards,
david