You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Christian Stocker <me...@chregu.tv> on 2010/11/16 07:56:07 UTC

Scalable, failsafe and distributed Jackrabbit setup?

Lots of buzzwords in the subject, but let me explain.

(Please read on after the first paragraph even if I'll mention PHP :)
The whole discussion should be of interest for everyone, not just
"PHP-people").

We (we == some people from the PHP/Symfony community) are currently in
the process of developing a new "Content Management Framework" (see
http://cmf.symfony-project.org/ for some more info), where we'd like to
use JCR as the storage API and Jackrabbit as the middle-term easy
backend solution to that. As the transport layer between PHP and
Jackrabbit, we will use the not yet totally finished
http://liip.to/jackalope, which uses the DavEx protocol.

So much for the background, now I'm looking for the best way to setup
the jackrabbit side to make it potentially scalable, failsafe and
distributed. As far as I understand, the "traditional" clustered setup
of a jackrabbit server looks like http://flic.kr/p/8TQL1N with one
central database:

* Failsafe: It is as much as the central database is fail safe. In the
case of mysql (which is no given, btw), we'd have to setup a
master-slave scenario, where the slave takes over the master role,
should the master go down. No idea how feasible that is for Jackrabbit
(I guess it's a lot of manual work and monitoring from outside)

* Scalable: It's as scalable as that single master database is. If your
app produces too many reads or writes for the db to handle, you're
doomed. But as the db schema is veeery lightweight, maybe it can be
assumed one never hits that ceiling for most websites.

* Distributed: Not really (except the jackrabbit nodes). You have one
fat database in one location.

Is that the way most Jackrabbit setups are done? And the database is
never the performance bottleneck?

Now, coming from the LAMP world, I'm traditionally more used to a
Master-Slave-Scenario (if not using one of those new fancy NoSQL
approaches), where there's one master db server for writes, which
replicates to many slave db servers for all the reads. Assuming your
typical Websites has many many more reads than writes, this scales
usually pretty well.

In a Jackrabbit world, I imagine this could look something like
http://flic.kr/p/8TMG1Z

(we would do the read/write differentiation on the PHP/client level, but
if Jackrabbit could do that by itself, that would ease some logic on the
client-side)

* Failsafe: As above, as long as a slave can be put in charge if the
master goes down, this can be done failsafe

* Scalable: It scales for reads very well, still the write issue (but I
can live with that)

* Distributed: You could move parts of the setup into other locations
and the read performance wouldn't degrade. Usually write latency is not
that critical as read latency, so I could live with the longer
roundtrips for write. And if you don't need writes (in general or in a
"fail" scenario), you can even serve your websites, when the two
locations are disconnected.

One of the technical problems with this approach could be:
http://jackrabbit.510166.n4.nabble.com/Reading-repository-content-from-a-read-only-MySQL-td522668.html

This approach still has the write-to-single-db problem, but I can live
with that (as most approaches have this problem) and to avoid this,
you'd maybe need some totally different approach than an RDBMS, like eg.
CouchDB which has replication built in from the ground up. Anyone tried
to use something like that as PM?

So what do you think? Is my approach feasible? Am I overthinking it and
the first approach is by far good enough? I don't say, that I need the
full setup yet, I just don't want to get into trouble later, when we
actually would need it and have to refactor a lot.

Any input is very appreciated

chregu

--
Liip AG // Feldstrasse 133 // CH-8004 Zurich
Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE

Re: Scalable, failsafe and distributed Jackrabbit setup?

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.

On 13.05.2011, at 16:54, Lukas Kahwe Smith wrote:

> 
> On 16.11.2010, at 19:26, Michael Wechner wrote:
> 
>> On 11/16/10 6:56 AM, Christian Stocker wrote:
>>> So what do you think? Is my approach feasible? Am I overthinking it and
>>> the first approach is by far good enough? I don't say, that I need the
>>> full setup yet, I just don't want to get into trouble later, when we
>>> actually would need it and have to refactor a lot.
>> 
>> I understand your concerns, but as long as your CMF instances are using the API
>> consistently you shouldn't have to worry much about refactoring at some later stage,
>> whereas this doesn't mean that you don't have to exchange your persistence implementation
>> at some later stage and hence I would suggest that you assume from the very beginning that you will have
>> to migrate your data from one implementation into another one.
>> 
>> Re the actual persistence implementation I don't think there is a one-fits-all solution,
>> but rather it depends on what kind of data you are dealing with and what kind of queries
>> you want to make and as you are pointing out write versus read access, whereas with
>> webapps becoming more interactive I think that write access becomes much more important
>> than today and hence its important to keep this mind re your setup.
> 
> 
> FYI
> 
> http://blog.liip.ch/archive/2011/05/04/how-to-make-jackrabbit-globally-distributable-fail-safe-and-scalable-in-one-go.html
> 
> One of the things we are looking to do here is to use this setup to have a spare slave with jackrabbit running on some small machine, so that in case our data center disappears forcing a rebuild we already have a clone able [1] setup ready. One thing we want to ensure is that this clone is consistent. 
> 
> I have only found docs for CRX on this, but it seems to also work with a stock Jackrabbit to do Lucene index checks on startup:
> http://dev.day.com/content/kb/home/Crx/CrxSystemAdministration/HowToCheckLuceneIndex.html
> 
> I wonder if there is already some way to trigger these checks on a running Jackrabbit?


should have read the CheckIndex docs to the end:
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/index/CheckIndex.html

java -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex pathToIndex [-fix] [-segment X] [-segment Y]

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

Re: Scalable, failsafe and distributed Jackrabbit setup?

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.

On 16.11.2010, at 19:26, Michael Wechner wrote:

> On 11/16/10 6:56 AM, Christian Stocker wrote:
>> So what do you think? Is my approach feasible? Am I overthinking it and
>> the first approach is by far good enough? I don't say, that I need the
>> full setup yet, I just don't want to get into trouble later, when we
>> actually would need it and have to refactor a lot.
> 
> I understand your concerns, but as long as your CMF instances are using the API
> consistently you shouldn't have to worry much about refactoring at some later stage,
> whereas this doesn't mean that you don't have to exchange your persistence implementation
> at some later stage and hence I would suggest that you assume from the very beginning that you will have
> to migrate your data from one implementation into another one.
> 
> Re the actual persistence implementation I don't think there is a one-fits-all solution,
> but rather it depends on what kind of data you are dealing with and what kind of queries
> you want to make and as you are pointing out write versus read access, whereas with
> webapps becoming more interactive I think that write access becomes much more important
> than today and hence its important to keep this mind re your setup.


FYI

http://blog.liip.ch/archive/2011/05/04/how-to-make-jackrabbit-globally-distributable-fail-safe-and-scalable-in-one-go.html

One of the things we are looking to do here is to use this setup to have a spare slave with jackrabbit running on some small machine, so that in case our data center disappears forcing a rebuild we already have a clone able [1] setup ready. One thing we want to ensure is that this clone is consistent. 

I have only found docs for CRX on this, but it seems to also work with a stock Jackrabbit to do Lucene index checks on startup:
http://dev.day.com/content/kb/home/Crx/CrxSystemAdministration/HowToCheckLuceneIndex.html

I wonder if there is already some way to trigger these checks on a running Jackrabbit?

regards,
Lukas Kahwe Smith
mls@pooteeweet.org


[1] http://blog.liip.ch/archive/2011/05/10/add-new-instances-to-your-jackrabbit-cluster-the-non-time-consuming-way.html

Re: Scalable, failsafe and distributed Jackrabbit setup?

Posted by Michael Wechner <mi...@wyona.com>.

On 11/16/10 6:56 AM, Christian Stocker wrote:
> So what do you think? Is my approach feasible? Am I overthinking it and
> the first approach is by far good enough? I don't say, that I need the
> full setup yet, I just don't want to get into trouble later, when we
> actually would need it and have to refactor a lot.

I understand your concerns, but as long as your CMF instances are using 
the API
consistently you shouldn't have to worry much about refactoring at some 
later stage,
whereas this doesn't mean that you don't have to exchange your 
persistence implementation
at some later stage and hence I would suggest that you assume from the 
very beginning that you will have
to migrate your data from one implementation into another one.

Re the actual persistence implementation I don't think there is a 
one-fits-all solution,
but rather it depends on what kind of data you are dealing with and what 
kind of queries
you want to make and as you are pointing out write versus read access, 
whereas with
webapps becoming more interactive I think that write access becomes much 
more important
than today and hence its important to keep this mind re your setup.

HTH

Mcihael

> Any input is very appreciated
>
> chregu
>