You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Paolo Castagna <ca...@googlemail.com> on 2011/04/07 15:45:53 UTC

How to replicate TDB indexes using rsync?

Hi,
one thing often people do to increase availability is to use replication.
Having replicated RDF stores allows you to load balance requests across
multiple machines in order to increase query throughput as well.
Replicas can act as sort of backup, however you need to be careful since
errors will be replicated as well, therefore replication, in general,
does not eliminate the need of backups. (What's the best way to backup
a TDB store?)

However, in presence of updates you are left with the problem to keep
your replicas in sync.

For simplicity, I was thinking to try something with a master/slave(s)
architecture. One Fuseki server acting as master and running with the
--update option and a few replicas running in read only mode.
I am thinking to use rsync to sync TDB indexes between master and slaves.

However, there is the need to coordinate the replication and forbid any
updates while the replication is in progress. Master should become read
only, sync everything on disk and then replication could start.

A similar thing would need to happen for slaves. Slaves should probably
be taken off-line while replication is going on and Fuseki restarted as
soon as replication finishes. I expect replication to be quite fast,
after the first time.

I am sending this email to validate my thinking and to ask if anyone else
has used rsync to manage replicated TDB stores with/without Fuseki.

Thank you,
Paolo


Re: How to replicate TDB indexes using rsync?

Posted by Paolo Castagna <ca...@googlemail.com>.

Andy Seaborne wrote:
> 
> 
> On 07/04/11 14:45, Paolo Castagna wrote:
>> Hi,
>> one thing often people do to increase availability is to use replication.
>> Having replicated RDF stores allows you to load balance requests across
>> multiple machines in order to increase query throughput as well.
>> Replicas can act as sort of backup, however you need to be careful since
>> errors will be replicated as well, therefore replication, in general,
>> does not eliminate the need of backups. (What's the best way to backup
>> a TDB store?)
>>
>> However, in presence of updates you are left with the problem to keep
>> your replicas in sync.
>>
>> For simplicity, I was thinking to try something with a master/slave(s)
>> architecture. One Fuseki server acting as master and running with the
>> --update option and a few replicas running in read only mode.
>> I am thinking to use rsync to sync TDB indexes between master and slaves.
>>
>> However, there is the need to coordinate the replication and forbid any
>> updates while the replication is in progress. Master should become read
>> only, sync everything on disk and then replication could start.
>>
>> A similar thing would need to happen for slaves. Slaves should probably
>> be taken off-line while replication is going on and Fuseki restarted as
>> soon as replication finishes. I expect replication to be quite fast,
>> after the first time.
>>
>> I am sending this email to validate my thinking and to ask if anyone else
>> has used rsync to manage replicated TDB stores with/without Fuseki.
> 
> It will only work when the DB is sync'ed ... maybe something to add the 
> the transaction work.  Depending on details, a multi-phase locking 
> scheme on write-transaction start or commit would be useful (cf sqlite), 
> then "sync-in-progress" does not need to stop readers (still, all needs 
> a great deal of care!).

Hi Andy,
thanks for your reply.

So, perhaps the easiest thing to do right now would be to have some code to
co-ordinate the replication, on the master:

  - take a write lock (so, no updates (or reads, now) are possible)
  - sync the DB on disk
  - trigger replication with rsync
  - release the lock

Slaves could replicate one from another and since they are read-only there is
no need to sync the DB or take a write lock.

I've not tested how fast/slow this would be and how long the write lock will
need to be kept... but it might be something which works for scenarios where
updates are not that many and clients are happy to queue them while replication
is going on.

I think the master/slave(s) replication is a scenario to keep in mind and if
there is something simple which can be done to make it possible and as easy as
possible for people, it would not be bad at all for production systems which
require high availability (for reads).

Paolo


> 
>         Andy
> 
>>
>> Thank you,
>> Paolo
>>

Re: How to replicate TDB indexes using rsync?

Posted by Andy Seaborne <an...@epimorphics.com>.

On 07/04/11 14:45, Paolo Castagna wrote:
> Hi,
> one thing often people do to increase availability is to use replication.
> Having replicated RDF stores allows you to load balance requests across
> multiple machines in order to increase query throughput as well.
> Replicas can act as sort of backup, however you need to be careful since
> errors will be replicated as well, therefore replication, in general,
> does not eliminate the need of backups. (What's the best way to backup
> a TDB store?)
>
> However, in presence of updates you are left with the problem to keep
> your replicas in sync.
>
> For simplicity, I was thinking to try something with a master/slave(s)
> architecture. One Fuseki server acting as master and running with the
> --update option and a few replicas running in read only mode.
> I am thinking to use rsync to sync TDB indexes between master and slaves.
>
> However, there is the need to coordinate the replication and forbid any
> updates while the replication is in progress. Master should become read
> only, sync everything on disk and then replication could start.
>
> A similar thing would need to happen for slaves. Slaves should probably
> be taken off-line while replication is going on and Fuseki restarted as
> soon as replication finishes. I expect replication to be quite fast,
> after the first time.
>
> I am sending this email to validate my thinking and to ask if anyone else
> has used rsync to manage replicated TDB stores with/without Fuseki.

It will only work when the DB is sync'ed ... maybe something to add the 
the transaction work.  Depending on details, a multi-phase locking 
scheme on write-transaction start or commit would be useful (cf sqlite), 
then "sync-in-progress" does not need to stop readers (still, all needs 
a great deal of care!).

		Andy

>
> Thank you,
> Paolo
>