You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Al Baker <aj...@gmail.com> on 2011/05/28 09:30:20 UTC

TDB Reliability

I've been testing TDB for a while, and am very impressed with its
performance.  However, I do see the various emails on the mailing lists
warning of touching the files while an application with TDB is open
(presumably with an open Jena Model attached to the TDB directory).

What kind of reliability does TDB have to survive a power hit or application
crash?

Are there some steps to take consistent and regular backups to mitigate any
issues?

Basically looking to have some level of confidence that I can use TDB in
production, take a reasonable amount of steps to insure reliability, and be
confident that I'll always either have a valid TDB store, or a way to
incrementally backup/rollback in the case of a severe crash/file system
error.

Thanks,
Al Baker

Re: TDB Reliability

Posted by Andy Seaborne <an...@epimorphics.com>.


On 30/05/11 05:41, Al Baker wrote:
> Hi Andy,
>
> Great to hear about the transaction work.
>
> For the N-Quad export on a live database, do you mean that a running
> application - with open Jena models, and possibly a thread writing to it
> will not interfere with the N-Quad dump - the only gotcha would be a
> possible missed quad just before/after the export operation?  If so, I think
> that would sufficie for the short-term.

An n-quad dump is another read operation - MRSW applies.

> I guess another possiblity is within the app, fire up a thread and do a
> Model.write out to the filesystem to save the entire model.

Yes - again, it's read operation.

> Regarding bulk imports - I'm actually finding regular Jena model
> manipulation runs very fast with TDB.  Within second can have a 100k
> statement TDB store setup.

Ok - if your data is 100K, then many of these issues do not apply very 
strongly.  at that sort of size, the data is completely cached 
in-memory.  It's when you get into the many 10's millions of triples 
that things get complicated because operations take an appeciable length 
of time.

> I'm not looking to boil the giant datasets out there, I'm looking at
> practical how-to-build-apps perspective.  Speaking of which, it would be
> nice if the cache was controllable - similar to Ehcache (or maybe an idea
> for a future project for TDB to use Ehcache) - TTL, max in memory, etc.

EhCache is interesting (caveat understanding it's license dependencies) 
and certainly better cache control is desirable.

What would be good is to have "scan-resistant" eviction policy (such as 
ARC or LRU2) which retain very commonly used blocks even if the access 
pattern includes a pass through a large proportion of the database. 
Currently, the current B+trees have separate caches for branches and 
leaves, which reduces the scan problem somewhat.  The old B-Tree code 
was susceptible to scan access patterns making the caching inefficient.

	Andy


>
> Thanks,
> Al
>
>
> On Sun, May 29, 2011 at 2:52 PM, Andy Seaborne<
> andy.seaborne@epimorphics.com>  wrote:
>
>>
>>
>> On 28/05/11 08:30, Al Baker wrote:
>>
>>> I've been testing TDB for a while, and am very impressed with its
>>> performance.  However, I do see the various emails on the mailing lists
>>> warning of touching the files while an application with TDB is open
>>> (presumably with an open Jena Model attached to the TDB directory).
>>>
>>> What kind of reliability does TDB have to survive a power hit or
>>> application
>>> crash?
>>>
>>> Are there some steps to take consistent and regular backups to mitigate
>>> any
>>> issues?
>>>
>>> Basically looking to have some level of confidence that I can use TDB in
>>> production, take a reasonable amount of steps to insure reliability, and
>>> be
>>> confident that I'll always either have a valid TDB store, or a way to
>>> incrementally backup/rollback in the case of a severe crash/file system
>>> error.
>>>
>>> Thanks,
>>> Al Baker
>>>
>>
>> Hi Al,
>>
>> Currently, TDB provides some update capabilities but relies on the
>> application maintaining MRSW (Multiple Reader Or Single Writer) concurrency
>> semantics together with a clean shutdown.  Many of the reports are due to
>> letting two writers access the database at the same time or crashes without
>> ensuring a sync() is done which currently is important for updates.
>>
>> For read-only usage, the database is safe - it is modified or reorganised
>> by reads so loss of machines or applications does not damage the on-disk
>> database.
>>
>> TDB is an in-process database - one JVM controls the database.  Having two
>> managing the files also will cause damage.
>>
>> You can backup a database by copy but only from a running system if you
>> co-ordinate with a sync() which makes the on-disk structures consistent.
>>   Stopping the DB is better and is needed on some OS's but dumping to N-Quads
>> can be done on a live database.
>>
>> For updates, there are periods of vulnerability.  This is being addressed
>> by adding ACID transactions to TDB.  The transaction system is based on
>> write-ahead logging; read requests go straight to the DB as before so
>> performance there will be unchanged.
>>
>> The disk format is (probably) going to be unchanged.  There are some
>> improvements that can be made but they aren't necessary.
>>
>> The bulk loader used to build a database from scratch will provide the best
>> load performance.  It will remain non-transactional. Transactions will be
>> aimed at non-bulk updates.  Where the practical boundary will be will emerge
>> in testing.
>>
>> The transaction work is active-work-in-progress [*] but I'm not going to
>> give specific release schedules except to say that as an open source
>> project, "release early, release often" of development versions will happen.
>>
>>         Andy
>>
>> [*] Indeed, I'm writing a journaled file abstraction at this moment.
>>
>

Re: TDB Reliability

Posted by Al Baker <aj...@gmail.com>.

Hi Andy,

Great to hear about the transaction work.

For the N-Quad export on a live database, do you mean that a running
application - with open Jena models, and possibly a thread writing to it
will not interfere with the N-Quad dump - the only gotcha would be a
possible missed quad just before/after the export operation?  If so, I think
that would sufficie for the short-term.

I guess another possiblity is within the app, fire up a thread and do a
Model.write out to the filesystem to save the entire model.

Regarding bulk imports - I'm actually finding regular Jena model
manipulation runs very fast with TDB.  Within second can have a 100k
statement TDB store setup.

I'm not looking to boil the giant datasets out there, I'm looking at
practical how-to-build-apps perspective.  Speaking of which, it would be
nice if the cache was controllable - similar to Ehcache (or maybe an idea
for a future project for TDB to use Ehcache) - TTL, max in memory, etc.

Thanks,
Al


On Sun, May 29, 2011 at 2:52 PM, Andy Seaborne <
andy.seaborne@epimorphics.com> wrote:

>
>
> On 28/05/11 08:30, Al Baker wrote:
>
>> I've been testing TDB for a while, and am very impressed with its
>> performance.  However, I do see the various emails on the mailing lists
>> warning of touching the files while an application with TDB is open
>> (presumably with an open Jena Model attached to the TDB directory).
>>
>> What kind of reliability does TDB have to survive a power hit or
>> application
>> crash?
>>
>> Are there some steps to take consistent and regular backups to mitigate
>> any
>> issues?
>>
>> Basically looking to have some level of confidence that I can use TDB in
>> production, take a reasonable amount of steps to insure reliability, and
>> be
>> confident that I'll always either have a valid TDB store, or a way to
>> incrementally backup/rollback in the case of a severe crash/file system
>> error.
>>
>> Thanks,
>> Al Baker
>>
>
> Hi Al,
>
> Currently, TDB provides some update capabilities but relies on the
> application maintaining MRSW (Multiple Reader Or Single Writer) concurrency
> semantics together with a clean shutdown.  Many of the reports are due to
> letting two writers access the database at the same time or crashes without
> ensuring a sync() is done which currently is important for updates.
>
> For read-only usage, the database is safe - it is modified or reorganised
> by reads so loss of machines or applications does not damage the on-disk
> database.
>
> TDB is an in-process database - one JVM controls the database.  Having two
> managing the files also will cause damage.
>
> You can backup a database by copy but only from a running system if you
> co-ordinate with a sync() which makes the on-disk structures consistent.
>  Stopping the DB is better and is needed on some OS's but dumping to N-Quads
> can be done on a live database.
>
> For updates, there are periods of vulnerability.  This is being addressed
> by adding ACID transactions to TDB.  The transaction system is based on
> write-ahead logging; read requests go straight to the DB as before so
> performance there will be unchanged.
>
> The disk format is (probably) going to be unchanged.  There are some
> improvements that can be made but they aren't necessary.
>
> The bulk loader used to build a database from scratch will provide the best
> load performance.  It will remain non-transactional. Transactions will be
> aimed at non-bulk updates.  Where the practical boundary will be will emerge
> in testing.
>
> The transaction work is active-work-in-progress [*] but I'm not going to
> give specific release schedules except to say that as an open source
> project, "release early, release often" of development versions will happen.
>
>        Andy
>
> [*] Indeed, I'm writing a journaled file abstraction at this moment.
>

Re: TDB Reliability

Posted by Andy Seaborne <an...@epimorphics.com>.

On 28/05/11 08:30, Al Baker wrote:
> I've been testing TDB for a while, and am very impressed with its
> performance.  However, I do see the various emails on the mailing lists
> warning of touching the files while an application with TDB is open
> (presumably with an open Jena Model attached to the TDB directory).
>
> What kind of reliability does TDB have to survive a power hit or application
> crash?
>
> Are there some steps to take consistent and regular backups to mitigate any
> issues?
>
> Basically looking to have some level of confidence that I can use TDB in
> production, take a reasonable amount of steps to insure reliability, and be
> confident that I'll always either have a valid TDB store, or a way to
> incrementally backup/rollback in the case of a severe crash/file system
> error.
>
> Thanks,
> Al Baker

Hi Al,

Currently, TDB provides some update capabilities but relies on the 
application maintaining MRSW (Multiple Reader Or Single Writer) 
concurrency semantics together with a clean shutdown.  Many of the 
reports are due to letting two writers access the database at the same 
time or crashes without ensuring a sync() is done which currently is 
important for updates.

For read-only usage, the database is safe - it is modified or 
reorganised by reads so loss of machines or applications does not damage 
the on-disk database.

TDB is an in-process database - one JVM controls the database.  Having 
two managing the files also will cause damage.

You can backup a database by copy but only from a running system if you 
co-ordinate with a sync() which makes the on-disk structures consistent. 
  Stopping the DB is better and is needed on some OS's but dumping to 
N-Quads can be done on a live database.

For updates, there are periods of vulnerability.  This is being 
addressed by adding ACID transactions to TDB.  The transaction system is 
based on write-ahead logging; read requests go straight to the DB as 
before so performance there will be unchanged.

The disk format is (probably) going to be unchanged.  There are some 
improvements that can be made but they aren't necessary.

The bulk loader used to build a database from scratch will provide the 
best load performance.  It will remain non-transactional. Transactions 
will be aimed at non-bulk updates.  Where the practical boundary will be 
will emerge in testing.

The transaction work is active-work-in-progress [*] but I'm not going to 
give specific release schedules except to say that as an open source 
project, "release early, release often" of development versions will happen.

	Andy

[*] Indeed, I'm writing a journaled file abstraction at this moment.