You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by "Dr. André Lanka" <ma...@dr-lanka.de> on 2012/05/25 10:40:34 UTC

Ideas for an efficient TDB check?

Hello Jena-Users,

we are using Jena+TDB in production and are looking for an efficient
method to check the validity of the TDB files on disk.

Our situation is as follows.

With Jena 2.6.4 and TDB 0.8.10 each of our servers stores triples in up
to 4000 different TDB stores stored on its local hard drive. On average
each store owns 1 million triples (with high variance). To get our
system working fluently, we need massive parallel write access to the
different stores, so one huge named graph is no alternative. Also we
need to have all stores open and accessible.

In order to get that large number of TDB stores opened in parallel, we
customised the TDB code for our needs. For instance we introduced read
caches shared between all stores (to avoid memory problems). Also we
introduced basic capabilities to roll back transactions. (We took
control over all data read from or written to ObjectFile and BlockMgr).

So, in our situation we can't switch to the new TDB version over night.

Now, the problem is that we had some disk issues a few days ago and want
to check which stores have got broken (We know some of them are broken).

Our initial idea is to iterate over all statements in the store and
collect any S, P and O used in the store. Second step would be to check
if any such URI is correctly mapped to an nodeID. And the other way round.

Unfortunately we are not sure, if this will cover any possible file
problem. Also, we think there could be a more efficient way to check the
internal data structures.


So, any idea (both high and low level) is highly appreciated.


Thanks in advance
André

-- 
Dr. André Lanka  *  0178 / 134 44 47  *  http://dr-lanka.de

Re: Ideas for an efficient TDB check?

Posted by "Dr. André Lanka" <ma...@dr-lanka.de>.
Hi Andy,

On 30.05.2012 14:13, Andy Seaborne wrote:
> On 30/05/12 08:36, "Dr. André Lanka" wrote:
>> Another crucial point is the possibility to sparql the
>> original(=unmodified) data during a write transaction (from the same
>> thread).
> 
> If you have two datasets t the same lcoation, one read on the original
> data, and one write, that will work.   It's lock-free mechanism until
> commit flushing.

That's great :-)
So, the global cache is left. We'll investigate this (and implement if
applicable) as soon as we have a larger time block for this.


Thanks again for your help
André


Re: Ideas for an efficient TDB check?

Posted by Andy Seaborne <an...@apache.org>.
On 30/05/12 08:36, "Dr. André Lanka" wrote:
> Another crucial point is the possibility to sparql the
> original(=unmodified) data during a write transaction (from the same
> thread).

If you have two datasets t the same lcoation, one read on the original 
data, and one write, that will work.   It's lock-free mechanism until 
commit flushing.

	Andy

Re: Ideas for an efficient TDB check?

Posted by "Dr. André Lanka" <ma...@dr-lanka.de>.
Hi Andy,

many thanks for your ideas and suggestions. Paolo wrote a check that
covers most of your ideas. We'll implement the "missing" ones soon.

On 26.05.2012 20:19, Andy Seaborne wrote:
>> customised the TDB code for our needs. For instance we introduced read
>> caches shared between all stores (to avoid memory problems). Also we
>> introduced basic capabilities to roll back transactions. (We took
>> control over all data read from or written to ObjectFile and BlockMgr).
> 
> Would you consider contributing your improvements back to Jena?

Yes, of course. We will merge our changes if you find them useful. :-)
Yet, at the moment the global cache is entangled with our transaction
management. Since the latter is obsolete (TDB 0.9), we would prefer to
integrate and contribute our changes to 0.9.


>> So, in our situation we can't switch to the new TDB version over night.
> 
> OK - as you probably know, transactions in 0.9.0 do provide robust update.

Yes, we are burning to switch to the new version. But before we have to
check, if we can open thousands of stores on a single machine without
memory issues.
Another crucial point is the possibility to sparql the
original(=unmodified) data during a write transaction (from the same
thread). This is necessary because we rely on the Delta (what was added
to and what was removed from the Model) during a write operation. The
ModelChangedListener is very helpful here. Unfortunately the MCL gets
informed about any statement that is _requested_ to be deleted,
independently of the existence in the store. The same holds for the
added statements. So we decided to add logic that let us read the
original data and check which statements are really removed or added.

These things keep us from a fast switch. Nonetheless, as soon as we have
time to, we'll check this out and implement it. And of course, if you
like the changes and find them useful, we share it with the community.


Thanks for your great semantic framework -- it's amazing. :-)


Greetings from Hojoki
André


Re: Ideas for an efficient TDB check?

Posted by Andy Seaborne <an...@apache.org>.
On 25/05/12 09:40, "Dr. André Lanka" wrote:
> Hello Jena-Users,
>
> we are using Jena+TDB in production and are looking for an efficient
> method to check the validity of the TDB files on disk.
>
> Our situation is as follows.
>
> With Jena 2.6.4 and TDB 0.8.10 each of our servers stores triples in up
> to 4000 different TDB stores stored on its local hard drive. On average
> each store owns 1 million triples (with high variance). To get our
> system working fluently, we need massive parallel write access to the
> different stores, so one huge named graph is no alternative. Also we
> need to have all stores open and accessible.
>
> In order to get that large number of TDB stores opened in parallel, we
> customised the TDB code for our needs. For instance we introduced read
> caches shared between all stores (to avoid memory problems). Also we
> introduced basic capabilities to roll back transactions. (We took
> control over all data read from or written to ObjectFile and BlockMgr).

Would you consider contributing your improvements back to Jena?

> So, in our situation we can't switch to the new TDB version over night.

OK - as you probably know, transactions in 0.9.0 do provide robust update.

> Now, the problem is that we had some disk issues a few days ago and want
> to check which stores have got broken (We know some of them are broken).
>
> Our initial idea is to iterate over all statements in the store and
> collect any S, P and O used in the store. Second step would be to check
> if any such URI is correctly mapped to an nodeID. And the other way round.
>
> Unfortunately we are not sure, if this will cover any possible file
> problem. Also, we think there could be a more efficient way to check the
> internal data structures.
>
>
> So, any idea (both high and low level) is highly appreciated.

Just iterating over S/P/O isn't enough I'm afraid.  It will iterate over 
a single index so it is not checking the other indexes.

A way to check the disk files would be to:
(for the default graph)

Graph.find(null, null, null)

which checks SPO index and the node table in the id->string direction.

Then dump each index (you'll need to write code for this) in tuple 
order, one row per line of output.  TupleIndexes reorder their estries 
back to primary table order - the POS index with consist of lines in the 
order S,P,O.

Sort each index dump and compare them.  They should be the same.

There is also the prefix index and the node to node id table.

The Node->NodeId mapping can be checked by taking apart the Node string 
table and checking for each Node in the Node->NodeId table.

	Andy

Re: Ideas for an efficient TDB check?

Posted by "Dr. André Lanka" <ma...@dr-lanka.de>.
Hi Paolo,

thanks for your reply and for your great piece of software. It covers
almost all ideas Andy mentioned. I'll check if the tuple-by-tuple
comparison Andy suggested can be done without performance issues.

We use your solution with a few changes* to check our stores
periodically. We are curious how many stores got broken.

I had to change some lines of code, I'm not sure why. You use
RecordFactory(SystemTDB.LenIndexQuadRecord,0); what fails for us. We
have to use SystemTDB.LenIndexTripeRecord instead. Perhaps because we
only use the default graph?

Anyways, we are really grateful for your help :-)

Thanks
André


*For instance we omit outputting the direct nodes. They are caught by
NodeTableInline and never reach the underlying NodeTableDirect.

Re: Ideas for an efficient TDB check?

Posted by Paolo Castagna <ca...@googlemail.com>.
Hi André,
I know exactly how you feel and I had exactly the same need at times.

How you know if your TDB indexes are all fine?

Add the work 'production' to it and everything becomes more 'fun'. :-)
Fortunately, we use replication and have the ability to replay updates going
back as much as we want/need. This makes things more 'relaxing'. But, this is
not the answer you are searching for right now.

I do not have *the* answer for you nor a tool, but in the past I've done
something similar to what you suggested, a sort of TDB index verifier/health
checker. Here [1], it's just a quick and dirty solution (not scalable... it
keeps stuff in memory, etc.). But, perhaps, it provides you with ideas.

If a TDB health checking utility is useful and feasible, we should probably open
a JIRA issue for it and gather ideas on how to best implement this. It should
not be too much work.

You are still using TDB 0.8.10, but on-disk format hasn't changed... so it's
reasonable to expect such functionality would work with your indexes as well.

My 2 cents,
Paolo

 [1]
https://github.com/castagna/tdbloader4/blob/f5363fa49d16a04a362898c1a5084ade620ee81b/src/test/java/dev/TDBVerifier.java


Dr. André Lanka wrote:
> Hello Jena-Users,
> 
> we are using Jena+TDB in production and are looking for an efficient
> method to check the validity of the TDB files on disk.
> 
> Our situation is as follows.
> 
> With Jena 2.6.4 and TDB 0.8.10 each of our servers stores triples in up
> to 4000 different TDB stores stored on its local hard drive. On average
> each store owns 1 million triples (with high variance). To get our
> system working fluently, we need massive parallel write access to the
> different stores, so one huge named graph is no alternative. Also we
> need to have all stores open and accessible.
> 
> In order to get that large number of TDB stores opened in parallel, we
> customised the TDB code for our needs. For instance we introduced read
> caches shared between all stores (to avoid memory problems). Also we
> introduced basic capabilities to roll back transactions. (We took
> control over all data read from or written to ObjectFile and BlockMgr).
> 
> So, in our situation we can't switch to the new TDB version over night.
> 
> Now, the problem is that we had some disk issues a few days ago and want
> to check which stores have got broken (We know some of them are broken).
> 
> Our initial idea is to iterate over all statements in the store and
> collect any S, P and O used in the store. Second step would be to check
> if any such URI is correctly mapped to an nodeID. And the other way round.
> 
> Unfortunately we are not sure, if this will cover any possible file
> problem. Also, we think there could be a more efficient way to check the
> internal data structures.
> 
> 
> So, any idea (both high and low level) is highly appreciated.
> 
> 
> Thanks in advance
> André
>