You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Andy Seaborne <an...@apache.org> on 2020/05/02 09:59:04 UTC
Re: TDB2 store grows unexpectedly


On 27/04/2020 09:30, Jan Šmucr wrote:
> Hello Andy,
> 
> It’s TDB2. See this graph: https://ibb.co/SPScf9V
> 
> At the end of the day it’s up to twice the size if compacting daily.
> So as I understand, it’s an implementation detail, right?

Yes, that's right.

> 
> Jan
> 
> On 2020/04/24 20:27:29, Andy Seaborne <an...@apache.org> wrote:
>> Hi Jan,
>>
>> You don't say whether this is TDB1 and TDB2 - they behave differently,
>> both grow though by varying degrees, TDB1 somewhat less quickly than
>> TDB2.  What's more with small databases like 40M,
>>
>> TDB2 has a compaction function which, in effect, does the dump/restore
>> but faster and with no downtime for read operations.  Unfortunately this
>> isn't available in Fuseki so a stop-compact-restart is necessary (it can
>> be done - the code isn't written).
>>
>> TBD2 uses MVCC datastructure - when you add data, the new data added to
>> copied blocks in the index. This has the advantage of arbitrary large
>> transactions, include bulk loads, while running. The loads are faster as
>> well because some data is written to disk async by the OS while the
>> update is in-progress.  Outstanding read transactions continue reading
>> the old data.  Indeed, doe TDB2, it would be possible run a query on any
>> previous state of the database - it never forgets until a compaction
>> happens.
>>
>> TBD1 grows but more slowly. It does not always reuse index blocks freed
>> up when the B+Tree blocks are split.  TDB1 does not finish and write
>> back transactions until after the transaction has finished which limits
>> the transaction size.
>>
>> At the moment, for both cases, offline repacking is necessary.
>>
>>       Andy
>>
>> On 24/04/2020 11:07, Jan Šmucr wrote:
>>> Hello.
>>>
>>> I'm building a file processing workflow monitoring system based on Jena Fuseki. The goal for this system is to be purely additive. Individual events are a pieces of knowledge about each of the jobs eventually connected via various identifiers and references. Finally I can search for an event and with basic knowledge of the scheme I can rebuild a graph representing the whole processing job, and display it to the customer. It seems to work and I'm happy about the whole idea.
>>>
>>> There's however one thing I'd like to solve and that is the incredible amount of space the triplestore consumes. Currently there's about 40M triples (2 months of traffic approximately) and if unmaintained, the amount of disk space the database consumes is huge. The maintenance process is to stop Fuseki -> dump database -> backup -> delete the old database -> load the dump -> start Fuseki. At this point the database is at most 10 % of what it was before the maintenance. Then it grows back and even more with new data being added.
>>>
>>> Note that approx. half of triples in the inserts might already be in the database. Example:
>>>
>>> ### First event
>>>
>>> e:MyReceiveEvent
>>>       a e:Event ;
>>>       a e:Receive ;
>>>       e:subjectMessage e:MyMessage .
>>> e:MyMessage
>>>       a e:Message ;
>>>       a e:guid "MyMessage"^^xsd:string .
>>>
>>> ### Second event
>>>
>>> e:MySendEvent
>>>       a e:Event ;
>>>       a e:Send ;
>>>       e:subjectMessage e:MyMessage .
>>> e:MyMessage
>>>       a e:Message ;
>>>       a e:guid "MyMessage"^^xsd:string .
>>>
>>> This is because I need to use inserts only and I don't know anything about the triplestore contents at the point I emit the event. My testing however didn't show any database growth when I did this on purpose.
>>>
>>> How to fight this? Is it because of all the inserts and some related triplestore designs? What steps should I perform to optimize the store to suit the scenario?
>>>
>>> Thank you very much for your responses.
>>>
>>> Jan
>>>
>>>
>>