You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Lorenz Buehmann <bu...@informatik.uni-leipzig.de> on 2023/02/01 07:20:00 UTC

Re: Re: Why does the OSPG.dat file grows so much more than all other files?

Interesting insights from both of you, thanks.

@Andy do you have rough idea why only the OSPG index was that large 
compared to the others? What kind of updates would lead to that result?


On 30.01.23 21:40, Andy Seaborne wrote:
> Elton - thanks for the update.
>
> The index sizes look much more like what I was getting using 100e6 
> BSBM data as a test.
>
> Inline ...
>
> On 30/01/2023 01:49, Elton Soares wrote:
>> Hi Lorenz and Andy,
>>
>> Thank you for your quick responses and suggestions.
>>
>> Q: "Do you have lots of may large literals in your data?"
>> A: I cannot be sure yet, but as Andy mentioned, the documentation 
>> indicates that the indexes store 8 byte entries instead of the 
>> literals strings representations 
>> (https://jena.apache.org/documentation/tdb/architecture.html). Thus, 
>> although initially we though that the reason the OSPG.dat was so much 
>> larger could be the number of objects being a lot larger than the 
>> number of predicates, subjects and graphs, or the fact that the 
>> literals stored in those objects could be too large, after discussing 
>> internally what is expressed in the documentation we considered that 
>> was very unlikely that any of these hypotheses was true, although we 
>> could be easily convinced otherwise by someone who knows the source 
>> code better than us.
>>
>> Q: "Also, did you try a compaction on the database? If not, can you 
>> try it and post the new file sizes afterwards? Note, they will be 
>> located in a new ./Data-XXXX directory, e.g. before Data-0001 and 
>> afterwards Data-0002"
>>
>> After your suggestion, I've tried to run two compression strategies 
>> on this dataset to see which one would work best.
>> The one I'm referring to as "official" is the one that uses the 
>> "/$/compact" endpoint and the one I'm referring to as "unofficial" is 
>> the one where I create an NQuads backup and upload it to a new 
>> dataset using the TDBLoader.
>> The reason I attempted this second strategy is because a 
>> StackOverflow post suggested that it could be significantly more 
>> efficient than the "official" strategy 
>> (https://stackoverflow.com/questions/60501386/compacting-a-dataset-in-apache-jena-fuseki/60631699#60631699).
>
> Could be - it's offline to backup-restore so a bulk loader can be used 
> for the restore (and you get a backup file as a record).
>
>> We will consider upgrading our Jena Fuseki server to version 4.7.0, 
>> although it is not yet clear that the growth we saw in the OSPG.dat 
>> could be avoided by the changes made from 4.4.0 to 4.7.0. I'll try to 
>> take some time to look into the changelog more carefully to see if 
>> there is anything that seems to relate to that.
>
> From your original sizes, would I be right in guessing you had 
> compacted at all and also that you do a significant amount of updates?
>
> 4.7.0 wouldn't change the growth situation - it does make compaction 
> in a live server more reliable.
>
>> Here is a summary of the results I've obtained with both compression 
>> strategies (in markdown notation):
>>
>> ## Original Dataset
>>
>> RDF Stats:
>>   - Triples: 65222513 (Approximately 65 million)
>>   - Subjects: 20434264 (Aproximately 20 million)
>>   - Objects: 8565221 (Aproximately 8 million)
>>   - Graphs: 213531 (Aproximately 213 thousand)
>>   - Predicates: 153
>>
>> Disk Stats:
>> - my-dataset/Data-0001: 671GB
>> - my-dataset/Data-0001/OSPG.dat: 243GB
>> - my-dataset/Data-0001/nodes.dat: 76GB
>> - my-dataset/Data-0001/POSG.dat: 35GB
>> - my-dataset/Data-0001/nodes.idn: 33GB
>> - my-dataset/Data-0001/POSG.idn: 29GB
>> - my-dataset/Data-0001/OSPG.idn: 27GB
>> - ...
>>
>> ## Dataset Replica ("unofficial" compression strategy)
>>
>> Description: Backed up dataset as NQuads and Restore it as a new 
>> dataset with TDBLoader.
>>
>> References:
>> - https://jena.apache.org/documentation/tdb2/tdb2_admin.html#backup
>> - https://jena.apache.org/documentation/tdb2/tdb2_cmds.html
>>
>> RDF Stats:
>>   - Triples: 65222513 (Approximately 65 million)
>>   - Subjects: 20434264 (Aproximately 20 million)
>>   - Objects: 8565221 (Aproximately 8 million)
>>   - Graphs: 213531 (Aproximately 213 thousand)
>>   - Predicates: 153
>>
>> Disk Stats:
>> - my-dataset-replica/Data-0001: 23GB
>> - my-dataset-replica/Data-0001/OSPG.dat: 3.5GB
>> - my-dataset-replica/Data-0001/nodes.dat: 680MB
>> - my-dataset-replica/Data-0001/POSG.dat: 3.6GB
>
> Those look like much more realistic sizes for 65e6 triples spread over 
> over multiple named graphs.  I was getting similar for 100e6 BSBM data.
>
>> - my-dataset-replica/Data-0001/nodes.idn: 8.0M
>> - my-dataset-replica/Data-0001/POSG.idn: 32M
>> - my-dataset-replica/Data-0001/OSPG.idn: 32M
>> - ...
>>
>>
>> ## Compressed Dataset ("oficial" compression strategy)
>>
>> Description: Compressed using `/$/compact/` endpoint generating a new 
>> Data-NNNN folder within the same dataset.
>>
>> References:
>> - https://jena.apache.org/documentation/tdb2/tdb2_admin.html#compaction
>>
>> RDF Stats:
>>   - Triples: 65222513 (Approximately 65 million)
>>   - Subjects: 20434264 (Aproximately 20 million)
>>   - Objects: 8565221 (Aproximately 8 million)
>>   - Graphs: 213531 (Aproximately 213 thousand)
>>   - Predicates: 153
>>
>> Disk Stats:
>> - my-dataset/Data-0002: 23GB
>> - my-dataset/Data-0002/OSPG.dat: 3.7GB
>> - my-dataset/Data-0002/nodes.dat: 680MB
>> - my-dataset/Data-0002/POSG.dat: 3.8GB
>> - my-dataset/Data-0002/nodes.idn: 8.0M
>> - my-dataset/Data-0002/POSG.idn: 40M
>> - my-dataset/Data-0002/OSPG.idn: 32M
>> - ...
>>
>> ## Comparison
>>
>> RDF Stats:
>>   - Triples: Same Count
>>   - Subjects: Same Count
>>   - Objects: Same Count
>>   - Graphs: Same Count
>>   - Predicates: Same Count
>>
>> Disk Stats:
>> - Total Space: ~29x reduction with both strategies
>> - OSPG.dat: ~69x reduction with replication and ~65x reduction with 
>> compression
>> - nodes.dat: ~111x reduction with both strategies
>> - POSG.dat: ~9,7x reduction with replication and ~7,6x reduction with 
>> compression
>> - nodes.idn: ~4125x reduction with both strategies
>> - POSG.idn: ~906x reduction with replication and ~725x reduction with 
>> compression
>> - OSPG.idn: ~843,75 reduction with both strategies
>>
>> ## Queries used to obtain the RDF Stats
>>
>> ### Triples
>> ```
>> SELECT (COUNT(*) as ?count)
>> WHERE {
>>    GRAPH ?graph {
>>      ?subject ?predicate ?object
>>    }
>> }
>> ```
>>
>> ### Graphs
>> ```
>> SELECT (COUNT(DISTINCT ?graph) as ?count)
>> WHERE {
>>    GRAPH ?graph {
>>      ?subject ?predicate ?object
>>    }
>> }
>> ```
>>
>> ### Subjects
>>
>> ```
>> SELECT (COUNT(DISTINCT ?subject) as ?count)
>> WHERE {
>>    GRAPH ?graph {
>>      ?subject ?predicate ?object
>>    }
>> }
>> ```
>>
>> ### Predicates
>> ```
>> SELECT (COUNT(DISTINCT ?predicate) as ?count)
>> WHERE {
>>    GRAPH ?graph {
>>      ?subject ?predicate ?object
>>    }
>> }
>> ```
>>
>> ### Objects
>> ```
>> SELECT (COUNT(DISTINCT ?object) as ?count)
>> WHERE {
>>    GRAPH ?graph {
>>      ?subject ?predicate ?object
>>    }
>> }
>> ```
>>
>> ## Comands used to measure the Disk Stats
>>
>> ### File Sizes
>> ```
>> ls -lh --sort=size
>> ```
>>
>> ### Directory Sizes
>> ```
>> du -h
>> ```
>>
>> Best Regards
>>
>> On 28/01/23 11:01, "Andy Seaborne" <andy@apache.org 
>> <ma...@apache.org>> wrote:
>>
>>
>> I don't how OSPG can be a considerably different size. Small variations
>> happen but this does not look small.
>>
>>
>> Lorenz's advice to run a compaction and see what the indexes sizes are
>> is a good idea. A backup would also be a good idea because something is
>> unexpected (backup uses GSPO).
>>
>>
>> There has been some fixes in compaction since 4.4.0 related to
>> compacting while also active in Fuseki.
>>
>>
>> This index does not store the literals strings representations - they
>> are referenced via the 8 byte entries. In OSPG, the index entries are 4
>> slots of 8 bytes.
>>
>>
>> Andy
>>
>>
>> (Unrelated comment below)
>>
>>
>> On 28/01/2023 07:47, Lorenz Buehmann wrote:
>>> Hi Elton,
>>>
>>> Do you have lots of may large literals in your data?
>>>
>>> Also, did you try a compaction on the database? If not, can you try it
>>> and post the new file sizes afterwards? Note, they will be located in a
>>> new ./Data-XXXX directory, e.g. before Data-0001 and afterwards 
>>> Data-0002
>>>
>>> By the way, we're now at Jena 4.7.0 - you might have a look at release
>>> notes of the last 3 versions, maybe things you have recognized while
>>> running you current Fuseki. If not, just keep it running if you're 
>>> happy
>>> with it of course.
>>
>>
>> Theer
>>
>>
>>>
>>>
>>> Cheers,
>>> Lorenz
>>>
>>> On 28.01.23 03:10, Elton Soares wrote:
>>>> Dear Jena Community,
>>>>
>>>> I'm running Jena Fuseki Version 4.4.0 as a container on an OpenShift
>>>> Cluster.
>>>> OS Version Info (cat /etc/os-release):
>>>> NAME="Red Hat Enterprise Linux"
>>>> VERSION="8.5 (Ootpa)"
>>>> ID="rhel"
>>>> ID_LIKE="fedora" ="8.5"
>>>> ...
>>>>
>>>> Hardware Info (from Jena Fuseki initialization log):
>>>> [2023-01-27 20:08:59] Server INFO Memory: 32.0 GiB
>>>> [2023-01-27 20:08:59] Server INFO Java: 11.0.14.1
>>>> [2023-01-27 20:08:59] Server INFO OS: Linux
>>>> 3.10.0-1160.76.1.el7.x86_64 amd64
>>>> [2023-01-27 20:08:59] Server INFO PID: 1
>>>>
>>>>
>>>> Disk Info (df -h):
>>>> Filesystem
>>>> Size Used Avail Use% Mounted on
>>>> overlay
>>>> 99G 76G 18G 82% /
>>>> tmpfs
>>>> 64M 0 64M 0% /dev
>>>> tmpfs
>>>> 63G 0 63G 0% /sys/fs/cgroup
>>>> shm
>>>> 64M 0 64M 0% /dev/shm
>>>> /dev/mapper/docker_data
>>>> 99G 76G 18G 82% /config
>>>> /data
>>>> 1.0T 677G 348G 67% /usr/app/run
>>>> tmpfs
>>>> 40G 24K 40G 1%
>>>>
>>>>
>>>> My dataset is built using TDB2, and currently has the following RDF
>>>> Stats:
>>>> · Triples: 65KK (Approximately 65 million)
>>>> · Subjects: ~20KK (Aproximately 20 million)
>>>> · Objects: ~8KK (Aproximately 8 million)
>>>> · Graphs: ~213K (Aproximately 213 thousand)
>>>> · Predicates: 153
>>>>
>>>>
>>>> The files corresponding to this dataset alone on disk sum up to
>>>> approximately 671GB (measured with du -h). From these, the largest
>>>> files are:
>>>> · /usr/app/run/databases/my-dataset/Data-0001/OSPG.dat: 243GB
>>>> · /usr/app/run/databases/my-dataset/Data-0001/nodes.dat: 76GB
>>>> · /usr/app/run/databases/my-dataset/Data-0001/POSG.dat: 35GB
>>>> · /usr/app/run/databases/my-dataset/Data-0001/nodes.idn: 33GB
>>>> · /usr/app/run/databases/my-dataset/Data-0001/POSG.idn: 29GB
>>>> · /usr/app/run/databases/my-dataset/Data-0001/OSPG.idn: 27GB
>>>>
>>>>
>>>> I've looked into several documentation pages, source code, forums, ...
>>>> nowhere I was able to find some explanation to why OSPG.dat is so much
>>>> larger than all other files.
>>>> I've been using Jena for quite some time now and I'm well aware that
>>>> its indexes grow significantly during usage, specially when triples
>>>> are being added across multiple requests (transactional workloads).
>>>> Even though, the size of this particular file (OSPG.dat) surprised me,
>>>> as in my prior experience the indexes would never get larger than the
>>>> nodes.dat file.
>>>> Is there a reasonable explanation for this based on the content of the
>>>> dataset or the way it was generated? Could this be an indexing bug
>>>> within TDB2?
>>>> Thank you for your support!
>>>> For completeness, here is the assembler configuration for my dataset:
>>>> @prefix : http://base/# <http://base/#>.
>>>> @prefix fuseki: http://jena.apache.org/fuseki# 
>>>> <http://jena.apache.org/fuseki#>.
>>>> @prefix ja: http://jena.hpl.hp.com/2005/11/Assembler# 
>>>> <http://jena.hpl.hp.com/2005/11/Assembler#>.
>>>> @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# 
>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
>>>> @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# 
>>>> <http://www.w3.org/2000/01/rdf-schema#>.
>>>> @prefix root: http://dev-test-jena-fuseki/$/datasets 
>>>> <http://dev-test-jena-fuseki/$/datasets>.
>>>> @prefix tdb2: http://jena.apache.org/2016/tdb# 
>>>> <http://jena.apache.org/2016/tdb#>.
>>
>>
>> It only needs:
>>
>>
>>
>>
>>>> :service_tdb_my-dataset
>>>> rdf:type fuseki:Service ;
>>>> rdfs:label "TDB my-dataset" ;
>>>> fuseki:dataset :ds_my-dataset ;
>>>> fuseki:name "my-dataset" ;
>>>> fuseki:serviceQuery "sparql" , "query" ;
>>>> fuseki:serviceReadGraphStore "get" ;
>>>> fuseki:serviceReadWriteGraphStore
>>>> "data" ;
>>>> fuseki:serviceUpdate "update" ;
>>>> fuseki:serviceUpload "upload" .
>>
>>
>>>> :ds_my-dataset rdf:type tdb2:DatasetTDB2 ;
>>>> tdb2:location "run/databases/my-dataset" ;
>>>> tdb2:unionDefaultGraph true ;
>>>> ja:context \[ ja:cxtName "arq:optFilterPlacement" ;
>>>> ja:cxtValue "false"
>>>> \] .
>>
>>
>> The rest can go.
>>
>>
>>>>
>>>> This issue has been also published at
>>>> https://stackoverflow.com/questions/75264889/why-does-the-ospg-dat-file-grows-so-much-more-than-all-other-files 
>>>> <https://stackoverflow.com/questions/75264889/why-does-the-ospg-dat-file-grows-so-much-more-than-all-other-files> 
>>>>
>>>>
>>>>
>>
>>
>>

Re: Why does the OSPG.dat file grows so much more than all other files?

Posted by "Rob @ DNR" <rv...@dotnetrdf.org>.

Speculating heavily here:

Each index is sorted relative to its keys, which as already noted earlier in the thread are a sequence of 8-byte Node IDs.  If the update patterns for this dataset primarily involve changing the objects of the quads then that could lead to much more frequent rebalancing and rewriting of the O based index.  This could be particularly true if the objects are of a type that TDB inlines, e.g. integers, since the Node ID encoding preserves ordering for some inlined types (this allows for range based scans to optimise some query filters).  So frequently updating an object that has an ordered inline-able value would cause the index entry for that quad to be shuffled elsewhere in the B+Tree frequently.

So hypothetically if you have triples of the form <subject> <predicate> “1”^^xsd:integer where the object is some counter/metric you are frequently updating that’d cause lots of churn in the OSPG index.  The other indexes would be less affected because the other nodes in the quad are changing less frequently.

Rob

From: Andy Seaborne <an...@apache.org>
Date: Wednesday, 1 February 2023 at 10:11
To: users@jena.apache.org <us...@jena.apache.org>
Subject: Re: Why does the OSPG.dat file grows so much more than all other files?


On 01/02/2023 07:20, Lorenz Buehmann wrote:
> Interesting insights from both of you, thanks.
>
> @Andy do you have rough idea why only the OSPG index was that large
> compared to the others? What kind of updates would lead to that result?

No. There's no reason from TDB code to treat one index differently.

I suspect that some thing the container host did something or a second
container with the same database file ran at some time.

The index is possibly corrupt - the compaction uses SPOG and does not
touch OSPG so the DB becomes valid.

We don't much about the usage - clearly there is a high update rate but
over what time period.

     Andy

>
>
> On 30.01.23 21:40, Andy Seaborne wrote:
>> Elton - thanks for the update.
>>
>> The index sizes look much more like what I was getting using 100e6
>> BSBM data as a test.
>>
>> Inline ...
>>
>> On 30/01/2023 01:49, Elton Soares wrote:
>>> Hi Lorenz and Andy,
>>>
>>> Thank you for your quick responses and suggestions.
>>>
>>> Q: "Do you have lots of may large literals in your data?"
>>> A: I cannot be sure yet, but as Andy mentioned, the documentation
>>> indicates that the indexes store 8 byte entries instead of the
>>> literals strings representations
>>> (https://jena.apache.org/documentation/tdb/architecture.html). Thus,
>>> although initially we though that the reason the OSPG.dat was so much
>>> larger could be the number of objects being a lot larger than the
>>> number of predicates, subjects and graphs, or the fact that the
>>> literals stored in those objects could be too large, after discussing
>>> internally what is expressed in the documentation we considered that
>>> was very unlikely that any of these hypotheses was true, although we
>>> could be easily convinced otherwise by someone who knows the source
>>> code better than us.
>>>
>>> Q: "Also, did you try a compaction on the database? If not, can you
>>> try it and post the new file sizes afterwards? Note, they will be
>>> located in a new ./Data-XXXX directory, e.g. before Data-0001 and
>>> afterwards Data-0002"
>>>
>>> After your suggestion, I've tried to run two compression strategies
>>> on this dataset to see which one would work best.
>>> The one I'm referring to as "official" is the one that uses the
>>> "/$/compact" endpoint and the one I'm referring to as "unofficial" is
>>> the one where I create an NQuads backup and upload it to a new
>>> dataset using the TDBLoader.
>>> The reason I attempted this second strategy is because a
>>> StackOverflow post suggested that it could be significantly more
>>> efficient than the "official" strategy
>>> (https://stackoverflow.com/questions/60501386/compacting-a-dataset-in-apache-jena-fuseki/60631699#60631699).
>>
>> Could be - it's offline to backup-restore so a bulk loader can be used
>> for the restore (and you get a backup file as a record).
>>
>>> We will consider upgrading our Jena Fuseki server to version 4.7.0,
>>> although it is not yet clear that the growth we saw in the OSPG.dat
>>> could be avoided by the changes made from 4.4.0 to 4.7.0. I'll try to
>>> take some time to look into the changelog more carefully to see if
>>> there is anything that seems to relate to that.
>>
>> From your original sizes, would I be right in guessing you had
>> compacted at all and also that you do a significant amount of updates?
>>
>> 4.7.0 wouldn't change the growth situation - it does make compaction
>> in a live server more reliable.
>>
>>> Here is a summary of the results I've obtained with both compression
>>> strategies (in markdown notation):
>>>
>>> ## Original Dataset
>>>
>>> RDF Stats:
>>>   - Triples: 65222513 (Approximately 65 million)
>>>   - Subjects: 20434264 (Aproximately 20 million)
>>>   - Objects: 8565221 (Aproximately 8 million)
>>>   - Graphs: 213531 (Aproximately 213 thousand)
>>>   - Predicates: 153
>>>
>>> Disk Stats:
>>> - my-dataset/Data-0001: 671GB
>>> - my-dataset/Data-0001/OSPG.dat: 243GB
>>> - my-dataset/Data-0001/nodes.dat: 76GB
>>> - my-dataset/Data-0001/POSG.dat: 35GB
>>> - my-dataset/Data-0001/nodes.idn: 33GB
>>> - my-dataset/Data-0001/POSG.idn: 29GB
>>> - my-dataset/Data-0001/OSPG.idn: 27GB
>>> - ...
>>>
>>> ## Dataset Replica ("unofficial" compression strategy)
>>>
>>> Description: Backed up dataset as NQuads and Restore it as a new
>>> dataset with TDBLoader.
>>>
>>> References:
>>> - https://jena.apache.org/documentation/tdb2/tdb2_admin.html#backup
>>> - https://jena.apache.org/documentation/tdb2/tdb2_cmds.html
>>>
>>> RDF Stats:
>>>   - Triples: 65222513 (Approximately 65 million)
>>>   - Subjects: 20434264 (Aproximately 20 million)
>>>   - Objects: 8565221 (Aproximately 8 million)
>>>   - Graphs: 213531 (Aproximately 213 thousand)
>>>   - Predicates: 153
>>>
>>> Disk Stats:
>>> - my-dataset-replica/Data-0001: 23GB
>>> - my-dataset-replica/Data-0001/OSPG.dat: 3.5GB
>>> - my-dataset-replica/Data-0001/nodes.dat: 680MB
>>> - my-dataset-replica/Data-0001/POSG.dat: 3.6GB
>>
>> Those look like much more realistic sizes for 65e6 triples spread over
>> over multiple named graphs.  I was getting similar for 100e6 BSBM data.
>>
>>> - my-dataset-replica/Data-0001/nodes.idn: 8.0M
>>> - my-dataset-replica/Data-0001/POSG.idn: 32M
>>> - my-dataset-replica/Data-0001/OSPG.idn: 32M
>>> - ...
>>>
>>>
>>> ## Compressed Dataset ("oficial" compression strategy)
>>>
>>> Description: Compressed using `/$/compact/` endpoint generating a new
>>> Data-NNNN folder within the same dataset.
>>>
>>> References:
>>> - https://jena.apache.org/documentation/tdb2/tdb2_admin.html#compaction
>>>
>>> RDF Stats:
>>>   - Triples: 65222513 (Approximately 65 million)
>>>   - Subjects: 20434264 (Aproximately 20 million)
>>>   - Objects: 8565221 (Aproximately 8 million)
>>>   - Graphs: 213531 (Aproximately 213 thousand)
>>>   - Predicates: 153
>>>
>>> Disk Stats:
>>> - my-dataset/Data-0002: 23GB
>>> - my-dataset/Data-0002/OSPG.dat: 3.7GB
>>> - my-dataset/Data-0002/nodes.dat: 680MB
>>> - my-dataset/Data-0002/POSG.dat: 3.8GB
>>> - my-dataset/Data-0002/nodes.idn: 8.0M
>>> - my-dataset/Data-0002/POSG.idn: 40M
>>> - my-dataset/Data-0002/OSPG.idn: 32M
>>> - ...
>>>
>>> ## Comparison
>>>
>>> RDF Stats:
>>>   - Triples: Same Count
>>>   - Subjects: Same Count
>>>   - Objects: Same Count
>>>   - Graphs: Same Count
>>>   - Predicates: Same Count
>>>
>>> Disk Stats:
>>> - Total Space: ~29x reduction with both strategies
>>> - OSPG.dat: ~69x reduction with replication and ~65x reduction with
>>> compression
>>> - nodes.dat: ~111x reduction with both strategies
>>> - POSG.dat: ~9,7x reduction with replication and ~7,6x reduction with
>>> compression
>>> - nodes.idn: ~4125x reduction with both strategies
>>> - POSG.idn: ~906x reduction with replication and ~725x reduction with
>>> compression
>>> - OSPG.idn: ~843,75 reduction with both strategies
>>>
>>> ## Queries used to obtain the RDF Stats
>>>
>>> ### Triples
>>> ```
>>> SELECT (COUNT(*) as ?count)
>>> WHERE {
>>>    GRAPH ?graph {
>>>      ?subject ?predicate ?object
>>>    }
>>> }
>>> ```
>>>
>>> ### Graphs
>>> ```
>>> SELECT (COUNT(DISTINCT ?graph) as ?count)
>>> WHERE {
>>>    GRAPH ?graph {
>>>      ?subject ?predicate ?object
>>>    }
>>> }
>>> ```
>>>
>>> ### Subjects
>>>
>>> ```
>>> SELECT (COUNT(DISTINCT ?subject) as ?count)
>>> WHERE {
>>>    GRAPH ?graph {
>>>      ?subject ?predicate ?object
>>>    }
>>> }
>>> ```
>>>
>>> ### Predicates
>>> ```
>>> SELECT (COUNT(DISTINCT ?predicate) as ?count)
>>> WHERE {
>>>    GRAPH ?graph {
>>>      ?subject ?predicate ?object
>>>    }
>>> }
>>> ```
>>>
>>> ### Objects
>>> ```
>>> SELECT (COUNT(DISTINCT ?object) as ?count)
>>> WHERE {
>>>    GRAPH ?graph {
>>>      ?subject ?predicate ?object
>>>    }
>>> }
>>> ```
>>>
>>> ## Comands used to measure the Disk Stats
>>>
>>> ### File Sizes
>>> ```
>>> ls -lh --sort=size
>>> ```
>>>
>>> ### Directory Sizes
>>> ```
>>> du -h
>>> ```
>>>
>>> Best Regards
>>>
>>> On 28/01/23 11:01, "Andy Seaborne" <andy@apache.org
>>> <ma...@apache.org>> wrote:
>>>
>>>
>>> I don't how OSPG can be a considerably different size. Small variations
>>> happen but this does not look small.
>>>
>>>
>>> Lorenz's advice to run a compaction and see what the indexes sizes are
>>> is a good idea. A backup would also be a good idea because something is
>>> unexpected (backup uses GSPO).
>>>
>>>
>>> There has been some fixes in compaction since 4.4.0 related to
>>> compacting while also active in Fuseki.
>>>
>>>
>>> This index does not store the literals strings representations - they
>>> are referenced via the 8 byte entries. In OSPG, the index entries are 4
>>> slots of 8 bytes.
>>>
>>>
>>> Andy
>>>
>>>
>>> (Unrelated comment below)
>>>
>>>
>>> On 28/01/2023 07:47, Lorenz Buehmann wrote:
>>>> Hi Elton,
>>>>
>>>> Do you have lots of may large literals in your data?
>>>>
>>>> Also, did you try a compaction on the database? If not, can you try it
>>>> and post the new file sizes afterwards? Note, they will be located in a
>>>> new ./Data-XXXX directory, e.g. before Data-0001 and afterwards
>>>> Data-0002
>>>>
>>>> By the way, we're now at Jena 4.7.0 - you might have a look at release
>>>> notes of the last 3 versions, maybe things you have recognized while
>>>> running you current Fuseki. If not, just keep it running if you're
>>>> happy
>>>> with it of course.
>>>
>>>
>>> Theer
>>>
>>>
>>>>
>>>>
>>>> Cheers,
>>>> Lorenz
>>>>
>>>> On 28.01.23 03:10, Elton Soares wrote:
>>>>> Dear Jena Community,
>>>>>
>>>>> I'm running Jena Fuseki Version 4.4.0 as a container on an OpenShift
>>>>> Cluster.
>>>>> OS Version Info (cat /etc/os-release):
>>>>> NAME="Red Hat Enterprise Linux"
>>>>> VERSION="8.5 (Ootpa)"
>>>>> ID="rhel"
>>>>> ID_LIKE="fedora" ="8.5"
>>>>> ...
>>>>>
>>>>> Hardware Info (from Jena Fuseki initialization log):
>>>>> [2023-01-27 20:08:59] Server INFO Memory: 32.0 GiB
>>>>> [2023-01-27 20:08:59] Server INFO Java: 11.0.14.1
>>>>> [2023-01-27 20:08:59] Server INFO OS: Linux
>>>>> 3.10.0-1160.76.1.el7.x86_64 amd64
>>>>> [2023-01-27 20:08:59] Server INFO PID: 1
>>>>>
>>>>>
>>>>> Disk Info (df -h):
>>>>> Filesystem
>>>>> Size Used Avail Use% Mounted on
>>>>> overlay
>>>>> 99G 76G 18G 82% /
>>>>> tmpfs
>>>>> 64M 0 64M 0% /dev
>>>>> tmpfs
>>>>> 63G 0 63G 0% /sys/fs/cgroup
>>>>> shm
>>>>> 64M 0 64M 0% /dev/shm
>>>>> /dev/mapper/docker_data
>>>>> 99G 76G 18G 82% /config
>>>>> /data
>>>>> 1.0T 677G 348G 67% /usr/app/run
>>>>> tmpfs
>>>>> 40G 24K 40G 1%
>>>>>
>>>>>
>>>>> My dataset is built using TDB2, and currently has the following RDF
>>>>> Stats:
>>>>> · Triples: 65KK (Approximately 65 million)
>>>>> · Subjects: ~20KK (Aproximately 20 million)
>>>>> · Objects: ~8KK (Aproximately 8 million)
>>>>> · Graphs: ~213K (Aproximately 213 thousand)
>>>>> · Predicates: 153
>>>>>
>>>>>
>>>>> The files corresponding to this dataset alone on disk sum up to
>>>>> approximately 671GB (measured with du -h). From these, the largest
>>>>> files are:
>>>>> · /usr/app/run/databases/my-dataset/Data-0001/OSPG.dat: 243GB
>>>>> · /usr/app/run/databases/my-dataset/Data-0001/nodes.dat: 76GB
>>>>> · /usr/app/run/databases/my-dataset/Data-0001/POSG.dat: 35GB
>>>>> · /usr/app/run/databases/my-dataset/Data-0001/nodes.idn: 33GB
>>>>> · /usr/app/run/databases/my-dataset/Data-0001/POSG.idn: 29GB
>>>>> · /usr/app/run/databases/my-dataset/Data-0001/OSPG.idn: 27GB
>>>>>
>>>>>
>>>>> I've looked into several documentation pages, source code, forums, ...
>>>>> nowhere I was able to find some explanation to why OSPG.dat is so much
>>>>> larger than all other files.
>>>>> I've been using Jena for quite some time now and I'm well aware that
>>>>> its indexes grow significantly during usage, specially when triples
>>>>> are being added across multiple requests (transactional workloads).
>>>>> Even though, the size of this particular file (OSPG.dat) surprised me,
>>>>> as in my prior experience the indexes would never get larger than the
>>>>> nodes.dat file.
>>>>> Is there a reasonable explanation for this based on the content of the
>>>>> dataset or the way it was generated? Could this be an indexing bug
>>>>> within TDB2?
>>>>> Thank you for your support!
>>>>> For completeness, here is the assembler configuration for my dataset:
>>>>> @prefix : http://base/#<http://base/> <http://base/#<http://base/>>.
>>>>> @prefix fuseki: http://jena.apache.org/fuseki#<http://jena.apache.org/fuseki>
>>>>> <http://jena.apache.org/fuseki#<http://jena.apache.org/fuseki>>.
>>>>> @prefix ja: http://jena.hpl.hp.com/2005/11/Assembler#<http://jena.hpl.hp.com/2005/11/Assembler>
>>>>> <http://jena.hpl.hp.com/2005/11/Assembler#<http://jena.hpl.hp.com/2005/11/Assembler>>.
>>>>> @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns>
>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns>>.
>>>>> @prefix rdfs: http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema>
>>>>> <http://www.w3.org/2000/01/rdf-schema#<http://www.w3.org/2000/01/rdf-schema>>.
>>>>> @prefix root: http://dev-test-jena-fuseki/$/datasets
>>>>> <http://dev-test-jena-fuseki/$/datasets>.
>>>>> @prefix tdb2: http://jena.apache.org/2016/tdb#<http://jena.apache.org/2016/tdb>
>>>>> <http://jena.apache.org/2016/tdb#<http://jena.apache.org/2016/tdb>>.
>>>
>>>
>>> It only needs:
>>>
>>>
>>>
>>>
>>>>> :service_tdb_my-dataset
>>>>> rdf:type fuseki:Service ;
>>>>> rdfs:label "TDB my-dataset" ;
>>>>> fuseki:dataset :ds_my-dataset ;
>>>>> fuseki:name "my-dataset" ;
>>>>> fuseki:serviceQuery "sparql" , "query" ;
>>>>> fuseki:serviceReadGraphStore "get" ;
>>>>> fuseki:serviceReadWriteGraphStore
>>>>> "data" ;
>>>>> fuseki:serviceUpdate "update" ;
>>>>> fuseki:serviceUpload "upload" .
>>>
>>>
>>>>> :ds_my-dataset rdf:type tdb2:DatasetTDB2 ;
>>>>> tdb2:location "run/databases/my-dataset" ;
>>>>> tdb2:unionDefaultGraph true ;
>>>>> ja:context \[ ja:cxtName "arq:optFilterPlacement" ;
>>>>> ja:cxtValue "false"
>>>>> \] .
>>>
>>>
>>> The rest can go.
>>>
>>>
>>>>>
>>>>> This issue has been also published at
>>>>> https://stackoverflow.com/questions/75264889/why-does-the-ospg-dat-file-grows-so-much-more-than-all-other-files <https://stackoverflow.com/questions/75264889/why-does-the-ospg-dat-file-grows-so-much-more-than-all-other-files>
>>>>>
>>>>>
>>>
>>>
>>>

Re: Why does the OSPG.dat file grows so much more than all other files?

Posted by Andy Seaborne <an...@apache.org>.


On 01/02/2023 07:20, Lorenz Buehmann wrote:
> Interesting insights from both of you, thanks.
> 
> @Andy do you have rough idea why only the OSPG index was that large 
> compared to the others? What kind of updates would lead to that result?

No. There's no reason from TDB code to treat one index differently.

I suspect that some thing the container host did something or a second 
container with the same database file ran at some time.

The index is possibly corrupt - the compaction uses SPOG and does not 
touch OSPG so the DB becomes valid.

We don't much about the usage - clearly there is a high update rate but 
over what time period.

     Andy

> 
> 
> On 30.01.23 21:40, Andy Seaborne wrote:
>> Elton - thanks for the update.
>>
>> The index sizes look much more like what I was getting using 100e6 
>> BSBM data as a test.
>>
>> Inline ...
>>
>> On 30/01/2023 01:49, Elton Soares wrote:
>>> Hi Lorenz and Andy,
>>>
>>> Thank you for your quick responses and suggestions.
>>>
>>> Q: "Do you have lots of may large literals in your data?"
>>> A: I cannot be sure yet, but as Andy mentioned, the documentation 
>>> indicates that the indexes store 8 byte entries instead of the 
>>> literals strings representations 
>>> (https://jena.apache.org/documentation/tdb/architecture.html). Thus, 
>>> although initially we though that the reason the OSPG.dat was so much 
>>> larger could be the number of objects being a lot larger than the 
>>> number of predicates, subjects and graphs, or the fact that the 
>>> literals stored in those objects could be too large, after discussing 
>>> internally what is expressed in the documentation we considered that 
>>> was very unlikely that any of these hypotheses was true, although we 
>>> could be easily convinced otherwise by someone who knows the source 
>>> code better than us.
>>>
>>> Q: "Also, did you try a compaction on the database? If not, can you 
>>> try it and post the new file sizes afterwards? Note, they will be 
>>> located in a new ./Data-XXXX directory, e.g. before Data-0001 and 
>>> afterwards Data-0002"
>>>
>>> After your suggestion, I've tried to run two compression strategies 
>>> on this dataset to see which one would work best.
>>> The one I'm referring to as "official" is the one that uses the 
>>> "/$/compact" endpoint and the one I'm referring to as "unofficial" is 
>>> the one where I create an NQuads backup and upload it to a new 
>>> dataset using the TDBLoader.
>>> The reason I attempted this second strategy is because a 
>>> StackOverflow post suggested that it could be significantly more 
>>> efficient than the "official" strategy 
>>> (https://stackoverflow.com/questions/60501386/compacting-a-dataset-in-apache-jena-fuseki/60631699#60631699).
>>
>> Could be - it's offline to backup-restore so a bulk loader can be used 
>> for the restore (and you get a backup file as a record).
>>
>>> We will consider upgrading our Jena Fuseki server to version 4.7.0, 
>>> although it is not yet clear that the growth we saw in the OSPG.dat 
>>> could be avoided by the changes made from 4.4.0 to 4.7.0. I'll try to 
>>> take some time to look into the changelog more carefully to see if 
>>> there is anything that seems to relate to that.
>>
>> From your original sizes, would I be right in guessing you had 
>> compacted at all and also that you do a significant amount of updates?
>>
>> 4.7.0 wouldn't change the growth situation - it does make compaction 
>> in a live server more reliable.
>>
>>> Here is a summary of the results I've obtained with both compression 
>>> strategies (in markdown notation):
>>>
>>> ## Original Dataset
>>>
>>> RDF Stats:
>>>   - Triples: 65222513 (Approximately 65 million)
>>>   - Subjects: 20434264 (Aproximately 20 million)
>>>   - Objects: 8565221 (Aproximately 8 million)
>>>   - Graphs: 213531 (Aproximately 213 thousand)
>>>   - Predicates: 153
>>>
>>> Disk Stats:
>>> - my-dataset/Data-0001: 671GB
>>> - my-dataset/Data-0001/OSPG.dat: 243GB
>>> - my-dataset/Data-0001/nodes.dat: 76GB
>>> - my-dataset/Data-0001/POSG.dat: 35GB
>>> - my-dataset/Data-0001/nodes.idn: 33GB
>>> - my-dataset/Data-0001/POSG.idn: 29GB
>>> - my-dataset/Data-0001/OSPG.idn: 27GB
>>> - ...
>>>
>>> ## Dataset Replica ("unofficial" compression strategy)
>>>
>>> Description: Backed up dataset as NQuads and Restore it as a new 
>>> dataset with TDBLoader.
>>>
>>> References:
>>> - https://jena.apache.org/documentation/tdb2/tdb2_admin.html#backup
>>> - https://jena.apache.org/documentation/tdb2/tdb2_cmds.html
>>>
>>> RDF Stats:
>>>   - Triples: 65222513 (Approximately 65 million)
>>>   - Subjects: 20434264 (Aproximately 20 million)
>>>   - Objects: 8565221 (Aproximately 8 million)
>>>   - Graphs: 213531 (Aproximately 213 thousand)
>>>   - Predicates: 153
>>>
>>> Disk Stats:
>>> - my-dataset-replica/Data-0001: 23GB
>>> - my-dataset-replica/Data-0001/OSPG.dat: 3.5GB
>>> - my-dataset-replica/Data-0001/nodes.dat: 680MB
>>> - my-dataset-replica/Data-0001/POSG.dat: 3.6GB
>>
>> Those look like much more realistic sizes for 65e6 triples spread over 
>> over multiple named graphs.  I was getting similar for 100e6 BSBM data.
>>
>>> - my-dataset-replica/Data-0001/nodes.idn: 8.0M
>>> - my-dataset-replica/Data-0001/POSG.idn: 32M
>>> - my-dataset-replica/Data-0001/OSPG.idn: 32M
>>> - ...
>>>
>>>
>>> ## Compressed Dataset ("oficial" compression strategy)
>>>
>>> Description: Compressed using `/$/compact/` endpoint generating a new 
>>> Data-NNNN folder within the same dataset.
>>>
>>> References:
>>> - https://jena.apache.org/documentation/tdb2/tdb2_admin.html#compaction
>>>
>>> RDF Stats:
>>>   - Triples: 65222513 (Approximately 65 million)
>>>   - Subjects: 20434264 (Aproximately 20 million)
>>>   - Objects: 8565221 (Aproximately 8 million)
>>>   - Graphs: 213531 (Aproximately 213 thousand)
>>>   - Predicates: 153
>>>
>>> Disk Stats:
>>> - my-dataset/Data-0002: 23GB
>>> - my-dataset/Data-0002/OSPG.dat: 3.7GB
>>> - my-dataset/Data-0002/nodes.dat: 680MB
>>> - my-dataset/Data-0002/POSG.dat: 3.8GB
>>> - my-dataset/Data-0002/nodes.idn: 8.0M
>>> - my-dataset/Data-0002/POSG.idn: 40M
>>> - my-dataset/Data-0002/OSPG.idn: 32M
>>> - ...
>>>
>>> ## Comparison
>>>
>>> RDF Stats:
>>>   - Triples: Same Count
>>>   - Subjects: Same Count
>>>   - Objects: Same Count
>>>   - Graphs: Same Count
>>>   - Predicates: Same Count
>>>
>>> Disk Stats:
>>> - Total Space: ~29x reduction with both strategies
>>> - OSPG.dat: ~69x reduction with replication and ~65x reduction with 
>>> compression
>>> - nodes.dat: ~111x reduction with both strategies
>>> - POSG.dat: ~9,7x reduction with replication and ~7,6x reduction with 
>>> compression
>>> - nodes.idn: ~4125x reduction with both strategies
>>> - POSG.idn: ~906x reduction with replication and ~725x reduction with 
>>> compression
>>> - OSPG.idn: ~843,75 reduction with both strategies
>>>
>>> ## Queries used to obtain the RDF Stats
>>>
>>> ### Triples
>>> ```
>>> SELECT (COUNT(*) as ?count)
>>> WHERE {
>>>    GRAPH ?graph {
>>>      ?subject ?predicate ?object
>>>    }
>>> }
>>> ```
>>>
>>> ### Graphs
>>> ```
>>> SELECT (COUNT(DISTINCT ?graph) as ?count)
>>> WHERE {
>>>    GRAPH ?graph {
>>>      ?subject ?predicate ?object
>>>    }
>>> }
>>> ```
>>>
>>> ### Subjects
>>>
>>> ```
>>> SELECT (COUNT(DISTINCT ?subject) as ?count)
>>> WHERE {
>>>    GRAPH ?graph {
>>>      ?subject ?predicate ?object
>>>    }
>>> }
>>> ```
>>>
>>> ### Predicates
>>> ```
>>> SELECT (COUNT(DISTINCT ?predicate) as ?count)
>>> WHERE {
>>>    GRAPH ?graph {
>>>      ?subject ?predicate ?object
>>>    }
>>> }
>>> ```
>>>
>>> ### Objects
>>> ```
>>> SELECT (COUNT(DISTINCT ?object) as ?count)
>>> WHERE {
>>>    GRAPH ?graph {
>>>      ?subject ?predicate ?object
>>>    }
>>> }
>>> ```
>>>
>>> ## Comands used to measure the Disk Stats
>>>
>>> ### File Sizes
>>> ```
>>> ls -lh --sort=size
>>> ```
>>>
>>> ### Directory Sizes
>>> ```
>>> du -h
>>> ```
>>>
>>> Best Regards
>>>
>>> On 28/01/23 11:01, "Andy Seaborne" <andy@apache.org 
>>> <ma...@apache.org>> wrote:
>>>
>>>
>>> I don't how OSPG can be a considerably different size. Small variations
>>> happen but this does not look small.
>>>
>>>
>>> Lorenz's advice to run a compaction and see what the indexes sizes are
>>> is a good idea. A backup would also be a good idea because something is
>>> unexpected (backup uses GSPO).
>>>
>>>
>>> There has been some fixes in compaction since 4.4.0 related to
>>> compacting while also active in Fuseki.
>>>
>>>
>>> This index does not store the literals strings representations - they
>>> are referenced via the 8 byte entries. In OSPG, the index entries are 4
>>> slots of 8 bytes.
>>>
>>>
>>> Andy
>>>
>>>
>>> (Unrelated comment below)
>>>
>>>
>>> On 28/01/2023 07:47, Lorenz Buehmann wrote:
>>>> Hi Elton,
>>>>
>>>> Do you have lots of may large literals in your data?
>>>>
>>>> Also, did you try a compaction on the database? If not, can you try it
>>>> and post the new file sizes afterwards? Note, they will be located in a
>>>> new ./Data-XXXX directory, e.g. before Data-0001 and afterwards 
>>>> Data-0002
>>>>
>>>> By the way, we're now at Jena 4.7.0 - you might have a look at release
>>>> notes of the last 3 versions, maybe things you have recognized while
>>>> running you current Fuseki. If not, just keep it running if you're 
>>>> happy
>>>> with it of course.
>>>
>>>
>>> Theer
>>>
>>>
>>>>
>>>>
>>>> Cheers,
>>>> Lorenz
>>>>
>>>> On 28.01.23 03:10, Elton Soares wrote:
>>>>> Dear Jena Community,
>>>>>
>>>>> I'm running Jena Fuseki Version 4.4.0 as a container on an OpenShift
>>>>> Cluster.
>>>>> OS Version Info (cat /etc/os-release):
>>>>> NAME="Red Hat Enterprise Linux"
>>>>> VERSION="8.5 (Ootpa)"
>>>>> ID="rhel"
>>>>> ID_LIKE="fedora" ="8.5"
>>>>> ...
>>>>>
>>>>> Hardware Info (from Jena Fuseki initialization log):
>>>>> [2023-01-27 20:08:59] Server INFO Memory: 32.0 GiB
>>>>> [2023-01-27 20:08:59] Server INFO Java: 11.0.14.1
>>>>> [2023-01-27 20:08:59] Server INFO OS: Linux
>>>>> 3.10.0-1160.76.1.el7.x86_64 amd64
>>>>> [2023-01-27 20:08:59] Server INFO PID: 1
>>>>>
>>>>>
>>>>> Disk Info (df -h):
>>>>> Filesystem
>>>>> Size Used Avail Use% Mounted on
>>>>> overlay
>>>>> 99G 76G 18G 82% /
>>>>> tmpfs
>>>>> 64M 0 64M 0% /dev
>>>>> tmpfs
>>>>> 63G 0 63G 0% /sys/fs/cgroup
>>>>> shm
>>>>> 64M 0 64M 0% /dev/shm
>>>>> /dev/mapper/docker_data
>>>>> 99G 76G 18G 82% /config
>>>>> /data
>>>>> 1.0T 677G 348G 67% /usr/app/run
>>>>> tmpfs
>>>>> 40G 24K 40G 1%
>>>>>
>>>>>
>>>>> My dataset is built using TDB2, and currently has the following RDF
>>>>> Stats:
>>>>> · Triples: 65KK (Approximately 65 million)
>>>>> · Subjects: ~20KK (Aproximately 20 million)
>>>>> · Objects: ~8KK (Aproximately 8 million)
>>>>> · Graphs: ~213K (Aproximately 213 thousand)
>>>>> · Predicates: 153
>>>>>
>>>>>
>>>>> The files corresponding to this dataset alone on disk sum up to
>>>>> approximately 671GB (measured with du -h). From these, the largest
>>>>> files are:
>>>>> · /usr/app/run/databases/my-dataset/Data-0001/OSPG.dat: 243GB
>>>>> · /usr/app/run/databases/my-dataset/Data-0001/nodes.dat: 76GB
>>>>> · /usr/app/run/databases/my-dataset/Data-0001/POSG.dat: 35GB
>>>>> · /usr/app/run/databases/my-dataset/Data-0001/nodes.idn: 33GB
>>>>> · /usr/app/run/databases/my-dataset/Data-0001/POSG.idn: 29GB
>>>>> · /usr/app/run/databases/my-dataset/Data-0001/OSPG.idn: 27GB
>>>>>
>>>>>
>>>>> I've looked into several documentation pages, source code, forums, ...
>>>>> nowhere I was able to find some explanation to why OSPG.dat is so much
>>>>> larger than all other files.
>>>>> I've been using Jena for quite some time now and I'm well aware that
>>>>> its indexes grow significantly during usage, specially when triples
>>>>> are being added across multiple requests (transactional workloads).
>>>>> Even though, the size of this particular file (OSPG.dat) surprised me,
>>>>> as in my prior experience the indexes would never get larger than the
>>>>> nodes.dat file.
>>>>> Is there a reasonable explanation for this based on the content of the
>>>>> dataset or the way it was generated? Could this be an indexing bug
>>>>> within TDB2?
>>>>> Thank you for your support!
>>>>> For completeness, here is the assembler configuration for my dataset:
>>>>> @prefix : http://base/# <http://base/#>.
>>>>> @prefix fuseki: http://jena.apache.org/fuseki# 
>>>>> <http://jena.apache.org/fuseki#>.
>>>>> @prefix ja: http://jena.hpl.hp.com/2005/11/Assembler# 
>>>>> <http://jena.hpl.hp.com/2005/11/Assembler#>.
>>>>> @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# 
>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
>>>>> @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# 
>>>>> <http://www.w3.org/2000/01/rdf-schema#>.
>>>>> @prefix root: http://dev-test-jena-fuseki/$/datasets 
>>>>> <http://dev-test-jena-fuseki/$/datasets>.
>>>>> @prefix tdb2: http://jena.apache.org/2016/tdb# 
>>>>> <http://jena.apache.org/2016/tdb#>.
>>>
>>>
>>> It only needs:
>>>
>>>
>>>
>>>
>>>>> :service_tdb_my-dataset
>>>>> rdf:type fuseki:Service ;
>>>>> rdfs:label "TDB my-dataset" ;
>>>>> fuseki:dataset :ds_my-dataset ;
>>>>> fuseki:name "my-dataset" ;
>>>>> fuseki:serviceQuery "sparql" , "query" ;
>>>>> fuseki:serviceReadGraphStore "get" ;
>>>>> fuseki:serviceReadWriteGraphStore
>>>>> "data" ;
>>>>> fuseki:serviceUpdate "update" ;
>>>>> fuseki:serviceUpload "upload" .
>>>
>>>
>>>>> :ds_my-dataset rdf:type tdb2:DatasetTDB2 ;
>>>>> tdb2:location "run/databases/my-dataset" ;
>>>>> tdb2:unionDefaultGraph true ;
>>>>> ja:context \[ ja:cxtName "arq:optFilterPlacement" ;
>>>>> ja:cxtValue "false"
>>>>> \] .
>>>
>>>
>>> The rest can go.
>>>
>>>
>>>>>
>>>>> This issue has been also published at
>>>>> https://stackoverflow.com/questions/75264889/why-does-the-ospg-dat-file-grows-so-much-more-than-all-other-files <https://stackoverflow.com/questions/75264889/why-does-the-ospg-dat-file-grows-so-much-more-than-all-other-files>
>>>>>
>>>>>
>>>
>>>
>>>