You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Vinay Mahamuni <vi...@thoughtworks.com> on 2022/01/27 06:14:50 UTC

How to optimize TDB disk storage?

Hello,

I am using Apache Jena v4.3.2 + Fuseki + TDB2 persistent disk storage. I am
using jena RDFConnection to connect to the Fuseki server. I am sending 50k
triples in one update. This is mostly new data(only a few triples will
match with existing data). These data are instances based on an ontology.
Please have a look at the attached file containing how much disk memory
increases with each update. For 1.5million triples, it took around 1.2GB.
We want to store around a few billions of triples. Thus the bytes/triple
ratio won't be good for our use case.

When I used the tdb2.tdbcompact tool, the data volume shrinked to 400MB.
But this extra step needs to be performed manually to optimise the storage.

My questions are as follows:

   1. Why 30 update queries each of 50k triples take 3 times more memory
   than a single update query of 1500k triples? Data getting stored is the
   same but memory consumed is more in the first case.
   2. Is there any other way to solve this memory problem?
   3. What are the existing strategies that can be used to optimise the
   storage memory while writing data?
   4. Is there any new development going on to use less memory for the
   write/update query?


Thanks,
Vinay Mahamuni

Re: How to optimize TDB disk storage?

Posted by Vinay Mahamuni <vi...@thoughtworks.com>.

Hi Andy,

Thank you very much for the answers.

Regards,
Vinay Mahamuni

On Fri, 28 Jan 2022 at 03:28, Andy Seaborne <an...@apache.org> wrote:

> Hi Vinay,
>
>
> On 27/01/2022 06:14, Vinay Mahamuni wrote:
> > Hello,
> >
> > I am using Apache Jena v4.3.2 + Fuseki + TDB2 persistent disk storage. I
> > am using jena RDFConnection to connect to the Fuseki server. I am
> > sending 50k triples in one update. This is mostly new data(only a few
> > triples will match with existing data). These data are instances based
> > on an ontology. Please have a look at the attached file containing how
> > much disk memory increases with each update. For 1.5million triples, it
> > took around 1.2GB. We want to store around a few billions of triples.
> > Thus the bytes/triple ratio won't be good for our use case.
> >
> > When I used the tdb2.tdbcompact tool, the data volume shrinked to 400MB.
> > But this extra step needs to be performed manually to optimise the
> storage.
>
> It can be triggered by an admin process with e.g. "cron".
>
> It doesn't have to be done very often unless your volume of 50k triple
> transactions is very high - in which case I suggest batching them into
> larger units.
>
> >
> > My questions are as follows:
> >
> >  1. Why 30 update queries each of 50k triples take 3 times more memory
> >     than a single update query of 1500k triples? Data getting stored is
> >     the same but memory consumed is more in the first case.
>
> TDB2 uses an MVCC/copy-on-write scheme for transaction isolation. It
> gives a very high isolation guarantee (serialized),
>
> That means there is a per transaction overhead here which is recovered
> by compact.  In fact, it can't recover at the time because the old data
> may be in use in read-transactions seeing the pre-write state.
>
> Compact is similar (not identical) to like PostgreSQL VACUUM.
>
> Note that all additional space is recovered by "compact". The active
> directory is the highest number "Data-NNNN". You can delete the earlier
> ones once the "compact" has finished as logged in the server log. Or zip
> them and keep them as backups - Fuseki has released them and does not
> touch them.  Caution: on MS Windows, due to a long standing (10+year)
> Java JDK issue, the server has to be stopped and restarted to properly
> release old files.
>
> It doesn't matter whether it was one large write-transaction or 100
> write transactions, the compacted database will be the same size. It
> will have become bigger for 100 writes than 1, but more space is
> recovered and the new data storage is the same size if you delete the
> now unused storage areas.
>
> >  2. Is there any other way to solve this memory problem?
>
> Schedule "compact", delete the old data storage.
>
> If the update are a stream of updates without reading the database,
> write a big file (N-triples, Turtle: just write all concatenated to a
> single file).
>
> You can also consider instead of loading in to Fuseki, to use the bulk
> loader tbd2.tdbloader to build the database offline, then put in place,
> then start Fuseki. The bulk loader is significantly faster when sizes
> get into the 100's millions of triples.
>
> >  3. What are the existing strategies that can be used to optimise the
> >     storage memory while writing data?
> >  4. Is there any new development going on to use less memory for the
> >     write/update query?
>
> Just plans that need resources!
>
> It would be nice to have serve-side transactions over several updates
> (which is beyond what the SPARQL protocol can do).
>
> --
>
> I've tried TDB with other storage systems (e.g. RocksDB) but the ability
> to directly write the on-disk format is useful - it makes the bulk
> loader work.
>
> --
>
> There are other issues as well in your use case.
>
> It also depends on the data. If many triples have unique literals/ URIs,
> the node table is proportionately large
>
>      Andy
>
> >
> >
> > Thanks,
> > Vinay Mahamuni
>

Re: How to optimize TDB disk storage?

Posted by Andy Seaborne <an...@apache.org>.

Hi Vinay,

On 27/01/2022 06:14, Vinay Mahamuni wrote:
> Hello,
> 
> I am using Apache Jena v4.3.2 + Fuseki + TDB2 persistent disk storage. I 
> am using jena RDFConnection to connect to the Fuseki server. I am 
> sending 50k triples in one update. This is mostly new data(only a few 
> triples will match with existing data). These data are instances based 
> on an ontology. Please have a look at the attached file containing how 
> much disk memory increases with each update. For 1.5million triples, it 
> took around 1.2GB. We want to store around a few billions of triples. 
> Thus the bytes/triple ratio won't be good for our use case.
> 
> When I used the tdb2.tdbcompact tool, the data volume shrinked to 400MB. 
> But this extra step needs to be performed manually to optimise the storage.

It can be triggered by an admin process with e.g. "cron".

It doesn't have to be done very often unless your volume of 50k triple 
transactions is very high - in which case I suggest batching them into 
larger units.

> 
> My questions are as follows:
> 
>  1. Why 30 update queries each of 50k triples take 3 times more memory
>     than a single update query of 1500k triples? Data getting stored is
>     the same but memory consumed is more in the first case.

TDB2 uses an MVCC/copy-on-write scheme for transaction isolation. It 
gives a very high isolation guarantee (serialized),

That means there is a per transaction overhead here which is recovered 
by compact.  In fact, it can't recover at the time because the old data 
may be in use in read-transactions seeing the pre-write state.

Compact is similar (not identical) to like PostgreSQL VACUUM.

Note that all additional space is recovered by "compact". The active 
directory is the highest number "Data-NNNN". You can delete the earlier 
ones once the "compact" has finished as logged in the server log. Or zip 
them and keep them as backups - Fuseki has released them and does not 
touch them.  Caution: on MS Windows, due to a long standing (10+year) 
Java JDK issue, the server has to be stopped and restarted to properly 
release old files.

It doesn't matter whether it was one large write-transaction or 100 
write transactions, the compacted database will be the same size. It 
will have become bigger for 100 writes than 1, but more space is 
recovered and the new data storage is the same size if you delete the 
now unused storage areas.

>  2. Is there any other way to solve this memory problem?

Schedule "compact", delete the old data storage.

If the update are a stream of updates without reading the database, 
write a big file (N-triples, Turtle: just write all concatenated to a 
single file).

You can also consider instead of loading in to Fuseki, to use the bulk 
loader tbd2.tdbloader to build the database offline, then put in place, 
then start Fuseki. The bulk loader is significantly faster when sizes 
get into the 100's millions of triples.

>  3. What are the existing strategies that can be used to optimise the
>     storage memory while writing data?
>  4. Is there any new development going on to use less memory for the
>     write/update query?

Just plans that need resources!

It would be nice to have serve-side transactions over several updates 
(which is beyond what the SPARQL protocol can do).

--

I've tried TDB with other storage systems (e.g. RocksDB) but the ability 
to directly write the on-disk format is useful - it makes the bulk 
loader work.

--

There are other issues as well in your use case.

It also depends on the data. If many triples have unique literals/ URIs, 
the node table is proportionately large

     Andy

> 
> 
> Thanks,
> Vinay Mahamuni