You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Osma Suominen <os...@helsinki.fi> on 2017/11/16 09:17:14 UTC

TDB1/TDB2 disk space with and without named graphs

Hi,

I've been testing how much disk space TDB1 and TDB2 databases require. I 
was surprised to find out that the disk usage varies quite a lot 
depending on whether data is loaded to the default graph vs. a named 
graph. For TDB1 it also depends on the loading method: tdbloader2 gives 
much more compact databases than tdbloader.

My dataset is about 38M triples of bibliographic data in an N-Triples 
file. There are some blank nodes in the data (unfortunately). Here I'm 
testing only the initial load without a preexisting database.

I used Jena 3.5.0 command line tools. I didn't pay much attention to 
loading times since I'm running this on a VM with shared CPU and disk 
resources and the background load varies over time. I think 
tdb2.tdbloader with named graphs was the slowest at around 42 minutes.

For tdb2.tdbloader and tdbloader2 I had to convert the file to N-Quads 
first to be able to load into a named graph. For tdb2.tdbloader loading 
into a named graph didn't work (JENA-1422, being worked on by Andy); 
tdbloader2 doesn't provide a --graph option at all.


TDB1 results:

5.2G    tdb-bib-default-tdbloader
3.3G    tdb-bib-default-tdbloader2

13G     tdb-bib-named-tdbloader
7.6G    tdb-bib-named-tdbloader2


TDB2 results:

5.2G    tdb2-bib-default
12G     tdb2-bib-named


Some conclusions:

* Loading the same data into a named graph instead of the default graph 
uses a lot more disk space
* There is no huge difference between TDB1 and TDB2 disk space usage 
when doing an apples-to-apples comparison (i.e. either using only the 
default graph for both, or a named graph for both)
* For TDB1, using tdbloader2 instead of tdbloader results in a much 
smaller (around 40%) database, both when using the default graph only 
and when using a named graph

My larger goal is to decide whether to use TDB1 or TDB2 (or something 
else, like HDT or Blazegraph...) for a new bibliographic Linked Data 
service. Disk space is a factor (though not the most important one) in 
the calculation.

-Osma


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: TDB1/TDB2 disk space with and without named graphs

Posted by Osma Suominen <os...@helsinki.fi>.
Dave Reynolds kirjoitti 16.11.2017 klo 13:42:

> If a TDB image only has a default graph, no named graphs at all, then it 
> acts a triple store and only needs the three SPO, POS, OSP indexes. In 
> that configuration it doesn't generate the graph indexes at all.
> 
> As soon as you have one named graph (even if small) then it acts as a 
> quad store and needs all 9 indexes (GOSP etc). The extra indexes take 
> more space, even if the underlying quad count is the same as the triple 
> count in default-only case.

Okay. But if I load the big dataset into the default graph, then load a 
very small dataset into a named graph, the size doesn't change. It's 
still around 5GB.

-Osma


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: TDB1/TDB2 disk space with and without named graphs

Posted by Dave Reynolds <da...@gmail.com>.
On 16/11/17 11:30, Osma Suominen wrote:
> 
> 
> Rob Vesse kirjoitti 16.11.2017 klo 13:13:
> 
>> This is by design. As has been discussed in the past tdbloader2 
>> produces maximally packed B+Trees by preprocessing data which will 
>> minimise disk space usage.
> [...]
>>   As Andy mentioned on an earlier thread tdb2.tdbloader essentially 
>> has the same behaviour as tdbloader, because of the different data 
>> structures low performance should be much better anyway and he did not 
>> think there would be much benefit to having a tdb2.tdbloader2 variant. 
>> Also given the different Data structures I’m not sure if this would be 
>> as practical.
> 
> Right. I was just surprised at how big the difference is.
> 
> tdbloader2 is both fast and space-efficient, that makes it a lot more 
> appealing than tdb2.tdbloader which in my (very limited) experience is 
> slow and space-hungry (but similar to tdbloader for TDB1).
> 
> But the real surprise was the space overhead of named graphs. More than 
> twice the space just because I decide to put the data in a named graph 
> instead of the default graph? And that seems to be the case both for 
> TDB1 (both tdbloader and tdbloader2) and TDB2.

I assume you are seeing the difference between a triple store and quad 
store configuration.

If a TDB image only has a default graph, no named graphs at all, then it 
acts a triple store and only needs the three SPO, POS, OSP indexes. In 
that configuration it doesn't generate the graph indexes at all.

As soon as you have one named graph (even if small) then it acts as a 
quad store and needs all 9 indexes (GOSP etc). The extra indexes take 
more space, even if the underlying quad count is the same as the triple 
count in default-only case.

Dave


Re: TDB1/TDB2 disk space with and without named graphs

Posted by Andy Seaborne <an...@apache.org>.

On 16/11/17 11:30, Osma Suominen wrote:
> 
> 
> Rob Vesse kirjoitti 16.11.2017 klo 13:13:
> 
>> This is by design. As has been discussed in the past tdbloader2 
>> produces maximally packed B+Trees by preprocessing data which will 
>> minimise disk space usage.
> [...]
>>   As Andy mentioned on an earlier thread tdb2.tdbloader essentially 
>> has the same behaviour as tdbloader, because of the different data 
>> structures low performance should be much better anyway and he did not 
>> think there would be much benefit to having a tdb2.tdbloader2 variant. 
>> Also given the different Data structures I’m not sure if this would be 
>> as practical.
> 
> Right. I was just surprised at how big the difference is.
> 
> tdbloader2 is both fast and space-efficient,

but does not run on MS Windows.

> that makes it a lot more 
> appealing than tdb2.tdbloader which in my (very limited) experience is 
> slow and space-hungry (but similar to tdbloader for TDB1).
> 
> But the real surprise was the space overhead of named graphs. More than 
> twice the space just because I decide to put the data in a named graph 
> instead of the default graph? And that seems to be the case both for 
> TDB1 (both tdbloader and tdbloader2) and TDB2.

The tdbloader in TDB2 is a simple. I found that this current simple one 
was faster than expected (maybe because it is append-only so disk 
friendly or maybe because I was using an SSD for testing).

Rather than wait for that work area to be done, I thought it would be 
good to contribute and now release TDB2. It's experimental.

The TDB1 loaders could be ported.


TDB2 goals are to address the scal elimiations on transactions, the 
write-back queue overload problems, a better architecture e.g. fully 
integrate in jena-text transactions, and no quirks about models across 
transactions.

The style of tdbloader2 could be used to compact databases down. Teh 
current compaction is a simple one, good for getting a trusted one to work.

	Andy

Re: TDB1/TDB2 disk space with and without named graphs

Posted by Osma Suominen <os...@helsinki.fi>.

Rob Vesse kirjoitti 16.11.2017 klo 13:13:

> This is by design. As has been discussed in the past tdbloader2 produces maximally packed B+Trees by preprocessing data which will minimise disk space usage.
[...]
>   As Andy mentioned on an earlier thread tdb2.tdbloader essentially has the same behaviour as tdbloader, because of the different data structures low performance should be much better anyway and he did not think there would be much benefit to having a tdb2.tdbloader2 variant. Also given the different Data structures I’m not sure if this would be as practical.

Right. I was just surprised at how big the difference is.

tdbloader2 is both fast and space-efficient, that makes it a lot more 
appealing than tdb2.tdbloader which in my (very limited) experience is 
slow and space-hungry (but similar to tdbloader for TDB1).

But the real surprise was the space overhead of named graphs. More than 
twice the space just because I decide to put the data in a named graph 
instead of the default graph? And that seems to be the case both for 
TDB1 (both tdbloader and tdbloader2) and TDB2.

-Osma


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: TDB1/TDB2 disk space with and without named graphs

Posted by Rob Vesse <rv...@dotnetrdf.org>.
On 16/11/2017, 09:17, "Osma Suominen" <os...@helsinki.fi> wrote:

    For TDB1 it also depends on the loading method: tdbloader2 gives 
    much more compact databases than tdbloader.

This is by design. As has been discussed in the past tdbloader2 produces maximally packed B+Trees by preprocessing data which will minimise disk space usage.

tdbloader just inserts data in the order that it in the order that it is encountered in the inputs which likely has zero relation to the ideal data ordering that tdbloader2 levarages.

The other key difference is that tdbloader2 is only able to create new databases while tdbloader is able to insert into an existing database.

 As Andy mentioned on an earlier thread tdb2.tdbloader essentially has the same behaviour as tdbloader, because of the different data structures low performance should be much better anyway and he did not think there would be much benefit to having a tdb2.tdbloader2 variant. Also given the different Data structures I’m not sure if this would be as practical.

Rob





Re: TDB1/TDB2 disk space with and without named graphs

Posted by Andy Seaborne <an...@apache.org>.

On 16/11/17 21:37, Andy Seaborne wrote:
> 
> 
> On 16/11/17 20:36, Osma Suominen wrote:
>> Andy Seaborne kirjoitti 16.11.2017 klo 22:04:
>>
>>> TDB1 or HDT.
>>>
>>> TDB2 has no benefits for you at 40M triples at occasional updates.
>>
>> Compaction would be a benefit, if it could be automated. But 
>> apparently not in the current state (see today's dev@ thread).
>>
>>> TDB2 goals are to address the scale limitations on transactions, the 
>>> write-back queue overload problems, a better architecture e.g. fully 
>>> integrate in jena-text transactions, and no quirks about models 
>>> across transactions. TDB2 is experimental at this stage.
>>
>> Understood.
>>
>>> (You could use DatasetGraphSwitchable in TDB2 to make a switchable 
>>> HDT backed database.)
>>
>> Thanks for the tip!
>>
>> I think there's a lot of potential in HDT, it's just hampered by 
>> implementation bugs and lack of resources on the hdt-java side. For my 
>> use case it would be almost perfect, but the hdt-java implementation 
>> doesn't support union default graph functionality [1]. It could be 
>> added of course, just hasn't been.
> 
> Fuseki (well, ARQ) supports union graph on all datasets these days.
> 
> It will be a loop over graphs if necessary, and suppressing duplicates 
> is expensive in the general case. Putting graphs one by one into a 
> general purpose RDF Dataset (DatasetImpl) means a loop.
> 
> (they use dataset in the general sense of "collection of data", not RDF 
> Dataset)
> 
>      Andy
> 
>>
>> -Osma
>>
>> [1] https://github.com/rdfhdt/hdt-java/issues/3

... isn't about union graphs.

It is about whether FROM/FROM NAMED picks graphs from the service 
dataset, which in Fuseki1, they don't.

It is reported that

     select * where {graph ?g {?s ?p ?o}}

works so named graphs are working, which is nothing to do with HDT.

If they upgrade to Fuseki2, current, that should do it then
GRAPH <urn:x-arq:UnionGraph> should work.

     Andy

>>
>>

Re: TDB1/TDB2 disk space with and without named graphs

Posted by Andy Seaborne <an...@apache.org>.

On 16/11/17 20:36, Osma Suominen wrote:
> Andy Seaborne kirjoitti 16.11.2017 klo 22:04:
> 
>> TDB1 or HDT.
>>
>> TDB2 has no benefits for you at 40M triples at occasional updates.
> 
> Compaction would be a benefit, if it could be automated. But apparently 
> not in the current state (see today's dev@ thread).
> 
>> TDB2 goals are to address the scale limitations on transactions, the 
>> write-back queue overload problems, a better architecture e.g. fully 
>> integrate in jena-text transactions, and no quirks about models across 
>> transactions. TDB2 is experimental at this stage.
> 
> Understood.
> 
>> (You could use DatasetGraphSwitchable in TDB2 to make a switchable HDT 
>> backed database.)
> 
> Thanks for the tip!
> 
> I think there's a lot of potential in HDT, it's just hampered by 
> implementation bugs and lack of resources on the hdt-java side. For my 
> use case it would be almost perfect, but the hdt-java implementation 
> doesn't support union default graph functionality [1]. It could be added 
> of course, just hasn't been.

Fuseki (well, ARQ) supports union graph on all datasets these days.

It will be a loop over graphs if necessary, and suppressing duplicates 
is expensive in the general case. Putting graphs one by one into a 
general purpose RDF Dataset (DatasetImpl) means a loop.

(they use dataset in the general sense of "collection of data", not RDF 
Dataset)

     Andy

> 
> -Osma
> 
> [1] https://github.com/rdfhdt/hdt-java/issues/3
> 
> 

Re: TDB1/TDB2 disk space with and without named graphs

Posted by Osma Suominen <os...@helsinki.fi>.
Andy Seaborne kirjoitti 16.11.2017 klo 22:04:

> TDB1 or HDT.
> 
> TDB2 has no benefits for you at 40M triples at occasional updates.

Compaction would be a benefit, if it could be automated. But apparently 
not in the current state (see today's dev@ thread).

> TDB2 goals are to address the scale limitations on transactions, the 
> write-back queue overload problems, a better architecture e.g. fully 
> integrate in jena-text transactions, and no quirks about models across 
> transactions. TDB2 is experimental at this stage.

Understood.

> (You could use DatasetGraphSwitchable in TDB2 to make a switchable HDT 
> backed database.)

Thanks for the tip!

I think there's a lot of potential in HDT, it's just hampered by 
implementation bugs and lack of resources on the hdt-java side. For my 
use case it would be almost perfect, but the hdt-java implementation 
doesn't support union default graph functionality [1]. It could be added 
of course, just hasn't been.

-Osma

[1] https://github.com/rdfhdt/hdt-java/issues/3


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: TDB1/TDB2 disk space with and without named graphs

Posted by Andy Seaborne <an...@apache.org>.

On 16/11/17 09:17, Osma Suominen wrote:
> Hi,
> 
> I've been testing how much disk space TDB1 and TDB2 databases require. I 
> was surprised to find out that the disk usage varies quite a lot 
> depending on whether data is loaded to the default graph vs. a named 
> graph. For TDB1 it also depends on the loading method: tdbloader2 gives 
> much more compact databases than tdbloader.

Elsewhere on this thread.

> My dataset is about 38M triples of bibliographic data in an N-Triples 
> file. There are some blank nodes in the data (unfortunately). Here I'm 
> testing only the initial load without a preexisting database.

Blank nodes make no difference to TDB1 or TDB2.

> I used Jena 3.5.0 command line tools. I didn't pay much attention to 
> loading times since I'm running this on a VM with shared CPU and disk 
> resources and the background load varies over time. I think 
> tdb2.tdbloader with named graphs was the slowest at around 42 minutes.
> 
> For tdb2.tdbloader and tdbloader2 I had to convert the file to N-Quads 
> first to be able to load into a named graph. For tdb2.tdbloader loading 
> into a named graph didn't work (JENA-1422, being worked on by Andy); 
> tdbloader2 doesn't provide a --graph option at all.

Indeed, it does not. It only loads quads. You have to convert the data.

> TDB1 results:
> 
> 5.2G    tdb-bib-default-tdbloader
> 3.3G    tdb-bib-default-tdbloader2
> 
> 13G     tdb-bib-named-tdbloader
> 7.6G    tdb-bib-named-tdbloader2
> 
> 
> TDB2 results:
> 
> 5.2G    tdb2-bib-default
> 12G     tdb2-bib-named

Initially you should get _approximately_ the same size for 
tdb2.tdbloader and tdbloader (i.e. loader1).

tdbloader2 (for TDB1) datasets will expand over time (updates) and loose 
their compactness.

> Some conclusions:
> 
> * Loading the same data into a named graph instead of the default graph 
> uses a lot more disk space
> * There is no huge difference between TDB1 and TDB2 disk space usage 
> when doing an apples-to-apples comparison (i.e. either using only the 
> default graph for both, or a named graph for both)

There will be if you update in place later.

> * For TDB1, using tdbloader2 instead of tdbloader results in a much 
> smaller (around 40%) database, both when using the default graph only 
> and when using a named graph
> 
> My larger goal is to decide whether to use TDB1 or TDB2 (or something 
> else, like HDT or Blazegraph...) for a new bibliographic Linked Data 
> service. Disk space is a factor (though not the most important one) in 
> the calculation.

TDB1 or HDT.

TDB2 has no benefits for you at 40M triples at occasional updates.

TDB2 goals are to address the scale limitations on transactions, the 
write-back queue overload problems, a better architecture e.g. fully 
integrate in jena-text transactions, and no quirks about models across 
transactions. TDB2 is experimental at this stage.

(You could use DatasetGraphSwitchable in TDB2 to make a switchable HDT 
backed database.)

> 
> -Osma

     Andy