You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Laura Morales <la...@mail.com> on 2017/11/25 11:40:09 UTC

Estimating TDB2 size

Is it possible to estimate the size of a TDB2 store from one of nt/turtle/xml input file, without actually creating the store? Is there maybe a tool for this?

Re: Estimating TDB2 size

Posted by Andy Seaborne <an...@apache.org>.
Another factor is the ratio of triples to terms.

In some data, the same terms are used often, Obviously true for 
properties much of the time. If theer are a lot of triples with the same 
subject, there are a higher number of triples compared to terms.

Some data is more linked, subject and object are the same term.

TDB stores terms once.

     Andy

On 25/11/17 11:42, ajs6f wrote:
> Andy may be able to be more precise, but I can tell you right away that it's not a straightforward function. How many literals are there "per triple"? How big are the literals, on average? How many unique bnodes and URIs? All of these things will change the eventual size of the database.
> 
> ajs6f
> 
>> On Nov 25, 2017, at 6:40 AM, Laura Morales <la...@mail.com> wrote:
>>
>> Is it possible to estimate the size of a TDB2 store from one of nt/turtle/xml input file, without actually creating the store? Is there maybe a tool for this?
> 

Re: Estimating TDB2 size

Posted by Andy Seaborne <an...@apache.org>.
Laura,

HDT does not provide update.

So it can keep compressed datastructures. In fact, it uses as much 
compression as it can regardless of the need for update. Updating such 
compressed (by content) datastructures would be very expensive (e.g. the 
dictionary changes).

     Andy

On 26/11/17 16:44, Laura Morales wrote:
> HDT does actually create more indices "out of band", that it they create a separate file.hdt.index files. The combined size however is still much smaller than a TDB store of the same file, but I don't know if this is down to TDB simply having more indices.
>   
>   
> 
> Sent: Sunday, November 26, 2017 at 5:20 PM
> From: ajs6f <aj...@apache.org>
> To: users@jena.apache.org
> Subject: Re: Estimating TDB2 size
> You have to start with the understanding that the indexes in a database are not the same thing nor for the same purpose as a simple file of triples or quads. TDB1 and 2 store the same triple several times in different orders (SPO, OPS, etc.) in order to be able to answer arbitrary queries with good performance, which is a common technique.
> 
> There is no reason to expect a database that is capable of answering arbitrary queries with good performance to be as small as a file, which is not.
> 
> ajs6f
> 

Re: Estimating TDB2 size

Posted by Andy Seaborne <an...@apache.org>.
See "5. Querying HDT-encoded datasets"
www.rdfhdt.org/technical-specification/


On 26/11/17 16:58, ajs6f wrote:
> The point is that HDT is expressly designed to be compact at all costs, and TDB is not at all. The indexes are one (important) aspect. From http://www.rdfhdt.org/what-is-hdt/:
> 
> "The internal compression techniques of HDT allow that most part of the data (or even the whole dataset) can be kept in main memory, which is several orders of magnitude faster than disks."
> 
> "HDT is read-only, so it can dispatch many queries per second using multiple threads."
> 
> That is a radically different design that Jena TDB, which relies on OS-provided file caching and offers transactional updates. Benchmarking is hard at best, and comparing software with different priorities and intentions doesn't help.
> 
> Otherwise, you could compare HDT with Jena's in-memory datasets, which (obviously) do expect that the data is kept in memory.
> 
> ajs6f
> 
>> On Nov 26, 2017, at 11:44 AM, Laura Morales <la...@mail.com> wrote:
>>
>> HDT does actually create more indices "out of band", that it they create a separate file.hdt.index files. The combined size however is still much smaller than a TDB store of the same file, but I don't know if this is down to TDB simply having more indices.
>>   
>>   
>>
>> Sent: Sunday, November 26, 2017 at 5:20 PM
>> From: ajs6f <aj...@apache.org>
>> To: users@jena.apache.org
>> Subject: Re: Estimating TDB2 size
>> You have to start with the understanding that the indexes in a database are not the same thing nor for the same purpose as a simple file of triples or quads. TDB1 and 2 store the same triple several times in different orders (SPO, OPS, etc.) in order to be able to answer arbitrary queries with good performance, which is a common technique.
>>
>> There is no reason to expect a database that is capable of answering arbitrary queries with good performance to be as small as a file, which is not.
>>
>> ajs6f
> 

Re: Estimating TDB2 size

Posted by ajs6f <aj...@apache.org>.
The point is that HDT is expressly designed to be compact at all costs, and TDB is not at all. The indexes are one (important) aspect. From http://www.rdfhdt.org/what-is-hdt/:

"The internal compression techniques of HDT allow that most part of the data (or even the whole dataset) can be kept in main memory, which is several orders of magnitude faster than disks."

"HDT is read-only, so it can dispatch many queries per second using multiple threads."

That is a radically different design that Jena TDB, which relies on OS-provided file caching and offers transactional updates. Benchmarking is hard at best, and comparing software with different priorities and intentions doesn't help.

Otherwise, you could compare HDT with Jena's in-memory datasets, which (obviously) do expect that the data is kept in memory.

ajs6f

> On Nov 26, 2017, at 11:44 AM, Laura Morales <la...@mail.com> wrote:
> 
> HDT does actually create more indices "out of band", that it they create a separate file.hdt.index files. The combined size however is still much smaller than a TDB store of the same file, but I don't know if this is down to TDB simply having more indices.
>  
>  
> 
> Sent: Sunday, November 26, 2017 at 5:20 PM
> From: ajs6f <aj...@apache.org>
> To: users@jena.apache.org
> Subject: Re: Estimating TDB2 size
> You have to start with the understanding that the indexes in a database are not the same thing nor for the same purpose as a simple file of triples or quads. TDB1 and 2 store the same triple several times in different orders (SPO, OPS, etc.) in order to be able to answer arbitrary queries with good performance, which is a common technique.
> 
> There is no reason to expect a database that is capable of answering arbitrary queries with good performance to be as small as a file, which is not.
> 
> ajs6f


Re: Estimating TDB2 size

Posted by Laura Morales <la...@mail.com>.
HDT does actually create more indices "out of band", that it they create a separate file.hdt.index files. The combined size however is still much smaller than a TDB store of the same file, but I don't know if this is down to TDB simply having more indices.
 
 

Sent: Sunday, November 26, 2017 at 5:20 PM
From: ajs6f <aj...@apache.org>
To: users@jena.apache.org
Subject: Re: Estimating TDB2 size
You have to start with the understanding that the indexes in a database are not the same thing nor for the same purpose as a simple file of triples or quads. TDB1 and 2 store the same triple several times in different orders (SPO, OPS, etc.) in order to be able to answer arbitrary queries with good performance, which is a common technique.

There is no reason to expect a database that is capable of answering arbitrary queries with good performance to be as small as a file, which is not.

ajs6f

Re: Estimating TDB2 size

Posted by Rob Vesse <rv...@dotnetrdf.org>.
Find out that you specifically said that your data file was compressed with GZip. You can often get up to around 32 times compression for NTriples because it’s verbosity makes it extremely well-suited to GZip compression. Therefore, your uncompressed data is probably more like 6-7 GB so Full coverage indexing and dictionary encoding are roughly doubling the actual data size.

Rob

On 26/11/2017, 18:11, "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at> wrote:

    of course, my comparison was naive - and I should have known better, 
    working with databases back in the 1980s. still it surprises when a 200 
    mb files becomes 13 gb - given that diskspace and memory is inexpensive 
    compared to human time waiting for responses, the design choices are 
    amply justified. i will buy more memory (;-)
    
    andrew
    
    
    On 11/26/2017 11:20 AM, ajs6f wrote:
    > You have to start with the understanding that the indexes in a database are not the same thing nor for the same purpose as a simple file of triples or quads. TDB1 and 2 store the same triple several times in different orders (SPO, OPS, etc.) in order to be able to answer arbitrary queries with good performance, which is a common technique.
    >
    > There is no reason to expect a database that is capable of answering arbitrary queries with good performance to be as small as a file, which is not.
    >
    > ajs6f
    >
    >> On Nov 26, 2017, at 11:17 AM, Andrew U. Frank <fr...@geoinfo.tuwien.ac.at> wrote:
    >>
    >> thank you for the explanations:
    >>
    >> to laura: i guess HDT would reduce the size of my files considerably. where could i find information how to use fuseki with HDT? i might be worth trying and see how response time changes.
    >>
    >> to andy: am i correct to understand that a triple (uri p literal) is translated in two triples (uri p uriX) and a second one (uriX s literal) for some properties p and s? is there any reuse of existing literals? that would give for each literal triple approx. 60 bytes?
    >>
    >> i still do not undestand how a triple needs about 300 bytes of storage? (or how an nt.gzip file of 219 M igives a TDB database of 13 GB)
    >>
    >> size of the database is of concern to me and I think it influences performance through the use of IO time.
    >>
    >> thank you all very much for the clarifications!
    >>
    >> andrew
    >>
    >>
    >>
    >> On 11/26/2017 07:30 AM, Andy Seaborne wrote:
    >>> Every RDFTerm gets a NodeId in TDB.  A triple is 3 NodeIds.
    >>>
    >>> There is a big cache, NodeId->RDFTerm.
    >>>
    >>> In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an int and long (96 bits) - the current implementation is using 64 bits.
    >>>
    >>> It is very common as a design to dictionary (intern) terms because joins can be done by comparing a integers, not testing whether two strings are the same, which is much more expensive.
    >>>
    >>> In addition TDBx inlines numbers integers and date/times and some others.
    >>>
    >>> https://jena.apache.org/documentation/tdb/architecture.html
    >>>
    >>> TDBx could, but doesn't, store compressed data on disk. There are pros and cons of this.
    >>>
    >>>      Andy
    >>>
    >>> On 26/11/17 08:30, Laura Morales wrote:
    >>>> Perhaps a bit tangential but this is somehow related to how HDT stores its data (I've run some tests with Fuseki + HDT store instead of TDB). Basically, they assign each subject, predicate, and object an integer value. It keeps an index to map integers with the corresponding string (of the original value), and then they store every triple using integers instead of strings (something like "1 2 9. 8 2 1 ." and so forth. The drawback I think is that they have to translate indices/strings back and forth at each query, nonetheless the response time is still impressive (milliseconds), and it compresses the original file *a lot*. By a lot I mean that for Wikidata (not the full file though, but one with about 2.3 billion triples) the HDT is more or less 40GB, and gz-compressed about 10GB. The problem is that their rdf2hdt tool is so inefficient that it does everything in RAM, so to convert something like wikidata you'd need at least a machine with 512GB of ram (or swap if you have a fast enough swap :D). Also the tool looks like it can't handle files with more than 2^32 triples, although HDT (the format) does handle them. So as long as you can handle the conversion, if you want to save space you could benefit from using a HDT store rather than using TDB.
    >>>>
    >>>>
    >>>>
    >>>> Sent: Sunday, November 26, 2017 at 5:30 AM
    >>>> From: "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>
    >>>> To: users@jena.apache.org
    >>>> Subject: Re: Estimating TDB2 size
    >>>> i have   specific questiosn in relation to what ajs6f said:
    >>>>
    >>>> i have a TDB store with 1/3 triples with very small literals (3-5 char),
    >>>> where the same sequence is often repeated. would i get smaller store and
    >>>> better performance if these were URI of the character sequence (stored
    >>>> once for each repeated case)? any guess how much I could improve?
    >>>>
    >>>> does the size of the URI play a role in the amount of storage used. i
    >>>> observe that i have for 33 M triples a TDB size (files) of 13 GB, which
    >>>> means about 300 byte per triple. the literals are all short (very seldom
    >>>> more than 10 char, mostly 5 - words from english text). is is a named
    >>>> graph, if this makes a difference.
    >>>>
    >>>> thank you!
    >>>>
    >>>> andrew
    >>>>
    >> -- 
    >> em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
    >>                                  +43 1 58801 12710 direct
    >> Geoinformation, TU Wien          +43 1 58801 12700 office
    >> Gusshausstr. 27-29               +43 1 55801 12799 fax
    >> 1040 Wien Austria                +43 676 419 25 72 mobil
    >>
    
    -- 
    em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
                                      +43 1 58801 12710 direct
    Geoinformation, TU Wien          +43 1 58801 12700 office
    Gusshausstr. 27-29               +43 1 55801 12799 fax
    1040 Wien Austria                +43 676 419 25 72 mobil
    
    
    





Re: Estimating TDB2 size

Posted by "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>.
of course, my comparison was naive - and I should have known better, 
working with databases back in the 1980s. still it surprises when a 200 
mb files becomes 13 gb - given that diskspace and memory is inexpensive 
compared to human time waiting for responses, the design choices are 
amply justified. i will buy more memory (;-)

andrew


On 11/26/2017 11:20 AM, ajs6f wrote:
> You have to start with the understanding that the indexes in a database are not the same thing nor for the same purpose as a simple file of triples or quads. TDB1 and 2 store the same triple several times in different orders (SPO, OPS, etc.) in order to be able to answer arbitrary queries with good performance, which is a common technique.
>
> There is no reason to expect a database that is capable of answering arbitrary queries with good performance to be as small as a file, which is not.
>
> ajs6f
>
>> On Nov 26, 2017, at 11:17 AM, Andrew U. Frank <fr...@geoinfo.tuwien.ac.at> wrote:
>>
>> thank you for the explanations:
>>
>> to laura: i guess HDT would reduce the size of my files considerably. where could i find information how to use fuseki with HDT? i might be worth trying and see how response time changes.
>>
>> to andy: am i correct to understand that a triple (uri p literal) is translated in two triples (uri p uriX) and a second one (uriX s literal) for some properties p and s? is there any reuse of existing literals? that would give for each literal triple approx. 60 bytes?
>>
>> i still do not undestand how a triple needs about 300 bytes of storage? (or how an nt.gzip file of 219 M igives a TDB database of 13 GB)
>>
>> size of the database is of concern to me and I think it influences performance through the use of IO time.
>>
>> thank you all very much for the clarifications!
>>
>> andrew
>>
>>
>>
>> On 11/26/2017 07:30 AM, Andy Seaborne wrote:
>>> Every RDFTerm gets a NodeId in TDB.  A triple is 3 NodeIds.
>>>
>>> There is a big cache, NodeId->RDFTerm.
>>>
>>> In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an int and long (96 bits) - the current implementation is using 64 bits.
>>>
>>> It is very common as a design to dictionary (intern) terms because joins can be done by comparing a integers, not testing whether two strings are the same, which is much more expensive.
>>>
>>> In addition TDBx inlines numbers integers and date/times and some others.
>>>
>>> https://jena.apache.org/documentation/tdb/architecture.html
>>>
>>> TDBx could, but doesn't, store compressed data on disk. There are pros and cons of this.
>>>
>>>      Andy
>>>
>>> On 26/11/17 08:30, Laura Morales wrote:
>>>> Perhaps a bit tangential but this is somehow related to how HDT stores its data (I've run some tests with Fuseki + HDT store instead of TDB). Basically, they assign each subject, predicate, and object an integer value. It keeps an index to map integers with the corresponding string (of the original value), and then they store every triple using integers instead of strings (something like "1 2 9. 8 2 1 ." and so forth. The drawback I think is that they have to translate indices/strings back and forth at each query, nonetheless the response time is still impressive (milliseconds), and it compresses the original file *a lot*. By a lot I mean that for Wikidata (not the full file though, but one with about 2.3 billion triples) the HDT is more or less 40GB, and gz-compressed about 10GB. The problem is that their rdf2hdt tool is so inefficient that it does everything in RAM, so to convert something like wikidata you'd need at least a machine with 512GB of ram (or swap if you have a fast enough swap :D). Also the tool looks like it can't handle files with more than 2^32 triples, although HDT (the format) does handle them. So as long as you can handle the conversion, if you want to save space you could benefit from using a HDT store rather than using TDB.
>>>>
>>>>
>>>>
>>>> Sent: Sunday, November 26, 2017 at 5:30 AM
>>>> From: "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>
>>>> To: users@jena.apache.org
>>>> Subject: Re: Estimating TDB2 size
>>>> i have   specific questiosn in relation to what ajs6f said:
>>>>
>>>> i have a TDB store with 1/3 triples with very small literals (3-5 char),
>>>> where the same sequence is often repeated. would i get smaller store and
>>>> better performance if these were URI of the character sequence (stored
>>>> once for each repeated case)? any guess how much I could improve?
>>>>
>>>> does the size of the URI play a role in the amount of storage used. i
>>>> observe that i have for 33 M triples a TDB size (files) of 13 GB, which
>>>> means about 300 byte per triple. the literals are all short (very seldom
>>>> more than 10 char, mostly 5 - words from english text). is is a named
>>>> graph, if this makes a difference.
>>>>
>>>> thank you!
>>>>
>>>> andrew
>>>>
>> -- 
>> em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
>>                                  +43 1 58801 12710 direct
>> Geoinformation, TU Wien          +43 1 58801 12700 office
>> Gusshausstr. 27-29               +43 1 55801 12799 fax
>> 1040 Wien Austria                +43 676 419 25 72 mobil
>>

-- 
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
                                  +43 1 58801 12710 direct
Geoinformation, TU Wien          +43 1 58801 12700 office
Gusshausstr. 27-29               +43 1 55801 12799 fax
1040 Wien Austria                +43 676 419 25 72 mobil



Re: Estimating TDB2 size

Posted by ajs6f <aj...@apache.org>.
You have to start with the understanding that the indexes in a database are not the same thing nor for the same purpose as a simple file of triples or quads. TDB1 and 2 store the same triple several times in different orders (SPO, OPS, etc.) in order to be able to answer arbitrary queries with good performance, which is a common technique.

There is no reason to expect a database that is capable of answering arbitrary queries with good performance to be as small as a file, which is not.

ajs6f

> On Nov 26, 2017, at 11:17 AM, Andrew U. Frank <fr...@geoinfo.tuwien.ac.at> wrote:
> 
> thank you for the explanations:
> 
> to laura: i guess HDT would reduce the size of my files considerably. where could i find information how to use fuseki with HDT? i might be worth trying and see how response time changes.
> 
> to andy: am i correct to understand that a triple (uri p literal) is translated in two triples (uri p uriX) and a second one (uriX s literal) for some properties p and s? is there any reuse of existing literals? that would give for each literal triple approx. 60 bytes?
> 
> i still do not undestand how a triple needs about 300 bytes of storage? (or how an nt.gzip file of 219 M igives a TDB database of 13 GB)
> 
> size of the database is of concern to me and I think it influences performance through the use of IO time.
> 
> thank you all very much for the clarifications!
> 
> andrew
> 
> 
> 
> On 11/26/2017 07:30 AM, Andy Seaborne wrote:
>> Every RDFTerm gets a NodeId in TDB.  A triple is 3 NodeIds.
>> 
>> There is a big cache, NodeId->RDFTerm.
>> 
>> In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an int and long (96 bits) - the current implementation is using 64 bits.
>> 
>> It is very common as a design to dictionary (intern) terms because joins can be done by comparing a integers, not testing whether two strings are the same, which is much more expensive.
>> 
>> In addition TDBx inlines numbers integers and date/times and some others.
>> 
>> https://jena.apache.org/documentation/tdb/architecture.html
>> 
>> TDBx could, but doesn't, store compressed data on disk. There are pros and cons of this.
>> 
>>     Andy
>> 
>> On 26/11/17 08:30, Laura Morales wrote:
>>> Perhaps a bit tangential but this is somehow related to how HDT stores its data (I've run some tests with Fuseki + HDT store instead of TDB). Basically, they assign each subject, predicate, and object an integer value. It keeps an index to map integers with the corresponding string (of the original value), and then they store every triple using integers instead of strings (something like "1 2 9. 8 2 1 ." and so forth. The drawback I think is that they have to translate indices/strings back and forth at each query, nonetheless the response time is still impressive (milliseconds), and it compresses the original file *a lot*. By a lot I mean that for Wikidata (not the full file though, but one with about 2.3 billion triples) the HDT is more or less 40GB, and gz-compressed about 10GB. The problem is that their rdf2hdt tool is so inefficient that it does everything in RAM, so to convert something like wikidata you'd need at least a machine with 512GB of ram (or swap if you have a fast enough swap :D). Also the tool looks like it can't handle files with more than 2^32 triples, although HDT (the format) does handle them. So as long as you can handle the conversion, if you want to save space you could benefit from using a HDT store rather than using TDB.
>>> 
>>> 
>>> 
>>> Sent: Sunday, November 26, 2017 at 5:30 AM
>>> From: "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>
>>> To: users@jena.apache.org
>>> Subject: Re: Estimating TDB2 size
>>> i have   specific questiosn in relation to what ajs6f said:
>>> 
>>> i have a TDB store with 1/3 triples with very small literals (3-5 char),
>>> where the same sequence is often repeated. would i get smaller store and
>>> better performance if these were URI of the character sequence (stored
>>> once for each repeated case)? any guess how much I could improve?
>>> 
>>> does the size of the URI play a role in the amount of storage used. i
>>> observe that i have for 33 M triples a TDB size (files) of 13 GB, which
>>> means about 300 byte per triple. the literals are all short (very seldom
>>> more than 10 char, mostly 5 - words from english text). is is a named
>>> graph, if this makes a difference.
>>> 
>>> thank you!
>>> 
>>> andrew
>>> 
> 
> -- 
> em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
>                                 +43 1 58801 12710 direct
> Geoinformation, TU Wien          +43 1 58801 12700 office
> Gusshausstr. 27-29               +43 1 55801 12799 fax
> 1040 Wien Austria                +43 676 419 25 72 mobil
> 


Re: Estimating TDB2 size

Posted by Laura Morales <la...@mail.com>.
> where could i find information how to use fuseki with HDT?

http://www.rdfhdt.org/manual-of-hdt-integration-with-jena/

Please report back your findings. I'm curious to read more feedback about this.

Re: Estimating TDB2 size

Posted by "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>.
thank you for the explanations:

to laura: i guess HDT would reduce the size of my files considerably. 
where could i find information how to use fuseki with HDT? i might be 
worth trying and see how response time changes.

to andy: am i correct to understand that a triple (uri p literal) is 
translated in two triples (uri p uriX) and a second one (uriX s literal) 
for some properties p and s? is there any reuse of existing literals? 
that would give for each literal triple approx. 60 bytes?

i still do not undestand how a triple needs about 300 bytes of storage? 
(or how an nt.gzip file of 219 M igives a TDB database of 13 GB)

size of the database is of concern to me and I think it influences 
performance through the use of IO time.

thank you all very much for the clarifications!

andrew



On 11/26/2017 07:30 AM, Andy Seaborne wrote:
> Every RDFTerm gets a NodeId in TDB.  A triple is 3 NodeIds.
>
> There is a big cache, NodeId->RDFTerm.
>
> In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an int 
> and long (96 bits) - the current implementation is using 64 bits.
>
> It is very common as a design to dictionary (intern) terms because 
> joins can be done by comparing a integers, not testing whether two 
> strings are the same, which is much more expensive.
>
> In addition TDBx inlines numbers integers and date/times and some others.
>
> https://jena.apache.org/documentation/tdb/architecture.html
>
> TDBx could, but doesn't, store compressed data on disk. There are pros 
> and cons of this.
>
>     Andy
>
> On 26/11/17 08:30, Laura Morales wrote:
>> Perhaps a bit tangential but this is somehow related to how HDT 
>> stores its data (I've run some tests with Fuseki + HDT store instead 
>> of TDB). Basically, they assign each subject, predicate, and object 
>> an integer value. It keeps an index to map integers with the 
>> corresponding string (of the original value), and then they store 
>> every triple using integers instead of strings (something like "1 2 
>> 9. 8 2 1 ." and so forth. The drawback I think is that they have to 
>> translate indices/strings back and forth at each query, nonetheless 
>> the response time is still impressive (milliseconds), and it 
>> compresses the original file *a lot*. By a lot I mean that for 
>> Wikidata (not the full file though, but one with about 2.3 billion 
>> triples) the HDT is more or less 40GB, and gz-compressed about 10GB. 
>> The problem is that their rdf2hdt tool is so inefficient that it does 
>> everything in RAM, so to convert something like wikidata you'd need 
>> at least a machine with 512GB of ram (or swap if you have a fast 
>> enough swap :D). Also the tool looks like it can't handle files with 
>> more than 2^32 triples, although HDT (the format) does handle them. 
>> So as long as you can handle the conversion, if you want to save 
>> space you could benefit from using a HDT store rather than using TDB.
>>
>>
>>
>> Sent: Sunday, November 26, 2017 at 5:30 AM
>> From: "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>
>> To: users@jena.apache.org
>> Subject: Re: Estimating TDB2 size
>> i have   specific questiosn in relation to what ajs6f said:
>>
>> i have a TDB store with 1/3 triples with very small literals (3-5 char),
>> where the same sequence is often repeated. would i get smaller store and
>> better performance if these were URI of the character sequence (stored
>> once for each repeated case)? any guess how much I could improve?
>>
>> does the size of the URI play a role in the amount of storage used. i
>> observe that i have for 33 M triples a TDB size (files) of 13 GB, which
>> means about 300 byte per triple. the literals are all short (very seldom
>> more than 10 char, mostly 5 - words from english text). is is a named
>> graph, if this makes a difference.
>>
>> thank you!
>>
>> andrew
>>

-- 
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
                                  +43 1 58801 12710 direct
Geoinformation, TU Wien          +43 1 58801 12700 office
Gusshausstr. 27-29               +43 1 55801 12799 fax
1040 Wien Austria                +43 676 419 25 72 mobil


Re: Estimating TDB2 size

Posted by Andy Seaborne <an...@apache.org>.
Also the compression of the dictionary and the triples-with-integers.

     Andy

On 26/11/17 13:27, ajs6f wrote:
> TDB is offering indexes in multiple orders to support fast query. HDT [1] appears to store one order and add additional information when loaded into main memory. That might be part of it.
> 
> ajs6f
> 
> [1] http://www.rdfhdt.org/hdt-internals/
> 
>> On Nov 26, 2017, at 7:56 AM, Laura Morales <la...@mail.com> wrote:
>>
>> I wonder... if TDB like HDT uses integers instead of strings, why is there such a difference in the store size? HDT files are so much smaller.
>>   
>>   
>>
>> Sent: Sunday, November 26, 2017 at 1:30 PM
>> From: "Andy Seaborne" <an...@apache.org>
>> To: users@jena.apache.org
>> Subject: Re: Estimating TDB2 size
>> Every RDFTerm gets a NodeId in TDB. A triple is 3 NodeIds.
>>
>> There is a big cache, NodeId->RDFTerm.
>>
>> In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an int
>> and long (96 bits) - the current implementation is using 64 bits.
>>
>> It is very common as a design to dictionary (intern) terms because joins
>> can be done by comparing a integers, not testing whether two strings are
>> the same, which is much more expensive.
>>
>> In addition TDBx inlines numbers integers and date/times and some others.
>>
>> https://jena.apache.org/documentation/tdb/architecture.html
>>
>> TDBx could, but doesn't, store compressed data on disk. There are pros
>> and cons of this.
>>
>> Andy
> 

Re: Estimating TDB2 size

Posted by ajs6f <aj...@apache.org>.
TDB is offering indexes in multiple orders to support fast query. HDT [1] appears to store one order and add additional information when loaded into main memory. That might be part of it.

ajs6f

[1] http://www.rdfhdt.org/hdt-internals/

> On Nov 26, 2017, at 7:56 AM, Laura Morales <la...@mail.com> wrote:
> 
> I wonder... if TDB like HDT uses integers instead of strings, why is there such a difference in the store size? HDT files are so much smaller.
>  
>  
> 
> Sent: Sunday, November 26, 2017 at 1:30 PM
> From: "Andy Seaborne" <an...@apache.org>
> To: users@jena.apache.org
> Subject: Re: Estimating TDB2 size
> Every RDFTerm gets a NodeId in TDB. A triple is 3 NodeIds.
> 
> There is a big cache, NodeId->RDFTerm.
> 
> In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an int
> and long (96 bits) - the current implementation is using 64 bits.
> 
> It is very common as a design to dictionary (intern) terms because joins
> can be done by comparing a integers, not testing whether two strings are
> the same, which is much more expensive.
> 
> In addition TDBx inlines numbers integers and date/times and some others.
> 
> https://jena.apache.org/documentation/tdb/architecture.html
> 
> TDBx could, but doesn't, store compressed data on disk. There are pros
> and cons of this.
> 
> Andy


Re: Estimating TDB2 size

Posted by Laura Morales <la...@mail.com>.
I wonder... if TDB like HDT uses integers instead of strings, why is there such a difference in the store size? HDT files are so much smaller.
 
 

Sent: Sunday, November 26, 2017 at 1:30 PM
From: "Andy Seaborne" <an...@apache.org>
To: users@jena.apache.org
Subject: Re: Estimating TDB2 size
Every RDFTerm gets a NodeId in TDB. A triple is 3 NodeIds.

There is a big cache, NodeId->RDFTerm.

In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an int
and long (96 bits) - the current implementation is using 64 bits.

It is very common as a design to dictionary (intern) terms because joins
can be done by comparing a integers, not testing whether two strings are
the same, which is much more expensive.

In addition TDBx inlines numbers integers and date/times and some others.

https://jena.apache.org/documentation/tdb/architecture.html

TDBx could, but doesn't, store compressed data on disk. There are pros
and cons of this.

Andy

Re: Estimating TDB2 size

Posted by Andy Seaborne <an...@apache.org>.
Every RDFTerm gets a NodeId in TDB.  A triple is 3 NodeIds.

There is a big cache, NodeId->RDFTerm.

In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an int 
and long (96 bits) - the current implementation is using 64 bits.

It is very common as a design to dictionary (intern) terms because joins 
can be done by comparing a integers, not testing whether two strings are 
the same, which is much more expensive.

In addition TDBx inlines numbers integers and date/times and some others.

https://jena.apache.org/documentation/tdb/architecture.html

TDBx could, but doesn't, store compressed data on disk. There are pros 
and cons of this.

     Andy

On 26/11/17 08:30, Laura Morales wrote:
> Perhaps a bit tangential but this is somehow related to how HDT stores its data (I've run some tests with Fuseki + HDT store instead of TDB). Basically, they assign each subject, predicate, and object an integer value. It keeps an index to map integers with the corresponding string (of the original value), and then they store every triple using integers instead of strings (something like "1 2 9. 8 2 1 ." and so forth. The drawback I think is that they have to translate indices/strings back and forth at each query, nonetheless the response time is still impressive (milliseconds), and it compresses the original file *a lot*. By a lot I mean that for Wikidata (not the full file though, but one with about 2.3 billion triples) the HDT is more or less 40GB, and gz-compressed about 10GB. The problem is that their rdf2hdt tool is so inefficient that it does everything in RAM, so to convert something like wikidata you'd need at least a machine with 512GB of ram (or swap if you have a fast enough swap :D). Also the tool looks like it can't handle files with more than 2^32 triples, although HDT (the format) does handle them. So as long as you can handle the conversion, if you want to save space you could benefit from using a HDT store rather than using TDB.
> 
> 
> 
> Sent: Sunday, November 26, 2017 at 5:30 AM
> From: "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>
> To: users@jena.apache.org
> Subject: Re: Estimating TDB2 size
> i have   specific questiosn in relation to what ajs6f said:
> 
> i have a TDB store with 1/3 triples with very small literals (3-5 char),
> where the same sequence is often repeated. would i get smaller store and
> better performance if these were URI of the character sequence (stored
> once for each repeated case)? any guess how much I could improve?
> 
> does the size of the URI play a role in the amount of storage used. i
> observe that i have for 33 M triples a TDB size (files) of 13 GB, which
> means about 300 byte per triple. the literals are all short (very seldom
> more than 10 char, mostly 5 - words from english text). is is a named
> graph, if this makes a difference.
> 
> thank you!
> 
> andrew
> 

Re: Estimating TDB2 size

Posted by Laura Morales <la...@mail.com>.
Perhaps a bit tangential but this is somehow related to how HDT stores its data (I've run some tests with Fuseki + HDT store instead of TDB). Basically, they assign each subject, predicate, and object an integer value. It keeps an index to map integers with the corresponding string (of the original value), and then they store every triple using integers instead of strings (something like "1 2 9. 8 2 1 ." and so forth. The drawback I think is that they have to translate indices/strings back and forth at each query, nonetheless the response time is still impressive (milliseconds), and it compresses the original file *a lot*. By a lot I mean that for Wikidata (not the full file though, but one with about 2.3 billion triples) the HDT is more or less 40GB, and gz-compressed about 10GB. The problem is that their rdf2hdt tool is so inefficient that it does everything in RAM, so to convert something like wikidata you'd need at least a machine with 512GB of ram (or swap if you have a fast enough swap :D). Also the tool looks like it can't handle files with more than 2^32 triples, although HDT (the format) does handle them. So as long as you can handle the conversion, if you want to save space you could benefit from using a HDT store rather than using TDB.



Sent: Sunday, November 26, 2017 at 5:30 AM
From: "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>
To: users@jena.apache.org
Subject: Re: Estimating TDB2 size
i have   specific questiosn in relation to what ajs6f said:

i have a TDB store with 1/3 triples with very small literals (3-5 char),
where the same sequence is often repeated. would i get smaller store and
better performance if these were URI of the character sequence (stored
once for each repeated case)? any guess how much I could improve?

does the size of the URI play a role in the amount of storage used. i
observe that i have for 33 M triples a TDB size (files) of 13 GB, which
means about 300 byte per triple. the literals are all short (very seldom
more than 10 char, mostly 5 - words from english text). is is a named
graph, if this makes a difference.

thank you!

andrew

Re: Estimating TDB2 size

Posted by "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>.
i have   specific questiosn in relation to what ajs6f said:

i have a TDB store with 1/3 triples with very small literals (3-5 char), 
where the same sequence is often repeated. would i get smaller store and 
better performance if these were URI of the character sequence (stored 
once for each repeated case)? any guess how much I could improve?

does the size of the URI play a role in the amount of storage used. i 
observe that i have for 33 M triples a TDB size (files) of 13 GB, which 
means about 300 byte per triple. the literals are all short (very seldom 
more than 10 char, mostly 5 - words from english text). is is a named 
graph, if this makes a difference.

thank you!

andrew


On 11/25/2017 06:42 AM, ajs6f wrote:
> Andy may be able to be more precise, but I can tell you right away that it's not a straightforward function. How many literals are there "per triple"? How big are the literals, on average? How many unique bnodes and URIs? All of these things will change the eventual size of the database.
>
> ajs6f
>
>> On Nov 25, 2017, at 6:40 AM, Laura Morales <la...@mail.com> wrote:
>>
>> Is it possible to estimate the size of a TDB2 store from one of nt/turtle/xml input file, without actually creating the store? Is there maybe a tool for this?

-- 
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
                                  +43 1 58801 12710 direct
Geoinformation, TU Wien          +43 1 58801 12700 office
Gusshausstr. 27-29               +43 1 55801 12799 fax
1040 Wien Austria                +43 676 419 25 72 mobil


Re: Estimating TDB2 size

Posted by ajs6f <aj...@apache.org>.
Andy may be able to be more precise, but I can tell you right away that it's not a straightforward function. How many literals are there "per triple"? How big are the literals, on average? How many unique bnodes and URIs? All of these things will change the eventual size of the database.

ajs6f

> On Nov 25, 2017, at 6:40 AM, Laura Morales <la...@mail.com> wrote:
> 
> Is it possible to estimate the size of a TDB2 store from one of nt/turtle/xml input file, without actually creating the store? Is there maybe a tool for this?