You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Mikael Pesonen <mi...@lingsoft.fi> on 2018/06/04 12:18:22 UTC
Querying used disk size
Hi,
what would be best way to estimate how much disk space (bytes) a single
graph is using in Fuseki?
Only option that came to mind is to get entire db disk usage with Linux
system call and take the same proportion as there are triplets in the
graph vs in all graphs. That would be a rough estimate.
Thank you
--
Lingsoft - 30 years of Leading Language Management
www.lingsoft.fi
Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
Mikael Pesonen
System Engineer
e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300
Time zone: GMT+2
Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND
Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND
Re: Querying used disk size
Posted by Mikael Pesonen <mi...@lingsoft.fi>.
Thank you Rob for the confirmation. Some monthly graph export could be
an option, to get an second opinion.
Br
On 4.6.2018 16:01, Rob Vesse wrote:
> That's usually what I see done in the literature
>
> Accounting for the exact amount of disk usage it's difficult for a number of reasons:
>
> - Terms are dictionary encoded, so each URI, literal and blank node identifier is stored only once and mapped to an internal constant size identifier (64 bits for TBD1). So however many times a term is used its storage is its encoded size plus N times the identifier size. So how "shared" disk usage contributes to an individual graph is subject to interpretation
> - Similarly there is no reference counting for terms. So if data is deleted from a graph some of the disk usage is never reclaimed, and there is no way to track this. On the other hand if you want to know how many times a given term is used you need to query the database to find that out.
> - Index size will vary depending upon the data, including how it was loaded and how many updates have happened. For example tdbloader2 will produce maximally packed indices but as soon as you start running updates the indexes will expand as the B+Trees get rebalanced. And again how do you account for the overhead of the on disk idnex data structures?
>
> One "hack" might be to export the graph in question, import it into a separate TDB instance and get the disk size of that. However as explained above you would end up over estimating to some extent.
>
> Rob
>
> On 04/06/2018, 13:18, "Mikael Pesonen" <mi...@lingsoft.fi> wrote:
>
>
> Hi,
>
> what would be best way to estimate how much disk space (bytes) a single
> graph is using in Fuseki?
>
> Only option that came to mind is to get entire db disk usage with Linux
> system call and take the same proportion as there are triplets in the
> graph vs in all graphs. That would be a rough estimate.
>
> Thank you
>
> --
> Lingsoft - 30 years of Leading Language Management
>
> www.lingsoft.fi
>
> Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
>
> Mikael Pesonen
> System Engineer
>
> e-mail: mikael.pesonen@lingsoft.fi
> Tel. +358 2 279 3300
>
> Time zone: GMT+2
>
> Helsinki Office
> Eteläranta 10
> FI-00130 Helsinki
> FINLAND
>
> Turku Office
> Kauppiaskatu 5 A
> FI-20100 Turku
> FINLAND
>
>
>
>
>
>
--
Lingsoft - 30 years of Leading Language Management
www.lingsoft.fi
Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
Mikael Pesonen
System Engineer
e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300
Time zone: GMT+2
Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND
Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND
Re: Querying used disk size
Posted by Rob Vesse <rv...@dotnetrdf.org>.
That's usually what I see done in the literature
Accounting for the exact amount of disk usage it's difficult for a number of reasons:
- Terms are dictionary encoded, so each URI, literal and blank node identifier is stored only once and mapped to an internal constant size identifier (64 bits for TBD1). So however many times a term is used its storage is its encoded size plus N times the identifier size. So how "shared" disk usage contributes to an individual graph is subject to interpretation
- Similarly there is no reference counting for terms. So if data is deleted from a graph some of the disk usage is never reclaimed, and there is no way to track this. On the other hand if you want to know how many times a given term is used you need to query the database to find that out.
- Index size will vary depending upon the data, including how it was loaded and how many updates have happened. For example tdbloader2 will produce maximally packed indices but as soon as you start running updates the indexes will expand as the B+Trees get rebalanced. And again how do you account for the overhead of the on disk idnex data structures?
One "hack" might be to export the graph in question, import it into a separate TDB instance and get the disk size of that. However as explained above you would end up over estimating to some extent.
Rob
On 04/06/2018, 13:18, "Mikael Pesonen" <mi...@lingsoft.fi> wrote:
Hi,
what would be best way to estimate how much disk space (bytes) a single
graph is using in Fuseki?
Only option that came to mind is to get entire db disk usage with Linux
system call and take the same proportion as there are triplets in the
graph vs in all graphs. That would be a rough estimate.
Thank you
--
Lingsoft - 30 years of Leading Language Management
www.lingsoft.fi
Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
Mikael Pesonen
System Engineer
e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300
Time zone: GMT+2
Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND
Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND