You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Mikael Pesonen <mi...@lingsoft.fi> on 2018/06/04 12:18:22 UTC

Querying used disk size

Hi,

what would be best way to estimate how much disk space (bytes) a single 
graph is using in Fuseki?

Only option that came to mind is to get entire db disk usage with Linux 
system call and take the same proportion as there are triplets in the 
graph vs in all graphs. That would be a rough estimate.

Thank you

-- 
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND


Re: Querying used disk size

Posted by Mikael Pesonen <mi...@lingsoft.fi>.
Thank you Rob for the confirmation. Some monthly graph export could be 
an option, to get an second opinion.

Br

On 4.6.2018 16:01, Rob Vesse wrote:
> That's usually what I see done in the literature
>
> Accounting for the exact amount of disk usage it's difficult for a number of reasons:
>
> - Terms are dictionary encoded, so each URI, literal and blank node identifier is stored only once and mapped to an internal constant size identifier (64 bits for TBD1). So however many times a term is used its storage is its encoded size plus N times the identifier size. So how "shared" disk usage contributes to an individual graph is subject to interpretation
> - Similarly there is no reference counting for terms. So if data is deleted from a graph some of the disk usage is never reclaimed, and there is no way to track this. On the other hand if you want to know how many times a given term is used you need to query the database to find that out.
> - Index size will vary depending upon the data, including how it was loaded and how many updates have happened. For example tdbloader2 will produce maximally packed indices but as soon as you start running updates the indexes will expand as the B+Trees get rebalanced. And again how do you account for the overhead of the on disk idnex data structures?
>
> One "hack" might be to export the graph in question, import it into a separate TDB instance and get the disk size of that. However as explained above you would end up over estimating to some extent.
>
> Rob
>
> On 04/06/2018, 13:18, "Mikael Pesonen" <mi...@lingsoft.fi> wrote:
>
>      
>      Hi,
>      
>      what would be best way to estimate how much disk space (bytes) a single
>      graph is using in Fuseki?
>      
>      Only option that came to mind is to get entire db disk usage with Linux
>      system call and take the same proportion as there are triplets in the
>      graph vs in all graphs. That would be a rough estimate.
>      
>      Thank you
>      
>      --
>      Lingsoft - 30 years of Leading Language Management
>      
>      www.lingsoft.fi
>      
>      Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
>      
>      Mikael Pesonen
>      System Engineer
>      
>      e-mail: mikael.pesonen@lingsoft.fi
>      Tel. +358 2 279 3300
>      
>      Time zone: GMT+2
>      
>      Helsinki Office
>      Eteläranta 10
>      FI-00130 Helsinki
>      FINLAND
>      
>      Turku Office
>      Kauppiaskatu 5 A
>      FI-20100 Turku
>      FINLAND
>      
>      
>
>
>
>

-- 
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books

Mikael Pesonen
System Engineer

e-mail: mikael.pesonen@lingsoft.fi
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND


Re: Querying used disk size

Posted by Rob Vesse <rv...@dotnetrdf.org>.
That's usually what I see done in the literature

Accounting for the exact amount of disk usage it's difficult for a number of reasons:

- Terms are dictionary encoded, so each URI, literal and blank node identifier is stored only once and mapped to an internal constant size identifier (64 bits for TBD1). So however many times a term is used its storage is its encoded size plus N times the identifier size. So how "shared" disk usage contributes to an individual graph is subject to interpretation
- Similarly there is no reference counting for terms. So if data is deleted from a graph some of the disk usage is never reclaimed, and there is no way to track this. On the other hand if you want to know how many times a given term is used you need to query the database to find that out.
- Index size will vary depending upon the data, including how it was loaded and how many updates have happened. For example tdbloader2 will produce maximally packed indices but as soon as you start running updates the indexes will expand as the B+Trees get rebalanced. And again how do you account for the overhead of the on disk idnex data structures?

One "hack" might be to export the graph in question, import it into a separate TDB instance and get the disk size of that. However as explained above you would end up over estimating to some extent.

Rob

On 04/06/2018, 13:18, "Mikael Pesonen" <mi...@lingsoft.fi> wrote:

    
    Hi,
    
    what would be best way to estimate how much disk space (bytes) a single 
    graph is using in Fuseki?
    
    Only option that came to mind is to get entire db disk usage with Linux 
    system call and take the same proportion as there are triplets in the 
    graph vs in all graphs. That would be a rough estimate.
    
    Thank you
    
    -- 
    Lingsoft - 30 years of Leading Language Management
    
    www.lingsoft.fi
    
    Speech Applications - Language Management - Translation - Reader's and Writer's Tools - Text Tools - E-books and M-books
    
    Mikael Pesonen
    System Engineer
    
    e-mail: mikael.pesonen@lingsoft.fi
    Tel. +358 2 279 3300
    
    Time zone: GMT+2
    
    Helsinki Office
    Eteläranta 10
    FI-00130 Helsinki
    FINLAND
    
    Turku Office
    Kauppiaskatu 5 A
    FI-20100 Turku
    FINLAND