You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Luí­s Moreira de Sousa <lu...@protonmail.ch.INVALID> on 2020/03/12 11:26:35 UTC

Memory management with Fuseki

Dear all,

I loaded six RDF datasets to Fuseki with sizes comprised between 20 Kb and 20 Mb. To host these six datasets (in persistent mode) Fuseki is using over 1 Gb of RAM and could soon get killed by the system (running on a container platform).

This demand on RAM for such small datasets appears excessive. What strategies are there to limit the RAM used by Fuseki?

Thank you.

--
Luís

Re: Memory management with Fuseki

Posted by Andy Seaborne <an...@apache.org>.
More information:

About Java and containers and sizing:

Summary - things got better at Java10 - running with Java11 is a good idea.

https://www.docker.com/blog/improved-docker-container-integration-with-java-10/

     Andy

On 17/04/2020 10:58, Rob Vesse wrote:
> Okay, that's very helpful
> 
> So one thing that jumps out at me looking at that Dockerfile and its associated entrypoint script is that it starts the JVM without any explicit heap size settings.  When that is done the JVM will pick default heap sizes itself which normally would be fine.  However in a container the amount of memory apparently available may not actually reflect external limits that the container runtime/orchestrator is imposing.  Just as a practical example I ran the container locally (using docker run to drop into a shell) and ran the same basic Java command the entrypoint runs, adding extra arguments to have the JVM dump its settings and I see a heap size of ~3GB:
> 
> bash-4.3$ java -cp "*:/javalibs/*" -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize'
>      uintx ErgoHeapSizeLimit                         = 0                                   {product}
>      uintx HeapSizePerGCThread                       = 87241520                            {product}
>      uintx InitialHeapSize                          := 197132288                           {product}
>      uintx LargePageHeapSizeThreshold                = 134217728                           {product}
>      uintx MaxHeapSize                              := 3141533696                          {product}
> 
> I repeated the same experiment running the container inside a Kubernetes pod with a 1GB resource limit and the JVM still picked a 3GB limit
> 
> This is a common problem that can occur in any containerised environment, it would be better to modify the Dockerfile to explicitly set desired heap sizes to match the resource limits your container orchestrator is going to impose upon you.  Be aware when choosing a heap size that a lot of TDB memory usage is off heap so you should set a JVM heap size that takes that into account, so perhaps try -Xmx512m leaving half your memory for off-heap usage (assuming the 1GB resource limit you state).  You'll likely need to experiment to find settings that work for your workload.
> 
> Hope this helps,
> 
> Rob
> 
> On 17/04/2020, 09:26, "Luí­s Moreira de Sousa" <lu...@protonmail.ch.INVALID> wrote:
> 
>      Hi all, some answers below to the many questions.
>      
>      1. This Fuseki instance is based on the image maintained at DockerHub by the secoresearch account. Copies of the Dockerfile and tdb.cfg files are at the end of this message. There is no other code involved.
>      
>      2. The image is deployed to an Openshift cluster with a default resource base of 1 CPU and 1 GB of RAM. The intention is to use Fuseki as component of a information system easy to deploy by institutions in developing countries, where resources may be limited and know-how lacking. These resources have shown sufficient to run software such as Postgres or MapServer.
>      
>      3. Openshift provides a user interface to easy monitor the resources taken up by a running container (aka pod), no code is involved in this monitoring. It is also possible to launch a shell session into the container and monitor that way. At the end of the message is a print out from top showing that nothing else is running in this particular container. All memory is used eihter by Fuseki or the system.
>      
>      4. The datasets I have been using to test Fuseki were created with rdflib and are saved as XML/RDF. Each contains some dozens of objects of interst and respective relations from a larger database. The largest of these RDF files contains just under 100 000 triples and occupies 20 MB in disk. I uploaded a new graph with more meaningfull labels (https://pasteboard.co/J4cfPM9.png). Each point in the graph is a dataset, in the xx axis (horizontal) is the number of triples in the dataset, in the yy axis (vertical) is the additional memory required by Fuseki once the dataset is added. Again, note that all datasets are uploaded in persistent mode.
>      
>      5. Regarding the JVM, the information in the manual simply refers that the heap size is somewhat dependent on the kind of queries run. But the problem on this end is with dataset upload. At this stage I do not know what or how to modify in the JVM set-up.
>      
>      Thank you for your help.
>      
>      Dockerfile
>      ----------
>      FROM secoresearch/fuseki:latest
>      
>      # Set environment variables
>      ENV ADMIN_PASSWORD toto
>      ENV ENABLE_DATA_WRITE true
>      ENV ENABLE_UPDATE true
>      ENV ENABLE_UPLOAD true
>      
>      # Add in config files
>      COPY ./tbd.cfg $FUSEKI_BASE/tbd.cfg
>      COPY ./tbd.cfg $FUSEKI_HOME/tbd.cfg
>      
>      
>      tbf.cfg
>      -------
>      {
>        "tdb.node2nodeid_cache_size" :  50000 ,
>        "tdb.nodeid2node_cache_size" :  250000 ,
>      }
>      
>      top
>      ---
>      Mem: 39251812K used, 26724204K free, 21104K shrd, 58340K buff, 23792776K cached
>      CPU:   9% usr   5% sys   0% nic  84% idle   0% io   0% irq   0% sirq
>      Load average: 2.02 1.93 1.75 3/4355 114
>        PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
>          1     0 9008     S   20528m  30%   4   0% java -cp *:/javalibs/* org.apache.jena.fuseki.cmd.FusekiCmd
>        109   102 9008     S     1520   0%   7   0% /bin/sh
>        102     0 9008     S     1512   0%   1   0% /bin/sh -c TERM="xterm-termite" /bin/sh
>        110   109 9008     R     1508   0%   1   0% top
>      
>      
>      
>      
>      --
>      Luís
>      
> 
> 
> 
> 

Re: Memory management with Fuseki

Posted by Rob Vesse <rv...@dotnetrdf.org>.
Okay, that's very helpful

So one thing that jumps out at me looking at that Dockerfile and its associated entrypoint script is that it starts the JVM without any explicit heap size settings.  When that is done the JVM will pick default heap sizes itself which normally would be fine.  However in a container the amount of memory apparently available may not actually reflect external limits that the container runtime/orchestrator is imposing.  Just as a practical example I ran the container locally (using docker run to drop into a shell) and ran the same basic Java command the entrypoint runs, adding extra arguments to have the JVM dump its settings and I see a heap size of ~3GB:

bash-4.3$ java -cp "*:/javalibs/*" -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize'
    uintx ErgoHeapSizeLimit                         = 0                                   {product}
    uintx HeapSizePerGCThread                       = 87241520                            {product}
    uintx InitialHeapSize                          := 197132288                           {product}
    uintx LargePageHeapSizeThreshold                = 134217728                           {product}
    uintx MaxHeapSize                              := 3141533696                          {product}

I repeated the same experiment running the container inside a Kubernetes pod with a 1GB resource limit and the JVM still picked a 3GB limit

This is a common problem that can occur in any containerised environment, it would be better to modify the Dockerfile to explicitly set desired heap sizes to match the resource limits your container orchestrator is going to impose upon you.  Be aware when choosing a heap size that a lot of TDB memory usage is off heap so you should set a JVM heap size that takes that into account, so perhaps try -Xmx512m leaving half your memory for off-heap usage (assuming the 1GB resource limit you state).  You'll likely need to experiment to find settings that work for your workload.

Hope this helps,

Rob

On 17/04/2020, 09:26, "Luí­s Moreira de Sousa" <lu...@protonmail.ch.INVALID> wrote:

    Hi all, some answers below to the many questions.
    
    1. This Fuseki instance is based on the image maintained at DockerHub by the secoresearch account. Copies of the Dockerfile and tdb.cfg files are at the end of this message. There is no other code involved.
    
    2. The image is deployed to an Openshift cluster with a default resource base of 1 CPU and 1 GB of RAM. The intention is to use Fuseki as component of a information system easy to deploy by institutions in developing countries, where resources may be limited and know-how lacking. These resources have shown sufficient to run software such as Postgres or MapServer.
    
    3. Openshift provides a user interface to easy monitor the resources taken up by a running container (aka pod), no code is involved in this monitoring. It is also possible to launch a shell session into the container and monitor that way. At the end of the message is a print out from top showing that nothing else is running in this particular container. All memory is used eihter by Fuseki or the system.
    
    4. The datasets I have been using to test Fuseki were created with rdflib and are saved as XML/RDF. Each contains some dozens of objects of interst and respective relations from a larger database. The largest of these RDF files contains just under 100 000 triples and occupies 20 MB in disk. I uploaded a new graph with more meaningfull labels (https://pasteboard.co/J4cfPM9.png). Each point in the graph is a dataset, in the xx axis (horizontal) is the number of triples in the dataset, in the yy axis (vertical) is the additional memory required by Fuseki once the dataset is added. Again, note that all datasets are uploaded in persistent mode.
    
    5. Regarding the JVM, the information in the manual simply refers that the heap size is somewhat dependent on the kind of queries run. But the problem on this end is with dataset upload. At this stage I do not know what or how to modify in the JVM set-up.
    
    Thank you for your help.
    
    Dockerfile
    ----------
    FROM secoresearch/fuseki:latest
    
    # Set environment variables
    ENV ADMIN_PASSWORD toto
    ENV ENABLE_DATA_WRITE true
    ENV ENABLE_UPDATE true
    ENV ENABLE_UPLOAD true
    
    # Add in config files
    COPY ./tbd.cfg $FUSEKI_BASE/tbd.cfg
    COPY ./tbd.cfg $FUSEKI_HOME/tbd.cfg
    
    
    tbf.cfg
    -------
    {
      "tdb.node2nodeid_cache_size" :  50000 ,
      "tdb.nodeid2node_cache_size" :  250000 ,
    }
    
    top
    ---
    Mem: 39251812K used, 26724204K free, 21104K shrd, 58340K buff, 23792776K cached
    CPU:   9% usr   5% sys   0% nic  84% idle   0% io   0% irq   0% sirq
    Load average: 2.02 1.93 1.75 3/4355 114
      PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
        1     0 9008     S   20528m  30%   4   0% java -cp *:/javalibs/* org.apache.jena.fuseki.cmd.FusekiCmd
      109   102 9008     S     1520   0%   7   0% /bin/sh
      102     0 9008     S     1512   0%   1   0% /bin/sh -c TERM="xterm-termite" /bin/sh
      110   109 9008     R     1508   0%   1   0% top
    
    
    
    
    --
    Luís
    





Re: Memory management with Fuseki

Posted by Luí­s Moreira de Sousa <lu...@protonmail.ch.INVALID>.
Hi all, some answers below to the many questions.

1. This Fuseki instance is based on the image maintained at DockerHub by the secoresearch account. Copies of the Dockerfile and tdb.cfg files are at the end of this message. There is no other code involved.

2. The image is deployed to an Openshift cluster with a default resource base of 1 CPU and 1 GB of RAM. The intention is to use Fuseki as component of a information system easy to deploy by institutions in developing countries, where resources may be limited and know-how lacking. These resources have shown sufficient to run software such as Postgres or MapServer.

3. Openshift provides a user interface to easy monitor the resources taken up by a running container (aka pod), no code is involved in this monitoring. It is also possible to launch a shell session into the container and monitor that way. At the end of the message is a print out from top showing that nothing else is running in this particular container. All memory is used eihter by Fuseki or the system.

4. The datasets I have been using to test Fuseki were created with rdflib and are saved as XML/RDF. Each contains some dozens of objects of interst and respective relations from a larger database. The largest of these RDF files contains just under 100 000 triples and occupies 20 MB in disk. I uploaded a new graph with more meaningfull labels (https://pasteboard.co/J4cfPM9.png). Each point in the graph is a dataset, in the xx axis (horizontal) is the number of triples in the dataset, in the yy axis (vertical) is the additional memory required by Fuseki once the dataset is added. Again, note that all datasets are uploaded in persistent mode.

5. Regarding the JVM, the information in the manual simply refers that the heap size is somewhat dependent on the kind of queries run. But the problem on this end is with dataset upload. At this stage I do not know what or how to modify in the JVM set-up.

Thank you for your help.

Dockerfile
----------
FROM secoresearch/fuseki:latest

# Set environment variables
ENV ADMIN_PASSWORD toto
ENV ENABLE_DATA_WRITE true
ENV ENABLE_UPDATE true
ENV ENABLE_UPLOAD true

# Add in config files
COPY ./tbd.cfg $FUSEKI_BASE/tbd.cfg
COPY ./tbd.cfg $FUSEKI_HOME/tbd.cfg


tbf.cfg
-------
{
  "tdb.node2nodeid_cache_size" :  50000 ,
  "tdb.nodeid2node_cache_size" :  250000 ,
}

top
---
Mem: 39251812K used, 26724204K free, 21104K shrd, 58340K buff, 23792776K cached
CPU:   9% usr   5% sys   0% nic  84% idle   0% io   0% irq   0% sirq
Load average: 2.02 1.93 1.75 3/4355 114
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    1     0 9008     S   20528m  30%   4   0% java -cp *:/javalibs/* org.apache.jena.fuseki.cmd.FusekiCmd
  109   102 9008     S     1520   0%   7   0% /bin/sh
  102     0 9008     S     1512   0%   1   0% /bin/sh -c TERM="xterm-termite" /bin/sh
  110   109 9008     R     1508   0%   1   0% top




--
Luís

Re: Memory management with Fuseki

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
The TE said

> In attachment you can find a chart plotting memory use increase against dataset size. There is no visible correlation, but on average each additional triplet requires upwards of 30 MB of RAM.
but those numbers can't be correct ...

The y axis denotes the memory consumption in GB? Sure? Not MB?

And x axis the number of triples? Is it logarithmic scale or are those
really just 20 000 triples? In general, why would somebody benchmark
those small datasets?

Also, why should the lowest number of triples (~10 000) consume 120GB -
looks weird. At least not what I think would happen with Fuseki alone.
Nor any other tool. If that'S true, something else is consuming memory
or there is some leak.

Again, MB vs  GB?

And how did you estimate "30MB per triple"? I can't believe this.

Please show code of your experiments as well as details of the
experimental setup.

On 16.04.20 12:16, Andy Seaborne wrote:
> What do we know so far?
>
> 1/ 6 datasets, uptime 20Mb each (file size? format? Compressed?
> Inference?)
>
> (is that datasets or graphs?)
>
> 2/ At 1G the system kills processes.
>
> What we don't know:
>
> A/ Heap size
>
> B/ Machine RAM size - TDB uses memory mapped so this matters. It also
> means the process size may look large (the database files are mapped)
> but this is virtual memory, not real RAM.
>
> C/ Why 1G? That is pretty small for general purpose Java program. Java
> needs space to load the code and the basic overheads of running a
> webserver. (this is with built-in Jetty? Not Fuseki in Tomcat?)
>
> I don't understand the graph - what is 120G?
>
>     Andy
>
> On 16/04/2020 10:11, Rob Vesse wrote:
>> I find the implied figures hard to believe, as Lorenz has said you
>> will need to share your findings via some other service since this
>> mailing list does not permit attachments.
>>
>> Many people use Fuseki and TDB to host datasets in the hundreds of
>> millions (if not billions) of triples in production environments e.g.
>> much of UK Open Data from govt agencies is backed by Fuseki/TDB in
>> one form or another.  Also the memory usage of Fuseki/TDB cannot
>> realistically be reduced to something as crude as MB/triples because
>> the memory management going on within the JVM and TDB is far more
>> complicated than that, see my previous reply to your earlier
>> questions [1]
>>
>> Rob
>>
>> [1]
>> https://lists.apache.org/thread.html/rf76be4fba2d9679f346dd7482d9925293eb768bbedce3feff7bb4376%40%3Cusers.jena.apache.org%3E
>>
>> On 16/04/2020, 08:47, "Lorenz Buehmann"
>> <bu...@informatik.uni-leipzig.de> wrote:
>>
>>      No attachments possible on this mailing list. Use some external
>> service
>>      to share attachments please or try to embed it as image (in case
>> it's
>>      just an image) as you did in your other thread. Or just use Gist
>>           On 16.04.20 09:27, Luí­s Moreira de Sousa wrote:
>>      > Dear all,
>>      >
>>      > I have been tweaking the tdb.node2nodeid_cache_size and
>>      > tdb.nodeid2node_cache_size parameters as Andy suggested. They
>> indeed reduce the RAM used by Fuseki, but not to a point where it
>> becomes usable. In attachment you can find a chart plotting memory
>> use increase against dataset size. There is no visible correlation,
>> but on average each additional triplet requires upwards of 30 MB of RAM.
>>      >
>>      > The actual datasets I work with count triplets in the millions
>> (from relational databases with tens of thousands of records). Even
>> if I ever convince a data centre to provide the required amounts of
>> RAM to a single container, the costs will be prohibitive.
>>      >
>>      > Can anyone provide their experiences with Fuseki in
>> production? Particularly in micro-services/containerised platforms?
>>      >
>>      > Thank you.
>>      >
>>      > --
>>      > Luís
>>      >
>>      >
>>          
>>
>>
>>


Re: Memory management with Fuseki

Posted by Andy Seaborne <an...@apache.org>.
What do we know so far?

1/ 6 datasets, uptime 20Mb each (file size? format? Compressed? Inference?)

(is that datasets or graphs?)

2/ At 1G the system kills processes.

What we don't know:

A/ Heap size

B/ Machine RAM size - TDB uses memory mapped so this matters. It also 
means the process size may look large (the database files are mapped) 
but this is virtual memory, not real RAM.

C/ Why 1G? That is pretty small for general purpose Java program. Java 
needs space to load the code and the basic overheads of running a 
webserver. (this is with built-in Jetty? Not Fuseki in Tomcat?)

I don't understand the graph - what is 120G?

     Andy

On 16/04/2020 10:11, Rob Vesse wrote:
> I find the implied figures hard to believe, as Lorenz has said you will need to share your findings via some other service since this mailing list does not permit attachments.
> 
> Many people use Fuseki and TDB to host datasets in the hundreds of millions (if not billions) of triples in production environments e.g. much of UK Open Data from govt agencies is backed by Fuseki/TDB in one form or another.  Also the memory usage of Fuseki/TDB cannot realistically be reduced to something as crude as MB/triples because the memory management going on within the JVM and TDB is far more complicated than that, see my previous reply to your earlier questions [1]
> 
> Rob
> 
> [1] https://lists.apache.org/thread.html/rf76be4fba2d9679f346dd7482d9925293eb768bbedce3feff7bb4376%40%3Cusers.jena.apache.org%3E
> 
> On 16/04/2020, 08:47, "Lorenz Buehmann" <bu...@informatik.uni-leipzig.de> wrote:
> 
>      No attachments possible on this mailing list. Use some external service
>      to share attachments please or try to embed it as image (in case it's
>      just an image) as you did in your other thread. Or just use Gist
>      
>      On 16.04.20 09:27, Luí­s Moreira de Sousa wrote:
>      > Dear all,
>      >
>      > I have been tweaking the tdb.node2nodeid_cache_size and
>      > tdb.nodeid2node_cache_size parameters as Andy suggested. They indeed reduce the RAM used by Fuseki, but not to a point where it becomes usable. In attachment you can find a chart plotting memory use increase against dataset size. There is no visible correlation, but on average each additional triplet requires upwards of 30 MB of RAM.
>      >
>      > The actual datasets I work with count triplets in the millions (from relational databases with tens of thousands of records). Even if I ever convince a data centre to provide the required amounts of RAM to a single container, the costs will be prohibitive.
>      >
>      > Can anyone provide their experiences with Fuseki in production? Particularly in micro-services/containerised platforms?
>      >
>      > Thank you.
>      >
>      > --
>      > Luís
>      >
>      >
>      
>      
> 
> 
> 
> 

Re: Memory management with Fuseki

Posted by Rob Vesse <rv...@dotnetrdf.org>.
I find the implied figures hard to believe, as Lorenz has said you will need to share your findings via some other service since this mailing list does not permit attachments.

Many people use Fuseki and TDB to host datasets in the hundreds of millions (if not billions) of triples in production environments e.g. much of UK Open Data from govt agencies is backed by Fuseki/TDB in one form or another.  Also the memory usage of Fuseki/TDB cannot realistically be reduced to something as crude as MB/triples because the memory management going on within the JVM and TDB is far more complicated than that, see my previous reply to your earlier questions [1]

Rob

[1] https://lists.apache.org/thread.html/rf76be4fba2d9679f346dd7482d9925293eb768bbedce3feff7bb4376%40%3Cusers.jena.apache.org%3E

On 16/04/2020, 08:47, "Lorenz Buehmann" <bu...@informatik.uni-leipzig.de> wrote:

    No attachments possible on this mailing list. Use some external service
    to share attachments please or try to embed it as image (in case it's
    just an image) as you did in your other thread. Or just use Gist
    
    On 16.04.20 09:27, Luí­s Moreira de Sousa wrote:
    > Dear all,
    >
    > I have been tweaking the tdb.node2nodeid_cache_size and
    > tdb.nodeid2node_cache_size parameters as Andy suggested. They indeed reduce the RAM used by Fuseki, but not to a point where it becomes usable. In attachment you can find a chart plotting memory use increase against dataset size. There is no visible correlation, but on average each additional triplet requires upwards of 30 MB of RAM.
    >
    > The actual datasets I work with count triplets in the millions (from relational databases with tens of thousands of records). Even if I ever convince a data centre to provide the required amounts of RAM to a single container, the costs will be prohibitive.
    >
    > Can anyone provide their experiences with Fuseki in production? Particularly in micro-services/containerised platforms?
    >
    > Thank you.
    >
    > --
    > Luís
    >
    >
    
    





Re: Memory management with Fuseki

Posted by Luí­s Moreira de Sousa <lu...@protonmail.ch.INVALID>.
Hi Lorenz,

someone got a picture in a previous message, I wonder if this issue affects everybody in the same way. In any case here is a link to Pasteboard:

https://pasteboard.co/J43bRYp.png

Regards.

--
Luís

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Thursday, April 16, 2020 9:40 AM, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:

&gt; No attachments possible on this mailing list. Use some external service
&gt; to share attachments please or try to embed it as image (in case it's
&gt; just an image) as you did in your other thread. Or just use Gist
&gt;
&gt; On 16.04.20 09:27, Luí­s Moreira de Sousa wrote:
&gt;
&gt; &gt; Dear all,
&gt; &gt; I have been tweaking the tdb.node2nodeid_cache_size and
&gt; &gt; tdb.nodeid2node_cache_size parameters as Andy suggested. They indeed reduce the RAM used by Fuseki, but not to a point where it becomes usable. In attachment you can find a chart plotting memory use increase against dataset size. There is no visible correlation, but on average each additional triplet requires upwards of 30 MB of RAM.
&gt; &gt; The actual datasets I work with count triplets in the millions (from relational databases with tens of thousands of records). Even if I ever convince a data centre to provide the required amounts of RAM to a single container, the costs will be prohibitive.
&gt; &gt; Can anyone provide their experiences with Fuseki in production? Particularly in micro-services/containerised platforms?
&gt; &gt; Thank you.
&gt; &gt; --
&gt; &gt; Luís

</b...@informatik.uni-leipzig.de>

Re: Memory management with Fuseki

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
No attachments possible on this mailing list. Use some external service
to share attachments please or try to embed it as image (in case it's
just an image) as you did in your other thread. Or just use Gist

On 16.04.20 09:27, Luí­s Moreira de Sousa wrote:
> Dear all,
>
> I have been tweaking the tdb.node2nodeid_cache_size and
> tdb.nodeid2node_cache_size parameters as Andy suggested. They indeed reduce the RAM used by Fuseki, but not to a point where it becomes usable. In attachment you can find a chart plotting memory use increase against dataset size. There is no visible correlation, but on average each additional triplet requires upwards of 30 MB of RAM.
>
> The actual datasets I work with count triplets in the millions (from relational databases with tens of thousands of records). Even if I ever convince a data centre to provide the required amounts of RAM to a single container, the costs will be prohibitive.
>
> Can anyone provide their experiences with Fuseki in production? Particularly in micro-services/containerised platforms?
>
> Thank you.
>
> --
> Luís
>
>


Re: Memory management with Fuseki

Posted by Luí­s Moreira de Sousa <lu...@protonmail.ch.INVALID>.
Dear all,

I have been tweaking the tdb.node2nodeid_cache_size and
tdb.nodeid2node_cache_size parameters as Andy suggested. They indeed reduce the RAM used by Fuseki, but not to a point where it becomes usable. In attachment you can find a chart plotting memory use increase against dataset size. There is no visible correlation, but on average each additional triplet requires upwards of 30 MB of RAM.

The actual datasets I work with count triplets in the millions (from relational databases with tens of thousands of records). Even if I ever convince a data centre to provide the required amounts of RAM to a single container, the costs will be prohibitive.

Can anyone provide their experiences with Fuseki in production? Particularly in micro-services/containerised platforms?

Thank you.

--
Luís



Re: Memory management with Fuseki

Posted by Rob Vesse <rv...@dotnetrdf.org>.
It's also worth noting that since Fuseki is a Java based application the JVM has its own memory management so asks the OS for some amount of memory which is then divided up between the Java objects inside the process.  Often the heap size may be larger than the memory the application needs.  The Fuseki scripts set a default heap size based on prior experience (i.e. provides a good general purpose default) but that may not be suitable for all environments and may need customising.  Also Fuseki uses memory mapped files if backed with TDB databases which are accounted for separately from JVM heap but should be shown in OS level accounting for the JVM.

TL;DR The amount of memory you see consumed at the OS level does not necessarily correlate directly with the amount of memory used for your datasets.  To determine that you would need to attach a JVM profiler to the running Fuseki application.

You may want to look at the Fuseki script in detail and adjust the JVM memory settings for your use case.

Rob

On 12/03/2020, 13:26, "Andy Seaborne" <an...@apache.org> wrote:

    
    
    On 12/03/2020 11:26, Luí­s Moreira de Sousa wrote:
    > Dear all,
    > 
    > I loaded six RDF datasets to Fuseki with sizes comprised between 20 Kb and 20 Mb. To host these six datasets (in persistent mode) Fuseki is using over 1 Gb of RAM and could soon get killed by the system (running on a container platform).
    
    How much RAM are you giving Fuseki?
    (Why "could soon get killed"?)
    
    > 
    > This demand on RAM for such small datasets appears excessive. What strategies are there to limit the RAM used by Fuseki?
    
    The persistent store can be controlled with
    
    https://jena.apache.org/documentation/tdb/store-parameters.html
    
    Specifically:
    
    tdb.node2nodeid_cache_size
    tdb.nodeid2node_cache_size
    
    file tdb.cfg in the
    
    Do not change the database layout parameters after it is built!
    
    
    Or put them all in one datasets as named graphs.
    Or load them as plain files (if they are read-only)
    
         Andy
    
    > 
    > Thank you.
    > 
    > --
    > Luís
    > 
    





Re: Memory management with Fuseki

Posted by Andy Seaborne <an...@apache.org>.

On 12/03/2020 11:26, Luí­s Moreira de Sousa wrote:
> Dear all,
> 
> I loaded six RDF datasets to Fuseki with sizes comprised between 20 Kb and 20 Mb. To host these six datasets (in persistent mode) Fuseki is using over 1 Gb of RAM and could soon get killed by the system (running on a container platform).

How much RAM are you giving Fuseki?
(Why "could soon get killed"?)

> 
> This demand on RAM for such small datasets appears excessive. What strategies are there to limit the RAM used by Fuseki?

The persistent store can be controlled with

https://jena.apache.org/documentation/tdb/store-parameters.html

Specifically:

tdb.node2nodeid_cache_size
tdb.nodeid2node_cache_size

file tdb.cfg in the

Do not change the database layout parameters after it is built!


Or put them all in one datasets as named graphs.
Or load them as plain files (if they are read-only)

     Andy

> 
> Thank you.
> 
> --
> Luís
>