You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Osma Suominen <os...@aalto.fi> on 2012/10/01 11:21:42 UTC

Re: Fuseki leaks memory on PUT requests

28.09.2012 19:27, Andy Seaborne wrote:

> Can you use a bit more heap?  The default is just a general default,
> including small 32 bit machines.
>
> I have it running at 2G and have executed 75 PUTs and it's still going.

Hi Andy!

Thanks for the quick reply and for testing this yourself. You're right, 
I made hasty conclusion and giving more heap to the JVM does seem to 
help. I tried using 2GB heap and could run 200 PUTs without problems on 
a recent snapshot. So there is indeed no memory leak.

This seems to be a GC issue: if you run many PUTs with a small heap size 
the GC doesn't get around to freeing enough memory before it's too late, 
despite the sleeping between PUTs. When I watched the process memory 
consumption using top in the latest test run, there was a steady rise to 
around 2GB and then suddenly 600-700MB is released when the GC kicks in. 
This process then repeats every dozen requests or so.

I will see whether tuning the GC parameters would help. It's a bit 
frustrating - I'm trying to set up a public SPARQL endpoint on a 
dedicated server machine and PUTs are the easiest way to update the data 
from outside the server, SOA-style. The server is a 64bit RHEL6 running 
Fuseki with 3GB heap and I can easily push it over the edge by accident 
with a few relatively small (<1M triples) PUTs. Total physical memory is 
4GB, so there's not that much room for increasing the heap size - okay, 
I should just get more memory...

-Osma

-- 
Osma Suominen | Osma.Suominen@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing 
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 
Aalto, Finland

Re: Fuseki leaks memory on PUT requests

Posted by Andy Seaborne <an...@apache.org>.
On 01/10/12 10:21, Osma Suominen wrote:
> 28.09.2012 19:27, Andy Seaborne wrote:
>
>> Can you use a bit more heap?  The default is just a general default,
>> including small 32 bit machines.
>>
>> I have it running at 2G and have executed 75 PUTs and it's still going.
>
> Hi Andy!
>
> Thanks for the quick reply and for testing this yourself. You're right,
> I made hasty conclusion and giving more heap to the JVM does seem to
> help. I tried using 2GB heap and could run 200 PUTs without problems on
> a recent snapshot. So there is indeed no memory leak.
>
> This seems to be a GC issue: if you run many PUTs with a small heap size
> the GC doesn't get around to freeing enough memory before it's too late,
> despite the sleeping between PUTs. When I watched the process memory
> consumption using top in the latest test run, there was a steady rise to
> around 2GB and then suddenly 600-700MB is released when the GC kicks in.
> This process then repeats every dozen requests or so.
>
> I will see whether tuning the GC parameters would help. It's a bit
> frustrating - I'm trying to set up a public SPARQL endpoint on a
> dedicated server machine and PUTs are the easiest way to update the data
> from outside the server, SOA-style. The server is a 64bit RHEL6 running
> Fuseki with 3GB heap and I can easily push it over the edge by accident
> with a few relatively small (<1M triples) PUTs. Total physical memory is
> 4GB, so there's not that much room for increasing the heap size - okay,
> I should just get more memory...
>
> -Osma
>

It's not a GC issue, at least not in the normal low level sense.

Write transactions are batched together for write-back to the main 
database after they are committed. They are in the journal on-disk but 
also the in-memory structures are retained for access to a view of the 
database with the transactions applied.  These take memory.  (it's the 
indexes - the node data is written back in the prepare file because it's 
an append-only file).

The batching size is set to 10 - after 10 writes, the system flushes the 
journal and drops the in-memory structures.  So if you get past that 
point, it should go "forever".

And every incoming request is pared in-memory to check validity of the 
RDF.  Also a source of RAM usage.

What the system should do is:
1/ use a persistent-but-cached layer for completed transactions
2/ be tunable (*)
3/ Notice a store is transactional and use that instead of parsing to an 
in-memory graph

but does not currently offer those features.   Contributions welcome.

	Andy

(*) I have tended to avoid lots of configuration options as I find in 
other systems lots of knobs to tweak is unhelpful overall.  Either 
people use the default or it needs deep magic to control.


Re: Fuseki leaks memory on PUT requests

Posted by Stephen Allen <sa...@apache.org>.
On Tue, Oct 2, 2012 at 12:32 AM, Osma Suominen <os...@aalto.fi> wrote:
> Hi Andy!
>
> 01.10.2012 23:33, Andy Seaborne kirjoitti:
>
>
>> It's not a GC issue, at least not in the normal low level sense.
>>
>> Write transactions are batched together for write-back to the main
>> database after they are committed. They are in the journal on-disk but
>> also the in-memory structures are retained for access to a view of the
>> database with the transactions applied.  These take memory.  (it's the
>> indexes - the node data is written back in the prepare file because it's
>> an append-only file).
>>
>> The batching size is set to 10 - after 10 writes, the system flushes the
>> journal and drops the in-memory structures.  So if you get past that
>> point, it should go "forever".
>>
>> And every incoming request is pared in-memory to check validity of the
>> RDF.  Also a source of RAM usage.
>
>
> Ah, thanks a lot! Now I understand what I was seeing. When I PUT several
> (but <10) datasets, Fuseki will temporarily eat a lot of memory. And now my
> problem is that for my datasets, this is more than the available heap.
>
> I understand that batching is performed for performance reasons (I just read
> JENA-256), but in my scenario, writes (using PUT) are usually rather big and
> infrequent (so write performance is not important, or at least not much
> helped by batching) except when I sometimes want to update every dataset in
> one go, so there will be several large PUTs and Fuseki will run out of heap
> unless I restart it in between the PUTs.
>
>
>> What the system should do is:
>> 1/ use a persistent-but-cached layer for completed transactions
>> 2/ be tunable (*)
>> 3/ Notice a store is transactional and use that instead of parsing to an
>> in-memory graph
>>
>> but does not currently offer those features.   Contributions welcome.
>>
>>         Andy
>>
>> (*) I have tended to avoid lots of configuration options as I find in
>> other systems lots of knobs to tweak is unhelpful overall.  Either
>> people use the default or it needs deep magic to control.
>
>
> I understand, nothing is perfect and there are always possible improvements
> to be made. And also I understand the aversion of knobs.
>
> In my case, I would like to see in Fuseki and/or TDB a way to either
> 1) reduce the batch size to something less than 10 (say, 2 or 5),
> 2) turn off batching completely,
> 3) make batching behavior dependent on the size (in triples or megabytes) of
> the accumulated queue, so a queue of large writes would be flushed sooner
> than a queue of small writes, or
> 4) make batching behavior dependent on time, so that if no further writes
> are performed in a certain time (say, 10 seconds or a minute) then the
> flushing will be done regardless of the size of the accumulated write queue
>
> I guess 1 or 2 would be in the tunable category, while 3 and 4 would maybe
> qualify as deep magic :)
>
> But now that I understand what's happening I can at least work around the
> problem.
>

A decent win would be to address what Andy mentioned as his number 3.
I've been working in this area lately on the SPARQL Update Query side
(PUT is SPARQL Update Graph Store Protocol).  But I hope to get to
that in time.

Meanwhile, if you really need to reduce memory, you can try the
following (untested) patch against the jena-fuseki project.  Adjust
the 10000 constant to something lower if needed.

-Stephen


Index: jena-fuseki/src/main/java/org/apache/jena/fuseki/servlets/SPARQL_Upload.java
===================================================================
--- jena-fuseki/src/main/java/org/apache/jena/fuseki/servlets/SPARQL_Upload.java	(revision
1392600)
+++ jena-fuseki/src/main/java/org/apache/jena/fuseki/servlets/SPARQL_Upload.java	(working
copy)
@@ -36,6 +36,8 @@
 import org.apache.jena.fuseki.http.HttpSC ;
 import org.apache.jena.fuseki.server.DatasetRef ;
 import org.apache.jena.iri.IRI ;
+import org.openjena.atlas.data.ThresholdPolicy ;
+import org.openjena.atlas.data.ThresholdPolicyFactory ;
 import org.openjena.atlas.lib.Sink ;
 import org.openjena.atlas.web.ContentType ;
 import org.openjena.riot.* ;
@@ -46,6 +48,7 @@
 import com.hp.hpl.jena.graph.Graph ;
 import com.hp.hpl.jena.graph.Node ;
 import com.hp.hpl.jena.graph.Triple ;
+import com.hp.hpl.jena.sparql.graph.GraphDefaultDataBag ;
 import com.hp.hpl.jena.sparql.graph.GraphFactory ;

 public class SPARQL_Upload extends SPARQL_ServletBase
@@ -95,7 +98,10 @@
         // Locking only needed over the insert into dataset
         try {
             String graphName = null ;
-            Graph graphTmp = GraphFactory.createGraphMem() ;
+            //Graph graphTmp = GraphFactory.createGraphMem() ;
+            ThresholdPolicy<Triple> policy =
ThresholdPolicyFactory.count(10000);  // Need to read the proper
setting from a Context object
+            Graph graphTmp = new GraphDefaultDataBag(policy);  // We
don't care that dupes can appear in here
+
             Node gn = null ;
             String name = null ;
             ContentType ct = null ;

Re: Fuseki leaks memory on PUT requests

Posted by Osma Suominen <os...@aalto.fi>.
Hi Andy!

01.10.2012 23:33, Andy Seaborne kirjoitti:

> It's not a GC issue, at least not in the normal low level sense.
>
> Write transactions are batched together for write-back to the main
> database after they are committed. They are in the journal on-disk but
> also the in-memory structures are retained for access to a view of the
> database with the transactions applied.  These take memory.  (it's the
> indexes - the node data is written back in the prepare file because it's
> an append-only file).
>
> The batching size is set to 10 - after 10 writes, the system flushes the
> journal and drops the in-memory structures.  So if you get past that
> point, it should go "forever".
>
> And every incoming request is pared in-memory to check validity of the
> RDF.  Also a source of RAM usage.

Ah, thanks a lot! Now I understand what I was seeing. When I PUT several 
(but <10) datasets, Fuseki will temporarily eat a lot of memory. And now 
my problem is that for my datasets, this is more than the available heap.

I understand that batching is performed for performance reasons (I just 
read JENA-256), but in my scenario, writes (using PUT) are usually 
rather big and infrequent (so write performance is not important, or at 
least not much helped by batching) except when I sometimes want to 
update every dataset in one go, so there will be several large PUTs and 
Fuseki will run out of heap unless I restart it in between the PUTs.

> What the system should do is:
> 1/ use a persistent-but-cached layer for completed transactions
> 2/ be tunable (*)
> 3/ Notice a store is transactional and use that instead of parsing to an
> in-memory graph
>
> but does not currently offer those features.   Contributions welcome.
>
> 	Andy
>
> (*) I have tended to avoid lots of configuration options as I find in
> other systems lots of knobs to tweak is unhelpful overall.  Either
> people use the default or it needs deep magic to control.

I understand, nothing is perfect and there are always possible 
improvements to be made. And also I understand the aversion of knobs.

In my case, I would like to see in Fuseki and/or TDB a way to either
1) reduce the batch size to something less than 10 (say, 2 or 5),
2) turn off batching completely,
3) make batching behavior dependent on the size (in triples or 
megabytes) of the accumulated queue, so a queue of large writes would be 
flushed sooner than a queue of small writes, or
4) make batching behavior dependent on time, so that if no further 
writes are performed in a certain time (say, 10 seconds or a minute) 
then the flushing will be done regardless of the size of the accumulated 
write queue

I guess 1 or 2 would be in the tunable category, while 3 and 4 would 
maybe qualify as deep magic :)

But now that I understand what's happening I can at least work around 
the problem.

-Osma


-- 
Osma Suominen | Osma.Suominen@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing 
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 
Aalto, Finland