You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Osma Suominen <os...@aalto.fi> on 2012/10/01 11:21:42 UTC
Re: Fuseki leaks memory on PUT requests
28.09.2012 19:27, Andy Seaborne wrote:
> Can you use a bit more heap? The default is just a general default,
> including small 32 bit machines.
>
> I have it running at 2G and have executed 75 PUTs and it's still going.
Hi Andy!
Thanks for the quick reply and for testing this yourself. You're right,
I made hasty conclusion and giving more heap to the JVM does seem to
help. I tried using 2GB heap and could run 200 PUTs without problems on
a recent snapshot. So there is indeed no memory leak.
This seems to be a GC issue: if you run many PUTs with a small heap size
the GC doesn't get around to freeing enough memory before it's too late,
despite the sleeping between PUTs. When I watched the process memory
consumption using top in the latest test run, there was a steady rise to
around 2GB and then suddenly 600-700MB is released when the GC kicks in.
This process then repeats every dozen requests or so.
I will see whether tuning the GC parameters would help. It's a bit
frustrating - I'm trying to set up a public SPARQL endpoint on a
dedicated server machine and PUTs are the easiest way to update the data
from outside the server, SOA-style. The server is a 64bit RHEL6 running
Fuseki with 3GB heap and I can easily push it over the edge by accident
with a few relatively small (<1M triples) PUTs. Total physical memory is
4GB, so there's not that much room for increasing the heap size - okay,
I should just get more memory...
-Osma
--
Osma Suominen | Osma.Suominen@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076
Aalto, Finland
Re: Fuseki leaks memory on PUT requests
Posted by Andy Seaborne <an...@apache.org>.
On 01/10/12 10:21, Osma Suominen wrote:
> 28.09.2012 19:27, Andy Seaborne wrote:
>
>> Can you use a bit more heap? The default is just a general default,
>> including small 32 bit machines.
>>
>> I have it running at 2G and have executed 75 PUTs and it's still going.
>
> Hi Andy!
>
> Thanks for the quick reply and for testing this yourself. You're right,
> I made hasty conclusion and giving more heap to the JVM does seem to
> help. I tried using 2GB heap and could run 200 PUTs without problems on
> a recent snapshot. So there is indeed no memory leak.
>
> This seems to be a GC issue: if you run many PUTs with a small heap size
> the GC doesn't get around to freeing enough memory before it's too late,
> despite the sleeping between PUTs. When I watched the process memory
> consumption using top in the latest test run, there was a steady rise to
> around 2GB and then suddenly 600-700MB is released when the GC kicks in.
> This process then repeats every dozen requests or so.
>
> I will see whether tuning the GC parameters would help. It's a bit
> frustrating - I'm trying to set up a public SPARQL endpoint on a
> dedicated server machine and PUTs are the easiest way to update the data
> from outside the server, SOA-style. The server is a 64bit RHEL6 running
> Fuseki with 3GB heap and I can easily push it over the edge by accident
> with a few relatively small (<1M triples) PUTs. Total physical memory is
> 4GB, so there's not that much room for increasing the heap size - okay,
> I should just get more memory...
>
> -Osma
>
It's not a GC issue, at least not in the normal low level sense.
Write transactions are batched together for write-back to the main
database after they are committed. They are in the journal on-disk but
also the in-memory structures are retained for access to a view of the
database with the transactions applied. These take memory. (it's the
indexes - the node data is written back in the prepare file because it's
an append-only file).
The batching size is set to 10 - after 10 writes, the system flushes the
journal and drops the in-memory structures. So if you get past that
point, it should go "forever".
And every incoming request is pared in-memory to check validity of the
RDF. Also a source of RAM usage.
What the system should do is:
1/ use a persistent-but-cached layer for completed transactions
2/ be tunable (*)
3/ Notice a store is transactional and use that instead of parsing to an
in-memory graph
but does not currently offer those features. Contributions welcome.
Andy
(*) I have tended to avoid lots of configuration options as I find in
other systems lots of knobs to tweak is unhelpful overall. Either
people use the default or it needs deep magic to control.
Re: Fuseki leaks memory on PUT requests
Posted by Stephen Allen <sa...@apache.org>.
On Tue, Oct 2, 2012 at 12:32 AM, Osma Suominen <os...@aalto.fi> wrote:
> Hi Andy!
>
> 01.10.2012 23:33, Andy Seaborne kirjoitti:
>
>
>> It's not a GC issue, at least not in the normal low level sense.
>>
>> Write transactions are batched together for write-back to the main
>> database after they are committed. They are in the journal on-disk but
>> also the in-memory structures are retained for access to a view of the
>> database with the transactions applied. These take memory. (it's the
>> indexes - the node data is written back in the prepare file because it's
>> an append-only file).
>>
>> The batching size is set to 10 - after 10 writes, the system flushes the
>> journal and drops the in-memory structures. So if you get past that
>> point, it should go "forever".
>>
>> And every incoming request is pared in-memory to check validity of the
>> RDF. Also a source of RAM usage.
>
>
> Ah, thanks a lot! Now I understand what I was seeing. When I PUT several
> (but <10) datasets, Fuseki will temporarily eat a lot of memory. And now my
> problem is that for my datasets, this is more than the available heap.
>
> I understand that batching is performed for performance reasons (I just read
> JENA-256), but in my scenario, writes (using PUT) are usually rather big and
> infrequent (so write performance is not important, or at least not much
> helped by batching) except when I sometimes want to update every dataset in
> one go, so there will be several large PUTs and Fuseki will run out of heap
> unless I restart it in between the PUTs.
>
>
>> What the system should do is:
>> 1/ use a persistent-but-cached layer for completed transactions
>> 2/ be tunable (*)
>> 3/ Notice a store is transactional and use that instead of parsing to an
>> in-memory graph
>>
>> but does not currently offer those features. Contributions welcome.
>>
>> Andy
>>
>> (*) I have tended to avoid lots of configuration options as I find in
>> other systems lots of knobs to tweak is unhelpful overall. Either
>> people use the default or it needs deep magic to control.
>
>
> I understand, nothing is perfect and there are always possible improvements
> to be made. And also I understand the aversion of knobs.
>
> In my case, I would like to see in Fuseki and/or TDB a way to either
> 1) reduce the batch size to something less than 10 (say, 2 or 5),
> 2) turn off batching completely,
> 3) make batching behavior dependent on the size (in triples or megabytes) of
> the accumulated queue, so a queue of large writes would be flushed sooner
> than a queue of small writes, or
> 4) make batching behavior dependent on time, so that if no further writes
> are performed in a certain time (say, 10 seconds or a minute) then the
> flushing will be done regardless of the size of the accumulated write queue
>
> I guess 1 or 2 would be in the tunable category, while 3 and 4 would maybe
> qualify as deep magic :)
>
> But now that I understand what's happening I can at least work around the
> problem.
>
A decent win would be to address what Andy mentioned as his number 3.
I've been working in this area lately on the SPARQL Update Query side
(PUT is SPARQL Update Graph Store Protocol). But I hope to get to
that in time.
Meanwhile, if you really need to reduce memory, you can try the
following (untested) patch against the jena-fuseki project. Adjust
the 10000 constant to something lower if needed.
-Stephen
Index: jena-fuseki/src/main/java/org/apache/jena/fuseki/servlets/SPARQL_Upload.java
===================================================================
--- jena-fuseki/src/main/java/org/apache/jena/fuseki/servlets/SPARQL_Upload.java (revision
1392600)
+++ jena-fuseki/src/main/java/org/apache/jena/fuseki/servlets/SPARQL_Upload.java (working
copy)
@@ -36,6 +36,8 @@
import org.apache.jena.fuseki.http.HttpSC ;
import org.apache.jena.fuseki.server.DatasetRef ;
import org.apache.jena.iri.IRI ;
+import org.openjena.atlas.data.ThresholdPolicy ;
+import org.openjena.atlas.data.ThresholdPolicyFactory ;
import org.openjena.atlas.lib.Sink ;
import org.openjena.atlas.web.ContentType ;
import org.openjena.riot.* ;
@@ -46,6 +48,7 @@
import com.hp.hpl.jena.graph.Graph ;
import com.hp.hpl.jena.graph.Node ;
import com.hp.hpl.jena.graph.Triple ;
+import com.hp.hpl.jena.sparql.graph.GraphDefaultDataBag ;
import com.hp.hpl.jena.sparql.graph.GraphFactory ;
public class SPARQL_Upload extends SPARQL_ServletBase
@@ -95,7 +98,10 @@
// Locking only needed over the insert into dataset
try {
String graphName = null ;
- Graph graphTmp = GraphFactory.createGraphMem() ;
+ //Graph graphTmp = GraphFactory.createGraphMem() ;
+ ThresholdPolicy<Triple> policy =
ThresholdPolicyFactory.count(10000); // Need to read the proper
setting from a Context object
+ Graph graphTmp = new GraphDefaultDataBag(policy); // We
don't care that dupes can appear in here
+
Node gn = null ;
String name = null ;
ContentType ct = null ;
Re: Fuseki leaks memory on PUT requests
Posted by Osma Suominen <os...@aalto.fi>.
Hi Andy!
01.10.2012 23:33, Andy Seaborne kirjoitti:
> It's not a GC issue, at least not in the normal low level sense.
>
> Write transactions are batched together for write-back to the main
> database after they are committed. They are in the journal on-disk but
> also the in-memory structures are retained for access to a view of the
> database with the transactions applied. These take memory. (it's the
> indexes - the node data is written back in the prepare file because it's
> an append-only file).
>
> The batching size is set to 10 - after 10 writes, the system flushes the
> journal and drops the in-memory structures. So if you get past that
> point, it should go "forever".
>
> And every incoming request is pared in-memory to check validity of the
> RDF. Also a source of RAM usage.
Ah, thanks a lot! Now I understand what I was seeing. When I PUT several
(but <10) datasets, Fuseki will temporarily eat a lot of memory. And now
my problem is that for my datasets, this is more than the available heap.
I understand that batching is performed for performance reasons (I just
read JENA-256), but in my scenario, writes (using PUT) are usually
rather big and infrequent (so write performance is not important, or at
least not much helped by batching) except when I sometimes want to
update every dataset in one go, so there will be several large PUTs and
Fuseki will run out of heap unless I restart it in between the PUTs.
> What the system should do is:
> 1/ use a persistent-but-cached layer for completed transactions
> 2/ be tunable (*)
> 3/ Notice a store is transactional and use that instead of parsing to an
> in-memory graph
>
> but does not currently offer those features. Contributions welcome.
>
> Andy
>
> (*) I have tended to avoid lots of configuration options as I find in
> other systems lots of knobs to tweak is unhelpful overall. Either
> people use the default or it needs deep magic to control.
I understand, nothing is perfect and there are always possible
improvements to be made. And also I understand the aversion of knobs.
In my case, I would like to see in Fuseki and/or TDB a way to either
1) reduce the batch size to something less than 10 (say, 2 or 5),
2) turn off batching completely,
3) make batching behavior dependent on the size (in triples or
megabytes) of the accumulated queue, so a queue of large writes would be
flushed sooner than a queue of small writes, or
4) make batching behavior dependent on time, so that if no further
writes are performed in a certain time (say, 10 seconds or a minute)
then the flushing will be done regardless of the size of the accumulated
write queue
I guess 1 or 2 would be in the tunable category, while 3 and 4 would
maybe qualify as deep magic :)
But now that I understand what's happening I can at least work around
the problem.
-Osma
--
Osma Suominen | Osma.Suominen@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076
Aalto, Finland