You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@clerezza.apache.org by Minto van der Sluis <mi...@xup.nl> on 2013/11/28 14:17:41 UTC

Is Clerezza leaking memory?

Hi,

I just ran into some peculiar behavior.

For my current project I have to import 633 files each containing approx
20 MB of xml data (a total of 13 GB). When importing this data into a
single graph I hit an out of memory exception on the 7th file.

Looking at the heap I noticed that after restarting the application I
could load a few more files. So I started looking for the bundle that
consumed all the memory. It happened to be the Clerezza TDB Storage
provider. See the following image (GC = garbage collection):




Looking more closely I noticed that Apache Jena is able to close a graph
(graph.close()) But Clerezza is not using this feature and is keeping
the graph open all the time.

How best to tackle this performance issue?

Regards,

Minto

-- 
ir. ing. Minto van der Sluis
Software innovator / renovator
Xup BV

Mobiel: +31 (0) 626 014541


Re: Is Clerezza leaking memory?

Posted by Reto Gmür <re...@apache.org>.
Hi Minto,

It would be great if you could check if the problem is still there with the
resolution I propose for CLEREZZA-871 (just pushed).

Cheers,
Reto




On Fri, Nov 29, 2013 at 6:15 PM, Andy Seaborne <an...@apache.org> wrote:

>  On 29/11/13 12:31, Minto van der Sluis wrote:
>
> Andy Seaborne schreef op 29-11-2013 9:39:
>
> On 28/11/13 13:17, Minto van der Sluis wrote:
>
> Hi,
>
> I just ran into some peculiar behavior.
>
> For my current project I have to import 633 files each containing approx
> 20 MB of xml data (a total of 13 GB). When importing this data into a
> single graph I hit an out of memory exception on the 7th file.
>
> Looking at the heap I noticed that after restarting the application I
> could load a few more files. So I started looking for the bundle that
> consumed all the memory. It happened to be the Clerezza TDB Storage
> provider. See the following image (GC = garbage collection):
>
>
>
>
> Looking more closely I noticed that Apache Jena is able to close a graph
> (graph.close()) But Clerezza is not using this feature and is keeping the
> graph open all the time.
>
>
> Jena graphs backed by TDB are simply views of the dataset - they don't
> have any state associated with them directly.  If the reference become
> inaccessible, GC should clean up.
>
> Hi Andy,
>
> The problem, as far as I can tell, is not in Jena TDB itself. The Jena TDB
> bundle is still active/running. Only the Clerezza TDB Provider bundle is
> stopped (by me). Like my image shows a normal GC does not release all of
> the memory. Only after stopping the Clerezza TDB Provider memory allocated
> for importing is release. Because of stopping this particular bundle all
> jena datastructures become inaccessible and eligible for GC. Just like the
> image shows.
>
> My reasoning is that since the Clerezza TDB Provider has a map with weak
> references to Jena models these references are never properly garbage
> collected. Since I use the same graph all the time all data gets
> accumulated and resulting in out of memory. Looking at a memory dump, most
> space is occupied by byte arrays containing the imported data.
>
> I use a nasty hack to prevent this dreaded out of memory. After every
> import I restart the Clerezza TDB Provider bundle programmatically (hail
> OSGI for I wouldn't know how to do this without OSGI). Like this I have
> been able to import more that 300 files in a row (still running).
>
> Regards,
>
> Minto
>
>
> It does look like something in Clerezza is holding memory.  Do note that
> TDB has internal caches so it wil grow a well.  Dataset are kept around
> because they are expensive to re-warm up, and the node table cache is
> in-heap.  Other caches are not in-heap (64 bit mode)
>
> If you want to bulk import, you could load the TDB database directly,
> using the bulk loader.  Indeed, it can be worthwhile taking the input,
> creating an N-Quads file, with lots of checking and validation of the data,
> then loading the N-Quads. It's annoying to get part way though a large load
> and find the data isn't perfect.
>
>     Andy
>
>
>
>

Re: Is Clerezza leaking memory?

Posted by Andy Seaborne <an...@apache.org>.
On 29/11/13 12:31, Minto van der Sluis wrote:
> Andy Seaborne schreef op 29-11-2013 9:39:
>> On 28/11/13 13:17, Minto van der Sluis wrote:
>>> Hi,
>>>
>>> I just ran into some peculiar behavior.
>>>
>>> For my current project I have to import 633 files each containing 
>>> approx 20 MB of xml data (a total of 13 GB). When importing this 
>>> data into a single graph I hit an out of memory exception on the 7th 
>>> file.
>>>
>>> Looking at the heap I noticed that after restarting the application 
>>> I could load a few more files. So I started looking for the bundle 
>>> that consumed all the memory. It happened to be the Clerezza TDB 
>>> Storage provider. See the following image (GC = garbage collection):
>>>
>>>
>>>
>>>
>>> Looking more closely I noticed that Apache Jena is able to close a 
>>> graph (graph.close()) But Clerezza is not using this feature and is 
>>> keeping the graph open all the time.
>>
>> Jena graphs backed by TDB are simply views of the dataset - they 
>> don't have any state associated with them directly.  If the reference 
>> become inaccessible, GC should clean up.
> Hi Andy,
>
> The problem, as far as I can tell, is not in Jena TDB itself. The Jena 
> TDB bundle is still active/running. Only the Clerezza TDB Provider 
> bundle is stopped (by me). Like my image shows a normal GC does not 
> release all of the memory. Only after stopping the Clerezza TDB 
> Provider memory allocated for importing is release. Because of 
> stopping this particular bundle all jena datastructures become 
> inaccessible and eligible for GC. Just like the image shows.
>
> My reasoning is that since the Clerezza TDB Provider has a map with 
> weak references to Jena models these references are never properly 
> garbage collected. Since I use the same graph all the time all data 
> gets accumulated and resulting in out of memory. Looking at a memory 
> dump, most space is occupied by byte arrays containing the imported data.
>
> I use a nasty hack to prevent this dreaded out of memory. After every 
> import I restart the Clerezza TDB Provider bundle programmatically 
> (hail OSGI for I wouldn't know how to do this without OSGI). Like this 
> I have been able to import more that 300 files in a row (still running).
>
> Regards,
>
> Minto

It does look like something in Clerezza is holding memory.  Do note that 
TDB has internal caches so it wil grow a well.  Dataset are kept around 
because they are expensive to re-warm up, and the node table cache is 
in-heap.  Other caches are not in-heap (64 bit mode)

If you want to bulk import, you could load the TDB database directly, 
using the bulk loader.  Indeed, it can be worthwhile taking the input, 
creating an N-Quads file, with lots of checking and validation of the 
data, then loading the N-Quads. It's annoying to get part way though a 
large load and find the data isn't perfect.

     Andy




Re: Is Clerezza leaking memory?

Posted by Minto van der Sluis <mi...@xup.nl>.
Andy Seaborne schreef op 29-11-2013 9:39:
> On 28/11/13 13:17, Minto van der Sluis wrote:
>> Hi,
>>
>> I just ran into some peculiar behavior.
>>
>> For my current project I have to import 633 files each containing
>> approx 20 MB of xml data (a total of 13 GB). When importing this data
>> into a single graph I hit an out of memory exception on the 7th file.
>>
>> Looking at the heap I noticed that after restarting the application I
>> could load a few more files. So I started looking for the bundle that
>> consumed all the memory. It happened to be the Clerezza TDB Storage
>> provider. See the following image (GC = garbage collection):
>>
>>
>>
>>
>> Looking more closely I noticed that Apache Jena is able to close a
>> graph (graph.close()) But Clerezza is not using this feature and is
>> keeping the graph open all the time.
>
> Jena graphs backed by TDB are simply views of the dataset - they don't
> have any state associated with them directly.  If the reference become
> inaccessible, GC should clean up.
Hi Andy,

The problem, as far as I can tell, is not in Jena TDB itself. The Jena
TDB bundle is still active/running. Only the Clerezza TDB Provider
bundle is stopped (by me). Like my image shows a normal GC does not
release all of the memory. Only after stopping the Clerezza TDB Provider
memory allocated for importing is release. Because of stopping this
particular bundle all jena datastructures become inaccessible and
eligible for GC. Just like the image shows.

My reasoning is that since the Clerezza TDB Provider has a map with weak
references to Jena models these references are never properly garbage
collected. Since I use the same graph all the time all data gets
accumulated and resulting in out of memory. Looking at a memory dump,
most space is occupied by byte arrays containing the imported data.

I use a nasty hack to prevent this dreaded out of memory. After every
import I restart the Clerezza TDB Provider bundle programmatically (hail
OSGI for I wouldn't know how to do this without OSGI). Like this I have
been able to import more that 300 files in a row (still running).

Regards,

Minto




Re: Is Clerezza leaking memory?

Posted by Andy Seaborne <an...@apache.org>.
On 28/11/13 13:17, Minto van der Sluis wrote:
> Hi,
>
> I just ran into some peculiar behavior.
>
> For my current project I have to import 633 files each containing 
> approx 20 MB of xml data (a total of 13 GB). When importing this data 
> into a single graph I hit an out of memory exception on the 7th file.
>
> Looking at the heap I noticed that after restarting the application I 
> could load a few more files. So I started looking for the bundle that 
> consumed all the memory. It happened to be the Clerezza TDB Storage 
> provider. See the following image (GC = garbage collection):
>
>
>
>
> Looking more closely I noticed that Apache Jena is able to close a 
> graph (graph.close()) But Clerezza is not using this feature and is 
> keeping the graph open all the time.

Jena graphs backed by TDB are simply views of the dataset - they don't 
have any state associated with them directly.  If the reference become 
inaccessible, GC should clean up.

     Andy


>
> How best to tackle this performance issue?
>
> Regards,
>
> Minto
> -- 
> ir. ing. Minto van der Sluis
> Software innovator / renovator
> Xup BV
>
> Mobiel: +31 (0) 626 014541