You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@clerezza.apache.org by Reto Gmür <re...@apache.org> on 2014/01/29 12:19:08 UTC

Re: Is Clerezza leaking memory?

Hi Minto,

It would be great if you could check if the problem is still there with the
resolution I propose for CLEREZZA-871 (just pushed).

Cheers,
Reto




On Fri, Nov 29, 2013 at 6:15 PM, Andy Seaborne <an...@apache.org> wrote:

>  On 29/11/13 12:31, Minto van der Sluis wrote:
>
> Andy Seaborne schreef op 29-11-2013 9:39:
>
> On 28/11/13 13:17, Minto van der Sluis wrote:
>
> Hi,
>
> I just ran into some peculiar behavior.
>
> For my current project I have to import 633 files each containing approx
> 20 MB of xml data (a total of 13 GB). When importing this data into a
> single graph I hit an out of memory exception on the 7th file.
>
> Looking at the heap I noticed that after restarting the application I
> could load a few more files. So I started looking for the bundle that
> consumed all the memory. It happened to be the Clerezza TDB Storage
> provider. See the following image (GC = garbage collection):
>
>
>
>
> Looking more closely I noticed that Apache Jena is able to close a graph
> (graph.close()) But Clerezza is not using this feature and is keeping the
> graph open all the time.
>
>
> Jena graphs backed by TDB are simply views of the dataset - they don't
> have any state associated with them directly.  If the reference become
> inaccessible, GC should clean up.
>
> Hi Andy,
>
> The problem, as far as I can tell, is not in Jena TDB itself. The Jena TDB
> bundle is still active/running. Only the Clerezza TDB Provider bundle is
> stopped (by me). Like my image shows a normal GC does not release all of
> the memory. Only after stopping the Clerezza TDB Provider memory allocated
> for importing is release. Because of stopping this particular bundle all
> jena datastructures become inaccessible and eligible for GC. Just like the
> image shows.
>
> My reasoning is that since the Clerezza TDB Provider has a map with weak
> references to Jena models these references are never properly garbage
> collected. Since I use the same graph all the time all data gets
> accumulated and resulting in out of memory. Looking at a memory dump, most
> space is occupied by byte arrays containing the imported data.
>
> I use a nasty hack to prevent this dreaded out of memory. After every
> import I restart the Clerezza TDB Provider bundle programmatically (hail
> OSGI for I wouldn't know how to do this without OSGI). Like this I have
> been able to import more that 300 files in a row (still running).
>
> Regards,
>
> Minto
>
>
> It does look like something in Clerezza is holding memory.  Do note that
> TDB has internal caches so it wil grow a well.  Dataset are kept around
> because they are expensive to re-warm up, and the node table cache is
> in-heap.  Other caches are not in-heap (64 bit mode)
>
> If you want to bulk import, you could load the TDB database directly,
> using the bulk loader.  Indeed, it can be worthwhile taking the input,
> creating an N-Quads file, with lots of checking and validation of the data,
> then loading the N-Quads. It's annoying to get part way though a large load
> and find the data isn't perfect.
>
>     Andy
>
>
>
>