You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by "Roll, Kevin" <Ke...@idexx.com> on 2015/11/23 14:13:18 UTC

Memory usage

We have started to encounter OutOfMemory errors on Jackrabbit under heavy pressure (it's worth noting that we are using the full Sling stack). I've discovered that Lucene keeps a full index of the repository in memory, and this terrifies me because we are already having problems just in a test scenario and the repository will only grow. Unfortunately we are forced to run this system on older 32-bit hardware in the field that does not have any room to expand memory-wise. Are there any options I can tweak to reduce the memory footprint? Any other things I can disable that will cut down on memory usage? Is Oak better in this regard? Thanks!

Re: Memory usage

Posted by Clay Ferguson <wc...@gmail.com>.

Some ideas:
  - Make sure you HAVE set the largest JVM memory option you can get away
with (duh.)
  - Try "-server" JVM option
  - Use a profiler to see what is truly using memory
  - Look on your physical DB for buffer size settings (MySql / Mongo?)
  - Try closing session after every 100 or 1000 saves, and starting fresh,
to make sure everything is flushed to disk
  - Use log statements to write to log file free VM memory after each
commit, to check how fast it goes down and find any jumps in memory use
  - I have never used a Profiler but a lot of people use them to find
memory hogging things
  - are you sure you are closing all streams and resources in finally
blocks, and not leaking resources related to the actual image processing?
  - are there some GC options you can tweak to help



Best regards,
Clay Ferguson
wclayf@gmail.com


On Mon, Nov 23, 2015 at 11:13 AM, Roll, Kevin <Ke...@idexx.com> wrote:

> Our use case is the following: an external process generates 70 images,
> each around ~700k in size. These are uploaded as sub-nodes under a master
> node that encapsulates the run. There are are also some sister nodes that
> contain a modest amount of metadata about each image and the run that
> generated it. In general most of the writing consists of a client POSTing
> these images into the repository via Sling; there are then some event
> handlers and tasks that look at the data that arrived. The only subsequent
> writes at present are some properties that are set after these images are
> examined and replicated into another system. So, I don't expect much at all
> in the way of concurrent read/write; it's mainly write a bunch and then
> read it back later.
>
> By heavy pressure what I mean is that we have a test lab running
> continuously against this system. It's a lot more traffic than can be
> expected in the real world, but it is good for shaking out problems. What
> concerns me is that according to the documentation an entire Lucene index
> is kept in memory. Right now we don’t do any pruning - our repository only
> grows larger. This implies to me that the index will only grow as well and
> we will ultimately run out of memory no matter how big the heap is.
> Hopefully I'm wrong about that.
>
> At the moment we have no JVM flags set. The SearchIndex configuration is
> also default (by default I mean what came with Sling), although I am
> looking at turning off supportHighlighting and putting a small value for
> resultFetchSize, say 100.
>
> -----Original Message-----
> From: Ben Frisoni [mailto:frisonib@gmail.com]
> Sent: Monday, November 23, 2015 11:55 AM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> A little bit of description on the term heavy pressure might help? Does
> this involve concurrent read operations/ write operations or both?
>
> Also some other things that effect performance:
> 1. What jvm parameters are set?
> 2. Do you have any custom index configurations set?
> 3. What does you repostiory.xml look like?
>
> This background info might help with answering your question.
>
> On Mon, Nov 23, 2015 at 8:13 AM, Roll, Kevin <Ke...@idexx.com> wrote:
>
> > We have started to encounter OutOfMemory errors on Jackrabbit under heavy
> > pressure (it's worth noting that we are using the full Sling stack). I've
> > discovered that Lucene keeps a full index of the repository in memory,
> and
> > this terrifies me because we are already having problems just in a test
> > scenario and the repository will only grow. Unfortunately we are forced
> to
> > run this system on older 32-bit hardware in the field that does not have
> > any room to expand memory-wise. Are there any options I can tweak to
> reduce
> > the memory footprint? Any other things I can disable that will cut down
> on
> > memory usage? Is Oak better in this regard? Thanks!
> >
> >
>

Re: Memory usage

Posted by Clay Ferguson <wc...@gmail.com>.

Kevin,
That word "generation" just means the most recent limited set of buffers.
Don't worry Lucene doesn't hold it's entire index in memory. I'm certain of
that. It is doing buffering using as minimal memory as possible just like
database engines, etc. As I said with my list of guesses... I say it's 99%
likely that your memory problem is not related to JCR or Lucene, but just a
leak you should be able to find.

Best regards,
Clay Ferguson
wclayf@gmail.com


On Mon, Nov 23, 2015 at 9:16 PM, Roll, Kevin <Ke...@idexx.com> wrote:

> Hi, Ben. I was referring to the following page:
>
> https://jackrabbit.apache.org/jcr/search-implementation.html
>
> "The most recent generation of the search index is held completely in
> memory."
>
> Perhaps I am misreading this, or perhaps it is wrong, but I interpreted
> that to mean that the size of the index in memory would be proportional to
> the repository size. I hope this is not true!
>
> I am currently trying to get information from our QA team about the
> approximate number of nodes in the repository. We are not currently setting
> an explicit heap size - in the dumps I've examined it seems to run out
> around 240Mb. I'm pushing to set something explicit but I'm now hearing
> that older hardware has only 1Gb of memory, which gives us practically
> nowhere to go.
>
> The queries that I'm doing are not very fancy... for example: "select *
> from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually
> rewriting that task so the query will be even simpler.
>
> Thanks for the help!
>
>
> users@jackrabbit.apache.org
> -----Original Message-----
> From: Ben Frisoni [mailto:frisonib@gmail.com]
> Sent: Monday, November 23, 2015 5:21 PM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> It is a good idea to turn off supportHighlighting especially if you aren't
> using the functionality. It takes up a lot of extra space within the index.
> I am not sure where you heard that the Lucene Index is kept in memory but I
> am pretty certain that is wrong. Can you point me to the documentation
> saying this?
>
> Also what data set sizes are you querying against (10k nodes ? 100k nodes?
> 1 mil nodes?).
> What heap size do you have set on the jvm?
> Reducing the resultFetchSize should help reduce the memory footprint on
> queries.
> I am assuming you are using the QueryManager to retrieve nodes. Can you
> give an example query that you are using?
>
> I have developed a patch to improve query performance on large data sets
> with jackrabbit 2.x. I should be done soon if I can gather together a few
> hours to finish up my work. If you would like you can give that a try once
> I finish.
>
> Some other repository settings you might want to look at are:
>  <PersistenceManager
>
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
>       <param name="bundleCacheSize" value="256"/>
> </PersistenceManager>
>  <ISMLocking
> class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
>
>
> Hope this helps.
>
>

Re: Memory usage

Posted by Clay Ferguson <wc...@gmail.com>.

Kevin,
Oh, maybe Sling can't do it LIMIT. I didn't realize (or notice) you were on
Sling, my bad. In my product (meta64.com) I didn't go with sling and I talk
directly to the Java API itself.

Best regards,
Clay Ferguson
wclayf@gmail.com


On Tue, Nov 24, 2015 at 3:58 PM, Roll, Kevin <Ke...@idexx.com> wrote:

> That's in JBoss, guy. Maybe it works there, but it doesn't in Sling... I
> tried it!
>
> -----Original Message-----
> From: Clay Ferguson [mailto:wclayf@gmail.com]
> Sent: Tuesday, November 24, 2015 4:47 PM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> Come on Kevin, I just googled it and found it immediately bro. :)
>
>
> https://docs.jboss.org/jbossdna/0.7/manuals/reference/html/jcr-query-and-search.html#jcr-sql2-limits
>
> Best regards,
> Clay Ferguson
> wclayf@gmail.com
>
>
> On Tue, Nov 24, 2015 at 3:30 PM, Roll, Kevin <Ke...@idexx.com> wrote:
>
> > Unfortunately the use of 'limit' is not supported (via JCR-SQL2 queries):
> >
> > https://issues.apache.org/jira/browse/SLING-1873
> >
> > I set resultFetchSize to a very low number and I was still able to
> iterate
> > through a larger result set, although this may have been batched behind
> the
> > scenes. I'm hoping that my new flag-based task will drastically cut down
> > the result set size and prevent the runaway memory usage anyway.
> >
> >
> > From: Clay Ferguson [mailto:wclayf@gmail.com]
> > Sent: Tuesday, November 24, 2015 1:35 PM
> > To: users@jackrabbit.apache.org
> > Subject: Re: Memory usage
> >
> > point #1. In SQL2 you can just build your query string dynamically and
> put
> > in the time of the last replication. So really I don't see the limitation
> > there. You would always just build your queries with the correct date on
> > it. But like you said, that is a 'weak" solution. I think actually the
> > 'dirty flag' kind of thing or 'needs replication flag' is better because
> > you can do it node-by-node and at any time, and you can shutdown and
> > restart and it will always pickup where it left off. With timestamps you
> > can run into situations where at one cycle it only half processed
> (failure
> > for whaever reason), and then your dates get messed up. So if I were
> you'd
> > do the flag approach. Seems more bullet proof. So if you have systems A ,
> > B, C where a needs to replicate out to B and C, then what you'd do is
> ever
> > time you modify or create an A node, you set B_DIRTY=true, and
> C_DIRTY=true
> > on the A node, and that flags it to know a replication is pending. Sounds
> > like you are on the right track you just need to set a LIMIT on your
> query
> > so that it only grabs 100 or so at a time. I know MySQL has a LIMIT.
> Maybe
> > SQL2 does also. You'd just keep running 100 at a time using LIMIT until
> one
> > of the queries comes back empty. Will use hardly any memory, and be
> > bullet-proof AND always easily restartable/resumable.
> >
> > Best regards,
> > Clay Ferguson
> > wclayf@gmail.com
> >
> >
> > On Tue, Nov 24, 2015 at 11:56 AM, Roll, Kevin <Ke...@idexx.com>
> > wrote:
> >
> > > Basically we replicate images and associated metadata to another
> system.
> > > One of the use cases is that the user marks an image as interesting in
> > the
> > > local system. This metadata change (or any other) needs to then
> propagate
> > > to the other system. So, I am querying for nodes where jcr:lastModified
> > is
> > > greater than another Date which is the timestamp of the last
> replication.
> > >
> > > My understanding is that JCR-SQL2 can only do a comparison where the
> > > second operand is static. I am working on a different approach where I
> > set
> > > a flag on any node that needs to be replicated. I have event handlers
> for
> > > added and changed nodes - at that moment it is trivial to determine
> > whether
> > > the node should be flagged. I realized it is much easier than trying to
> > > figure it out later. The "later" case arises because we have the option
> > to
> > > switch this replication on and off, and there may be a situation where
> it
> > > becomes on and must catch up with a backlog of work. This way I can
> > simply
> > > query all nodes with the flag set (I have a scheduled task that looks
> for
> > > nodes needing replication).
> > >
> > > If there's a date comparison trick it might help as an interim solution
> > > until I get this other idea up and running.
> > >
> > > Thanks!
> > >
> > > -----Original Message-----
> > > From: Clay Ferguson [mailto:wclayf@gmail.com]
> > > Sent: Tuesday, November 24, 2015 12:15 PM
> > > To: users@jackrabbit.apache.org
> > > Subject: Re: Memory usage
> > >
> > > glad you're gettin' closer.
> > >
> > > If you want, tell us more about the date range problem, because I may
> > know
> > > a solution (or workaround). Remember dates can be treated as integers
> if
> > > you really need to. Integers are the fastest and most powerful data
> type
> > > for dbs to handle too. So there should be a good clean solution unless
> > you
> > > have a VERY unusual situation.
> > >
> > > Best regards,
> > > Clay Ferguson
> > > wclayf@gmail.com
> > >
> > >
> > > On Tue, Nov 24, 2015 at 10:14 AM, Roll, Kevin <Ke...@idexx.com>
> > > wrote:
> > >
> > > > I think I am hot on the trail. I noticed this morning that the top
> > > objects
> > > > in the heap dump are not just Lucene, they are classes related to
> query
> > > > results. Due to a limitation in the Jackrabbit query language
> > > (specifically
> > > > the inability to compare two dynamic dates) I am running a query that
> > > > returns a result set proportional to the size of the repository (in
> > other
> > > > words it is unbounded). resultFetchSize is unlimited by default, so I
> > > think
> > > > I am getting larger and larger query results until I run out of
> space.
> > > >
> > > > I already changed this parameter yesterday, so I will see what
> happens
> > > > with the testing today. In the bigger picture I'm working on a better
> > way
> > > > to mark and query the nodes I'm interested in so I don't have to
> > perform
> > > an
> > > > unbounded query.
> > > >
> > > > Thanks again for the excellent support.
> > > >
> > > > P.S. We build and run a standalone Sling jar - it runs separately
> from
> > > our
> > > > main application.
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > > > Sent: Tuesday, November 24, 2015 11:05 AM
> > > > To: users@jackrabbit.apache.org
> > > > Subject: Re: Memory usage
> > > >
> > > > So just as Clay has mentioned above, Jackrabbit does not hold the
> > > complete
> > > > Lucene index in memory. How it actually works is there is a
> > VolatileIndex
> > > > which is memory. Any updates to the Lucene Index are first done here
> > and
> > > > then are committed to the FileSystem based on the threshold
> parameters.
> > > > This was obviously implemented for performance reasons.
> > > > http://wiki.apache.org/jackrabbit/Search
> > > > Parameters:
> > > > 1.
> > > >
> > > > maxVolatileIndexSize
> > > >
> > > > 1048576
> > > >
> > > > The maximum volatile index size in bytes until it is written to disk.
> > The
> > > > default value is 1MB.
> > > >
> > > > 2.
> > > >
> > > > volatileIdleTime
> > > >
> > > > 3
> > > >
> > > > Idle time in seconds until the volatile index part is moved to a
> > > persistent
> > > > index even though minMergeDocs is not reached.
> > > >
> > > > 1GB is quite low. My company has ran for over two years a production
> > > > instance of Jackrabbit with 1 GB of memory and it has not had any
> > issues.
> > > > The only time I saw huge spikes on memory consumption is on large
> > > > operations such as cloning a node with many descendants or querying a
> > > data
> > > > set with a 10k+ result size.
> > > >
> > > > You said you have gathered a heap dump, this should point you in the
> > > > direction of what objects are consuming majority of the heap. This
> > would
> > > be
> > > > a good start to see if it is jackrabbit causing the issue or your
> > > > application.
> > > > What type of deployment (
> > > > http://jackrabbit.apache.org/jcr/deployment-models.html) of
> jackrabbit
> > > are
> > > > you guys running? Is it completed isolated or embedded in your
> > > application?
> > > >
> > > > On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <Ke...@idexx.com>
> > > > wrote:
> > > >
> > > > > Hi, Ben. I was referring to the following page:
> > > > >
> > > > > https://jackrabbit.apache.org/jcr/search-implementation.html
> > > > >
> > > > > "The most recent generation of the search index is held completely
> in
> > > > > memory."
> > > > >
> > > > > Perhaps I am misreading this, or perhaps it is wrong, but I
> > interpreted
> > > > > that to mean that the size of the index in memory would be
> > proportional
> > > > to
> > > > > the repository size. I hope this is not true!
> > > > >
> > > > > I am currently trying to get information from our QA team about the
> > > > > approximate number of nodes in the repository. We are not currently
> > > > setting
> > > > > an explicit heap size - in the dumps I've examined it seems to run
> > out
> > > > > around 240Mb. I'm pushing to set something explicit but I'm now
> > hearing
> > > > > that older hardware has only 1Gb of memory, which gives us
> > practically
> > > > > nowhere to go.
> > > > >
> > > > > The queries that I'm doing are not very fancy... for example:
> > "select *
> > > > > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm
> actually
> > > > > rewriting that task so the query will be even simpler.
> > > > >
> > > > > Thanks for the help!
> > > > >
> > > > >
> > > > > users@jackrabbit.apache.org
> > > > > -----Original Message-----
> > > > > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > > > > Sent: Monday, November 23, 2015 5:21 PM
> > > > > To: users@jackrabbit.apache.org
> > > > > Subject: Re: Memory usage
> > > > >
> > > > > It is a good idea to turn off supportHighlighting especially if you
> > > > aren't
> > > > > using the functionality. It takes up a lot of extra space within
> the
> > > > index.
> > > > > I am not sure where you heard that the Lucene Index is kept in
> memory
> > > > but I
> > > > > am pretty certain that is wrong. Can you point me to the
> > documentation
> > > > > saying this?
> > > > >
> > > > > Also what data set sizes are you querying against (10k nodes ? 100k
> > > > nodes?
> > > > > 1 mil nodes?).
> > > > > What heap size do you have set on the jvm?
> > > > > Reducing the resultFetchSize should help reduce the memory
> footprint
> > on
> > > > > queries.
> > > > > I am assuming you are using the QueryManager to retrieve nodes. Can
> > you
> > > > > give an example query that you are using?
> > > > >
> > > > > I have developed a patch to improve query performance on large data
> > > sets
> > > > > with jackrabbit 2.x. I should be done soon if I can gather
> together a
> > > few
> > > > > hours to finish up my work. If you would like you can give that a
> try
> > > > once
> > > > > I finish.
> > > > >
> > > > > Some other repository settings you might want to look at are:
> > > > >  <PersistenceManager
> > > > >
> > > > >
> > > >
> > >
> >
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
> > > > >       <param name="bundleCacheSize" value="256"/>
> > > > > </PersistenceManager>
> > > > >  <ISMLocking
> > > > > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
> > > > >
> > > > >
> > > > > Hope this helps.
> > > > >
> > > > >
> > > >
> > >
> >
>

RE: Memory usage

Posted by "Roll, Kevin" <Ke...@idexx.com>.

That's in JBoss, guy. Maybe it works there, but it doesn't in Sling... I tried it!

-----Original Message-----
From: Clay Ferguson [mailto:wclayf@gmail.com] 
Sent: Tuesday, November 24, 2015 4:47 PM
To: users@jackrabbit.apache.org
Subject: Re: Memory usage

Come on Kevin, I just googled it and found it immediately bro. :)

https://docs.jboss.org/jbossdna/0.7/manuals/reference/html/jcr-query-and-search.html#jcr-sql2-limits

Best regards,
Clay Ferguson
wclayf@gmail.com


On Tue, Nov 24, 2015 at 3:30 PM, Roll, Kevin <Ke...@idexx.com> wrote:

> Unfortunately the use of 'limit' is not supported (via JCR-SQL2 queries):
>
> https://issues.apache.org/jira/browse/SLING-1873
>
> I set resultFetchSize to a very low number and I was still able to iterate
> through a larger result set, although this may have been batched behind the
> scenes. I'm hoping that my new flag-based task will drastically cut down
> the result set size and prevent the runaway memory usage anyway.
>
>
> From: Clay Ferguson [mailto:wclayf@gmail.com]
> Sent: Tuesday, November 24, 2015 1:35 PM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> point #1. In SQL2 you can just build your query string dynamically and put
> in the time of the last replication. So really I don't see the limitation
> there. You would always just build your queries with the correct date on
> it. But like you said, that is a 'weak" solution. I think actually the
> 'dirty flag' kind of thing or 'needs replication flag' is better because
> you can do it node-by-node and at any time, and you can shutdown and
> restart and it will always pickup where it left off. With timestamps you
> can run into situations where at one cycle it only half processed (failure
> for whaever reason), and then your dates get messed up. So if I were you'd
> do the flag approach. Seems more bullet proof. So if you have systems A ,
> B, C where a needs to replicate out to B and C, then what you'd do is ever
> time you modify or create an A node, you set B_DIRTY=true, and C_DIRTY=true
> on the A node, and that flags it to know a replication is pending. Sounds
> like you are on the right track you just need to set a LIMIT on your query
> so that it only grabs 100 or so at a time. I know MySQL has a LIMIT. Maybe
> SQL2 does also. You'd just keep running 100 at a time using LIMIT until one
> of the queries comes back empty. Will use hardly any memory, and be
> bullet-proof AND always easily restartable/resumable.
>
> Best regards,
> Clay Ferguson
> wclayf@gmail.com
>
>
> On Tue, Nov 24, 2015 at 11:56 AM, Roll, Kevin <Ke...@idexx.com>
> wrote:
>
> > Basically we replicate images and associated metadata to another system.
> > One of the use cases is that the user marks an image as interesting in
> the
> > local system. This metadata change (or any other) needs to then propagate
> > to the other system. So, I am querying for nodes where jcr:lastModified
> is
> > greater than another Date which is the timestamp of the last replication.
> >
> > My understanding is that JCR-SQL2 can only do a comparison where the
> > second operand is static. I am working on a different approach where I
> set
> > a flag on any node that needs to be replicated. I have event handlers for
> > added and changed nodes - at that moment it is trivial to determine
> whether
> > the node should be flagged. I realized it is much easier than trying to
> > figure it out later. The "later" case arises because we have the option
> to
> > switch this replication on and off, and there may be a situation where it
> > becomes on and must catch up with a backlog of work. This way I can
> simply
> > query all nodes with the flag set (I have a scheduled task that looks for
> > nodes needing replication).
> >
> > If there's a date comparison trick it might help as an interim solution
> > until I get this other idea up and running.
> >
> > Thanks!
> >
> > -----Original Message-----
> > From: Clay Ferguson [mailto:wclayf@gmail.com]
> > Sent: Tuesday, November 24, 2015 12:15 PM
> > To: users@jackrabbit.apache.org
> > Subject: Re: Memory usage
> >
> > glad you're gettin' closer.
> >
> > If you want, tell us more about the date range problem, because I may
> know
> > a solution (or workaround). Remember dates can be treated as integers if
> > you really need to. Integers are the fastest and most powerful data type
> > for dbs to handle too. So there should be a good clean solution unless
> you
> > have a VERY unusual situation.
> >
> > Best regards,
> > Clay Ferguson
> > wclayf@gmail.com
> >
> >
> > On Tue, Nov 24, 2015 at 10:14 AM, Roll, Kevin <Ke...@idexx.com>
> > wrote:
> >
> > > I think I am hot on the trail. I noticed this morning that the top
> > objects
> > > in the heap dump are not just Lucene, they are classes related to query
> > > results. Due to a limitation in the Jackrabbit query language
> > (specifically
> > > the inability to compare two dynamic dates) I am running a query that
> > > returns a result set proportional to the size of the repository (in
> other
> > > words it is unbounded). resultFetchSize is unlimited by default, so I
> > think
> > > I am getting larger and larger query results until I run out of space.
> > >
> > > I already changed this parameter yesterday, so I will see what happens
> > > with the testing today. In the bigger picture I'm working on a better
> way
> > > to mark and query the nodes I'm interested in so I don't have to
> perform
> > an
> > > unbounded query.
> > >
> > > Thanks again for the excellent support.
> > >
> > > P.S. We build and run a standalone Sling jar - it runs separately from
> > our
> > > main application.
> > >
> > >
> > > -----Original Message-----
> > > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > > Sent: Tuesday, November 24, 2015 11:05 AM
> > > To: users@jackrabbit.apache.org
> > > Subject: Re: Memory usage
> > >
> > > So just as Clay has mentioned above, Jackrabbit does not hold the
> > complete
> > > Lucene index in memory. How it actually works is there is a
> VolatileIndex
> > > which is memory. Any updates to the Lucene Index are first done here
> and
> > > then are committed to the FileSystem based on the threshold parameters.
> > > This was obviously implemented for performance reasons.
> > > http://wiki.apache.org/jackrabbit/Search
> > > Parameters:
> > > 1.
> > >
> > > maxVolatileIndexSize
> > >
> > > 1048576
> > >
> > > The maximum volatile index size in bytes until it is written to disk.
> The
> > > default value is 1MB.
> > >
> > > 2.
> > >
> > > volatileIdleTime
> > >
> > > 3
> > >
> > > Idle time in seconds until the volatile index part is moved to a
> > persistent
> > > index even though minMergeDocs is not reached.
> > >
> > > 1GB is quite low. My company has ran for over two years a production
> > > instance of Jackrabbit with 1 GB of memory and it has not had any
> issues.
> > > The only time I saw huge spikes on memory consumption is on large
> > > operations such as cloning a node with many descendants or querying a
> > data
> > > set with a 10k+ result size.
> > >
> > > You said you have gathered a heap dump, this should point you in the
> > > direction of what objects are consuming majority of the heap. This
> would
> > be
> > > a good start to see if it is jackrabbit causing the issue or your
> > > application.
> > > What type of deployment (
> > > http://jackrabbit.apache.org/jcr/deployment-models.html) of jackrabbit
> > are
> > > you guys running? Is it completed isolated or embedded in your
> > application?
> > >
> > > On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <Ke...@idexx.com>
> > > wrote:
> > >
> > > > Hi, Ben. I was referring to the following page:
> > > >
> > > > https://jackrabbit.apache.org/jcr/search-implementation.html
> > > >
> > > > "The most recent generation of the search index is held completely in
> > > > memory."
> > > >
> > > > Perhaps I am misreading this, or perhaps it is wrong, but I
> interpreted
> > > > that to mean that the size of the index in memory would be
> proportional
> > > to
> > > > the repository size. I hope this is not true!
> > > >
> > > > I am currently trying to get information from our QA team about the
> > > > approximate number of nodes in the repository. We are not currently
> > > setting
> > > > an explicit heap size - in the dumps I've examined it seems to run
> out
> > > > around 240Mb. I'm pushing to set something explicit but I'm now
> hearing
> > > > that older hardware has only 1Gb of memory, which gives us
> practically
> > > > nowhere to go.
> > > >
> > > > The queries that I'm doing are not very fancy... for example:
> "select *
> > > > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually
> > > > rewriting that task so the query will be even simpler.
> > > >
> > > > Thanks for the help!
> > > >
> > > >
> > > > users@jackrabbit.apache.org
> > > > -----Original Message-----
> > > > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > > > Sent: Monday, November 23, 2015 5:21 PM
> > > > To: users@jackrabbit.apache.org
> > > > Subject: Re: Memory usage
> > > >
> > > > It is a good idea to turn off supportHighlighting especially if you
> > > aren't
> > > > using the functionality. It takes up a lot of extra space within the
> > > index.
> > > > I am not sure where you heard that the Lucene Index is kept in memory
> > > but I
> > > > am pretty certain that is wrong. Can you point me to the
> documentation
> > > > saying this?
> > > >
> > > > Also what data set sizes are you querying against (10k nodes ? 100k
> > > nodes?
> > > > 1 mil nodes?).
> > > > What heap size do you have set on the jvm?
> > > > Reducing the resultFetchSize should help reduce the memory footprint
> on
> > > > queries.
> > > > I am assuming you are using the QueryManager to retrieve nodes. Can
> you
> > > > give an example query that you are using?
> > > >
> > > > I have developed a patch to improve query performance on large data
> > sets
> > > > with jackrabbit 2.x. I should be done soon if I can gather together a
> > few
> > > > hours to finish up my work. If you would like you can give that a try
> > > once
> > > > I finish.
> > > >
> > > > Some other repository settings you might want to look at are:
> > > >  <PersistenceManager
> > > >
> > > >
> > >
> >
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
> > > >       <param name="bundleCacheSize" value="256"/>
> > > > </PersistenceManager>
> > > >  <ISMLocking
> > > > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
> > > >
> > > >
> > > > Hope this helps.
> > > >
> > > >
> > >
> >
>

Re: Memory usage

Posted by Clay Ferguson <wc...@gmail.com>.

Come on Kevin, I just googled it and found it immediately bro. :)

https://docs.jboss.org/jbossdna/0.7/manuals/reference/html/jcr-query-and-search.html#jcr-sql2-limits

Best regards,
Clay Ferguson
wclayf@gmail.com


On Tue, Nov 24, 2015 at 3:30 PM, Roll, Kevin <Ke...@idexx.com> wrote:

> Unfortunately the use of 'limit' is not supported (via JCR-SQL2 queries):
>
> https://issues.apache.org/jira/browse/SLING-1873
>
> I set resultFetchSize to a very low number and I was still able to iterate
> through a larger result set, although this may have been batched behind the
> scenes. I'm hoping that my new flag-based task will drastically cut down
> the result set size and prevent the runaway memory usage anyway.
>
>
> From: Clay Ferguson [mailto:wclayf@gmail.com]
> Sent: Tuesday, November 24, 2015 1:35 PM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> point #1. In SQL2 you can just build your query string dynamically and put
> in the time of the last replication. So really I don't see the limitation
> there. You would always just build your queries with the correct date on
> it. But like you said, that is a 'weak" solution. I think actually the
> 'dirty flag' kind of thing or 'needs replication flag' is better because
> you can do it node-by-node and at any time, and you can shutdown and
> restart and it will always pickup where it left off. With timestamps you
> can run into situations where at one cycle it only half processed (failure
> for whaever reason), and then your dates get messed up. So if I were you'd
> do the flag approach. Seems more bullet proof. So if you have systems A ,
> B, C where a needs to replicate out to B and C, then what you'd do is ever
> time you modify or create an A node, you set B_DIRTY=true, and C_DIRTY=true
> on the A node, and that flags it to know a replication is pending. Sounds
> like you are on the right track you just need to set a LIMIT on your query
> so that it only grabs 100 or so at a time. I know MySQL has a LIMIT. Maybe
> SQL2 does also. You'd just keep running 100 at a time using LIMIT until one
> of the queries comes back empty. Will use hardly any memory, and be
> bullet-proof AND always easily restartable/resumable.
>
> Best regards,
> Clay Ferguson
> wclayf@gmail.com
>
>
> On Tue, Nov 24, 2015 at 11:56 AM, Roll, Kevin <Ke...@idexx.com>
> wrote:
>
> > Basically we replicate images and associated metadata to another system.
> > One of the use cases is that the user marks an image as interesting in
> the
> > local system. This metadata change (or any other) needs to then propagate
> > to the other system. So, I am querying for nodes where jcr:lastModified
> is
> > greater than another Date which is the timestamp of the last replication.
> >
> > My understanding is that JCR-SQL2 can only do a comparison where the
> > second operand is static. I am working on a different approach where I
> set
> > a flag on any node that needs to be replicated. I have event handlers for
> > added and changed nodes - at that moment it is trivial to determine
> whether
> > the node should be flagged. I realized it is much easier than trying to
> > figure it out later. The "later" case arises because we have the option
> to
> > switch this replication on and off, and there may be a situation where it
> > becomes on and must catch up with a backlog of work. This way I can
> simply
> > query all nodes with the flag set (I have a scheduled task that looks for
> > nodes needing replication).
> >
> > If there's a date comparison trick it might help as an interim solution
> > until I get this other idea up and running.
> >
> > Thanks!
> >
> > -----Original Message-----
> > From: Clay Ferguson [mailto:wclayf@gmail.com]
> > Sent: Tuesday, November 24, 2015 12:15 PM
> > To: users@jackrabbit.apache.org
> > Subject: Re: Memory usage
> >
> > glad you're gettin' closer.
> >
> > If you want, tell us more about the date range problem, because I may
> know
> > a solution (or workaround). Remember dates can be treated as integers if
> > you really need to. Integers are the fastest and most powerful data type
> > for dbs to handle too. So there should be a good clean solution unless
> you
> > have a VERY unusual situation.
> >
> > Best regards,
> > Clay Ferguson
> > wclayf@gmail.com
> >
> >
> > On Tue, Nov 24, 2015 at 10:14 AM, Roll, Kevin <Ke...@idexx.com>
> > wrote:
> >
> > > I think I am hot on the trail. I noticed this morning that the top
> > objects
> > > in the heap dump are not just Lucene, they are classes related to query
> > > results. Due to a limitation in the Jackrabbit query language
> > (specifically
> > > the inability to compare two dynamic dates) I am running a query that
> > > returns a result set proportional to the size of the repository (in
> other
> > > words it is unbounded). resultFetchSize is unlimited by default, so I
> > think
> > > I am getting larger and larger query results until I run out of space.
> > >
> > > I already changed this parameter yesterday, so I will see what happens
> > > with the testing today. In the bigger picture I'm working on a better
> way
> > > to mark and query the nodes I'm interested in so I don't have to
> perform
> > an
> > > unbounded query.
> > >
> > > Thanks again for the excellent support.
> > >
> > > P.S. We build and run a standalone Sling jar - it runs separately from
> > our
> > > main application.
> > >
> > >
> > > -----Original Message-----
> > > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > > Sent: Tuesday, November 24, 2015 11:05 AM
> > > To: users@jackrabbit.apache.org
> > > Subject: Re: Memory usage
> > >
> > > So just as Clay has mentioned above, Jackrabbit does not hold the
> > complete
> > > Lucene index in memory. How it actually works is there is a
> VolatileIndex
> > > which is memory. Any updates to the Lucene Index are first done here
> and
> > > then are committed to the FileSystem based on the threshold parameters.
> > > This was obviously implemented for performance reasons.
> > > http://wiki.apache.org/jackrabbit/Search
> > > Parameters:
> > > 1.
> > >
> > > maxVolatileIndexSize
> > >
> > > 1048576
> > >
> > > The maximum volatile index size in bytes until it is written to disk.
> The
> > > default value is 1MB.
> > >
> > > 2.
> > >
> > > volatileIdleTime
> > >
> > > 3
> > >
> > > Idle time in seconds until the volatile index part is moved to a
> > persistent
> > > index even though minMergeDocs is not reached.
> > >
> > > 1GB is quite low. My company has ran for over two years a production
> > > instance of Jackrabbit with 1 GB of memory and it has not had any
> issues.
> > > The only time I saw huge spikes on memory consumption is on large
> > > operations such as cloning a node with many descendants or querying a
> > data
> > > set with a 10k+ result size.
> > >
> > > You said you have gathered a heap dump, this should point you in the
> > > direction of what objects are consuming majority of the heap. This
> would
> > be
> > > a good start to see if it is jackrabbit causing the issue or your
> > > application.
> > > What type of deployment (
> > > http://jackrabbit.apache.org/jcr/deployment-models.html) of jackrabbit
> > are
> > > you guys running? Is it completed isolated or embedded in your
> > application?
> > >
> > > On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <Ke...@idexx.com>
> > > wrote:
> > >
> > > > Hi, Ben. I was referring to the following page:
> > > >
> > > > https://jackrabbit.apache.org/jcr/search-implementation.html
> > > >
> > > > "The most recent generation of the search index is held completely in
> > > > memory."
> > > >
> > > > Perhaps I am misreading this, or perhaps it is wrong, but I
> interpreted
> > > > that to mean that the size of the index in memory would be
> proportional
> > > to
> > > > the repository size. I hope this is not true!
> > > >
> > > > I am currently trying to get information from our QA team about the
> > > > approximate number of nodes in the repository. We are not currently
> > > setting
> > > > an explicit heap size - in the dumps I've examined it seems to run
> out
> > > > around 240Mb. I'm pushing to set something explicit but I'm now
> hearing
> > > > that older hardware has only 1Gb of memory, which gives us
> practically
> > > > nowhere to go.
> > > >
> > > > The queries that I'm doing are not very fancy... for example:
> "select *
> > > > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually
> > > > rewriting that task so the query will be even simpler.
> > > >
> > > > Thanks for the help!
> > > >
> > > >
> > > > users@jackrabbit.apache.org
> > > > -----Original Message-----
> > > > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > > > Sent: Monday, November 23, 2015 5:21 PM
> > > > To: users@jackrabbit.apache.org
> > > > Subject: Re: Memory usage
> > > >
> > > > It is a good idea to turn off supportHighlighting especially if you
> > > aren't
> > > > using the functionality. It takes up a lot of extra space within the
> > > index.
> > > > I am not sure where you heard that the Lucene Index is kept in memory
> > > but I
> > > > am pretty certain that is wrong. Can you point me to the
> documentation
> > > > saying this?
> > > >
> > > > Also what data set sizes are you querying against (10k nodes ? 100k
> > > nodes?
> > > > 1 mil nodes?).
> > > > What heap size do you have set on the jvm?
> > > > Reducing the resultFetchSize should help reduce the memory footprint
> on
> > > > queries.
> > > > I am assuming you are using the QueryManager to retrieve nodes. Can
> you
> > > > give an example query that you are using?
> > > >
> > > > I have developed a patch to improve query performance on large data
> > sets
> > > > with jackrabbit 2.x. I should be done soon if I can gather together a
> > few
> > > > hours to finish up my work. If you would like you can give that a try
> > > once
> > > > I finish.
> > > >
> > > > Some other repository settings you might want to look at are:
> > > >  <PersistenceManager
> > > >
> > > >
> > >
> >
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
> > > >       <param name="bundleCacheSize" value="256"/>
> > > > </PersistenceManager>
> > > >  <ISMLocking
> > > > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
> > > >
> > > >
> > > > Hope this helps.
> > > >
> > > >
> > >
> >
>

RE: Memory usage

Posted by "Roll, Kevin" <Ke...@idexx.com>.

Unfortunately the use of 'limit' is not supported (via JCR-SQL2 queries):

https://issues.apache.org/jira/browse/SLING-1873

I set resultFetchSize to a very low number and I was still able to iterate through a larger result set, although this may have been batched behind the scenes. I'm hoping that my new flag-based task will drastically cut down the result set size and prevent the runaway memory usage anyway.


From: Clay Ferguson [mailto:wclayf@gmail.com] 
Sent: Tuesday, November 24, 2015 1:35 PM
To: users@jackrabbit.apache.org
Subject: Re: Memory usage

point #1. In SQL2 you can just build your query string dynamically and put
in the time of the last replication. So really I don't see the limitation
there. You would always just build your queries with the correct date on
it. But like you said, that is a 'weak" solution. I think actually the
'dirty flag' kind of thing or 'needs replication flag' is better because
you can do it node-by-node and at any time, and you can shutdown and
restart and it will always pickup where it left off. With timestamps you
can run into situations where at one cycle it only half processed (failure
for whaever reason), and then your dates get messed up. So if I were you'd
do the flag approach. Seems more bullet proof. So if you have systems A ,
B, C where a needs to replicate out to B and C, then what you'd do is ever
time you modify or create an A node, you set B_DIRTY=true, and C_DIRTY=true
on the A node, and that flags it to know a replication is pending. Sounds
like you are on the right track you just need to set a LIMIT on your query
so that it only grabs 100 or so at a time. I know MySQL has a LIMIT. Maybe
SQL2 does also. You'd just keep running 100 at a time using LIMIT until one
of the queries comes back empty. Will use hardly any memory, and be
bullet-proof AND always easily restartable/resumable.

Best regards,
Clay Ferguson
wclayf@gmail.com


On Tue, Nov 24, 2015 at 11:56 AM, Roll, Kevin <Ke...@idexx.com> wrote:

> Basically we replicate images and associated metadata to another system.
> One of the use cases is that the user marks an image as interesting in the
> local system. This metadata change (or any other) needs to then propagate
> to the other system. So, I am querying for nodes where jcr:lastModified is
> greater than another Date which is the timestamp of the last replication.
>
> My understanding is that JCR-SQL2 can only do a comparison where the
> second operand is static. I am working on a different approach where I set
> a flag on any node that needs to be replicated. I have event handlers for
> added and changed nodes - at that moment it is trivial to determine whether
> the node should be flagged. I realized it is much easier than trying to
> figure it out later. The "later" case arises because we have the option to
> switch this replication on and off, and there may be a situation where it
> becomes on and must catch up with a backlog of work. This way I can simply
> query all nodes with the flag set (I have a scheduled task that looks for
> nodes needing replication).
>
> If there's a date comparison trick it might help as an interim solution
> until I get this other idea up and running.
>
> Thanks!
>
> -----Original Message-----
> From: Clay Ferguson [mailto:wclayf@gmail.com]
> Sent: Tuesday, November 24, 2015 12:15 PM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> glad you're gettin' closer.
>
> If you want, tell us more about the date range problem, because I may know
> a solution (or workaround). Remember dates can be treated as integers if
> you really need to. Integers are the fastest and most powerful data type
> for dbs to handle too. So there should be a good clean solution unless you
> have a VERY unusual situation.
>
> Best regards,
> Clay Ferguson
> wclayf@gmail.com
>
>
> On Tue, Nov 24, 2015 at 10:14 AM, Roll, Kevin <Ke...@idexx.com>
> wrote:
>
> > I think I am hot on the trail. I noticed this morning that the top
> objects
> > in the heap dump are not just Lucene, they are classes related to query
> > results. Due to a limitation in the Jackrabbit query language
> (specifically
> > the inability to compare two dynamic dates) I am running a query that
> > returns a result set proportional to the size of the repository (in other
> > words it is unbounded). resultFetchSize is unlimited by default, so I
> think
> > I am getting larger and larger query results until I run out of space.
> >
> > I already changed this parameter yesterday, so I will see what happens
> > with the testing today. In the bigger picture I'm working on a better way
> > to mark and query the nodes I'm interested in so I don't have to perform
> an
> > unbounded query.
> >
> > Thanks again for the excellent support.
> >
> > P.S. We build and run a standalone Sling jar - it runs separately from
> our
> > main application.
> >
> >
> > -----Original Message-----
> > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > Sent: Tuesday, November 24, 2015 11:05 AM
> > To: users@jackrabbit.apache.org
> > Subject: Re: Memory usage
> >
> > So just as Clay has mentioned above, Jackrabbit does not hold the
> complete
> > Lucene index in memory. How it actually works is there is a VolatileIndex
> > which is memory. Any updates to the Lucene Index are first done here and
> > then are committed to the FileSystem based on the threshold parameters.
> > This was obviously implemented for performance reasons.
> > http://wiki.apache.org/jackrabbit/Search
> > Parameters:
> > 1.
> >
> > maxVolatileIndexSize
> >
> > 1048576
> >
> > The maximum volatile index size in bytes until it is written to disk. The
> > default value is 1MB.
> >
> > 2.
> >
> > volatileIdleTime
> >
> > 3
> >
> > Idle time in seconds until the volatile index part is moved to a
> persistent
> > index even though minMergeDocs is not reached.
> >
> > 1GB is quite low. My company has ran for over two years a production
> > instance of Jackrabbit with 1 GB of memory and it has not had any issues.
> > The only time I saw huge spikes on memory consumption is on large
> > operations such as cloning a node with many descendants or querying a
> data
> > set with a 10k+ result size.
> >
> > You said you have gathered a heap dump, this should point you in the
> > direction of what objects are consuming majority of the heap. This would
> be
> > a good start to see if it is jackrabbit causing the issue or your
> > application.
> > What type of deployment (
> > http://jackrabbit.apache.org/jcr/deployment-models.html) of jackrabbit
> are
> > you guys running? Is it completed isolated or embedded in your
> application?
> >
> > On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <Ke...@idexx.com>
> > wrote:
> >
> > > Hi, Ben. I was referring to the following page:
> > >
> > > https://jackrabbit.apache.org/jcr/search-implementation.html
> > >
> > > "The most recent generation of the search index is held completely in
> > > memory."
> > >
> > > Perhaps I am misreading this, or perhaps it is wrong, but I interpreted
> > > that to mean that the size of the index in memory would be proportional
> > to
> > > the repository size. I hope this is not true!
> > >
> > > I am currently trying to get information from our QA team about the
> > > approximate number of nodes in the repository. We are not currently
> > setting
> > > an explicit heap size - in the dumps I've examined it seems to run out
> > > around 240Mb. I'm pushing to set something explicit but I'm now hearing
> > > that older hardware has only 1Gb of memory, which gives us practically
> > > nowhere to go.
> > >
> > > The queries that I'm doing are not very fancy... for example: "select *
> > > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually
> > > rewriting that task so the query will be even simpler.
> > >
> > > Thanks for the help!
> > >
> > >
> > > users@jackrabbit.apache.org
> > > -----Original Message-----
> > > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > > Sent: Monday, November 23, 2015 5:21 PM
> > > To: users@jackrabbit.apache.org
> > > Subject: Re: Memory usage
> > >
> > > It is a good idea to turn off supportHighlighting especially if you
> > aren't
> > > using the functionality. It takes up a lot of extra space within the
> > index.
> > > I am not sure where you heard that the Lucene Index is kept in memory
> > but I
> > > am pretty certain that is wrong. Can you point me to the documentation
> > > saying this?
> > >
> > > Also what data set sizes are you querying against (10k nodes ? 100k
> > nodes?
> > > 1 mil nodes?).
> > > What heap size do you have set on the jvm?
> > > Reducing the resultFetchSize should help reduce the memory footprint on
> > > queries.
> > > I am assuming you are using the QueryManager to retrieve nodes. Can you
> > > give an example query that you are using?
> > >
> > > I have developed a patch to improve query performance on large data
> sets
> > > with jackrabbit 2.x. I should be done soon if I can gather together a
> few
> > > hours to finish up my work. If you would like you can give that a try
> > once
> > > I finish.
> > >
> > > Some other repository settings you might want to look at are:
> > >  <PersistenceManager
> > >
> > >
> >
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
> > >       <param name="bundleCacheSize" value="256"/>
> > > </PersistenceManager>
> > >  <ISMLocking
> > > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
> > >
> > >
> > > Hope this helps.
> > >
> > >
> >
>

Re: Memory usage

Posted by Clay Ferguson <wc...@gmail.com>.

point #1. In SQL2 you can just build your query string dynamically and put
in the time of the last replication. So really I don't see the limitation
there. You would always just build your queries with the correct date on
it. But like you said, that is a 'weak" solution. I think actually the
'dirty flag' kind of thing or 'needs replication flag' is better because
you can do it node-by-node and at any time, and you can shutdown and
restart and it will always pickup where it left off. With timestamps you
can run into situations where at one cycle it only half processed (failure
for whaever reason), and then your dates get messed up. So if I were you'd
do the flag approach. Seems more bullet proof. So if you have systems A ,
B, C where a needs to replicate out to B and C, then what you'd do is ever
time you modify or create an A node, you set B_DIRTY=true, and C_DIRTY=true
on the A node, and that flags it to know a replication is pending. Sounds
like you are on the right track you just need to set a LIMIT on your query
so that it only grabs 100 or so at a time. I know MySQL has a LIMIT. Maybe
SQL2 does also. You'd just keep running 100 at a time using LIMIT until one
of the queries comes back empty. Will use hardly any memory, and be
bullet-proof AND always easily restartable/resumable.

Best regards,
Clay Ferguson
wclayf@gmail.com


On Tue, Nov 24, 2015 at 11:56 AM, Roll, Kevin <Ke...@idexx.com> wrote:

> Basically we replicate images and associated metadata to another system.
> One of the use cases is that the user marks an image as interesting in the
> local system. This metadata change (or any other) needs to then propagate
> to the other system. So, I am querying for nodes where jcr:lastModified is
> greater than another Date which is the timestamp of the last replication.
>
> My understanding is that JCR-SQL2 can only do a comparison where the
> second operand is static. I am working on a different approach where I set
> a flag on any node that needs to be replicated. I have event handlers for
> added and changed nodes - at that moment it is trivial to determine whether
> the node should be flagged. I realized it is much easier than trying to
> figure it out later. The "later" case arises because we have the option to
> switch this replication on and off, and there may be a situation where it
> becomes on and must catch up with a backlog of work. This way I can simply
> query all nodes with the flag set (I have a scheduled task that looks for
> nodes needing replication).
>
> If there's a date comparison trick it might help as an interim solution
> until I get this other idea up and running.
>
> Thanks!
>
> -----Original Message-----
> From: Clay Ferguson [mailto:wclayf@gmail.com]
> Sent: Tuesday, November 24, 2015 12:15 PM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> glad you're gettin' closer.
>
> If you want, tell us more about the date range problem, because I may know
> a solution (or workaround). Remember dates can be treated as integers if
> you really need to. Integers are the fastest and most powerful data type
> for dbs to handle too. So there should be a good clean solution unless you
> have a VERY unusual situation.
>
> Best regards,
> Clay Ferguson
> wclayf@gmail.com
>
>
> On Tue, Nov 24, 2015 at 10:14 AM, Roll, Kevin <Ke...@idexx.com>
> wrote:
>
> > I think I am hot on the trail. I noticed this morning that the top
> objects
> > in the heap dump are not just Lucene, they are classes related to query
> > results. Due to a limitation in the Jackrabbit query language
> (specifically
> > the inability to compare two dynamic dates) I am running a query that
> > returns a result set proportional to the size of the repository (in other
> > words it is unbounded). resultFetchSize is unlimited by default, so I
> think
> > I am getting larger and larger query results until I run out of space.
> >
> > I already changed this parameter yesterday, so I will see what happens
> > with the testing today. In the bigger picture I'm working on a better way
> > to mark and query the nodes I'm interested in so I don't have to perform
> an
> > unbounded query.
> >
> > Thanks again for the excellent support.
> >
> > P.S. We build and run a standalone Sling jar - it runs separately from
> our
> > main application.
> >
> >
> > -----Original Message-----
> > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > Sent: Tuesday, November 24, 2015 11:05 AM
> > To: users@jackrabbit.apache.org
> > Subject: Re: Memory usage
> >
> > So just as Clay has mentioned above, Jackrabbit does not hold the
> complete
> > Lucene index in memory. How it actually works is there is a VolatileIndex
> > which is memory. Any updates to the Lucene Index are first done here and
> > then are committed to the FileSystem based on the threshold parameters.
> > This was obviously implemented for performance reasons.
> > http://wiki.apache.org/jackrabbit/Search
> > Parameters:
> > 1.
> >
> > maxVolatileIndexSize
> >
> > 1048576
> >
> > The maximum volatile index size in bytes until it is written to disk. The
> > default value is 1MB.
> >
> > 2.
> >
> > volatileIdleTime
> >
> > 3
> >
> > Idle time in seconds until the volatile index part is moved to a
> persistent
> > index even though minMergeDocs is not reached.
> >
> > 1GB is quite low. My company has ran for over two years a production
> > instance of Jackrabbit with 1 GB of memory and it has not had any issues.
> > The only time I saw huge spikes on memory consumption is on large
> > operations such as cloning a node with many descendants or querying a
> data
> > set with a 10k+ result size.
> >
> > You said you have gathered a heap dump, this should point you in the
> > direction of what objects are consuming majority of the heap. This would
> be
> > a good start to see if it is jackrabbit causing the issue or your
> > application.
> > What type of deployment (
> > http://jackrabbit.apache.org/jcr/deployment-models.html) of jackrabbit
> are
> > you guys running? Is it completed isolated or embedded in your
> application?
> >
> > On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <Ke...@idexx.com>
> > wrote:
> >
> > > Hi, Ben. I was referring to the following page:
> > >
> > > https://jackrabbit.apache.org/jcr/search-implementation.html
> > >
> > > "The most recent generation of the search index is held completely in
> > > memory."
> > >
> > > Perhaps I am misreading this, or perhaps it is wrong, but I interpreted
> > > that to mean that the size of the index in memory would be proportional
> > to
> > > the repository size. I hope this is not true!
> > >
> > > I am currently trying to get information from our QA team about the
> > > approximate number of nodes in the repository. We are not currently
> > setting
> > > an explicit heap size - in the dumps I've examined it seems to run out
> > > around 240Mb. I'm pushing to set something explicit but I'm now hearing
> > > that older hardware has only 1Gb of memory, which gives us practically
> > > nowhere to go.
> > >
> > > The queries that I'm doing are not very fancy... for example: "select *
> > > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually
> > > rewriting that task so the query will be even simpler.
> > >
> > > Thanks for the help!
> > >
> > >
> > > users@jackrabbit.apache.org
> > > -----Original Message-----
> > > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > > Sent: Monday, November 23, 2015 5:21 PM
> > > To: users@jackrabbit.apache.org
> > > Subject: Re: Memory usage
> > >
> > > It is a good idea to turn off supportHighlighting especially if you
> > aren't
> > > using the functionality. It takes up a lot of extra space within the
> > index.
> > > I am not sure where you heard that the Lucene Index is kept in memory
> > but I
> > > am pretty certain that is wrong. Can you point me to the documentation
> > > saying this?
> > >
> > > Also what data set sizes are you querying against (10k nodes ? 100k
> > nodes?
> > > 1 mil nodes?).
> > > What heap size do you have set on the jvm?
> > > Reducing the resultFetchSize should help reduce the memory footprint on
> > > queries.
> > > I am assuming you are using the QueryManager to retrieve nodes. Can you
> > > give an example query that you are using?
> > >
> > > I have developed a patch to improve query performance on large data
> sets
> > > with jackrabbit 2.x. I should be done soon if I can gather together a
> few
> > > hours to finish up my work. If you would like you can give that a try
> > once
> > > I finish.
> > >
> > > Some other repository settings you might want to look at are:
> > >  <PersistenceManager
> > >
> > >
> >
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
> > >       <param name="bundleCacheSize" value="256"/>
> > > </PersistenceManager>
> > >  <ISMLocking
> > > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
> > >
> > >
> > > Hope this helps.
> > >
> > >
> >
>

RE: Memory usage

Posted by "Roll, Kevin" <Ke...@idexx.com>.

Basically we replicate images and associated metadata to another system. One of the use cases is that the user marks an image as interesting in the local system. This metadata change (or any other) needs to then propagate to the other system. So, I am querying for nodes where jcr:lastModified is greater than another Date which is the timestamp of the last replication.

My understanding is that JCR-SQL2 can only do a comparison where the second operand is static. I am working on a different approach where I set a flag on any node that needs to be replicated. I have event handlers for added and changed nodes - at that moment it is trivial to determine whether the node should be flagged. I realized it is much easier than trying to figure it out later. The "later" case arises because we have the option to switch this replication on and off, and there may be a situation where it becomes on and must catch up with a backlog of work. This way I can simply query all nodes with the flag set (I have a scheduled task that looks for nodes needing replication).

If there's a date comparison trick it might help as an interim solution until I get this other idea up and running.

Thanks!

-----Original Message-----
From: Clay Ferguson [mailto:wclayf@gmail.com] 
Sent: Tuesday, November 24, 2015 12:15 PM
To: users@jackrabbit.apache.org
Subject: Re: Memory usage

glad you're gettin' closer.

If you want, tell us more about the date range problem, because I may know
a solution (or workaround). Remember dates can be treated as integers if
you really need to. Integers are the fastest and most powerful data type
for dbs to handle too. So there should be a good clean solution unless you
have a VERY unusual situation.

Best regards,
Clay Ferguson
wclayf@gmail.com


On Tue, Nov 24, 2015 at 10:14 AM, Roll, Kevin <Ke...@idexx.com> wrote:

> I think I am hot on the trail. I noticed this morning that the top objects
> in the heap dump are not just Lucene, they are classes related to query
> results. Due to a limitation in the Jackrabbit query language (specifically
> the inability to compare two dynamic dates) I am running a query that
> returns a result set proportional to the size of the repository (in other
> words it is unbounded). resultFetchSize is unlimited by default, so I think
> I am getting larger and larger query results until I run out of space.
>
> I already changed this parameter yesterday, so I will see what happens
> with the testing today. In the bigger picture I'm working on a better way
> to mark and query the nodes I'm interested in so I don't have to perform an
> unbounded query.
>
> Thanks again for the excellent support.
>
> P.S. We build and run a standalone Sling jar - it runs separately from our
> main application.
>
>
> -----Original Message-----
> From: Ben Frisoni [mailto:frisonib@gmail.com]
> Sent: Tuesday, November 24, 2015 11:05 AM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> So just as Clay has mentioned above, Jackrabbit does not hold the complete
> Lucene index in memory. How it actually works is there is a VolatileIndex
> which is memory. Any updates to the Lucene Index are first done here and
> then are committed to the FileSystem based on the threshold parameters.
> This was obviously implemented for performance reasons.
> http://wiki.apache.org/jackrabbit/Search
> Parameters:
> 1.
>
> maxVolatileIndexSize
>
> 1048576
>
> The maximum volatile index size in bytes until it is written to disk. The
> default value is 1MB.
>
> 2.
>
> volatileIdleTime
>
> 3
>
> Idle time in seconds until the volatile index part is moved to a persistent
> index even though minMergeDocs is not reached.
>
> 1GB is quite low. My company has ran for over two years a production
> instance of Jackrabbit with 1 GB of memory and it has not had any issues.
> The only time I saw huge spikes on memory consumption is on large
> operations such as cloning a node with many descendants or querying a data
> set with a 10k+ result size.
>
> You said you have gathered a heap dump, this should point you in the
> direction of what objects are consuming majority of the heap. This would be
> a good start to see if it is jackrabbit causing the issue or your
> application.
> What type of deployment (
> http://jackrabbit.apache.org/jcr/deployment-models.html) of jackrabbit are
> you guys running? Is it completed isolated or embedded in your application?
>
> On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <Ke...@idexx.com>
> wrote:
>
> > Hi, Ben. I was referring to the following page:
> >
> > https://jackrabbit.apache.org/jcr/search-implementation.html
> >
> > "The most recent generation of the search index is held completely in
> > memory."
> >
> > Perhaps I am misreading this, or perhaps it is wrong, but I interpreted
> > that to mean that the size of the index in memory would be proportional
> to
> > the repository size. I hope this is not true!
> >
> > I am currently trying to get information from our QA team about the
> > approximate number of nodes in the repository. We are not currently
> setting
> > an explicit heap size - in the dumps I've examined it seems to run out
> > around 240Mb. I'm pushing to set something explicit but I'm now hearing
> > that older hardware has only 1Gb of memory, which gives us practically
> > nowhere to go.
> >
> > The queries that I'm doing are not very fancy... for example: "select *
> > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually
> > rewriting that task so the query will be even simpler.
> >
> > Thanks for the help!
> >
> >
> > users@jackrabbit.apache.org
> > -----Original Message-----
> > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > Sent: Monday, November 23, 2015 5:21 PM
> > To: users@jackrabbit.apache.org
> > Subject: Re: Memory usage
> >
> > It is a good idea to turn off supportHighlighting especially if you
> aren't
> > using the functionality. It takes up a lot of extra space within the
> index.
> > I am not sure where you heard that the Lucene Index is kept in memory
> but I
> > am pretty certain that is wrong. Can you point me to the documentation
> > saying this?
> >
> > Also what data set sizes are you querying against (10k nodes ? 100k
> nodes?
> > 1 mil nodes?).
> > What heap size do you have set on the jvm?
> > Reducing the resultFetchSize should help reduce the memory footprint on
> > queries.
> > I am assuming you are using the QueryManager to retrieve nodes. Can you
> > give an example query that you are using?
> >
> > I have developed a patch to improve query performance on large data sets
> > with jackrabbit 2.x. I should be done soon if I can gather together a few
> > hours to finish up my work. If you would like you can give that a try
> once
> > I finish.
> >
> > Some other repository settings you might want to look at are:
> >  <PersistenceManager
> >
> >
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
> >       <param name="bundleCacheSize" value="256"/>
> > </PersistenceManager>
> >  <ISMLocking
> > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
> >
> >
> > Hope this helps.
> >
> >
>

Re: Memory usage

Posted by Clay Ferguson <wc...@gmail.com>.

glad you're gettin' closer.

If you want, tell us more about the date range problem, because I may know
a solution (or workaround). Remember dates can be treated as integers if
you really need to. Integers are the fastest and most powerful data type
for dbs to handle too. So there should be a good clean solution unless you
have a VERY unusual situation.

Best regards,
Clay Ferguson
wclayf@gmail.com


On Tue, Nov 24, 2015 at 10:14 AM, Roll, Kevin <Ke...@idexx.com> wrote:

> I think I am hot on the trail. I noticed this morning that the top objects
> in the heap dump are not just Lucene, they are classes related to query
> results. Due to a limitation in the Jackrabbit query language (specifically
> the inability to compare two dynamic dates) I am running a query that
> returns a result set proportional to the size of the repository (in other
> words it is unbounded). resultFetchSize is unlimited by default, so I think
> I am getting larger and larger query results until I run out of space.
>
> I already changed this parameter yesterday, so I will see what happens
> with the testing today. In the bigger picture I'm working on a better way
> to mark and query the nodes I'm interested in so I don't have to perform an
> unbounded query.
>
> Thanks again for the excellent support.
>
> P.S. We build and run a standalone Sling jar - it runs separately from our
> main application.
>
>
> -----Original Message-----
> From: Ben Frisoni [mailto:frisonib@gmail.com]
> Sent: Tuesday, November 24, 2015 11:05 AM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> So just as Clay has mentioned above, Jackrabbit does not hold the complete
> Lucene index in memory. How it actually works is there is a VolatileIndex
> which is memory. Any updates to the Lucene Index are first done here and
> then are committed to the FileSystem based on the threshold parameters.
> This was obviously implemented for performance reasons.
> http://wiki.apache.org/jackrabbit/Search
> Parameters:
> 1.
>
> maxVolatileIndexSize
>
> 1048576
>
> The maximum volatile index size in bytes until it is written to disk. The
> default value is 1MB.
>
> 2.
>
> volatileIdleTime
>
> 3
>
> Idle time in seconds until the volatile index part is moved to a persistent
> index even though minMergeDocs is not reached.
>
> 1GB is quite low. My company has ran for over two years a production
> instance of Jackrabbit with 1 GB of memory and it has not had any issues.
> The only time I saw huge spikes on memory consumption is on large
> operations such as cloning a node with many descendants or querying a data
> set with a 10k+ result size.
>
> You said you have gathered a heap dump, this should point you in the
> direction of what objects are consuming majority of the heap. This would be
> a good start to see if it is jackrabbit causing the issue or your
> application.
> What type of deployment (
> http://jackrabbit.apache.org/jcr/deployment-models.html) of jackrabbit are
> you guys running? Is it completed isolated or embedded in your application?
>
> On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <Ke...@idexx.com>
> wrote:
>
> > Hi, Ben. I was referring to the following page:
> >
> > https://jackrabbit.apache.org/jcr/search-implementation.html
> >
> > "The most recent generation of the search index is held completely in
> > memory."
> >
> > Perhaps I am misreading this, or perhaps it is wrong, but I interpreted
> > that to mean that the size of the index in memory would be proportional
> to
> > the repository size. I hope this is not true!
> >
> > I am currently trying to get information from our QA team about the
> > approximate number of nodes in the repository. We are not currently
> setting
> > an explicit heap size - in the dumps I've examined it seems to run out
> > around 240Mb. I'm pushing to set something explicit but I'm now hearing
> > that older hardware has only 1Gb of memory, which gives us practically
> > nowhere to go.
> >
> > The queries that I'm doing are not very fancy... for example: "select *
> > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually
> > rewriting that task so the query will be even simpler.
> >
> > Thanks for the help!
> >
> >
> > users@jackrabbit.apache.org
> > -----Original Message-----
> > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > Sent: Monday, November 23, 2015 5:21 PM
> > To: users@jackrabbit.apache.org
> > Subject: Re: Memory usage
> >
> > It is a good idea to turn off supportHighlighting especially if you
> aren't
> > using the functionality. It takes up a lot of extra space within the
> index.
> > I am not sure where you heard that the Lucene Index is kept in memory
> but I
> > am pretty certain that is wrong. Can you point me to the documentation
> > saying this?
> >
> > Also what data set sizes are you querying against (10k nodes ? 100k
> nodes?
> > 1 mil nodes?).
> > What heap size do you have set on the jvm?
> > Reducing the resultFetchSize should help reduce the memory footprint on
> > queries.
> > I am assuming you are using the QueryManager to retrieve nodes. Can you
> > give an example query that you are using?
> >
> > I have developed a patch to improve query performance on large data sets
> > with jackrabbit 2.x. I should be done soon if I can gather together a few
> > hours to finish up my work. If you would like you can give that a try
> once
> > I finish.
> >
> > Some other repository settings you might want to look at are:
> >  <PersistenceManager
> >
> >
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
> >       <param name="bundleCacheSize" value="256"/>
> > </PersistenceManager>
> >  <ISMLocking
> > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
> >
> >
> > Hope this helps.
> >
> >
>

Re: Memory usage

Posted by Ben Frisoni <fr...@gmail.com>.

Yea makes sense with 256mb heap size.
The default resultFetchSize is below from the documentation on
http://wiki.apache.org/jackrabbit/Search.

resultFetchSize

2147483647

The number of results the query handler should initially fetch when a query
is executed. Default value: Integer.MAX_VALUE (-> all)

1.2.1

You can also set the resultSize on the query instance being executed
dynamically.

I would suggest trying to add an extra 1GB of ram to this machine you are
hosting your Jackrabbit instance on. This way you can have at least 1GB for
jackrabbit and 1GB for the OS. You will see great improvements.

On Tue, Nov 24, 2015 at 11:14 AM, Roll, Kevin <Ke...@idexx.com> wrote:

> I think I am hot on the trail. I noticed this morning that the top objects
> in the heap dump are not just Lucene, they are classes related to query
> results. Due to a limitation in the Jackrabbit query language (specifically
> the inability to compare two dynamic dates) I am running a query that
> returns a result set proportional to the size of the repository (in other
> words it is unbounded). resultFetchSize is unlimited by default, so I think
> I am getting larger and larger query results until I run out of space.
>
> I already changed this parameter yesterday, so I will see what happens
> with the testing today. In the bigger picture I'm working on a better way
> to mark and query the nodes I'm interested in so I don't have to perform an
> unbounded query.
>
> Thanks again for the excellent support.
>
> P.S. We build and run a standalone Sling jar - it runs separately from our
> main application.
>
>
> -----Original Message-----
> From: Ben Frisoni [mailto:frisonib@gmail.com]
> Sent: Tuesday, November 24, 2015 11:05 AM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> So just as Clay has mentioned above, Jackrabbit does not hold the complete
> Lucene index in memory. How it actually works is there is a VolatileIndex
> which is memory. Any updates to the Lucene Index are first done here and
> then are committed to the FileSystem based on the threshold parameters.
> This was obviously implemented for performance reasons.
> http://wiki.apache.org/jackrabbit/Search
> Parameters:
> 1.
>
> maxVolatileIndexSize
>
> 1048576
>
> The maximum volatile index size in bytes until it is written to disk. The
> default value is 1MB.
>
> 2.
>
> volatileIdleTime
>
> 3
>
> Idle time in seconds until the volatile index part is moved to a persistent
> index even though minMergeDocs is not reached.
>
> 1GB is quite low. My company has ran for over two years a production
> instance of Jackrabbit with 1 GB of memory and it has not had any issues.
> The only time I saw huge spikes on memory consumption is on large
> operations such as cloning a node with many descendants or querying a data
> set with a 10k+ result size.
>
> You said you have gathered a heap dump, this should point you in the
> direction of what objects are consuming majority of the heap. This would be
> a good start to see if it is jackrabbit causing the issue or your
> application.
> What type of deployment (
> http://jackrabbit.apache.org/jcr/deployment-models.html) of jackrabbit are
> you guys running? Is it completed isolated or embedded in your application?
>
> On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <Ke...@idexx.com>
> wrote:
>
> > Hi, Ben. I was referring to the following page:
> >
> > https://jackrabbit.apache.org/jcr/search-implementation.html
> >
> > "The most recent generation of the search index is held completely in
> > memory."
> >
> > Perhaps I am misreading this, or perhaps it is wrong, but I interpreted
> > that to mean that the size of the index in memory would be proportional
> to
> > the repository size. I hope this is not true!
> >
> > I am currently trying to get information from our QA team about the
> > approximate number of nodes in the repository. We are not currently
> setting
> > an explicit heap size - in the dumps I've examined it seems to run out
> > around 240Mb. I'm pushing to set something explicit but I'm now hearing
> > that older hardware has only 1Gb of memory, which gives us practically
> > nowhere to go.
> >
> > The queries that I'm doing are not very fancy... for example: "select *
> > from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually
> > rewriting that task so the query will be even simpler.
> >
> > Thanks for the help!
> >
> >
> > users@jackrabbit.apache.org
> > -----Original Message-----
> > From: Ben Frisoni [mailto:frisonib@gmail.com]
> > Sent: Monday, November 23, 2015 5:21 PM
> > To: users@jackrabbit.apache.org
> > Subject: Re: Memory usage
> >
> > It is a good idea to turn off supportHighlighting especially if you
> aren't
> > using the functionality. It takes up a lot of extra space within the
> index.
> > I am not sure where you heard that the Lucene Index is kept in memory
> but I
> > am pretty certain that is wrong. Can you point me to the documentation
> > saying this?
> >
> > Also what data set sizes are you querying against (10k nodes ? 100k
> nodes?
> > 1 mil nodes?).
> > What heap size do you have set on the jvm?
> > Reducing the resultFetchSize should help reduce the memory footprint on
> > queries.
> > I am assuming you are using the QueryManager to retrieve nodes. Can you
> > give an example query that you are using?
> >
> > I have developed a patch to improve query performance on large data sets
> > with jackrabbit 2.x. I should be done soon if I can gather together a few
> > hours to finish up my work. If you would like you can give that a try
> once
> > I finish.
> >
> > Some other repository settings you might want to look at are:
> >  <PersistenceManager
> >
> >
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
> >       <param name="bundleCacheSize" value="256"/>
> > </PersistenceManager>
> >  <ISMLocking
> > class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
> >
> >
> > Hope this helps.
> >
> >
>

RE: Memory usage

Posted by "Roll, Kevin" <Ke...@idexx.com>.

I think I am hot on the trail. I noticed this morning that the top objects in the heap dump are not just Lucene, they are classes related to query results. Due to a limitation in the Jackrabbit query language (specifically the inability to compare two dynamic dates) I am running a query that returns a result set proportional to the size of the repository (in other words it is unbounded). resultFetchSize is unlimited by default, so I think I am getting larger and larger query results until I run out of space.

I already changed this parameter yesterday, so I will see what happens with the testing today. In the bigger picture I'm working on a better way to mark and query the nodes I'm interested in so I don't have to perform an unbounded query.

Thanks again for the excellent support.

P.S. We build and run a standalone Sling jar - it runs separately from our main application.

-----Original Message-----
From: Ben Frisoni [mailto:frisonib@gmail.com] 
Sent: Tuesday, November 24, 2015 11:05 AM
To: users@jackrabbit.apache.org
Subject: Re: Memory usage

So just as Clay has mentioned above, Jackrabbit does not hold the complete
Lucene index in memory. How it actually works is there is a VolatileIndex
which is memory. Any updates to the Lucene Index are first done here and
then are committed to the FileSystem based on the threshold parameters.
This was obviously implemented for performance reasons.
http://wiki.apache.org/jackrabbit/Search
Parameters:
1.

maxVolatileIndexSize

1048576

The maximum volatile index size in bytes until it is written to disk. The
default value is 1MB.

2.

volatileIdleTime

3

Idle time in seconds until the volatile index part is moved to a persistent
index even though minMergeDocs is not reached.

1GB is quite low. My company has ran for over two years a production
instance of Jackrabbit with 1 GB of memory and it has not had any issues.
The only time I saw huge spikes on memory consumption is on large
operations such as cloning a node with many descendants or querying a data
set with a 10k+ result size.

You said you have gathered a heap dump, this should point you in the
direction of what objects are consuming majority of the heap. This would be
a good start to see if it is jackrabbit causing the issue or your
application.
What type of deployment (
http://jackrabbit.apache.org/jcr/deployment-models.html) of jackrabbit are
you guys running? Is it completed isolated or embedded in your application?

On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <Ke...@idexx.com> wrote:

> Hi, Ben. I was referring to the following page:
>
> https://jackrabbit.apache.org/jcr/search-implementation.html
>
> "The most recent generation of the search index is held completely in
> memory."
>
> Perhaps I am misreading this, or perhaps it is wrong, but I interpreted
> that to mean that the size of the index in memory would be proportional to
> the repository size. I hope this is not true!
>
> I am currently trying to get information from our QA team about the
> approximate number of nodes in the repository. We are not currently setting
> an explicit heap size - in the dumps I've examined it seems to run out
> around 240Mb. I'm pushing to set something explicit but I'm now hearing
> that older hardware has only 1Gb of memory, which gives us practically
> nowhere to go.
>
> The queries that I'm doing are not very fancy... for example: "select *
> from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually
> rewriting that task so the query will be even simpler.
>
> Thanks for the help!
>
>
> users@jackrabbit.apache.org
> -----Original Message-----
> From: Ben Frisoni [mailto:frisonib@gmail.com]
> Sent: Monday, November 23, 2015 5:21 PM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> It is a good idea to turn off supportHighlighting especially if you aren't
> using the functionality. It takes up a lot of extra space within the index.
> I am not sure where you heard that the Lucene Index is kept in memory but I
> am pretty certain that is wrong. Can you point me to the documentation
> saying this?
>
> Also what data set sizes are you querying against (10k nodes ? 100k nodes?
> 1 mil nodes?).
> What heap size do you have set on the jvm?
> Reducing the resultFetchSize should help reduce the memory footprint on
> queries.
> I am assuming you are using the QueryManager to retrieve nodes. Can you
> give an example query that you are using?
>
> I have developed a patch to improve query performance on large data sets
> with jackrabbit 2.x. I should be done soon if I can gather together a few
> hours to finish up my work. If you would like you can give that a try once
> I finish.
>
> Some other repository settings you might want to look at are:
>  <PersistenceManager
>
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
>       <param name="bundleCacheSize" value="256"/>
> </PersistenceManager>
>  <ISMLocking
> class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
>
>
> Hope this helps.
>
>

Re: Memory usage

Posted by Ben Frisoni <fr...@gmail.com>.

So just as Clay has mentioned above, Jackrabbit does not hold the complete
Lucene index in memory. How it actually works is there is a VolatileIndex
which is memory. Any updates to the Lucene Index are first done here and
then are committed to the FileSystem based on the threshold parameters.
This was obviously implemented for performance reasons.
http://wiki.apache.org/jackrabbit/Search
Parameters:
1.

maxVolatileIndexSize

1048576

The maximum volatile index size in bytes until it is written to disk. The
default value is 1MB.

2.

volatileIdleTime

3

Idle time in seconds until the volatile index part is moved to a persistent
index even though minMergeDocs is not reached.

1GB is quite low. My company has ran for over two years a production
instance of Jackrabbit with 1 GB of memory and it has not had any issues.
The only time I saw huge spikes on memory consumption is on large
operations such as cloning a node with many descendants or querying a data
set with a 10k+ result size.

You said you have gathered a heap dump, this should point you in the
direction of what objects are consuming majority of the heap. This would be
a good start to see if it is jackrabbit causing the issue or your
application.
What type of deployment (
http://jackrabbit.apache.org/jcr/deployment-models.html) of jackrabbit are
you guys running? Is it completed isolated or embedded in your application?

On Mon, Nov 23, 2015 at 10:16 PM, Roll, Kevin <Ke...@idexx.com> wrote:

> Hi, Ben. I was referring to the following page:
>
> https://jackrabbit.apache.org/jcr/search-implementation.html
>
> "The most recent generation of the search index is held completely in
> memory."
>
> Perhaps I am misreading this, or perhaps it is wrong, but I interpreted
> that to mean that the size of the index in memory would be proportional to
> the repository size. I hope this is not true!
>
> I am currently trying to get information from our QA team about the
> approximate number of nodes in the repository. We are not currently setting
> an explicit heap size - in the dumps I've examined it seems to run out
> around 240Mb. I'm pushing to set something explicit but I'm now hearing
> that older hardware has only 1Gb of memory, which gives us practically
> nowhere to go.
>
> The queries that I'm doing are not very fancy... for example: "select *
> from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually
> rewriting that task so the query will be even simpler.
>
> Thanks for the help!
>
>
> users@jackrabbit.apache.org
> -----Original Message-----
> From: Ben Frisoni [mailto:frisonib@gmail.com]
> Sent: Monday, November 23, 2015 5:21 PM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> It is a good idea to turn off supportHighlighting especially if you aren't
> using the functionality. It takes up a lot of extra space within the index.
> I am not sure where you heard that the Lucene Index is kept in memory but I
> am pretty certain that is wrong. Can you point me to the documentation
> saying this?
>
> Also what data set sizes are you querying against (10k nodes ? 100k nodes?
> 1 mil nodes?).
> What heap size do you have set on the jvm?
> Reducing the resultFetchSize should help reduce the memory footprint on
> queries.
> I am assuming you are using the QueryManager to retrieve nodes. Can you
> give an example query that you are using?
>
> I have developed a patch to improve query performance on large data sets
> with jackrabbit 2.x. I should be done soon if I can gather together a few
> hours to finish up my work. If you would like you can give that a try once
> I finish.
>
> Some other repository settings you might want to look at are:
>  <PersistenceManager
>
> class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
>       <param name="bundleCacheSize" value="256"/>
> </PersistenceManager>
>  <ISMLocking
> class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>
>
>
> Hope this helps.
>
>

RE: Memory usage

Posted by "Roll, Kevin" <Ke...@idexx.com>.

Hi, Ben. I was referring to the following page:

https://jackrabbit.apache.org/jcr/search-implementation.html

"The most recent generation of the search index is held completely in memory."

Perhaps I am misreading this, or perhaps it is wrong, but I interpreted that to mean that the size of the index in memory would be proportional to the repository size. I hope this is not true!

I am currently trying to get information from our QA team about the approximate number of nodes in the repository. We are not currently setting an explicit heap size - in the dumps I've examined it seems to run out around 240Mb. I'm pushing to set something explicit but I'm now hearing that older hardware has only 1Gb of memory, which gives us practically nowhere to go.

The queries that I'm doing are not very fancy... for example: "select * from [nt:resource] where [jcr:mimeType] like 'image%%'". I'm actually rewriting that task so the query will be even simpler.

Thanks for the help!


users@jackrabbit.apache.org
-----Original Message-----
From: Ben Frisoni [mailto:frisonib@gmail.com] 
Sent: Monday, November 23, 2015 5:21 PM
To: users@jackrabbit.apache.org
Subject: Re: Memory usage

It is a good idea to turn off supportHighlighting especially if you aren't
using the functionality. It takes up a lot of extra space within the index.
I am not sure where you heard that the Lucene Index is kept in memory but I
am pretty certain that is wrong. Can you point me to the documentation
saying this?

Also what data set sizes are you querying against (10k nodes ? 100k nodes?
1 mil nodes?).
What heap size do you have set on the jvm?
Reducing the resultFetchSize should help reduce the memory footprint on
queries.
I am assuming you are using the QueryManager to retrieve nodes. Can you
give an example query that you are using?

I have developed a patch to improve query performance on large data sets
with jackrabbit 2.x. I should be done soon if I can gather together a few
hours to finish up my work. If you would like you can give that a try once
I finish.

Some other repository settings you might want to look at are:
 <PersistenceManager
class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
      <param name="bundleCacheSize" value="256"/>
</PersistenceManager>
 <ISMLocking
class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>


Hope this helps.

Re: Memory usage

Posted by Ben Frisoni <fr...@gmail.com>.

It is a good idea to turn off supportHighlighting especially if you aren't
using the functionality. It takes up a lot of extra space within the index.
I am not sure where you heard that the Lucene Index is kept in memory but I
am pretty certain that is wrong. Can you point me to the documentation
saying this?

Also what data set sizes are you querying against (10k nodes ? 100k nodes?
1 mil nodes?).
What heap size do you have set on the jvm?
Reducing the resultFetchSize should help reduce the memory footprint on
queries.
I am assuming you are using the QueryManager to retrieve nodes. Can you
give an example query that you are using?

I have developed a patch to improve query performance on large data sets
with jackrabbit 2.x. I should be done soon if I can gather together a few
hours to finish up my work. If you would like you can give that a try once
I finish.

Some other repository settings you might want to look at are:
 <PersistenceManager
class="org.apache.jackrabbit.core.persistence.pool.DerbyPersistenceManager">
      <param name="bundleCacheSize" value="256"/>
</PersistenceManager>
 <ISMLocking
class="org.apache.jackrabbit.core.state.FineGrainedISMLocking"/>


Hope this helps.

On Mon, Nov 23, 2015 at 12:13 PM, Roll, Kevin <Ke...@idexx.com> wrote:

> Our use case is the following: an external process generates 70 images,
> each around ~700k in size. These are uploaded as sub-nodes under a master
> node that encapsulates the run. There are are also some sister nodes that
> contain a modest amount of metadata about each image and the run that
> generated it. In general most of the writing consists of a client POSTing
> these images into the repository via Sling; there are then some event
> handlers and tasks that look at the data that arrived. The only subsequent
> writes at present are some properties that are set after these images are
> examined and replicated into another system. So, I don't expect much at all
> in the way of concurrent read/write; it's mainly write a bunch and then
> read it back later.
>
> By heavy pressure what I mean is that we have a test lab running
> continuously against this system. It's a lot more traffic than can be
> expected in the real world, but it is good for shaking out problems. What
> concerns me is that according to the documentation an entire Lucene index
> is kept in memory. Right now we don’t do any pruning - our repository only
> grows larger. This implies to me that the index will only grow as well and
> we will ultimately run out of memory no matter how big the heap is.
> Hopefully I'm wrong about that.
>
> At the moment we have no JVM flags set. The SearchIndex configuration is
> also default (by default I mean what came with Sling), although I am
> looking at turning off supportHighlighting and putting a small value for
> resultFetchSize, say 100.
>
> -----Original Message-----
> From: Ben Frisoni [mailto:frisonib@gmail.com]
> Sent: Monday, November 23, 2015 11:55 AM
> To: users@jackrabbit.apache.org
> Subject: Re: Memory usage
>
> A little bit of description on the term heavy pressure might help? Does
> this involve concurrent read operations/ write operations or both?
>
> Also some other things that effect performance:
> 1. What jvm parameters are set?
> 2. Do you have any custom index configurations set?
> 3. What does you repostiory.xml look like?
>
> This background info might help with answering your question.
>
> On Mon, Nov 23, 2015 at 8:13 AM, Roll, Kevin <Ke...@idexx.com> wrote:
>
> > We have started to encounter OutOfMemory errors on Jackrabbit under heavy
> > pressure (it's worth noting that we are using the full Sling stack). I've
> > discovered that Lucene keeps a full index of the repository in memory,
> and
> > this terrifies me because we are already having problems just in a test
> > scenario and the repository will only grow. Unfortunately we are forced
> to
> > run this system on older 32-bit hardware in the field that does not have
> > any room to expand memory-wise. Are there any options I can tweak to
> reduce
> > the memory footprint? Any other things I can disable that will cut down
> on
> > memory usage? Is Oak better in this regard? Thanks!
> >
> >
>

RE: Memory usage

Posted by "Roll, Kevin" <Ke...@idexx.com>.

Our use case is the following: an external process generates 70 images, each around ~700k in size. These are uploaded as sub-nodes under a master node that encapsulates the run. There are are also some sister nodes that contain a modest amount of metadata about each image and the run that generated it. In general most of the writing consists of a client POSTing these images into the repository via Sling; there are then some event handlers and tasks that look at the data that arrived. The only subsequent writes at present are some properties that are set after these images are examined and replicated into another system. So, I don't expect much at all in the way of concurrent read/write; it's mainly write a bunch and then read it back later.

By heavy pressure what I mean is that we have a test lab running continuously against this system. It's a lot more traffic than can be expected in the real world, but it is good for shaking out problems. What concerns me is that according to the documentation an entire Lucene index is kept in memory. Right now we don’t do any pruning - our repository only grows larger. This implies to me that the index will only grow as well and we will ultimately run out of memory no matter how big the heap is. Hopefully I'm wrong about that.

At the moment we have no JVM flags set. The SearchIndex configuration is also default (by default I mean what came with Sling), although I am looking at turning off supportHighlighting and putting a small value for resultFetchSize, say 100.

-----Original Message-----
From: Ben Frisoni [mailto:frisonib@gmail.com] 
Sent: Monday, November 23, 2015 11:55 AM
To: users@jackrabbit.apache.org
Subject: Re: Memory usage

A little bit of description on the term heavy pressure might help? Does
this involve concurrent read operations/ write operations or both?

Also some other things that effect performance:
1. What jvm parameters are set?
2. Do you have any custom index configurations set?
3. What does you repostiory.xml look like?

This background info might help with answering your question.

On Mon, Nov 23, 2015 at 8:13 AM, Roll, Kevin <Ke...@idexx.com> wrote:

> We have started to encounter OutOfMemory errors on Jackrabbit under heavy
> pressure (it's worth noting that we are using the full Sling stack). I've
> discovered that Lucene keeps a full index of the repository in memory, and
> this terrifies me because we are already having problems just in a test
> scenario and the repository will only grow. Unfortunately we are forced to
> run this system on older 32-bit hardware in the field that does not have
> any room to expand memory-wise. Are there any options I can tweak to reduce
> the memory footprint? Any other things I can disable that will cut down on
> memory usage? Is Oak better in this regard? Thanks!
>
>

Re: Memory usage

Posted by Ben Frisoni <fr...@gmail.com>.

A little bit of description on the term heavy pressure might help? Does
this involve concurrent read operations/ write operations or both?

Also some other things that effect performance:
1. What jvm parameters are set?
2. Do you have any custom index configurations set?
3. What does you repostiory.xml look like?

This background info might help with answering your question.

On Mon, Nov 23, 2015 at 8:13 AM, Roll, Kevin <Ke...@idexx.com> wrote:

> We have started to encounter OutOfMemory errors on Jackrabbit under heavy
> pressure (it's worth noting that we are using the full Sling stack). I've
> discovered that Lucene keeps a full index of the repository in memory, and
> this terrifies me because we are already having problems just in a test
> scenario and the repository will only grow. Unfortunately we are forced to
> run this system on older 32-bit hardware in the field that does not have
> any room to expand memory-wise. Are there any options I can tweak to reduce
> the memory footprint? Any other things I can disable that will cut down on
> memory usage? Is Oak better in this regard? Thanks!
>
>